Execution or OPS5 Production Systems on a Massively Parallel Machine by Hillyer, Bruce K. & Shaw, David Elliot
Abstract 
ExecutIon or OPS5 ProductIon Systems 
on a MassIvely Parallel MachIne 
Bruce K. Hillyer 
David Elliot Shaw 




In fo?cent years, the development of expert systems implemented by rule-based 
production systems h3.S emerged as one of the domInant paradigms in the field of 
artIficIal IntellIgence. \Vhile production systems offer important advantages in large-
scale Al appllcatlOns, their use in such applications is typically very costly In 
executIon tIme. In thIS paper, we deSCrIbe an algOrIthm for executing productIon 
systems expressed in the OPS5 language on a massively parallel multiple-SI1vID 
machIne called ~ON-VON, portions oi which are currently under construction at 
Columbia UnIVersIty. The algorithm, a parallel adaptation of Forgy's Rete Match, 
has been Implemented and tested on an instruction-level SImulator. 
\Ve present a detaIled performance analysis, based on the implemented code, for the 
averaged ch:uactenstics of SIX productIon systems having an average of 9 10 
Inference rules each. The analysis predicts an execution rate of more than 850 
production finngs per second using hardware comparable in cost to a V A.X 11/780. 
By way of companson, a LISP-based OPS5 Interpreter running on a VAX 11/780 
tYPIcally fires 1 to 5 rules per second, whIle a Bliss-based interpreter executes 5 to 
1 ~ rules per second. 
1 Introduction 
After severll decades of research on artIficial intelligence, rule-based production 
systems have emerged 3S one of the most important and widely employed tools for 
the 1m plementation of expert systems and other Al software. In general terms, 3. 
production system consists of a set of condition/action rules, or productions, a. 
workin~ memory representing the current "state of the world", and an in terpreter 
that repeatedly executes a three-phase cycle: 
1. .\tatch. The interpreter identifies all rules whose conditions are satIsfied 
by the current contents of working memory. 
2. Select. One of the matching rule instantiations is selected . 
.3. A.ct. The working memory is modified as specified by the action part of 
the selected rule. 
. 
A production system organization facilitates the modular, incremental growth of 
knowledge bases, and allows for the useful but unplanned interaction of 
Independently-specIfied rules [Winston, 1977; Nilsson 19801. While a few production 
systems have already found commercial application, their use in certain other 
domaInS is precluded by slow execution speeds. This is particularly true in the case 
of real-time systems characterized by severe and inflexible time constraints, and in 
applIcatIOns where high throughput is necessary to make the use of such systems 
cost-effectIve. A number of researchers [Sauers and Walsh, 1983; Forgy & 
\fcDermott, 1977; Lenat and McDermott, 1977; McCracken, 1979; Lenat et al., 
1979; Buchanan, 1982; Hayes-Roth et al. , 1983] have conSIdered the problem of 
efficiency in the execution of production systems, and have proposed techniques to 
increase the speed of rule-based inferencing. 
One approach to the efficient execution of production systems involves the use of 
para.llel hardware. Forgy [19801 considered the problem of executing production 
systems in parallel on the ILLIAC IV, but was forced to significantly modify the 
production system paradigm in order to obtain reasonable performance. Stolfo and 
Shaw [19821 subsequently proposed a highly parallel machine called DADO, which 
wa.s intended specifically for the execution of production systems; an early prototype 
of the DADO machine is presently operatIonal. :\fore recently, members of the 
DADO project have Investigated a number of issues related to langu::..ges a.nd 
algorithms for the parallel execution of productlon systems [Stolfo, 19841· The 
present paper describes and analyzes an algOrithm for executing production systems 
on a parallel machine called NON-VON, which was designed not for the execution 
of production systems In particular, but rather, for application to a wide range of 
symbolic information processing tasks. A prototype of the NON-VON machine 
having 63 processing elements became operational in January 1985, and a larger 
prototype is under construction. 
In particular, this paper presents an algorithm for the parallel executlOn of 
production systems implemented using the language OPS5 [Forgy, 19811. developed 
by Forgy and others at Carnegie-Mellon University. The algorithm may be 
regarded as a parallel version of Forgy's Rete Match [19821. A LISP-based OPS5 
interpreter executing the sequential Rete Match algorithm on a VA .. X 11/780 
tYPically fires between 1 and 5 rules per second, while a Bliss-based interpreter 
executes between 5 and 12 productions per second [Gupta, 1984 (private 
communication)l. By way of comparison, the results presented in this paper predict 
that a NON-VON machine having approximately the same hardware cost as the 
VA .. X 11/780 should execute more than 850 rules per second. This result is based 
on measurements obtained by Gupta and Forgy [19831 of the static and dynamic 
characteristics of six production systems having an average of 910 inference rules 
each. 
To establish the background for the work reported here, the next two sections will 
discuss the OPS5 production system language and the sequential Rete Match 
algOrithm for production system execution, respectively. Section 4 provides an 
overview of the NON-VON architecture, while section 5 explicates the details of the 
machine configuration and performance assumptions that are used in our analysis. 
All algorithm for the implementation of OPS5 on NON-VON is described in section 
6. Section 1 analyzes the storage requirements of this algorithm, while section 8 
presents a detailed analysis of its performance characteristics. The derivation of the 
statistiCS employed in our performance analysis are presented as an appendix to the 
paper. 
2 OPS5 Production Systems 
The production system language OPS was first descrIbed by Forgy and McDermott 
[1977J. Several subsequent versions have appeared, with OPS5 being the m05t 
widely known. . We have chosen OPS5 as the vehicle for our investigations Into 
parailel execution of production systems for several reasons: 
l. It is widely known, and has been evaluated favorably by other 
researchers [Hayes-Roth et aI., 19831 . 
. "") It has be'O'n used to implement a large and successful commercial 
production system [McDermott, 19801. 
3. Static and dynamic characteristics of several OPS5 production systems 
have been measured [Gupta and Forgy, 19831. 
4. Its speed can be increased significantly by parallel execution, even though 
the language was deSigned for sequential processing. 
It should be noted, however, that other researchers ~firanker, 1984al are actively 
engaged In the development of a production system language specifically designed 
for parallel execution; such a language may well prove better suited to the 
capabilities of parallel machines. 
The essential elements of the OPS5 language are outlined below; a more complete 
exposItIon can be found in [Forgy, 1981]. By way of illustration, Figure 1 shows a 
pair of productions whose execution results in the printing of a sorted list of all 
num bers In working memory. 
In.ert Figur. 1 (Exaapl. OPSS Production.) h.r •. 
A corresponding set of sample working memory elements is presented in Figu:-e 2; 
they specify that the current task is to sort, that the output counter is 0, and that 
there are three numbers to be sorted: 17, S, and 23. 
In •• rt Figur. 2 (Exaapl. lorking X.aery Elem.nt.) h.r •. 
------------------------------------------------------------------------------
A rule expressed as a production in OPS5 consists of a production name, a 
conjunction of (pOSSibly negated) clauses known as condition elements, and an arrow 
followed by one or more actions. The condition elements of a productIOn are 
collectively known as the condition or left-hand side (LHS) of the productIon 
SimIlarly, the actions are the right-hand side (RHS). Each condition element 
consIsts of a. class name and one or more terms. The class name is the first Item 
in the condition element. Each term consists of an attribute name, a relational 
operator, and a value. Attribute names are prefixed with an up-arrow. Although 
attribute-value pairs are the usual form of expression, OPS5 conditions may b~ 
expr~ssed in a positional notation by omitting attribute names. Common actions in 
OPS5 write values to output, and remove, modify, or create facts in the working 
memory. 
Numeric and string values, which may be constants or variables, occur 10 OPS5. 
Variables are denoted by enclosing the name in angle-brackets. The permitted 
relational operators are <, <=, >, >=, =, <>, and <=>. The fIrst six 
have their usual meanings, but only = and < > may be applied to string values. 
The operator < = > evaluates to true provided an attribute and value are of the 
same type. If no operator is explicitly specified in a term, = is assumed. OPS5 
has grouping operators to express conjunctions and disjunctions of multiple terms 
involving an attnbute. 
A working memory element is SImilar In form to a condition element, but contains 
neither operators nor variables, since it expresses a specific fact about the world 
modeled by the production system. Each working memory element is assigned a 
unique 32-bit integer time-tag UPOIl creation. The tag serves as a compact 
IdentIfier for the working memory element, and also permits the distinction of 
current facts from old information.' A working memory element is said to match a 
condition element provided all the constraints specified by relational operators hold. 
The left-hand side of a production is said to be satisfied provided that: 
1. For every non-negated condition element there exists a working memory 
element that matches it. 
2. For every negated condition element, there does not exist a working 
memory element that matches it. 
3 Each variable is bound to the sa.me value 10 all occurrences. 
An OPS5 interpreter executes <l production system by cycling through the following 
three-step process, halting before the third step if no production is satisfied by the 
current workIng memory. 
1. .\·fatch the working memory elements with the conditlOns of all 
productions: Each ordered tuple of working memory elements satisfYI!1g 
the corresponding non-negated condition elements of a productIOn 15 
called an insta.ntiation of that production. The collection of all such 
Instantiations is called the conflict set . 
. ") Select one instantiation from the conflict set according to certain 
predefined crIteria. This step is known as conflict resolution. Conflict 
resolution strategies provided in OPS5 favor instantiations containing 
recent information, and prefer productions having restrictive conditIons. 
The former tends to focus the system's attention on one task at a time, 
and the tatter applies special case rules in preference to genenl ones. 
3. A.ct on the chosen instantiation by performing the actions specified in the 
production's rIght-hand side. These actions perform input and output, 
and modify the contents of working memory. The modifying actions can 
make new working memory elements, modify one or more terms 10 some 
working memory elements in the instantiation, and remove elements of 
the instantiatIOn from working memory. Performing the actions specified 
by a production is sometimes called ruing the production. 
3 The Rete Match Algorithm 
Of the three steps In the production system cycle, the matching phase has proven 
In practice to be the most time-consumIng. According to Forgy [1979J, more than 
90% of the execution time in a uniprocessor implementation IS consumed by 
matchIng. A naive implementation of the interpreter would match the condition 
part of each rule in turn against the entire contents of the working memory. 
Forgy's Rete Match algorithm explOIts hiS observation that firIng an OPS 
production causes only a few changes ~o working memory, and that these changes 
have few effects on the conflict set. Hence a computational savings results If the 
production system is compIled lOtc a dataflow graph, with state informatIOn saved 
at each node during execution. A change to working memory is entered into initial 
nodes of the graph. Consequent state changes then propagate through the graph, 
updating information stored in intermediate nodes. State changes in terminal nodes 
of the graph represent changes to the conflict S2t. Figure 3 shows an example 
dataflow gra.ph (as used on NON-VON) corresponding to the productlon na.med 
sort-work that is deplcted in Figur~ 1. 
------------------------------------------------------------------------------
In.ert Figure 3 (£%ample O&taflow Graph for IOI-VO!) here. 
------------------------------------------------------------------------------
The graph can be viewed as a collection of tests that progressively determine which 
productlons are ready to fire. First, the intra-condition tests check that attrIbutes 
In a working memory element satisfy relational operators, and that vanJ.bl~s 
occurring more than once in a condition element are bound consistently. Any 
working memory element satisfying all intra-condition tests for a condition element 
is stored 3.S a token in the a-mem node corresponding to that condition element. 
Subsequently, inter-condition tests are performed in two-input nodes to venfy 
consistent binding of variables across multiple condition elements in a prod~ction's 
left-hand side. This testing occurs in AND-nodes for non-negated condition 
elements, and NOT-nodes for negated condition elements. At the output of each 
.-\."TI-node and each NOT-node In the graph is a p-mem node to store tokens. A 
token In a ,a-mem node represents an ordered tuple of working memory elements 
that JOintly satisfy all non-negated condition elements that are ancestors of that 
node. 
The intra-conditlon tests are local. in that each examInes terms of only one working 
memory element. Entry of a token into an Q-mem or ,8-mem node triggers the 
more complex inter-condition testing, which proceeds as follows. First, the two-
input node follOWing the memory node is identified. Second, the opposite memory 
node that serves as the other input is located. Third, the new token is matched 
with all members of the opposite node to test for consistent variable bindings, in 
accordance with the type of two-input node. If consistent bindings are found, the 
output tokens from this tWO-Input node are formed, and they become new entries to 
the subsequent ,8-mem node. In the case of terminal two-input nodes, the result is 
an addition to (or deletion from) the conIlict set. 
8 
4 The NON-VON Ma.chine 
This section outlines the essentials of the general NON-VON architecture, In support 
of the' analysIs of section 8. A fuller descriptIon of the architecture is found In 
[Shaw, 1982] and [Shaw and Sabety, 1984]. Although all portions of the general 
machine architecture are mentioned here for completeness, only certain subsystems 
are required to execute the algorithms described in this paper. Section 5 presents 
the reduced configuration assumed for OPS5 production system execution. 
The top-level organization of the general NON-VON machine is Illustrated In FIgure 
4. 
------------------------------------------------------------------------------
In •• rt Figure 4 (Organization ot the IOI-VOI Wachine) here. 
------------------------------------------------------------------------------
~ON-VON has two principal components, known as the prima.ry processing 
subsystem and the secondary processing subsystem. NON-VON is connected to a 
bost machine, a general purpose computer serving as a front end device for 
interactlOns wIth the user. 
The prImary proceSSIng subsystem is orga.nlzed as a binary tree. It consists of a 
large number of small processing elements (SPE's), each having an 8-bit ALU, a 
very small RA.\I, and communication connectlOns to three neighboring SPE's, whIch 
a.re known 3.S the parent, left child, and right cbild. In addition, each SPE is 
capable of communicating, within a single instruction cycle, with two additional 
SPE's, called the left neighbor and rigbt neigbbor. These neighbors are the 
predecessor and successor In an lnorder traversal of the prImary proceSSIng 
subsystem tree. Each leaf node in the tree' is also connected by bit-serial lines to 
four other leaves known as the Nortb, Soutb, East, and ~Vest neighbors, providing 
effiCIent support for an orthogonal mesh-connected communication topology. (The 
mesh connections are not used in the executIon of OPS5, however.) 
Each SPE (as currently fabrIcated) contains a local RAM consisting of a 64 x 8-bit 
section and :l 64 x I-bit section. A prototype chip containing eight SPE's IS 
descnbed in detaIl 10 [Shaw and Sabety, 19841. The SPE's do not store programs 
locally, but Instead receive instructions that are broadc3.St to them from some 
9 
higher level in the pnmary processing subsystem tree, as descnbed below. ThIs 
mode of processing was named single instruction-stream, multiple data-stream 
(S1}.ID) by Flynn [197~1· 
In the top five· to ten levels of the primary processing subsystem, each SPE IS 
connecte.d to a. large processing element (LPE). The LPE's are general-purpose 
microcomputers having large ~\1's, and supporting locally stored programs 
Unlike the SPE's, the LPE's are capable of operating asynchronously in multiple 
instruction-stream, multiple data-stream (MIMD) mode [Flynn, 197~1- In particular, 
LPE's at the roots of several subtrees of the primary processing subsystem (possibly 
at different levels) can broadcast separate instruction streams to be executed 
Simultaneously by all SPE's below them, givIng NON-VON the capability for what 
is sometimes referred to as multiple-SllvID execution. Each LPE also has an acth'e 
memory controller to generate control signals and to cache instructions and data for 
Its subtree of SPE's. 
The LPE's are connected by a high-bandwidth interconnection network. For 
moderate numbers 1V of LPE's (say, N < 128), a two-stage root-point network 
conSisting of Nl/2 x Nl/2 crossbar sWitches gives lower latency than a log(1v}stage 
'2 x 2 crossbar network such as a butterfly or omega, at comparable cost. The use 
of such a high-bandwidth network is essential to anum ber of NON-YON algorithms 
Involving large collections of data. The algorithms presented in this paper, 
however, do not make use of the LPE network. 
The second3.ry processing subsystem incorporates a substantial number (perhaps 32 
to 256) of disk dnves. Each drive is connected via an intelligent head unit to an 
LPE In the pnmary processing subsystem, forming a very high bandwidth 
Interconnection between these two subsystems. In addition to ordinary disk I/O, 
Intelligent head units can perform certain computationally simple data filtering and 
hashing operations "on the flyn, passing results to the associated LPE's. In the 
production system algorithms reported in thiS paper, the secondary proceSSIng 
subsystem is not needed, as all the required data (even for rather large production 
systems) can fit within the primary processing subsystem. 
10 
An early prototype of the NON-VON architecture has been operational at Columbia 
since January, 1985. This prototype, called NON-VON I, contains 63 SPE's, ~ach 
of which embodies some, but not all of the features described above, and one VA.X 
11/750 that serves as the sole LPE and host. A larger, significantly enhanced 
prototype called NON-VON 3 is currently under construction. This machme will 
embody 8,191 SPE's, again operating under the control of a single VAX: 11/750 
The machine is beIng implemented using 3 micron custom n~OS chips, ~ach 
contaIning four SPE's, WhICh were developed using the MOSIS "silicon brokerage" 
system at lSI. 
5 Configuration and Perrormance Assumptions 
A large-scale ~ON-VON primary processing subsystem might comprtse as many as a 
million SPE's, together with a thousand or more LPE's. To execute production 
syst~msr however, a much smaller machine will suffice. In particular, we have 
assumed a pnmary processing subsystem comprising 16K SPE's for purposes of the 
analysIs presented In this paper. The system is assumed to contain 32 LPE's, all 
assOCiated with the fifth level of the tree; the 31 LPE's that would be assOCiated 
with the first through fourth levels in a general NON-VON machIne are not 
reqUired fer the execution of the production syste:n algorithm. \Ve assume that 
each SPE contains 64 bytes of ~\1. as is the case in the current NON-VON 
deSign. Such a configuration would I'm body 4096 Integrated circuit chips for the 
SPE's and 640 chips for the LPE's, assuming that 20 chips are reqUIred to 
Implement elch LPE. \Ve also assume a dedicated host, together with a bus 
(which need not be as fast as that which would be incorporated in a general ~O~­
VON machine) connecting the host to the LPE's. \Ve assume the LPE's and host 
to be capable of executing three million instructions per second, a figure chosen to 
correspond roughly with the performa.nce of 32-bit microprocessors such as the 
AT &T 32100 and the ~t0tur()l~ M68020. 
Figure 5 depicts the reduced NON-VON configuration assumed for production 
system execution. 
------------------------------------------------------------------------------
In.lrt Figure 6 (IOI-VOI: R.duced Configuration for OPS6) here. 
------------------------------------------------------------------------------
11 
:"iON-VON uses :l two-speed clock. The short clock period is fer the broadcast of 
an instruction to all SPE's through 3. high-f:lnout trp.e implemented in fast bIpolar 
logic, and the execution of that instruction. The long clock period permits :l signal 
to propagate through combinational logic from the root of the tree to the leaves, 
and back to the root again. There are two special instructions that require thIS 
long communication step. The RESOL \IE instruction requires two long clock 
perIods to identify the first (in an inorder enumeration of the tree nodes) of an 
arbItrary collection of SPE's having a certaIn flag register set. The linear neighbor 
communication instructions require two long clock periods to permit aU SPE's to 
communicate simultaneously with their predecessors or successors 10 an inorder 
traversal of the binary tree. On the basis of preliminary chip tests and calculations 
for a tre-e of 16K SPE's, we assume a. period of 350 ns. for the fast clock (30 ns. 
broadcast + 320 ns. execution) and 3 us. (100 ns. per level) for the slow clock . . 
In the analysis of performance given in section 8, the fast and slow clock periods 
are counted separately. 
The following three sections of the paper describe an algorithm and performance 
analysis for the execution of OPS5 on NON-VON. Considerablp. detail is given, to 
show how a heterogeneous massively-parallel machine can be applied to this task, 
and to pronde the rp.ader with a basis for assessing the performance figures denveci 
In section 8. 
6 Execution or OPS5 on NON-VON 
This sectIon presents observations about the potential parallelism embodied in OPSS, 
a deSCription of how the Rete M:ltch is processed on NON-VON to exploit the 
IdentIfIed parallelism, and an example to clanfy this processing. 
Our algorithm has its roots in Algorithm 3 of [Gupta, 19841. This algorithm was 
designed for the DADO machIne [Stolfo and Shaw, 1982], In a configuratIOn 
consisting of 1023 identical processors. Our :llgorithm is designed for :l NON-VON 
configuration having 32 large processing elements, each somewhat more powerful 
than a DADO PE, and 16K small processing elements for a greater degree of 
associative parallelism. The foHowing discussion presents the rationale for our 
approach to this problem. Since the experimental implement:ltion of an OPSS 
interpreter for NON-VON compnses more than 1500 lines of LISP code, we descnbe 
portions of the processing relevant to the performance analysis without formally 
specifying details of the entire algorithm. 
6.1 Parallelism in OPS5 
As discu5sed previously, the execution cycle for an OPS5 production system has 
three steps: match, select, and act. In the Rete match algorithm, the match phase 
has two components: the highly local intra-condition testing, and the subsequent 
inter-condition testing that evaluates a dataflow graph to combine pre'rious results 
and save state in Q-mem and ,a-mem memory nodes. 
Three levels of potential parallelism can be identified In this execution cycle: 
1. The lOtra-condition testing can be performed in a massively paralI.el 
:nanner using associative processing techniques. This has been previously 
noted by Stolfo and Shaw [19821. The NON-VON production system 
algonthm guarantees very rapid completion of thiS step, 10 time 
dependent on static characteristics of the productlOn rules. This 
contrasts with other implementatIons that depend on hashing techniques 
to control the ar:lount of matching. ~fasslve parallelism is also applicable 
during the deletion of facts from working memory. ~ON-VON 
simultaneously finds all instances of a fact in all Ct-mem and .a-mem 
nodes, 3.nd removes all affected memory tokens simultaneously . 
., There is a modest amount of potential concurrency in the evaluation of 
Inter-condition testing in tWO-input nodes of a Rete dataflow graph. 
This has been observed by Gupta [19841. who recommends partitioning 
the productlOn rules into 32 subsets (based on the same empinc:].l data 
on which our own analysis is based) to explOIt parallelism in thIS phase. 
In [Oflazer, 19841. it is determined that for two specific productlon 
systems, the maximum avaIlable parallelism factor in the inter-condition 
testing is approximately 7. The generalIty of thiS result is unknown . 
.3 \Ve find no SignIficant parallelism In the select and act phases, although 
substantial portions of the act phase can be overlapped with the 
followtng match phase. 
The heterogeneous architecture of the :"iON-VON machine IS well suited to the 
exploitation of these varying degrees of parallelism. 
1. The mtra-condition testing is performed in the SPE's in two massively 
?3.r3.ilel SI1-ID computation steps. The first step Simultaneously evaluates 
IndivIdual terms of all condition elements. The second step determines 
the satisfaction of all conditIOn elements by a parallel commuOlcation lfi 
time proportional to the number of terms in the longest condition 
element. l The synchronous nature of SIMD execution was found not to 
limit the rate of processing In thIS phase. 
f) The moderate parallelism of the inter-condition testing is done in 32 
LPE's, with NON-VON operating under a partitioned-SIMD execution 
dIscipline. Pure SL\ID processing would have been a serious constraint 
during this portion of the execution, since evaluation of the Rete 
dataflow graphs presents a high degree of data sensitIvIty. The use of 
subtrees of SPE's as active memories enhances the throughput of the 
LPE's by provIding a fast associative search capability. 
3. The select and act phases, which have little inherent parallelism in 
OPS5, are performed in a single relatively fast host processor, although 
the LPE's and SPE's do some «bookkeeping" and overlapped processing 
ior the next match phase at thIS time. 
13 
The overlapped processing is as follows. During the action phase of the production 
system ··eyele, the host executes the right hand side of the selected instantiation. 
ThIS commonly results in the addition of facts 11 ... lk to working memory. For 
each Ii the host assigns a time-tag for identification and converts attribute values 
to tokens to obtain the working memory token ti' Each ti is installed in a table 
of working memory elements lfi the host, and is transmitted to the LPE's for use In 
the next matchIng phase of the productIon system cycle. The matching phase for 
t 1 starts In the LPE's and SPE's whIle the host asynchronously creates and 
transmits t !} ... ,t k. With the exception of the time for t l' the host processing for an 
addition to working memory overlaps matching, and does not contribute to the 
running tIme of the algorithm. Similarly, LPE's asynchronously finish the matching 
phase for each ti' depending on the amount of activity in the dataflow graph of 
~ach partition. Thus the host receives conflict set changes from LPE'g that finish 
early, overlapped wIth cont.lnued matching lfi other LPE's. Only one 
synchronization point occurs in the production system cycle: to be consistent with 
IThis can be improved to time logarithmIC in the number of terms in the longest 
condition element WIth a worst-case 50% decrease 10 SPE utilization, by techOlques 
closely related to the allocatIon schemes for database records described in [Shaw and 
HIllyer, 1982]. 
the semantics of OPS5, all changes 10 the conflict set must reach the host prIor to 
the completion of conflict resolution for the next cycle. 
6.2 Description or Processing 
This section gives an overview of the procedures executed by the host, LPE's. and 
SPE's during the execution of an OPS5 program, and the following section presents 
a concrete example. Salient steps are detailed in the analysis of section 8. 
The host processor is responsible for controlling the overall computation and for 
comm unicating with the user. Prior to the commencement of execution, the host 
obtains a collection of production rules from the user (or from disk, at the direction 
of the user), partitions them into subsets as described further below, and sends one 
subset to each LPE. Each subset is compiled by its LPE into a dataflow graph. 
These graphs are similar to those of the sequential Rete Match algorithm, Dut the 
dataflow graph used by NON-VON is smaller, with input nodes representing entire 
conditIOn elements, rather than individual attribute constants and varIables. ThiS 
Increases parallelism during each execution cycle of the productIOn system: all 
condition elements are tested in one parallel step by the 16K SPE's before the 32 
LPE's ~.~mmence their dataflow graph processing. 
Host processing dUrIng the three phases of the production system execution cycle 
proceeds as follows: 
1. Dunng the match phase, the host receives messages from the LPE's, 
which report changes to the conflict set. In our Implementation the host 
maintainS the conflict set as a list2 sorted by the OPS5 conflict 
resolution cnteria. 
I") During the select phase, the host chooses a production instantiation to be 
fired. In our implementation this amounts to remOVIng the Item at the 
head of a sorted list. 
3. During the act phase, the host executes I/O specified in the right hand 
side of the chosen production, creates new working memory tokens to 
represent facts to be added to working memory, and broadcasts messages 
to the LPE's. Each message contains a working memory token to be 
a.dded or deleted. 
2Average length 16 (Appendix, item 5). 
1.5 
Each LPE IS responsible for :l. subset of the production system rules and for :l. 
subtree of SPE's. The LPE generates an instruction stream for the SPE's, 
evaluates the dataflow graph for the production system partition stored in that 
subtree, and obtains the resulting conflict set changes. 
In particular, 
1. During the match phase, each LPE broadcasts SIMD code to its SPE's to 
associatIvely determine additions and deletions to the set of condition 
elements that are satisfied. A list of these changes is built. Each 
addition is represented by a new m'~mory token which, according to the 
Rete technique, is stored in an Q-mem or ,B-mem node in the dataflow 
graph. This is implemented by associ:l.tively locating an available SPE 
and storing the token there. Insertion or deletion of a token T in a 
(non-terminal) Q-mem or ,B-mem node AI is followed by evaluation of the 
two-input node IV that follows M in the dataflow graph. The evaluation 
of N is implemented as follows. First the LPE looks in its table of tw~ 
input nodes to determine the memory node N!' that is the other input of 
N. :"Iext, an associative probe is performed in all SPE's in the subtree 
to locate tokens in lvl'. An associative match is performed to discover 
which tokens T'j in 1\tf' have all their variables bound consistently with 
the variables in T The final step in evaluating N depends on whether 
N is of type .~~1) or NOT, and on whether T is a left or right input to 
N. Suppressing details, we sta.te that in most cases the LPE retrieves 
matching tokens T'i' and concatenates them in turn with T to form new 
m~ory tokens T"i that are pl::J.ced in the memory node at the output of 
N Changes in terminal memory nodes of the dataflow graph represent 
ch3.nges to the conflict set; these changes are reported to the host as 
they are detected . 
. ') During the select phase, which is just long enough for the host to 
extract the first element of its conflict set list, the LPE's are idle. 
3. Dunng the act phase of the' host, the LPE's receive messages from the 
host that contain working memory tokens and commands to add or 
delete the tokens from working memory. These tokens are stored In 
appropriate LPE tables, and are processed as described above: the 
matching phase for LPE's and SPE's IS overlapped With the act phase of 
the host. 
The SPE's In a particular LPE-rooted subtree serve as an active memory. They 
contain the condition element terms of compiled productions, and the token memory 
for the Rete d3.taflow graph. The SPE associative processing capability facilitates 
16 
the rapid, highly parallel evaluation of condition elements with respect to new 
working memory elements, as well as the parallel evaluation of the two-input nodes 
In the dataflow graph evaluation process. 
Specifically, 
1. During the match phase, tokenized attribute-value pairs are broadcast to 
the SPE's by their controlling LPE. The SPE's, which each contain one 
term of a condition element, then simultaneously evaluate the satisfaction 
of all terms. Next, chains of adjacent SPE's communicate In parallel to 
determine the satisfaction of entire condition elements, and the ill's of 
satisfied condition elements are retrieved by the controlling LPE. Dunng 
the two-input testing in the Rete dataflow graph, all SPE's containing a 
token for the relevant memory node simultaneously check the consistency 
of variable bindings, and the LPE retneves successful matches. DeletIon 
of a working memory element token T causes all concatenated ,8-mem 
tokens that contain T to become unsupported; they are all associativeLy 
found and removed by the SPE's in one highly parallel step (with 
additional matching required when a token is deleted from the right 
Input of a 0iOT node. 
2. DUrIng the select phase the SPE's are idle. 
3. Dunng the act phase, the SPE's assocIatively find and delete the token 
for the production Instantiation that has been fired. 
6.3 Example or OPS5 Data and Processing 
ThiS section presents an example of a. condition element based on a simple Al 
"blocks .. Forld", shows how it is stored in NON-VON SPE's, and describes how It IS 
tested on a sample working memory element. 
The follOWing condition element WIll match a block that has no square faces, saVIng 
the edge lengths in the variables < I >, < 'tV >, and < h > It ensures that the 
Width of the block is different from the length, and that the height is different 
from the length and width. Note that the first occurrence of a variable simply 
binds its value, while later occurrences r~fer to this value for comparison. Braces 
enclose conjunctions of terms pertaining to an attribute. 
(block Alength <I> 
Aw1dth {<w> <> <1> } 
Ahelght { <h> <> <1> <> <w> }) 
Ii 
This condition element occupies -1 NON-VON SPE's, as depicted in Figure 6. Each 
SPE holds a comparison operator a.nd two 32-blt integers, which m~y be either 
constants, or variables that are filled by attribute value tokens at runtime. 
------------------------------------------------------------------------------
Insert Figure 6 (Condition Ele.ent Stored in Four SPE'.) here. 
------------------------------------------------------------------------------
To do the matching of a working memory element, for instance 
(block ·length 3 ·width 4 "height 5), two main steps are required. In the first 
step, the truth or falsity of each term is determined by SIMD execution of code 
that each LPE broadcasts to the SPE's in its partition. The operations performed 
for this exam pIe are: 
1. Associatively probe for all SPE's having a ·classname variable, and 
broadcast the token for "block" to them. 
2. Associatively probe for all SPE's having a ·length variable, and 
broadcast the value 3 to them. 
3. Associatively probe for all SPE's having a ·width variable, and broadcast 
the value 4 to them. 
4. AssocIatIvely probe for all SPE's having a "height variable, and 
broadcast the value .5 to them. 
5. In parallel, evaluate the stored relational operator on the two stored 
values in all SPE's. 
In the second step, the truth or falsity of inter-condition tests for entire condition 
elements is determined as follows: 
1. Associatively locate all SPE's that hold the first term of a condition 
element. 
2. IIl."parallel, send the boolean result of the comparison In the first terms 
to the SPE's having second terms of condition elements, where logical 
conjunction is performed. Now the second terms contain an indication of 
whether both the first and second terms of a conditIon element were 
satisfied . 
. 3. Send this result in panllel to SPE's containing the third terms, where 
logical conjunctIon is performed agalO. 
4. ContInue for a number of steps one less than the largest number of 
terms in the longest condition element. AllY SPE holding the last term 
of a condition element wIll then contain TRUE or FALSE for the entire 
conditIOn element. 
18 
After the intra-conditIon testing has been performed in a partition, the LPE 
associatively enumerates the ill's of condition elements that were satisfied. In 
)iON-VON, this enumeration occurs in time proportional to the Dum ber that are 
satisfied, independent of the total number of condition elements. The expected 
number of satisfied condition elements per partition is less than one (Appendix, Item 
4). The LPE places a-mem tokens for satisfied condition elements on a stack of 
pending memory node additions. 
The memory node additions in a partition are handled sequentially by the 
controlling LPE. Since the expected number of node additions per partition. is less 
than one, this is both fast and economical. 
To perform a memory node addition, the LPE performs several steps. The LPE 
broadcasts SL\ID code to associatively allocate SPE's to hold the token, and then 
code to store the token. Then the LPE determines which two-input node is 
triggered by that memory node addition. The two-input node is evaluated in two 
steps. First, the opposite input memory for the node-a distrIbuted set of tokens 
stored In SPE's--is activated by an associative search. Second, an associative match 
of relevant vanable bindings is performed in parallel between the new token and all 
members of the opposite memory. Any consistent bindings are discovered by an 
associatIve probe. Portions of successfully matched tokens are reported to the LPE, 
and new memory node additions/deletions are placed on the stack. 
The LPE recognizes an insertion of a token into any memory node at the bottom 
of the Rete network to be the addition of a productIOn instantiation to the confllct 
set. Such a token is also reported to the host for conflict set resolution. 
\Vhen all pending memory node additions have been processed, the LPE sends a 
completion signal to the host, which performs conflict resolution when all LPE's are 
finished. A centralized algOrIthm is reasonable for this step since the average SIze 
of the global conflict set is 16 (Appendix, item 5), and the average number of 
19 
changes to the conflict set IS 5.3 per firing cycle (Appendix, item 6). The host 
then executes the right hand side actions of the chosen production, sending messages 
to the LPE's to effect changes to working memory. 
6.4 Partitioning Production Systems 
In the production systems examined in Gupta and Forgy [1983/, approximately 30 
(Appendix, item 3) condition elements are satisfied by a typical single addition to 
working memory. As noted earlier, Gupta [1984] suggests obtaining parallelism by 
diViding OPS5 production systems into 32 partitions that execute concurrently. The 
problem of obtaining a suitable pa.rtitioning is separable from the problem of 
executing the resulting partitions, and as such is beyond the scope of this paper. 
Relevant to the present discussion, however, is the fact that the goal of partitioning 
is to distrIbute the two-input node testing uniformly over the partitions, despite the . 
fact that cascading of two-input node activations through the Rete dataflow graph 
is necessarIly sequential. Although results have been reported in [Ishida, 19841 and 
[Oflazer, 1984], the partitioning problem remains an area for future work. 
A ({good" partitioning method is assumed to exist, where the meaning of ({good" is 
determined by the statistical parame~ers given in the performance analysis sectIOn, 
as justified in the appendix. Somewhat similar assumptions have previously been 
adopted by Gupta [19841 and Miranker [1984bl. 
During execution, a partitioned prcduction system could exhibit transient over-
concentratIOn of a-mem and ,B-mem tokens in individual partitions, requiring 
significantly more storage than for the average case. It remains to be seen whether 
this is a. problem in practice. One could speculate that such over-concentration, 
should it occur, 
results from a production system programming style, or 
- IS due to the nature of the OPS5 execution semantics, or 
- IS the consequence of a particular partitioning algorithm, or 
- IS an unavoidable element of partitIOned execution of production systems. 
These posslbilities have greatly differing Implications for appropriate remedies. \Ve 
are not aware of any results reported in the literature that illuminate these issues, 
and thus regard the (potential) problem of over-concentration as an open question 
20 
7 Hardware Capacities Required 
The storage required in an SPE is 24 bytes plus 18 I-bit flags. This repr;:sents 
space for one term of a conditIon element, one Q-mem token or portion of a ~-mem 
token, and one relevant binding. More specifically, the byte space is allocated as 
follows: 
- Two attribute name ID's (1 byte each) 
- Two attribute-value tokens (4 bytes each) 
- A. condition element ID (2 bytes) 
- An Q-mem or ,8-mem node ID (2 bytes) 
- A workIng memory element ID stored in an Q-IDem or ,8-mem node (4 
bytes) 
- A relevant binding value or conflict set member ID (4 bytes) 
- The ID number of the SPE (2 bytes). 
The I-bit fbgs mark subsets of SPE's that contain: 
- A condition-element term 
- The first term of a condition-element 
- The last term of a condition element 
- An Q-mem or .8-mem token 
- The first cell of a token 
- The last cell of a token 
- A working memory element ID in a token 
- A relevant binding in a token 
The type (stnng or numenc) of the first value 
The type of the second value 
The type of the relevant binding 
A condition element that feeds tbe nght-hand input of a NOT node 
A member of the conflict set 
- Compare for equal 
- Com pare for not equal 
- Compare for less 
- Compare for less-or-equal 
- Compare for same type 
Thus the 64 byte RA~"f ur current NON-VON SPE's IS of suffiCient capacity to 
store a term-evaluating node as well as a memory node and relevant binding. The 
extra RAM could bold more tokens (but with a decrease in execution speed), or 
POSSI bly data such as the dataflow graph connections that would otherwise be stored 
10 the LPE's. 
~l 
The average of the production systems examined has 4 condition elements per 
production (Appendix, Item i) and 3 mtra-condition tests per condition element 
(Appendix, item 10). Thus the number of SPE's reqUlred for production memory In 
~ON-VON is a.bout 12 times the number of productions. The average of the 
production systems examined has 910 productions (Appendix, item 11), reqUlring 
10,920 SPE's for condition element terms. The analysis presented below assumes 
16K SPE's. Quadrupling the number of SPE's would permit systems of 
approximately 5400 nles to be executed, but might decrease the clocking speed by 
as much as 10 percent, absent technological compensation. 
The maximum aggregate number of tokens in Q-mem and ,a-mem nodes is about 
4600 (Appendix, item 12), and the average token requires 2 SPE's (Appendix, item 
21); thus, about 9200 SPE's are required to store the tokens. A NON-VON having 
16K SPE's is sufficient. Note from the storage allocation described previously that 
each SPE can hold both a condition element term and part of a token at the same 
time. ~ote also that the number of tokens stored in an SPE could be increased if 
necessary, although the execution speed would be reduced somewhat. 
A host storage capacity of one megabyte would accommodate the following: 
1. A sym bol table of string tokens. 
0') The current members of the conflict set. 
3 \Vorking memory elements with time tags indicating order of creation. 
4. Right-hand sides of productions, indexed by production ID's. 
5 Software for the interpretation of OPSS, including communication with 
the user and with the LPE's. 
\Ve estimate that 256K bytes would be sufficient to store the data and code 
required by an LPE, which includes: 
1. Working memory elements with time tags indicating order of creation. 
(This duplicates Information stored in the host, but reduces the need for 
com m uOlcation.) 
., Tables encoding the Rete dataflow graph. 
3. Precompiled sequences of SPE instructions that the LPE WIll command 
the active memory controller to broadcast into the subtree, to perform 
data storage, associatIve matching, and associative retrieval. 
4. Procedures to be executed by the LPE in performing its portion of the 
productIon system work, and for compiling productions Into the dataflow 
graph. 
8 Performance or Rete Match on NON-VON 
Execution speeds of 1 to 12 cycles (or "rule firings") per second on a VA..X 11/780 
lre typical for OPS5 systems of the size analyzed in this paper [Gupta, 1984 
(private commuDlcation)]. Using data from [Gupta and Forgy, 1983], we obtaIn 
below a predicted execution speed for NON-VON of 861 production firings per 
second. The formulas suggest that NON-VON's advantage may Increase as 
production systems become larger. Intuitively, NON-VON is insensitive to the total 
number of objects because of its associative processing capabilities. 
\Ve have WrItten and tested on an instruction-level simulator an experimental 
compiler and runtime system for the execution of OPS5 on a one-LPE ~ON-VON. 
By examining the NON-VON instructions, we determine the number of slow and 
fast clock cycles required for each of the SL\ID processing steps, as a function of 
parameters of the production system. Adding approximate overhead values for non-
overlapped execution in LPE's and the host3 gives the time required per production 
firing cycle4, and hence the rate of production system execution on NON-VON. 
These calculations are presented in three sections below. The first derives the time 
reqUired for an addition to working memory, the second calculates the time for a 
deletlon, and finally, the overall time including conflict resolution is obtained. 
In the following analysis, F denotes the fast clock period for the SPE's, S denotes 
3The figures given for host and LPE processing are estimates, unlike the SPE 
figures, wnich are derived from actual code. 
4\Ve assume that the only' actions in the right-hand sides of productions are 
addltions deletions and modIfications of working memory elements. The nght-hand 
SIde of ~ productI~n expressed 10 OPS5 can cause arbitrarIlv large amounts of I/O 
and can call any functIon wrItten in LIS?, but the tIme consumed by such 
operations does not glve information about the performance of the productIon 
system InferencIng engIne. 
the slow clock period ior the SPE's, and H denotes the average tnstructlOn tIme for 
the LPE and host microprocessors. 
8.1 Addition to Working Memory 
The analysis for an addition to working memory will be carried out in six parts. 
1. The time for broadcasting the attribute-value pairs of a new working 
memory element. 
2. The time for intn-condition testing. 
3. The time for processtng a-mem node additions. 
4. The time for processing .8-mem node additions . 
.5. The time for evaluating two-input node tests. 
6. The time for processing that may arise upon deletion of a token from 
the right-hand input of a NOT node. 
The first step in processing a working memory addition is to store it into a table 
in the LPE, assigning a time-tag. Next, the LPE broadcasts the attribute-value 
paIrs. For each term, the attribute ID is associatively matched with the attribute 
ID's stored in all SPE's. Then the value of the attribute is broadcast and stored 10 
parallel by those SPE's for which the attribute ID matched. The time required is 
given by 
Tbroadcast - 26F x nAttr class + SF + LPEbroadcast 
where 
nAttrclass' the number of attrIbute-value pairs in this 
working memory eiement's class, IS 11.4 (Appendix, item 14), and 
LPEbroadcast, the number of non-overlapped LPE and host 
instructlODs, is simply assumed to be 60. This reflects the 
time required to store into the LPE'g table and assign a time tag. 
Thus Tbroadcast = 301F + 60H. 
The next step is the actual Intra-condltion testlng. First, the two values stor~d In 
an SPE containing a term are compar~d. Second, the success of the companson 
r'?latlve to the stored relational operator is determin~d. Third. the conJunctlOn of 
the terms in a condition el~ment is ~valuated in time proportional to the long~st 
condition element. Finally, matching condition element ill's are reported out, and 
tokens for corresponding a-mem node additions are stacked. The time reqUlred IS 
given by 
T t h - 86F + (maxTerms - 1) x (.3F + 2S) ma.c, 
+ successes x (6F + 2S) + LPEmatch 
where 
maxTerms, the largest number of terms In a condition element, IS 9 
(Appendix, item 15), 
successes, the maximum number of condition elements that are satisfied 
In any partition, on the average, is 3 (Appendix. item 4), 
LPEmatch. the number of non-overlapped LPE instructions to construct 
and stack tokens resulting from successes is simply assumed to be 40. 
Thus T match = 128F + 22S + 40H. 
The processing required for an a-mem node addition includes removing the top 
~ntry (a node ID and a working-memory element ID) from the stack of pendIng 
additions, looking up relevant binding indices from the Rete net array (subscripted 
by node ID), copying relevant binding values from the current working memory 
element (array access), selecting an available SPE to hold the token and bindings, if 
any, and broadcasting the token into the SPE for storage. The time required for 
the a-mem node additions resulting from an addition to working memory is given 
by 
T - n.A.dds x ( ~IF + 25 + (tokLen - 1) x (14F + 4S) + wmIDs x 4F a-mem 
+ relBinds x 6F + strToks x IF + 3F if rhsNOT) 
+ LPEa-mem 
where 
nAdds, the maximum number of Q-mem additions In any partitlon, IS 3 
(Appendix, item 4), on average, 
tokLen, the average length for an Q-mem token, IS I (Appendix, item 
17), 
wmIDs, the number of working memory id's stored in a token, is I for 
Q-mem, since there is just one condition-element above each Q-mem 
node, 
relBinds, the n 11m ber of relevant bindings In a token, IS I (Appendix, 
Item 16), 
strTob, the number of relevant bindings that are string tokens rather 
than numerIC tokens, is assumed to be half of relBinds, or 0.5, 
the term "3F if rhsNOT" is for setting a flag if this token is in a memory 
node that is the right-hand input of a NOT nod~. This flag facilitates 
rapid processing of deletions, as discussed later. The entry of a token 
IOto the- right-hand input of a NOT node is relatively rare, as discussed 
10 (Appendix, item 18). Thus the total contribution of the strToks and 
rhsNOT terms is assumed to be 1 fast cycle. 
LPEa-mem' the number of LPE instructions to perform the 
appropriate array accesses, is simply assumed to be 20. 
Thus T a-mem = 96F + 6S + 20H. 
The processing and formula for additions to .8-mem nodes IS the same as for Q-mem 
nodes. The parameter values that differ are: 
26 
nAdds is 2 (Appendix, item 19), 
tokLen is 3 (Appendix, item 17), 
wmIDs is 3 ·(Appendix, item 13). 
Thus T 8-mem = 136F + 20S + 20H. 
The memory allocation scheme that associatively finds available SPE's to hold 
tokens has the same behavior as the classical first-fit algonthm. Since most 
allocatIOn requests are for token space of size I, and none are for more than 10 or 
so, first-fit should behave well, and the need for compaction of free space should be 
rare, unless almost all of the memory is occupied. Although the addition of a 
token to a ,3-mem node can trigger an incremental memory space compaction if 
token :nemory has become fragmented and is nearly full, we assume that 
compaction IS sufficiently mfrequent to be negligIble. Support for this is given by 
the 2/3 rule, also known as the "50% rule" [Knuth, 1968], together with the 
observation (sectIOn 7) that token memory IS less than 60% full when running the 
production system analyzed here. Experimental examination of actual token 
memory b~havlor m partitIoned OPS5 production systems remains an area for 
future woric 
An additIon of a token to an a-mem or ,a-mem node Initiates the processmg of a 
two-input node in the Rete dataflow graph. The purpose of a two-input node is to 
evaluate re!J.tional operators that reference variables bound in preVIOUS conditIon 
elements The right-hand input of a two-input node is from a single condition 
element, the "currentn one. The left-hand input is from other condition elements of 
the production. The two-input node examines variables that are bound in prevIous 
condItion elements and referenced for comparison in the current one. The values of 
such variables are relevant bindings for this two-input node. In summary, an 
addition to an a-mem or .8-mem node causes comparison of relevant bindings with 
tokens stored in the opposite input of the two-input node. 
The LPE performs array accesses to determine the node ID of the opposite input, 
to identify the comparison operators for this node, and to extract the current 
. .,-
_I 
binding from the token whose :lddition triggered this processmg. An assocIatIve 
probe activates all tokens in the Opposlte node. The current binding and 
comparison operators are broadcast to those tokens, and associative matchmg 
identifies tokens having all comparisons satisfied5. In the case of Ml) nodes, the 
workmg memory element ill's are retrieved from satisfied tokens, and new tokens 
are formed and stacked for addition to the ,B-mem node that receives the output of 
thiS two-input node. Processing a right-hand input to a NOT node IS imtially 
similar to that for an :\J.'\ID node, but after working memory ID's are retrieved from 
satisiied·-tokens, deletions (rather than additions) are stacked for further processing, 
as accounted for separately below. In contrast, upon addition to the left-hand Input 
of a NOT node, the input token is copied to the output ,B-mem node only if no 
opposite tokens were satisfied. 
Ttwo-input -
nAdds x { P e x (8F + 2S) 
+ Po x [ nBind x ( 2F + 2S + { 4F if <=> string 
+ LPEtwo-input 
where 
SF If <=> number 
10F if =,<>,<,<=,>,>= }) 
+ (llF + 2S) 
+ if not LHS input of NOT node, then 
iF + 2S + 
nSuccessful~fatch x (4F + 2S + tokLen x (iF + 2S)) 
+ If nBind > 1 then 4S x [(nBi.nd - 1) DN 21 
+ if nBind even then IF + 2S I} 
nAdds, the maximum number of Q-mem plus .B-mem additions In any 
partition, is 5 (Appendix, items 4, 19) on average, 
P e' the probability that the opposite Input memory IS empty, 
SThe idea of storing relevant bindings In tokens to facilitate associative matching 
is found in [Gupta, 19841. 
IS O.i (Appendix, item 20) 
Po' the probability that the opposIte input memory has tokens, 
IS 03 (Appendix, item 20) 
nBind, the number of relevant bindings in a token, is 1 (Appendix, 
item 16), 
it is assumed that the computatIOnal savings for the <=> operator IS 
never obtaIned, 
it is assumed that the computational savings for entry to the LHS of 
~OT t',vo-input nodes is never obtained, 
nSuccessfuL\fatch. the total number of successful matches during the 
comparison, is 2, since these cause the 2 ,a-mem node additions 
(Apeendix, Item 19), 
tokLen, the average length for a token In the opposite memory, 
is '2 (Appendix, item 21), 
LPEtwo-tnput' the number of non-overlapped LPE instructions to do the 
appropnate arr:lY accesses and create and stack any output tokens, is 
SImply assumed to be 50. 
Thus Ttwo-Input = 127F + 34S + SOH. 
The processlOg of token deletlOns induced by. an addition to the nght-hand input of 
a ~OT node is as follows. The goal is to remove all tokens in descendant nodes In 
the Rete net that depend on the token that is deleted. This is done by 
associatively activating all those nodes, assoCiatively matching the working memory 
element ID's comprising the deleted token to find tokens that depend on the deleted 




x { 69F + nDescendants x 8F + wmIDs x (2F + 25) 
+ if wmIDs > 1 then 4S x [(wmIDs - 1) DIV 21 
+ if wmIDs even IF + 2S 
+ nCSdel x (8F + 2S) 
+ (maxLen - 1) x (3F + 25) } 
+ LPErhsNOTdeletion 
edel' the expected maximal number of deletions in any partition as 
a result of an entry into a NOT node, is 0.6 (Appendix, item 23). (It 
makes sense to use a non-integral expectation since synchronization IS 
not required after each addition or deletion.) 
nDescendants, the average number of ,B-mem nodes in the Rete net that 
ar~ descendants of a NOT node is 2 (Appendix, item 24), 
wmIDs, the average number of working memory element ID's stored In a 
)-mem token, is 3 (Appendix, item 13), 
nC5del, the number of conflict set members deleted as a result of deleting 
one token is 0.16 (Appendix, item 26), 
m3.xLen, the length of the largest token, is 9 (Appendix. item 15) 
LPErhs;.iOTdeletion IS assumed to be 40 instructions. 
Thus T rhsNOTdeletion = iOF + 16S + 40H. 
The total number of processing cycles resulting from an addition to working 
memory, T add, IS the sum of the 6 partial results obtained above. 
T add = T broadcast + T match + T a--mem + T p-mem 
+ Ttwcrinput + TrhsNOTdeletion 
= 858F + 985 + 230H 
.30 
8.2 Deletion rrom Working Memory 
Deletion of a working memory element IS considerably faster than addition, since 
the principal actions are associatively searching among all o-mem and p-mem nodes 
to locate tokens that depend on that working memory element, and erasing all such 
tokens in parallel. It also is necessary to delete the working memory element from 
the LPE table. There are two complications that arise, however. First, deletion of 
a working memory element from the right-hand input of a NOT node can 
:'unblock" tokens in the left-hand input. This case is infrequent, but if it occurs, 
processing for the comparison of relevant bindings is necessary. Second, deletion of 
a workmg memory element may cause the removal of members of the conflict set 
This wIll happen naturally in token memory, but the affected instantiations II!ust be 
retrieved by the LPE so the tokens can be removed from the global conflict set 
table. The time required for deletion of working memory elements is given by 
T del 80F + 6S + maxLen x (6F + 4S) 
where 
+ rhsNOT x [IOF + (tokLen - 1) x (iF + 2S) + matchCost] 
+ nCsDel x (8F + ~S) 
+ LPEdel 
maxLen, the length of the largest token, is 9 (Appendix, item 15), 
rhs:\OT, the maxImum number of tokens in anyone partition removed from 
nght-hand inputs of NOT nodes as a consequence of a working memory 
deletion, is 1 (AppendiX, Item 25). 
tokLen, the average length for a token in the opposite memory, 
is 2 (Appendix, Item 21). 
matchCost, the number of cycles to match relevant bIndings, is 
given by the formula for Ttwo-inout given above. but with the 
nAdds pa.:-ameter = 1 for the token removed in thIS case. and the 
nSuccessfuL\fatch parameter = 0 (Appendix, item 28), so 
matchCost = 1.5F + 3S + SOH, 
nCsDel, the average number of instantiations removed from the conflict set 
as J. result of the deletion of an arbitrary working memory element, is 
0.16 (Appendix. item 25), 
LPEdel is simply assumed to be 40 instructions. 
Thus T del = 167F + 47S + 90H. 
8.3 Total Time per ProductIon Firing 
.31 
Since there are 2.21 changes (a.dditions/deletions) to working memory per production 
firing (Appendix, item 27), the time for these cha.nges (Appendix, item 29) i$ given 
by 
where 
Trhs is the number of host instructions needed for conflict resolution 
and evaluatlOn of the chosen production's right-hand side. 
Given the assumptIon that T rhs = 500 [Gupta, 1984], 
Tfiring = 1133F + 150S + 8S4H. 
From section 5 we have F = 350 nanoseconds, S = 3 microseconds, and H = 333 
nanoseconds. Hence Tfiring = 1161 microseconds, which yields an execution rate of 
861 productions per second. 
g Conclusions 
NON-VO~'s projected rate of production evaluation reflects the performance of a 
heterogeneous architecture designed for rapid symbolic computation. The massive 
parallelism of :"ION-VON's small processing elements was designed to be particularly 
efficient in executing such oper3.tions as associative matching and data storage and 
retrieval. The large processing elements and host arp. few enough in number that It 
is fe3Slble to make them quite powerful, with the relatively large RA ... \1 memones 
needed to hold control tables and data structures sllch as the compIled Rete 
network, together with substantial amounts of program code. 
);ON-VON's partltlOned-SL\1D mode of execution, 10 which instructions are 
broadcast to the SPE',s for execution, avoids a need to replicate identical code In 
thousands of processing elements. In addition, the storage capacity of NON-VaN's 
small processing elements is well ma.tched to the size of typical condition element 
terms and memory nodes. (Viewed in a more general context, the NON-VON small 
processing element was designed to ha.ve a capacity on the order of the size of a 
typical «record".) In coarser grain, strictly MIMD machines the fit is not as 
natural, and software mechanisms such as hash tables are used to accommodate the . 
storage of several items in each processing element, amortizing the cost of the larger 
processor and program code over a greater quantity of data. 
Although the formulas and parameters that predict NON-VaN's performance on 
this problem have been analyzed in considerable detail, the limitations of our 
analytiC techniques should not be ignored. An important area for future work is 
the validation of our analysis by experimental measurements. 
Acknow ledgrnen ts 
The authors are indebted to Steve Taylor and Dan Miranker for their critical 
examination of an early draft of this paper, and more generally, to Sal Stolfo and 
other members of Columbia's DADO Project, whose ideas have strongly influenced 
our own. Special thanks are due Anoop Gupta for providing data and helpful 
InSights. This research was supported in part by the Defense Advanced Research 
Projects Agency under contract N00039-82-C-0427, by the New York State Center 
for Advanced Technology in Computers and Information Systems at Columbia 
UniverSity, by an IB~t Fellowship, and by an IBM Faculty Development Award. 
.3.3 
References 
B. G. Buchanan. "New research on expert systems", A1achine Intelligence 10, 1. E. 
Hayes et a1. (eds.), Halsted Press, New York, 1982, pp. 269-299. 
~fichael 1. Flynn, "Some computer organizations and their effectiveness ll , IEEE 
Transactions on Computers, vol. c-21, September 1972, pp. 948-960. 
Charles L. Forgy, On the Efficient Implementa.tion of Production Systems, Ph D. 
ThesIs, Carnegie-Mellon Computer Science Department, February 1979. 
Charles L. Forgy, "Note on production systems and llliac N", Technical Report, 
Carnegie-:\-fellon Computer Science Department, July 1980. 
Charles L. Forgy, OPS5 Users' Manual, Technical Report CMU-CS-81-135, Carnegie-
Mellon University, 1981. 
Charles L. Forgy, "Rete: A fast algorithm for the many pattern/many object 
pattern match problem", Artificial Intelligence, vol. 19, no. 1, September 1982, pp. 
17-37. 
Charles L. Forgy and John McDermott, "OPS, a domain-independent production 
system language", IJCAl-77, Proceedings of the ruth international joint conference 
on artificial intelligence, August 1977, pp. 933-939. 
Anoop Gupta, "Implementing OPS5 production systems on DADO" ,Proceedings of 
the 1984 international conference 00 parallel processin~, August 21-24, 1984, pp. 
83-91. 
Anoop Gupta. and Charles L. Forgy, ":\feasurement5 on production systems", 
Technical Report, Carnegie-Mellon Computer Science Department, 1983 (undated). 
Fredenck Hayes-Roth et 301. (eds.), Building Expert Systems, Addison-Wesley, 
ReadIng, \las5., 1983. 
Toru Ishida and Salvatore 1. Stolfo, "Simultaneous firing of production rules on 
tree-structured machines", Technica.l Report, Columbia Computer Science 
Department, 1984. 
Donald E. Knuth, The Art of Computer Programming: Volume 1, Fun d8.lIl en tal 
Algorithms, Addison ''Nesley, Reading, Massachusetts, 1968. 
Douglas B. Lenat, et al., "Cognitive economy in artificial intelligence systems", 
IJCAl-7Q, Proceedings of the sixth international joint conference on artificial 
intelligence, Tokyo, August 20-23, 1979, pp. 531-536. 
Douglas B. Lenat and John McDermott, '1.ess than general production system 
archItectures", JJCAI-77, Proceedings of the fifth international joint conference on 
artificia.l intelligence, August 1977, pp. 928-932. 
Donald L. McCracken, "Representation and efficiency in a production system for 
speech understanding", Proceedings of the sixth international joint conference on 
artificial intelligence, Tokyo, August 20-23, 1979, pp. 556-.561. 
John :\-fcDermott, "RI: A rule-based configurer of computer systems", A..rtificial 
Intelligence, vol.' 19, no. 1, September 1982, pp. 39-88. 
Daniel P. :\-firanker, "A framework for discussing HERBAL", Working ~1emo, 
Columbia Computer Science Department, 1984a. 
Daniel P. :\-firanker, "Performance estimates for the DADO machine: A comparIson 
of TREAT and RETE", International conference on fifth generation computer 
systems, November 1984b, pp. 449-457. 
Nils 1. Nilsson, Principles of Artificial Intelligence, Tioga, Palo Alto, Calif., 1980. 
Kemal Oflazer, "Partitioning in parallel processing of production systems", 
Proceedings of the 1 984 internation3.l conference on parallel processing, August 
21-24, 1984, pp. 92-100. 
Ron Sauers and Rick \Valsh, "On the requirements of future expert systems", 
lJC.41-83, Proceedings of the eighth international joint conference on artiflcial 
intelligence, August 1983, pp. 110-11.5. 
DaVid Elliot Shaw, Knowledge-Based Retrieval on a Relational Database Afachine, 
Ph.D. Thesis, Stanford Department of Computer Science, 1980. 
David Elliot Shaw, "OrganizatlOn and operation of a massively parallel machine", In 
Guy Rabbat (ed.), Computers and Technology, ElseVier - North Holland, 1985. 
DaVid Elliot Shaw and Bruce K. Hillyer, "Allocation and manipulation of records 10 
the NON-VON supercomputer", Technical Report, Columbia Computer Science 
Department, 1982. 
David Elliot Shaw and Theodore M. Sabety, "The multiple-processor PPS chip of 
the NON· VON 3 supercomputer", Integration: Tbe VLSI Journal (accepted for 
pubhcation), 1984. 
Salvatore 1. Stolfo, "Five parallel algOrIthms for production system execution on the 
DADO machine", A.A.Al-84 , Proceedings of tbe national conference on artificial 
intelligence, August 6-10, pp. 300-307. 
Salvatore 1. Stolfo, Daniel Miranker, and David Elliot Shaw, "Architecture and 
applications of DADO: a large-scale parallel computer for artificial intelligence", 
lJCAI-83, Proceedings of the eighth international joint conference on artificial 
intelligence, August 1983, pp. 850-854. 
.1j 
Salvatore J. Stolfo and David Elliot Shaw, "DADO: a tree-structured m'1Chlne 
architecture for production systems", AAAI-82, Proceedings of the national 
conference on artificial intelligence, August 18-20, pp. 242-246. 
Patnck H. Winston, Artificial Intelligence, Addison Wesley, Reading, Mass., 1977. 
Appendix 
~ON-VON's performance In executing OPS5 depends on several parameters of the 
productIon system in questloti. For the analysIs presented in this paper, statistics 
have been derived from averages over the SIX production systems measured In 
[Gupta and Forgy, 19831· Although most values are obtained directly from that 
work, some plausible inferences have been necessary, and actual measurements of 
the needed parameters would be preferable. The derivations and Justifications for 
the statistics employed are given below. All references to pages and tables cite 
[Gupta arrd Forgy, 19831. 
1) The average number of a-mem node additions per change to working memory 
is 5 09 (po 25, Table 5-2, line 1). 
2) The average static sharing of a-mem nodes IS 3.51 (p. 23, Table 4-5, 
lIne 2). 
3) The average number of condition elements having all intra-condition tests 
satIsfied by an arbitrary workIng memory addition (without sharing) is 
approximated as 1i.87, the product of 1) and 2) above. That this is an 
approxImation since it IS the dynamic sharing of a-mem nodes that 
IS relevant; Gupta states (private communication) that 30 IS a more 
accurate value, so we use 30 for the analysis. 
4) The average number of a-mem addItions per partItion is 0.94, the 
ratto of 30 (item 3 above) to 32, the number of partitions. U the 
a-mem additions were evenly spread over the partitions, 30 
partlt:ons would have one addition each, and 2 would have none. If, 
however, the additIons occur randomly, with a uniform distribution over 
the 32 partitions, clustering would cause an expected maximum of 3.4 
additions in some partition. It is hoped that intelligent partitioning 
of the productIon system can. do better than randomizIng the activity, 
but this remains an open question. Oflazer [19841 reports 3 heuristics 
for partitioning production systems; all 3 perform better than a random 
distribution of rules. He does not state the effect of his heuristics 
on a-mem node additions. We assume that as a result of 
partltioning, on the average there are at most 3 a-mem additions in 
any partition. 
5) The average SIze of the conflict set is 16.0 (p. 30, Table 5-8, line 3). 
6) The average number of changes to the global conflict set per firing cycle 
is 5.3 (p. 29, Table 5-6, line 3). 
i) The average number of condition elements per production 1S 4.11 (p. 21, 
Table 3-8, line 2). 
8) The average number of attributes 10 a condition element is 3.71 (p. 21, 
Table 3-8, line 5). 
9) The average number of variables In a condition element IS 1.56 (p. 21, 
Table 3-8, line 6). 
10) The average numbp.r of intra-condition tests for a condition element 
is 2.93, the difference between 8) and half of 9). This is based on 
the assumptIOn that each variable occurs once to be bound and once In 
an inter-condition test. All attribute occurrences that are not 
associated with an inter-condition test must be for intra-condition 
testing. This analysis neglects the effect of conj Ilnction and 
disjunction expressions, which although rare, would tend to raise the 
n urn ber of intra-condition tests. 
11) The average number of productions is 909.83 (p. 21, Table 3-8, line 1). 
12) The maximum aggregate number of o-mem and ,B-mem tokens is 4616 
(p. 30, Table 5-8, line 6.) 
13) The number of working memory element ID's stored in an o-mem token 
is 1. The size of a ;3-mem token ranges from 2 (most frequent because 
of the progressive filtering performed by the Rete net) up to the number 
of positIve condition elements in a production (average 4). Thus we 
say the average number of working memory element ID's in a .8-mem 
tokens is 3. 
14) The number of attributes per class ranges from 1 to 152 (pp. 19-20, 
Tables 3-1 through 3-6). Since the OPS5 language manual states that 
the maximum number of attnbute slots per class is 126, we see that (or 
some classes, attribute names have been· mapped by OPSS literal 
declarations to shared physical locations. A static weighted average 
of all 41 most frequent classes listed in these tables, assuming no 
attribute folding (i.e., 1.52 distinct physical locations are permitted 
in a class) gives 11.4 attributes per class. 
15) The average of the largest number of terms in a condition element is 9, 
by 1Ospection and averaging of the values ranging from 7 to 11 found in 
(p. 13, Figure 3-4). 
16) The average number of relevant bindings in a token is 1. This is 
derived 10 the following way. The average number of relevant bindings 
·37 
IS the quotient of the average number of tokens in an opposite node, 
divided by the average number of tests that are performed when a token 
IS Inserted. For an Al'ID node this value is 0.88 (p. 27, Table 5-4, 
line 4 divided by line 3), and for a NOT node this value is 097 (p. 28, 
Ta.ble 5-.5, line 4 divided by line 3). Note also that a static figure of 
0.8 IS obtained from (p. 21, Table 3-8, line 7). Thus we state that the 
average number of relevant bindings in a token is 1. This applies both 
to Ct-mem and to ,B-mem tokens. 
17) The average length of a token is the greater of the number of relevant 
bindings, or the number of working memory ID's stored in the token. For 
Ct-mem tokens, both figures are 1. For j3-mem tokens, the number 
of working memory ID's is 3, as described in 13), so the average length 
of a ,3-mem token is 3. 
18) The average number of entries to the right-hand input memory of an A..ND 
node is 22.37 (p. 27, Table 5-4, line 1). The average number of entries 
to the right-hand input memory of a NOT node is 4.73 (p. 28, Table ;:)-'~, 
line 1). Thus 17% of right-hand entries are to NOT nodes. 
19) The average number of additions to ;3-mem node memory resulting from 
an addition to working memory is 6.31 (p. 26, Table 5-3, line 1). If 
6 additions are independently uniformly distributed over 32 partitions, 
the expected maximum number in any partition is 1.41. \Ve assume that 
as a consequence of partitIOning the productions, on the average there 
are at most 2 ;3-mem additions in any partition. 
'20) For an A ... ";1) node, the average number of entries to the right-hand input 
IS '22.37, and to the lelt-hand input is 7.2 (p. 27, Table 5-4, line 1). 
.38 
Thus 76% of entries to Al.'ID nodes are to the right-hand input, and 24% are 
to the left-hand. \Vhen entering the right-hand input, the opposite node 
IS empty 87.17% of the time, and when entering the left-hand input, the 
opposite is empty 43% of the time (p. 27, Table 5-4, line 2). Thus the 
weighted probability is 0.77 that an entry to an Al'ID node will find the 
opposite empty. 
For a NOT node, the average number of entries to the right-hand input 
is 4.73, and to the left-hand input IS 1.2 (p. 28, Table 5-5, line 1). 
Thus 80% of entries to NOT nodes are to the right-hand input, and 20% are 
to the left-hand. When entering the right-hand input, the opposite node 
is empty 70.33% of the time, and when entering the left-hand input, the 
opposite is empty 25.5% of the time (p. 28, Table 5-5, line 2). Thus the 
weighted probabllity is 0.61 that an entry to a NOT node will find the 
opposite empty. 
Thus we say that the probability of finding the opposite node empty IS 
0.7, and the probabIlity of finding tokens in the opposite node is 03 
Gupta (private communication) points out that this does not hold for 
"long-chain" activations that propagate 3. token through consecutive 
two-input tests dunng one execution cycle. 
21) The length of all tokens in a-mem nodes is I, and the average length 
for- toker.s in ,a-mem nodes is 3 (Appendix, item 17). We assume that 
most tokens are in a-mem nodes, and the ,a-mem tokens are short 
because of the progressive filtering action of the Rete net, so we say 
the average length of an arbitrary token is 2. 
22) The probability that an addition to 3. NOT node triggers a deletion is 
o 12, which is 0.3 x 0.49 x 0.8, where 0.3 is the probability that the 
opposite node is occupied (p. 28, Table 5-5, line 2), 0.49 is the 
average number of tokens that successfully match given that the opposite 
node IS occupied (p. 28, Table5-S, line 5), and 0.8 is the probability 
that an entry to a NOT node is to the right-hand input (p 28, Table S:.5, 
line 1). 
23) The expected maximal number of deletions in any partltion as a result 
of an entry Into a NOT node is calculated as follows. There is a total 
of 5.93 entries to NOT nodes in all partitions (p. 28, Table 5-5, 
line 1). Let T(n,k} denote the probability that an aggregate of n 
entries to NOT nodes triggers k token deletions. \Ve round 5.93 to 6, 
and using the probability 0.12 that any 1 entry triggers a deletion, 
obtain through elementary probabdity theory that T(6,0} = .4644, 
T(6,1) = 038, T(6,2} = 0.1295, and the sum of T(6,k} for k > 2 
is 0.0261. Next we assume the 6 entries are uniformly distributed over 
the 32 partitions, and perform elementary combinatorial calculations to 
obtain the number of entries in each partition. (The assumption of 
Uniform distribution is discussed in item 4, above.) We state the two 
most common occurrences. The probability that six partitions have one 
entry each is 0.6l. The probability that one partition has two entries 
and four other partitions have one entry each is 0.34. ~ultiplying these 
and similar values by the T(6,k) and grouping by the maximum number 
of deletions triggered gives the probability D(k} that k 
deletions are triggered. We obtain D(O} = 0.-16, D(!} = 0.48, D(2) = 0.06, 
and all other values are nearly O. Hence the expected maximum number of 
deletions in any partition is 0.60. 
24) The average number of ,8-mem nodes that are descendants of a NOT node 
is no more than 2. This is supported by two considerations. The average 
number of ,B-mem nodes for a production is one less than the avenge 
.39 
number of condition elements (Appendix. item 7), hence it is 3. The average 





number of NOT nodes is only 481 (p. 21, Table 4-1, line 5), so it IS 
lIkely that the ancestors of a .a-mem node are Al"-4TI nodes. 
On average, 1.48 tokens are removed from right-hand inputs of NOT nodes 
as a consequence of a working memory deletIon. This figure is obtained 
in the following way. The average number of tokens In all nodes IS 
1289 .5 (p. 30, Table 5-8, line 5). The average number of working memory 
elements is 295 is (p. 30, Table 5-8, line 1). Thus there are 4.36 -
tokens per working memory element. Since a average token contains '2 
workIng memory element ID's (Appendix, item 21), an average workmg 
memory element is represented in 8.i2 tokens. Since less than 17% of 
aa-tokens are stored in right-hand inputs of :\"OT nodes (AppendIX, Item 
18), the figure of 1.48 IS obtained. We assume that at most one occurs 
in any partItion. Actual dynamic measurements of this statistic are 
needed. 
The average number of working memory elements is 295 (p. 30, Table 5-8, 
hne 1). The average size of the conflict set is 16 (Appendix, item 5). 
The average number of condition elements in a production is 4.11 
(Appendix, Item i). Since negated condItion elements are not represented 
In a conflict set instantiation, and assummg 3/4 of the condition 
elements are non-negated, the average size of an instantiation is 3. 
Thus 3x16/29S, or 0.16, is the average probability that deletion of a 
working memory element removes a member of the conflict set. 
The average number of changes to workmg memory per production system 
cycle IS estimated as follows. From (p. 20, Table 3-7, lines 1-3) we 
calculate the number of changes to workmg memory as a percent of the 
total num ber of rhs actions by summing the percent of actions that are 
~L\KE. the percent that are REMOVE, and tWice the percent that are 
~IODIFY (since a modify is a make and a remove). These product of these 
v.lIues WIth the (static) number of actions per production (p. 21, 
Ta.ble .3-8, lme 3) gives an estImate of the number of changes per workmg 
memory that results from firing a production. The average of the values 
thus obtained is 2.21 changes per firing. A value denved from dynamic 
measurements would be preferable. 
The number of addUJons to .a-mem resulting from the deletIon 
of a workIng memory element was conSidered too insignificant to analyze 
by Gupta [1984), and no statistics were presented for this case in 
[Gupta and Forgy, 19831. 'We assume these statistics were aggregated with 
those for twcrtnput processing and ~mem additions that result from 
the addItion of workIng memory elements, and thus they have already been 
accounted for in the addition portion of the analysis. 
The formula for total time per production formula assumes that the number 
40 
of working memory additions is equal to the number of working memory 
deletions, which should be true over the long-term, assuming working 
memory size does not grow without bound. We also note a comment from 
Anoop Gupta [private communication, 19851: "The assumption is made that 
the same partition exhibits worst-case performance for all of the 2.21 
changes. That is probably not so, and will result in overall better 
perform ance than predicted." 
41 
(p .ort-York 
(current-ta.k Ata.knaae .ort) 
(counter ATalue <n» 
(n~ber ATalul <%> Au.ed no) 
(number ATalue < <%> A u •• d no) 
--) 
(Trite <x»~ 
Cmodi!y 2 ATalue (compute <n> + 1») 
(modify 3 Au •• d ye.») 
(p .ort-done 
(current-ta.k Ata• kn .... ort) 
(counter ATalu• <total» 
- (numb.r Au •• d no) 
--) 
(Trite <total> item. lorted) 
(r8l!on 1») 
If current ta.k i. to .ort 
and output counter i. n 
and there i. &l1 unu.ed nu.ber % 
but no •• aller unu.ed nuab.r 
Then 
Trite % to output 
increment output counter 
and .ark x a. u.ed. 
If current task i. to .ort 
and output counter 11 u,tal 
but no unu.ed nuaber rl.ain. 
Th.n 
Trite total nuab.r ot ite •• 
and terainate .orting ta.k 
Figur. 1: Ex&apl. OPS5 Production. 
(current-talk Atalknaae IOrt) 
(counter ATalue 0) 
(number ATalU. 17 Au •• d no) 
(number AT&lU. 5 ·ulsd no) 
(number ATalue.23 Auled no) 
Figur. 2: Exaaplt lorkil1l liI.aery El •• entl 
·n 
CUT upoat-Q con~ I 1 C~ ~"t-




61 To . Host 
Leaf Mesh Connections 
A Small Processinl Element 
o Lara. Processinl Element 
o Disk H.ad and Inteilicent Head Unit 
Figure 4: Organization of the 10 I-Val Machine 
46 
& Small Processin& Element 
o Host 
o Larae Processinl Element 
Figure 5: MOl-VOl: Reduc,d Con1igur&tion for Production S1. t ••• 
5PE 1: 5FE 2: 
~cla~~namg con~t ~w'dth ~Igngth 
"",,"-----,I G ,-"I _ ...... _---'10 .....-1 _-" 
5FE 3: 5FE 4: 
~hg,ght ~Igngth ~hg'gnt ~w'dth 
,--~I 01,-" _ ...... _---'I m .....-1 _.......I 
Figur. &: Condi tioll !l ••• Ilt. Stor.d. ill Four SP!'. 
