An Algorithmic Taxonomy of Production System Machines by Mills, Russell C.
CUCS-340-88 
An Algorithmic Taxonomy 01 Production System Machines 
Russell C. MiIIs 
Columbia University 
Computer Science Department 
29 April 1988 
Abstract 
丁his paper presents a survey of ∞mputer architectures designed to execute production systems. After a 
brief description of production systems and production system languages. the paper summarizes match 
algorithms. particularly the Rete algorithm, and outlines suggested parallelizations. Most parallel 
produαion system algorithms have as their unit of sequential computation a single production's left-hand 
side. activations of a single Rete node. a single activation of a Rete node. or a single ∞mparison in a 
Rete node. The paper discusses a number of proposed production system machine architectures in 
terms of the para川 el and sequential computations performed in the algorithms suggested for each 
machine. A taxonomy Of parallel prωuction system algorithms, describing in detail the distribution and 
replication of data and ∞mputations，∞ndudes the paper. 
1 Introduction 
The production system paradigm is a data-directed formalism widely used in artificial intelligence 
research and in the building ot expeηsystems. A number of commercially successful expert systems 
such as XCON at Digital Equipment Corporation [29]. and ACE at AT& T [66] have been implemented in 
production system languages. The slow execution speeds ot production systems (XCON requires five 
minutes of CPU time on a tairly powerful ∞mputer to configure a single VAX system , while ACE 
implemented in OPS4 requires many hours ot CPU time to process a single day's data on telephone 
cable failure reports for a small city) stand in the way ot constructing larger or real-time production 
systems. Consequently, many researchers have attempted to accelerate production system execution 
through hardware , software , and algorithmic techniques. This paper discusses some proposed 
production system machine architectures and proposes a taxonomy based on the aJgorithm(s) proposed 
for each. 
1.1 Productlon Systems 
A production 5y5tem. or PS [37，町 is a pa忧ern. or data-directed program expressed as a set of 
production 阳le5， known ∞Ilectively as the productio月 memory, or PM , which operates on a globaJ 
database, the working me厅lOty， or WM. under the direction of a ∞ntrol strategy. Each productÎon 
consists of an if-then rule whose left-hand side (LHS) consists ot a precondition on the database for 
application ot the rule. and whose right-hand side (RHS) ∞nsists of a set ot actions. any ot which may 
affect the database. 丁he ∞ntrol strategy dictates the order of application , or firing. of rules whose 
preconditions are satisfied. A rule whose LHS is satisfied , together with the set of WM elements (WME's) 
that satisfies it , is known as an instantiation of 伽e rule. 
A typical production system ∞ntrol strategy dictates that the system repeatedly execute execution 
steps consisting of three parts: 
,. Match: Determine the set ot all rule instantiations, which is known as the conf/ict 5et. 
2. Select. Choose a subset of the instantiations to fire. In the most widely-used production 
system languages. this pr∞ess ， known as ∞nflict re臼lution. selects exactly one 
instantiation. If there are no rule instantiations , the system halts. 
3. Act Perlorm the actions indi臼ted by the RHS's of the selected instantiated produαion(s). 
Production systems are typi臼Ily but not always forward-chaining inference systems. 
1.2 Productlon Systems as a Baslc Computatlonal Paradlgm for AI 
Production systems are widely used in Al research because they are data叫driven and separate 
program operations from program ∞ntrol [39]. Many problems in AI can be reduced to production 
systems. 
Nils Nilsson [39J de民ribes the r创uction of theorem-proving and state-space search to generalized 
produαion systems. His reduαion makes PM the set of state-generating operators and WM the set ot 
states being explored. 
Michael Rychener [53) shoWS how semantic nets [46J can be expressed in production system terms , 
and ∞nstructs an expert system for ∞mputer吐ided design. In his system , PM ∞nsists of two classes of 
2 
rules: those that create and traverse the semantic net, and those that encode the net itself, while WM 
consists ot goals and temporary network structures. 
Mark Perlin [45] has proved the equívalence of frame systems [30] and production systems. Fírst he 
demonstrates that a frame system can simulate a production system at a cost that is at most linear in the 
size of the rule system by construc甘ng ， tor a given production system, a frame system that perlorms the 
matcning operatíons tor the productions' LHS's. Next he shows that a production system can simulate a 
frame system, again at a ∞st that is at most linear. For each active memory operation such as read and 
write, and for each local propagation sChedule , he constructs a production that implements the operation. 
1.3 Productlon System Languages 
In a typical production system. rules are of the form 
P: C1 & ... & Cn --> A 1 …Am' 
where each Cj is a condition (known as a condition element. or CE). and each Ak is an action, which may 
add or delete WM elements. or perhaps interact with a user or perlorm input or output on a file system. 
Each condition element is a representation of a c1ass of WM elements that match the ∞ndition-- each 
production system language defines the form of the representation and what it means to match a 
condition element. Condition elements can contain variables, and diHerent CE's can refer to the same 
variable: when a CE matches a WME, variables in the CE are bound to the ∞rresponding features of the 
WME. In most production system languages，∞ndition elements 臼n also be negated. If the above 
produαion P contains p positive (not negated) CE飞 an instantiation of P ∞nsists of P together with a 
p-tuple of WME's (w1
…
.,wp) such that 
, . each wj satisfies the i-th postive ∞ndition element, 
2. all variables in the positive CE's are bound ∞nsistently. 
3. and for each negated CE. there is no WME satisfying it with variable bindings ∞nsistent 
wíth those in the positive CE's. 
The OPS famíly of production system languages 归3] ， developed at Carnegie-Mellon University, display 
a variety of representations of working memory. In each OPS language , condition elements are 
abstractions of working memory elements. Each language provides a variety of predicates for matching 
CE's to WME's. 
One ot the earlier members ot the OPS family, OPS4问. represents ~ME's as arbitrary list structures. 
OPS4 provides predicates for ma也hing list structures , but also allows programmers to write arbitrary LlSP 
functions 10 match CE's and WME's. 
In ∞nirast ， OPS5 [3, 10]. proba剧y the most wídely used OPS language, represents a WM element as 
a tuple ot (named attribute，∞nstant value) pairs. An OPS5 condition element is an abstraction Of a WM 
element; it ∞ntains a number ot attribute tields, each ot which can contain variables，∞nstants ， and 
predicate symbols testing equality , ordering , and equality of type. Each OPS5 WME or CE also belongs 
to exactly one class, which iS just a name-valued attribute that itself has no name, and 0∞upies the first 
position in that WME or CE. OPS5 variable names are enclosed in angle brackets ("<name>"). while 
attribule names begin with a 臼ret (''''). Figure 1 displays a ∞ntrived OPS5 production and two WME's 
that together ∞nstitute an instantîation ot the rule. 
3 
Flgure 1: An OPS5 PrOduαlon 
(p example 
(cla~~l ^attrl <X> > 1 ^a乞tr2 <X>) 
(cla~~2 ^at二trl <z> < <X>) 
--> 
(remov. 1) 
(make (cla~s3 ^attr工 <z>)))
(cla~sl ^attr工 5 attr2 5) 
(cla~~2 ^attrl 4) 
OPSS's set of predicates is limited, and the language does not allow the programmer to define new 
ones. OPS83 (12) extends OPS5's pattern-matching capabilities by allowing the programmer to call 
functions in the LHS of productions. In fact, OPS83 is a procedural language with an embedded element 
data type (much like a standard record type) , and make, mOdlfy, and remove statements to manipulate 
the contents of WM. Instead of providing a ∞nflict-resolution strategy, OPS83 provides the flre 
statement, whiCh fires a programmer-specified rule instantiation. 
In spite of the limited expressiveness of OPS5, with íts small number of pattern-matching operations 
and its lack of programmer-defined predicates , most proposed production system machines are designed 
specifically to execute OPS5. This situation may stem from the general availability of an OPS5 interpreter 
and the consequent widespread use of the language. 8ut OPS5 may not be an ideal production system 
language, and not all production systems are written in OPS5, so OPS5-specific machine designs are 
。pen to criticism as being too inflexible. 
1.4 Executlon Speed of Unlprocessor PS ImplementatJons 
In the worst case , the problem of deciding whether a production's LHS can be satisfied is at least 
NP-complete, since a known NP∞mplete problem，∞njunctive boolean que叩例，臼n be recast in 
producrion system terms. Average心ase exeαJtion speed is therefore of much more interest than worst-
case speed. Since there is no mωel of an average production system, most measurements of 
production system execution speed are empirical stuoies of large working systems. 
Charles Forgy has repoηed [8] that most production systems spend 90% of their execution time in the 
match phase. even using an et1ident match algorithm. Most research on a∞elerating production system 
execution has therefore centered on the match phase. However, one should remember that if the match, 
select, and act phases are not overfapped，∞mpletely eliminating the time spent in the match phase can 
speed up execution only by a factor of 10. 
8ecause diHerent prωωtions make diHerent numbers of changes to WM, and because the number of 
WM cnanges determines the aπlOunt of match ∞mputation required , the exeαJtion speed of production 
system inte币reters is usually measured in WM changes per se∞nd ， rather than rule firings [60]. 
Measurements are usually made on a mωerately powe巾l ∞mputer ， typically a VAX-11厅80. Execution 
speed varies widely and depends on the language, the implementation language of the interpreter, and 
the degree of ∞mpilation of the rules. For example , Anoop Gupta reports [1ηthat the standard OPS5 
interpreter written in L1SP executes 8 WM changeSlsecond, while the 8liss-based interpreter executes 
a∞υt 40. OPS83, which is ∞mpiled into machine code rather than inte币reted ， runs at about 200 WM 
4 
changeslsecond. The order-of-magnitude difference in speed between interpreted ∞de and compiled 
code complicates the comparison of performance projections for a number of production system 
machines discussed later in this paper. 
1.5 Algorlthms for Sequentlal Productlon System Executlon 
丁he problem of finding all rule instantiations can be recast in relational database terms [64]. As before, 
let the produαion 
P: C1 & ... & Cn --> A1 ... Am' 
have p positive CE's Cj. ,...,Cj . For each Cj, let Rj be the relation defined as the subset of WM satisfying 
" 'p 
Cj. Then the set of rule instantiations I(P) is that subset of Ri ('"( R; , the jo的。f the relations Rj on the 
variables common to two or more CE's in P, not excluded' by a WME matching a negated èE with 
consistent bindings. 
The most naive match algorithm for production systems constructs the conflict set during each match 
phase by comparing eve叩 production with eve叩 tuple of WM elements. However, since the number of 
changes to WM each cycle is typically smaller than WM itself (in fact, in OPS5 programs, much smaller 
[15]) , more efficient match algorithms pro臼臼 the changes to WM to produce changes to the ∞nflict set 
These algorithms save some of the state of the match algorithm between production system cycles. 
Researchers have designed a spectrum of state-saving algorithms, described below, distinguished by the 
amount of state they save. 
Daniel Miranker's TREAT algorithm [32, 33] saves only the relations Ri and the conflict set I(P). In 
each match cycle, it uses new WME's as seeds to construct new instantiations , and it Iimits the search for 
new instantiations to those productions all of whose positive CE's have a nonempty corresponding 
relation. 
Charles Forgy's Rete algorithm [8 , 11J saves all initial subsequence relations of P. For each i in 1 ，...川，
let Pj be the partial lHS 
Pj :C 1 &… &Ci 
The Rete algorithm saves all the I(Pj) , the set of all instantiations of Pi, in addition ωthe Ri' A change to 
WM , whether addition or deletion , that aft民ts Rj can cause changes in all the Pi for i >- j. The Rete 
algorithm under1 ies the ∞mmercial OPS5 and OP5a3 inte币reters and ∞mpilers. 
A variant of the Rete algorithm, the subject of some experiments by Anoop Gupta [18] , recursively 
splits the lHS ot each produdion in two，∞mputes the set ot instantiations ot each half, and computes 
the join ot the two sets of pa而al instantiations. Gupta's results indicated that on the OPS5 programs he 
studied, computing the partial instantiations of the right-hand halt ot each lHS wasted time and space. 
A tinal variant of the Rete algorithm, proposed by Kemal Oflazer and repoηed in [Gupta86a] , saves all 
nonredundant instantiations ot all subsequences of the lHS of each production. Oflazer's afgorithm 
underlies the design ot a proposed machine for production systems, described later in this paper. 
5 
1.6 The Rete Match Algorlthm In More Detall 
Since many 01 the production system architectures discussed in this su阿ey a忧empt to accelerate 
matching by parallelizing the Rete match algorithm. a more detailed discussion of the sequential Rete 
algorithm is warranted. The PS language compiler translates the LHS of each productíon into a dataflow 
network, which the runtime system interprets or executes. For each production P and each change t。
WM , the network computes the changes to I(Pj) from the changes to Rj and to I(Pj.1). Each node in the 
network stores part ot the saved match state as a set of tokens, each 01 which represents an element 01 
Rj or I(Pj) , or performs part of the computatíon to update the state. Rete network nodes are of several 
types: 
• alpha-memory nodes. each of which stores a relation Rj. 
• beta-men刀ry nodes. each ot which stores a relation I(Pj). The beta-memory node storing 
I(P n) is also known as an output node. and it stores pa内 of the contlict set. 
• 'single-input-test nodes, which perform tests on WME's for membership in a relation Rj. 
These tests include comparing a WME attribute against a constant and checking the 
consistency of variable bindings within a single CE. 
• two-input-test nodes, which ∞nstruct I(Pj) from I(Pj_1) (the right 的puf) and Rj (the left input)_ 
lt the CE corresponding to the right input of a node is negated. the node is known as a 
not-node: otherwise it is called an and-node. These tests are also known as inter-condition 
tests. 
Figure 2 shows the network ∞nstructed from the production given in figure 1. In practice. the ∞mpiler 
merges the alpha-memory node storing R1 and the beta-memory node storing I(P 1). 
Aete alpha- and beta-memory nωes can also be shared. If two productions P and P' share a 
substring 01 CE's Pj and P'j. the ∞mpiler can generate a network in which the te臼 involved in 
constructing I(Pj) and I(P'j) are performOO once. and a single beta-memory node storing I(Pj) has tw。
outputs. Notice that it the compiler does not generate shared nodes. it can merge each two-input-test 
node with its two memory-node inputs. 
During production system execution. adding an element to WM initiates a sequence 01 one-input tests 
and the possible creation 01 an alpha-πlemory token. which activates the and- and not-nodes connected 
to the alpha-memory node. And-nωes receiving the new token ∞mpare it with the tokens stored in their 
other memory-node input. and create a new token for each new partial instantiation. Not-nodes receiving 
a new token trom the right a悔。∞mpare it with the tokens stored in their left input. It this new token is the 
first token from the right input mat由ing a particular token from the left. 50me previously-created partial 
instantiations mUit be retr部tOO ， 50 the not-nωe creates a negated token. which flows through the 
network annihilating matching positive tokens and creating new negatOO tokens. Deleting an element 
from WM causes the removal of all alpha-memory tokens ∞rresponding to it. and the creation of negated 
tokens. A negated token arriving at the right input 01 a not-node causes the creation of positive tokens if 
the positive alpha-meπ10"1 token ∞rresponding to the newly-arrivOO negated token iS the only one 
blocking the creation Of partial instantiations. The sequential Rete match algorithm guarantees that a 
negated token arriving at a memo叩 node finds a ∞rresponding positive token. 
5 




Aatt:l > 1 (b~ <Z>) 
\\\\ 
<Z> > <z> 
c:onflic:t ..t 
1.7 Some Statlstlcs on Productlon Systems 
Anoop Gupta [15 , 181 has ∞11前ted extensive statistics on two sets ot six production systems each. He 
measured the lollowing pr。因rties ot 阳ωuc币。n systems: 
• Textual cnaracteristi西 S凶"路 the number ot CE's 阳r production , the number ot negated 
CE's per producti仙， and the number ot WM changes per RHS. In his sample , 27% of all 
produc:ions ∞ntain回 at least one negated CE, 50 negated CE's are impo内ant.
• Re!e ne!W。毗 char配te耐蚀::s sUCh as the number ot nodes per CE, for ∞mpilation with node 
Sharing and without 
• Run-tlme char缸teristics sωh as the averaoe and maximum number ot Rete ma!ch tokens in 
beta-memory nodes, the percen凶oe 0' and- and not.nodes that perform no test for variable 
eQuality, the number 0' nOde actlvations per WM chanoe , and the number ot productíons 
wnose saved state ís ∞anoed 闵r WM change. He tound that on !he average, each WM 
Change aftects a∞ut 26 prωuctions in thls manner. 
Many 01 !he designs discussed in this pa闵r are based on these statisti臼.
7 
1, 8 Suggested Parallellzatlons 01 Productlon Systems 
This paper divides proposed production system architectures into groups based on characteristics ot 
the algorithms proposed for them. This division highlights important common features of the machines in 
eacn group. and the characteristics that distinguish the groups. The paper lirst discusses uniprocessor 
designs. because these designs are otten used as building blocks in multiprocessor architectures. It then 
divides the parallel production system algorithms into classes based on which operations are para川elized.
The simplest parallelization technique performs the match operation lor different rules in parallel. 
Several variants on this technique a口empt to provide more speedup. Non-state-saving algorithms can be 
para川elized; one example is the distributed version of the TREAT algorithm described later in this paper. 
The Rete match can also be parallelized in a number ot ways. A parallel Rete match algorithm can 
process sequentially-activated Rete network nodes in parallel. it can process sequentially nodes activated 
in parallel , or it can process in parallel nodes activated in parallel. Finally, the productions' RHS can be 
processed in parallel if the production system interpreter allows multiple rules to fire simultaneously. 
2 Specialized Production System Uniprocessors 
Several researchers have explored uniprocessor architectures specialized for production systems. 
Uniproces50r architectures are impo咀nt in this context because fast uniprocessors set upper bounds on 
communication speed in parallel systems. 丁hese designs are also used as processing elements (PE's) in 
some ot the parallel machines described in this paper. 
2.1 RISC archltectures for productlon systems 
Theodore Lehr has proposed [26] a Reduced Instruction Set Computer (RISC) [44] architecture for 
produαion system execution. His RISCF processor implements the complete Rete match algorithm. He 
bases its design on several characteristics of OPS5 program execution. First. a processor executing the 
Rete match algorithm makes many references to memory , since inter心。ndition tests typically involve 
large numbers of tokens. Second. the processor executes arithmetic operations consisting mostly of 
integer compari臼ns. Finally, the pr∞essor's branching behavior is ∞mplex and erratic, as the assembly 
code generated from OPS5 programs ∞ntains a large number of ∞nditional jumps and subroutine calls. 
whiCh can slow down pipelined architectures and cause cache misses, 
The proposed RISCF pr∞essor addresses the problems of heavy memo叩 traffic and branching and 
has a simple ALU. The ∞π1)are instructi。惜， for example , compare a register with the contents of 
mem。叩 and set ∞nditionα>des and brand'! prediαion bits. and must precede a ∞nditional branch. 
Statistics derived from Anoop Gupta's measurements ot production systems show that the results of 
many of these tests can be predict创 statically with 90% a∞uracy. The branch prediction bits then allow 
the processor to keep its instruction fetChldecode/execute pipeline full most of the time. Lehr projects an 
o'lerall speedup of 1.15 from the branch prediαion strategy. The RISCF process町， like other RISC's, 
also has a large register file. 
Lehr has also designed a gallium arsenide realization Of the RISCF processor [2η. This Chip set 
compríses 9 chí间， and has a projected execution rate of one instruction every 30-nanosecond machine 
cycle. 
James Quinlan has studied the effectiveness 01 a number of uniprocessor architectures. including 
a 
Lehr's, in executing production systems [47J. Using Anoop Gupta's measurements on six produc:ion 
systems. Ouinlan estimated the number of instruction fetches , data reads and writes. and computation 
instructions performed by six architectures , including a hypothesized microcoded OPS matcher. the 
RISCF processor, and a VAX-11/780. From these statistics he derived total number of machine cyc!es 
and execution times for the various architectures. He concluded that the microcoded machine should run 
three to six times as fast as a VAX and twice as fast as an NMOS realization of the RISCF processor. but 
that a gallium arsenide realization of the RISCF would be as fast as the microcoded machine provided it 
had an effective cache. 
At the time of Ouinlan's study, the VAX-11n80 represented technology that was almost ten years old. 
While the use ot the VAX is legitimate for baseline comparisons , the age difference among the 
technologies he studied vitiates his conclusions about the effectiveness of special processor designs 
relative to conventional architectures. 
3 Production-Ievel parallelism 
PrOduction-level parallelism entails distributing entire productions among different processors. Each 
processor receives a difterent subset of the rules in a production system. and stores the initial working 
memo叩 elements potentially matching i饱 rules. During the match phase. all processors. using a suitable 
match algorithm. (perhaps naive match. TREAT, or Rete) , match their rules against their subset of 
working memory. During the select phase. the processors ∞。perate to determine whicl1 rule gets fired in 
the act phase. Production-Ievel parallelism can be ∞mbined with other levels of parallelism , as is seen in 
machines described in later sections. 
The copy-and-constrain method [61 J applied to a rule produces constrained copies of the rule. each of 
whiCh matches a subset of the instantiations of the original rule. If the set of possible working mem。叩
elements is finite , then the set of possible instantiations of a rule not containing any negated condition 
elements is finite as well. and it is at least theoretically possible to create a constrained copy of the rule 
for each instantiation. SuCh a specialized rule ∞ntains no variables. and an instantiation can be found 
wìth just one constant test per ∞ndition element. At least two working production system interpreters 
work on this principle: the Aspro and the Concurrent Inference System. 
3.1 IIllac-IV 
Charles Forgy has studied the po臼iblity of interpreting production systems on the IIliac-IV [9]. The 
IIliac.IV is an 8-by-8 mesh-<刀nnected array of SIMDl PE's. Each PE ∞nsists of a powerful processor, a 
small local mem。叩 (2K bytes). ∞nnections to i凶 four neighbors in the array. connections to a broadcast 
bus for instructions and data, ar田部cess to its own section of a disk. Forgy'S algorithm is just an SIMD 
version of the Rete match algorithm in which the rules are distributed among the PE's. Each PE receives 
a number of prod四怪。ns; durirnJ the match phase, the host ∞mputer broadcasts changes to working 
memory and instructions to ∞ntrol the Rete network match in each PE. The algorithm does not use the 
mesh inter.PE ∞nnections: it simply uses the IIliac.IV as a set of processors on a broadcast bus. 
In order to simplify the implementation , Forgy adopted a simple production system language, SPS. that 
1 Single Instruco∞. Multip悔 Da闺
g 
represents data as t叩les ， rather than as tuples of attribute-value or as lists. SPS's march strategy哥马t
guaranteed to find all satísfied productions: starting with the most recently-added working memory 
elements , it a忧empts to build instantíatíons CE by CE from left to ríght in each produαion ， binding 
variables as ít pr∞eeds. If at some poínt ín the process, it cannot satísfy a CE, ít abandons the 
production rather than backtracking. SPS has no conflict resolutíon strategy: the interpreter símply fires 
all satísfied productions on each cycle. 
The SPS compiler divides the set of Rete networi< nodes into a number of different classes such as 
not- , and- , and single-input-test nodes, and allocates space for the input and output memoríes 01 each 
node. The no-backtracking strategy allows the ∞mpíler to determíne the maximum size of the alpha- and 
beta-memoríes required. Duríng each productíon system 句de ， all PE's evaluate together all nodes of 
each type. 
Forgy presents no performance projections or data for his SPS implementatíon. His parallel match 
algorithm differs from that used by other researchers on SIMD machines such as NON-VON and the CAP 
(see below) in that it performs the entire match operation in lockstep. In doing so , the algorithm abandons 
backtrackíng and sacrifices complete search. 
3.2 Flndlng Optlmal Partltlons 
In each PS executíon cyde, the amount of match ∞mputatíon requíred varies tremendously among the 
rules in the system. Kemal Oflazer of CMU has studied the problem of assigning productíons t。
processors at compile time to achíeve run-time load balancing among the partitions [401. His studies 
used both analyses of program texts and statistics derived from previous program executions to derive 
produαíon system partitionings: símulations showed , however, that they were only slightly better than 
rand。付， ones. 
Oflazer compared three partitioníng methods to random partitioning. The first method assigned 
productíons to processors in a round-robín fashion based on theír textual order in a program. This 
assignment strategy ís not random, since programmers are líkely to put similar productíons, or 
productions that work on simílar pa口erns ín workíng memory, close together. Round-robin assígnment 
should therefore place símilar productions, whích are likely to requíre processing during the same 
interpreter cycles , into different pr∞es臼rs ， and ∞ntribute to load balancing. The se∞nd method used 
syntactic inlormatíon. Many OPS5 prωuctíons have goalor context condition elements. whose purpose 
iS to ∞ndition theír actívation to a phase of program execution. Syntactíc assignment places productions 
with the same goal element inωdifferent pa门itions. The third technique used the processing time 
requíred by the Rete algorithm to maintaín the state Of each production and the frequency with which the 
algoflthm process创 productions together during tha same match cycle to predict good pa叫tionings.
Finding optimal partitioni吨s is an NP-∞mplete bin-packing problem, so Oflazer used símulated 
annealíng to sear由 for approximatíons to the optimal partitíon. 
Simulated exeαJtions of the partitioned production systems Showed that no static partitioning scheme 
worked ve叩 well. The method based on exeωtion time gave somewhat better partitionings than the tw。
textual methods, but a川 three tedlníques gave Speedups ot 1.15 to 1.25 over random partitioníng. One 
explanatíon for thís poor performance is that production systems have írregular and very data-dependent 
execution paths, and that these paths are díHicult to detect statically. 
10 
3.3 The Aspro System 
The Aspro Parallel Inference Engine [49]. developed by Goodyear Aerospace , is a speαalized pa口ern­
matching machine attached to a sequential processor. The Aspro consists of a 2K-element array of 
bit-serial PE's. each with 4K bits of memo叩. The Aspro represents the ∞mplete state of working 
memory as a single 2K-bit veα。r: it uses the same representation for the LHS and RHS of a production. 
During each match phase, the interpreter compares each production bitwise in turn with the current state 
of working memory. A production is satisfied if every bit set in its LHS (I.e. every condition) is also set in 
working memory. During the act phase, the interpreter fires every satisfied production; firing a production 
may include calling functions in the host. 
Since each bitwise comparison of wor阳ing memory with the LHS of a production is a single instruction, 
match time is linear in the number ot productions. But since the Aspro memory can hold at most 2K 
produαions ， the system is guaranteed to execute at least 500 match cyclesJsecond. 
3.4 The Concurrent Inference System 
The Concurrent Inference System (CIS川1] ， developed at the MIT AI lab, is a forward- and backward-
chaining inference engine. It has been implemented on the Connection Machine [20], a massively 
parallel highly.∞nnected SIMD array of ve叩 small processors. 
The CIS represents rules in a standard iHhen form, but each ∞nclusion has associated with it a 
real.valued 臼rtainty tactor. Variables in rules are allowed, provided that the 5et of pos5ible values is 
tinite and known at ∞mpile time; the compiler creates an equivalent 5et ot constrained rule5 containing 
no variable5. Since the compiler k.now5 each variable'5 set of possible values , it can compile the set of 
rules into a tixed graph. 
At run time, each 阳le is associated with an activity factor. Interence proceeds synchronously as an 
activity network. During each inference step, each rule ∞mputes the minimum or maximum activity, for 
conjunctions and disjunctions respectively , ot 出 LHS. It then adjusts the activity of eac.'1 of 凶
conclusions. AII inferences pr∞eed simultaneously, but the system stops after a user.specifiable number 
of cycles 50 that the user can interact with it. 
The perlormance daimed for the CIS is very good: Blelloch states that the system can contain 100000 
rules and still provide interaαive responses. The system obtains high ∞ncurrency by taking advantage 
d ∞nα;rrent matching and ∞ncurrent forward propagation. It is difficult to ∞mpare the pe巾rmance of 
the CIS with that of any OPS implementation , since the model of matching used in the CIS i5 very 
different from that in OPS systems. 
Citing the Pro笃leC妇r [6) ar回 Mycin [56] expeηsystems as examples , Blelloch claims that re5tricting 
variables to a finite set of values is not very limiting. As he note5 , the requirement that the 5et of values 
be known at ∞mpile time can be r创axed ， provided that the activity network be allowed to change 
dynamically, something the Connection Machine architecture allows. But no large rule bases have been 
implemented in the CIS, so the tJexibility ot the system has yet to be proved. 
Another limitation of both the Aspto and the CIS is the absence of negated conditions. Since the 
Connection Machine suppoηs global reduction operations in parallel , it is possible that negations could be 
added 10 !he CIS. 
11 
4 Non-State-Saving Machines and Algorithms 
The two machines proposed for non毛tate-saving algorithms. DADO and the Delta-Drive Computer. 
execute different algorithms and are. quite different architecturally. 
4.1 DADO 
丁he DADO machine [57. 62) represents an attempt to accelerate production system execution through 
massive parallelism. As originally conceived. the DADO machine ∞nsists of a very large number 
(perhaps 10000 to 100000) ot fairly small PE's connected in a complete bina叩 tree. The PE's do not 
share memory; all inter-PE ∞mmunication is through 110 circuitry. Each PE has a modest amount of 
local memory. enough to store a small matching program and some data. The DADO machine functions 
as a rule coproce55or attached to a conventional host machine. 
丁he current operational prototype. DAD02. has 1023 PE's. Each PE consists ot an eight-bit proce臼。「
(an Intel 8751) , 16K bytes ot mem。叩. and a semicustom 1/0 processor. The 110 processor provides rapid 
bidirectional global communication. 115 broadcast circuit allows the host to broadcast data to all PE's in 
the tree; its resolveJreport circuit calculates the minimum ot a set ot 8也it values submitted by the various 
PE、 and se15 a flag in the PE having the smallest value. AII PE's have to participate in each 
communication through the 110 proω55or. Each PE can also communicate with its parent and left and 
right children through a channel separate from the VO pro臼55or. One interesting architectural teature of 
the 110 processor allows a PE to dis∞nnect itself. under software control. from its parent: a PE that does 
so becomes the root of i15 own DADO subtree. and can broadcast data to i15 descendants, as well as 
receive data from them through i15 re臼Ive/report cirωit. 5everal algorithms proposed for DADO (see 
below) exploit this capability. 
The DADO machine combines aspects of 51MD and MIMD ∞mputers. 5ince all PE's must execute 
the same sequence of ∞mmunication instructions. communication is 5IMD. But between 
commmunications. the DADO machine is computationally an MIMD computer. since each PE has i15 own 
local memory and stored program. and can execute arbitrary non-communicating code. 
510lfO [58) has proposed a number of algorithms for production system execution on DADO. AII these 
algorithms use DADO to aα习lerate the match and select phases of production system execution. and 
make the host ∞mputer responsible for actions in the RH5 ot rules. 
The fu /l distribution algorithm is the simplest. and exploi15 only production-Ievel parallelism, as 
described in the previous 臼α1on. The algorithm also uses DADO 110 hardware during the select and act 
pnases. During the select phase , PE's ∞mmunicate encoded priorities for their best instantiations to the 
hOSt. The PE with the highest priority instantiation. as determined by the 110 processor, then 
communicates its instantiat阳n to the host. During the act phase, the host broadcas15 changes to working 
memory to the DADO PE's; each PE retains those cnanges that are potentially relevant to its rules. 
The original DADO algorithm divides the DADO into three ∞mponents: the PM-Ievel (one level ot the 
tree). the upper tree (those PE's above the PM-Ievel) , and the set of WM也Jbtrees. one for each PE in the 
PM-Ievel. Eacn PE in the PM-Ievel receives a number ot productions; PE's in the WM-subtree below a 
PM-Ievel PE store working-memory elements relevant to a rule in that PM-Ievel PE. During the match 
phase. each PM-Ievel PE ∞nstructs all instantiations of i15 satisfied rules. For each production P with 
LH5 C 1 ... Cn • and for eacn j in 1.....n. the algorithm constructs R 1 ⑧...⑧气..1 from RI ~...③ RJ' For each 
12 
element Oj of R1 8...8 Rj' the PM-Ievel PE substilutes into Cj叫 all va阳bles bound in Oj' broadcasts the 
resulting ∞nstrained condition element, and reads back sequentially all WM elements in its WM-subtree 
satisfying the broadcast ∞ndition element. The PM-Ievel PE combines each such WM element with Qj to 
form another element of RI ③ ...@R尸 l' Thus the PM-Ievel PE's use their WM-Ievel subtrees as 
generalized content-addressable memo叩·
Daniel Mirank町's dist，也Jted TR日T algorithm 1321 is a state-saving version of the original DADO 
algorithm. Each PM-Ievel PE stores (either in itself or in its WM-Ievel subtree) the current instantiations of 
its rules. Each WM-Ievel PE stores a number of Rete alpha-mem。叩 tokens; the tokens for a 5ingle 
alpha-memory node are distributed throughout the WM-Ievel PE's. During the match phase of the TREAT 
algorithm, each PM-Ievel PE first determines which rules are affected by the current changes to its 
working memory and have non-empty alpha-memories for each positive condition element. For each 
such rule , the PE then ∞nstructs all new instantiations of that rule. 
The fine-grain para/lel Rete algorithm maps the Rete network directly onlo the DADO machine, The 
Rete network without node sharing is in fact a binary tree, 50 the mapping is a trivial one: leaf PE's store 
linear chains of one-input tests, while interior PE'5 receive and- and not-test nodes, During the match 
phase , the h05t broadcasts changes to PE's containing one-input tests; these PE'5 ∞nstruct tokens for 
WM elements matching their CEs and pass the tokens to their parent PE'5. Two-inpUI test PE's in a 
pipelined fashion read tokens from their children，∞nstruct new tokens 1rom them, and pass the new 
tokens to their parent PE's. This algorithm processes activations of different nodes in parallel , but 
because each node is mapped to a single PE, it cannot process multiple activations of a single node in 
parallel. 
Anoop Gupta [16] has suggested another match algorilhm for DADO, His algorithm distributes 
productions to the PM-Ievel and distributes the tokens stored in each Rete match alpha- and beta-
mem。可 node throughout the WM-Ievel subtree of the associated production. Each Rete network token 
for this algorithm ∞nsists ot the node ID, a list ot WME IDs, and a list 01 the values 01 the variables used 
in the following two-input test. During the match phase, the host broadcasts WM change5 to the PM-Ievel 
PE、 which perform all intra心。ndition tests on them; each PM-Ievel PE then stores the resulting alpha-
mem。叩 tokens throughout its WM-subtree. For each new token , the WM-Ievel PE broadcasts it and the 
10 of the opposite memo叩 nωe to all WM-Ievel PE's; each WM-Ievel PE executes the consistency 
checkS required for the new token. For each two-input test passed , the PM-Ievel PE then creates a new 
token , which it stores in some WM-Ievel PE. Conflict resolution proceeds as in the original DADO 
algorithm. The algorithm exploits production-Ievel parallelism and parallelism in the evaluation 01 each 
node ac:ivation , but it serializes node activations during a single match phase. Gupta's algorithm requires 
that each PM-Ievel PE store all working memory elements (tcgether with a川 its attribute values) relevant 
to any 01 its rules , as well as a list ot the attributes used by each two-input test node. ThuS this algorithm 
requires that the PM-Ievel PE's store rTlOre data than their WM-Ievel PE's. and iS more suitable to a 
heterogeneous arChitecture than to ttle homogeneous DADO architecture. 
Miranker [311 has prOjeαed the pe付ormance ot the DAD02 machine by estimating the number of 
instructions required to perform eac:h of a number 01 primitive operations and the number 01 times the 
match phase performs each such operation. He concluded that the DAD02 machine running the 
distributed TREAT algorithm should exeωte about 212 WM changeSlse∞nd. Based on the same 
instruction ∞unts ， Gupta [16] estimated an execution speed of 167 WM changeslsecond for the fine-
13 
grain Rete algorithm. As the instruction counts are wildly ina∞urate ， neither performance figure can be 
trusted. 
4,2 The Delta-Drlven Computer 
As pa门。f the European Esprit project, a group at BULL SA Research Center in France has designed a 
parallel computer architecture. the Delta Driven Computer [50] , to execute relational , logic, and functional 
programs. The Delta Driven Computer is unique in having an intermediate language based on production 
rules with a forward-chaining inference strategy. Compilers for relational. logic, and function programs 
translate the programs into production rules. 
了he architecture of the Delta-Drive Computer is a cluster of bus-based message-passing PE's with no 
shared memory. Each PE consists of a powerful microprocessor, local memory. and an attached 
symbolic coprocessor for performing joins. unifi臼tion. or pattern matching. The authors do not specify 
the scale of their machine, but from their hopes for speedup on the order of 1000. one can deduce that 
the machine will contain many PE's. 
了he authors propose a forward-chaining production rule paradigm as the machine's intermediate 
language. Their paradigm en∞mpa臼es rules of the form P1 & ... & Pn -> A1 …Am' where the Pj and 气
are predicates of atoms and variables. Thus , although their paradigm seems to lack negated condition 
elements (predicates). it permits more general patterns than OPS5. Execution proceeds by cycles. At 
each cycle, the interpreter feeds the changes made to the global database back into the rules--hence the 
name -Delta Driven. - AII satisfied rules fire on each cyde; the system provides no conflict resolution 
strategy. 
Aather than distributing productions among the processors and duplicating parts of the database, the 
system distributes individual relations (corresponding to alpha-memories). and duplicates productions. 
The interpreter further distributes each relation to a number of pr。由ssors by hashing on attributes used 
in subsequent joins. 
The algorithms suggested for the Delta Driven Computer do not seem suitable as accelerators for 
OPS-style prodυction systems, since they do not provide negated condition elements or ∞nflict 
resolution. Further , distributing the elements of individual alpha-memories makes negated conditlons 
difficult to implement without a fast global ∞mmunication mechanism. On the other hand, it would be 
ve叩 interesting to see how well a machine designed for OPS-style production system execution ∞uld 
execute tM intermediate language generated by the Delta-Driven Computer's higher-level language 
translators. 
5 Parallel Processlng of Sequentlally-Actlvated Nodes 
AII the machines described in this 5民tion parallelize the Aete algorithm by distributing the ∞ntents of 
Rete alpha- and beta-memory nωes. During the match phase, they pr∞ess each node activation by 
using the distributed memo叩 nodes as ass∞iative memo叩. Because this pr∞essing requires high 
communication bandwidth and low latency, the machines are all SIMD. 
14 
5.1 NON-VON 
The NON-VON machine (proposed by [55]) is a fine~rain.parallel architecture designed to accelerate a 
wide range of artificial intelligence tasks , incJuding database operations, expert systems, and image 
understanding. Uke DADO, NON-VON is organized as a binary tree , but its granularity is much finer, its 
architecture is not homogeneous, and its operating mOdes, and hence the algorithms proposed for it, are 
different. 
NON-VON is actually organized as two very diHerent subsystems connected to a conventionaJ host 
computer. The primary processing subsystem consists of a ve叩 large number (perhaps a million , in a 
full-scale version of the machine) of smaJl processing elements (SPE's) connected in a binary tree. Each 
SPE consists of an 8-bit ALU, a very small amount of memory (perhaps 64 bytes) , and connections to its 
parent and le优 and right children. Each SPE also has a ∞nnection to its predecessor and successor in 
an inorder traversal of the entire tree; thus algorithms can treat any subtree of the primary processing 
subsystem as a linearly ordered array. Leaf SPE's are also interconnected in a mesh , but the production 
system algorithm proposed for NON-VON does not use these connections. The SPE's are SIMD 
processors: they read and execute instructions broadcast by a controlling processor. An enabled bit 
controJs conditional instruction execution; Disabled PE's ignore all instructions except the instruction that 
enables the PE. A special instruction , the resolve instruction, sets a bit in the first enabJed PE in an 
inorder traversal of the SPE tree. Algorithms that process sequentia川y a set of data stored in the SPE's 
use the resolve instruction to enumerate the set. Global instructions-the resolve operation and 
communication between Iinear neighbors--take ten times as long as computation instructions. 
丁he secondary processing subsystem is a highly-∞nnected network of a smaller number (perhaps 31 
to 1 023) of powe巾I proce臼ors known as large processing elements (LPE's. Each LPE has a 32-bit 
processor, a fairly large memory (at least 256K bytes), and a connection to one of the SPE's in the upper 
levels of the primary processing subtree. The LPE's operate in MIMD mode, but each LPE can als。
broadcast instructions and data to the subtree of SPE's to which it is attached. Thus NON-VON supports 
multiple-SIMD processing, where some subset of the LPE飞 typically those attached to SPE's at one level 
of the prima可 processing subsytem，∞ntrol SIMD computation in their respective subtrees. 
8ruce Hillyer and David Shaw propose in [21] an algorithm for production system execution inspired by 
Gupta's algorithm for DADO [16) and by an unpublished algorithm ot Daniel Miranker. 节，eir algorithm 
partitions the rules in a prωuction system into 32 grou因; they assume that the partitioning scheme 
places similar rules in ditferent grouP5, 50 that the pr∞essing time tor each production system cycle is 
roughly the same tor all partit阳ns. Each LPE-SPE subtree r∞ted at the sixth level (the level containing 
32 PE's) receives one ot the groups ot rules. 
Their match a恼。毗hm， like Gupta's tor DADO, processes node activations sequentially , but uses the 
tree of PE's as ass∞iative memo叩 to s阳ed up each activation. Their algorithm differs from Gupta's in 
its representation of ∞叫ition element and Aete network tokens , and in its handling ot intra心。ndition
tests. Since each NON-VON SPE has very little memo可， it cannot store an entire ∞ndition element or a 
complete Aete network token. Thus Hillyer and Shaw propose that each ∞ndition element be stored in a 
linear array ot SPE's , with each SPE storing a single term Of the condition , and that each token likewise 
be stored in a linear array, with eacn SPE storing a single working memeory element ID. An SPE can 
store both a term ot a ∞ndition element and a piece of a token. Each LPE performs intra心。ndition te5ts 
by broadcasting a working memo叩 change and having each SPE te5t tor sati5faction ot its term of a 
15 
condition element. If the longest condition element in its productions has n terms , the SPE's then 
compute in (n-1) cycles the AND of the tests for each condition element: this computation exploits the 
linear ∞nnection of the SPE's. LPE's allocate Rete network tokens and process node activations ve叩
much as the WM斗evel PE's do in Gupta's DADO algorithm. 
Hillyer and Shaw have analyzed their algorithm by writing NON-VON assembly code for each of the 
primitive operations in the match cycle and counting machine cycles. Basing their analysis on Gupta's 
statistics on the execution of six large OPSS programs, they then compute the average match-phase 
processing time per working mem。叩 change. Assuming an LPE processing speed of 3 MIPS, and an 
SPE instruction cycle time of 300 microseconds, they conclude that NON-VON should be able to execute 
about 2000 worωng memo叩 elementslse∞nd. These performance projections depend on a good rule-
partitioning strategy, but they do not depend on reducing the variance in the processing time for node 
activations, since their algorithm pr。但sses node activations sequentially. 
5.2 The CelJ ular Array Processor 
Ruven Brooks and Rosalyn Lum pr。因se [2] using the 厅T Cellular Array Processor (CAP) for 
production systems. The CAP is an array of perhaps 32 to 128 processors, each with a fairly large local 
memo叩， connected to a broadcast bus. The CAP has been realized in a CMOS chip set ∞mprising five 
chips for a 16-bit CAP processor. 
Brooks and Lum suggest a parallelized Rete match algorithm much like the algorithm executed by each 
NON-VON LPE. Their algorithm distributes ∞nstant tests and input memories tor two-input nodes 
among the CAP PE、 During the match phase, the ∞ntrolling ∞mputer broadωsts changes to working 
mem。叩. Their match algorithm pr∞esses node activations sequentially, but uses the CAP PE's to speed 
up each activation. For each change to working memory, the CAP PE's evaluate the associated constant 
tests and create or delete Rete network tokens. For each two-input node activation, the CAP PE's search 
the opposite memory in parallel , creating or deleting new tokens. 
The authors suggest performance evaluations of the CAP based on ∞mparisons of aggregate 
memory-processor bandwidths. For example , they contrast DAD02. with the capacity to process 1023 
bytes/instruction, with the 512 byteSlinstruction processed by a CAP consisting of 256 16-bit processors 
operating with a faster dock (each CAP PE can pe斤。rm a 16-bit addition in 100 nanoseconds) , and 
conclude that the CAP offers ∞mparable pr∞essing capacity with many fewer ∞mponents. They 阳rther
assert that since two-input node activations ∞nsume most Of the processing in the Rete match, and since 
most memory nωes are either 陆(ge or em仰， CAP processor utilization should be high. 
The performance daims for the CAP can be questioned on two grounds. First, it is not clear that 
comparing the speeds ot pr0C8~记rs ot diHerent generations has any relevance to pe巾rmance
evaluation. As Stolto has pointed out [591, DAD02 was a prototype machine , and a DADO built t。ωda町y 
W。ωuld b阳ec∞。nst阳阳αe创d ot f旭as剖t3扭2
。nlνy an up仰pe町r b饲。u川Jrnd 。∞n 阳t怡s p闵e忖巾。rma肌nc臼8. Claims for the CAP'S pe 付ormanca can be verified only by a 
detailed study ot processor utilization. 
The CAP, with its fast ∞mmunication (broadcasts ocωr at instruction speed) and lack of contention for 
shared memory, may a∞elerate production system execution. However, the CAP, and all parallel 
processors that use a large number ot processors as associative memory, must compete with 
16 
uniprocessor indexing schemes that reduce the ∞st ot searching for a pa忧ern in a set Of data. 
5.3 The OOPS-MOP 
A group at the University of Tennessee has designed and tabricated their OOPS-MOP (Our OPS 
Matching-Only Processor) [38) as pa门。t an accelerator tor OPS production systems. They envision a 
system built of a number of OOPS-MOP slave chips and a controlling processor. 
Each OOPS-MOP chip consists of eight identi创 Wchunks飞 each of which ∞nsists of six wsliversw 
whose outputs are ANDed together; each sliver consists of a programmable arithmetic ∞mparator. The 
slivers in a chunk compare the value of a single binding variable to five ∞nstants or bound variables. 
Binding a variable and ANDing together a number of compar险。ns with constants or variables takes one 
instruction cycle. 
The group only hints at a system architecture and an algorithm to exploit the OOPS-MOP. 
Presumably, each one- and two-input test in an OPS program would be assigned to a different chunk at 
compile time. During program execution , the controlling processor would execute an algorithm much like 
Daniel Miranker's TAEAT algorithm. Starting with all variables unbound, the ∞ntrolling processor would 
try to ∞nstruct a complete instantiation from working memory elements by binding variables in 
succession and backtracking when it could not complete an instantiatíon. Since the OOPS-MOP Chips 
have only enough memory to bind a variable , they cannot store the tokens involved in the Aete match. 
The authors suggest that OOPS-MOP's may speed up OPS programs by a factor of 10 or 20 over 
brute torce sequential implementatíons, but they offer no substantiation for this daim, and they do not 
specify what they mean by a brute force method: the Aete match algorithm alone speeds up matching by 
at least a tactor ot 10. 
Lack of on-chip memory and an intlexible instruction set hobble the OOPS-MOP architecture. An 
OOPS-MOP system must contain a dedicated wchunkw tor each one- and two-ínput test. Since many 
OPS5 programs contain thousands ot tests , a usable OOPS-MOP system would have t。∞ntain
thousands of matching Chips, almost all of which would be idle during each OPS5 match cycle. The 
OOPS-MOP processor can handle only the simplest matching tasks: the spareness of its instruction set 
makes it unsuitable even for QP臼3.
6 Sequential Processlng 01 Nodes Actlvated In Parallel 
The machines deSCtibed in this section parallelize the Aete match by assigning individual Aete network 
nodes to pr。饵弱。咱， either at ∞mpile time or at run time. 节，ey require fast acω臼 to shared memo叩
containing WM elerr￥ents ， and are based on tast buses. 
6.1 The Carnegle-Mellon Productlon System Machlne 
Anoop Gupta and others at Carnegie-Mellon University [13, 181 have pro阳sed and analyzed a 
specialized Production System Machine (PSM) to accelerate production system execution. The PSM is a 
multiprocessor realization ot the parallelized Aete match algorithm; pr∞essors are assigned dynamically 
to individual alpha- and beta-nOde activations. Successtul node activations create new tokens: a task 
sCheduler, when paSSed these tokens , creates new node activations. A few architectural features define 
17 
the PSM. 
First. the PSM is a shared-memory multiprocessor with a fairly sma/l number (32 to 64) of processors. 
Gupta justifies the sma/l number of processors by citing the relatively low degree of parallelism he found 
in the execution traces of the OPS5 programs he studied [15]. 节1e granularity of communication and 
computation in the PSM dictates a shared-memory architecture. individual node activations. which 
consume only 50-100 instruction cycles [18). create large tokens that must be communicated to other 
nodes in the Rete network- Since pointers to tokens are much smaller than the tokens themselves, it is 
faster to pass pointers to the newly-created tokens. and since processors must dereference pointers to 
structures created by other processors, the tokens must be in shared memory. 
Second. each processor in the PSM is a powerful processor with some private memory and a cache. 
Since the PSM has relatively few processors. each one can be fairly powerful. But since each processor 
runs code that is memory-reference-intensive rather than computation-intensive [4η. caches are required 
if processors are not to wait on memory. Further. the caches must be able to store shared data objects 
such as Rete network tokens, since many of the references to shared memory are to these tokens. 
Gupta proposes the specialized RISCF processor for the PSM. 
Because shared data 。同时ts must be 臼cheable. the PSM must have a 饵che-coherency scheme. 
Therefore. pr∞essors communicate with shared memory over a , shared bus. rather than through an 
interconnection network such as a crossbar switch or a log(n) stage inter∞nnection network. since 
cache-∞herency schemes for shared buses are much easier to ∞nstruct ( [52), cited by Gupta). The 
shared bus limits the number of processors in the PSM, since contention for the shared bus se付。usly
degrades performance if the number ot processo厄 is larger than 64. Gupta proposes a multiple-bus 
system for larger numbers Of processors. 
Final/y. a hardware 阳sk scheduler assigns pr∞esses (node activations) to waiting proce出ors. The 
scheduler sits on the shared bus. A processor passes the scheduler a pointer to a new token by writing 
to memory-mapped registers in the scheduler; the scheduler assigns a node activation to an idle 
processor by writing to memory-mapped registers in the pr∞essor. 80th operations take one bus cycle. 
The scheduler is responsible for ensuring that multiple activations of a single node that cannot be 
processed simultaneously are assigned serially. It handles this task by maintaining a task queue in 
associative memory ot all active and pending node activations. The task queue must be quite large. since 
Gupta's simulations show that for the programs he simulated. the maximum number of nodes in the 
queue was 2000. while the average number was 90. 
The hardware task scheduler transtorms a fairly conventional general-purpose shared-memory bus-
based architecture into a spedalized dataflow mac'1 ine (Ior a description of dataflow machines. see [22])_ 
Each computation (node activation) in the Rete network can create new tokens. which actively (through 
the scheduler) create new nωe activations. The mapping of ∞mputatíons to pr。ω臼ors can be 
completely dynamic. since the shared bus imposes no topological restraints on the pa忧ern of 
Interprocessor ∞mmunication. 
Gupta's performance projections for the PSM are based on extensive event-driven simulations of the 
execution of six OPS5 programs with a parametrized cost model. Traces of actual OPS5 program 
execution drive the simulator; they show d1anges to working memory elements. node activations. and the 
creatíon of Rete network tokens. The cost model includes the costs of the operations involved in 
18 
scheduling and processing node activations. Gupta estimated these costs by writing hypothetical 
assembly code to pertorm various primitive operations and counting processor cycle times. The ∞st 
model also includes the effects of contention for shared memory. based on user-specified cache-hit 
ratios. 
Gupta measured both concurren町--the mean number of proces臼rs busy-and speedup for 
parallelization of the Rete match at the production level. at the node level (where the machine processes 
node activations in parallel , but only one activation of a given node at a time) , and at the intra-node level 
(multiple activations of all nOdes). Speedup differs from ∞ncurrency in the PSM because of the effects ot 
memory contention , loss of node sharing in the Rete match network, and scheduling overhead. The 
intra-node parallelism figures are the most significant, since the PSM is designed to take advantage of 
parallelism at this tine grain. Only one ot the six programs Gupta studied showed signiticant pertormance 
gains with more than 32 processors. With 32 processors，∞ncurrency ranged from 4 to 22: the true 
speedup was less, ranging from 2 to 15. These figures corresponded to execution speeds of 2000 to 
14000 WM E changes/se∞nd. The corresponding figures for node-Ievel parallelism are somewhat worse: 
concurrency ranged from 4 to 13, speedup from 1.6 to 6.6, and execution speed from 2000 to 9000 WM E 
changeslsecond. The figures for produc币。n-Ievel parallelism are wo尼e still: ∞ncurrency ranged from 2 
to 12, speedup from 1.2 to 4.8 , and execution speed from 800 to 7200 WME changeSlsecond. 
Gupta's simulations also showed that the hardware task scheduler is a necessary part ot the PSM. 
Using multiple software task queues rather than a hardware scheduler resulted in a halving of the 
perlormance of the system. 
According to the simulations, a single-processor PSM should be able to execute about 1000 WME 
changeSlsecond, which is several times taster than the best OPS83 implementations. Much of the 
improvement seems to come from a change in the data structures used in the two-input nodes. Current 
OPS implementations keep the tokens for the left and right inputs ot the two-input nodes in linked lists. s。
that the average cost ot deleting a token from a node or of searching the node for a token with consistent 
bindings is proportional to the length of the list. Gupta proposes keeping all tokens in two global hash 
tables , one for le代 inputs ot two-input nodes. and one for right inputs. Tokens would be hashed on an 
identifying number for the node and on the values of the attributes used in the ∞nsistency tests in the 
node. With this scheme. the average ∞st ot deleting a token or ot searching a node for tokens satisfying 
equality tes15 should be independent of the number ot tokens stored in the node. One architectural 
consequence of maintaining a global hash table for tokens is that processors inserting or deleting tokens 
must be able to lock an individual hash bucket. 
6.2 The Encore Multlprocessor 
A group at Carnegie-Mellon University has recently implemented OPS5 on the Encore multiprocessor 
[191 and collected statistics on execution times of several programs. The Encore multiprocessor 
resembles the proposed PSM in many ways, so the group's experiments also represent tests ot the PSM. 
The En∞re multiprocessor is a shared-mem。可 bus-based machine with 2 to 20 processors. Each pair 
ot processors shares a 32K byte cache: the cachesπ"IOnitor bus activity to maintain cache ∞herency. just 
as in the PSM. Each En∞re pr∞essor is a 32-bit microprocessor, an NS32032. Since the Encore is a 
general-purpose machine, it 1邵阳 the PSM'S specialized hardware task scheduler: theretore , Gupta's 
experimental results must be compared with his simulation results for a PSM with multiple software task 
19 
queues. 
The group varied the number 01 task queues, locking strategies for global hash table buckets , and the 
number of proωssors used. and recorded actual execution times for three fairly large OPS5 programs 
(not the same programs studied in [18]). 丁he results agreed quite well with the predictions of Gupta's 
simulator: after calibrating the simulator for the NS32032 instruction set , it predicted many executíon 
times that were within 50% 01 the observed. Speedups over a single-processor execution ranged from 2 
10 11 with 13 processors performing matching tasks. One unanticipated phenomenon was considerable 
contentíon 10r shared hash buckets ín a program that constructed large cross-product5 01 working 
memory elements: each hash bucket grew quite large, and each processor spent hundreds 01 cycles 
waitíng for each a∞ess to the hash table. 
The experíments on the Encore multíprocessor are a pa门ial ， prelimínary validation 01 the elfectiveness 
of the PSM. Gupta's experiments indicated that one m司or bottleneck in the En∞re implementation was 
slow task scheduling , and his simulator predicted much greater speedup lor a machine with a fast 
scheduler. One the other hand, 10r some programs, contention for shared hash buckets was the major 
diHiculty, and for these programs, sottware techniques such as copying and ∞nstraining rules promise 
greater speedup than the hardware task scheduler. 
6.3 MANJI 
A group 01 researchers at Keío University in Japan has proposed and evaluated MANJI , a shared-
memo叩 multiprocessor for prodυction systems [34]. The algorithm proposed 10r MANJI diHers somewhat 
from the PSM algorithm, and the differences dictate architectural choices made by the researchers. N。
pe斤。rmance projections are available, so ∞mpariωns with the PSM and the En∞re must be tentative. 
The MANJI group proposes a Rete match algcrithm wíth distributed node activations , but with nodes 
assigned statically to processors, several nodes per processor. The arrival 01 a token at a processor 
activates the processing 01 a node. Thus task asslgnment in MANJI iS a distributed operation, while ít is a 
centralized one in the PSM. The PSM requires centralized task assignment because the architecture 
supports multíple símultaneous activations of a node: sínce MANJI processes activations of a single node 
seqυentially. ít does not require a centralized arbiter. 
MANJI is a shared-memory machíne with two buses. one for access to workíng memory, and one 10r 
broadcastíng Rete match tokens. The match processors use the working memory bus for a∞ess to 
global workíng memo叩: in order to reduce bus traffic, each proces臼r has a cache whose block size and 
replacement strategy are tied to the structure of OPS5 working meπ'1O ry elements. Match processors use 
!he !oken bus for broadcasting and receiving Rete match tokens. The broadcast mechanism is a 
receiver.selectable mult也ast (any pr∞e部。r can broadcast). A processor generating a token writes it to 
an address in globaJ memory determined by the node 10; each destination prOCeS50r detects the arrival of 
the token. and read5 and pr∞esses it. There iS no global queue 10r the activations 01 each node. 50 a 
processor generating a node activation must wait to write the token to global memory until all destination 
processors have read the previous token at that address. The receiver-selectable multicast has a 
complex paging mechanism so that pr∞essors can avoid mapping the entire global address space into 
tneir local address space. The active role of the token arrival queue at each node gives the MANJI 
machine a di5tinct dataflow character. 
20 
The MANJI group studied the steady-state characteristics of a simple Markov model of bus congestion 
in the machine. They conciuded that most of the time, in the absence of page faults in the receiver-
selectable multicast mechanism, bus congestion is minor, and most processors can be kept busy. 
Presumably, large OPS5 programs would generate page faults , which the model does not take into 
account. 
Several potential problems 100m over MANJI. Since nodes are assigned statically to produαions ， one 
problem is load balancing, but this problem should not be as severe for node-Ievel parallelism as for 
production-Ievel parallelism, since there are many more nodes than productions. Another problem is 
serialization of activations of a single node; Gupta's studies seem to indicate that this serialization limits 
the available parallelism. 
7 Parallel Processing of Nodes Activated in Parallel 
In the Carnegie-Mellon Production System Machine [13] , the unparallelizable unit of ∞mputation is the 
node activation. Parallelizing single node activations may 0仔er more parallelism than distributing 
complete node evaluations. Several researcl1 ers [14, 51] have suggested dataflow machines as possible 
architectures for fine-grained execution of production systems. Two proposed dataflow machines try to 
exploit this parallelism by distributing each Rete memo叩 node among a number of processors. Another 
machine described below, a distributed-memory message-passing machine, executes a state-saving 
variant of the Rete match algorithm, which distributes tokens among the PE's at the leaves of a complete 
binary tree. 
7.1 The Waterloo Dataflow machlne 
Michael Kelly and Rudolph Seviora of the University of Waterloo suggest a distributed Rete match 
algorithm and a specialized datatlow machine tor accelerating production system execution [25]. The 
architecture they propose is a highly parallel distributed-memory machine with fast global communication. 
Since their algorithm motivates the architectural specifics of the machine, it is described first. 
Their algorithm distributes Rete match alpha- and beta- memory tokens throughout the machine. Each 
logical Rete node contains a single token, though a processor may hold several such nodes. This 
scneme lumps mem 。叩 nodes t阳。get怕he町r w阳it阳h the two-input-test nodes that fol川阳low them in the network. 
E臼ac∞h logical node ín tne Wa副te刨r川l。∞。 machine ∞n罔Sl喝st岱s ot the t阳est臼s and 1旧D of a t阳wo。创.斗input-怕s创t node, together 
with one token trom its le价。r right in阻J t.
The arrival ot a token (by a mechanism to be specitied later) triggers one of several actions. It a 
matching positiv9 tcken arrives at the empty side of a logical and-node, the node creates a new 
composite token 卸回 passes it on. It a positive token arrives at the full side ot a logical and-node, and 
that node is the single generative ∞py of the node, the nωe creates a new ∞py of itself, gives the new 
node the token it received, and makes the new node the generative ∞py. 节，e arrival of a negative token 
at the full side of a logical and-nωe destroys the token and the node. Pr∞essing not-nodes is similar, 
but more complicated , since no pr∞essor stores global information. The arrival of a token at the negated 
input of a not-node generates tokens of the opposite polarity. 节，e authors present a scheme tor handling 
the tokens generated at not-nodes that depends on a serializing communication channel. 
The proposed architectúre ∞nsists of a number of PE's inter∞nnected with a bidirectional tree-
21 
structured bus wíth higher throughput near the r001 ot the tree. (Such an architecture was prOpoSed and 
anafyzed by Leiserson [28]. who ca/fed it a 创-tree.) Ea∞ token generated by a logícal node flows up 
and down throughout the entire tree, since aJl logical nodes must see it. In addition, the PE's are 
connected localfy, perhaps in a grid , to facilitate load balancing. At the end ot each match phase, the 
enlire machine executes a load balancing algorithm: each PE passes some ot its newly generated logical 
nodes 10 its neighbors, choosing the destination PE's by some criterion unspecified by the researchers. 
Ke Jly and Seviora simulated the performance of their machine by simulating the steady-state execution 
ot a single repetitively-tired produαion. They obtained timings from a register-transfer-fevef simufator and 
found a speedup ot 2.5 with 4 PE's and 4.5 with 16 PE's. They claim that their machine has the potentiaf 
to speed up production systems by a tactor ot 350, the average number of ∞mparisons performed while 
processing the two-input nodes in the average execution cycle in Gupta's study [15]. 
Severaf questions about the performance of Ke/fy and Seviora's machine remain unanswered. The 
tree-structured bus, though it is asynchronous, must depend on buHers at each PE tor its asynchrony. 
Since buHers are tinite , wi /f the variance in processing time at each PE (processing time must vary in spite 
of the load-balancing procedure) have a synchronizing eHect? Paralfelizing the search of a list by 
distributing its efements and using many processors does speed up the search, but so can hashing 
techniques , which use onfy one processor. How much speedup will distributing the contents of memory 
nodes contribute over hashing them? AlsO. fogical node migration implies code migration as well , since 
the tests for a logical node migrate with it. How much does ∞de migration increase the communication 
。verhead of the machine? Fina/fy, the potential speedup by a tactor of 350 is only an upper bound. How 
many ot the 350 comparisons are inherentfy sequentiaf? 
7.2 PESA-1 
A group ot researchers at Honeywelf Computer Sciences Center has proposed and analyzed a 
ditferent approach to building a dataflow production system machine. 节1e PESA-1 machine [48 , 54] uses 
buses and random-number generators ωdistrìbute Rete network tokens evenly throughout the machìne, 
rather than relying on token replication and a load-balancing algorithm to etfect distribution. 
ConceptuaJly, PESA-1 is just a bus-based distributed-memory machine built ot custom processors. 
Each PE stores in its local mem。叩 a/f the ∞de for evaluating the production system. PESA-1 executes a 
distributed Rete match afgorithm; the algoríthm's execution consists of a series of token creations and 
node activations. A node is activated when a PE storing part of that node receives a token tagged with 
the node's 10. It the new token matches the tokens stored in the PE, the PE generates new combined 
tokens and broadcasts them on the bus. For each new token , the creating PE also generates a 
destination PE number, using 50me random number generator. Thus in this distribution scheme, each 
PE has a random Selection 01 Rete match tokens trom aJl Rete network nodes. The authors do not 
discuSS the processing 01 not-n创es ， but presumably they can be handled as in the Waterloo dataf10w 
machine. 
PESA-1'S actual architecture and algorìthm are somewhat more ∞mplicated. The PESA-1 PE's are 
arranged in levels; at each level i, two shared buses connect it with levef i+1 and level i-1. The levels 
correspond to levels in the Rete match network, and the number of PE's at each level decreases with 
increasing fevel number, as Gupta's statistics indicate that the Rete match creates fewer tokens in each 
successlve matching feve l. If the compiler generates a Rete network with more levefs than there are 
22 
levels ín PESA-' , the compiler folds the network :0 make it fit ín the machine. When a PE creates a 
token , it tags it not just with its destination PE, but also with a level number. 
Simulations with an instruction-Ievel simulator predict an execution rate 01 25K working memory 
element changeslsecond on a small program (1 Q productions) and a configuration 01 4 PE's in the first 
level , 32 ín the second, 4 in the third, 2 ín the fourth , and 1 in the fifth. Analyses of bus contention predict 
that a bus with a 1 QQ-nanose∞nd cycle time (the same cycle time assumed in the Production System 
Machine) should be able to handle 160K WME changeSlsecond, so bus contention should not be a 
limiting factor for PESA-1. 
7.3 Oflazer's machlne 
Kemal Ollazer [41] proposes a tree-structured machine to execute a variant 01 the Rete match 
aJgorithm that saves all nonredundant state information, and provides some performance projections 
derived lrom simulations. 
Recall that for each production , the Rete match saves the relations (i.e. ordered k-tuples 01 WME's) 
satis巾ing each prefix C1 & ... & Ck 01 the production's LHS. Ofl但er proposes an algorithm that, lor each 
produαion ， saves the relations satislying all subsequences Ci & ... & Ci 01 the LHS. Oflazer terms a 
., .~ 
member 01 one 01 these relations an instance element. Although a change to working memory can cause 
changes to many of these relations , most 01 the processing 01 these changes can be done in parallel. 
011azer ca/ls a subsequence 01 an instance element a redundant instance element. since ttle state 
inlormation ít contains is duplicated in the larger one. His algorithm does not save redundant instance 
elements. Unlortunately, changes to WM 但n create redundant instance elements, and eliminating them 
must be done sequentia/ly. 
Oflazer proposes a tree-structured distributed-memory multicomputer to exeωte his algorithm. The 
proposed machine has several hundred last processors at the leaves and specia/ized switches at the 
interior nodes. The algorithm distributes the stored state for a production among the leaf processors 01 
some subtree, and each processor stores part of the state 01 a number 01 productions. During the match 
phase , processors make changes to theír stored state in response to each working memory change , and 
communicate through the 1/0 switches at the interior nodes in order to eliminate redundant instance 
ele厅1ents.
Results from símulations 01 three 01 the OPS5 programs Gupta studied indicated that 512 5- to 10-
MIPS processors Should execute the systems at about 2200 to 7000 WME changes per second. 
丁he most serious dilfiωIty with Oflazer's algorithm is the potential for an exponential explosion of the 
size 01 the saved state_ As a remedy for this problem, he suggests splitting productions with many CE's 
ínto several productions with lewer CE's ead1. and connnecting them with message CE's. Unlortunately, 
this remedy serializes some :;1 the processing , and inaeases the number of productions fired in a 
program run. 
23 
8 Parallel Rule Firings 
A fínal source 01 parallelísm ín produc:ion systems is sometimes called aρplication parallelísm [18), 
which may consist 01 multiple threads of control or of firing many non-interfering rule ínstantiatíons 
simultaneously. 5everal groups have consídered multíple rule firíngs. Whíle their results have generally 
beenin∞nClusive ， their methods promise to have some influence on produω。n system architectures. 
D. 1. Moldovan and F. M. 了enorio [63 , 35 , 36) have studied the problem 01 partitioning production 
systems in order to minimize communícatíon among pa门itions. They ídentify four types of dependencies 
between productíons. whích they term output dependence, interface dependence, input心utput
dependence. and input dependence. Only two types of dependence prevent símultaneous rule firíngs: 
input dependence , where one produαion's actions destroy the precondítions for another production's 
fíring , and input-output dependence, where one productíon creates precandítions for another. When two 
rules cannot fire simultaneously. they must be in the same partition, or there must be communication 
between their respective partitions. The authors built a simulator that estimated !otal execution time. 
taking into ac∞unt synchronization canstraints and ∞mmunication costs tor various networks, and c!aim 
that a 16-processor system should execute 4000 rule cycles per second. 
One problematic aspect of Moldovan and Tenorio's research is that it derives partitionings that are ve叩
different from Otlazer's. Otlazer's methods. which addressed load-balancing problems. placed similar 
produαions into ditferent partitions. 8ut similar productions are likely to have similar pre- and post-
corditions , and hence to have dependencies. Otl但er's partitioning trades ∞mmunication (broad臼sting
working memory changes to all partitions in the hope that each change atfects all partitions) and storage 
(replicaled working memory elements) tor load balancing. while Moldovan and Tenorio's partitioning 
seems 10 do the Opposite. 
了。ru Ishida and 5alvatore 5tolfO have analyzed dependencies and synchronization requirements in 
production systems in [23. 24]. 了heyal臼 propose a framework tor multiple rule firings on tree-structured 
machines. 
They ∞nstruct a graph with one vertex tor each production and working memory class and plus- and 
minus-Iabeled directed edges between productions and working memo叩 classes. The graph contains an 
edge from a production to a working mer咱叩 cla臼 it that production can create (+) or destroy (-) a 
working memo叩 element ot that class. Likewise. the graph ∞ntains an edge trom a working memo叩
c!ass to a production if that cJa臼 appears in a positive (+) or negative (-)∞ndition element on the LH5 ot 
the production. They distinguish three cJasses ot interactions between produαions. 5uppose P and Q 
are two productions. It. tor eve叩 working memory class W such that P has a + edge 10 W, there is no 
edge. or a tnere is a + edge from W to Q, then P and Q can be tired in parallel. If there is a cla臼 Wsυch
that P has a +创08 10 W, but W has a - edge 10 Q, then P and Q may not be tired in parallel. since firing 
P may destroy Q's 剧院ution environment. Finally. it there is a class W such that P has a + edge to W, 
while Q has a • edge to W, then P and Q cannot be tired in parallel , since the result may not be the same 
as any serial firing ot P and Q. 
Ishida and 5tolfo next designed a hierarchical partitioning algorithm intended to maximize parallel 
execution. Applying this algorithm to one productlon system whose synchronization analysis had been 
improved manually, they tound that they cauld tire an average ot seven productions in parallel. 8ased on 
thís splitting algonthm, they propose a scheme for executing multiple rule firings in parallel on a tree-
24 
structured machme such as DADO. 
An unresoJved probfem in the work of both groups is the consistency ot their muftipJe-tiring schemes 
with standard ∞nffict resolution strategies. For instance, suppose that instantiations ot rufes P and 0 
both exist, but that the conflict resofution strategy favors p's instantiation. Suppose further that firing P 
c;eates a working memory efement that creates another instantiation of 0 , and that conflict resolution 
selects O's second instantiation for firing. Then the work of Ishida and Stolfo a/lows P's instantiation and 
Q's first instantiation to 行 re simultaneously, while sequential execution with conflict resofution fires p's 
instantiation, followed by O's se∞nd instantiation. 
A. O. Oshisanwo and P. P. Dasiewicz propose in [42] a heterogeneous production system machine, 
MAPPS, and a scheme for affowing muftiple rule firings. They propose a run-time check for conflicting 
instantiations, and present high performance prediαions for a distributed Rete algorithm running on the 
machine. 
The MAPPS machine comprises three sets of PE's. The first set executes constant tests, which are 
distributed statica/ly among it. The 5e∞nd 5et consists of clusters of PE's sharing common memo可
within a cfuster but communicating between clusters on a bus: each cluster ex民utes the two-input tests 
for a set of productions, which constitutes a subset of a partition. The third set is a tree of PE's for conflict 
resolution and execution of muftiple rule firings. 
Oshisanwo and Dasiewicz state that an (unspecified) statistical model of production systems based on 
Gupta's statistics gave simulated exeαJtion times of 10K to 19K working memory element changes per 
second. They are now working on a register-transfer-Ievel simulator of the machine. 
Unfortunately, the authors do not present the details of their run-time strategy for avoiding the firing of 
conflicting rules. They seem to imply that the cost 01 the run-time checks involved is Iinear in the number 
of partitions, the number of condition elements in each production , and the number of working memory 
elements created or destroyed by each rule. Since each pair of instantiations must be checked for 
conflicts , the obviouS algorithms have ∞sts quadratic in the number of partitions. 1I the number of 
partitions is large, tnis check is likely to be expensive , and a major bottleneck in the machine. rtitions. 
9 Conclusion 
This paper has presented descriptions of proposed production system architectures classified 
according to properties of the algorithm(s) each is designed to exeωte. Almost no consensus can be 
Ciscerned about any aspect ot prωuction system architecture or method of parallelizing the match 
computation. The proposed arChiteαures include shared- and distributed-memo叩 machines. Among the 
distnbuted-mem。叩∞mputers ， PE's communicate Ihrough buses , interconnection networks of various 
lopologies , or a ∞mbination ot the two. Researchers have proposed homogeneous and heterogeneous 
machines，国 th SIMD and MIMD, with processors ranging from single-bit ALU's to powe巾I custom 32-bit 
ChipS. The proposed algorithms 0∞upy several points in the state-saving spectrum, although most are 
parallelizations 01 the Rete match. Parallel versions of the Rete match algorithm parallelize different pa门S
。 f the computation. The algorithms display tradeotfs between the number ot parallel tasks (inversely , the 
granularity of each task) and delays due to communication , synchronization , and scheduling ∞nstraints 
and overheads. Each paraffelization technique di5tributes 50me ot data 5tructures and replicates 50me 
data, a5 displayed befow. 
25 




Sequentíal node activations, 
paral/el processing of each 
Parallel node activations, 
sequential processing of each 
ParaJlel node activatíons, 








alpha- and beta-memories 
One- and two-input tests 
(statically or dynamically 
distributed) 
Contents of 
alpha- and beta-memories 
PM , alpha-memories, 
confJict set 
All partial instantiations 




(possibly at runtíme) 






The wide range of paraJl el algorithms for PS execution stems partly from ignorance about the run-time 
behavJor of production systems. In particular, questions such as how much state a production system 
inrerpreter should save , and how paral/elizable production systems are , sti Jl lack definitive answers. 
Miranker's comparisons of TREAT and Rete [31 , 33] show that, tor at least some OPS5 production 
systems, maintaining the state saved by Rete is ∞mputationaJly more expensive than throwing away 
partial instantiations that do not form part ot a complete instantiation, and re∞mputing them as 
necessary. Although Gupta's statistics indicate that for the systems he studied , most changes to working 
memory caused changes to the stored state of some 26 productions, the size of his sample was quite 
small--he published statistics on only twelve production systems, all wrítten in OPS5. Although he tried to 
select a representative sample ot OPS5 programs , the phenomena he noted may be artifacts ot a 
programming style dictated by the language (cf. (32 , 65]). Also, the number of affected rules in each PS 
cycle is closely related to the number of changes ωWM in each cycle. Firing several rules in each cycle 
can increase the number of changes to WM , but semantic problems involved in multiple rule firing must 
stJl I be resolved. 
The performan佣 projections otfered by various research groups ClO not go tar towarCl establishing 
sound bases tor ∞mparing diHerent machines. The projections are based almost entirely on simulation 
results , in some cases at a register-transfer level , in others at the level of events such as node 
actJvatJons. Some working prototype machines (such as DADO) exist, but tew performance 
measurements for them have been published. Device technology and procesωr cycle time are important 
determinants of performance, but each research group assumes different values ot these parameters. To 
make matters worse , performance measurements on prototypes cannot be compared directly with 
performance projections, because the prototypes may not use current technology. FinaJly, the Cl iHerent 
degrees of rule compilation assumed in different machines , and the contounding role ot the interprete r's 
26 
base language, make comparisons even more diHicult. 
Even if the various performance projections were normalized and directly comparable. most of them 
would still be inadequate for evaluating the proposed machines. Simulations at the register-transfer level , 
ror example , give very accurate timing information. but they must be run on very small data sets. and can 
Slmulate only very small systems. Conclusions based on these small systems cannot be generalized t。
very large systems in the absence of statistícal and worst.case knowledge about theír behavíor. On the 
other hand, simulatíons dríven by events such as Rete node activatíons can gíve information about much 
larger syslems, but the results are only as general as the data drivíng the simulations are representative. 
Statistical models of production system behavíor offer generality, but only Ihe MANJI group have 
constructed such models. and only of bus contention in their machine (which is probably not its 
performance-limiting factor anyway). 
Most published analyses of production system machines omit two important evaluation critería: 
generality and efficiency. Productíon system machínes should be general--capable of efficient execution 
of more than OPS5 programs--because not all production systems are OPS5 programs. Performance 
evaluations of these machines should consider nol just throughput. but should also include studies of 
utifization and efficiency. 
In summary, although many interesting algorithms and machines for acccelerating production system 
execution have been proposed , it is impossible 10 use predict how well they will perform on real systems. 
Furthermore, even though most of the proposed machines are designed to execute one language 
(OPS5) , it is impossible to conclude which is fastest. Given the large number of proposed machines. and 
the ímpoηance of speeding up produαion systems, a more systematic and general evaluatíon is required. 
27 
References 
[1] 81elloch , G. 巳
CIS: A Massively Concurrent Rule-8ased System. 
In Fifth National Conference on Artificiallntelligence, pages 735-741. AAAI, 1986. 
[2] 8rooks. R. , and Lum，同.
Yes , An SIMD Machine Can 8e Used For AI. 
In Ninth International Joint Conference on Artificiallnte'"甸ence， pages 73-79. ACM , 1985. 
[3) 8rownston, L. , Farrell , R. , Kant, E., and Martin , N. 
Programming Expert Systems in OPS5. 
Addison-Wesley, Reading , Massachusetts, 1985. 
[4] Chandra, A. K. , and Merlin , P. M. 
Optimal Implementation of Conjunctive Queries in Relational Data 8ases. 
In Proceedings 01 the Nin的 AnnualACM Symposium on Theory 01 Computing, pages 77-90. 
1977. 
[5] Davis , R. , and King , J. 
An Overview of PrOduction Systems. 
Machine intelligence 8: Machine Representations of Knowledge. 
Ellis Ho阳。od ， Chichester, 1977, pages 300-322. 
Also published as Technical RepoπAIM-271 ， Computer Science Department, Stanford University. 
1975. 
[6] Duda, R. O. Hart. P. E.. Konolige, K., and Aeboh ，民，
A Computer-based Consultant for Mineral Exploratíon. 
Technical Report. SRI Intemational , September, 1979. 
[7J F。咱y， C. L. 
OPS4 User's Manual. 
Technical Report CMU-CS-79-132, Computer Science Department, Carnegie-Mellon University , 
July, 1979. 
(8) Forgy, C. L. 
On the Efficient Implementation 01 Production Systems. 
PhD thesis, Carnegie-Mellon University, 1979. 
(9) Forgy, C. L. 
Note on Production Systems and /IIiac IV. 
Technical Report CMU-CS-80-1 30, Computer Scien佣 Department， Carnegie-Mellon University, 
1980. 
(10) Forgy, C. L. 
OPS5 User's Manual. 
TechniωI Report CMU-CS-81-135. Computer Scien饲 Department ， Carnegie-Mellon University, 
JUly , 1981 , 
(11) Forgy. C. L. 
Rete: A Fast Algorithm for the Many PatternJMany Obj创 Pattern Match Problem. 
Artificiallntelligence (19) :17-37, 1982. 
(12) Forgy, C. L. 
The OPS83 Reρort. 
Technlcal Report CMU-CS-84-133. Computer Scien臼 Depaηment. Carnegie-Mellon University , 
May, 1984. 
28 
[13] Forgy, C. l., and Gupta, A. 
Preliminary Arcl1 itecture of the CMU PrOduction System Machine. 
In Nineteenth HawaJÏ International Conference on System SCiences, pages 194-200. ACM , Kona, 
Hawaii, January, 1986. 
(14] Graham, P. C. J. 
Providing Architectural Support for Expert Systems. 
Computer Architecture News 12(5):12-18, 1984. 
(15] Gupta, A., and Forgy, C. L. 
Measurements on Production Systems. 
Technical Report CMU-CS-83-167, Computer Science Department, Carnegie-Mellon University , 
1983. 
f16] Gupta, A. 
Implementing OPS5 Production Systems on DADO. 
In Proceedings 01 the 19841nternational Conference on Parallel Processing, pages 83-91. IEEE, 
1984. 
[1ηGupta， A., Forgy, C. L., Newell，人， and Wedig , R. 
Parallel Algorithms and Arcl1 itectures for Rule-Based Systems. 
In The 13的 Annuallnternational Symposium on Computer Architecture, pages 28-37. IEEE, 
Tokyo, Japan , 1986. 
(18] Gupta, A. 
Paral/elism 的 Production Systems. 
PhD thesis, Carnegie-Mellon University, March , 1986. 
[19] Gupta, A., Forgy, C. L. , Kalp, A. , Newell, A. , and Tambe , M. 
Results of Parallellmplementat.ωn 01 OPS5 on the Encore Multiprocessor. 
Technical RepoηCMU-CS-87-146 ， Computer ScienωDepartment ， Carnegie-Mellon University, 
1987. 
[20] Hillis, W. D. 
The Connection Machine (Computer Architecture 10f the New Wave). 
Technical Report 646, M.I.T. Artificial Intelligence Laboratory, September, 1981. 
[21] Hillyer, B. K., and Shaw, D. E. 
Execution of OPS5 Production Systems on a Massively Parallel Machine. 
Journal 01 Para/lel and Distributed Computing 3(2) :236-268, 1986. 
[22] Hwang , K. , and Briggs, F. A. 
Computer Architectufe and Para/lel Processing. 
McGraw-Hill , New York, New Yo r1<, 1984. 
Pages 732.7臼.
[23] Ishida, T. , and Stolfo , S. J. 
Simultan80us Firing 01 Production Rules on Tree Structured Machines. 
Technical Re~泪rt CUCS-1 09-84, Computer Science Department, COlumbia University, 1984. 
[24] Ishida, T. , and 5tol阳， 5. J. 
Towards the Parallel Execution of Rules in Production 5ystem Programs. 
In Proceedi，句S 01 the 1985 International Conlerence on Paral刷 Processing， pages 568-575. 
IEEE, 1985. 
[25] Kelly , M. A. , and 5eviora, R. E. 
A Multiprocessor Architecture for Production 5ystem Matching. 
In Sixth National Conlerence on Artiliciallnte匈ence， pages 36-41. AAAI , 1987. 
29 
[26] Lehr，了. F. 
The Implementation of a Production System Machine. 
In Nineteenth Hawaii Intemational Conference on System Sciences, pages 177-186. ACM , Kona, 
Hawaii , January, 1986. 
[2汀 Lehr ， T. F. 
The GaAs Realization of a Production System Machine. 
In Nineteenth Hawaii Intemational Conferenc8 on System Sciences, pages 246.252. ACM , Kona, 
Hawaii, January, 1986. 
[28] Leiserson , C. E. 
Fat-Trees: Universal Networks tor Hardware-Efticient Supercomputing. 
In Proceedings of the 1985 Intemational Conference on Parallel Processing, pages 393-402. 
IEEE, 1985. 
(29J McDermo忧， J. 
R1: A Aule-based Configurer 01 Computer Systems. 
A汀'ificiallntelligence (19):39-88, 1982. 
[30J Minsky, M. 
A Framework for Aepresenting Knowledge. 
The Psychoωgy of Computer Vision. 
McGraw-Hill. New York, 1975. 
[31] Miranker. D. P. 
Performance Estimates for the DADO Machine: A Comparison of TREAT and RETE. 
Technical Rep。ηCUCS-1 18-84, Computer Science Department, Columbia University. 1984. 
[32J Miranker, D. P. 
Treat: A New and Efficient Match Algorithm for AI Production Systems. 
PhD thesis, Columbia University. 。α。ber. 1986. 
[33J Miranker. D. P. 
TAEA了: A Better Match Algorithm for AI Production Systems. 
In Sixth National Conferenc8 on At1íficiallntelligence, pages 42-47. AAAI , 1987. 
[34J Miyazaki , J. , Amano , H. , and Aiso. H. 
MANJI: An Architecture for PrOduction Systems. 
In Twentieth Hawaii Intemational Conference on System Sciences, pages 236-245. ACM , Kona, 
Hawaii, January, 1987. 
[35J Moldovan. D. 1. 
A Model for Para l1 el Processing of Production Systems. 
In Proceedings of the 19861EEE Intemational Conference on Systems. Man, and Cybemeti(坷，
pages 568-573. IEEE, 1986. 
[36J Moldovan. D. 1. 
A Multipr。饲岛。r lor Aule-Based Systems. 
In Proceedi，有gs SupercofTl)uting '87, pages 482-490. International Super∞mputing Institute. Inc.. 
1987. 
[37] Newell, A. 
Production Systems: Models 01 ContrOI Structures. 
Visuallnformation Processing. 
Academic Press, New York, 1973, pages 463-526. 
[38] New∞rt， D. F. , Alley. G. T.. B叩an ， W. l.. Eason. A. 0. , and 8ouldin , D. W. 
A Parallel Symbol-Matching C。币ro伺臼or for Aule Processing Systems. 
In Proceedings of the 1986 IEEE Intemational Conference on Systems, Man, and Cybemetics , 
pages 578-81. .1 EEE, , 986. 
30 
(39] Nilsson , N. J. 
Principles 01 Artiliciallntel/i旨ence.
Tioga Publishing Company, Palo Alto , Calitornia , 1980. 
[40] Otlazer. K. 
Partitioning in Parallel Processing ot Production Systems. 
In Proceedings 01 the 19841nternatωnal Conference on Parallel Processing. pages 92-100. 
IEEE, 1984. 
[41] Otlazer, K. 
Partitioning in Parallel Processing of PrOduction Systems. 
PhD thesis. Carnegie-Mellon University , March, 1987. 
[42J OShisanwo, A. 0. , and Dasiewicz, P. P. 
A Parallel Model and Architecture for Production Systems. 
In Proceedings of the 1987 International Conference on Parallel Processing, pages 147-153. 
IEEE, 1987. 
[43J Pasik. A. 
ηle OPS Family of Production System Languages. 
Technical Report CUCS-232-86, Computer Sclence Department, Columbia University, 1986. 
[44J Patterson , D. A.. and Sequin, C. H. 
A VLSI RISC. 
IEEE Computer , 5(9) :8-21. 1982. 
[45] Perlin , M. 
On the Computational Equivalence of Frame Systems and Rule Systems. 
In U. S.-Japan AI Symposium 8i气 Tokyo ， November, 1987. 
[46J Ouillian. R. 
Semantic Memory. 
Semantic Inlormation Processir略
MIT Press. Cambridge. Massadluse口5 ， 1968. 
[4ηOuinlan ， J. 
A Comparative Analysis of Computer Architectures for Production System Machines. 
In Nineteenth Hawaii International Conlerence on System Sciences. pages , 87 -193. ACM , Kona. 
Hawaii. January, 1986. 
[48J Ramnarayan , R. , Zimmermann. G. , and Krolikoski , S. 
PESA-1: A Parallel Architecture for OPS5 Production Systems. 
In Nineteenth Hawaii International Conference on System Sciences, pages 201 -205. ACM. Kona, 
Hawaii. January. 1986. 
[49] Reed Jr. , B. 
The Aspro Parallellnference Engine (P.I. E.): A Real Time Production Rule System. 
Technical Repoπ85-6048 ， Goodyear Aerospace , 1985. 
Pages 459-464. 
[50] Rohmer, J., and Gonzalez-Rubio, R. 
Delta Drive Computer: A Parallel Machine tor Symbolic Processing. 
In Convention Informatique, pages 150-155. SICOB, Paris , France, 1986. 
[51] Rokey, M. 
The Dataflow Architecture: A Suitable Base tor the Implementation ot Expert Systems. 
Comρuter Architecture News 13(4):8-14, 1985. 
31 
[52J Rudoph , L. , and Sagall , Z. 
Dynamic Decentralized Cache Schemes for MIMD Parallel Processors. 
In- Eleventh Annuallnternational Symposium on Computer Architecture, pages 340.7. Ann Arbor, 
Michigan, 1984. 
(53J Rycher , M. D. 
ASemantic Network of Production Rules in a System for Describing Computer Structures. 
In Sixth InternationalJoint Conference on Artificiallntelligence, pages 738-743. ACM , 1979. 
[54J Schreiner，只， and Zimmermann, G. 
PESA 1--A Parallel Architecture for Producton Systems. 
In Proceedings of the 19871nternational Conference on Parallel Processing, pages 166-169. 
IEEE.1987. 
[55] Shaw. D. 巳
The NON- VON Supercomputer. 
Technical Report CUCS-29-82 , Computer Science Department. Columbia University , August , 
1982. 
[56J Shortliffe , E. H. 
Computer.based Medical Consultations: MYCIN. 
Elsevier, New York. 1976. 
[57J Slolfo. S. J.. and Miranker, D. P. 
DADO: A Parallel Processor for Expert Systems. 
In Proceedings of the 1984 International Conference on Parallel Processing, pages 74-82. IEEE, 
1984. 
[58J Slolfo. S. J. 
Five Parallel Algorithms for PrOduction System Execution on the DADO Machine. 
In Proceedings ofthe National Conference on Artificiallntelligence. pages 300-307. AAAI. 1984. 
[59J Stolfo. S. J. 
A Note on 1mρlementing OPS5 Production Systems on DADO. 
Technical Report CUCS-130-84, Computer Science Department, Columbia University , 1984. 
[60J Slolfo , S. J. 
On the Design of Parallel Production System Machines: What's in a LlP? 
In Eighteenth Hawaii International Conference on System Sciences, pages 232-237. ACM , Kona. 
Hawaii. January, 1985. 
[61 J Slolfo. S. J.. Miranker. D. P. , and Mills, R. C. 
More Rules May Mean Faster Parallel Execution. 
In Proceedings of the Workshoρ on AI and Distributed PrOblem Solving. Washington. D. C., May. 
, 985. 
[62J Slolfo , S. J.. and Miranker, D. P. 
The DADO Production System Machine. 
Journal of Parallel and Distributed Computing 3(2) :269-296, 1986. 
[63J Tenorio , M. F. M.. and Moldovan, D. 1. 
Mapping Prωuction Systems into Multiprocessors. 
In Proceedings of the 1985 International Conference on Parallel Processing, pages 56-62. IEEE, 
1985. 
[64J Ullman. J. D. 
Principles of Database Systems. 
Compuler Science Press. Rockville. Maryland. 1982. 
32 
[65] van Biema, M. , Miranker, D. P. , and Slolfo , S. J. 
The Oo-Loop Considered Harmful in PrOduct:on System Programming. 
In First International Conference on Expert Database Systems, pages 88-97. ACM , Charleston, 
South Carolina , April , 1986. 
[6ôj Vesonder, G., Stolfo, S. J. , Zielinsky, J. E. , Miller, F. 0. , and Copp, D. H. 
ACE: An Expert System for Telephone Cable Maintenance. 
In Eighth International Joint Conference on Artificiallntelligence, pages 116-' 21. ACM , 1983. 
