This paper presents a new production system architecture that takes advantage of modern associative memory devices to allow parallel production ring, concurrent matching, and overlap among matching, selection, and ring of productions. We prove that the results produced by the architecture are correct according to the serializability criterion. A comprehensive event driven simulator is used to evaluate the scaling properties of the new architecture and to compare it with a parallel architecture that does global synchronization before every production ring. We also present measures for the improvement in speed due to the use of associative memories and an estimate for the amount of associative memory needed. Architectural evaluation is facilitated by a new benchmark program that allows for changes in the number of productions, the size of the database, the variance between the sizes of local data clusters, and the ratio between local and global data. Our results indicate that substantial improvements in speed can be achieved with a very modest increase in hardware cost.
Introduction
Considerable e orts have been made towards speeding up production system machines in the past twenty years 6, 29] . Originally, production systems were realized as interpreted language programs for sequential machines. The high cost of matching motivated the development of concurrent matching systems and, subsequently, systems that also allowed multiple productions to be red at the same time. In a separate line of research, modern compile optimization techniques were developed to run production system programs more e ciently on general purpose sequential machines.
These e orts have led to great advances in the understanding of the issues involved in the construction of faster production system machines, but only limited improvement in actual performance. Also, there have been few attempts to integrate progress made in di erent areas: the use of the restrictive commutativity criterion for correctness and the notion of a match-select-act \cycle" forced even advanced architectures to perform synchronization before each production ring; compile optimization techniques were mostly restricted to sequential machines; many of the concurrent matching engines were constructed with a large number of small processors and were not combined with parallel ring techniques. Moreover, parallel processing researchers failed to take advantage of the fact that, in typical production systems, reading operations are performed much more often than writing ones.
We propose a novel parallel production system architecture that uses the less restrictive serializability criterion for correctness. This architecture eliminates the concept of a production system \cycle", thus eliminating the need to construct a global \con ict set" and to perform global synchronization before each production ring. Productions are partitioned among processors based on information about the workload of each production and on production dependencies identi ed through compiling techniques. The use of modern content addressable memories allows a new production to be selected to re before all the matches resulting from previous production actions are complete. This architecture follows an early recommendation of Gupta and Forgy 15] , i.e., that a parallel production system machine be constructed with a small number of relatively powerful processors.
Background
Attempts to speed up Production Systems (PS) date back to 1979 when Forgy created the Rete network, a state saving algorithm to speed up the matching phase of PS 11] . Following a 1986 study by Gupta, which indicated that a signi cant portion of the processing time in a Rete-based PS machine is consumed in the matching phase 14] , substantial e orts were made to improve this phase. This includes concurrent implementations of the Rete network 13, 16, 25, 24, 39] , generalization of the Rete network 30], elimination of internal memories from the Rete network to increased speed 31], extension of the Rete network for compatibility with real-time systems 9], and the use of message-passing computers to implement the Rete network 2]. Progress in other aspects of production system machines included compile time optimization for the Rete network 19], nondeterministic resolution for the con ict set combined with parallel ring of productions 21, 27] , loosely coupled implementations of production systems 21] , and the use of meta rules to solve the con ict set 40] . Comprehensive surveys of the research towards speeding up production systems are found in the works of Kuo and Moldovan 29] and Amaral and Ghosh 6].
The issue of which criterion to use for correctness in the execution of a production system is still an open question. The two most prominent candidates are the commutativity criterion and the serializability criterion. When commutativity is used, a set of rules can be executed in parallel if and only if the result is the same that would be produced by any possible sequential execution of the set. Under serializability it is enough that the result produced by the parallel execution be equal to at least one sequential execution of the set 36].
The commutativity criterion proposed by Ishida and Stolfo 20] is favored by programmers because it allows for easy veri cation of correctness in a production system. However, it is very restrictive and the amount of parallelism extracted from a PS using this criterion is very low. The use of the serializability criterion increases the amount of parallelism available but makes the veri cation of correctness in a program more di cult. Nevertheless, if serializable production systems are proven to be su ciently faster than commutable ones, development tools will be created to aid the veri cation of correctness. Schmolze and Snyder 38] studied the use of con uence to control a parallel production system. They suggest the use of term rewriting systems 17, 26] to verify the con uence of a production set. They argue that a con uent production set that is guaranteed to terminate will produce the same nal result independent of the sequence in which the productions are executed. Therefore, for such a class of systems, the veri cation of correctness with the serializability criterion would not impose an extra burden in the programmer.
The need to improve other phases of production execution besides the match cycle is now evident 6] . In this paper we present a parallel architecture based on the serializability criterion of correctness. The architecture exploits the high read/write ratio of production systems, and the increased importance of associative search operations when global synchronization is eliminated, to yield a fast and e cient production system engine. The next section presents the architectural model and proves that its operation is correct. In section 4 we present a partitioning algorithm that performs the assignment of productions to processors. Section 5 describes the benchmarks used to study performance and introduces a new benchmark program. Section 6 presents comparative measurements with a synchronized architecture and an evaluation for the volume of activity in the bus and the size of associative memories.
Architectural Model
The parallel architecture presented in this paper stems from the realization that improvements restricted to the matching phase of the traditional match-select-act cycle of Production Systems (PS) fail to produce signi cant speedup. Even machines that allow concurrent execution of the acting and matching phases, while maintaining the global production selection, yield limited improvements in speed. The architecture proposed here allows parallel ring of productions allocated to distinct processors. Within a processor, activities related to matching, acting and selecting are concurrent. Thus the next instantiation to be red may be selected even before the Rete network updates due to a previous production ring are completed.
Such aggressive parallelism is possible because the concept of a match-select-act cycle is eliminated. The principle of ring the most recent and speci c instantiation is replaced by an approximation of it: only instantiations that are known at the time of the selection are considered, we call this a partially informed selection mechanism. The use of associative memories allows for quick elimination of instantiations that are no longer reable. We also replace the restrictive commutativity criterion by the serializability criterion of correctness. The use of serializability reduces the number of situations in which synchronization is necessary, increasing the amount of parallelism available.
On surveying measurements published by other authors 14, 33], we found that the ratios of reading and writing operations in the benchmarks studied are between 100 and 1000. We also found that in complex benchmarks that bear more similarity with \real life" problems, this ratio tends to be higher than in \toy problems". This is primarily because productions have a larger number of antecedents than consequents in such problems 4]. Our observation motivates an architecture based on a broadcasting network over which only writing operations occur. Such an architectural model imposes limits to the number of processors used. However, two characteristics of PS make them compatible with an architecture with a moderate number of processors: the amount of inter-production parallelism is limited and, as a PS grows, the size of the database grows much faster than its production set. Section 3.1 presents basic de nitions that set the environment for the processing model. Section 3.2 introduces the architectural organization and expands on the processor model, con ict set management, and processor operation. Section 3.3 presents a theorem that demonstrates that the results produced by the processing model is correct according to the serializability criterion of correctness.
Basic De nitions
A Production R i consists of a set of antecedents A(R i ) and a set of consequents C(R i ): the antecedents specify the conditions upon which the production can be red; the consequents specify the actions performed when the production is red.
De nition 1 The database manipulated by a Production System consists of a set of assertions. Each assertion is represented by a Working Memory Element (WME), notated by W k . A WME consists of a class name and a set of attribute-value pairs. The class name and the set of attribute names of a WME together characterize its type, T W k ].
De nition 2 Each production antecedent speci es a type of WME and a set of values for its attributevalue pairs. A WME W k is tested by an antecedent if it has the speci ed type. An antecedent is matched by a WME if the WME has the type speci ed and all the values in the antecedent match the ones in the WME.
De nition 3 If an antecedent of a production R i tests WMEs of type T W k ], then we say that W k is tested by the production R i , this is notated by W k A(R i ).
De nition 4 A non-negated antecedent tests for the presence of a matching WME in the memory. A negated antecedent tests for the absence of any matching WME in the memory. A production R i is said to be reable if all its non-negated antecedents are matched and none of its negated antecedents are matched.
The consequent of a production can specify three kinds of actions that modify WMEs: addition, deletion, or modi cation.
De nition 5 A WME W k is modi able by the consequents of a production R i i the ring of R i adds (deletes) any WME of type T W k ] to (from) the Working Memory. This is denoted by W k C(R i ). In the processing model discussed in section 3.2 some productions re locally while others need to change WMEs that are stored in the local memory of remote processors. The following de nitions describe important situations that appear in the execution of the model.
De nition 8 A WME W k is local to a processor P i i W k is stored in the local memory of P i ; W k is not stored in the local memory of any other processor P j ; and there is no production allocated to a processor other than P i that changes W k .
De nition 9 A WME W k is pseudo-local to a processor P i i W k is stored in the local memory of P i ; W k is not stored in the local memory of any other processor P j ; and there is at least one production allocated to P j 6 = P i that changes W k . We say that P j shares W k .
For example, a WME that is written by many processors and read by only one processor is pseudo-local for the processor that reads it; it is a shared WME for all processors that write it. A processor does not store shared WMEs in its local memory.
De nition 10 A production R n res locally in a processor P i i 8 W k C(R n ), W k is local or pseudolocal to P i .
Consequence 2 A production that does not re locally, is said to be a global production. Such a production must propagate actions to remote processors.
De nition 11 A production R n enables a production R m i 9 W k such that S C(Rn) W k ] = S A(Rm) W k ]. A production R n disables a production R m i 9W k such that S C(Rn) W k ] 6 = S A(Rm) W k ].
De nition 12 A production R n has an output con ict with a production R m i 9W k such that S C(Rn) W k ] 6 = S C(Rm) W k ].
Productions that can re locally are classi ed as Independent of Network Transactions (INT) or Dependent on Network Transactions (DNT), according to their dependencies with other productions that belong to other processors. INT and DNT productions have to be mapped and processed di erently for correct execution according to the serializability criterion. Productions are partitioned into disjoint sets with one set assigned to each processor. R n 2 P i indicates that production R n belongs to processor P i .
The Working Memory is distributed among the processors in such a way that a processor stores in its local memory all and only the WMEs tested by its productions.
De nition 13 A production that can re locally is DNT if and only if at least one of the following conditions holds:
(i) two of the antecedents of the production are changed by the consequents of a single production allocated to another processor: one of these changes produces an enabling dependency and the other produces a disabling one;
(ii) the production has two con icting writes with a production allocated to another processor; (iii) the production has an output con ict and a disabling dependency with a production allocated to another processor.
At compile time, after the set of productions is partitioned among the processors, the set of antecedents and the set of consequents of each production are analyzed to determine whether the production is global, local INT, or local DNT. To check if a production is local DNT is a simple matter of checking if any of the conditions of de nition 13 holds.
De nition 14 A production R n is INT i R n can re locally and R n is not DNT.
An INT production can start ring at any time as long as its antecedents are satis ed. A DNT production P i only starts ring after all tokens generated by a production P j , currently being red by a remote processor, are broadcast in the network and consumed by the processor that res P i . This prevents P i and P j actions from being intermingled, avoiding thus non-serializable behavior.
System Overview
The architectural model proposed in this paper consists of a moderate number of processors interconnected through a broadcasting network. The set of productions is partitioned among these processors with each production assigned to exactly one processor. A processor reads data only from its local memory, i.e., no read operations are performed over the network. Due to the absence of network reads and the low frequency of network writes, a simple bus should be adequate as the broadcasting system. This conclusion is supported by detailed experimental results showing the bus not to be a bottleneck even for a twenty processor system. A number of associative memories implement a system of lookaside tables to allow parallel operations within each processor. This scheme does not allow parallel production ring within a processor, but allows the match-select-act phases of a PS to overlap. A snooping directory isolates the activities in remote processors from the activities in a local processor, and interrupts a local operation only when pieces of data that a ect it are broadcast over the network.
The parallel architecture is formed by identical processors connected via a Broadcasting Interconnection Network (BIN), as shown on Figure 1 . At start-up the I/O processor (I/OP) loads the productions on all processors. System level I/O and user interface are also handled through the I/OP. The main components of each processor are the Snooping Directory (SD), the Matching Engine (ME), the Production Controller (PC) and the Instantiation Controller (IC). Whenever a processor P i needs to broadcast a change to a WME that is stored in other processors local memories, P i creates a token to broadcast in BIN. The Snooping Directory is an associative memory that identi es whether a token broadcast on BIN conveys an action relevant to the local processor. Relevant tokens are kept in a Broadcasting Network Bu er (BNB) until the IC and the ME are able to process it. The Matching Engine is a Rete-based matcher that implements a state-saving algorithm. The IC uses specialized memory structures to maintain and rapidly update the list of reable instantiations. To perform this task, it has to monitor the outputs of ME as well as the ring of local (through PC) and global (through SD) productions. One of the memories controlled by IC is the Firing Instantiation Memory (FIM) that keeps a list of all the production instantiations that are enabled to re. The Production Controller (PC) selects an instantiation to be red from the list maintained by IC, and, whenever necessary, synchronizes the production ring with BIN operations to guarantee that production rings appear to be atomic.
Productions are divided in three categories: local INT, local DNT, and global. The ring of a local INT production does not require BIN ownership because all its actions modify local WMEs only. Therefore, upon selecting an INT production, the PC immediately propagates its actions to ME and IC. To avoid interleaving of actions belonging to distinct productions, all tokens broadcast in BIN during local production ring are bu ered in BNB. These tokens are processed as soon as the local ring nishes. When a local DNT production is selected, its execution has to wait until the BIN changes ownership, which is an indication that the ring of a global production has been concluded. The local DNT production is then red in the same fashion as a local INT. A global production modi es shared WMEs, i.e., WMEs that belong to the antecedents of productions assigned to other processors. Thus, these changes need to be broadcast to all processors. When a global production is selected, PC acquires access to the BIN, processes all outstanding changes in the BNB, and, if the selected production is still reable, proceeds to broadcast tokens with changes to shared WMEs. The BIN ownership is not released until all actions that change shared WMEs are broadcast. After releasing the BIN, PC prevents any incoming token from proceeding to local processing. These tokens are bu ered in BNB and processed locally after the local execution of the selected production is complete. This avoids write interleaving in the local memories and guarantees an atomic operation for production ring within a processor.
The main steps in the machine operation are presented below in an algorithmic form. The steps of the algorithm are performed by di erent structures of the processing element. enable local execution of incoming tokens Note that no production is red while there are outstanding tokens in BNB. The selection of a reable instantiation in step 2 of PRODUCTION-FIRING is done according to the \pseudo-recency" criterion: the most recent instantiation in FIM is selected This is not a true recency criterion because ME may still be processing a previous token, and thus the instantiations that it will produce are not in FIM yet.
The test in step 7 is necessary because between the time the BIN was requested and the time its ownership is acquired, incoming tokens might have changed the status of the production selected to re. If this occurs, the ring of the selected production is aborted. Steps 12-14 are executed for productions that are dependent on network transactions, as de ned in section 3.1. If such productions were to start ring while a remote processor is in the middle of a production execution, the intermingling of actions could result in non-serializable behavior. Notice that the BIN is released in step 10, before changes to local memory take place. To guarantee that no token is processed before the local changes are executed, bu ering of tokens in BNB in step 15 is activated immediately upon releasing the BIN.
The architectural model presented in this section bears some similarity with the systems proposed by Schmolze and Goel 37] and Ishida et al. 21 ]. In all three systems, each production is uniquely assigned to one processor and all WMEs tested by the production are stored locally. Contrary to the architecture presented in this paper, the systems proposed in 21] and 37] use a taxing synchronization mechanism and require each processor to keep a list of all dependencies that each production has with other processors. The bus-based architecture with snoopy mechanism presented in this paper substantially simpli es synchronization and avoids the potential for incorrect behaviour or deadlock. Similar synchronization mechanisms are nowadays employed for cache coherency in several commercial medium-scale multiprocessor systems 18].
Detailed Processor Model
The processor architecture is detailed in Figure 2 . The Instantiation Firing Engine (IFE) implements the outgoing interface with the Broadcasting Interconnection Network (BIN) and synchronizes internal activities. The IFE selects an instantiation to be red among the ones stored in the Fireable Instantiation Memory (FIM). If the production selected to re is global, the IFE places a request for ownership of the BIN. Upon receiving BIN ownership, IFE waits until all outstanding tokens stored in BNB are processed. If the selected instantiation becomes un reable due to such processing, IFE has to abandon it and select a new instantiation. Otherwise IFE broadcasts tokens with changes to the shared WMEs, releases the BIN, and executes the local actions. The Snooping Directory (SD), along with the Broadcasting Network Bu er (BNB), implements the incoming network interface. The Snooping Directory is an associative memory that contains all WME types that belong to the antecedent sets of the productions assigned to the processing element. BNB is used to store tokens broadcast on BIN and captured by SD during the local ring of a production, or during the execution of local actions of a global production. The tokens stored in BNB are processed as soon as the ring of the current production nishes. In the rare situation in which BNB is full, a halt signal is issued to freeze the activity on BIN. When the halt signal is reset, the activity in the bus resumes: the same processor that had BIN ownership continues to broadcast tokens as if nothing had happened.
Whether a WME change is originated locally or captured from BIN, it needs to be forwarded to the Rete network and to the Fireable Instantiation Control (FIC). Like the original Rete network, the one used in this architecture has and -memories. To avoid the high cost of waiting for the removal of a WME, which was pointed out by Miranker 31] , negated antecedents are stored in both -memories and in the reable instantiations produced for the con ict set. The presence of the negated conditions in this representation allows the quick removal of non-reable instantiation when a new token is processed. There is a possibility that a WME change previously processed by FIC and not yet processed by Rete disables an instantiation freshly generated by Rete. To avoid a possibly non-serializable behavior, before adding a new instantiations to FIM, FIC checks it against the Pending Matching Memory (PMM), which stores all tokens still to be processed by Rete. The deletion of an instantiation from FIM is also performed by FIC. The operation of FIM, AFIM, PMM and FIC are explained in greater detail in section 3.2.2.
Con ict Set Management
The Fireable Instantiation Control (FIC) uses the Antecedents of Fireable Instantiation Memory (AFIM) to maintain a list of all enabled instantiations in the Fireable Instantiation Memory (FIM). AFIM and FIM are fully associative memories with capability to store don't cares in some of their cells. The elds in each line of FIM and AFIM are shown in Figure 3 . FIC maintains an internal timer that is used to time stamp each instantiation added to FIM. Each line of AFIM stores either a WME that is the antecedent of a reable instantiation, or an -test that speci es an instantiation negated antecedent. Its elds are:
Presence -indicates whether the AFIM line is occupied. It is used to manage the space in the memory.
Negated -indicates whether this line stores a WME or a negated antecedent.
Type -stores the WME type.
Bindings -contains the values stored in each attribute-value pair of the WME. Notice that the name of the attribute does not need to be stored. Symbolic names are translated into integer values at compile time.
-test -is used only for negated antecedents: speci es the -test to be performed to verify a production antecedent.
Instantiation -indicates which reable instantiations have this antecedent.
The maximum number of attribute-value pairs in a single WME is limited by the size of the eld Bindings in AFIM. A situation in which a WME has more attribute-value pairs than this limit is handled at compile time by splitting this WME into di erent WMEs with subsets of attribute-value pairs and performing the corresponding changes in the entire source code. Notice that because AFIM stores antecedents of reable instantiations, most of the variables are bound, therefore the bindings eld stores mostly constants. For an easy handling of unbound variables, which match any value, the bindings eld of AFIM is a ternary memory. Besides the values 0 and 1, it can also store a \don't care" value X. Such a memory might be implemented using two bits per cell, or using actual ternary logic in VLSI. One example of the latter is the Trit Memory developed by Wade 43] . One alternative to implement a non-bound value is to add a tag bit to bindings that indicates whether the value is bound or not. The advantage of this representation is that there is only one extra bit per word. Each line in FIM stores one reable instantiation, with the following elds:
Presence -indicates whether the line is occupied; Fireable -indicates whether the instantiation stored in the line is still reable 4 .
PM Address -contains a pointer to the Production Memory indicating where the production actions are stored.
Time Tag -record the time in which the instantiation became reable. It is used to implement a pseudorecency criterion to select an instantiation to be red.
The third piece of memory managed by FIC is a fully associative memory called Pending Matching Memory (PMM). When a token is placed in the input nodes of the Rete network, it is also stored in PMM. The token is removed from PMM when the Rete network produces a signal indicating that all changes to the con ict set originated by that token have being processed. Upon receiving a new reable instantiation from Rete, FIC associatively searches PMM. FIC has to perform an independent search for each antecedent of the new instantiation. If any line of PMM indicates the deletion (addition) of a WME that matches a non-negated (negated) condition of the instantiation, the new instantiation is ignored 5 . If no such line is found in PMM, FIC records the new instantiation in one line in FIM and stores each one of its antecedents in a separate line in AFIM. Figure 4 shows the organization of PMM with four elds:
Presence -indicates whether there is a WME stored in the line.
Sign -indicates whether this WME has been added to or deleted from the working memory.
Type -stores the type of WME.
Bindings -records the bindings of the WME.
Presence Sign
Type Bindings r r r r r r r r r r r r An instantiation is only removed from FIM after an incremental garbage collector removes the corresponding antecedents from AFIM.
During the execution of a token, FIC performs three actions in parallel: send the token to the Rete network input; add the token to PMM; and update FIM and AFIM. To update AFIM and FIM, rst FIC executes an associative search in AFIM for entries with the same WME present in the token, but with opposite sign. For each matching entry in AFIM, FIC marks the corresponding instantiation in FIM as un reable. Finally FIC resets the presence bit for these entries in AFIM. This process leaves \garbage" in FIM and AFIM, consisting of all the non-reable instantiations still present in FIM plus the antecedents of these instantiations in AFIM.
FIC has an Incremental Garbage Collector that searches FIM for an instantiation I k that is non-reable. FIC performs an associative search in AFIM and remove all antecedents of I k , and then eliminates I k from FIM. To guarantee the consistency of FIM and AFIM, the garbage collection is always performed as an atomic operation. For e ciency, the position in FIM in which the last garbage collection was executed is kept internally in FIC, and is used as the starting point of the next search. If FIM and AFIM are not full, garbage collection is performed at least once between two instantiation additions. Whenever FIM or AFIM are full, extra garbage collection is executed to free space. This solution trades memory space for speed: a WME that is tested by antecedents of many instantiations is stored many times in AFIM.
Broadcasting Interconnection Network Arbitration
Access arbitration in a broadcasting network is a well studied problem. In this machine we adopt the scheme used in the rst prototype of the Alpha architecture by DEC 42] . During startup each processor is assigned an arbitrary priority number from 0 to N. N is the highest priority and 0 is the lowest. When a processor requests the network, it uses its priority. The requester with highest priority is the winner and is granted access to the network. The winner has possession of the network as long as it needs to write all consequents of one production. After releasing the network, the winner sets its own priority to zero. All processors that had a priority number less than the winner increment their priority number by one, regardless of whether they made a request.
This scheme works as a round robin arbitration if all processors are requesting the network at the same time. If fewer processors are requesting the network, this mechanism creates the illusion that only these active processors are present in the machine.
In section 3.2 we establish that broadcast writes need to be kept in a bu er while a processor is ring local productions. When this bu er over ows, a halt signal is issued by the processor. This signal stalls all network broadcasting activities, giving time for the overloaded processor to consume its tokens and alleviate its bu er load. When the stall signal is removed, the network continues its activity without any change in the ownership. To avoid a great impact in the speed of the machine, the bu er must be su ciently large to avoid frequent stalling of the network.
Correctness of the Processing Model
This section investigates whether the machine proposed in section 3 correctly executes a production system. The correctness criterion used is serializability 36] and the condition of ownership is stated in axiom 1.
Axiom 1 A WME W k is stored in the local memory of a processor P i i W k A(R n ) and R n 2 P i .
Theorem 1 Giving the parallel machine model presented in this document, the de nition of local DNT, local INT, and global productions, Axiom 1 is a necessary and su cient condition of ownership to guarantee correct execution of a production system under the serializability criterion of correctness.
Proof:
First we prove that axiom 1 is necessary. For the sake of contradiction, suppose that the ownership condition stated in axiom 1 is not satis ed. Assume that there is a production R n 2 P i and a WME W k , such that W k A(R n ) and W k is not stored in the local memory of P i . Because reading operations are not allowed in the broadcasting network, P i cannot perform the matching of R n . Therefore a production system cannot be executed in such a machine.
Thus, axiom 1 is necessary.
To prove that axiom 1 is su cient, we must show that, in every possible circumstance, the results produced by this model could be obtained by a sequential execution of the productions. Therefore, we must analyze all situations in which parallel execution might occur and show that each one of them results in a serializable outcome. Because there is no parallel production ring within a processor, the following analysis is restricted to concurrent ring of productions allocated to distinct processors. Inter-processor parallelism occurs in two situations: among productions ring locally in distinct processors and between a production being broadcast over the BIN and one (or more) ring locally. All situations described below involve two productions allocated to distinct processors being red concurrently. The fact that all antecedents and consequents are local indicates that the productions being red in parallel are completely independent of productions allocated to other processors, therefore the same results produced by the parallel ring could be obtained by any sequential ring of the same productions.
Situation 2: A production R m 2 P j enables a production R n 2 P i ; R m and R n might have non-con icting shared outputs; R m does not disable R n ; R n res locally. Since R n res locally, all WMEs that are changed by both R n and R m are pseudo-local for P i and shared for P j . Because those are non-con icting outputs and R m enables R n , parallelism occurs when R n starts ring after being enabled by an action of R m and before R m nishes broadcasting changes to the network. The ring of R n prevents the changes broadcast by R m from being processed locally until R n nishes. As long as the actions broadcast by R m are queued and processed after R n nishes, the result is the same as if R n would have been red after R m nished. Thus, it is serializable. Situation 3: A production R m 2 P j disables a production R n 2 P i ; there is no enabling dependencies between R m and R n ; R m and R n might have non-con icting shared outputs; R n res locally.
The only possibility for the parallel ring of R m and R n is for P i to start ring R n before P j had broadcast any action that disables R n . Even if P j had broadcast some of the shared non-con icting outputs when R n starts ring, the e ect is the same as ring R n before R m .
Therefore, the result is serializable.
Situation 4: A production R n 2 P i changes a pseudo-local WME W k and a production R m 2 P j modi es W k . R n res locally. Because R m modi es W k , R m is a global production. It is necessary to analyze three di erent cases:
Case 1: W k is the only shared output between R n and R m .
Notice that the (possibly) con icting WME W k is exclusively stored in P i . Therefore if P i disables the BIN before P j broadcast changes to W k , the result is the same of ring R n before R m . If P i disables BIN after changes to W k are broadcast, the result is equivalent to ring R n after R m . In both cases it is serializable. Case 2: R n and R m have more than one shared output, but no more than one of them is con icting.
The concern with multiple shared outputs is that the actions of the local and the global production might be intermingled. This would happen if P i would inhibit actions from the network after P j broadcast some but not all actions of R m . Since R m has only one action con icting with R n , the interruption of the remote ring will either take place before or after this con icting action is broadcast. If the interruption occur before the con icting action is executed in P i , the result is equivalent to R n ring before R m . If it occurs after, the result is equivalent to R n ring after R m . In either case this situation results in a serializable behavior.
Case 3: R n and R m have more than one con icting action.
In this case, if intermingled execution would be allowed, non-serializable behavior would result. However, according with condition (ii) of de nition 13 R n is DNT and therefore cannot start ring until the network changes ownership, indicating that the global production either has nished or has not started. This ensures the necessary synchronization and results in serializable behavior.
Situation 5: A production R n 2 P i is enabled and disabled by a production R m 2 P j ; R n res locally.
In this situation, there would be a non-serializable behavior if production R n would be allowed to re after P j had broadcast the action that enables R n and before the action that disables R n is broadcast. This situation does not occur because, according to condition (i) of de nition 13, R n is DNT: it only starts ring when the network changes ownership.
Situation 6: A production R n 2 P i is enabled by a production R m 2 P j ; R n has one output con ict with R m ; R n and R m may or may not have shared non-con icting outputs; and R n res locally.
Parallelism occurs if R n starts ring in P i after the action that enables R n have been broadcast by P j and before P j nishes broadcasting R m actions. If at that point the con icting action has been already broadcast, the result will be equivalent to ring R n before R m . If the con icting action has not been broadcast, the result is equivalent to R m ring before R n . Either way, the result is serializable.
Situation 7: A production R n 2 P i is disabled by a production R m 2 P j ; R n has one output con ict with R m ; R n and R m may or may not have shared non-con icting writes; R n res locally.
This situation could result in non-serializable behavior if R n were to start ring after P j broadcasts the con icting action of R m , and before the action that disables R n is broadcast. However, this cannot occur because, according to condition (iii) of de nition 13, R n is DNT. Situations 1 through 7 deal with possible dependencies involving two productions R m and R n allocated to distinct processors. The local ring of R n in all situations indicates that its consequents change only local or pseudo-local WMEs. Table 1 helps to verify that every possible combination of dependencies among two productions in this situation have being analyzed. In this table a \-" indicates no dependencies, \1" indicates one dependency, \1+" indicates one or more dependencies, \2+" indicates two or more dependencies, and \X" indicates zero or any number of dependencies. Table 1 has ve columns: \Enabling" column indicates the number of actions in C(R m ) that enable R n ; \Disabling" indicates the number of actions in C(R m ) that disable R n ; \Non-Con icting Write" indicate the number of non-con icting shared actions between R n and R m ; \Non-Con icting Write" indicate the number of non-con icting shared actions between R n and R m ; and \Situation" indicates which of the situations analyzed in this proof covers each case. Every possible combination of dependencies between two productions is covered in table 1.
Enabling Disabling Non-Con icting Con icting Situation
Write There is still the possibility that dependencies involving more than two productions create a situation in which the parallel model yields a non-serializable behavior. The only situation in which this might occur are in cycles of disablings, analyzed in situation 8. Therefore W k is a shared WME for P j , W l is a shared WME for P i , and neither R m or R n can re locally. The acquisition of the broadcasting network works as a synchronizing element preventing R n and R m from ring in parallel. The same reasoning can be extended to disabling cycles with any number of productions.
This concludes the proof. Since the results are serializable for any possible con icting situation, we conclude that Axiom 1 is a su cient condition of ownership and that the results produced by the model proposed are serializable.
4 Production Partitioning Algorithm
The problem of partitioning a Production System into disjoint production sets which are then mapped onto distinct processors has been studied by a number of researchers. Most partitioning algorithms are designed with the goal of reducing enabling, disabling and output dependencies among productions allocated to di erent processors 37]. O azer formulates partitioning as a minimization problem and concludes that the best suited architecture for Production Systems has a small number of powerful processors 35] . O azer also indicates that a limited amount of improvement in the PS speed can be obtained by an adequate assignment of productions to processors. Moldovan presents a detailed description of production dependencies and expresses the potential parallelism in a \parallelism matrix" and the cost of communication among productions in a \communication matrix" 32]. Xu and Hwang use a similar scheme with matrices of cost to construct a simulated annealing optimization of the production partition problem 44].
Although certain basic principles are maintained in all partitioning schemes, partition algorithms are tailored to speci c architectures. We are concerned with two kinds of relationships among productions: productions that share antecedents, and productions that have con icting actions. Assigning productions with common antecedents to the same processor reduces memory duplication, while assigning productions with con icting actions to the same processor prevents tra c in the bus. Previous partition algorithms were greatly in uenced by enabling and disabling dependencies among productions 32, 35, 44] . Our experience with production systems shows that grouping productions with common antecedents is much more e ective to reduce the communication cost. Moreover, in the production system programs that we examined, a production seldom creates a WME that was not tested on its antecedents. Therefore, productions that have a greater number of common antecedents are also most likely to have a greater number of enabling and disabling dependencies among them. Thus, our partition algorithm does not include these dependencies, but only shared antecedents and con icting outputs.
We analyzed and experimented with several partitioning algorithms and found the following algorithm to be the most e ective 4, 5]. The optimal partitioning of productions into disjoint sets is modeled as a minimum cut problem, which is NP- Complete 12] . The polynomial time approximate solution presented in this section has three goals: minimizing the duplication of working memory elements; reducing tra c in the bus; and balancing the amount of processing in each processor. In the architecture presented in section 3 these goals translate to: minimizing the number of global productions and reducing the number of local DNT production. As a consequence, the number of local INT productions is increased.
To represent the relationships among productions we de ne an undirected, fully connected graph PRG = (P; E) called Production Relationship Graph. Each vertex in P represents one of the productions in the system, and each weighted edge in E is a combined measure of the production relationships. PRG has a weight function w : E ! Z Empirical studies with a parallel architecture simulator show that the main factor limiting further reduction is the time spent in the matching phase in the Rete network. Consequently, the load balancing must concentrate on the processing performed in the Rete network. Furthermore, most of the time in the Rete network is spent in -node activities. Thus, the number of -tests performed in the antecedents of a production is used as a measure of the workload associated with this production. To address the constraint of balancing the amount of processing among processors, we de ne the function B : P 0 ; :::; P N?1 ! Z + , which computes the number of beta tests that are expected to be performed by processor P i . B(P i ) = N?1 X j=0 (R j ) ' ij ; (2) where (R j ) is the number of beta tests performed for production R j , and ' ij is 1 if R j is assigned to P i , and 0 otherwise 6 . N is the total number of productions in the system. Let S i denote the set of productions assigned to processor P i . When the algorithm starts, all subsets S i are empty and all productions are in the set S. The tness of placing production R i in set S k is measured by the value of the function F(R i ; S k ). The value of the tness function indicates how the production represented by the vertex R i ts in the subset S k . F(R i ; S k ) computes a weighted sum of the connections between vertex R i and all other vertices in PRG. A strong connection with a vertex that has been assigned to a set other than S k reduces the tness of R i to S k , while a strong connection with a vertex already in S k increases the tness. A strong connection with a vertex that has not been assigned to any subset has an intermediate value because S k may be able to attract both vertices.
The strategy used in this partitioning algorithm consists of selecting the processor with the least number of estimated beta tests, and then nding the production best tted to this processor. The productions strongly related to other productions in PRG are the rst ones to be assigned to processors. The algorithm ends when there are no more productions in S.
PARTITION(S; E; w; N; B; F) 1 while S 6 = ; 2 do S k S k f R i =R i 2 S and B(P k ) = min k B(P k ) and F(R i ; S k ) = max i F(R i ; S k )g 6 (Rj) is an estimate of the number of beta tests performed because of the presence of production Rj. It is measured in previous runs of the same production system.
S S ? fR i g 5 Performance Evaluation
Performance evaluation can be accomplished through measurement, simulation, and analytic modeling 23]. Measurement consists of observing actual values for speci ed parameters in an existing system. Simulation consists in creating a model for the behavior of a system, writing a computer program that reproduces this behavior, feeding the simulator with an appropriate sample of the workload of the actual system, and computing selected parameters of interest. In analytic modeling a mathematical model of the system is created and its solution provides the performance evaluation 23]. In a related work, we used an analytical model to investigate the e ect of using multiple functional units to update the Rete network within each processor 7] .
In this research we use an event driven simulator to evaluate the speedup of the architecture proposed. The input of the simulator consists of production system programs written in OPS5 syntax. For syntax and lexical analysis, the tools yyacc and yylex were used 7 .
Benchmarking
A well known weakness of production system machine research is the lack of a comprehensive and broadly used set of benchmarks for evaluation of performance. In the process of searching for benchmarks to evaluate this novel architecture, we contacted many researchers with the same problem: a new idea to be evaluated in need of a suitable set of benchmark programs. Most of the benchmarks obtained were toy programs with a small number of productions in which the researcher can only change the size of the database. A benchmark in which the number of productions and the database size can be independently changed would allow researchers to study various aspects of new architectures. Section 5.1.1 presents a new benchmark that has such characteristics. It is a modi cation of the well known Traveling Salesperson Problem that we call a Contemporaneous TSP (CTSP) 8]. Another benchmark that we wrote is a solution to the \Confusion of Patents Problem". The following sections describe CTSP in detail and brie y present some other benchmarks used to test the architecture.
A Contemporaneous TSP
In this modi ed version of the TSP, the cities are grouped into \countries". The tour has to be constructed such that the salesperson enters each country only once. The location and borders of the countries must allow the construction of a tour observing this restriction. The problem is formally stated as follows:
An instance of CTSP is represented by (K; C; c; c ; c ; O; d). K = fC 1 ; C 2 ; :::; C n g is a \con-tinent" formed by \countries". Each country C i = fc i; (1) ; c i; (2) This formulation of TSP is called \contemporaneous" because it re ects some aspects of modern day life. In the current global economy, travelpersons are likely to have greater needs than the traditional salesperson driving from town to town. Consider a music star in a worldwide tour carrying along a huge crew and sophisticated equipment: the singer will visit many di erent locations in each continent; the cost of ying back and forth between continents is much higher than movements within a continent and depends on the locations of departure and arrival. Other situations involving sophisticated traveling requirements include the planning of airline routes and national political campaigns in large countries such as USA, Brazil and India. Applications in which data locality allows the creation of clusters include: insurance database management, banking industry, a national health care information network, and a national criminal o ense information network 8 .
A Production System Solution for CTSP
The formulation presented above for the CTSP is generic enough to allow its application in many elds: there is no restriction in what the words continent, country, city, and distance might represent. To facilitate the construction of a Production System solution that is useful for testing new PS architectures, we used a simpler version of CTSP with the following restrictions: 8 In the 1994 \Brady Bill", the USA Congress mandated the construction of such a network for background veri cation for the purchase of re weapons. The problem is symmetric, i.e., d(c i ; c j ) = d(c j ; c i ) for any i and j.
A continent is a two-dimensional Euclidian space.
A country is a contiguous, rectangular shape within this space.
The number of cities in each state follows a normal distribution with average c and standard deviation c .
The city locations are uniformly distributed within each country.
There is a common boundary between two countries that are consecutive in the ordering O.
Our PS solution for CTSP has a set of productions for each country and a set of productions for each country boundary. The data set is constructed in such a way that the distances among cities located within each country are stored in WMEs with di erent types. Given a country C i , the country that precedes C i in the order O is denominated P(C i ); the country that succeeds C i in the order O is denominated S(C i ).
It is not necessary to store in the data base the distance between every two cities in the continent. For a city c j in a country C i , the only relevant distances are the distance to the cities within C i , to the cities in P(C i ), and to the cities in S(C i ). The following list illustrates WMEs typically used in our solution to Our solution has seventeen local productions per country and twelve productions per country boundary. This organization allows the researcher to vary the number of productions by creating continents with di erent number of countries. The size of the data base is determined by the number of countries and the average number of cities per country. The variance between the amount of data processed by each cluster of production is given by c .
The heuristic used in the PS solution of the problem involves the computation of two extra locations for each country C i : the geometric center of the borders with P(C i ) and with S(C i ). Because we impose the restriction that countries have rectangular shapes in a two-dimensional Euclidian space, the border between two subsequent countries in the tour is always a segment of a straight line. The border center b(C i ; C j ) between countries C i and C j is the center of the line segment that forms the boundary. The heuristic used to construct the internal tour in a country C i is described below:
The rst city c k in the internal tour of a country C i is the city with minimum distance d(b(C i ; P(C i )); c k ).
While the internal tour of country C i is not complete, select a city c l 2 C i such that d(c k ; c l ) ?
d(c l ; b(C i ; S(C i ))) is minimum. c k is the last city inserted in the tour.
Whenever the internal tours of two adjacent countries C i and C j are completed, the last city visited in C i is connected to the rst city visited in C j . This heuristic rationale is to add to the internal tour the cities that are close to the last city included in the tour and far from the border in which the internal tour shall end. There is a limited local optimization of the constructed tour. We developed a C program that allows researchers to specify continent maps and to experiment with di erent numbers of countries, c , and c .
This simpli ed CTSP o ers many advantages for production system benchmarking: the number of productions in the program can be varied by changing the number of countries; the ratio of global to local data is controlled by the average number of cities in each country; the balance in the size of local data clusters is speci ed by c ; and the speci cation of the continent \map" is very simple making it easy for a researcher to generate a new instantiations of the benchmark. This benchmarking facility is available through anonymous ftp to: pine.ece.utexas.edu in /a/pine/home/pine/ftp/pub/parprosys. In the measurements presented in section 6, instances of the CSTP appear as south, south2, moun and moun2. In moun and south a single set of productions performs the optimization in all country borders, while in south2 and moun2 an specialized set of productions is used for the optimization of each country border. Table 2 shows the relation between the number of countries in each benchmark C, the average number of cities in each country c and important parameters in the benchmarks generated by the facility. 
Confusion of Patents Problem
We constructed a solution for the formulation of the Confusion of Patents Problem presented in 10, 22] The problem presents ve patents, ve inventors, ve cities, and ten constraints. Using these constraints we must decide who invented what and where. In our solution, all 125 possible combinations and 10 constraints are present in the initial database; 67 productions use the constraints to eliminate combinations that are not possible; 19 productions select the right combinations and print the solution.
Because this solution has only four di erent types of WMEs, most of the productions either change or test the same kinds of WME. As a consequence, productions have strong interdependency, resulting in a production system poorly suited for clustering. Even in a machine with a moderate number of processors, most of the actions need to be broadcast on the network. The main source of parallelism is the concurrent execution of di erent portions of the Rete network. Performance measures to this solution of the confusion of patents problem are reported under the name patents.
The Hotel Operation Problem
Originally written by Steve Kuo at the University of Southern California, hotel is a production system that models the operation of a hotel. It is a relatively large and varied production system (80 productions, 65 WME types) with 17 non-exclusive contexts. Because each production in hotel is related with the activities that actually take place in a hotel, the amount of speedup obtained depends on the balance of work among each one of these activities. For example, if a hotel is speci ed with a large number of tables in the restaurant and very few rooms, the productions that take care of the restaurant tables will have a much larger load than the productions that cleanup the rooms. This work unbalance is transferred to parallel architectures that partition the program at the production level.
The Game of Life
This is an implementation for Conway's game of life, as constructed by Anurag Acharya. After our modi cations, life has forty productions. Twenty ve of these productions are in the context that computes the value of each cell for the next generation and potentially can be red in parallel. The other fteen productions are used for sequencing and printing and can be only slightly accelerated by Rete network parallelism.
The Line Labeling Problem
Di erent versions of the line labeling problem (Waltz and Toru-Waltz) have being used for performance evaluation 27, 28, 34, 37]. Our version was originally written by Toru Ishida (Columbia Univ.), and successively modi ed by Dan Neiman (Univ. of Massachusetts), Anurag Acharya (Carnegie-Mellon Univ.) and Jos e Amaral (Univ. of Texas). The current version has two non-overlapping stages of execution, each one with four productions. Because the system is partitioned at the production level, the amount of parallelism is limited to four fold. Such a low limit in speedup occurs because this is a simple \toy" problem with only ten productions, not adequate for the architecture proposed. The line labeling problem is identi ed as waltz2 in our set of benchmark. Table 3 shows static measures | number of productions, number of distinct WME types, average number of antecedents per production, average number of consequents per productions | for the benchmarks used to estimate performance in the multiple functional unit Rete network. south and south2 are CTSPs with four countries and ten cities per country; moun and moun2 are CTSPs with ten countries and 15 cities per country; life, patents, waltz2, and hotel are the benchmarks discussed in sections 5. 6 Performance Measurements
The benchmarks described in section 5.1 were used to evaluate the performance of the proposed architecture. First we measure the amount of speedup over an architecture with global synchronization and without overlapping between matching and selecting-acting within a processor. Then we investigate the e ectiveness of the use of associative memories. Finally we obtain estimates for the size of associative memories needed for each one of the benchmarks and for the level of activity in the bus.
Notice that this section measures performance improvement obtained from two distinct ideas: section 6.1 measures the improvement solely due to elimination of over-synchronization and section 6.2 measures the improvement solely due to use of associative memories. However, because there is some interaction between these improvements, their product is only a rough estimate of the combined bene t of these ideas.
Parallel Firing Speedup
To measure the advantages of parallel production ring and of the internal parallelism in each processor, we de ne a globally synchronized architecture that is very similar to the one proposed in this paper, except that it performs global con ict set resolution to implement the OPS5 recency strategy. This synchronized architecture is also very similar to the one suggested by Gupta, Forgy, and Newell 15] . In this architecture, each processor reports the best local instantiation to be red to the bus controller. The bus controller selects the instantiation whose time tag indicates it to be the latest one to become reable. This added decision capability in the bus controller implements the recency strategy to solve the con ict set. The processor selected to re a production broadcasts all changes in the bus. A processor only selects a new candidate to re when the matching in the Rete network is complete. The bus controller waits until all processors report a new candidate to re. This mechanism reproduces the global synchronization and con ict set generation/resolution present in many of the previously proposed architectures. In order to have a fair comparison, we considered that the synchronized architecture uses an associative memory to store and solve the local con ict sets, and that the bus controller chooses the \winner" in one time step.
Since the synchronized architecture also uses associative memory to store and search the local con ict sets, the comparisons of Figures 5 and 6 do not re ect the advantages of using such memories in our architecture. We delay this analysis until Section 6.2. Figure 5 shows the speedup curves for the benchmarks life, hotel, patents, and waltz2. In this and the next section, we will observe a signi cant di erence in performance and memory requirements between this group of benchmarks and the ones based on CTSP (south, south2, moun, and moun2). This is due to a gap in complexity between the two groups of benchmarks: the CTSP programs have higher data locality, larger number of productions, and larger data sets. Due to these characteristics, CTSP programs re ect more closely the characteristics encountered in production system applications in industry. The curve names starting with \s" indicate measures in the synchronized architecture; the curve names starting with \a" indicate measures in the architecture proposed in this paper. All speedups are measured against a single processor synchronized architecture. For the benchmarks presented in Figure 5 , there is not much distinction between the two architectures when they have a single processor. This indicates that the parallelism between the matching phase and the selecting/execution phase does not result in much speed improvement for these benchmarks. Yet, even with these \toy problems", the parallel ring of productions and the elimination of the global synchronization provides signi cant speedup. Figure 6 shows the comparative performance for the CTSP benchmarks. Here, signi cant speedup is observed over the synchronized architecture even for the single processor con guration. This measures the amount of speed that is gained due to the parallelism between matching and selecting/ ring. The apparent superlinear speedup in the curves of Figure 6 re ects the fact that these curves are showing the combined speedup due to two di erent factors: intra and interprocessor parallelism. To obtain the speedup due exclusively to parallel production ring, the reader should divide the values in the \a" curves by the values in the same curve for a single processor machine. These results con rm our initial conjecture that the elimination of the global synchronization in a production system allows the construction of machines with signi cant speedup. Another way to compare the two architectures is to measure how much speedup the proposed architecture has over the synchronized one with the same number of processors. Measurements were made for machines with one through twenty processors. Table 4 shows the mean and the variance for the speedups obtained with each con guration. It also shows the maximum and minimum speedup obtained with any number of processors. Because our architecture implements \eager" production ring without generating a global con ict set, in rare cases, some extra production execution might cause it to be slower than the synchronized architecture (see the minimum speedup for patents). The gap in performance between the CTSP and the other benchmarks in Table 4 indicates that the proposed architecture is very e ective on extracting parallelism of PS programs that possess data locality. Table 4 : Speedup over synchronized architecture using the same number of processors.
E ectiveness of Associative Memories
An associative memory or content addressable memory (CAM) is an storage device that retrieves data upon receiving a partial speci cation of its contents. We adopt Wade's terminology and call a traditional memory accessed by addresses a reference addressable memory (RAM) 43]. CAMs are most bene cial for systems in which storage devices are often searched for a cell with a given pattern. The most well known applications of the CAM mechanism are the tag matching in a cache memory and the data checking in a snooping cache or directory. When a CAM receives a request for a piece of data, it searches all positions of the memory and reports the contents of the records that match the speci ed pattern. Obvious advantages of a CAM over a RAM are the possibility of parallel matching when enough hardware is available to implement it, the liberation of the processor during memory searches, and reduced tra c between processor and memory 41].
In section 3 we stated that the design of the architecture is based on the premise that the use of CAMs signi cantly improves the processing speed. In this section we address questions that come to the mind of an inquisitive computer architect when analyzing the architecture. First, assume a machine con guration in which all memory components are CAM: what would be the impact of replacing one of these CAMs for a RAM? Second, consider a machine in which all memories are RAM: how much speedup would be gained if one of these memory components were to be replaced for a CAM?
To evaluate the speedup obtained by the use of CAMs, we implemented options in the simulator that allow us to specify whether each one of the individual memory components | AFIM, FIM, and PMM | is a CAM or a RAM. If a component is speci ed as a RAM, the simulator counts the number of accesses performed until the searched data item is found. This number is multiplied by the RAM access time to nd the time for that particular access. If a component is speci ed as a CAM, every access takes the same amount of time.
The e ectiveness of a CAM in the architecture depends on the amount of data stored in the memory, the frequency of access, and whether its accesses are in the critical path of execution. Thus, the amount of speedup obtained by a given combination of CAM/RAM memories depends on the production system program that the machine is executing. For a production system program that maintains a large number of productions in the con ict set, the use of CAM for AFIM and FIM might result in a considerable speed improvement. If the con ict set is small, the use of CAM for these memories only improves the speed slightly.
To set up experiments to measure these speedups, we de ned two quantities: Speedup(M; B) and Slowdown(M; B). Speedup(M; B) is the amount of speedup that results when the memory component M is replaced for a CAM in a machine that was originally formed only by RAMs. M designates one of the memory components | PMM, AFIM, or FIM | and B is a benchmark program. While Speedup(M; B) in this section measures the amount of speed gained because of the use of CAMs, the speedup measured on section 6.1 was relating the asynchronous ring of production with a machine that res productions synchronously but also uses CAMs. Because the base machine to compute the speedup in this section and in section 6.1 are di erent, these two set of measurements are not to be compared. 
For a given benchmark program the amount of speedup obtained by using CAM memories varies with the number of processors used in the architecture. Table 5 presents the average speedup for machines with one up to twenty processors. In practical designs, CAMs might be slower than RAMs: either because they are constructed with older technology, or because they need to use more silicon area for the comparator circuits. To account for these factors we introduce a technology factor T that indicate how much slower a basic operation such as the reading or writing of a single data element was considered in this comparison. Table 5 shows measures for a machine with CAMs with the same speed as the RAMs (T = 1) and for a machine with CAMs that are four times slower (T = 4) than the RAMs. Observe that there is no signi cant di erence in speedup between the two measures, indicating the advantage of the use of CAMs, even if they are slower than RAMs. . Table 5 shows the speedup and the slowdown due to each piece of associative memory for each one of the benchmarks presented in section 5.1. The last column shows the speedup that compares a con guration with all three memories associative against one in which all three memories are RAM. Table 5 shows that replacement of just one memory for a CAM results in quite low speedup. This limited speedup is result of the slow operation of the RAMs in the machine. Only when all three memories are made CAMs, the processing speed shows considerable improvement. The numbers in the slowdown columns show that the use of RAM in PMM or AFIM alone might cause signi cant reduction in speed. Both experiments show that the use of CAM for FIM is not very important. Overall, these results con rm our initial conjecture that the use of CAMs can provide considerable speedup in production system architectures. Table 6 : Maximum and average \crest" for memory size (bytes).
Benchmark

Associative Memory Size
The next question that the inquisitive computer architect must ask is: how large do these associative memories need to be? The simulator has an option to report the \crest" 10 of each memory component in any given run. Table 6 shows the maximum and the average crest over machines with up to twenty processors. The average crest is the average of the largest memory needed for each machine con guration. The maximum crest indicates the minimum memory size needed to run that speci c benchmark. Observe that for some memory/benchmark the average crest is several times smaller than the maximum crest (see AFIM in moun2 and PMM in waltz2). If memory size becomes a concern in the construction of the machine, a RAM can be used to contain over ow. The absence of a direct correlation between the size of the memory crest and the speedup and slowdown shown in table 5 re ects the fact that the processing speed is not solely dependent on the amount of data stored in each memory: it also depends on the frequency and time of access of these memories.
The speed comparison with the synchronized architecture presented in section 6.1 considered that both architectures used associative memory to store and search the con ict set. The average and the maximum crests of the associative memories for the synchronized architecture are presented in the rightmost columns of Table 6 . Observe that for most of the signi cant benchmarks, the synchronized architecture needs a much larger memory. For the CSTPs benchmarks (moun2 and south2) the maximum crest in the synchronized architecture was ten times larger than in the architecture proposed in this paper. This evidences that the \eager ring" mechanism also reduces the demand for memory. 10 The crest of a memory component is the maximum amount of data stored in that memory component in any processor of the machine for a given benchmark and a speci ed number of processors. Table 7 : Percentage of time that the bus is busy.
Use of Bus
A legitimate concern about any bus-based parallel architecture is the limitation of a bus as a broadcasting network. In sections 2 and 3 we conjectured that bus bandwidth is not a limitation in the architecture proposed. Table 7 presents the measurements for the percentage of time that the bus is busy for machines with 4, 8 and 16 processors, assuming that bus bandwidth is the same as that of local memory. These measures include the arbitration time and the token broadcasting time. Observe that technological limitations would have to render the bus much slower than the memories before the bus speed becomes a concern in this architecture.
Concluding Remarks
We proposed a new architecture for production systems that eliminates global synchronization and the generation of a global con ict set. The increased importance of associative search for maintaining reable instantiation tables in this setting is underscored by the big performance gains obtained by using modest amounts of associative memory. Note that a single physical CAM can be logically partitioned into PMM, FIM and AFIM, and the \crests" in each partition are not expected to occur in the same processor and at the same time. Thus, only a few kilobytes of associative memory is su cient for most of the benchmarks considered.
A number of issues remain for future research in this area. With the improved speed in production selection and ring due to the CAMs, the matching in the Rete network is again a bottleneck. We have developed an analytical model to investigate the utilization of multiple functional units in the Rete network of each processor. The predictions indicate that a small number of functional units provide signi cant improvement in the Rete network speed 7] . One can now study the system-level e ect of a faster Rete network for the architecture proposed in this paper.
Acharya and Tambe have showed the usefulness of handling collections of WMEs instead of single WMEs during the match phase 1]. The manipulation of collections in the architecture presented in this paper would further reduce the amount of tra c in the bus. However, more theoretical studies are necessary before collection oriented production systems are built. For example, the handling of self-disabling productions in collection oriented systems needs to be studied with care.
This research assumed the use of serializability as a correctness criterion. Our experience with PS benchmarks indicates that programmers often rely on knowledge about con ict set resolution strategies when writing PS programs. This is mostly evidenced by the omission of important antecedents in productions that are enabled but never selected to re by an speci c strategy. For problems like CTSP, writing a serializable correct PS was fairly straightforward. Now that our study has indicated that serializable systems o er great speed improvements, it is desirable to develop programming aid tools to help in the speci cation and veri cation of a wider range of serializable PS programs.
