The Rewrite Rule Machine (RRM) is a massively parallel MIMD/SIMD computer designed with the explicit purpose of supporting veryhigh-level parallel programming with rewrite rules. The RRM's node architecture consists of a SIMD processor, a SIMD controller, local memory, and network and I/O interfaces. A 64-node cluster board is already an attractive RRM system capable of extremely high performance on a variety of applications. A cluster is SIMD at the node level, but it is MIMD at the system level to flexibly exploit the parallelism of complex nonhomogeneous applications. In addition to reporting detailed simulation experiments used to validate the node design, we measure the performance of an RRM cluster on three relevant applications.
Introduction
The Rewrite-Rule Machine (RRM) is a Multiple Instruction, Multiple Data/ Single Instruction Multiple Data (MIMD/SIMD) massively parallel computer being designed, simulated, and prototyped at SRI International. The RRM project is unique because it emerged from an initial design search space that was primarily focused on software issues. The outcome of this high-level design effort has been coupled with a bottom-up quantitative approach resulting in an architecture which, while trying to balance complexity, performance and cost in an optimal way, still inherits the important guidelines of the initial theoretical work. Two main characteristics of the overall design are the use of the concurrent rewriting model of computation and the use of active memory.
RRM Software Model
A rewrite rule p → p' consists of a lefthand side pattern p and a righthand side pattern p', and is interpreted as the replacement, called rewriting, of p by p' in some data structure. The RRM's model of computation is concurrent rewriting, that is, the process of replacing instances of lefthand side patterns by corresponding instances of rightand side patterns concurrently. Since rule application depends only on the local existence of a pattern, rewrite rules are intrinsically concurrent. A program is then a collection of rewrite rules. In its concurrent execution each rule can be applied simultaneously to many instances (SIMD rewriting), and many different rules can each be simultaneously applied to many instances.
Rewrite rules have been used for expressing the implicit parallelism of functional programs in a declarative way, leading to the investigation of so-called reduction architectures (see for example [11, 21] ). However, when generalized adequately [18, 16] , rewrite rules are not limited to functional computations. They can express with similar ease many other parallel but nonfunctional applications. As explained in [16] , concurrent rewriting gives rise to a machine-independent parallel language Maude [19, 16]  in which a very wide range of parallel applications can be easily expressed in a very high level, declarative way. Maude supports three different types of rewriting:  *Supported by Office of Naval Research Contract N00014-92-C-0222.
RRM Hardware Hierarchy
Our parallel programming paradigm diverges from the standard von Neumann model of computation where every execution step requires some interaction between the CPU and data memory. One way of describing the RRM architecture is to imagine a parallel system whose computational units are in its first-level caches. One can think of the SIMD processors as a self-modifiable programmable active store, and of the data memory as conventional passive memory. This organization blurs the distinction between the computational agent and memory, and thus limits the negative effects of random memory access [17] .
As displayed in Fig. 1 , the RRM is a 7-tiered hierarchical architecture. The most basic unit is a 16-bit processing element with 16 registers called a cell. Four cells, which share local communication buses, make up a tile, and 144 tiles operating in SIMD mode make up an ensemble, which is expected to fit on a single die. A node consists of a collection of hardware devices that constitute a selfcontained computational building block. In our case the node is a tightly coupled design that is tuned to supply the ensemble SIMD processor with enough resources to efficiently sustain computation. A node contains an ensemble, data and instruction memory, and I/O and network interfaces, and is expected to be realized as a multichip module. A cluster consists of 64 or more nodes connected on a high-speed network, and fitting on a single board. The Rewrite Rule Machine as a whole is a collection of clusters connected on a network and sharing a common host, which runs a standard operating system and handles user interaction. We view an RRM system with a single cluster as an attractive accelerator for applications such as event-driven simulation, image Fig. 1 processing, neural networks, artificial intelligence, and symbolic computation in general. Such single-board system has a raw peak performance of 3.6 teraops and, as explained in this paper, is flexible enough to achieve very good performance on a heterogeneous variety of applications.
Implementation of Concurrent Rewriting on the RRM
The RRM is designed to exploit the massive parallelism of many types of applications expressed with rewrite rules. Fast SIMD rewriting is supported at the chip level, but the RRM as a whole operates in MIMD/SIMD mode to efficiently and flexibly exploit parallelism at all levels. The RRM can perform globally-SIMD homogeneous computations, but can also effectively exploit heterogeneous MIMD parallelism at the cluster and RRM system levels.
Rewrite rules are surprisingly well-suited to massively parallel computation. The most striking architectural advantage of using rewrite rules for parallel computation is that proper compilation techniques can greatly reduce the need for synchronization [15] . Consistent with our framework, rewrite rules allow our design to favor a solution that exposes the underlying architecture to satisfy synchronization requirements through application-specific software primitives. Our design supports both the shared memory and message-passing communication schemes; shared memory consistency is entirely maintained with barrier synchronization mechanisms and test and set operations, while message passing is supported with a very simple active message scheme [23] . The simplicity of our hardware somewhat increases software complexity, but this allows integration of message-passing and shared-memory communication schemes in a more natural way than in other shared-memory designs [10, 13] .
We have developed two compilers mapping rewrite rules to parallel RRM code [2, 15] . The latest compiler exhibits efficiencies within 20% of the corresponding hand-compiled codes. Given the great flexibility of the concurrent rewriting model, we believe that it is possible to compile and parallelize conventional code on the RRM with reasonable ease and efficiency. In this way, support for legacy code written in conventional languages, and integration of such code with new code written in a rewriting language could be achieved.
Terms and graphs are represented by having each RRM cell represent a vertex. Each cell has one register holding a datum labeling a vertex, and a variable small number of registers (two or three) holding the addresses of the child cells. Our indirect addressing scheme allows extreme flexibility in representing a graph; vertices of the same graph could reside in neighboring tiles, in nonneighboring tiles, in different RRM nodes, or in passive memory. A mix of software and hardware mechanisms allows communication to occur between vertices residing in any of the above locations. All cells in an ensemble listen to the same SIMD instructions broadcast by a common controller. The instructions are interpreted depending on the cell's internal state; cells to which the instruction does not apply become inactive. Under SIMD control, cells can communicate with each other to find patterns that are instances of a rewrite rule lefthand side. Many such instances can be found simultaneously within a single ensemble and across multiple RRM nodes; the found instances can then be simultaneously replaced by righthand side patterns. The ensemble's SIMD controller has a feedback mechanism which is used to interrogate cells. In this way, scheduling of code for different rewrite rules can be made conditional to the appropriate data being present in the cells. Different RRM ensembles can then work asynchronously in MIMD/SIMD mode on very different types of data, with each ensemble using only the rules that are relevant for the data it currently has.
Related Research
Key ways the RRM design differs from massively parallel SIMD machine designs of the past include (1) its MIMD/SIMD character, (2) its use of software-controlled prefetching [12, 8] , which allows data access to be decoupled from the instruction stream, (3) the extreme simplicity of its SIMD controller and (4) its RISC-like instruction set architecture.
Several other features of the RRM are novel in combination, although most have been seen in earlier machine designs in isolation. As a concrete comparison, Goodyear/NASA MPP [5] has local connections between large numbers of (1-bit) cells; however, cells have minimal computational power and there is no support for indirect addressing. The CM-1 and CM-2 architectures are also composed of SIMD-controlled 1-bit cells and in addition have floating point hardware support. The RRM has no dedicated floating point support, and features much more powerful computational agents (much more active memory and a 16-bit ALU). The CM-5 is a MIMD machine with vector units in each node when fully configured. The vector units could be thought of as a very limited form of SIMD computational agents, but they require significant hand-coded software support and are not designed for symbolic computation. The MasPar line of architectures [6] is another modern SIMD design with some similarity to the RRM. The MasPar architectures utilize 4-bit computational cells which are smaller and can store less than RRM cells. MasPar machines support floating point arithmetic better than the RRM, but lack some of the addressing support, as well as the MIMD/SIMD capabilities found in the RRM.
Section 2 describes in detail the node architecture, gives a brief description of the ensemble, and discusses the (preliminary) cluster architecture used in the simulations. Section 3 discusses our simulation methodology and experiments. We have measured the performance of an RRM cluster on three applications: the DARPA Image Understanding benchmark, a logic level circuit simulation, and a parallel sorting algorithm.
RRM Architecture
After a brief description of the system and cluster levels, for which only preliminary designs exist, this section focuses on the detailed architecture of an RRM node by describing and interelating its components.
RRM System
The RRM system is composed of a number of cluster boards interconnected with a high performance network. A host (a conventional workstation) is responsible for the user interface, compilation, system and high-level synchronization functions. We include a separate I/O network for generality because I/O requirements will depend on the particular application area of the final design. The number of cluster boards employed in the system will depend on both technological issues and performance requirements. For the time being, we focus on a system with one cluster board.
RRM Cluster Board
Each RRM cluster board is composed of either 64 or 128 computational nodes; initial estimates indicate that a 64-node cluster implemented in Multi Chip Module technology will fit on a reasonably small board of 40×40 cm. Details of the cluster interconnection topology have not yet been decided; for simulation purposes we model the node-to-node interconnection network as a point-to-point 500-Mbyte/s bidirectional 2-D mesh. We have derived the topology, the link controller architecture and the bandwidth estimates (500 Mbyte/s) from the IEEE SCI standard 1596-1992 [20] . Even though 1-GByte/s communication drivers are already on the market, we prefer to assume 500-Mbyte/s to be conservative on an aspect of the design that we have not yet fully explored.
Node Architecture
The RRM node architecture augments the ensemble SIMD processor with local memory and with powerful communication capabilities. We have chosen a non-blocking Load/Store scheme so that software-controlled prefetching can allow overlap of computation and communication. We have completely decoupled the data flow from the control flow to parallelize the execution of control and data access operations. One of the interesting results of our design effort is noticing that this paradigm applies to the SIMD world quite well and in some respects allows an overall simplification of the design. As we shall see later, by sharply dividing the execution of control and data access instructions between the SIMD controller and the SIMD processing elements (PEs) one can achieve greater parallelism and at the same time reduce software and hardware complexity. Fig. 2 is a functional block diagram of the node architecture. The ensemble's cells are continuously fed ins t r u c t i o n s b y t h e S I M D c o n t r o l l e r , which steps through the instruction memory. The internal request buses are used for distributing data among the devices of the node. All devices are interfaced to this data path with proprietary bus interface units (BIUs) that, as described below, offer a simple and uniform way of propagating nonblocking split-transaction requests.
An important characteristic of this architecture is its flexibility; it can be modified by adding and removing BIUs and/or buses to fine tune its performance. Each BIU can be connected to an arbitrary small number of devices and, provided it has enough multiplexers, to an arbitrary number of request buses. The 4-bus configuration depicted above was derived by gathering execution information from a mix of heterogeneous benchmarks (symbolic Fibonacci, sorting, image component labeling, event-driven simulation, image understanding) and by choosing parameters that yield good average performance and at the same time exhibit good hardware utilization. Later, we justify this choice of configuration in more detail.
Ensemble
Our SIMD processor, called an ensemble, fits on a single die. The ensemble has been the object of extensive studies in the past [1, 3, 14] and its topology and architecture are based on the results of extensive theoretical and experimental research. For expository purposes we summarize the main characteristics of the ensemble.
The ensemble contains a 12x12 grid of buses and a controller (Fig. 3a) . The row buses (really one large unidirectional bus) are used to broadcast SIMD instructions to all cells within the chip, and the column buses are used for data input-output. The controller does not have access to the column buses, which are for the exclusive use of the cells.
Each square formed by the intersection of the buses is called a tile (Fig.3b) and contains four 16-bit processing elements called cells (Fig. 3c) . Each cell is connected to one row bus, to one column bus and to four local 16-bit buses (NEWS). The four local buses allow direct communication between cells of adjacent tiles, and one of the buses (North) allows communication between cells within the same tile. This unique topology offers a large degree of connectivity while trading off hardware simplicity with having to multiplex eight cells on each of the NEWS buses. Non-neighboring cells that cannot communicate through the NEWS buses use the column buses regardless of whether they reside in the same ensemble chip or reside in different nodes. This greatly simplifies both software and hardware at the expense of having to service all non-local communication requests off the ensemble chip even in the case of non-local communication inside the ensemble. A simple fixed-priority scheme synchro- 
SIMD Controller
Our SIMD controller is simple enough to fit within the SIMD processor chip. Its simplicity is, in our opinion, of paramount importance because it allows decentralization and simplification of the hardware design and because it permits instructions to be propagated within the chip, therefore allowing faster clock rates.
The controller's hardware (Fig 4) consists of an ALU, a register file, and some control logic. The instruction memory is matched to the controller speed; a secondary program memory can also be included to implement instruction caching. The SIMD controller steps through the program memory and executes or broadcasts instructions.
Our instruction set design closely follows the RISC philosophy to allow only simple elementary instructions and to expose the underlying architecture in order to take advantage of optimizing compilation techniques. Based on our detailed hardware design for the ensemble we are confident that all instructions can execute in two half cycles of 5 ns or at 100 MHz. The RRM uses the A-SIMD mode of execution where, although the controller continuously broadcasts instructions, individual cells may choose to stop executing instructions based on the value of their internal registers. This powerful program control scheme causes control information to be implicit in the ordering of the instruction stream, thus simplifying the hardware design.
As shown in Fig. 4 the instructions in the Instruction Register (IR) can be either placed on a latch to be broadcast to the SIMD cells or can be executed internally by the controller hardware to control the program flow. Synchronization between the controller and the cells is achieved with a simple wired OR mechanism used to determine whether one or more of the 576 cells is in the active state. Besides program flow control mechanisms, the controller also offers some simple hardware support for asynchronous message passing between nodes. Controller messages coming from outside the node contain a predefined vector that points to some part of the program memory; applicationspecific handlers service messages by executing the appropriate interrupt routines. To keep the controller design as simple as possible we do not anticipate automatic context switch support and nested interruption capabilities. Messages are typically very small and rely on the message handlers for data movement (active message paradigm). This part of the controller can be directly derived from conventional processor design techniques and therefore is not of particular interest at this point.
Bus Interface Units and Bus Architecture
The BIUs (Fig. 5 ) synchronize information flow between devices within the node. All transactions are non-blocking. (All requests are buffered and, after issuing a request, a device is free to perform other tasks.) All requests and messages consist of either two address words for read requests or one address and one data word for write requests.
It is important to note that a read request, after it has reached its source location and has obtained the necessary data, is transformed into a Write/Reply request that is processed by the hardware as a normal write request, which is then propagated back to the reader. The detection of outstanding read requests is obtained using a mix of software and hardware techniques. The number of request buses determines how many bus transactions can happen in parallel. Depending on the application, internode communication requirements can greatly vary. Here we report the results of some experiments designed to determine a sensible number of request buses to be used in our current node configuration. Fig. 6 details the performance variations of a 4-node system (expressed as percentages) when the number of buses, memory units, and SIMD processor BIUs are all varied from 1 to 12. This graph supports the choice of a 4-bus system because, except for the hardware simulator application, the incremental advantage of increasing the bus width beyond 4 is very small. ure. Sorting, Fibonacci, and the hardware simulator applications never use passive memory because the problem size was chosen to fit entirely in active memory. The image understanding benchmark, however, relies on passive memory to store temporary results. This benchmark's small performance variation as the bus bandwidth is reduced reinforces the conviction that our programming model can be quite resilient to memory bandwidth limitations. The hardware simulator is the application that relies most heavily on the internal node communication capabilities because of the extremely high connectivity required by this application.
Network Interface
The network interface supports communication between nodes. It consists of communication drivers and high-speed hardware queues to store incoming and outgoing messages. Although we have not yet committed to a final network topology, we have simulated a 2-D bidirectional mesh with point-to-point links. This part of the node architecture is a good example of how changing specification parameters cause relatively minor changes to the overall system. Our current network interface is assumed to have four bidirectional ports connected to its immediate neighbors; in case, for example, we could only employ an interface with one bidirectional port, four such devices could be placed on a single BIU, thus emulating the original topology with only very localized changes to the design.
In Fig. 7 we report the result of an experiment aimed at determining a suitable number of network interfaces. This experiment was conducted with a system of 16 nodes. The number of interfaces was varied from 1 to 4, thus measuring the effect of int e r n o d e c o m m u n i c a t i o n parallelism. In the 1-interface configuration, packets traveling between nodes can only be sent and received sequentially, while in the 4-interface version packets can be sent and received in parallel from the four NEWS directions. As expected, sorting, which is bound by internode communication bandwidth, shows the highest sensitivity to this parameter and would justify the adoption of multiple interfaces. We have chosen to adopt a more conservative 1-interface base configuration so that our performance results would not depend on an optimistic nodeto-node communication mechanism.
Memory Controller, I/O Controller, and Addressing
The flexibility of our node architecture allows tuning the memory subsystem to a required throughput. Our base configuration uses four memory BIUs, one memory controller per BIU, and assumes that memory is matched to the memory controller speed. Because of the adoption of the active memory paradigm, the applications we have developed so far make very little use of passive memory and therefore are marginally influenced by the memory subsystem characteristics. Addressing is a part of our design that has been left underspecified because it is usually not a critical aspect of computer designs. For the moment we do not simulate any indirect system-level addressing mechanisms and assume instead hard-wired addresses. In the future we plan to include a standard virtual memory mechanism to handle a larger memory address space. I/O ports are memory mapped and are accessed just like any other memory 
Simulation and Performance Results
We describe here our simulation methodology and performance measurements for an RRM cluster of 64 or 128 nodes. Although communication-to-computation ratio, bus contention, network throughput and other performance metrics are all important measurements, we chose to report only the wall clock time. We have chosen to do so because this is the only performance evaluation measure that allows easy comparison of the RRM cluster with other designs to give an accurate relative account of the RRM estimated performance.
A register transfer-level simulator of an RRM cluster has been implemented. The simulator holds a very detailed description of all the hardware down to the register level; it uses the libraries provided by the general-purpose simulation package Csim [9] . This package is an extension of the C language that allows very efficient process-oriented event-driven simulations. Each device of each node is a separate process that interfaces with other processes through synchronization lines (events) and hardware queues (mailboxes). This simulation scheme is very similar to a Verilog Hardware Description Language (VHDL) type of behavioral simulation. Contention is carefully taken into account at all levels, and timing (the amount of time each process takes to perform a given operation) is derived from a careful analysis of the hardware as it would be implemented with realistic high-end microelectronics technology. All chips are clocked at 100 MHz. Request bus transactions execute at 50 MHz, while node-to-node packets travel at the rate of 500 Mbyte/s. Since the RRM compiler is only partly complete, we hand-compiled the benchmarks in RRM assembly language. Based on our experience, we expect the compiled code to perform within ±20% of this handwritten code.
Since the network architecture for the cluster has not yet been determined, further simulation work will be required. However, since our communication assumptions are based on existing off-the-shelf technologies, the performance estimates derived from the present simulation experiments are well-grounded.
Performance Estimates
Sorting was implemented with a new version of the Shear Sort algorithm [22] . Even though our particular implementation is architecture-dependent, the ideas we used can be easily extended to other architectures offering good connectivity of their computational agents. The trick is to lay out the problem in a manner allowing efficient communication for both the normal 2-D pattern necessary for the Shear algorithm and for longer-range links among the elements of the list. We have found that the register usage to hold long range pointers is fully justified by performance improvements. Another important improvement to the algorithm discussed in [22] is the fact of keeping a sublist of el- Fig. 8 ements in each processor, thus avoiding the need of alternating shuffle exchanges between odd and even locations.
In Fig. 8 we report the speedup obtained by a 64-node RRM cluster over an optimized quicksort implementation on a SPARC-10/41 with 48 Mbytes of memory. The anomalous speedup behavior between 4464 and 8929 is due to the internode I/O overhead, which becomes predominant when the data size grows beyond the active memory available in a single RRM node. Notice that the RRM's parallel performance is vastly better than the sequential version, with execution time growing much slower as the problem size approaches the active memory size of the RRM cluster. We anticipate some performance degradation when the data set size grows beyond the active memory available; this will be the object of future studies.
Hardware simulation is representative of a wide class of applications that fall under the category of Discrete Event Simulation. We have simulated a 540-gate LSI design consisting of several cascaded binary counters used for digital image processing. Each one of the logic gates in the LSI design is mapped to an RRM cell. Each gate can have a maximum of 5 inputs and can be programmed to have a maximum delay of 15 time steps. Mapping of the network was performed off-line to minimize distant connections. We replicated the same circuit enough times to obtain a suitable number of gates for the different experiments. Fig. 9 reports the performance of a 64-node RRM and the Mentor Graphics Quick-Sim simulation tool run on a SPARC-10/41 for 100,000 iterations. The Quick-Sim execution time was estimated by subtracting the time taken to simulate one time step form the time taken to simulate the 100,000 steps to mask out the effects of system-level overhead. These results point out the great versatility of the RRM interconnection network by indicating good performance figures even for an application where the connectivity required is extraordinarily high. The largest example required a total of 64,592 connections between gates of which 68% (44195) required, at each time step, the use of the distant communication mechanisms.
The DARPA Image Understanding Benchmark for Parallel Computers [24] is a good benchmark because it allows direct performance comparisons with other parallel machines and because it is composed of different phases which test different performance aspects of a design. The benchmark consists of detecting and abstracting a pattern of rectangles embedded in a cluttered color digital image, and then matching the resulting model with a set of given hypotheses. We have not yet completed this benchmark; therefore, we report only the execution times of the low-and intermediate-level processing parts which detect and abstract the pattern of rectangles from an input test image of size 512x512x8 bits. For this benchmark we relaxed the assumption of a 64 node board and increased the number of nodes to 128 to allow a more fair comparison with the ASP and the IUA architectures. . 10 contains the reported execution times of several parallel machines [7] with the addition of the RRM performance. Notice that the RRM favorably compares with even the fastest reported simulated execution times that are based on massive special-purpose signal processing designs. We expect the symbolic processing phase of the rest of the benchmark to perform very well in comparison with other machines, given the fact that the RRM was originally designed to support symbolic computation. A fair performance comparison should point out that the ASP, IUA and RRM execution times were obtained through simulation and with substantial development efforts, while the other execution times were obtained with ''real'' machines and in some cases required minimal software development time. The clock rates of the ASP and IUA machines were at the time of the simulations (1989) 20 MHz and 10 MHz, respectively; although this might suggest a technological imbalance (the RRM is clocked at 100 MHz) a more careful analysis of the architectures points out that the RRM's high clock rate is justified by its RISC-like design and on-chip controller; in addition, our understanding is that the ASP and IUA clock rate estimates would still be reasonably adequate today and have not been much influenced by recent advances in microelectronic technology.
Conclusion
We think that our design is well-suited for massively parallel computation because it unifies state-of-the-art computer architecture and hardware solutions with a well-understood and mature high-level programming paradigm. Our declarative model of computation allows parallelism to be exploited at many levels simultaneously while reducing synchronization overhead. We have shown very good performance of an RRM cluster on a set of representative applications. We have also laid down the basis for further tuning of our base architecture to application requirements and technological constraints, thus providing design flexibility that will be very useful for future implementations. In the near future we will develop and simulate more applications and experiment with a range of network architectures for the cluster. The current RRM compiler will be extended to handle a wider class of rewrite rules and will be enriched with optimization techniques. In addition, a hardware prototype of the SIMD processor will be built using the SPLASH-2 FPGA system [4] . Fig. 10 
