In multiprocessor systems, processing nodes contain a processor, some cache and a share of the system memory, and are connected through a scalable interconnect. The system memory partitions may be shared (shared-memory systems) or disjoint (messagepassing systems). Within each class of systems many architectural variations are possible. Fair comparisons among systems are difficult because of the lack of a common hardware platform to implement the different architectures.
INTRODUCTION
Multiprocessor systems are becoming common place in the computer industry. The consensus among machine designers is to favor asynchronous MIMD (Multiple Instructions Multiple Data streams) systems, in which processors execute their own instructions and run on different clocks.
In these systems, processing elements contain a processor, some cache and a share of the system memory, and are connected through a scalable interconnect which facilitates machine packaging, such as a bus or a mesh (see Fig. 1 ). Whereas this physical model dominates, disagreement exists as to the interprocessor communication mechanism. There are two dominant models. One is based on disjoint memories and message-passing and the other is based on shared-memory. In a message-passing system, processors communicate by exchanging explicit messages through send and receive primitives. In the shared-memory model, processors communicate through load and store instructions and some form of explicit synchronization among processors is required at times to avoid data access races [7] . The shared-memory model facilitates fine grain (word level) communication but requires a large number of instructions to transmit large chunks of data, whereas the message-passing model can transmit large amounts of data in a single message. In terms of ease of programming, the shared-memory model has been so far the favored transition path from uniprocessors to multiprocessors. On the other hand, message-passing systems are generally perceived as more scalable than shared-memory systems. One concern in both kinds of systems is the growing disparity between the processor speed and the speed of communication. In message-passing systems, mes-
System Memory

Caches
Processors
Interconnection sage-passing primitives have traditionally suffered from a large software overhead. In sharedmemory systems, the large latency of loads and stores on shared data is usually solved by complex shared-memory access mechanisms.
Some researchers advocate private caches [15] with hardware-or software-based consistency maintenance, such as in the Stanford DASH [12] prototype. The consistency protocol, the constraints on the ordering of memory accesses [7] , the cache parameters, as well as the interconnect latency and bandwidth are all factors affecting the performance and programming ease of multiprocessors. Machines such as the DASH have been called CC-NUMAs (for Cache-Coherent Non-Uniform Memory Access) architectures to differentiate them from COMAs (Cache-Only Memory Architectures) exemplified by the Data Diffusion Machine (or DDM) [9] . A COMA has the same architecture as the one shown in Fig. 1 and communication is done through shared variables; however, no main memory in each processor node is replaced by a huge cache called attraction memory.
For some forms of communication, message-passing is more efficient than shared-memory and there is a trend today towards integrating the message-passing and shared-memory paradigms in order to draw from the strengths of both [11] . To understand which form such an integrated system should take comparisons among systems are required. Presently, comparisons are difficult to make and hard to validate because of the lack of a common hardware platform to implement the different models.
Because multiprocessors are complex and powerful the correctness of a design and its expected performance are very difficult to evaluate before the machine is built. Traditionally two approaches have been taken to verify a design: breadboard prototyping and software simulation. A breadboard prototype is costly, takes years to build and explores a single or a few design points.
Discrete-event simulation is very flexible but very slow if the design is simulated in details and it is subject to validity problems because the target system must be considerably abstracted in order to keep simulation times reasonable. In some industrial projects where a detailed and faithful sim-4 ulation of a target system has been done in software, the simulation runs at the speed of a few cycles of the target system per second of simulation. The parallelization of discrete-event simulation is an ad-hoc procedure and usually exhibits low speedup [8] . Most simulators [6] [14] rely on the direct execution of each target instruction on the host and, because the code (either source or binary) must be instrumented, it is difficult to simulate efficiently the execution of interesting workloads, other than scientific programs with little or no I/O.
The major objective of the RPM project is to develop a common, configurable hardware platform to emulate faithfully the different models of MIMD systems with up to eight execution processors. Emulation is orders-of-magnitude faster than simulation and therefore an emulator can run problems with large data set sizes, more representative of the workloads for which the target machine is designed. An emulation is closer to the target implementation than an abstracted simulation and therefore more reliable performance evaluation and design verification can be done. Finally, an emulator is a real computer with its own I/O and the code running on the emulator is not instrumented. As a result, the emulator "looks" exactly like the target machine to the programmer and a variety of workloads can be ported on the emulator, including code from production compilers, operating systems, database systems, and software utilities.
In the following, we first introduce the hardware emulation approach for multiprocessors.
Then we describe the architecture of RPM. In Section 4 we explain the methodology to keep track of emulated time, to measure performance, and to program an emulation. In Sections 5 and 6 we show the expected performance of RPM and we compare the emulation approach to simulation and breadboard prototyping. Current status and concluding remarks follow.
THE HARDWARE EMULATION APPROACH
The approach to system prototyping in RPM is based on hardware emulation. Emulators have been used in the past to experiment with instruction sets, and more recently in the pre-silicon validation of complex VLSI circuits. To the best of our knowledge, no attempt to apply the emulation approach to the design and verification of multiprocessor systems has ever been made. RPM is built from off-the-shelf components except for the cache, memory, coherence and communication controllers which are implemented with FPGAs. The emulation of a particular machine model is done through the FPGAs and a part of the memory to which they are attached.
This approach enables the emulation of an entire multiprocessor system at low cost.
The clock rate is 10MHz, which is about 10 times slower than the rate permitted by current board and PLD technologies (estimated at 100MHz). This compromise on the emulation efficiency results from two trade-offs. First, the design and fabrication of the PC-boards are greatly simplified. Second, the lower clock rate facilitates the configuration of the FPGAs. FPGAs are slower than other programmable logic devices or custom circuits -especially when they are programmed with VHDL synthesizers-and clocking them at lower speed promotes the mapping of more complex circuits. Additionally, in order to further simplify the design and to have the flexibility to emulate complex mechanisms, each processor clock (pclock) is emulated in several consecutive clocks. Therefore, overall, the processing speed is 10MHz divided by the number of clocks in a pclock. This relatively low rate allows us to use a standard interconnection fabric and still have enough bandwidth to emulate useful interconnections.
Timings are preserved in RPM by scaling down the speed of all components (time-scaling) and performance data are collected in a set of counter stored in each memory (count memory). Time scaling and count memory are explained in Section 4. In the following we overview RPM's architecture.
ARCHITECTURE OF RPM
The architecture of RPM has been geared towards the evaluation of multiprocessors with the general architecture shown in Fig. 1 A typical memory transaction in RPM has three phases: prelude, suspension and completion. During the prelude the packet is received and decoded. The request is then suspended by storing a transaction completion record (containing all the information needed to resume and complete the request) in a memory location associated with the virtual memory bank and by filling the interleaving register with a value in pclocks equal to the suspension time. While the request is suspended, accesses to a different virtual memory bank can be serviced. When the interleaving register reaches zero, the memory controller is interrupted; it then fetches the transaction completion record in memory and completes the transaction by (possibly) sending some messages (completion phase). Virtual memory interleaving is further illustrated in Section 4.
The internal bus is a synchronous bus with a protocol similar to the Sun Microsystems MBUS protocol [4] . It is a 32-bit wide packet-switched bus which transfers packets of size between 16 and 128 bytes. Controllers MC2 and MC3 connect to the internal bus through very large, two-way FIFO buffers, which are there to prevent deadlocks and to relieve the controllers from managing the data transfers. All on-board datapaths are 32-bit wide.
The Delay Unit (DU) is a programmable unit which emulates variable interconnection delays. It is built with a FIFO controlled by one AMD MACH 210 chip. The FIFO (16 Kbytes) contains blocks and messages which are sent to the bus interface after a programmable delay depending on the target machine's interconnect latencies and packet size. This delay is computed by the formula:
The Futurebus+ interface is made of off-the-shelf chip sets. It includes bus transceivers, plus the LIFE chip from Newbridge and a distributed arbiter chip from National Semiconductors.
More details on the hardware design of RPM can be found in [13] .
Emulation of CC-NUMAs with Central Directory Protocol (CC-NUMA-CD)
Our first emulator is a system with hardware-enforced cache-coherence under strong ordering of memory accesses, to enforce sequential consistency [7] . In this emulator, MC1/RAM1 is a firstlevel write-through cache (containing both data and instructions), MC2/RAM2 is a second-level write-back cache and MC3/RAM3 is the main memory. The protocol is directory-based and each memory block has a home node where a directory records the presence and state of copies in every cache [12] [15] . The memory and the cache directories have pending states so that transactions for different blocks are executed concurrently. The protocol is a pure-write invalidate protocol but write-update and competitive-update protocols as described in [5] are easy extensions.
The SRAM implementing the first-level cache (FLC) is divided into five parts: the data memory (up to 1 Mbyte), the cache directory, the Translation Lookaside Buffer (TLB) needed for virtual memory support [4] , the space for the emulation of prefetch and write buffers, and the space dedicated to the collection of performance statistics. The controller is partitioned across two FPGAs: one for the control unit and the other for the data unit. Currently the first-level cache is write-through and direct-mapped with a block size of 16 bytes. A store issued by the processor is always propagated to the second-level cache. The first-level cache and the processor block on every write access that misses or requires coherence activity in the second-level cache.
A typical data read cycle in the first-level cache consists of receiving an address from the processor, translating the address in the TLB space, accessing the cache directory space, fetching the data from the cache data space, fetching the counter in count memory for that event, updating the counter and returning the data to the processor (all this is done in eight cycles). 
Emulation of Other Architectures
Given its generic board architecture, RPM is capable of emulating in full detail very complex target multiprocessors with very different architectures, provided they fit the generic block diagram of Fig.1 . Examples are:
• CC-NUMAs with centralized directories (CC-NUMA-CD) under weakly-ordered memory models: Under weak ordering of memory accesses [7] there can be a write buffer between the firstand second-level caches and between the second-level cache and the internal bus. Write buffers can be assisted by a write cache (WC), which is a small cache keeping track of partially modified blocks [5] . Additionally, prefetching hardware may be added.
• CC-NUMAs with distributed directories (CC-NUMA-DD): Instead of a centralized directory located at the home memory, the directory is distributed by linking caches containing a copy of the block through a set of hardware pointers in each cache entry. This organization is adopted in the Scalable Coherent Interface (SCI) standard [10] .
• COMAs (Cache-Only Memory Architectures): This is the architecture of machines such as the Data Diffusion Machine (DDM) [9] . In this case there is no system memory and MC3/RAM3 acts as a huge cache (also called attraction memory). The DRAM directory is replaced by the attraction memory state. MC1/RAM1 and/or MC2/RAM2 may be configured as caches.
• MPS (Message-Passing Systems): In this configuration MC3/RAM3 acts as the local (private) memory of the processor, MC2/RAM2 acts as a Message Passing Controller (MPC), and MC1/ RAM1 acts as the processor cache. The MPC buffers messages sent or received by the processor, formats out-going packets according to the protocol in the target, decodes the received messages, and, if needed, interrupts the processor when messages are received.
• Mixed Shared-Memory and Message Passing Systems: Every shared-memory organization can be augmented with a message-passing facility for bulk transfers of data among processors, as was done in Alewife [11] .
• Virtual Shared Memory: Shared memory can be implemented on top of a distributed, messagepassing system through virtual memory mechanisms. To promote efficiency, hardware support is often needed [2] and RPM is an ideal vehicle to experiment with such hardware.
EMULATION METHODOLOGY
The success of the RPM project relies on several novel approaches for measuring time and collecting performance data. In this Section we briefly describe the performance evaluation methodology as well as the methodology to program the emulator.
Keeping Track of Time: Time Scaling
Since the speeds of the hardware emulation and of the target system are different, the timings measured on the emulator must be related to the timings in the target machine. Rather than keeping track of simulated time through event-driven mechanisms and timestamping (as is done in software simulators [8] ), time is scaled. Time scaling preserves the relative timing of components in the emulator and in the target, and absolute times in the target are derived from executions on RPM by simple scaling arguments. For example, it is intuitively obvious that the processor utilization in a system with processors running at 100 MHz and with average memory latencies of 100 13 nanoseconds is equal to the processor utilization in a system with the same architecture but with processors running at 1 MHz and with average memory latencies of 10 microseconds. Therefore we do not have to build the system with the most up-to-date and fastest technology provided we scale memory, interconnection and processor speeds appropriately.
Every component (interconnect, cache, memory and I/O processor) is characterized by two fundamental performance measures: latency and bandwidth. These two measures can be independent. For example, two networks can have the same latency but one may have more bandwidth because it has more links; similarly, the bandwidth of a memory can be increased (while its latency remains the same) by interleaving it.
A convenient unit for all timings is the pclock -the clock period of the processor. In RPM, a pclock is currently eight cycles. This gives the emulator eight cycles to simulate all the activities occurring in one pclock in the target system. To simulate variable latencies, we delay requests. To simulate variable bandwidth of a given resource, we must vary the number of cycles that each request keeps the resource busy.
The flexibility of this adjustment varies with the resource. All on-board data paths in RPM are 32-bit wide and the system bus (Futurebus+) is 64-bit wide. We can emulate one cycle of a data path with 64, 128 or even 256 bits by 2, 4, or 8 cycles of the 32-bit data paths in RPM. The latencies and bandwidths of the first-level cache and of the internal bus are not adjusted. The implication is that their speed scales up proportionally with the speed of processors in all target systems. The Delay Unit (DU) (see Fig. 4 ) simulates variable latencies in the interconnection (emulated by the Futurebus+). As configured, the bandwidth available on the Futurebus+ is very large, in effect implementing an interconnect with infinite bandwidth. To run experiments under limited interconnect bandwidth, we can reduce the width of the bus (by reprogramming the Futurebus+ controllers) and artificially increase the size of each packet. To increase the interconnect bandwidth the number of clocks per pclock can be doubled or even quadrupled.
To illustrate time scaling and virtual memory interleaving, consider the simple case of a miss in the second-level cache such that the block address maps to the local on-board memory (local miss), and the block is uncached elsewhere. Let T m be the latency of the miss and let n be the degree of memory interleaving so that the memory can deliver n blocks every T m pclocks.
RPM has a monolithic memory controller, which must emulate the operations of the n parallel memory controllers of the target system. In RPM the memory transaction must busy the controller for T m /n pclocks by idling it before suspending the miss request. On the other hand, the total latency of the miss in RPM must be T m pclocks, the same as in the target system. Let T p , T i , 
(1) enforces the same latency in RPM and in the target whereas (2) enforces the same utilization 10 MIPS peak, whatever the target system speed is (provided one pclock = eight clocks). I/O latency must be scaled. The service of I/O requests must be delayed as faster processors are emulated. This is done in software, with the support of an interrupt timer.
Collecting Performance Data
The primary mechanism to collect performance data involves event counters stored in a special to obtain the required performance data. This counting mechanism can be started and stopped under software control. FIGURE 6 . Generation of addresses for event counting in the first-level cache's count memory Fig. 6 illustrates how the addresses in count memory are generated for mutually exclusive events in the first-level cache. In this simple example, three signals -programmed in MC1-correspond each to one property of an access in the first-level cache. One signal indicates whether the access is to private or to shared data. The second line is the read/write signal from the processor and the third signal is high when the access hits and low when the access misses in the firstlevel cache. The combination of these signals forms a three-bit address, which can be used to address a counter in the area of RAM1 allocated to count memory. In this example there are only eight addresses, but in a practical situation up to 20 signals can be defined in each controller. For instance one signal may also distinguish between instructions and data, and, in the case of instructions, the opcode could also be a field of the address of the performance counters.
Several other special purpose monitoring strategies can be implemented in all levels of the memory hierarchy. They can be added to the basic count memory mechanism or they can replace it, depending on the amount of hardware required. 
Programming RPM
RPM is programmed by mapping the controllers of the target design into the FPGAs. Currently, we do not have tools to partition automatically a design across multiple FPGAs. The data path and the RTL description of each controller are specified separately in VHDL. Design parameters such as block sizes, cache sizes, latencies and bandwidth are specified as constants in the VHDL descriptions. Nevertheless, the task of re-programming the FPGAs is error-prone and increases the turnaround time for emulating different machines. In the future, we expect that the turnaround time for this design stage will be reduced as behavioral compilers become more widely available. A behavioral compiler is a high-level synthesis tool which accepts an algorithmic description of a circuit to cre- Table 1 shows the slowdown factor, ie, the speed ratio between the target system and the emulator, for various uniprocessor technologies in the target systems. These slowdown factors can be predicted accurately in our emulation approach and they are independent of the number of processors (from 2 to 8) since the number of processors is the same in RPM and in the target system. Table 1 also shows the time taken by RPM for different execution times of the target. We can reasonably expect to obtain experimental points for realistic workloads for systems with processors of up to 1 GIPS. Some of these experiments would take months to run on a current software simulator and the simulation would have to be significantly simplified and abstracted.
PERFORMANCE OF RPM
COMPARISON WITH OTHER APPROACHES
Recent breakthroughs in simulation methodology have opened the possibility of efficient, detailed with a complex memory model per second on a 50 MIPS workstation. This low performance is due to multiple factors: the overhead due to event scheduling (e.g., context switching and event list management or activity scanning), the code expansion due to code instrumentation to keep track of target instruction execution times in execution-driven simulators (which was reported to be between 2 and 3 in [6] ) or the overhead of decoding and executing instructions in programdriven simulations, the management of timestamps associated with events, the collection of performance data, the semantic gap between hardware mechanisms and their execution on the simulator (the fact that each basic activity which takes one cycle in the hardware of the target takes several instructions to simulate on the host), and the speedup of the target multiprocessor.
To keep simulation times reasonable, the data set sizes of the workload must be drastically reduced. Observations made on the small data set sizes must then be extrapolated to the workload with the actual data set size, a difficult task which has never really been validated.
A common drawback of all simulations is that they abstract the behavior of the target multiprocessor. Many effects are ignored or approximated, on the premise that they are negligible. In some cases some key hardware components and physical events are totally removed from the simulation. For example, it is not uncommon that a simulator avoids simulating the caches and the data transfers among them. The validity of these simplifications is usually not verified and relies 20 on the experience and judgement of the evaluator.
To predict the speedup of RPM with respect to a simulator such as Cache-Mire or Tango we run the simulation of the target system and measure the simulation time as well as the execution time of the target in pclocks (same as RPM's). Based on such measurements we have observed speedups of between 100 and 1,000 over execution-driven simulations. However, the level of implementation details in RPM is actually closer to a cycle-by-cycle, register-transfer level simulation. In practical cases, such simulations run at the rate of a few cycles per second and the speed up of RPM over these detailed simulators can be up to one million.
Besides simulation, breadboard prototypes are also extensively used in industry to validate a new architecture and in academic research projects to explore novel architectures. Prototypes are fast and faithful to the target system, but their cost is high and they provide little research information besides proving one point. In an academic environment, the building of a prototype is a valuable experience for students in architecture and it helps direct the simulation experiments.
In most projects, however, most of the research results are still derived through simulations.
STATUS AND FUTURE PLANS
Currently, the hardware is up and running and our first emulator (a CC-NUMA-CD under sequential consistency) is in the late debugging phase. We are replacing the Xilinx XC4013 in the controllers with the XC4025. This upgrade will double the number of equivalent gates in each controller. It is possible because the XC4025 is pin-to-pin compatible with the XC4013. FPGA programs for a CC-NUMA with release consistency, for the Scalable Coherent Interface (SCI) protocol and for a COMA are being developped. The I/O capability of RPM is being upgraded with two Gbytes of disk connected to each processor node. As soon as this is done, the total disk space will be 16 Gbytes and we will be able to port applications with distributed I/O. On the software side, we are currently porting some of the SPLASH benchmarks on the first emulator. In parallel, we are starting an effort to port an operating system kernel as well as a database engine to run database applications.
21
CONCLUDING REMARKS
The emulation methodology described in this paper is very cost-effective for the prototyping of various multiprocessors. In this sense it is an extremely valuable research tool in architecture, especially in an academic environment. Students working on the project gain valuable experience in complex chip designs as well as in system architecture. Each system architecture mapped onto RPM is an actual system design, which a graduate student can complete in a matter of months by re-using the same hardware platform and thus concentrating on the essential parts of the design.
For all its advantages, there are however some weaknesses in the current emulator. The first weakness is the relatively low pclock rate. We were somewhat conservative in our design because of our lack of experience with FPGAs and with FPGA CAD tools. The low clock rate and the simplicity of the board-level design have facilitated the construction of RPM in an academic environment. The total time taken by the design and the construction was 15 months.
The second limitation of the current emulator is the number of processors. This is due to our limited budget. Unfortunately, we have no technical solution to this problem: To investigate larger machines, an emulator with more processors will have to be built. Finally, as for any piece of hardware, the efficiency advantage of an emulator over simula-22 tion erodes every year, as faster workstations and personal computers are introduced. Nonetheless, given the current speed of RPM, we expect that it will remain competitive with software simulators for at least ten years.
There are several ways to improve on the speed of the emulator. With current CAD tools and FPGA technologies a more aggressive design could raise the clock speed to 20MHz. The number of clocks in each pclock could also be cut in half if a processor with an on-chip instruction cache was used to avoid emulating instruction fetches. Such an emulator would emulate 5
Millions cycles of the target per second and large speedups over software simulation could be expected for large-scale emulators with 128 or 256 processors.
