Abstract
Introduction
Execution driven simulation (EDS) is a powerful method for evaluating computer architectures. EDS is particularly useful for deriving accurate results for multiprocessor systems, since the precise sequence of instruction executions on the parallel processors may vary as the simulated design varies. We have developed Cerberus, an EDS simulation system, to provide accurate cycle-by-cycle instruction simulation with a low degree of slowdown relative to untraced Ý The work presented here has been supported in part by the State of California under the MICRO program, Sun Microsystems, Toshiba Corporation, Fujitsu Microelectronics, Cirrus Corporation, Microsoft Corporation, Quantum Corporation, and Sony USA Research Laboratories. Partial support was also provided by Siemens A.G., which supported Jeff Rothman during some of this work. native execution. Cerberus is an extremely flexible system for simulating programs targeted for MIMD machines. Many other EDS tools allow a trade-off of accuracy for speed. Cerberus provides the highest level of simulation accuracy with speed comparable to tools using less accurate simulation modes that run on uniprocessor workstations.
There are a number of methods for studying multiprocessor machine behavior. Hardware can be electronically monitored and workloads run directly [26] . Synthetic models can be created to generate artificial reference streams [28] . Trace driven simulation (TDS) has been widely used, with a wide variety of methods for collecting traces [23, 29] . EDS is gaining wide usage for accurate simulation of new architectural models. EDS varies from the other methods of modeling by linking the architectural simulator directly to the trace instruction generator, without the intermediate step of storing a set of (invariant) traces. This allows interaction between the hardware model and the software. A trace generation system using EDS has the advantage of simplicity of use, low slowdown, small disk space requirements, and accuracy with easy system parameter reconfiguration. EDS does have the disadvantage that the trace consumer (e.g., a cache simulator) must be run on the same system as the trace generator. Thus, if the tracer runs on a slow machine, the trace consumer must do so as well.
The execution time overhead of a trace generation system is also an important factor to consider. For a system which studies multiprocessor designs by running the simulator on a uniprocessor, the most important speed optimization is incorporating the parallel trace generator, processor thread scheduler and cache simulator into a single process. By utilizing these optimizations with careful design and refinement through profiling, we have created in Cerberus a very efficient system for investigating the properties of multiprocessor hardware and software which does not trade accuracy for speed. It is easy to create programs for, using simple SMP primitives to specify parallelism. It uses lowlevel cycle-by-cycle simulation to very accurately model the interactions between simulated processors. It uses assembly language routines for commonly called simulation functions, which reduces simulation slowdown. And it has a very simple trace generation interface, allowing users to easily add cache simulators, code profilers or other statistics gathering tools.
The remainder of this paper is structured as follows: Section 2 provides background on multiprocessor tracing methodology and an overview of some of the most recent tools developed for tracing memory accesses. Section 3 briefly summarizes the implementation details of Cerberus and some of the difficulties that arose during its creation. The performance evaluation of Cerberus is found in Section 4. In Section 5 we present our conclusions. This paper is a condensation of [19] , which discusses all of the issues considered here in significantly more detail.
Summary of Trace Generation Methods
A variety of methods exist for deriving trace information from workloads. We provide a brief summary of the methods here; see [29, 19] for a more detailed description of the different approaches. Table 1 provides some details of each of the various methods, specifying whether the method could only collect traces for the target architecture (trace driven simulation or TDS) or if the trace information could be piped or used directly with a cache simulator (execution driven simulation or EDS, also referred to as program driven simulation). In addition, we specify the hardware on which it can be run and the approximate slowdown compared to running the workloads under test in native and/or uniprocessor mode.
Hardware approaches to collecting traces for multiprocessors have included adding I/O cards to capture transactions on the memory bus [32, 30] and modifying the microcode to capture memory references [22] . OS system calls have also been used to capture traces, such as the Unix debugging command Ptrace, to cause an interrupt to a reference collection routine for each instruction of a user-level program [13] . Modification of error correcting code (ECC) bits was used to allow interrupts on just the shared memory references in [17] , which then invoked a cache simulator subsystem.
Software methods have been the most widespread means for collecting and using traces. These approaches can be generally broken down into interpreters (native code running on top of a simulator) and code modifiers (augmented code which allows traces to be collected). The interpretive methods have been used to simulate 68000 execution [6] and the MIPS R3000 instruction sets [31] .
Augmenting source or object code has been used to trace code running on parallel machines or to simulate parallel execution on uniprocessor workstations. The former approach has been used to add minimally intrusive instructions to the low-level code, writing records to memory buffers [9, 1, 25] . Modifying code and adding routines to simulate the execution of a parallel machine on a uniprocessor machine has been the most popular method [5, 7, 4, 8, 3] . Our trace simulator falls into this category.
Implementation of Cerberus
At the time we were developing Cerberus, the existing tools for generating traces were slow, inexact with respect to individual processor timings, and/or difficult to use or interface with user created cache simulators. We designed our tool to overcome these problems and to simulate multiprocessor program execution as quickly and efficiently as possible. The goals we set out to achieve in creating our tracing tool were: (1) create a system for uni-and multiprocessor simulation and instruction and address trace generation; (2) have an efficient system that would be able to simulate billions of instructions in a reasonable amount of time; (3) support more than one model for expressing shared-memory parallelism; (4) be able to attach modules easily (such as a cache simulator or code profiler) to consume traces on the fly, avoiding the use of massive amounts of disk space to store traces; (5) be able to link a simulator to the trace generator in one UNIX process to minimize operating system context switching; (6) accurately estimate execution time by simulating all user code and system libraries using instruction-level simulation (i.e., processor thread switching after each instruction). The following sections describe basic features of the Cerberus system. A much more detailed description can be found in [19] .
R3000 Architecture Characteristics
The original target of Cerberus was the DEC 3100/5000 series of machines, which use the MIPS R2000/3000 RISC microprocessors. The R3000 instruction set employs a fixed 32-bit format with 3 main format types. This makes it easy to disassemble and decode for modification purposes. There are some characteristics that cause difficulties (discussed in Section 3.5 and [19] ). One of the most interesting "features" of the MIPS architecture is the partial exposure of the 5 stage instruction pipeline to the programmer. To allow exploitation of pipeline delays caused by certain operations, delay slots are associated with these instructions. Any branch or jump instruction is followed by a delay slot instruction that is executed during the bubble in the instruction pipeline caused by the change in control flow. The delay slot instruction is executed regardless of whether the branch is taken. As part of the code modification process detailed below, the instruction in the delay slot is moved to the position before the branch or jump with which it is associated, taking care to make sure that the results of the delay slot instruction have no effect on the branch. This instruction movement can cause other difficulties when the delay slot is the target of a branch (which some optimizers can introduce into the code). Load instructions are also followed by a delay slot instruction, which cannot have a dependency on the register being loaded. This load delay causes no major difficulties, but must be observed in the modification process and in the hand-coded assembly routines. One of the chief goals for Cerberus was to be able to run a multiprocessor trace generator on a uniprocessor workstation using a standard C compiler. Figure 1 shows the steps necessary to accomplish this task. This process begins by running the parallel source code through a series of macro packages and lexical analyzers, which generate suitable new Opcode Explanation addu r2,r3,r4 Add unsigned r3 and r4, results put in reg. r2 jal routine Jump to routine, place return address in r31 lw r2,-5000(r10)
Code Modification
Load the word at address r10-5000 into reg. r2 nop Do nothing for one cycle sw r10, 16(r2) Store the word in r10 into the address r2+16 Each instruction in the original unmodified object code is turned into a block of eight instructions in the modified object code. Not all the instructions in the block are used in all cases, but all use the first few instructions of the block of eight to call the address collator (indirectly through an assembly language "staging" function). If there are still active threads (simulated processors) to process during the current cycle, the thread scheduler is called. When all the addresses have been collected for the current time step, the address collator calls the cache simulator (or other tool module). The scheduler returns control to a particular thread only when this simulated processor is ready to proceed, after any (simulated) memory or other delay. Figure 2 shows an example of augmentation, demonstrating how an instruction from the unmodified object code gets augmented to call the simulator, load the necessary simulated register information, perform the operation, and save the resulting state (with the MIPS instructions explained in Table 2 ). The call to pSIMstep is a call to an assembly language "staging" function which in turn calls the C language portions of the simulator. This staging function sets up for a call to the Cbased collator/simulator routines, saving simulated processor state, preparing the instruction and data (if appropriate) addresses for the simulator to process using the MIPS standard function calling conventions [12] . Not all augmented instructions call pSIMstep, but may call one of a number of similar hand-coded assembly language routines to handle special cases, such as load-store operations, locking, floating-point instructions and other more complicated operations. The assembly language routines call one of three C language routines, which handle instruction addresses (SIMstep), instruction and data addresses (SIMmem), or locking operations (SIMlock). These three routines collate the memory addresses, call the cache simulator or other tool once each time step, then call the scheduler to return control to one of the active threads. At some point, control is returned to each of the threads that are active, on a roundrobin basis.
Once the scheduler has returned control to the thread, the rest of the eight instruction block is run, including the direct execution of the actions performed by the simulated instruction. The value returned by the scheduler to the thread in register r2 is the base address of the context block for that particular thread. All the saved context for a thread is accessed as an offset from r2. For example, all simulated integer registers rx can be found at memory address Ö¾ · £ Ü. Simulated floating-point registers fy are found at memory address Ö¾ · £ Ý · ½¾ . Other information, such as the program counter and status control registers are found as higher offsets from r2.
Register r2 was chosen for this purpose because the MIPS calling conventions specify that r2 contains the return value from function calls. For uniprocessor mode operation, this allows the C simulator functions to return control to a thread using a simple C return instruction. This has the effect of making it appear as though the augmented program makes simple calls to the simulator, which then returns as a function call should. In the case of a multiprocessor simulation, the paradigm is somewhat different than standard function calls, in that control jumps around between different parts of the augmented code and the scheduler, and some scheduler related function calls never return, or at least not as expected in a normal program. The changes of control act more as gotos with passed parameters than proper function calls. An assembly language routine psched is called by the scheduler to change the function call paradigm into the appropriate changes of control for the whole simulation.
Memory Model
Parallel programs begin with a single thread, which then creates the other threads after an initialization phase in the program. Cerberus supports two programming models, the Sequent model [20] in which N threads are forked off at once (m fork), and the s fork model, which forks off one thread at a time. Much of the parallelism is created by the use of special functions in the original source code, such as special fork functions (m fork and s fork) to create multiple processor threads. These functions, as well as synchronization (locks and barriers) are handled by function calls to the thread library at runtime.
During runtime, when a "fork" routine is called the first time, each new thread is provided with its own copy of the entire data space (except code and read-only data) when it is created. One additional data space is shared by all threads, and holds the shared variables, as shown in Figure 3 . Cerberus ensures that all references to shared variables are directed to this shared address space. In the Sequent model, all variables that are not explicitly designated to be shared with the shared type qualifier are private. This includes global variables as well as automatic (stack) variables. Therefore it is necessary to provide each processor with its own (private) copy of the memory space. There also needs to be a shared memory space, for which purpose we use the memory space created by the compiler. In addition, memory space is set aside to allow dynamic memory allocation, keeping the shared heap in proximity to the shared memory space. Likewise, the local heaps and stacks for each processor are contiguous with each processor's local memory space.
The "lightweight threads" model is also supported by Cerberus, which is used by the SPLASH-2 applications suite [33] . In this model, all global variables are implicitly shared; only the stack variables are private. Use of this model is specified by passing a command to the runtime system at start-up, which causes the threads to use the shared memory for all global variables. tion (and data) addresses accessed during that cycle. Some threads may not be available for scheduling, due to the ability to stall individual threads for cache misses or at barriers. Once all the memory references are collated for a time step, the information is passed on to the cache simulator (or other tool). The cache simulator is called even if all threads are stalled, to tell the cache simulator to advance its clock and perform shared bus/memory system operations, which will eventually cause the simulated processors to become available again. When each thread has completed execution of the function it was assigned, it is deactivated. When all threads have been deactivated, the system returns to single processor mode and finishes the program.
Scheduling
To support simulation of parallelism on a uniprocessor workstation, it is necessary to provide the appearance of multiple code streams executing simultaneously and to support the illusion of shared and private data spaces. Supporting both of these requires a 200+ byte context block per processor to keep track of the state of each simulated processor, such as integer and floating-point register values, control register values, special private and shared memory pointers, and simulation state. The context block is the key item through which modified code and simulated processors interact to provide the semblance of parallelism.
To be able to simulate multiple threads of execution on a uniprocessor workstation, a simulator has to provide some sort of a thread package with a simulated processor scheduler, or rely on the host machine's operating system for scheduling the simulated processors. With a user-level threads package using instruction-level task switching gran- ularity (which Cerberus uses), each simulated instruction appropriately loads and stores the registers with which it interacts and returns back to the simulator. All instructions that read a register (or registers) must load that register from the context block (except for register r0, which is always 0). For many instructions, particularly ALU instructions, two registers are read, which requires two values to be read from the context block for those simulated instructions. Any register that is modified during the instruction (typically one register) must have its value copied to the context block at the end of that simulated instruction.
Loading and storing state each cycle provides the ability to switch processor threads after execution of a single instruction. Some simulators switch processors on a basic block granularity [5, 3] , which involves calling the scheduler at the beginning of each basic block and requires loading and storing all the affected simulated registers at the beginning and end of each basic block.
Synchronization operations provides an opportunity for scheduling optimization. When processors reach a barrier, they are descheduled until all processors reach that barrier. This provides a means of reducing simulation overhead and eliminating redundant spin-wait traces. A further optimization (which we have not investigated) is to also descheduling processors waiting for locks, reactivating them when they acquire the lock.
Not all EDS methods use the low-level scheduling and fine granularity thread context switches that Cerberus uses. Some methods use a UNIX process per simulated processor, which avoids user-level scheduling and explicit saving of simulated processor status [1, 7] ; UNIX process switching, however, incurs a rather high overhead. Synchronization between processors is performed at varying levels of granularity, which can be controlled by the user by specifying the accuracy/slowdown trade-off they are willing to make. Another method inserts calls to the simulator into the C code before compilation, which is used to switch processor contexts when an interesting event requiring synchronization occurs [8] . Since the compiler takes care of saving state between function calls, little explicit effort needs to be made to keep track of each processor's state.
Summary of Implementation Difficulties
This is a short summary of the implementation difficulties we encountered while creating our tool. A more detailed version with our solutions to the problems can be found in [19] . Many of the difficulties have little to do with multiprocessor simulation, but were encountered while trying to simulate the uniprocessor workloads, such as SPEC92 or the usual systems libraries.
When trying to modify FORTRAN code, it was found that the FORTRAN compiler puts read-only data into the text section, which has to be detected in order not to modify the data, assuming it was code.
A pair of functions implementing non-local gotos allowing jumps out of and into the middle of functions (setjmp and longjmp) requires that special state be saved in order to save and restore state the way those functions expect.
Low-level memory allocation functions sbrk and brk need to be intercepted, because both the modified code and the simulation package have their own versions of these functions. This can lead to inconsistent pointers to the top of memory, which can cause segmentation faults.
Branches must reach eight times as far in the modified code, since the code is expanded by a factor of eight. On rare occasions the branches cannot reach far enough, so measures had to be taken to substitute jumps for branches.
Branch delay slots in the MIPS machine language are very difficult to deal with in certain cases. Some optimized code makes the delay slot instruction the target of other branches. This situation effectively makes the delay slot instruction part of the overlap between two basic blocks. It is necessary to be able to detect during run-time whether the delay slot instruction is being executed at the end of a basic block or at the beginning of the next basic block.
It was found when porting the simulator between different machines using the same CPU that the C compilers defined program symbols in inconsistent ways, and typically at variance with the official MIPS specifications [15] . C macros were analyzed to detect which compiler was being used.
Running the simulator on different machines (with identical operating systems) or under the standard debugger would often give slightly different results in terms of miss ratios. This was due to slightly different values that the stack pointer was assigned by the different machines at runtime.
Some of the C compilers ignore the volatile type qualifier, which is used to force the code to read values from (shared) memory instead of caching those values in registers. This requires using more recent compilers which implement volatile. Sometimes it is necessary to insert extra volatile qualifiers in front of variables which are used for spin-locks.
Performance
To determine the overhead of simulation, we have compared the user's CPU time, as computed by /bin/time of the uniprocessor version of the programs, with the time from the stub cache simulator in uniprocessor mode (Table 3) and for multiprocessor mode with a stub simulator and a full cache coherent cache simulator ( Table 5 ). As will be shown, an average slowdown of 31 is observed for the workloads tested by the simulator in uniprocessor mode and a 40 to 50 times slowdown for multiprocessor operation, simulating just the stub (no cache simulator), but with synchronization operations supported.
Workloads
The four programs used for performance measurements come from the first SPLASH suite [21] , which are regularly used in our research. These workloads consist of: MP3D: a hypersonic rarefied fluid flow simulation, using Monte Carlo methods; LOCUS: a commercial quality VLSI standard cell router; OCEAN: a simulation of large-scale ocean movements based on eddy and boundary currents; and WA-TER: a measurement of the forces and potentials involved over time among water molecules in motion. The problem size and characteristics of the workloads we used can be found in Tables 3-5 . 
Simulation Characteristics without Parallelism

Programs
Measurement of Overhead Memory Refs. with Parallelism (¢½¼ )
Number Table 4 . Number of total memory references for the workloads increases as the number of processors increase.
To measure the overhead due to the trace generation process, we compared the user times to run the unmodified program (with a single processor) to the modified code with a stub memory interface. The first set of measurements (Table 3) uses the m4 NULL (uniprocessor) ANL macros [14] supplied by Stanford University, which eliminate all parallel constructs from programs. This removes locks, barriers, and the parallel fork mechanism. Without using parallel scheduling, Cerberus's scheduler is a much simpler routine which has low overhead.
The difference in slowdowns among the various workloads (seen in Table 3 and further down in Table 5 ) is due to the instruction mix of each program. As will be explained in Section 4.4, floating-point instructions require about 40 instructions to emulate in the worst case, yet because many of them require multiple cycles to execute, the slowdown for a floating-point instruction can be relatively small. The LO-CUS workload has no floating-point instructions, whereas MP3D, OCEAN, and WATER have 9.7, 25.6 and 17.2 percent floating-point operations of instructions executed, respectively. In addition, these programs have varying mixes of fixed point arithmetic and memory operations, which can cause their overheads to vary.
Simulation Slowdown
Using a single workstation and the user CPU time from the UNIX /bin/time command, we measured the execution time for the sequential version of our workloads run natively on a workstation and for the simulated parallel versions. These measurements were all taken on a DEC 5000/125. Table 5 shows the ratios of the runtimes for all the simulations in comparison to the native sequential version of the workloads. The average slowdown with the stub simulator attached for small amounts of parallelism is around 45-50; with a complex cache simulator attached simulating a cache coherency protocol [18] (similar to the Illinois protocol [16] ) with 16 byte blocks and 16 Kbyte split instruction and data caches per processor, the average slowdown ranges between 920 and 1030. Note that we are not claiming that our cache simulator is as efficient as some more tightly integrated EDS-cache simulator tools; rather our point is that the EDS portion of the simulation only requires about 5 percent of the execution cycles.
One reason the simulations slow down with more processors is the increase in the number of references (Table 4 ). OCEAN in particular shows a large increase in the number of addresses generated, which naturally causes the simulation to slow down, particularly with a cache simulator in use. In addition, increasing the number of processors (threads) increases the size of the working set perceived by the host workstation, causing the system to slow down due to cache misses and page faults.
Instruction Overhead
The modification process is designed to minimize the amount of overhead for the most common instructions while fitting all the necessary state preserving operations into the 8 instruction block. However, a number of instructions cannot be handled under those restrictions and require additional routines to aid in saving and restoring simulated processor state. For example, floating-point instructions have approximately twice the overhead of integer instructions, due to the effort required to load all of the registers involved. In the worst case it must load four floating-point registers (two double precision registers) and the floating-point control register. To make the call to the FP loading routine fit into the eight instruction limit, some decoding must be done to the argument passed to the special assembly language loading routine. The routine determines what precision must be handled, and which registers must be loaded. For example, an integer instruction with no special cases to handle (such as the add instruction in Figure 2 ) requires the eight instruction block in the code and 12 instructions in assembly language to handle the interfacing with the C code routines. Floating-point operations require the eight instruction block plus 32 instructions to load two floating-point registers (27 for one register). However, many floating-point instructions require multiple cycles to execute. For example, a double precision multiply takes 5 cycles to execute; a double precision divide takes 19 cycles £ . Code with a high density of floating-point instructions will generally show less slowdown than pure integer code.
The simulation overhead for each workload depends upon the mix of instructions to be simulated. Some instructions have less overhead than normal integer instructions. Load-store instructions have lower overhead per memory reference because the additional overhead for capturing the data address is small. Branch and jump instructions also have slightly lower overhead than normal instructions, because the branch often takes place in the middle of the block of eight instructions, skipping several of the nops used to pad-out the block to the proper length. In addition, some instructions use register r0 (which always has value 0) or have only 0 or 1 operands. It is then possible to complete those operations in less than 8 instructions (such as nops from the original code, which require only 4). In cases where the operation can be performed with few instructions, an optimization is made whereby a branch is inserted to skip over the nops which are used to pad-out the block.
Simulation Overhead
To determine and measure where the simulator spends its time, Pixie [24] was used to profile the simulation executable. Among other statistics, Pixie is able to determine the number of instruction and data references for the simulation and the cycles spent in each function. Table 6 was derived by grouping related functions together and summing £ The latency of these operations on the R3000 can be partially hidden until the results are required. Our simulations, however, assume that instructions are executed serially. Table 5 . Slowdown ratios of simulated vs. native execution of workloads with parallelism support.
Ratio of Simulation Time to Native Runtime
the instructions executed in the functions. The largest single source of overhead (data collation is spread over several functions and also includes the single thread scheduler) is the multi-thread scheduler. During execution, it cycles through the processors, scheduling each one that is not stalled. When all the available processors have been run, the cache simulator is called with the address information generated by the active processors. The return value is a bit vector of the processors which can be scheduled the next step. The process of determining which processors (threads) to schedule, calling the cache simulator and advancing the time step requires approximately 20-40 instructions per call to the scheduler. The number of instructions it takes to find the next processor to schedule falls with the number of processors simulated, so that the simulator actually executes fewer instructions to run the whole simulation with more processors (for well behaved workloads). As the number of processors increases further, the working-set size of the simulation increases sufficiently to bog down the host system due to cache misses and page faults. Table 6 . Measurement of overhead in the simulator with a minimal tool stub, 4 processors. Figure 5 shows the average user time for each workload spent by the simulator for each memory reference generated, with a stub simulator attached, and with a shared bus coherent cache simulator. These times per reference are quite low: corresponding numbers for a uniprocessor simulator were reported in the 200-400 microsecond per second range for a DEC 5000/240 in [10] . We note that the trace generation time is insignificant compared to the trace consumption (cache simulation) time (around 5 percent).
Percentage of Cycles in Simulation
As the number of processors increases, the execution time slightly decreases up to 4 processors and then increases with additional processors. This is particularly noticeable for the cache simulations. The decrease is due to factors such as more efficient scheduling of processors and the increase in total simulated cache space causing a reduction of the miss ratio (misses are costlier than hits to simulate). The increasing execution time with more processors (beyond 4) is due to the increasing working-set/memory space requirements on the host system and increasing communications and synchronization needs for the coherent cache simulation. In addition, the system time for larger simulations increases as the demand on the virtual memory system to run the simulation increases with more processors. Table 7 shows a comparison of the processing time per reference for a few cache simulators. All of the simulations were run on a DEC 5000/125. The timings were computed using the user time from the ULTRIX utility /bin/time. The input to each TDS simulator is a trace file generated from MP3D using 10 steps of 1000 particles in the test geometry. Our cache simulator computes timing information and simulates cycle-by-cycle for bus operations, maintaining a write buffer and performing complex coherency operations, including caching of locks. It also interacts with the memory reference generator to stop the flow of addresses during cache stalls. Each processor maintains fully-associative data and instruction caches. The MPSIM simulator is based on the work in [27] , which simulates a range of cache sizes Table 7 . Comparison of Cerberus attached to a fully-associative cache simulator (using an adaptive coherence protocol) with various TDS cache simulators.
by using stacks. It does not simulate bus timing. The other cache simulator is Dinero, a public domain uniprocessor cache simulator by Mark Hill. Dinero was evaluated using both direct-mapped and fully-associative configurations. Also included in Table 7 is the time it took to just generate the traces, as well as the trace file sizes. The trace format uses 6 bytes per entry, with 1 byte for the processor number, 1 byte for the reference type, and 4 bytes for the memory address. The traces were generated using Cerberus with a utility to dump the traces to disk. An interesting point to note is that it takes Cerberus approximately 2 microseconds for each address reference generated (Figure 5 ), but 13 to 15 additional microseconds to write the information to disk (Table 7) , which shows that disk operations heavily dominate trace generation time. A filter program was used to turn the trace files into a format suitable for Dinero, which takes ASCII input of the form "type address" (processor number is not necessary). All of the simulations used similar cache configurations (as much as possible) with 16 byte blocks, with 16K byte instruction and data caches when it was possible to specify. The parallel cache simulators used the Illinois coherency protocol for MPSIM and an adaptive invalidation protocol similar to Illinois [18] for our cache simulator which used the Cerberus system. The results show that a cache simulator using Cerberus has speed comparable to (and often better than) TDS based simulators and scales well in multiprocessor mode.
Design and Performance Trade-offs
In the process of designing the simulation tool, we had to make decisions about trading accuracy for speed. Our approach was to use the minimum granularity possible for switching between threads, while attempting to minimize the slowdown. Other EDS simulators have chosen a coarser grain of context switching. Other possibilities were on a basic block granularity [5, 3] and at user determined levels of granularity [7, 8] . Proteus [8] in particular modifies the C language source code just for certain (shared) memory references, which can cause each processor to have a different value of the global time at any given point in the program, as the processors synchronize time at shared memory points. Our method of task switching on each instruction is the most accurate, yet is not any slower than the other, less accurate, emulators. Considering all the other management features that must be handled by our simulator, such as scheduling processor threads, our simulator does very well in comparison to other simulators that can be run on single processor workstations.
Conclusion
The key to understanding how multiprocessor systems work is an accurate model of the interactions between the processors. Many parallel programs use dynamically scheduled task distribution among the processors, with work queues for load balancing. This can cause the interleaving of memory references to be quite different depending on the target environment. Failure to accurately model processor interactions could lead to mutual exclusion and synchronization violations, as well as incorrect load balancing [2] . To provide an accurate view of program execution, execution driven simulation is the best method for dynamically scheduled workloads. EDS also has the side benefit of eliminating the massive amount of disk space necessary for storing traces.
Cerberus is an EDS-based multiprocessor simulation system that allows program trace generation with a high degree of flexibility and fine grain accuracy without sacrificing performance. It is flexible in allowing the easy attachment of user created tools for code profiling, cache simulation, trace generation and other statistics gathering. The intuitive shared memory programming models used by Cerberus lead to easy expression of parallelism in programs.
Cerberus provides efficient simulation of multiprocessors by creating a single UNIX process with lightweight threads for each simulated processor, tightly linked with a user's measurement tool. This eliminates the extra context switches needed by other simulation systems, and allows for very low-level and accurate simulation of the interleaving of processor memory references.
Some trace generation tools have less slowdown, but with some loss of accuracy. Cerberus derives its accuracy by simulating instruction-by-instruction; others tools use basic block granularity or instrumented high-level code to synchronize simulated processors. Cerberus does not sacrifice efficiency to attain its accuracy. However, since the actual execution time of instrumented code is heavily dominated by the measurement tools (especially in the case of multiprocessor cache simulators), efficiency is generally not a major concern for the address generation subsystem. One of the results of this study shows that execution time measurements of TDS systems show little speed advantage over Cerberus. We believe that Cerberus is a very good system for accurately and flexibly studying new computer architecture designs.
