This paper examines the cost/performance of simulating a hypothetical tar-get parallel computer using a commercial host parallel computer.
This paper examines the cost/performance of simulating a hypothetical tar-get parallel computer using a commercial host parallel computer.
We address the question of whether parallel simulation is simply faster than sequential simulation, or if it is also more cost-effective. Over the last several years, direct execution has become widely used to accelerate architectural simulations [6, 4, 3, 7, 15] . Direct execution exploits the commonality between the instruction set of the simulated target machine and the underlying host system. For example, a floating-point multiply on the target is "simulated" by executing a floating-point multiply on the host. Such a system need only simulate the differences between the target system and the host, achieving impressive performance when the two systems are very similar.
Simulations of parallel computers have exploited direct execution in several ways [3, 7, 5] . Most commonly, a parallel target system is simulated on a uniprocessor host. For example, the Tango system spawns an event generation process for each processor in a target sharedmemory system. All parameters can be extracted from four runs of the fullyparallel simulation.
In the remainder of this section, we describe how we model each of the major contributes to simulation time: event processing time, direct context switch overhead, and host cache and TLB interference.
3.1

Modeling
Processing Times
A potentially serious problem with conservative fixedwindow simulation algorithms is that most host nodes will be idle while they wait for the slowest node to reach the barrier.
In WWT, be the set of blocks that a process leaves in an infinite cache. In a finite cache, some of the blocks in the footprint will not fit, and are replaced. We define the projec- tzon of a process to be the set of blocks a process leaves in a finite cache that it may reference againl. Given the size of the footprints of two processes, Thiebaut and Stone's model estimates the projection of each process and uses it to determine the interference.
We have extended the mod el to allow for sharing between processes, estimate the interference between more than two processes, and take as input the size of the projections of the processes, rather than of the footprints.
We estimate the average cache (TLB) projection of a target node by measuring the average processing time both with and without flushing the cache (TLB) at the beginning of every quantum. The difference between these times is due to refetching the blocks in the target's projection; by dividing this difference by the CM-5's cache (TLB) miss penalty, we can determine the expected size of the projection of a target node. To accurately estimate interference on the critical path, we found it necessary to not only measure the average projection, but also measure the average of the largest projection in a quantum.
3.4
Running Time of a Quantum Putting these three submodels together, along with the fixed quantum overhead, T~Ua%iU~ou,~h~~d, allows us to estimate the mean running time of a quantum: with a maximum observed error of -24Y0 for sparse on a 4-node host system. The model is more accurate at the extremes: it is exact, by definition, when p = 1, and the error is less than 16% for p = 32. The second, more fundamental observation, is that the inherent simulation parallelism is low, only providing speedups ranging from 4 to 9 on 32 host nodes. This is at least partially due to the low target system speedups these programs achieve for the small data sets used in this study.
Despite the relatively low "inherent" parallelism in event processing times, the Wisconsin Wind Tunnel actually achieves accept able overall speedups, as illustrated in the right-hand side of Figure 2 . These plots show the overall simulation speedups, plus a breakdown into the contributions of the various overheads. The central observation is that overhead increases the simulation parallelism by up to a factor of two. This result is consistent with additional measurements which indicate that overhead accounts for 44% to 68!Z0 of the computation in a sequential W WT simulation.
These overheads We approximate the cache and TL B interference for K >32, by simply using the estimated interference for K = 32; since both cache and TLB begin thrashing for more than 4 target nodes per host node, there will be essentially no reuse (i.e., hits) for large K.
Modeling the Cost of Host Systems
In this section, we introduce cost models for uniprocessors (Uni), small-scale bus-based shared-memory multiprocessors (Bus), and large-scale parallel supercomputers (MPP). The cost models are based on current products and allow us to vary the number of host processors, p, and the number of target nodes per host node, K. We assume that each host node requires 32 megabytes per target node. This is significantly more than needed for the small data sets used in this study; however, these data sets were chosen so that we could simulate 32 target nodes within 32 megabytes of memory (i.e., on one CM-5 node). Real data sets are much larger; for example, the official NAS input to appbtis 125 times larger than the data set presented here [2].
Our uniprocessor cost model is based on the Silicon
Graphics CHALLENGE M, a rack-mounted uniprocessor workstation server. We use a server configuration 
Current implementations
of massively parallel processors consist of a collection of workstation-like processing nodes connected together by a high-bandwidth interconnection network. Our cost model for these systems does not include a fixed base cost because they are generally expanded by adding entire cabinets, rather than individual processor boards.
Rather than try and capture the complex step function of the actual cost, we simply approximate it as a linear function of p; this approximation should not introduce significant error since we only consider values of p that are powers of two. Modeling the net work cost as a multiplier, Xnet~0~~, of the processor cost, the overall cost (for all p > 2) is:
CMPP (K, P) = P(1 + xnetwor~)cp.ocesso.
+ KpCmemfi
For the purposes of this study, we use current Silicon
Graphics list prices for our uniprocessor and sharedmemory multiprocessor cost estimates: CprOC~ssO~= $20000, CrnemoT~= $3200 (32 megabytes), BaseCrJni = $3200, and BaseCBti. 
Modeling Cost /Performance
Since speedup is a measure of parallel simulation performance, cost/performance is simply the cost of the host system divided by the simulation speedup it achieves. For a uniprocessor system, the cost/performance is simply Cuni, because speedup is 1 by definition. For parallel simulation of a Kp-node target system, the cost /performance is: memory with respect to the cost of a processor board results in a larger value of Kmin. The intuition behind c~Machzne(~,P) = ;g;::)cMa.h8ne(K, P)(7) this result is that, for a given target system size, K~tn. This result is intuitive, since higher parallelism gives rise to larger speedups which in turn offset the cost of adding more host nodes.
The model also predicts that a decrease in the cost of The second interesting prediction of this model is the lack of continuity in p. That is, parallel simulation does not gradually become more effective, but rather once the speedup is sufficient to overcome the large base cost, the optimum cost/performance occurs when the simulation is either fully parallel (barnes) or nearly so (appbt).
The lower half of Figure 4 plots CPMpp (K, p) and Cun, (Kp) for appbt and barnes and Kp Decreasing the processor cost (and/or the cost of the network for iUP.P's) has a complementary effect, not only decreasing K~in, but reducing the break-even target system size. Similarly, increasing the parallel simulation speedups, as we expect for larger data sets, will also tend to make parallel simulation increasingly costeffect ive. 
ACM Transactmns on
