Absfmct-This paper proposes to speedup sampled microprocessor simulations by reducing warmup times without sacrificing simulation accuracy. It exploiting the observation that of Ihe memory references tkut precede U sample cIwte5 references Ihnt occur rrearesl Io Ike cluster are more likely to be germane lo Ike execution of the cluster iIseV. Hence, while modeling all cache and branch predictor interactions that precede a sample cluster would reliably establish their state, this is overkill and leads to longrunning simulations. Instead, accurately establishing simulated cache and branch predictor state can be accomplished quickly by only modeling a subset of the memory references and contmlflow instructions immediately preceding a sample cluster.
I. Introduction
This paper explores a technique for accelerating sampled microarchitecture simulations by reducing the amount of cache and branch predictor warmup prior to each sample cluster where cycle-accurate simulation data are gathered. By w a n u p we refer to the practice of modeling cache and branch predictor interactions for a specified interval prior to actual data gathering, in an effort to establish the simulated cache and branch predictor state precisely as they would have appeared had the entire simulation been conducted in cycle-accurate detail.
Unfortunately, highly detailed software simulation of a microprocessor is prohibitively slow. Even on the fastest hardware, slowdowns of several orders of magnitude (relative to native execution) are common. For example, KleinOsowski er al.
[9] show that cycle-accurate modeling of many SPEC CPU2000 [ l l ] benchmarks on reference inputs can take many weeks. Still, software. simulation is fundamental to all computer architecture research. To make simulation-driven research tractable, many studies employ sanipling: taking meaKevin Skadron
Department of Computer Science
University of Virginia Charlottesville, VA 22904 skadron @cs.virginia.edu surements of a small, representative subset of the instructions that are executed over the lifetime of the benchmark. Since it is precisely the software simulation of the cycle-by-cycle progression of individual instructions through the pipeline that produces the overwhelming slowdowns, in sampled simulation only the subset of instructions which constitute the sample are modeled in cycle-accurate detail. Fortunately, measuring the instruction throughput (i.e., instructions per cycle, IPC) of only a subset of the instructions can-for a properly chosen subset-yield information about the instruction throughput of a benchmark's entire end-to-end execution. Conte ef al.
[3], and Sherwood er al.
[15], 1161 propose strategies for choosing representative samples that yield very good approximations to true end-to-end IPC; both will be discussed in the Section 11.
To preserve the integrity of sampled measurements, the simulated processor slate must be accurately established prior to the cycle-accurate simulation of each cluster. In other words, accuracy is predicated upon successfully defeating the socalled cold-start bias; because cache and branch predictor performance are critical to microprocessor performance, if the state of the cache (at all levels of the hierarchy) and branch predictor do not appear at least approximately as they would have had the entire simulation been performed in cycleaccurate detail at the leading edge of a cluster, the simulation results may be inaccurate. One straight-forward technique to guarantee the accuracy of pre-cluster cache and branch predictor state is to model the interaction of each memory reference-instructions and datawith the cache hierarchy and every control-flow instruction with the branch predictor while the simulator is executing precluster instructions. (All cache and hranch predictor interactions are already modeled within the cycle-accurate clusters.) In this way, the cache and branch predictor will always contain exactly the same state as if cycle-accurate simulation had been employed throughout the simulation. Though its accuracy is unimpeachable in terms of cache and branch predictor state, thisfullwarmup method is heavy-handed. While not as expensive (in terms of running time) as cycle-accurate simulation, modeling all cache and branch predictor interactions is still costly.
One method for further accelerating sampled siniulations is to avoid fullwarmup by only modeling those interactions that occur within a given number of instructions prior to the [12] . Our technique makes the determination of when to engage cache and branch predictor wannup by exploiting memory reference reuse latencies (MIm)-a measurement of the number of instructions that elapse between successive references to the same address. We have developed software that facilitates MRRL measurements and determines the pre-sample warmup interval independently for the instruction stream, data stream, and control-flow instruction stream.
The rest of this paper is organized as follows. We discuss related work in Section U. Section III explains Memory
Reference Reuse Latency, its measurement and its significance. Section IV applies MRRL to sampled simulation. Wg? explain OUT experimental methodology in Section V, present our results in Section VI and finally conclude in Section W.
II. Related Work
Several studies examine ways to reduce overall simulation running times by executing only a small subset of the benchmark in cycle-accurate detail. Skadron et al.
[17] identify short, representative simulation windows of 50 million instructions for the SPECInt95 benchmarks. The key insight which guides their approach is to exclude the benchmarks' unrepresentative start-up behavior (e.g., data structure setup and initialization). Conte et al.
[3] take a different approach and instead simulate multiple randomly-chosen, fixed-sized clusters of contiguous instructions from the complete dynamic instruction stream. By choosing clusters randomly (i.e., such that all parts of the dynamic instruction stream have equal probability of being selected), random cluster sampling is amenable to statistical analysis, and allows the determination of a confidence interval within which the true IF' C resides. Their work demonstrates the applicability of random cluster sampling to microprocessor simulation and focuses on the problem of warming up the branch prediction structures (assuming a perfect cache). They furthermore show that using stale predictor state from the previous cluster plus a short warmup interval of at least 7,000 instructions [IO] prior to each cluster is sufficient to minimize cold-start bias and achieve very small errors of a few percent in the mean observed IPC. In the experiments conducted for this research we use random cluster sampling, prefixing a warmup interval determined by MRRL before each cluster and preserving stale cache and branch predictor state. by approaching the problem analytically, their technique is able to achieve rapid warmup without compromising accuracy.
Haskins and Skadron [5] propose a warmup acceleration technique called Minimal Subset Evaluation (MSE). The MSE technique uses formulas derived from combinatorics and probability theory to calculate for some user-chosen probability p, the number of memory references prior to each cluster that must be modeled in order to achieve accurate cache state; thus with probability p, cache state will appear exactly as it would had fillwarmup been used. As with PARSIM, MSE's mathematical underpinnings improve upon prior efforts by maintaining accuracy while reducing warmup times. Unlike PARSIM, MSE requires neither a priori knowledge of the steady-state cache miss ratio nor an instruction trace of the benchmark to be modeled (from which to derive the proportion of the instruction stream populated by memory references). The work in [51, however, only treats warmup acceleration of pre-cluster memory reference interactions with the first-level data-and instruction-cache; it is not obvious that MSE extends to (sometimes unified) secondary caches or branch predictors.
Karlsson et al, [6] develop an analytical model for characterizing working-set size as a function of database size in decision-support systems @SSs). They perform an insightful investigation of temporal locality in DSSs and construct a model for identifying potentially reusable query components. Phalke and Gopinath [13] model inter-reference gaps (which are equivalent to memory reference reuse latencies) as kth order Markov chains. By modeling per-address temporal locality in this way, they were able to develop improved algorithms for page replacement, dynamic memory management and trace compression. ThiCbaut [19] draws an analogy between memory access pattems and fractal random walks on the one-dimensional lattice (where the countably infinite lattice is mimicked by a large memory address space). From this framework, Thiibaut describes a method for accurately predicting the miss ratio of fully associative caches. While these works do not treat warmup in execution-driven simulation, they were instructive in their analytical assessment of temporal locality in memory reference streams.
Wood et a/. [21] establish the concept of cache generations. Each cache generation begins immediately after a new l i e is brought into the cache and ends when the line is evicted and replaced. Their notion of cache generations establishes a framework for analytically estimating the unknown or coldstart reference miss ratio, p. They further establish that p is substantially higher than the miss ratio of references chosen at random. Armed with reliable fi-stimated unknown reference miss ratio-they were able to accurately estimate cache m i s s ratios in sampled trace-driven simulations. This research however, does not address the issue of accurately establishing simulated cache hierarchy or branch predictor state for execution-driven simulations.
In their Cache Decay research, Kaxiras et al. [7] propose a technique of cutting power to (heuristically presumed) dead cache lines, thereby reducing leakage power. For the SPEC CPU2000 benchmarks, their measurements show that for a 32KB L1 data-cache, a cache line's dead time can range from 45% to as much as 99% of the total time since being loaded. Their work shows that most cache lines' active lifetime is significantly longer than their useful lifetime, which confirms our hypothesis that references occurring many instructions before a cluster are unlikely to have any relevance within the cluster and can therefore be safely omitted from warmup.
As in prior research, we achieve efficient execution by breaking the simulation into three separate phases. The first, aggressive fast-forward phase can be considered the "cold" phase; this is followed by the "warm" phase, where cache and branch predictor interactions are modeled; and concluded by the "hot" phase where cycle-accurate simulation of the processor pipeline takes place. The hot phase contains sample cluster instructions and preceding cold and warm phases contain the pre-cluster instructions. Hence, for each pre-cluster-cluster pair, the aim of our research is to preserve simulation accuracy as we increase the duration of the cold phase while reducing the duration of the warm phase, always leaving the hot phase unchanged. Ad-hoc warmup methods that guess a warmup amount (e.g., X% of all pre-cluster instructions) may still yield inaccurate results (if warming up only X% of pre-cluster instructions is too few) or fall short of the potential speedup (if warming up fewer than X% of pre-cluster instructions would have still yielded accurate results). By measuring the reuse latency of individual memory addresses, we were able to forge an alternative warmup acceleration technique that preserves accuracy by determining which references are likely to be germane to each cycle-accurate cluster. We measured MRRLs for each benchmark using cnstommade MRRL profiling software. As the profiler simulates each pre-cluster-cluster pair, the profiling software maintains several associative arrays of memory reference addresses,, M[A]+ne for the instruction stream, one for the data stream, and one for the stream of branch instructions. Each element of the m a y is logically timestamped with the number of instructions executed as of the currently simulating memory or branch instruction; if a previously-encountered address is re-accessed, the difference of the previous timestamp and the current number of executed instructions is temporarily stored as binsn. These binsn are used to concurrently build a reuse latency histogram by incrementing the connt of the bucket that contains binsn. When a pre-cluster-cluster section concludes, the profiler outputs its binsn histogram. These histograms contain the complete memory reference reuse latency profile for each pre-cluster-cluster pair.
III. Memory Reference Reuse Latency
A pre-clusterxluster histogram counts the number of references whose reuse latencies fall within n disjoint length intervals. Formally, each histogram gives the count of references for which the number of elapsed instructions between successive accesses to the same address lies within thi: interval subset bucket,, where j E {l, 2, ..., n} for all n buckets. Not surprisingly, the histograms invariably tell the same story when plotted A far greater number of references are revisited a small number of instructions after their most recent access (i.e., the histogram bucket with the largest population was always bucketl). Thus, the more instructions that elapse after an access to M[AJ the less likely M[A] is to he accessed again during the current pre-cluster-cluster pair. This is exactly as we had expected, in tight of concepts pioneered in [21] and subsequent work in [7] .
From the histograms we can calculate the reuse distance corresponding to any desired perceniile N , i.e., the bucketj for which at least N% of references are contained in Ci,,bucketb. Let W N c) bucketj mean that the jth bucket
In other words, of all the references in the current precluster-cluster pair, N% have reuse latencies of less than W N instructions.
By engaging warmup W N instructions prior to the current pre-cluster-cluster boundary for large enought N , we know that the overwhelming majority of addresses that will be accessed during the simulation cluster will have been initialized.
We argue that if N% of references require only W N instructions between successive accesses, then it is pointless to model the few ((100 -N)%) pre-cluster cache and branch predictor interactions that occur more than W N instructions before the cluster, since these references will probably not hi: relevant to the cluster's precision and require disproportionately long to warmup. This strategy of delaying pre-cluster cache and branch predictor modeling will be explained in mon: detail in the next section.
IV. Accelerating Warmup
The steps of the MRRL warmup acceleration technique are enumerated below: 1) First, the user selects the locations of the cycb. =-accurate clusters within the benchmark; by corollary non-cluster regions are selected simultaneously. Each cluster is paired with its own preceding non-cluster (i.e., precluster) region. 2) The user next profiles the benchmark to characterize, for each pre-cluster-cluster pair, the reuse latencies of all references that occur. As this profile data is valid for any cache and branch predictor configuration, this is a one-time cost for each benchmark sample.
3) Simulations can then be run in an aggressive fastforward mode, updating only architected stai!e. At W N instructions prior to the cluster, the simulator shifts 'A discussion of "large enough N appears in Section Vi. into warmup mode where cache hierarchy and branch predictor interactions are modeled. Once the cluster is reached, the cache(s) and branch predictor will contain accurate state, and cycle-accurate, simulation begins. This last step repeats for each pre-cluster-cluster pair. Contrast this approach to the more conservative technique of modeling all pre-cluster cache and branch predictor interactions, i.e., fullwarmup. Obviously, modeling all pre-cluster cache and branch predictor interactions will maintain perfect state throughout all levels of the cache hierarchy and in the branch predictor, rendering the simulation data impervious to inaccuracies that arise from cold-start bias; only sampling error remains. Reciprocally, stale-stare or nowanup-as the latter name implies-does not model any pre-cluster cache or branch predictor interactions, but merely recycles state as it appeared at the conclusion of the previous cluster. By not modeling cache and branch predictor state prior to each cluster, nowurmup is very susceptible to cold-stat hias, as will be shown in the next Section. In our discussion of MRRLs accuracy, we refer not only to whether the true end-to-end cycle-accurate E'C is contained within a statistical confidence interval, but also to the deviation between the IPC yielded by MRRL-driven warmup, and-for the same sample-by ful1warmup. We measure this deviation by calculating the IPCURR' -IPCfu,,w.rm"p . In our discussion of MRRL's speedup capa ility, we refer to the amount of potential speedup overfullwarmup which (as shall be shown in the next section) is the running time of nowarmup. relutive ermr thus: 100%.
IPcl$t,w,.m",

V. Methodology
The data discussed in this section were gathered using random cluster sampling as described by Conte er al. 
TABLE I1 IPC %-ERROR RELATIVE TO FULLWARMUP ( ' p c M R R L n ~~~--I~~f u ' ' -~m~p
). MEAN CALCULATIONS BASEDON THE ABSOLUTE VALUE OF ERRORS.
~PCf"llul.rmup
[2] toolset-to obtain the dynamic instruction count. Next, a simple Per1 script was used to select 50 I-million-instruction clusters at random' from the discrete interval [l, L], where L is the dynamic instruction count. The location (in the number of instructions relative to the stan of execution) of the 50 clusters selected were then saved to a file, and subsequently used to drive the multiple cluster profiling and simulation steps enumerated in Section N.
For the MRRL simulations, the warm phase was engaged WN instructions prior to each cluster for N = 0.999. Recall from Section III that MRRL initiates cache and branch predictor warmup according to the maximum length of reuse latencies that compose the N-th percentile of all reuse latency measurements. We chose to study N = 0.999 (i.e., the 99.9-th percentile) and find that it performs well in terms of absolute deviations in IPC, statistical analysis of the deviations, and speedup. Analysis of other N to find a minimum percentile is beyond the scope of this work, and is an area of future research.
All benchmark come from the SPEC CPU2000 suite [Ill; the binaries were compiled into the Alpha AXP instruction set and statically linked so that the simulations see all user-space program behavior, including library routines. The MRiU profiler and the multiple cluster simulator were adapted from sim-safe and sim-outorder, respectively, from SimpleScalar.
To measure simulation time data as accurately as possible, sim-outorder was further modified to use the UNIX system call getmsage() to monitor the CPU time of each simulation regardless of other activity on the host system. (All the scripts 2By "at random:' we mean such that all regions of the discrete intend [l,L] have equal probability of being selected. and software developed for this research are available for download from http:/flava.cs.virginio.edu/.) Table I gives a brief description of the cache hierarchy and branch predictor configuration.
Once each benchmark's 50-cluster sample was selected, the next step was to profile to gather MRRL data for each benchmark. A Per1 script was then used to extract WN for each benchmark's pre-cluster-cluster pairs. when fed to the multiple cluster simulator, these data were used to demarcate the cold phase from the warm phase. The previously chosen hot phases (clusters) remained fixed just .as they were during the profile.
The three metrics we use to measure MRRus merit are percent-emor IPC deviation from fillwarmup, accuracy with respect to the true IPC, and running time aS a percentage of fullwumiup. For completeness and as a hasis for discussing simulation acceleration, in addition to demonstrating the validity of MRFC as a tool for rapid, accurate simulation, we also show data arising from nowarmup for each of the three metrics aforementioned. (Recall that nowarmup merely recycles state from one cluster to the next, and models caching and branch prediction solely during the clusters.)
For each benchmark, Table II shows the true end-toend tPC3 (i.e., IPCt,,,) generated by simulating in cycleaccurate detail for the entire dynamic instruction stream, fullwannup IPC (i.e., IF'Cfulltuarmup) percent-error deviation relative to IPCtrue, and MRRLa.999 IPC (i.e., I P C M R R L~,~~~) and 'Most of these lPCs come from the SirnPoinr I141 Web site. They were generated for a specific configuration of sim-outodcder (linked CO f" the sire). MRRL.fulhwmup. and nowamup experiments compared against these lPCs use the same sim-ouionler configuration and the same benchmark binaries. 
VI. Evaluation
A. IPC Accuracy: MRRL versusfullwarmup
For most benchmarks, Table II shows that fullwarmup's percent-error deviation from IPCt7,, (i.e., the sampling error) . is small, less than 5%. While applu, galgel, and especially gap buck this trend, this is not a failure of random cluster sampling, but a failure to draw a suitably large sample of clusters from the dynamic instruction stream. A larger sample (of more than 50 clusters) would have reduced the sampling error by more accurately representing all aspects of these benchmarks' behavior. Conte er al.
[3], for example, achieve relalive errors in IPC of less than 3% through sampling. Of paramonnt importance to our research however, is that in general, MRRLo.ge9 does not introduce statistically significant addirional nonsampling error, which arises chiefly from cold-start bias [3]. In other words, our primary objective is to develop i3 w m u p technique such that I P C M R R L~.~~~ strays very little from IPCfullwormup, and on that count we claim victoiy. For all benchmarks except fma3d, the percent difference deviation fromfullwarmup is less than 0.5O%.fma3d's seemingly drastic deviation is due to the small numbers involved in the percenterror calculation; IPCfu~~wormup = 0.533, IPCMRBL,.,,, = 0.554. The relative error, 100%. (0.5:$533) = 3.9% makes the deviation look much worse than it really is when one considers that the absolute error, 0.554 -0.533 =: 0.021 is so small. nowarmup on the other hand, yields percent-error deviations from fullwarmup of less than 2% in general, but substantially larger relative errors for focerec and galgel. In Section VI-B, we show why these much larger errors make nowarmup an untrustworthy wannup strategy.
B. IPC Accuracy: MRRL versus ETtrue
While MRF&.ggg is apparently a sound w m u p strategy, and n o w a n u p apparently unsound, we will now rigorously demonstrate these hypotheses. As mentioned before, a significant advantage of random cluster sampling is that results obtained from this style of simulation can be statistically analyzed. Sampling produces error because only a subset of the population is measured rather than the entire population.
Random sampling allows us to rigorously gauge the amount of error and the probability that the amount is significant, based upon the assumption that all members of the population had uniform probability of being included in the sample.
For each benchmark, the mean instruction throughput was measured by counting the number of cycles consumed by all 50 clusters. Dividing the total number of executed instructions (50 million) by this amount yielded the overall sample IPC. For a well-chosen sample, this sample IF' C will be a good estimate of the end-to-end P C . ---dence confirms our hypothesis that their respective -10.46% and -14.99% percent-error deviations from the IPCfullwormup sample means are significant.
C. Acceleration: MRRL versusfuZZwarmup
Before discussing MRRL's acceleration capability, it is important to discuss the optimality of nowarmup's runtime. Since nowarmup does not model any pre-cluster cache or branch predictor interactions, nowarmup simulations have no warm phase, only cold and hot. Because the cold phase models a proper subset of the activity modeled in the warm phase, eliminating the warm phase reduces execution time to its absolute minimum under the three-phase cold-wan-hot simulation strategy described in Section III.
Since nowarmup simulation time is the minimum possible simulation time it also represents the per-benchmark maximum potenrial speedup from warmup. Table IV shows that these potential speedups ranged from 32.31% for gzipxraphic to 59.84% for fnta3d. of each .benchmark's fullwarmup run- Table N . In other words, Table IV shows not only the M R R L~. Q~~' s reduction relative to fullwarmup, but also how close to the maximum possible speedup each M~. g g g simulation was able to come (the higher the percentage the better).
MFS&.Q9Q's achieved potential speedup for all benchmarks is respectable, averaging 90.62% of the maximum, and ranging 
VII. Conclusions and Future Work
Memory reference reuse latency analysis is a useful tachnique that can be used to reduce the running times of sampled simulations by reducing the amount of time spent warming up simulated cache and branch predictor state during the simulation phase preceding each sample cluster. By measuring the reuse latency (in number of instructions) between consecutive accesws to each memory address, we can discover the memory reference reuse latency that corresponds to an arbitrary percentile: MRRLN. This MRRLN is used to determine the amount of warmup to perform during inter-cluster regions.
To make simulation as rapid as possible, cold nmde uses aggressive low-detail simulation, updating only architected state; in warm mode, memoty reference interactions withiu the cache hierarchy and branch instruction interactions with the branch predictor are also modeled. At the conclusion of the warm mode, cache and branch predictor state will be accurately established, allowing the subsequent hot mode to simulate in cycle-accurate detail without imprecision arising from cold-start bias, which can reduce accuracy.
Our results show that, used in conjunction with random cluster sampling, MRRL does not compromise accuracy. For the SPEC CPU2000 benchmarks tested, the percent-error between IPCtullWDrmup and ~F ' C M~~R L~.~~~ averaged less than. 1%, and was shown to be statistically insignificant for M R R~. Q Q~ in that all hut four of the benchmarks' 95% confidence intervals predicted the observed error from IPCtrue. Additionally, the fullwarmup simulations of the same fow benchmarks also failed to predict the observed error. Since fullwarmup is impervious to nonsampling error due to cold-start bias, this implies that the failure of both MRRLo.999 and firllwarniup to predict the observed error is attributable to sampling error and that MRRLo.s9~ accomplishes our objective of reducing nonsampling error. Thus, we conclude that MRRL at the 99.9-th percentile mimicks fullwarmup well. MRRLn.999 accomplishes OUT second objective as well, cutting simulation times by 50.06% on average, which is 90.62% of the maximum potential speedup.
Since MRRL works by accurately establishing cache and branch predictor state, an interesting avenue for future research would be to analyze whether MRRL accurately estimates the cache miss rate and branch misprediction rate from the sample clusters. Currently under investigation is tlie use of hypothesis testing to demonstrate that the difference between IPCfuiiwarmup and IPCMR" is statistically insignificant for some MRRL percentile N. In particular, we will implement a marched-pairs t-test, pitting the per-cluster IPCi of each benchmark against each other for fullwarmup simulations and MRRLN simulations. In preliminary experiments we computed from the matched pairs, a set of differences which were then used to calculate a t-score based on the difference of the means, the standard error of the means, and their Pearson from 88.55% for vprmure to 97.51% for apsi. These translate into running times of only 39.75% and 59.56%, respectively of the time taken to simulate viafullwarmup. product-moment correlation coefficient 1201, thus:
where px -py is the difference of the fullwamiup and MRRLN means, ux and u y are the standard errors among the fullwarmup and MRRLly cluster IPCs', and rxy is the Pearson product-moment correlation coefficient between the fullwarniup and MRRLN cluster IPCs. (This is n e c e s s q because we are measuring the effects of ksted warmup strategies as different "treatments" of the same sample population [ZO] .) This is then repeated for fullwarmup and nowarmup.
At the 5% level of significance, for instance, the critical value6 for our 50-cluster-sample experiments is 2.0096. Table  V lists the t-scores of the benchmarks calculated by pairing the cluster JPCs fromfirllwurmup and MRRLn.ggg, and by pairing the cluster IPCs fromfullwarmup and nowarmup. These early results are very promising and quantitatively insightful. Recall from Section VI-A, the relatively large 3.9% error between I P C U R R L~,~~ and I P C~~I~~~~~~ f o r m d . We qualitatively concluded that since the absolute error was very small (0.021), that the percent-error was insignificant. Table V quantitatively confirms this since the fnla3d t-score is less than the critical value aforementioned; thus, forfnia3d. the difference between I P C M R R~~,~~~ and lPCfdlwormup is statistically insignificant at the 5% level. 
