INTRODUCTION
Trace-driven simulation is a common approach for evaluating memory systems. Unfortunately, it also demands large amounts of space and time, particularly for large caches and long running applications. These demands can be greatly reduced by employing sampling techniques at the expense of providing only a statistical estimate of the properties of a full trace.
Our interest in using sampling is three-fold. First, we are interested in the behavior of commonly used desktop applications. When compared to benchmarks such as SPEC95, these applications have larger working sets, are feature rich, and, of course, can run for billions of instructions. Hence, traces based on exhaustive executions of such applications will be large. In this work, we consider trace sampling for a suite of five publicly available desktop application traces for Windows NT on the Intel X86 platform. To determine the feasibility of sampling for these traces, we present cache miss rate results for a large set of cache sizes and sampling techniques. Second, we want to demonstrate the utility of these sampling techniques for architectural studies. Although it has been shown that trace sampling is not very accurate for metrics such as hit rate when simulating large multimegabyte caches[3], we demonstrate that sampling is useful in assessing trends not only for caches but also for other architectural structures whose state depends on the processing of past references. Two architectural studies are presented here that apply sampling techniques. The first study demonstrates how sampling may be used to assess trends in victim cache performance [2] , and the second study uses sampling to assess trends in branch prediction techniques. In addition, we can estimate parameters for analytical cache models with sampled traces as precisely as estimating with whole reference traces (cf.
[I] for this part of the study).
TRACE SAMPLING
In trace sampling, an observation, or sample, is obtained by recording a fixed number, the sample size, of consecutive references from a reference stream. Another fixed number of Permission lo make digital or hard copies of all or part of this work for personal or cla~~lOOm use is granted without fee provided that copv2s are not made or dtstributed for profit or commercial advantage and that copies bear this notice and the full citation on the fnrst page. TO copy othervase.
to republish. to post on servers or to redistribute to I~sts, requwes prior specific permisslon and/or a fee. SIGMETRICS '99 5/99 Atlanta. Georgia, USA 0 1999 ACM l-581 13-083-X/99/0004.,.$5.00 references are ignored before the next observation is made. The sampling ratio is the percentage of total references used in all the observations. Sampling theory states that sets of random, unbiased observations from a population may be used to make inferences about that population. As described above, observations in trace sampling are not random; they are systematic since they are evenly spaced throughout the trace. This non-random pattern is not a problem, though, since systematic observations can be used to make even more precise inferences than random observations when the variance of systematic ObseNations is greater than the variance of the population. Unfortunately, however, trace sampling for memory systems neither involves unbiased observations nor a sufficient alternative. The problem is that the state of the cache is unknown at the start of each observation, i.e., within a sample it is unknown whether the first reference to each cache block will be a hit or a miss. These are known as unknown [7] or coZ&sturt [4] references.
METHODOLOGY 3.1 Sampling Techniques
A number of techniques have been employed to mitigate the bias due to unknown references. The techniques considered here are described in Figure 1 . One approach is to make assumptions about, or construct, the state of the cache at the start of each sample (e.g., cold, half, and stitch). Another is to directly estimate the miss ratio of unknown references (e.g., INITMR). The efficacy of these assumptions depends on workload, cache organization, and choice of sampling parameters. For example, if most misses in a cache for a given workload are due to conflicts in a small number of cache lines, then stitch may work well since only a small portion of the working set is likely to change between samples. Note that true-sample simulates the caches over the full trace and reports the miss ratio observed over the regions that are sampled with the other techniques. It is therefore an unbiased estimator of the miss ratio for the entire trace. Its accuracy depends on how "tine-tuned" our sampling parameters, sample size and sampling ratio, are to a given cache and workload. While true-sample is not a practical method, it is however the basis for comparisons with the other techniques which, in addition to the same sampling errors, will have unknown reference biases.
Our experiments compare the complete branch prediction accuracy results to stitch sampled results for two branch predictors: a simple bimodal [l] predictor and a gshare [6] predictor. We consider predictor tables ranging in size from 512 to 32,768 entries.
Benchmarks
Our experiments are based on traces gathered from the execution of five desktop applications: Adobe Acrobat Reader, Netscape Navigator, Adobe Photoshop, Microsoft PowerPoint, and Microsoft Word. A complete description of these applications and their workloads can be found in [5] .
The results using stitch for the bimodal predictor are extremely good, even for large tables. Because individual branches are most often either always taken or always not taken, keeping the results of past predictions, even distant ones, is beneficial. Regarding the gshare predictor, we observe that stitch is still quite accurate up to 8K entries. Moreover, stitch can lead to the correct conclusion that the gshare predictor is more accurate than the bimodal predictor only when the predictor tables are sufficiently large.
CONCLUSIONS 4. SAMPLING RESULTS

Determination of Miss Rates
We simulated direct mapped and 4-way set-associative instruction and data caches with sizes ranging from 8KB to 128KB and direct-mapped and 4-way set-associative combined caches with sizes ranging from 256KB to 4MB. We include 90% confidence intervals for all sampling results to give each estimate context. The following observations are representative of the complete set of simulations.
Commonly used Windows NT desktop applications can run for billions of instructions. Performing architectural studies on traces of billions of references is not feasible from both time and space perspectives. Trace sampling is an efficient alternative that greatly reduces simulation time and, in many cases, nominally reduces accuracy. We have shown that among the choices of sampling techniques, stitch most accurately predicts cache miss rates for these workloads. Furthermore, we have shown that sampling can be used to efficiently drive architectural studies and gather parameters for analytical cache models. 1. The real miss rate is within the 90% confidence interval of true-sample for all experiments. 2. All techniques are accurate within the 90% confidence interval for caches up to 32KB in size.
ACKNOWLEDGEMENTS
This work was supported in part by NSF Grant MIP-9700970 and by a gift from Intel Corporation.
3. stitch and INITMR are reliable for these workloads up to 64KB, and stitch is the most accurate.
111
4. No technique is reliable for all large caches, although stitch comes quite close.
Architectural Studies PI
Trace-driven simulation is used not only for the evaluation of cache parameters but also for studying hardware assists to the memory hierarchy or to the processor core. Often these hardware assists contain structures which, like caches, have states that depend on the recent history of data references or instruction execution.
[31 [41
Victim Caches
The scenario we consider here is that of an architect who wishes to gather an efficient estimate of the expected change in miss rate for data caches augmented with victim caches of between one and five entries. Our experiments compare the true results for the victim cache simulations to stitch sampled results, whose miss rate accuracy has already been discussed. In addition to the actual miss rates being very similar, the true miss rate is always within the 90% confidence interval (or just outside) for caches up to 128KB in size. The trends in improved miss rate due to the victim caches are correctly predicted, even for large caches. 
PI
Branch prediction is typically simulated over the same workloads used to drive cache memory simulations. As these prediction techniques increase in complexity, so do the costs of simulation.
