Abstract-Efficient execution on modern architectures requires good data locality, which can be measured by the powerful stack distance abstraction. Based on this abstraction, the miss rate for LRU caches of any size can be predicted. However, measuring stack distance requires the number of unique memory objects to be counted between successive accesses to the same data object, which requires complex and inefficient data collection. This paper presents a new efficient way of estimating the stack distances of an application. Instead of counting the number of unique memory objects touched between successive accesses to the same data, our scheme only requires the number of memory accesses to be counted, a task efficiently handled by existing builtin hardware counters. Furthermore, this information only needs to be captured for a small fraction of the memory accesses. A new efficient off-line algorithm is proposed to estimate the corresponding stack distance based on this sparse information.
I. INTRODUCTION
The high latency and limited bandwidth of DRAM has been identified as two potential bottlenecks of present and future computer systems [20] . However, both memory latency and bandwidth issues can often be tolerated if an application's data locality is improved, leading to a better use of the caches in the memory hierarchy. To accomplish this, we need powerful techniques to give insight into application data locality. Furthermore, such techniques have to be efficient enough to evaluate long-running applications operating on large input data sets.
Stack distance, introduced by Mattson [17] , is the basis for several such techniques. The stack distance is the number of unique memory objects accessed during a reuse epoch, where a reuse epoch is the time between two successive memory accesses to the same memory object. For cache analysis, the memory objects are usually cache lines. Based on the distribution of stack distances for all memory operations of an application, the miss ratio of a fully associative LRU cache of any size can be calculated trivially and quickly [17] . Since its introduction, stack distance has been widely used, for example, to model cache reuse [8] , to guide program transformation [27] and insertion of cache hints [5] , to find locality phases [22] , and for modeling of cache contention between parallel processes [6] .
While the stack distance abstraction is powerful, the requirement to count the number of unique memory objects accessed during every reuse epoch requires elaborate bookkeeping, which results in complex and slow algorithms. Furthermore, the traditional way of implementing a stack distance algorithms requires full address traces collected from parts of or the full execution of the target application. We have measured slowdowns of 100−1000× while collecting full address traces, using the Pin [14] and Valgrind [18] dynamic instrumentation tools.
The goal of this research is to find an inexpensive way to model LRU caches, based on sparse and easily captured information, such as that used by the StatCache [2] [3] . Instead of keeping track of the number of unique memory objects accessed during reuse epochs, StatCache simply records the number of memory accesses executed during the reuse epochs, called a reuse distance. Reuse distances can be measured with low overhead by leveraging hardware counters and watchpoint mechanisms provided by modern CPUs and operating systems. Furthermore, StatCache only measures the reuse distance for a very sparse selection of memory accesses, for example, only one out of every 10, 000 memory accesses. Practical experiments using this method report an average runtime overhead of 40 percent [3] , compared to 100 − 1000× to capture full address traces. StatCache then uses an off-line statistical model to estimate the miss ratio for an arbitrary size cache with a random-replacement policy.
The sparsely collected reuse distance information from StatCache appear to be a poor fit for building up the state needed to model LRU caches. However, it is possible to use the sparse reuse distance information to infer much of the lost state. Instead of explicitly trying to count the number of unique memory objects accessed during each reuse epoch, we start off with an algorithm that calculates stack distances based on the reuse distances of all memory accesses. The stack distances can then be estimated based on sparse reuse distance information by applying several approximations. The validity of these approximations is evaluated through experiments, which show excellent accuracy for reuse distance information for only one out of every 10,000 memory accesses.
The paper continues by reviewing StatCache and its efficient data collection. Next, in Section III, we give a rigorous description of StatStack. In Section IV and V, we discuss possible sources of errors and describe a scheme for collecting runtime data that minimizes these errors. Finally, in Section VI, we evaluate the accuracy of StatStack by comparing its results with the output from a traditional cache simulator fed with full address traces for 28 applications. We show that a sparse sample of reuse distances can be used to model a fully-associative LRU cache accurately across a wide range of cache sizes.
II. STATCACHE IN REVIEW
StatCache is a tool for estimating an application's cache miss ratio [2] [3] . It uses an online sampler that attaches to the target application and measures the reuse distances of a set of randomly selected memory references. The reuse distance of a memory reference A is equal to the number of memory references in the target application's address trace between A and the previous memory reference to the same cache line. The sampler records the set of measured reuse distances in a structure called a reuse distance sample or RDS, which is fed as input to an offline statistical cache model. StatCache models a fully associative cache with a random replacement policy, and can accurately estimate miss ratios using RDSs collected with sample rates as low as 10 −6 [3] , i.e., only containing reuse information of every one millionth memory access.
Unlike collection of address traces, the collection of reuse distances allows the sampler to make efficient use of functionality supported by contemporary hardware and operating systems, such as hardware performance counters and watchpoints. The StatCache sampler enables sampling of unmodified binaries and has a relatively small impact on the sampled applications runtime. Berg et al. [3] report an average slowdown of only 40 percent when hardware and operating system support are used.
The StatCache sampler works as follows. Hardware counter overflow traps are used to halt the execution when a memory reference that has been selected for sampling executes. A onetime watchpoint is then set for the address it accesses. The number of memory accesses performed to date is recorded by reading a hardware counter, after which the execution is continued. The next time the same address is accessed, a watchpoint trap is generated, which again halts the execution. The reuse distance is calculated as the difference between the number of memory accesses executed to date and the previously recorded number.
III. STATSTACK
StatStack, like StatCache, is a method for estimating an application's cache miss ratio. StatStack introduces a new statistical cache model that, unlike the statistical cache model of StatCache, models a fully associative cache with an LRU replacement policy. The StatStack cache model uses the same RDS input data as the cache model of StatCache. This allows StatStack to use the same efficient sampling technique as StatCache, and thereby inherit StatCache's low runtime overhead.
A natural approach to modeling of LRU caches is to use the stack distance abstraction. However, the sparse input data used by StatStack does not contain enough information to support traditional methods for computing stack distances. StatStack is based on a new approach that uses the concept of an expected stack distance, which is roughly the average number of unique cache lines accessed during a reuse epoch.
Our approach can be summarized as follows: 1) Based on the reuse distance information in an RDS, we estimate the target application's reuse distance distribution; 2) We use that distribution to determine the likelihood that a memory access, to cache line l, is the last of the memory accesses executed in the current reuse epoch to access l; 3) Based on these likelihoods, we estimate the average number of unique cache lines accessed during each reuse epoch recorded in the RDS, which gives us the expected stack distances of the sampled memory accesses; 4) We then use these expected stack distances to estimate the target application's miss ratio.
In the remainder of this section, we develop the StatStack cache model. We start by introducing a fundamental relation between reuse and stack distance (Section III-A). Based on this relation we derive an expression for expected stack distances given the knowledge of all reuse distances in the target application's address trace (Section III-B). We then introduce a set of approximations that enable us to the estimate expected stack distances based only on the sparse information in a RDS (Section III-C). Finally, we put the pieces together and show how the miss ratio of the target application is estimated based on expected stack distances (Section III-E).
A. Stack Reuse Relation
In this section, we introduce a relation between reuse distance and stack distance. This relation allows us to compute the stack distance of a memory access, x, given that we know the reuse distance of all memory accesses executed between x and the previous memory accesses to the same cache line. Before proceeding, we need to make our terminology more precise.
Definition 3.1: Let x i and x j be two successive memory accesses the same cache line, then
• the reuse distance of x j is the number of memory references executed between x i and x j , • the forward reuse distance of x i is the number of memory references executed between x i and x j , • the stack distance of x j is the number of distinct cache lines accessed by the memory references executed between x i and x j .
The relation between reuse distance and stack distance is best explained by an example. Example 1: Figure 1 shows a sequence of an address trace, where memory references are marked by dots on the horizontal time axis. The memory references are labeled according to their position in the address trace. The arcs connect successive memory references to the same cache line to indicate a reuse epoch, for example, consider the reuse epoch between x 1 and x 8 . The stack distance of x 8 is three, since there are three unique cache lines, l 2 , l 3 and l 4 , accessed by the memory references executed between x 1 and x 8 . The three cache lines, l 2 , l 3 and l 4 , have to be accessed one and only one last time by the memory references executed between x 1 and x 8 . The last of these memory reference to l 2 , l 3 and l 4 are x 5 , x 4 and x 7 , respectively. These are exactly the memory references executed between x 1 and x 8 whose outbound arcs reach beyond x 8 , i.e. their forward reuse distance is greater than their distance to x 8 . The stack distance of x 8 is therefore equal to the number of memory references executed between x 1 and x 8 with a forward reuse distances greater than their distances to x 8 , which is three in this example.
The result of the above example holds true in general. To show this we introduce the following notation: Let R(x t ) and R(x t ) denote the reuse distance and the forward reuse distance of x t , respectively. Further, let Q(x t ) be the set of memory references between x t and the previous memory reference to the same cache line. For example, in Figure 1 , Q(x 8 ) = {x 2 , x 3 , x 4 , x 5 , x 6 , x 7 }. We can now generalize the result of Example 1 as follows: The stack distance of a memory reference x t is equal to the number of memory references, x i ∈ Q(x t ), for which R(x i ) > t − i. This statement follows from the observation that R(x i ) > t − i if and only if x i is the last of the memory references in Q(x t ) to access cache line l i . The stack distance of x t , denoted by S(x t ) can now be expressed as follows,
where, r = R(x t ) and 1(α) is defined to be one if α is true and zero otherwise. The second equality in (Eq. 1) follows from the variable substitution j = t − i. (Eq. 1) can also be derived from equations presented by Bennett and Kruskal [1] (see Appendix).
B. Expected Stack Distance
To compute the stack distance of a memory reference x t using the method of Section III-A we need to know the forward reuse distance of all memory references executed between x t and the previous memory reference to the same cache line. However, our goal is to approximate an application's miss ratio given only the reuse distance of a sparse set of the application's memory references. To this end, we introduce the concept of an expected stack distance. We start by assuming that we know the reuse distance of all memory references in the target application's address trace, and derive an expression for the expected stack distances based on this information. Later, we introduce a set of approximations that enable us to estimate expected stack distances based on the sparse information in an RDS.
The expected stack distance of the memory references with a reuse distance of r is the average stack distance of all memory references with a reuse distance of r. As an example, consider the three memory references x i , x j and x k in Figure 2 . Using the method of Section III-A, we can compute their stack distances, which are one, two and three, respectively. Assuming that these three memory references are the only memory references with a reuse distance of three, we can compute their expected stack distance, simply by averaging their stack distances, which in this case gives us an expected stack distance of
To derive an analytical expression for the expected stack distance we introduce the following notation. Let T (r) be the set of all memory references with a reuse distance of r in the target application's address trace, and let n r be the number of such memory references. We can now express the expected stack distance of the memory references with a reuse distance of r, denoted ES(r), as follows,
We get the second equality in (Eq. 2) by substituting S(x i ) with the right hand side of (Eq. 1). We again consider the memory reference x i , x j and x k in Figure 2 , but this time we compute their expected stack distance using (Eq. 2). By expanding the two nested sums in (Eq. 2), we get the following expression,
Notice the order in which the terms are arranged. For example, the first row in (Eq. 3) is the fraction of the memory references, x i−1 , x j−1 and x k−1 , with a reuse distance greater than one, the second row is the fraction of the memory references x i−2 , x j−2 and x k−2 , with a reuse distance greater than two, and so on. We denote these fractions f 1 , f 2 and f 3 , respectively. The benefit of rearranging the terms of (Eq. 2), is the following. Given the reuse distance distributions for the memory references appearing in each row, we can compute f 1 , f 2 and f 3 simply by computing the fraction of forward reuse distances in the respective distributions that are greater than one, two and three, respectively. This will later allow us to make the approximations necessary to estimate expected stack distances based on the sparse reuse distance information in an RDS. Above, a subscript of i denotes that f i , is the number of memory references with a forward reuse distances greater than i, at position i counted backwards from the end, in the reuse epochs of length three. Generalizing the notation by adding a superscript that denotes the length of the reuse epochs in question, allows us to write the following expression for the expected stack distances,
As an aside we note that the f r i fractions can be interpreted as a likelihood: If we were to pick one of the memory references x i−2 , x j−2 and x k−2 in Figure 2 at random, we can interpret f 3 2 as the likelihood that reuse distance of the memory reference is greater than two.
C. RDS Approximation
To compute an expected stack distance using (Eq. 4), we need to compute all the f r i fractions. For each of these fractions, we need to know the reuse distances distribution for a specific set of memory references. For example, to compute f 3 2 in Figure 2 , we need to know the reuse distances distribution of x i−2 , x j−2 and x k−2 . In this paper, we use RDSs that contain the reuse distances of only one out of every 10, 000 to 50, 000 memory references on average. It is therefore unlikely that these RDSs contain enough reuse distances to accurately estimate the f r i fractions. Our approach is to introduce an approximation that relaxes the constraint of having to know the reuse distance distributions of these specific sets of memory references.
The approximation that we use relies on the assumption that the reuse distance distribution observed for any (large enough) set of memory references is the same as the reuse distance distribution all memory references in the address trace. Under this assumption, we can make the following approximation,
where, F i is the fraction of all memory references with a reuse distance greater than i in the target application's address trace.
With the above approximation, we can use the reuse distance distribution of all memory references in the target application to compute the f r i fractions. This might seem like a step in the wrong direction, we now need to know the reuse distance of more, in fact all, memory references to compute the f r i fractions. However, the benefit of this is that we can now use all reuse distances in an RDS to estimate the reuse distance distribution needed to compute the expected stack distances. LetF i be the fraction of reuse distances in an RDS with a reuse distance greater than i, we can now make the following approximation,
By putting together, (Approx. A), (Approx. B) and (Eq. 4), we get the following approximate expression for expected stack distance,
We have now arrived at an expression that allows us to compute the expected stack distance of any memory reference in the target application's address traces, given only its reuse distance and a sparse RDS collected from the target application. In order to move forward with the development of the cache model we defer the discussion of the implications of the above approximation to Section V. However, it is important to recognize the negative effects that program phase changes [7] [23] can have on (Approx. A). Fortunately, the StatCache sampler has a sampling mode, discussed in Section IV, that allows us to estimate the expected stack distances of memory references executed in different program phases separately. As we will see in Section VI, this will under most circumstances, eliminate the sensitivity to program phase changes.
D. Cold Misses
Since the number of cold misses experienced by an application is independent of the replacement policy of the cache, we can use the same method as StatCache to compute the target application's cold miss ratio. This method is based on the concept of dangling samples. When a memory reference that the StatCache sampler has selected for monitoring executes, the sampler sets a watchpoint for the address being accessed. If this address is not accessed again, the watchpoint will never trigger.
When the application has finished its execution, the untriggered watchpoints are recorded in the RDS as dangling samples.
The number of cold misses experienced by an applications is proportional to the number of dangling samples [4] . A cold miss occurs when the application accesses cache-line-sized piece of memory for the first time. Every cache-line-sized piece of memory accessed by the application is also accessed one last time and if the last memory references is sampled it will result in a dangling sample. Since the sampler samples all memory accesses with equal probability, the cold miss ratio is equal to the number of dangling samples divided by the number of reuse distances in the RDS.
E. Cache Model
In this section, we put the pieces together and show how StatStack estimates an application's miss ratio including booth cold and capacity misses from its input data, an RDS collected by the StatCache sampler.
Armed with equation (Eq. 7), we can estimate the expected stack distance distribution of the target application given its RDS as follows. First, we compute the expected stack distance of each distinct reuse distance in the RDS using (Eq. 7). Our implementation of StatStack does this in three steps: First, it sorts the reuse distances in the RDS into a histogram data structure. It then computes allF i 's, in a single pass over the histogram's buckets. Finally, it computes the expected stack distance for each distinct reuse distance in the RDS, by computing a running sum over the sorted sequence ofF i 's. Then, by weighting each of the expected stack distances with the frequency of the corresponding reuse distance in the RDS we get the expected stack distance distribution.
To estimate the target application's capacity miss ratio, we approximate its actual stack distance distribution with its expected stack distance distribution; we discuss this approximation further in Section V-A. For a fully associative cache with an LRU replacement policy, a memory reference will result in a cache miss if its stack distances is greater than the cache size, measured in number of cache lines [17] . We can therefore compute the target application's capacity miss ratio for a given cache size C, simply by computing the fraction of stack distances in its expected stack distance distribution that are greater than or equal to C. Finally, by adding the cold miss ratio to the capacity miss ratio we get the target application's miss ratio for a fully associative cache with an LRU replacement policy. Figure 3 shows the miss ratios estimated by StatStack as a function of cache size for the SPEC CPU2006 benchmarks, together with a reference miss ratio obtained from a trace driven cache simulator. The RDSs used to estimate these miss ratios have been collected using a sample rate of 10 −4 , i.e. the RDSs contain the reuse distances of only one out of every 10, 000 memory references. These results will be analyzed in Section VI, where we perform a sensitivity analysis to quantify the impacts of the approximations on the accuracy of the cache model.
IV. HIERARCHICAL SAMPLING
In this section, we discuss a hierarchical sampling policy implemented by the StatCache sampler. In this work, we use the hierarchical sampling policy for two reasons: First, it reduces StatStack's sensitivity to program phase changes, and second, it reduces the runtime overhead of the sampler.
Under the hierarchical sampling policy, the sampler alternates between two phases, the sampling phase and the hibernation phase. It is only during the sampling phase that the sampler starts new reuse distance measurements. The sampler stays in the sampling phase for a fixed number of memory references. We call the duration of a sampling phase a sampling window. The hibernating phases are of random length; this is to ensure random sampling. When a sampling phase comes to its end, the watch points are kept alive so that no measurements are lost. For most applications, the majority of reuse distances are short so most of the watch points trigger early in the hibernation phase. This allows the target application to execute at close to native speed during the sampler's hibernation phases.
The sampler tags the reuse distances it measures with the id of the window in which the measurement started. This allows the cache model to estimate the miss ratio of each sampling window individually. As we alluded to in Section III-C, this is what eliminates the cache model's sensitivity to program phase changes.
The StatCache sampler exposes three parameters to the user, the sampling phase length (s), the average hibernation phase length (h), and the average number of samples per sampling phase (n). The setting of these parameters is important since they affect the accuracy of the cache model and the runtime overhead of the sampler.
V. ERROR SOURCES
To be able to drive the cache model with the sparse reuse distance information in a RDS we have introduced a number of approximations that are potential sources of errors. We group these error sources into two categories, approximation errors and statistical sampling errors. These categories are further divided into type I and type II approximation error and type I and type II statistical sampling errors. In the remainder of this section we discuss these error sources one by one.
A. Type I Approximation Error
StatStack computes the expected stack distance distribution of the target application and uses it to approximate the target application's miss ratio. By doing this, StatStack approximates the application's actual stack distance distribution with its expected stack distance distribution. Since these two distributions are generally not the same, this approximation can be a potential source of errors. Figure 4 shows an example stack distance distribution of memory references with a reuse distances of r and also their expected stack distance. The memory references with a stack distance greater than the cache size will result in a cache miss. As shown in Figure 4 the expected stack distance of the memory references for this distribution is less than the cache size. The miss ratio calculated from the expected stack distance distribution is therefore zero. However, since some of the stack distances are larger than the cache size, the miss ratio calculated from the actual stack distance distribution is not zero. The above error tend to be large if the application makes a large number of memory references with the same reuse distances but with largely varying stack distances. This is most likely to be the case for applications having program phases with largely different cache behaviors. However, since, the cache model estimates the miss ratio for each sampling window individually; it is only for the windows that span a program phase change that this is a problem. This error is reduced by having the sampler use a large number of small sampling windows. This results in the portion of windows that span program phase changes being small, and therefore their impact on the overall estimated miss ratio is low.
B. Type II Approximation Error
In Section III-C, we made the assumption that the reuse distance distribution observed for any (large enough) set of memory references is the same as the reuse distance distribution all memory references in the address trace. If this assumption does not hold, the accuracy of the cache model will suffer.
For example, if an application's reuse distance distribution changes over time, the above assumption does not hold. However, for these types of applications, we can reduce the problem, by using a sampling window size small enough so that the reuse distance distribution stays the same within the sampling windows. Another scenario where the assumption does not hold is for applications whose memory references' reuse distances display certain patterns. For example, every memory references with a forward reuse distance of length r is followed by r memory references with a reuse distance of 1. In this case sampling windows does not help. However, as we show in Section VI, the approximation errors are small, which suggests that reuse distance patterns that compromise the accuracy of the cache model are rare.
We have argued that both the type I and type II sampling errors can be reduced by using short sampling windows. However, too short sampling windows can hurt the cache model. Consider a memory references with a reuse distance of r, as we showed in Section III-A, its stack distance is determined by the forward reuse distance of the r memory references executed before it. If the sampling window size is smaller than r, the sampler cannot capture the reuse behavior of these r memory references. When modeling a cache of size C, the accuracy of the estimated stack distances with lengths close to C will have the largest impact on the overall accuracy of the cache model. Therefore, the window size should typically not be shorter than the reuse distances of the memory accesses with expected stack distances close to the cache size.
C. Type I Statistical Sampling Error
StatStack estimates the target application's reuse distance distribution with the distribution of reuse distances in a sparse RDS. This gives rise to the type I statistical sampling error. The magnitude of the type I statistical sampling error depends on the number of reuse distance measured in the sampling windows, and the variance of the target application's reuse distance distribution.
D. Type II Statistical Sampling Error
The hibernation phase of the hierarchical sampling policy introduces a second type of statistical sampling error, which we call the type II statistical sampling error.
The use of non-zero-length hibernation phases as compared to zero-length hibernation phases, can roughly be thought of as randomly selecting only a sub set of the sampling windows for estimating the overall miss ratio. Since the overall miss ratio is the average of the miss ratios for the sampling windows, the type II statistical sampling errors can be analyzed using standard statistical techniques for random sampling. This type of analysis, for the similar method of trace sampling [13] [ 16] , has been thoroughly investigated in previous work [26] . In essence, the type II statistical sampling error depends on the average hibernation phase length and how much the target application's miss ratio varies over time.
VI. EVALUATION
In this section, we evaluate the accuracy of the StatStack cache model by comparing its output; an estimated miss ratio, to a reference miss ratio obtained from a traditional trace driven cache simulator, for the SPEC CPU2006 benchmark suite.
A. Experimental Setup
We use 28 out of the 29 programs in the SPEC CPU2006 benchmark suite (481.wrf did not compile on our system) run with their first reference input sets. All benchmark programs are compiled using GCC version 4.1.2 with optimization level 02 targeting an x86 64 system.
In order to evaluate the accuracy of our model, we first collect reference address traces from the benchmarks programs using an in-house instrumentation tool. The address trace collection is started after approximately 60 seconds of uninstrumented execution for all programs except for 403.gcc whose total execution time with its first reference input set is less than 60 seconds. All traces used for evaluation contains five billion memory access. This trace size was chosen for practical reasons.
To obtain reference miss ratios we use a trace-driven cache simulator, configured to simulate a fully associative cache with an LRU replacement policy for cache sizes ranging from 32kB up to 8MB. The reference miss ratios include both cold and capacity misses. We then collect sparse RDSs by measuring the reuse distances of randomly selected memory references in the address traces using the hierarchical sampling policy. We finally use an implementation of the StatStack cache model, as described in Section III, to estimate the miss ratios of the traces for cache sizes ranging from 32kB up to 8MB.
B. Sampler Parameters
As mentioned in Section IV, the StatCache sampler has three parameters: sampling phase length (s), average hibernation phase length (h) and average number of samples per sampling phase (n). We have empirically found two sets of parameters, S 1 : {s = 10 6 , h = 14 · 10 6 , n = 1500} and S 2 : {s = 10 6 , h = 74 · 10 6 , n = 1500} that work well for our benchmarks. Here s and h are measured in number of executed memory references. These two sets of parameters result in sample rates of 10 −4 and 2·10 −5 respectively. Unless otherwise noted, these are the parameters used to collect the RDSs in this section. For an in-depth discussion of how we found these sampler parameters please see [11] .
C. Sensitivity Analysis
In this section, we present the results of two of experiments. In the first, we evaluate the cache model's sensitivity to the approximations errors, and in the second, we evaluate the cache model's sensitivity to the sampling errors.
1) Approximation Sensitivity: To evaluate the cache model's sensitivity to the approximation errors, we collect RDSs that contain the reuse distances of all memory references in the address traces and estimate miss ratios using these RDSs. Since these non-sparse RDSs contain the reuse distance of all memory references, we effectively eliminate the sampling errors. By comparing these miss ratios to the reference miss ratios, we can evaluate the approximation errors in isolation. Note that, even though we measure the reuse distance of all memory references in the address traces we still use a sampling window of size s = 10 6 (the other two parameters are set to h = 0 and n = 10 6 ). Figure 5 shows the estimated miss ratio next to the reference miss ratio for all benchmark (except for 429.mfc 1 , 433.milc 1 , 471.omnetpp 1 , 410.bwaves 2 and 453.povray 2 ). The close agreement between the estimated and reference miss ratios indicates that the approximation errors' are relatively small.
The largest discrepancies between the estimated and the reference miss ratios are for small cache sizes of less than 64KB. This is likely due to the type I approximation error: For most applications the distribution of stack distance are heavily weighted towards short stack distances. This makes it likely that there are a large number of memory references with the same reuse distance that display the behavior shown in Figure 4 for small cache sizes. The application with the largest error for large cache sizes is 473.astar for which the estimated miss ratio is somewhat lower than the reference for 4MB caches. This indicates that some portion of the expected stack distances with lengths close to 64k (4MB / line-size) are underestimated, this is likely due to the type II approximation error.
2) Sampling Sensitivity: In order to evaluate the sampling errors we sampled the address traces, using sampler parameters S 1 , and estimated their miss ratios 32 times each. By observing the differences of the estimated miss ratios we can evaluate the sampling errors. Figure 6 shows three graphs for each benchmark application: Statstack-min and Statstack-max, show the minimum and the maximum of the 32 estimated miss ratios, the third graph labeled Reference, show the references miss ratio. The close proximity of the min and max graph indicates that largest differences of the 32 estimated miss ratios are small, which further implies that the sampling errors are small. We repeated the above experiment, but this time with the sampler parameters S 2 , the results are shown in Figure 7 . Since RDSs collected using S 2 contains on average 5 times fewer reuse distance than RDSs collected using S 1 , we expect the sampling error to be larger in Figure 7 than in Figure 6 , which is indeed the case. However, the only difference between the two sets of sampler parameters is that S 2 has longer hibernation periods and therefore contains less sampling windows; the larger sampling error in Figure 7 is therefore due to the type II sampling error.
To gain further insight into the sampling errors, we consider the distributions of errors for the two sets of sampler parameters as shown Figure 8 . These error distributions are computed using the same data as above, which contains 32 miss ratios for each application and each cache size. We compute the errors for each of these sets of 32 miss ratios individually as the absolute error with respect to the average miss ratio. To obtain a large enough data sample to accurately estimate the error distributions, the distributions shown in Figures 8(a) and 8(b) contain the error of all applications and all cache sizes for the miss ratios estimated using the sampler parameters S 1 and S 2 , respectively.
The reason for using absolute error as opposed to a relative error is that, many of our benchmark applications have miss ratios that are very low. For example, 456.hmmer has a miss ratio of less than 0.5% for all cache sizes greater than 128KB. The performance gained by reducing a miss ratio this small, by say 50%, is negligible. Furthermore, small absolute errors tend to become large relative errors for small cache sizes, and the other way around for large caches. If we use relative errors, the errors for low miss ratios will be overrepresented and the errors for large miss ratios, for which there are potential performance gains, will be suppressed, we therefore use absolute errors.
As we can see in Figure 8 , when RDSs are collected using S 1 , 90% of the measured errors are less than 0.2%, for RDSs collected using S 2 , 74% of the errors are less than 0.2% and 89% are less than 0.4%.
VII. CACHE MODEL PERFORMANCE
The overall performance of cache models, like StatStack, has two parts, the performance of the data collection mechanism and the performance of the model itself.
The StatStack cache model uses the StatCache sampler to collect its input data. The sparseness of the collected data in conjunction with the use of hardware and operating system support makes the execution time overhead of the target application as low as 40 percent [3] .
The execution time of the cache model, for the RDSs used for evaluation, is only a few seconds. The short execution time is due to the sparseness of the input data. Internally, the cache model stores the reuse distances in a histogram. The most time consuming operation for the cache model is to sort the data in the input RDS into the histogram, but because of the sparseness of the RDSs this operation is still fast. When the histogram is built, the cache model requires only two passes over its buckets to compute the miss ratio. The number of buckets in the histogram is equal to the number of distinct reuse distances in the RDS.
VIII. RELATED WORK
Since the introduction of stack distance, by Mattson [17] in the early 70s, it has earned plenty of attention from researchers in a quest to find efficient techniques to study cache locality. Mattson proposed a stack based algorithm to compute stack distances. For each memory reference, Mattson's stack based algorithm searches the stack for the accessed address. If the address is in the stack, the algorithm moves the address to the top of the stack, and otherwise, pushes it on the top of the stack. At any time, the number of unique addresses accessed since the last access to an address is the number of addresses above it in the stack. The stack maintains the history of uniquely accessed addresses. This history is sufficient to find the stack distance of memory reference.
To improve the performance of Mattson's algorithm Bennett and Kruskal [1] replace the stack with a m-ary tree. The m-ary tree allows for faster operations than the linked-list stack used by Mattson. To further improve performance, other types of trees have been proposed, for example AVL tress [19] and splay trees [24] . Kim et al. [15] propose an approximate algorithm, in which the stack is sliced up into disjoint ranges and only the address range is tracked. This allows them to use a hash table to search the stack. Other approximate algorithms have also been proposed by Ding and Zong [9] and Shen at al. [21] .
All of the above algorithms use full address traces. However, the time it takes to collect and analyze large traces can be prohibitive. Trace sampling is a technique to reduce the size of the traces [13] [16] [26] . Trace sampling only collects and analyzes address traces for chosen sections of the application's execution. By applying statistical techniques, Trace sampling infers the overall miss ratio from that of the short sub traces. A drawback of Trace sampling is the large number of additional memory references needed to regain the access history lost between sub traces. The number of additional memory references required and techniques to reduce them have been investigated [10] [12]. Shen at al. [21] proposed a probabilistic model to estimate stack distance distributions. The input to their model is a reuse distance distribution and the size of the memory footprint. StatStack differs from their model in two key aspects. First, StatStack does not require the size of the memory footprint to be known a priori. Secondly, StatStack takes an RDS as input and uses it to estimate the reuse distance distribution, while Shen at al. use the actual reuse distance distribution of all memory references. It appears as they could use an estimated reuse distance distribution, but this is not considered.
Tam at al. [25] present a hardware supported approach to collect address traces efficiently. Their focus is on online generation of miss ratio curves for the last level cache. By using hardware performance counters, they are able to trace only the memory references that miss in higher cache levels. Furthermore, they use trace sampling which in combination with their tracing technique makes their approach very efficient.
IX. CONCLUSIONS
This paper presented StatStack, a new statistical cache model that models a fully associative cache with LRU replacement policy. The input to StatStack cache model is an RDS, which contains the reuse distances of a sparse set of randomly selected memory references. To obtain RDSs, StatStack uses the same sampling technique as StatCache, and therefore inherits StatCache low data collection overhead.
We evaluated the accuracy of StatStack using the SPEC CPU2006 benchmarks. The results show that StatStack accurately estimates miss ratios based on RDSs containing as few as 500, 000 and 100, 000 reuse distances, for which 90% of the estimated miss ratios have absolute errors less than 0.2% and 0.4%, respectively. Furthermore, the execution time of the StatStack cache model is less than a few second. 
To show that (Eq. 1) follows from the above equations, we indentify the following equalities, B t (i) = 1( R(x i ) + i > t) and P t−1 (x t ) = t − R(x t ) − 1. By substituting these equalities into (Eq. 8), (Eq. 1) follows.
