Abstract-Stack or reuse distances have been widely adopted in studying memory localities and cache behaviors. However, the memory references, normally profiled by a binary instrumentation tool, only reflect the accessing sequence of instruction fetching and load or store executions. That is why the stack or the reuse distances obtained from these memory references cannot be used to predict the L2 or lower cache misses. This paper proposes a probability model to calculate the L2 reuse distance histogram from the L1 stack distance histograms without any extra simulations. The L2 cache misses or memory localities can be predicted fast and accurately based on the result of our model. We use 13 benchmarks chosen from Mobybench 2.0 and SPEC 2006 to evaluate the accuracy of our model. With the support of StatCache and StatStack, the average absolute error of modeling the L2 cache misses is about 8%. Meanwhile, contrast to gem5 fast simulations, the process of predicting L2 cache misses can be sped up by 50 times on average.
INTRODUCTION
For studying cache behaviors, the analytical models are normally fed with statistical memory characteristics, such as reuse or stack distance histograms [1] . However, these statistical characteristics are frequently obtained by profiling memory traces or emulating application executions [2] , which only reflect memory behaviors in L1 caches. Therefore, lots of prior studies choose to use trace-driven simulations to predict L2 or lower level cache misses [3] . Although trace-driven simulations perform faster than detailed ones, the time overhead is still considerably larger than that of analytical models. Furthermore, the growing length of application traces and the non-unified simulator interfaces often bring storage and flexibility problems. Unfortunately, according to the study by [4] , these problems have been more serious in recent years. On the other hand, the flexibility problem still remains as a big challenge and lots of frustrating coding as well as debugging works are needed. Last but not least, because they mainly focus on proposal evaluations rather than providing a guidance in architecture designs, simulation-based methods normally have rare architecture insights.
Therefore, this paper proposes a probability model to estimate the L2 reuse distance histogram from L1 stack distance histograms directly without any simulations. To our best knowledge, this model is the first analytical method to predict the L2 reuse distance histogram directly from the L1 stack distance histograms. The calculated L2 reuse distance histogram can be applied to predict L2 cache misses in single-core processors. Meanwhile, the study of contention behaviors in the multi-core shared cache can also take the advantage of fast modeling with the predicted results from our model. The rest of the paper is organized as follows: Section II introduces how to calculate the L2 reuse distance while the model constructing is given in Section III. Section IV shows the experiment setup and studies the evaluation results. Finally, Section V concludes this paper.
II. CALCULATING THE L2 REUSE DISTANCE

A. Classical Stack Distance Theory
For studying cache behaviors, the stack distance is defined as the number of unique cache lines that accessed by the memory references during a reuse epoch, where a reuse epoch refers to the time interval between two successive memory references to the same cache line [5] . To collect the stack distance, prior studies usually construct the LRU stack history [6] that records latest references to different cache lines within each reuse epoch. In this way, the stack distance of current memory request can be easily calculated by counting the number of these latest references. Figure I shows Normally, the collected stack distances of each reference are used to construct the so-called stack distance histogram for estimating LRU cache misses [5] . When the L1 cache has cache lines, the number of cache misses can be calculated as (1) where _ ( ) is defined as the L1 stack distance histogram.
B. L2 Reuse Distance
Generally, memory references to the L2 cache are caused by L1 cache misses. Therefore, the L2 reuse distance of in Figure I can be calculated by counting the number of L1 cache misses in the L1 reuse epoch of . We define → to describe the ratio of how many L1 reuse epochs with stack distance will generate misses, or in other words, the L2 reuse distance . In this case, the L2 reuse distance histogram _ ( ) can be calculated by (2) .
Typically, the references within each L1 reuse epoch can be classified into 2 groups. As shown in Figure I , the references of the first group, which is named as the set { }, do not generate reuse behaviors within the reuse epoch of , such as , and
. On the other hand, the references in the second group, which is named as the set { }, have reuse epochs embedded in the epoch, like , etc. In this paper, we name these nested reuse epochs as "embedded epochs".
Across the whole program, there may be more than one possible reuse epochs that have the same LRU stack history. Although the references in the set { } do not generate reuse epochs in the epoch of , their minimum L1 stack distances are determined, which have been analyzed before. For these references, the probabilities of generating L1 cache misses can be calculated using the L1 stack distance histogram. For each memory reference, we define as the probability of generating an L1 cache miss given its L1 stack distance is larger than or equal to . By generalizing the case of , can be calculated as (3) when the L1 cache has cache lines. If is smaller than or equal to , the reference is certain to trigger an L1 cache miss. In this case, should equal 1.
To make the expression of → more concise and easier to calculate, we introduce one unified, weighted average probability _ to replace the usage of a group of different
Advances in Computer Science Research, volume 80
. According to (3), ∑ _ ( ) * represents the number of L1 cache misses that generated by the references whose L1 stack distances are larger than or equal to . By exploring all possible , _ could be calculated as (4) .
2) L1 Cache Misses in the Set { }: As shown in Figure III , there is only one case (case 4) that the embedded epoch generates an L1 cache miss (as we assumed before, the L1 cache has only 2 cache lines). In this paper, we assume that all possible cases of share the same probability to appear within the reuse epoch of . Therefore, the probability of to cause an L1 cache miss can be described as 1/6 because there are 6 possible cases of in Figure III. 
3) Generalization:
is defined as the probability of having an L1 cache misses in the set { } . Based on the discussion in Figure III , could be calculated as (5) . The L1 cache has cache lines while the L1 stack distance of is .
Actually, there may be more than one members in { }. We observe that could be calculated as (6) (represented as ( )) approximatively while there are references in the set { }. , ( ≥ )
III. MODEL GENERALIZATION
A. The Probability of Multiple As we mentioned above, there will be more epochs generated when more references are executed in the reuse epoch of . To describe the appearance ratios of different numbers of , we use ( , ) to represent the probability of having embedded epochs within the epoch of , given the L1 stack distance of is .
Given a reuse epoch with stack distance ( ≤ ), the epoch could be embedded in the reuse epoch or not. Meanwhile, the epoch can merely appear within the references whose L1 stack distances are larger than or equal to , while the number of these references is ∑ _ ( ) , in which _ ( ) represents the L1 stack distance histogram. Thus, for with L1 stack distance , the probability of having one embedded epoch with the L1 stack distance can be calculated as _ ( )/ ∑ _ ( ) . Considering all possible , ( , 1) can be estimated as (7) .
Furthermore, assuming each embedded epoch is independent with each other, ( , ) can be calculated as (8).
B. The Calculation of → Ultimately, the probability → can be obtained using (9) when the L1 stack distance equals and the L2 reuse distance is . For each reuse epoch, ( , ) represents the probability of having reuse epochs (equals the members of set { }). ( ( )) * 1 − ( ) denotes the probability of generating L1 cache misses when there are reuse epoch within the current epoch. For example, in Figure III , is either 1 or 0 while is 1 (because there is one and only one reuse epoch within the epoch of ). Lastly,
gives the probability of causing − L1 cache misses from the set { }.
IV. EVALUATIONS
The cache architecture used for evaluations in this paper has two levels. From the CPU side to the memory side, the independent instruction and data caches are connected to the L2 shared cache through a crossbar. To simplify the discussion, all caches are fully-associative. However, it will be not difficult to extend our work to set-associative architectures. The replacement policy of L1 caches is set with the Least Recently Used (LRU) while that of the L2 cache could be configured as the Random or the LRU. The evaluating platform is implemented with gem5 AtomicSimpleCPU full-system simulator (containing two-level caches and DRAMs). Figure IV shows the absolute errors of predicting L2 cache misses using the StatCache [7] . Briefly, the average absolute error of the tested benchmarks is around 8%. However, for some benchmarks, such as BaiduMap and BBench, the prediction errors increase significantly, up to 11.9%. The StatCache assumes that the memory references share the same L1 cache miss rate during a small execution slot. However, the correctness of this assumption depends on the length of profiling interval. The memory references of BaiduMap and BBench may not satisfy the assumption of StatCache when the profiling interval contains one million memory references in this paper. 2) StatStack and L2 Cache Misses with the LRU Replacement Policy: StatStack [5] has provided an efficient method to calculate the stack distance histogram using the reuse distance histogram, which is accepted in this paper. Figure V shows the absolute errors of predicting L2 LRU cache misses. Most absolute errors are below 7%, which are lower than the errors in L2 Random cache miss predictions. This error reduction, we believe, could be caused by the error masking effects. For example, the reuse epoch, whose stack distance should be predicted as 8, is predicted with stack distance 9 using our model and StatStack. However, this reuse epoch will be regarded as a cache hit when the L2 cache is set with 16 cache lines, regardless the predicted stack distance is 8 or 9. Figure VI shows the time overhead comparisons between gem5 AtomicSimpleCPU simulations and implementations of our model. The Y-axis gives the consumed minutes for predictions, which are shown in a log scalar. The X-axis represents all tested benchmarks. Briefly, the prediction can be sped by more than 50 times on average by using our model. Meanwhile, we can see that the time consumed in the L2 histogram calculation gives the half contribution to the total overhead of our model implementations approximately. This paper proposes a probability model to predict the L2 reuse distance histogram without any extra simulations. With the support of StatCache and StatStack, the calculated L2 histogram can be adopted for modeling L2 Random and LRU cache misses with the average errors of 8% and 6.8%, respectively. Our future works will extend the model application to multi-core architectures and consider the influences of the set-associative organization as well.
A. Evaluations of Predicting L2 Cache Misses 1) StatCache and L2 Cache Misses with the Random Replacement Policy:
Advances in Computer Science
FIGURE V. ABSOLUTE ERRORS OF PREDICTING L2 LRU CACHE MISSES
B. Time Overhead
