Trace-driven cache simulation is a time-consuming yet valuable procedure for evalnating the performance of embedded memory systems. In this paper we present a novel technique, called as iterative cache simulation, to produce a variety of performance metrics for several different cache configurations. Compared with previous work in this field, our approach has following features.
Introduction
Caches are very important to embedded computer systems, especially as the gap between microprocessors and memories is continuously becoming wider and wider. Trace-driven cache simulation is a popular and essential tool for performance evaluation of memory svstems. With the data-memory trace pros and cons with both types. On one hand, post simulation requires large storage for the trace, which is not necessary for onthe-fly simulation. On the other hand, generating traces takes time; so if we want to simulate several cache arcbitechues with the same trace collected, it is better to use post simulation.
As we know, the three basic parameters of a cache are: cache size, block size (also known as line size), and degree of associativity.
In cache designs, the problem often resides in the choice of the three parameters. To get optimal performance for a program, we need to run cache simulation over the reference trace many times with varying parameters.
While the idea of trace-driven simulation is simple, it is very time consuming since the simulation time is proportional to the length of the trace. For real applications, the traces can easily exceed millions of references. Therefore numerous efforts have been made to reduce trace length and simulation time. Among many techniques a very attractive class is one-pass simulation [Z], which simulates several cache configurations in a single tun. However, although the large number of methods can vastly accelerate simulation time, each of them has certain limitations. Traditional trace reduction cannot handle varying block sizes (so are most one-pass algorithms). Few one-pass algorithms support multiprocessor caches that use invalidation based coherence protocols.
.~
In hardwarelsoftware co-design, fast simulation of several generated by an instrumented program, researchers can obtain a candidate cache configurations is of special importance. Li and number of performance metrics such as cache miss ratio, writebottleneck of interested systems. This is particularly useful for complexity of cache simulations, they only described an where the memory plays an estimation scheme for miss ra1.e~ offirect-mapped caches [I] . important part in the overall performance and must be tweaked to rednce cost, A good reference on hardwardsoftware co-synthesis
In this paper, we propose iterative cache simulation, which is a with memory hierarchies can be found in Li time, ours produces precis; results and it supporis many performance metrics, including miss ratio, write-back counts, distribntion of misses, et al. Even though we have to Nn cache simulation multiple times (each for a different configuration), results show that the total simulation time is very similar to Cheetah, the fastest one-pass simulator of which we are aware.
The rest of this paper is organized as follows. In section 2 we briefly review some previous work in cache simulation. Section 3 describes our method in detail. Simulation results of some multimedia applications based on OUI method are shown in section 4, together with the runtime of cheetah. Finally concluding remarks are drawn in section 5.
Related work

Trace reduction
Since the simulation time and storage requirement are directly related to the size of trace, a lot of work has been done in trying to reduce the trace length. The key idea is to retain only the references that contribute to the performance metrics. E.g., consecutive accesses to the same cache block can be shrunk to one without changing miss ratio. [SI, which compacts traces through a cache filter (stripping temporal locality) followed by a block filter (stripping spatial locality). While good compression ratio is achieved with this approach, accuracy is sacrificed. Laha et al. noticed remarkable errors in large caches with small miss rates in their sampling techniques 161.
We are interested in a trace stripping technique that can always guarantee accurate performance metrics while compressing traces substantially. As pointed out by Wang and Baer, Puzak's method, albeit precise for set-associative caches in terms of miss ratio, is not sufficient for other metrics (e.g. write-hack counts) or mnltiprocessor caches. Therefore we need to dense some other methods.
One-pass cache simulation
One-pass simulation algorithms attack the problem from another perspective. Instead of simulating one cache configuration at a time, these methods simulate a class of caches efficiently in a single pass through the trace. . Their algorithm, referred to as stack simulation, takes advantage of inclusion property (i.e. at any time of simulation, the contents of a cache is always the subset of any larger ones), which is guaranteed by certain replacement policies including LRU.
Originally stack simulation only works for fully associative caches and it only reports miss ratio. Hill and Smith identified setrefinement property (i.e. blocks mapped to the same set in larger caches are also mapped to the same set in smaller caches) and extended inclusion property for direct-mapped and set-associative caches. They further proposed forest simulation and allassociativity simulation, whicli respectively work for directmapped caches and caches with arbitrary associativity 171. Meanwhile lnompson and Smith introduced dirty level and writes avoided [XI, which track the status of a written block in write-back caches and give write-back counts in addition to miss ratio. Their algorithm was then generalized by Wang and Baer to set-associative caches as well [9] . The new algorithm can he used to simulate multiprocessor caches with updatebased protocols, but it has some difficulties in dealing with invalidation-based protocols. Later Wu and Muntz fixed the problems by using covering vector and marker splitting to track invalidated data blocks in the stack [lo] .
Aside from stack simulation, which is based on stacks that record accesses, Sugumar and Abraham developed some more efficient algorithms using binomial trees [ll] . Their new one-pass schemes are reported to outperform earlier ones by a factor of up to 5. Furthermore, they proposed a one-pass algorithm to simulate caches with varying block size. Their algorithms were implemented in Cheetah, a really fast one-pass cache simulator.
In spite of the their efficiency, one-pass simulation algorithms all have certain limitations. Since they have to track block'status for all the caches in the simulation pool, the bookkeeping of each reference has to he simple, otherwise it will not make much difference from simulating the caches one by one. As a matter of fact, none of the onepass simulations support prefetching, subblock replacement. or multi-level caches; nor can they produce useful performance metrics such as distribution of misses.
Iterative cache simulation
Motivation and goals
Our attempt of evaluating cache performance originated from studies of video applications on a VLIW base video signal processor. Since the processor consists of several clusters each having its local cache, we need a simulator that can support multiprocessor caches. Due to limited inter-cluster communication bandwidth and uncommon data sharing patterns, invalidation-based protocols are preferred over update-based protocols. Morwver, we are interested in prefetching-based systems. However, no existing one-pass simulators could do the job. At first, we tried to adapt them to the new requirements, hut it turned out that the modifications would he so complicated that there would not be much difference from running the simulation for each cache confguration.
It seems clear that we have to simulate all the candidates one by one. On one hand, this gives us the flexibility to deal with various cache models (e.g. multiprocessor caches, multi-level caches, caches with suh-block replacement andor prefetching strategies, et al.) and performance metrics (e.g. write-back counts, distribution of misses) that are not supported or only partially supported hy one-pass simulators. On the other hand, however, the gains might be at the cost of excessive time. Therefore our focus is on reducing simulation time.
In order to reduce simulation time, we have to reduce workload. There is always a tradeoff between accuracy and trace length. Our goal is to reduce trace length as much as possible and meanwhile keep performance metrics 100% correct.
Notations and assumptions
In the following discussions, we use a 3-tuple (nsets, blksize, assoc) to refer to a specific cache configuration where the cache consists of nsefs sets and each set contains assoc blocks of blksize bytes. Therefore if assoc=l it indicates a direct-mapped cache, otherwise it is an assoc-way set-associative cache (with LRU replacement being used). We further assume that nsets and blksize are both in powers of two.
Miss ratio is probably the most important figure among all the useful performance metrics. We will use this metric as an example in the next subsection, and then generalize our algorithms to other metrics at the end of this section.
Iterative cache simulation
As mentioned previously, we want to discard some noninteresting references in the trace after each simulation, so that next time when we simulate another cache configuration the workload could be reduced. E.g., sometimes when we go from one cache configuration CI to another one CZ, all the references that hit CI will always hit C2, therefore by stripping off those references that cause hit to C l (which is a significant reduction), we reduce trace length yet are still able to get exact miss ratio for c 2 .
Hill and Smith found out that set-refinement property implies inclusion property in direct-mapped caches with same block size [7] . This observation facilitates the simulation of direct-mapped caches with same block size. E.g., during the simulation of cache (256, 16, I ) , we could toss out all the references that cause hit, and later the reduced trace can be used to simulate (512, 16, I ) or other direct-mapped caches with more number of sets but same block size. Remember that extended inclusion property says that cache configuration (nsetsl, blksize, assocl) contains a subset of blocks in (nsetsl, blksize, assocZ), provided that nsetsl I nsers2 and assocl < assoc2. So once we have the reduced trace from (256, 16, I ) , we could also simulate set-associative caches with more than or equal to 256 sets. Now the question is which references we should discard when simulating a set-associative cache. Of course we can use the reduced trace from direct-mapped cache for all the set-associative caches, hut that is not economical as we could filter out more redundant references. Unfortunately set-refinement does not imply inclusion in set-associative caches, which means that we cannot simply throw away references that cause a hit. This is because LRU replacement may change the mapping of a missed block E.g., suppose we have a cache configuration ( I , 1. 2) and a reference sequence 0, I , 0, 2, I : then the outcome would be, respectively, miss &place into block 0, miss & place into block 1, hit block 0, miss & replace block 1, miss & replace block 0 (Figure 3.3.la) . If we remove the references that cause hit (as we did with direct-mapped caches), the subsequent sequence 0, 1. 2, I would result in a hit for the last I (Figure 3 The reason for the discrepancy is that hits can change the recentness of a block. Therefore in addition to the references that cause miss, we need to save as well those hit references that alter recentness of blocks in a set. Notice that recentness does not matter at all until a miss occurs, so we only need to save at most assoc-l hit references before the miss reference when an assocway set-associative cache is being simulated. As shown in Algorithm 3.3.1, the overhead of this method is fairly small, but the gain in trace length reduction is significant. For many traces and cache configurations we have analyzed, after simulating 2-way set-associative cache the trace length could be cut 50%. Each time when we simulate a cache, we can strip off some redundant references, thereby accelerale further simulations of caches with more sets and/or higher associativity. 
3.
4.
5.
6. */
7.
8. i f ( h i t ) { 9. set.block [blk] .hit-addr = addr; I* set and blk indicate the set (in cache) and the block (in set) addr ended up with, no matter whether it is a hit or miss.
(set, blk, hit) = convc. While Wang and Baer's algorithm applies for caches with same number of sets, Sugumar and Abraham's algorithm assumes fixed cache size. Our approach to the problem is a combination of both. It first fmds out a dominant set, which contains the minimum cache for each block size. E.g., given the interested cache configurations in Table 3 .3.1, the dominant set is { (512, 16, I ) ,   (256, 32, I ) , (256, 64, 1) 1, as (due to inclusion property) the union of the miss trace of the three configurations will be a superset of references that cause misses in any of the caches in Table 3 .3.1. Algorithm 3.3.2 shows the function to generate the universal trace using the same example. It simulates (256, 32, 1) and meanwhile saves the miss references for all the three configurations. Notice that cache (512, 16, I ) has the same size as (256, 32, I ) , which means that the tags for both configurations are of the same size too, so it is treated differently as (256, 64, I ) . In our implementation, this algorithm is realized using bit-vectors, so it is quite efficient.
1. univ-cache-filter( addr ) 3.
(
(set, hit) = conventional-cache-sim ( (256, 32, I ), addr
if (hit) ( 5. saddr = ad& 22 4 11 strin off block offset 6.
7.
8.
save( addr 1; I/ for (512, 16, 1 )
9.
10.
last[(addn>6) % 2561 = addn>6; 11.
save( addr ); I/ for (256, 64, 1) 1024,32, 1)-(1024, 32,2)-(1024,32,3 Once we have the universal trace, we can simulate all the cache Configurations. In order to save workload to the maximum extent, we need to sort the candidate cache configurations into a special order such that following simulations can use the reduced traces generated by previous simulations. E.g., both (512, 16, 2) and (1024, 16, 3) can be simulated using the universal trace. If we simulate the former one first, then we can take advantage of the further reduced trace in the latter one. However, if the latter configuration is simulated first, we have to use the universal trace again when simulating (512, 16, 2). Obviously the cache configurations in OUT design space form a partially ordered set, in which the relation is defined by the inclusion property. As an example, Figure 3 .3.2 depicts the lattice of a set consisting of the configurations in Table 3 .3.1.
Starting from the universal trace, we can traverse the lattice in associativity first or number-of-sets first order. Depending on the trace, either one can be faster but the difference is not distinctive in OUI experiments.
Other performance metrics
Although miss ratio is important, there are some other interesting and useful performance metrics as well. E.g., in wribhack cache systems, the frequency of write-backs would greatly affect the traffic that goes to next level of memory. Since for each cache configuration we perform a conventional simulation, it is very easy to adapt the simulation algorithm to take into account other performance metrics andlor cache architectues. In fact, for some metrics like distribution of misses, we need not change Algorithm 3.3.1 013.3.2 at all.
For writeback counts, we need to record references that cause write-hit. In Algorithm 3.3.1, at most assoc-l hit references are recorded before a miss reference. However, the write-hits that make a block dirty (and thus contribute to write-hack counts) may not be among the hit references recorded, since only the most recent access to each block is kept. A trick to solve this problem is to change any read operation to write operation. Notice that this change of Algorithm 3.3.1 is minor and does not increase the trace length. However, in Algorithm 3.3.2 we have to record all the write-hits as we do not know which of them will modify which block in a set-associative cache.
As for multiprocessor caches, obviously read accesses of shared data will not cause any problem. This is because a shared read will clear the dirty flag (if set) of the block containing shared data on the exclusive processor and bring the block to the processors that issue the read, regardless of whether update or invalidationbased protocols are used. However, things become different for a shared write. In update-based protocols, a shared write will update all the blocks that have the shared data, which has the same effect as a shared read, hence no extra processing is required for onepass simulation algorithms. In invalidation-base protocols, however, those blocks will be invalidated by the shared write.
This will cause deletion of a block in the LRU stacks used by onepass algorithms. Tracking the propagation of a deleted block for different cache configurations is a complicated matter, so onepass algorithms cannot handle invalidation-based protocols. Nonetheless, our method does not have such a problem, since only one cache configuration is simulated at a time. Note that update or invalidation happens only when a miss occurs. Because all the miss references are saved anyway, Algorithm 3.3.1 and Algorithm 3.3.2 do not need any modification.
Results
Since we are interested in video applications and the traces of several such programs are readily available, we use them as examples. The outputs (performance metrics) of our simulations are correct when compared with results from other methods. Discussions of the miss ratios or the impact of cache configurations on video applications are beyond the scope of this paper, but they can be found in our other work [12] . In this section we will concentrate on the execution time. The generation of original traces and our trace-driven cache simulations are done on an SGI Power Challenge workstation which has four 194 MHz MJPS R-10000 microprocessors. Table 4 .1 shows the trace lengths of original data references. To highlight the reduction of disk space, we also show in the table the trace lengths after the first and last simulations. We compared the simulation time of our method with Cheetah, which is a great onepass cache simulator in terms of speed hut lacks flexibility. It only reports miss ratio and it does not support complex cache models (e.g. caches using write-hack, prefetching, and/or sub-block replacement). Moreover, it can only simulate caches with the same block size or direct-mapped caches with the same cache size. For these reasons, we only measured miss ratios for Table 4.2 and  Table 4 .3 respectively. We also compared our iterative method with traditional trace length reduction technique, in which the reduced trace generated during the simulation of (256, 32, I ) is used for all the other configurations (i.e. "multi-mn"). A speedup of 10-20% of iterative method over multi-run is observed in Table  4 .2. As we can also see, our method outperforms Cheetah for caches with the same block size (Table 4 .2); hut for fixed-size direct-mapped caches, it could be 50% slower (Table 4 .31, because for each configuration we have to use the universal trace. 
Conclusions
In this paper, we presented a so-called iterative cache simulation method, to trace-driven simulate a set of cache configurations accurately and quickly. In OUI approach, we sort candidate cache configurations in such an order that after each simulation we can reduce trace length and thus speedup following simulations. Compared with other cache simulators, our method features following:
1. It supports a wide range of performance metrics, including miss ratio, write-back counts, bus traftic, et al.
