Abstract Hard real-time systems must obey strict timing constraints. Therefore, one needs to derive guarantees on the worst-case execution times of a system's tasks. In this context, predictable behavior of system components is crucial for the derivation of tight and thus useful bounds. This paper presents results about the predictability of common cache replacement policies. To this end, we introduce three metrics, evict, fill, and mls that capture aspects of cache-state predictability. A thorough analysis of the LRU, FIFO, MRU, and PLRU policies yields the respective values under these metrics. To the best of our knowledge, this work presents the first quantitative, analytical results for the predictability of replacement policies. Our results support empirical evidence in static cache analysis.
Introduction
Embedded systems as they occur in application domains such as automotive, aeronautics, and industrial automation often have to satisfy hard real-time constraints. Timeliness of reactions is absolutely necessary. Off-line guarantees on the worst-case execution time of each task have to be derived using safe methods. Execution times of a task vary depending on the task's inputs and the initial hardware state. The vast number of cases prohibits exhaustive testing to exactly determine the worst-case execution time. Instead approximative methods have to be applied. Such methods must be conservative, i.e., they must never underestimate the worst-case execution time, they must statically overapproximate the dynamic behavior of a task on all possible inputs and hardware states.
Caches, deep pipelines, and all kinds of speculation are increasingly used in todays embedded systems to improve average-case performance. At the same time they increase the variability of execution times of instructions due to the possibility of timing accidents with high penalties: a cache miss may take 100 times as long as a cache hit. Thus, the precision (tightness of upper bounds) of a static analysis greatly depends on its ability to statically exclude as much detrimental behavior to the timing of the program's instructions as possible: cache misses, mispredicted branches, pipeline stalls, etc. Exclusion of these so-called timing accidents tightens the upper bound by the associated timing penalty, e.g., the cache miss penalty or the time to refill the pipeline. Examples of static analyses to exclude timing accidents can be found in Langenbach et al. (2002) , Thesing (2004) , Ferdinand and Wilhelm (1999) .
A designer that introduces caches, deep pipelines, or other performance boosting components into his system may find himself in the paradoxical situation that he has successfully improved the average-case performance of the system, but fails to derive sufficient timing guarantees despite his best efforts. This may be for two reasons: although the system's average-case behavior has improved, its worst-case performance has deteriorated. Even if the worst-case performance is sufficient, the provable bound may be too imprecise due to low predictability of the new components. Hence, a system with good average-case, but with poor worst-case performance or low predictability will not be certifiable. Thiele and Wilhelm (2004) describes threats to the predictability of systems and proposes design principles that support timing predictability.
The timing predictability of a system is a measure for the possibility of determining tight bounds on execution times. As depicted in Fig. 1 , timing predictability is composed of uncertainty and associated penalties. Uncertainty comprises timing accidents that cannot be excluded statically but never happen during execution. High penalties do not automatically make a system unpredictable: if there is no uncertainty Fig. 1 Execution times of tasks vary depending on inputs and the initial state of the hardware they are executed on. The figure depicts a distribution of execution times. The border cases are known as Best-and Worst-Case Execution Time (BCET and WCET) . A correct timing analysis obtains a safe upper bound on all possible execution times this is not a problem. On the other hand, high levels of uncertainty only become harmful to timing predictability if the associated penalties are large.
Caches As noted before, the processor caches have a strong influence on both the average-case and the worst-case performance. Due to the high cache-miss penalties they have a potentially strong impact on the predictability of a system. Several properties of the processor caches influence predictability: associativity, replacement policy, write policy, and whether there are separated data and instruction caches, see Heckmann et al. (2003) . Of these, the cache replacement policy has by far the strongest influence on the predictability of the cache behavior. We will investigate the following widely used replacement policies regarding their timing predictability:
• Least Recently Used (LRU) used in Intel Pentium I and MIPS 24K/34K
• First-In First-Out (FIFO or Round-Robin) used in Intel XScale, ARM9, ARM11
• Most Recently Used (MRU) as described in Al-Zoubi et al. (2004) , Malamy et al. (1994) • Pseudo-LRU (PLRU) used in PowerPC 75x and Intel Pentium II-IV The cache miss penalty is the same for all of the considered replacement policies. Timing predictability of cache replacement policies therefore only depends on the amount of uncertainty.
Contributions
We introduce two metrics, evict and fill, that capture our notion of the predictability of cache replacement policies.
Every cache analysis has to cope with a certain amount of uncertainty resulting from various sources explained in Sect. 3.1. The two metrics, evict and fill indicate how quickly knowledge about cache hits and misses can be (re-)obtained. They mark a limit on the precision that any cache analysis can achieve. A thorough analysis of the LRU, FIFO, MRU, and PLRU policies yields the respective values under these metrics.
Further analyses elaborate on these results and yield a more refined view on the limits of cache analyses: While evict and fill constitute milestones in the recovery of information, supplementary results show how information evolves in between.
To the best of our knowledge, this work presents the first quantitative, analytical results about the predictability of replacement policies.
Caches
Caches are commonly employed to hide the speed gap between main memory and the CPU by exploiting locality in memory accesses. They are very fast but small memories that store a subset of the main memory's contents. On today's architectures a cache miss may have an associated penalty of several hundred CPU cycles. Future architectures are expected to exhibit even larger cache miss penalties.
To reduce traffic and management overhead, the main memory is logically partitioned into memory blocks of size B bytes. Memory blocks are cached as a whole in cache lines of equal size. Usually, B is a power of two. This way the block number is determined by the most significant bits of a memory address.
When accessing a memory block one has to determine whether the memory block is stored in the cache (cache hit) or not (cache miss). To enable an efficient look-up, each memory block can be stored in a small number of cache lines only. For this purpose, caches are partitioned into equally-sized cache sets. The size of a cache set is called the associativity k of the cache. Again, k is usually a power of two, such that the set number is determined by the least significant bits of the block number. The remaining bits, known as the tag are stored along with the data to finally decide, whether and where a memory block is cached within a set.
Since the number of memory blocks that map to a set is far greater than the associativity of the cache, a so-called replacement policy must decide which memory block to replace upon a cache miss. To facilitate useful replacement decisions a number of status bits is maintained that store information about previous accesses. We only consider replacement policies that have independent status bits per cache set. Almost all known policies comply with this.
Cache analysis
In cache analysis there is a concept of may and must cache information at program points: may-and must-caches are upper and lower approximations, respectively, to the contents of all concrete caches that will occur whenever program execution reaches a program point. So, the must-cache at a program point is a set of memory blocks that are definitely in each concrete cache at that point. The may-cache is a set of memory blocks that may be in a concrete cache whenever program execution reaches that program point. May and must cache information is obtained by static analysis. Ferdinand and Wilhelm (1999) , for instance, presents cache analyses based on abstract interpretation.
Must cache information is used to derive safe information about cache hits; in other words it is used to exclude the timing accident "cache miss". The complement of the may cache information is used to safely predict cache misses. The more cache hits can be predicted, the better the upper bound on the worst-case execution time will be. Vice versa, predicting more cache misses will result in a better lower bound on the best-case execution time. Observe the asymmetry between may-and must: while a greater must-cache means more precise information, a greater may-cache means less precise information.
Sources of uncertainty
There are several reasons for uncertainty about cache contents:
• Static cache analyses usually cannot make any assumptions about the initial cache contents. Cache contents on entrance depend on previously executed tasks. Even assuming a completely empty cache may not be conservative as shown in Thesing (2004) . The only safe initial must-cache is the empty set, whereas the only safe initial may-cache must contain every memory block that may be mapped to the particular cache set.
• At control-flow joins, analysis information about different paths needs to be safely combined. Intuitively, one must take the intersection of the incoming mustinformation and the union of the incoming may-information. A memory block can only be in the must-cache if it is in the must-caches of all predecessor control-flow nodes, correspondingly for may-caches.
• If the analysis cannot exactly determine the address of a memory access it must conservatively account for all possible addresses. This especially deteriorates mayinformation.
• Statically undetermined preempting tasks may change the cache state at preemption points.
Since information about the cache state may thus be unknown or lost, it is important to recover information quickly to be able to classify memory accesses safely as cache hits or misses. Fortunately, this is possible for most caches. The speed of this recovery greatly depends on the cache replacement policy employed and influences uncertainty about cache hits and misses. Thus, it is an indicator of timing predictability.
Cache predictability metrics
For a timing analysis the data that is actually cached is irrelevant. Only the address ranges that are cached influence timing. In the following, if we talk about cache contents, we only really talk about the addresses of the cached memory blocks. We show how quickly cache contents become known when accessing a sequence of memory blocks starting from an unknown cache state. For the replacement policies we consider, an access to a cache set does not affect the state of other sets. Thus, we consider the recovery of information about single cache sets.
Notation and basic notions
We use the following generic names and notations: a, b, c ∈ A the set of memory addresses [b, e, c, f ] , q ∈ C p k the set of cache-set states of associativity k under policy p b, c, d , s ∈ S ⊆ A * the set of access sequences with pairwise different accesses [b, e, c, f ] . Depending on the replacement policy additional status bits as e.g. in [e, b, c, d] 0010 are used to fully describe a state. Their meaning will become clear in the description of the particular policy.
We assume all memory accesses in the regarded sequences to be pairwise different. This is sensible because recurring accesses do not contribute additional information about the cache contents. Another reason is that arbitrarily long access sequences can be constructed for two of the considered replacement policies, namely PLRU and MRU, that never recover complete information about the cache contents if repetitive accesses are allowed. In other words, there are access sequences such that different initial states result in different states for an arbitrary number of accesses; they never converge.
May-and must-information available after observing an access sequence s without knowing the initial set state can be defined as follows:
May p k (s) is the set of cache contents that may still be in the cache set after accessing the sequence s, regardless of the initial cache state. Analogously, Must p k (s) is the set of cache contents that must be in the cache set after accessing the sequence s. Since we take into account every initial state, Must p k (s) is always a subset of CC p k (s). The following two definitions show how much may-and must-information is available after observing any access sequence s of length n:
Note that may p k (n) and must p k (n) are well-defined: For all sequences s of length n, |May p k (s)| is equal (the same goes for |Must p k (s)|). The sequences contain pairwise different accesses only and are thus equal up to renaming. Thus, May p k (s 1 ) equals May p k (s 2 ) up to renaming, too. In the following proofs we may therefore always restrict our attention to one representative access sequence.
Metrics
Based on may p k (n) and must p k (n) we are ready to define evict and fill that indicate how quickly may-and must-information can be recovered:
Figure 2 illustrates the two metrics. evict p (k) tells us at which point we can safely predict that some elements are no more in the cache, i.e. they are in the complement of may-information. Any element not contained in the last evict p (k) accesses cannot be in the cache set: If some element not contained in the sequence could have , b, c, d, e, f, g, h, . . . of pairwise different memory blocks. After evict accesses, any set contains only elements from the access sequence. fill accesses are required to converge to one completely known cache set. Selected cache sets are annotated with their respective contents "survived" then any other element not contained in the sequence could have "survived" as well. Then may p k (n) = c > n where c is the number of blocks that map to the cache set. Less than evict p (k) accesses do not allow to predict any misses. The greater evict p (k), the longer it takes to gain may-information, and furthermore, the obtained may-information is less precise. The obtained may-information is less precise, because any of the greater number of evict p (k) elements may still be in the cache set.
After fill p (k) pairwise different memory accesses we know exactly what is contained in the cache set, namely the last k accesses, i.e., we obtain complete mayand must-information. This allows us to precisely predict cache hits and misses. In contrast to may-information, some must-information is directly obtained with the first memory accesses. At least the most recently accessed element is in the cache set. Thus, it is pointless to define a counterpart to evict for must-information, since min{n | must p k (n) ≥ 1} = 1 for all policies.
Consider the implications of these metrics on any cache analysis. They mark a limit on achievable precision: no analysis can infer any may-information (complete must-information) given an unknown cache-set state and less than evict(k) (fill(k)) pairwise different memory accesses. At the same time the metrics allow us to investigate the quality of different analyses. Does an analysis need longer access sequences to derive safe information about the cache set contents, or is it optimal with respect to the metrics?
Another application of these metrics is to determine the minimal effort to establish a desired cache-set state, assuming that no explicit instructions are available to do so. This may be used to eliminate initial uncertainty in cache analyses by prepending load instructions. Or simply to create uniform conditions for performance measurements. For this special purpose, it is interesting to investigate access sequences resulting in cache misses only. In such a case, a desired cache-set state can be obtained faster. We therefore distinguish M-and HM-access sequences: if we assume all accesses in the regarded sequences to be cache misses we denote this by the subscript M, otherwise by HM. Thus fill LRU HM (8) is the number of pairwise different accesses (hits or misses) needed to know the exact contents of an 8-way cache set using LRU replacement. For brevity, we will also use e(k) and f (k) for evict(k) and fill(k).
As we have noted above, some must-information can be immediately obtained with one individual access. The following definition of the minimal life-span captures how this generalizes:
The minimal life-span is the minimal number of accesses necessary to evict an element out of a cache set that has just been accessed (not counting the access that possibly brought the element into the set). In other words, it tells us how many of the most-recently accessed elements are always in the cache.
Based on the minimal life-span of a policy, it is easy to determine some mustinformation: the last mls(k) accessed elements are always in the cache set. Therefore, one can construct a must-analysis that remembers the last mls(k) accesses. A high value of mls(k) would make this a reasonably precise must-analysis. Depending on the replacement policy, optimal analyses that eventually obtain complete mustinformation may be much more expensive.
Equalities
The definitions given in the previous subsection were chosen to be as uniform as possible: they all relate must p k (n) and may p k (n) with k and n. However, for the following proofs we need to establish some equalities to ease argumentation about evict, fill, and mls.
Lemma 1 evict p (k) is the minimal length of access sequences such that only elements of the sequence may be contained in the cache set.
Proof We need to show may p k (n) ≤ n ⇔ ∀s ∈ S, |s| = n :
Then at least one element a not contained in s must have survived. Upon an access, the update process of the status bits is independent of the tag bits of all non-accessed elements. Thus, the tag bits of a can be chosen arbitrarily. I.e. any other element b ∈ CC p k (s) could have survived as well. Then, may p k (n) = c > n where c is the number of blocks that map to the cache set.
Lemma 2 This following equation makes explicit that the cache set is filled with the last k accesses of the access sequence s, once its state is known.
. The ⇐ direction of the equivalence is obvious. For ⇒ one needs to show that whenever must p k (n) = k then for any sequence s of length n ≥ k,
For any sequence there is an initial state, such that the sequence will result in misses only. Therefore, one of the intersected sets is always equal to
Lemma 3 An address a that has just been accessed will at least remain in the cache set for the mls p (k) subsequent accesses.
⇐: must p k (n) ≥ n since the last n elements are always contained in the cache set. Obviously must p k (n) ≤ n.
LRU caches
LRU replacement conceptually maintains a queue of length k for each cache set, where k is the associativity of the cache. If an element is accessed that is not yet in the cache (a miss), it is placed at the front of the queue. The last element of the queue is then removed if the set is full. It is the least-recently-used element of those in the queue. At a cache hit, the element is moved from its position in the queue to the front, in this respect treating hits and misses equally.
The contents of LRU caches are very easy to predict. For memory access sequences with pairwise different accesses and a strict least-recently-used replacement, we obtain the tight bounds
evict(k) and fill(k) tell us at which point any may-and complete must-information can be determined. However, the metrics do not tell us how may-and must-information evolves before and after these points. For the common case of an 8-way associative cache, we have precisely determined how much may-and must-information is available as a function in the number of accesses. Note that these functions mark the maximum information that can be obtained; a particular analysis may be less precise. Figure 3 shows plots of these functions. In the case of LRU replacement these functions are quite obvious, which will be different in the following cases of FIFO, MRU, and PLRU. Must-information rises with every access up to the minimal life-span mls LRU (8), which is equal to fill LRU HM (8) and evict LRU HM (8). Up to evict(k) accesses, any memory block mapped to the cache set may reside in the set.
We have determined these functions by exhaustively generating all successor states of all possible initial cache-set states, exploiting symmetries. For LRU and FIFO replacement this could have been done analytically, but for the other cases this would have been very tedious. This automatic computation was only possible up to associativity 8 as the number of states grows rapidly with rising associativity. Fig. 3 Evolution of may-and must-information of a 8-way LRU cache set. c is the number of blocks that can be mapped to the cache set. May-and must-information is shown by the dashed and the solid curve, respectively. From fill(k) on the two functions have the same value 6 FIFO caches FIFO cache sets can also be seen as a queue: new elements are inserted at the front evicting elements at the end of the queue. In contrast to LRU, hits do not change the queue. Our representation of FIFO cache sets has to be interpreted in this way: In [b, c, e, d ], d will be replaced on a miss resulting in [x, b, c, e] .
Implementations use a round-robin replacement counter for each set pointing to the cache line to replace next. This counter is increased if an element is inserted into a set, while a hit does not change this counter.
In the case of misses only, FIFO behaves like LRU. Thus, the following tight bounds are obvious: Proof Since the queue is not changed on a hit, the element just accessed may reside at the end of the queue. Thus, it may be evicted with the next access.
As in the LRU-case we have determined the evolution of must-and mayinformation experimentally. Figure 4 illustrates the results. Disappointingly from a predictability point-of-view, must-information exceeding the minimal life-span of 1 is only attained after 17 accesses.
MRU caches
MRU stores one status bit for each cache line. In the following, we call these bits MRU-bits. Every access to a line sets its MRU-bit to 1, indicating that the line was recently used. Whenever the last remaining 0 bit of a sets status bits is set to 1, all other bits are reset to 0. This asymmetry in the last bit set to 1 will play a role as we will see later. At cache misses, the line with lowest index (in our representation the left-most) whose MRU-bit is 0 is replaced.
We represent a sample state of an MRU cache set as [a, b, c, d] 
For the upper bound, notice that at some point during any k pairwise different accesses (hits or misses), the MRU-bits are reset. MRU-bits of lines that have not been accessed until this point are then set to 0. If it took k accesses to reset the bits, exactly these k elements make up the cache set. Otherwise (less than k accesses), after the reset k − 1 MRU-bits are 0, and an additional k − 1 accesses are sufficient because accesses to elements with MRU-bit 1 are impossible, from the reset point on. They would be hits and violate the property of pairwise different accesses.
Theorem 5 (fill MRU ) For the MRU replacement policy it is impossible to give a bound on the number of accesses needed to reach a completely known cache-set state:
fill MRU
Proof Consider an access sequence of pairwise different accesses. After at most 2k − 2 accesses there will be only misses. Proof Consider an arbitrary element e of the sequence. Since elements are inserted from left to right, all elements in the set left of e will be replaced earlier (after the next reset). Right to e there can be at most one element with MRU-bit 1. Thus, at least k − 2 other cache lines will be accessed before the next reset and thus before e is replaced.
Theorem 6 Let k > 2. After at most 2k − 4 misses the last k − 1 accessed elements are present in the cache set, and the set is stable with respect to this weaker property. This bound is tight.
Proof The first reset of the MRU-bits occurs after at most k − 1 accesses. If it takes exactly k − 1 accesses the initial cache-set state fits the requirements of Lemma 5 proving the theorem for this case. Otherwise, the reset takes place after at most k − 2 accesses. k − 2 additional accesses are sufficient due to Lemma 5 because the miss causing the reset has an MRU-bit of 1 and cannot be evicted by the next k − 2 misses. are the last k − 2 misses but neither x k−1 nor x k which are still in the cache set belong to the last k − 1 misses.
Theorem 7
Let k > 2. After at most 3k − 4 accesses (hits or misses) the last k − 1 accessed elements are present in the cache set, and the set is stable with respect to this weaker property. This bound is tight.
Proof Due to our general assumption about pairwise different accesses it holds that after the MRU-bits have been reset the second time, no more hits are possible because every line has been accessed at least once: every MRU-bit must have been 0 at some time and 1 later on. Now, Lemma 5 is applicable and k − 2 further accesses are sufficient.
The first reset occurs after at most k accesses, the second one after exactly k − 1 additional accesses. Adding the k − 2 accesses after the second reset yields 3k − 3. We now exclude the cases where k accesses are needed for the first reset proving the upper bound of 3k − 4: if exactly k accesses were needed to reset the bits for the first time every cache line with MRU-bit 1 must have been accessed. Thus there are no further hits possible after the first reset, already.
Consider the following cache-set states and access sequences:
The last k − 1 accesses were v k−3 , v k−2 , w 1 , . . . , w k−3 , but v k−3 has just been evicted by w k−3 . Only the next access (evicting u k−2 ) will make sure the last k − 1 accessed elements are present in the cache set.
This shows tightness for k > 2. Note that for k = 4 the accesses v 1 , . . . , v k−4 and the MRU-bit prefixes 0 . . . 0 and 1 . . . 1 do not exist.
Theorem 8 (mls MRU ) The minimum life-span of an element in a MRU-cache is 2.
Proof The MRU-bit of an accessed element e is always set to 1 resulting in mls MRU (k) > 1. But the next access may reset all the MRU-bits. If e is the left-most element it will be replaced with the next access, which yields mls MRU (k) = 2.
The evolution of may-and must-information is depicted in Fig. 5 . As complete must-information is never attained, the must-curve peaks at 7. Interestingly, mayinformation never drops from the 14 = 2k − 2 memory blocks that are reached after evict accesses. This can be explained quite easily: the element that causes the reset of the MRU-bits remains in the set for 2k − 2 further accesses. Due to the unknown initial state any access could have caused the reset. This behavior is in contrast to that of LRU, FIFO, and PLRU, where eventually only the last k accessed elements may reside in a cache set.
PLRU caches
PLRU (Pseudo-LRU) is a tree-based approximation of the LRU policy. It arranges the cache lines at the leaves of a tree with k − 1 "tree bits" pointing to the line to be replaced next. A 0 indicating the left subtree, a 1 indicating the right. See Fig. 6 for an explanation of the replacement policy. PLRU is much cheaper to implement than true LRU in terms of storage requirements and update logic. This comes at a price: it does not always replace the least-recently-used element. This property reduces predictability.
PLRU also tracks invalid lines. On a cache miss, invalid lines are filled from left to right, ignoring the tree bits. The tree bits are still updated.
Since illustrating the states of these cache sets is rather complicated we introduce the notion of a normalized cache-set state. With no invalid lines, equivalent cache-set states with same content and same order of replacements can be obtained by interchanging neighboring subtrees and flipping the corresponding tree bit. We represent a concrete cache set by the equivalent one with all tree bits set to 1. For instance the concrete cache-set state [a, b, c, d] 010 with tree bits 010 in Fig. 6 is represented by  [d, c, a, b ] ∼ = . Disregarding invalid lines the right-most element will be replaced in the normalized representation on a cache miss; it is pointed at by the tree bits. An access moves an element to the left-most position.
An access path to a cache line is a sequence of bits indicating the directions one has to take to walk from the root to this line in the normalized representation of the cache set; 0 for left, 1 for right. E.g. the access path of d in [a, b, c, d, e, f, g, h] ∼ = is 011.
We will interpret access paths as binary numbers. We will use two operators: Definition 1 (Miss replacement distance) The miss replacement distance mrd(e) of an element e is the minimum number of consecutive misses that are necessary to evict an element from a cache set q. For elements e ∈ q we define mrd(e) = 0.
Lemma 6 (Miss replacement distance) A cache line e with access path p 1 . . . p n has miss replacement distance mrd(e) = p n . . . p 1 + 1 assuming no invalid lines. Fig. 5 Evolution of may-and must-information of a 8-way MRU cache set. c is the number of blocks that can be mapped to the cache set. Note that complete must-information can not be obtained, thus fill The cache line d with access path 011 from the example above will be replaced after 001 + 1 = 2 consecutive misses: 011 → 111 → replaced.
Theorem 10 (mls PLRU )
The minimum life-span of an element in a PLRU-cache is mls(k) = log 2 k + 1. In other words, the last log 2 k + 1 accesses to a cache set always reside in the set.
Proof After the access to an element its access path is 0 . . . 0. To replace this element all bits on its access path must be flipped to 1 . . . 1. By Observation 9 each access to other elements flips at most one of the bits of the access path to 1. To reach the lower bound of log 2 k + 1 one must access the neighboring subtrees in a bottom-up fashion, to avoid flipping bits back to 0.
Eviction
is a tight bound on the number of misses to evict all entries from a k-way setassociative PLRU cache set.
Proof Assuming no invalid lines, this proof is easy. It is a simple consequence of Lemma 6 that k misses suffice to evict a complete set. If all lines are invalid, the problem is equally easy. It becomes more complicated if some subset of size 0 < i < k of the lines is invalid. The first i misses will then go into these invalid lines instead of following the standard PLRU replacement policy. These accesses do however modify the tree bits in the standard way, as if they had been hits. The number of misses needed to completely evict the cache set is then determined by the positions of the remaining k = k − i non-accessed lines. Each line can be associated with the number of misses necessary to replace the content of that line. By Lemma 6 the line with access path p 1 . . . p n will be replaced after p n . . . p 1 + 1 consecutive misses, i.e. the number of trailing 0s in p 1 . . . p n mainly determines the miss replacement distance. To have m trailing 0s none of the 2 m − 1 neighbors in the particular subtree of height m may have been accessed in the first phase, filling up the invalid lines. Any access in the subtree would have flipped at least one of the final m bits. If k lines have not been accessed yet, the maximal number of trailing 0s in any of these lines' access paths may be log 2 k . So, the maximal distance to eviction of any untouched line is bounded by
as an upper bound for the number of accesses to evict a PLRU-set with misses only. Obviously, z is maximized by a power of two (for any non power of two k = 2 l + δ, 0 < δ < 2 l , k = 2 l results in a higher value of z), which allows us to simplify the formula to 2k − k − k 2k , assuming k is a power of two. Maximizing this yields If one cannot assume that only misses will occur, the number of accesses for eviction gets even larger. However, we do not have to consider invalid lines because allocations to invalid lines are equivalent to hits at those position.
For the case of hits and misses we need a simple lemma that relates the number of accesses to the two halves of a cache set:
Lemma 7
The number of accesses to the two halves c 1 , c 2 of a 2k-way cache set differs by at most k.
Proof Consider a situation with h i hits and m i misses to c i . For each but the first miss on c 2 there must be an access to c 1 to flip the bits back to c 2 :
The last possible case is h 2 = k, in which all hits h 2 must have preceded all misses m 2 due to the accesses in the sequence being pairwise different. But every further access to c 2 must then be directly preceded by at least one access to c 1 again Proof Claim: let z(k) be an upper bound for the number of accesses needed to evict a cache set of associativity k. Then z(2k) = 2z(k) + k − 1 is an upper bound for a set of associativity 2k.
We consider a set of size 2k to be composed of two halves c 1 , c 2 of size k. Wlog. let c 1 be the first half with no initial contents left. Let a 1 and a 2 be the number of accesses on c 1 and c 2 respectively to reach this state. Then c 2 needs at most z(k) − a 2 further accesses. Since c 1 consists of elements from the access sequence only, every subsequent access to c 1 will be a miss. Therefore, there can be at most one access to c 1 between two consecutive accesses to c 2 from now on.
Combining the last two statements there can be at most 2(z(k) − a 2 ) − 1 further accesses until c 2 is completed, too. Adding the first a 1 + a 2 accesses results in a 1 + a 2 + 2(z(k) − a 2 ) − 1 = 2z(k) + a 1 − a 2 − 1. Using Lemma 7 this is bounded by 2z(k) + k − 1.
Solving the recurrence for z with the trivial value z(2) = 2 proves the upper bound.
To prove tightness assume a worst-case initial cache-set state c k and a worst case access sequence s k = u 1 , . . . , u z(k) for associativity k are known. The access sequence x 1 , . . . , x k , u 1 , v 1 , . . . , u z(k)−1 , v z(k)−1 , u z(k) evicts the contents of the cache set with initial state [x 1 , . . . , x k ] • c k with no less than k + 2z(k) − 1 accesses.
For k = 2 all cache sets states and all access sequences of length 2 are worst case initial cache-set states serving as a basis for the recursion. , . . . , x k . By Lemma 6, it will take 1 . . . 10 + 1 = k − 1 further misses to eliminate it, after k − 2 accesses it is still in the cache set. Thus, the cache set does not consist of the last k accessed elements, in particular it has not stabilized yet.
Lemma 8 If it takes evict PLRU

HM (k) accesses to evict a cache set, the last two accesses must have gone to different halves of the cache set.
Proof Assuming this is false one could insert an additional miss-access between the last two accesses on the half not accessed. Thus the number of accesses for eviction would be increased by one contradicting the assumption of a worst case. Proof We want to prove the given bound based on our results for evict PLRU
Since the last access to a set always resides in the left-most position with access path 0 . . . 0, k − 1 additional misses suffice to fill the set due to Lemma 6. This still leaves us one short of the given bound if eviction took exactly evict PLRU HM (k) steps. In that case, however, the last two accesses must have gone to different halves due to Lemma 8. Thus, they have access paths 0 . . . 0 and 10 . . . 0. Due to Lemma 6 they will be replaced after k and k − 1 misses. Thus k − 2 further accesses suffice.
Tightness is shown by modifying a generic worst-case example for e PLRU HM (k). Let s = x 1 , . . . , x e be this worst-case access sequence (assuming the same initial cache-set state). Let | denote the center of the cache set. Effectively, we remove the last two accesses from the old example and insert a hit h into the access sequence accessing the left side of the (normalized) cache set.
Knowing that the last two accesses x e−3 , x e−2 accessed different halves of the set, 1 the hit h changes the order in which these two elements will be replaced. Thus x e−3 must be evicted from the set to stabilize it. Due to Lemma 6 this takes k − 1 additional accesses because x e−3 has access path 10 . . . 0 after the hit. Carrying out s only, will result in the cache-set state depicted above, which is not yet stabilized.
The evolution of may-and must-information for a PLRU-set of associativity k = 8 is depicted in Fig. 5 . As in every policy, must-information initially rises up to mls(k) and reaches k after fill(k) accesses; may-information drops to evict after evict accesses. The further development of both curves is less uniform than in the other cases, which might be attributed to the more complicated policy. Sleator and Tarjan (1985) consider replacement policies from a different point of view. They investigate the amortized efficiency of the list update and paging rules LRU, FIFO, LIFO, and LFU. As a reference they take Belady's (1966) optimal offline policy OPT. They show that any online algorithm must fare worse than OPT by a certain factor and go on to prove that LRU and FIFO do perform as well as possible for an online algorithm. Their work concerns theoretical performance limits rather than predictability of replacement policies.
Related work
Al-Zoubi et al. (2004) perform measurements using the SPEC CPU2000 benchmarks, comparing the performance of different associativities and replacement policies including FIFO, LRU, PLRU, MRU, and OPT. They conclude that LRU, PLRU, and MRU show nearly the same performance. These policies are approximately as good as a cache of half the size with OPT policy while clearly outperforming FIFO. This interesting experimental result yields insights concerning average-case performance in practice. It does however, not deal with predictability. Heckmann et al. (2003) provide must-and may-analyses for LRU, PLRU, and a pseudo round-robin replacement policy in the context of worst-case execution time tools. Cache lines are assigned ages where "old" lines are close to eviction. Newly introduced lines assume the minimum age 0. Updates change these ages to account for all possible concrete scenarios: in the may-analysis, the minimal possible age is taken, in the must-analysis the maximal. For LRU, this yields very precise and efficient analyses. For PLRU, the must-analysis loses precision while staying efficient. It can maximally infer 4 of the 8 lines of an 8-way set-associative PLRU cache set which is strongly related to our Theorem 10. The may-analysis becomes useless since only ages 0 and 1 are reachable. They also give an example for a replacement policy with very poor predictability: pseudo round-robin used in the Motorola ColdFire 5307. It is effectively a FIFO replacement except that the replacement counter is shared among all cache sets. The inability to analyze the sets independently results in an even lower predictability than for the FIFO policy. Fig. 7 Evolution of may-and must-information of a 8-way PLRU cache set. c is the number of blocks that can be mapped to the cache set The manual of the PowerPC 75x series (Freescale Semiconductor Inc. 2002) gives the number of uniquely addressed misses to flush an 8-way PLRU cache set used in these CPUs, which is an instance of evict PLRU M .
Putting it all together, Al-Zoubi et al. (2004) provide empirical performance results whereas Sleator and Tarjan (1985) present a theoretical performance analysis that is independant of any particular benchmark.
In contrast to performance, predictability in the sense of this paper is concerned with the obtainable precision of provable upper and lower bounds on execution times. Static analysis is used to determine such bounds. Heckmann et al. (2003) provide specific static cache analyses for several replacement policies and compare their precision. Our work presents the theoretical limits of any static cache analysis.
Conclusions and future work
An important part in the design of hard real-time systems is the proof of timeliness, which is determined by the worst-case performance of the system. Performance boosting components like caches have an increasing impact on both the average and the worst-case performance. We investigated the predictability of four popular cache replacement policies. To this end, we introduced the metrics evict and fill and determined their values.
In these metrics, no policy can perform better than LRU because k is an obvious lower bound for any replacement policy. The other policies under investigation, PLRU, MRU, and FIFO, perform considerably worse: in the more interesting cases of evict HM (k) and fill HM (k), FIFO and MRU exhibit linear growth in terms of k, while PLRU grows super-linearly. However, instantiating k with the common values 4 and 8 shows a different picture, see Table 2 . Here, PLRU even fares slightly better than FIFO and MRU. Yet, compared to 8-way LRU, PLRU, MRU, and FIFO take more than twice as long to regain complete information. In particular, this differs from the worst-case performance results obtained in Sleator and Tarjan (1985) , where FIFO and LRU fared equally well. 
