Abstract. Abstract Interpretation is a technique for the static analysis of dynamic properties of programs. It is semantics based, that is, it computes approximative properties of the semantics of programs. On this basis, it allows for correctness proofs of analyzes. It thus replaces commonly used ad hoc techniques by systematic, provable ones, and it allows the automatic generation of analyzers from speci cations as in the Program Analyzer Generator, PAG. In this paper, abstract interpretation is applied to the problem of predicting the cache behavior of programs. Abstract semantics of machine programs for di erent t ypes of caches are de ned which determine the contents of caches. The calculated information allows to sharpen worst case execution times of programs by replacing the worst case assumptioǹ cache miss' by`cache hit' at some places in the programs. It is possible to analyse instruction, data, and combined instruction/data caches for common (re)placement and write strategies. The analysis is designed generic with the cache logic as parameter.
Cache Memories and Real-Time Applications
Caches are used to improve the access times of fast microprocessors to relatively slow main memories. They can reduce the number of cycles a processor is waiting for data by providing faster access to recently referenced regions of memory 1 .
Programs with hard real time constraints have to be subjected to a schedulability analysis by the compiler 24, 9 ] it has to be determined whether all timing constraints can be satis ed. WCETs (Worst Case Execution Times) for processes have to be used for this. For hardware with caches, the appropriate worst case assumption is that all accesses miss the cach e . T h i s i s a n o verly pessimistic assumption which l e a d s t o a w aste of hardware resources.
Correct information about the contents of the cache at program points could help to sharpen the worst case execution times. Such information can be computed by an abstract interpretation statically collecting information about cache contents. The way this information is computed, an abstraction of the concrete semantics of the programs, depends on the type of cache regarded and the cache replacement strategy. S e v eral abstract semantics are described, for di erent t ypes of caches.
Overview
In the following section we brie y sketch the underlying theory of abstract interpretation and present the analysis tool PAG. Section 4 describes related approaches for the prediction of cache behavior.
Cache memories are brie y described in section 5. In section 6 we g i v e a semantics for programs that re ects only memory accesses (to xed addresses) and its e ects on cache memories for common cache architectures. In section 7 we present must analyses that computes a subset of the memory blocks that must be in the cache and may analyses that computes a superset of the memory blocks that may be in the cache and describe how the results of the analyses can be interpreted.
The functional and the callstring approach d e v eloped for the abstract interpretation of programs with recursive procedures is used in section 8 to compute the behavior of memory references within loops by c o m bining the results of the must and may analyses. An example is given in section 9.
In section 10 we describe how the analyses can be transferred to the analysis of data caches or combined instruction/data caches for a restricted class of programs, and how a c o m bination of the must and may analyses can be used for the analysis of writes to the cache for common cache organizations.
Program Analysis by Abstract Interpretation
Program analysis is a widely used technique to determine runtime properties of a given program without actually executing it. There is a common theory for all program analyses called abstract interpretation 6, 7, 8] . With this theory, termination and correctness of a program analysis can be easily proven. According to this theory a program analysis is determined by a n abstract semantics.
The program analyzer generator PAG 1, 2] o ers the possibility to generate a program analyzer from a description of the abstract domain and of the abstract semantic functions. These descriptions are given in two h i g h l e v el languages, which support the description even of complex domains and semantic functions. The domain can be constructed from some simple sets like i n tegers by operators like building power sets or by constructing of functions. The semantic functions are described in a functional language which c o m bines high expressiveness with e cient implementation. Additionally one has to write a join function combining two i n c o m i n g v alues of the domain into a single one. This function is applied whenever a point in the program has two (or more) possible execution predecessors.
For the analysis of programs with (recursive) procedures PAG supports the functional approach and the call string approach 22].
Related Work
The computation of WCETs for real-time programs is an ongoing research activity. P ark and Shaw 1 8 ] describe a method to derive W CETs from the structure of programs. In 20] Puschner and Koza propose methods to guide the computation of WCETs by user annotations. Both approaches do not take cache behavior into account.
The possibilities to use optimizing compilers to improve c a c he performance of programs has extensively been studied 13, 1 4 , 1 9 , 2 6 ] . But all the proposed program transformations and code reorganizations do not necessarily help in computing the worst case execution time of a program. An overview of`Cache Issues in Real-Time Systems' is given in 4]. We restrict our examination here to the intrinsic cache behavior. In 16, 15] Arnold, Mueller, Whalley, and Harmon describe a data ow analysis for the prediction of instruction cache behavior of programs for direct mapped caches.
A method for the data cache analysis by graph coloring is described in 17, 2 1 ] . Similar to the Chow-Hennessy register allocator, variables are allocated to cache lines. The objective of the analysis is to show that throughout the live range of a cache line, no other memory access interferes with this particular cache line.
In 12] a general framework is described for the computation of WCETs of programs in the presence of pipelines and cache memories. Two kinds of pipeline and cache state information are associated with every program construct for which timing equations can be formulated. One describes the pipeline and cache state when the program construct is nished. The other one can be combined with the state information from the previous construct to re ne the WCET computation for that program construct. An approximation to the solution for the set of timing equations has been proposed.
Cache Memories
A cache can usually be characterized by three major parameters: A set can be considered as a fully associative subcache. In the case of an A-way set associative or fully associative cache, a cache line has to be selected for replacement when the cache is full and the processor requests further data. This is done according to a replacement strategy. C o mmon strategies are LRU (Least Recently Used), FIFO The fully associative set fl j : : : l j+A;1 g is treated as the fully associative cache above. For all cache lines that are not in in the set, t h e c a c he state remains unchanged.
We represent programs by c o n trol ow graphs consisting of nodes and typed edges. The nodes represent basic blocks 4 . F or each basic block i t i s k n o wn which memory blocks it references 5 A basic block is a sequence (of fragments) of instructions in which control ow e n ters at the beginning and leaves at the end without halt or possibility of branching except at the end. For our cache analysis, it is most convenient t o h a ve one memory reference per control ow node. Therefore, our nodes may represent the di erent fragments of machine instructions that access memory. The goal is to determine for every control ow n o d e n whether the references to the memory L (n) will result in cache hits or cache misses.
This can be computed from the abstract semantics by:
{ if a memory block s is not inĉ(l) for an arbitrary l then it is de nitely not in any cache line. This memory reference will always miss the cache.
{ ifĉ(l) = fsg for a cache line l then s is de nitely in cache line l.
This memory reference will always hit the cache.
{ ifĉ(l) = fI sg for a cache line l then s is de nitely in cache line l for the second and all following executions of n.
In 3] references to instruction caches are further categorized taking the loop nesting level of the instruction into account. An instruction within a loop is called rst miss if the rst reference to the instruction is a cache miss and all remaining references during the execution of the loop are cache hits. Likewise, a rst hit indicates that the rst reference to the instruction will be a hit and all remaining references during the execution of the loop will be misses (see Table 1 ). This categorization of instructions is used in a timing tool to compute the WCET of a program. For fully associative c a c hes and set associative c a c hes, two di erent join functions have to be used. For the identi cation of`always hits', the join function corresponds to set intersection, and for the identi cation of`always miss', the join function corresponds to set union.
During the analysis for direct mapped caches there never occur empty sets. The interpretation of sets of one element i s e q u i v alent under union and intersection: #(A B) = 1 a n d A 6 = and B 6 = ) (A B) = ( A \ B).
Join Functions for Fully Associative Caches with LRU Replacement
For the fully associative cache with LRU replacement strategy we can use the following join function to determine if a memory block s is in the cache at a control ow n o d e n:Ĵ \ Massoc (ĉ 1 ĉ 2 ) = c wherê c(l x ) = fs i j 9 l a l b with s i 2ĉ 1 (l a ) s i 2ĉ 2 (l b ) a n d x = max(a b)g.
The position of the memory blocks in the abstract cache state, i.e. the number of the cache line, represents the relative age of a memory block. If a memory block s has two di erent relative a g e s i n t wo abstract cache states, i.e. is in di erent positions s 2ĉ 1 (l x ) a n d s 2ĉ 2 (l y ) then the join function takes the oldest relative age, i.e. the highest position. 
A reference to s will always hit the cache.
{ If s 2ĉ(l x ) t h e n s will remain in the cache for at least ( capacity line size ; x) c a c he updates that put a`new' element i n to the cache. To determine if a memory block s is never in the cache at a control ow n o d e n we use the join functions:Ĵ Massoc (ĉ 1 ĉ 2 ) = c whereĉ(l) = c 1 (l) ĉ 2 (l) Here we h a ve the same join function as for the direct mapped cache.
An abstract cache stateĉ a t a c o n trol ow n o d e n can be interpreted in the following way:
{ if a memory block s is not inĉ(l) for an arbitrary l then it is de nitely not in any cache line. This memory reference will always miss the cache. { If s 2ĉ(l x ) and fl j : : : l c : : : l j+A;1 g is the fully associative set of the cache with j x j + A + 1 , t h e n s will remain in the cache for at least (j + A ; 1) ; x cache updates that put a`new' element i n to the cache.
To determine if a memory block s is never in the cache at a control ow n o d e n we use the same join functions and the same interpretation as in the fully associative case:Ĵ MA;way =Ĵ Massoc .
Loops are of special interest, since many programs spend most of their runtime within loops. In a control ow graph, a loop is represented as a cycle. The start node of a loop has two incoming edges. One represents the entry into the loop, the other represents the control ow from the end of the loop to the beginning of the loop. The later is called loop edge 8 .
A loop usually iterates more than once. Since the execution of the loop body usually changes the cache contents, it is useful to distinguish the rst iteration from all others. This could be achieved by virtually unrolling each l o o p o n c e .
Example 4. Let us consider a su ciently large fully associative data cache with LRU replacement strategy and the following program fragment:
: : : /* Variable x not in the data cache */ for i:=1 to .. do : : :y:=x : : :end : : : In the rst execution of the loop, the reference to x will be a cache miss, because x is not in the cache. In all further iterations the reference to x will be a cache hit, if the cache is su ciently large to hold all variables referenced within the loop.
For the abstract interpretation, the join functionĴ \ Massoc combines the abstract cache states at the start node of the loop. Since the join function is`similar' to set intersection, the combined abstract cache state will never include the variable x, because x is not in the abstract cache state before the loop is entered. For a WCET computation for a program this is a safe approximation, but nevertheless not very good.
Loop unrolling would overcome this problem. After the rst unrolled iteration, x would be in the abstract cache state and would be classi ed as always hit.
For nested loops, loop unrolling can be an expensive transformation which i s exponential in the nesting depth. This problem is similar to the problem of analyzing procedures in program analysis, for which solutions exist (see Section 3).
For our analysis of cache behavior we transform loops into procedures to be able to use the existing methods and tools 9 (see Figure 1 ).
Callstring Approach
There are only a nite number of cache lines and for each program a nite number of memory blocks. This means, the domain of abstract cache states 8 We consider here loops that correspond to the loop constructs of`higher programming languages'. Program analysis is not restricted to this, but will produce more precise results for programs with well behaved control ow. c : L ! 2 S 0 is nite. Additionally, the abstract cache update functionsÛ and the join functionsĴ are monotonic. This guarantees that abstract interpretations with both the callstring approach and the functional approach will terminate.
In the callstring approach, the high complexity of the functional approach can be circumvented. If we restrict the callstring length to 1 (callstring(1)), then for each transformed loop only two di erent incoming abstract cache states are considered: One for the call to the loop{procedure at the original place of the loop in the program (1) (see Figure 1) and one for the recursive call of the loopprocedure (2). The rst call corresponds to the rst iteration of the loop. The second call corresponds to all other iterations of the loop.
This means, we can interpret the abstract cache statesĉ f for the rst iteration andĉ o for all other iterations at a control ow n o d e n within the loop{procedure according to Table 2 . Note: For A-way set associative c a c hes and fully associative caches the determination of`always hit' and`always miss' requires analysis with bothĴ \ M andĴ M . W e call the analysis withĴ \ M must analysis because it computes all blocks that must be in the cache. And we call the analysis witĥ J M may analysis because it computes all blocks that may be in the cache.
Functional Approach
During the analysis of a program, PAG tabulates for each procedure (and each loop that has been transformed into a procedure) all abstract cache states within the procedure for all di erent incoming abstract cache states. This computes the same values as if the loops had been unrolled. In the worst case, the exponential growth in program code of the loop unrolling corresponds to exponentially many di erent incoming abstract cache states that are tabulated during the analysis. But often, there are much less di erent incoming abstract cache states than unrolled loop bodies for a deeply nested loop nest.
The functional approach gives the most detailed results for the abstract interpretation but may b e v ery expensive. In the current design, the work is limited to the prediction of memory references to addresses that can be determined at analysis time. This allows for example for the prediction of instruction cache behavior.
Our analysis can also be used to predict the behavior of data caches or combined instruction/data caches for programs that use only scalar variables. For this kind of programs, it is possible to compute for each data reference to a procedure parameter or a local variable the address within the procedure stack frame by a static stack level simulation 25]. For each call to a procedure, the address of the procedure stack frame depends only on a statically computable o set to the procedure stack frame of the caller.
For our abstract interpretation, we extend the function that maps control ow nodes to the list of referenced memory blocks by an argument that is the set of possible absolute stack frame addresses 10 For programs without recursive procedures, there are only nitely many stack frame addresses. This guarantees termination of the abstract interpretation. With the functional approach and the callstring approach where the procedure nesting depth of the program does not exceed the callstring length, the sets of stack frame addresses for theÛ 0 andĴ 0 functions contain always exactly one element. This means there is no loss of information.
For programs with recursive procedures, the number of stack frame addresses may grow in nitely during the analysis so that the analysis does not terminate. Cousot and Cousot 5] proposed a technique called`widening' that speeds up the analysis.
We use a`widening' function 5 to restrict the number of stack frame addresses. When during the analysis the numb e r o f e l e m e n ts in a set of stack frame addresses exceeds a given limit R, 5 replaces this set by N 0 12 . This can only occur when the join function is applied. 10 This works for C-type languages where all procedures are`global'. PASCAL-like languages with local procedures referencing local variables of other procedures can't be modeled in this way.
11
This holds only for procedures of the original program. The newly introduced loop{ procedures do not change the procedure stack frame address. 12 PAG includes a`negative' set representation, so that this operation is e ciently implemented. { Write back: The data is written only to the cache line. The modi ed cache line is written to main memory only when it is replaced. This is usually implemented with a bit (called dirty bit) for each c a c he line that indicates if the cache line has been modi ed. The execution time of a store instruction often depends on whether the memory block that is written is in the cache (write hit) or not (write miss). For the prediction of hits and misses we can use our analysis. There are two common cache organizations with respect to write misses: { Write allocate: The block is loaded into the cache. This is generally used for write back c a c hes.
{ No write allocate: The block is not loaded into the cache. The write changes only the main memory. This is often used for write through caches. Writes to write through/write allocate caches can be treated as reads. For no write allocate caches, the update functions have t o b e a d a p t e d . F or A-way s e t associative caches (A > 1) and fully associative c a c hes, a write access to a block s is treated as a read access, if s is already in the concrete or abstract cache state. Otherwise, and for direct mapped caches 13 , the write access is ignored, i.e. the update functions is the identity function for this case.
Write back caches write a modi ed line to memory when the line is replaced. The timing of a load or store instruction may depend on whether a modi ed or 13 This is to preserve t h e i n terpretation of sets of one element as always hits.
an unmodi ed line is replaced 14 . T o k eep track of modi ed cache lines, we extend the cache states by a`dirty' bit, where d means modi ed, p means unmodi ed 15 Let n b e a c o n trol ow node, s a be one read or write memory reference at n, c 1 and s is a always miss inĉ 2 , then a dirty memory block has been replaced. This reference has de nitively caused a write back. { If fs j (d s) 2ĉ 1 (l x )g 6 = then for a WCET analysis we h a ve to consider a possible write back. The identi ed (possible) write backs can be used in another abstract interpretation similar to the cache analysis for the prediction of the write bu er behavior.
State of the Implementation and Future Work
The presented techniques have b e e n v alidated with an ANSI-C frontend that has been interfaced to PAG. W e are currently developing a PAG interface for executables based on the Wisconsin architectural research tool set (WARTS). 14 Many c a c he designs use write bu ers that hold a limited number of blocks. Write bu ers may delay a cache access, when they are full or data is referenced that is still in the bu er. To analyze the behavior of the write bu ers possible`write backs' have to be determined.
Conclusion
We h a ve described several semantics of programs executed on machines with several types of one level caches. Abstract interpretations based on these semantics statically analyze the intrinsic cache behavior of programs. The information computed allows interpretations such a s a l w ays hit',`always miss',` rst hit', rst miss', and`write back'. It can be used to improve execution time calculations for programs. The analyses are speci ed as needed by the program analyzer generator PAG.
Speci cation
For the sake of simplicity and space, we assume only references to xed addresses, and we consider only direct mapped caches and the must analysis for fully associative caches: 
