Abstract interpretation is a technique for the static detection of dynamic properties of programs. It is semantics based, that is, it computes approximative properties of the semantics of programs. On this basis, it allows for correctness proofs of analyses. It replaces commonly used ad hoc techniques by systematic, provable ones, and it allows the automatic generation of analyzers from speci cations as in the Program Analyzer Generator, PAG. In this paper, abstract interpretation is applied to the problem of predicting the cache behavior of programs. Abstract semantics of machine programs are dened which determine the contents of caches. For interprocedural analysis, existing methods are examined and a new approach that is especially tailored for the cache analysis is presented. This allows for a static classi cation of the cache behavior of memory references of programs. The calculated information can be used to sharpen worst case execution time estimations. It is possible to analyze instruction, data, and combined instruction/data caches for common (re)placement and write strategies. Experimental results are presented that demonstrate the applicability of the analysis.
Cache Memories and Real-Time Applications
Caches are used to improve the access times of fast microprocessors to relatively slow main memories. They can reduce the number of cycles a processor is waiting for data by providing faster access to recently referenced regions of memory 1 . C a c hing is more or less used for all general purpose processors, and, with increasing application sizes it becomes more and more relevant and used for high performance microcontrollers and DSPs.
Programs with hard real-time constraints have to be subjected to a schedulability analysis, e.g. by the compiler 32, 8] . This should determine whether all timing constraints can be satis ed. WCET (Worst Case Execution Time) extimations for processes have to be used for this. The degree of success for such a timing validation 31] depends on sharp WCET estimations. There are two components to the prediction of WCETS:
(i) architecture modeling, the determination of how m uch time it will take to execute an execution path on the target system, and (ii) program path analysis, the determination of a worst case execution path.
Here, we focus on the rst point.
For hardware with caches, the typical worst case assumption is that all accesses miss the cache. This is an overly pessimistic assumption which leads to a waste of hardware resources.
Overview
In the following Section we brie y sketch the underlying theory of abstract interpretation and present the program analyzer generator PAG. C a c he memories are brie y described in Section 4. In Section 5 we give a s e m a n tics for programs that re ects only memory accesses (to xed addresses) and its effects on cache memories, and we present t h e must analysis that computes for all program points a set of memory blocks that must be in the cache whenever control reaches this point and the may analysis that computes a set of memory blocks that may be in the cache. The behavior of memory references within loops and recursive procedures can be analyzed with interprocedural analysis methods. In Section 6 existing approaches are discussed and a new approach i s p r e s e n ted. An example is given in Section 7. Section 8 describes extensions to data and combined caches. In Section 10 we present and discuss the results of practical experiments from an implementation of the analyses, and Section 11 describes related work.
Program Analysis by Abstract Interpretation
Program analysis is a widely used technique to determine runtime properties of a given program without actually executing it. Such information is used for example in optimizing compilers 33] to enable code improving transformations. A program analyzer takes a program as input and computes some interesting properties. Most of these properties are undecidable. Hence, both correctness and completeness of the computed information are not achievable together. Program analysis makes no compromise on the correctness side the computed information is reliable as for enabling optimizing transformations. It can't thus guarantee completeness. The quality of the computed information, usually called its precision, should be as good as possible.
There is a well developed theory of static program analysis called abstract interpretation 5{7]. With this theory, correctness of a program analysis can be easily derived. According to this theory a program analysis is determined by a n abstract semantics. Usually, the meaning of a language is given as functions for the statements of the language computing over a concrete domain. A domain is a complete partially ordered set of values. For such a semantics, an abstract version consists of a new simpler abstract domain and simpler abstract functions which de ne the abstract meaning for every program statement.
For an abstract semantics and an input program, a system of recursive equations can be constructed. The variables in this system stand for the values of the abstract domain at every program point. In this equation system, the value at a program point depends on the values at all program points which can directly precede the execution of this program point. For example, the value after the exit of a loop depends on the value at the end of the loop body and on the value before the loop because it is possible that the loop is never executed. The control ow graph of a program describes every possible ow o f c o n trol and therefore all dependencies between the variables of the equation system. Lattice theory underlying abstract interpretation states that the recursive equation system can be solved by x p o i n t iteration if the abstract domain has only nite ascending chains, i.e., every chain of values v 1 < v 2 < has only nite length, and if in addition every semantic function is monotonic.
The program analyzer generator PAG 1, 2] o ers the possibility to generate a program analyzer from a description of the abstract domain and of the abstract semantic functions in two high level languages, one for the domains and the other for the semantic functions. Domains can be constructed inductively starting from simple domains using operators like constructing power sets and function domains. The semantic functions are described in a functional language which c o m bines high expressiveness with e cient implementation. Additionally the user has to supply a join function combining two domain values into one. This function is applied whenever a point in the program has two (or more) possible execution predecessors. In the case of an associative c a c he, a memory block has to be selected for replacement when the cache is full and the processor requests further data. This is done according to a replacement strategy. Common strategies are LRU (Least Recently Used), FIFO (First In First Out), and random.
We restrict our description to the semantics of A-way set associative c a c hes with LRU replacement strategy. The fully associative and the direct mapped caches are special cases of the A-way set associative c a c he where A = n and A = 1 r s p .
Cache Semantics
In the following, we consider an A-way set associative cache as a sequence of (fully associative) sets F = hf 1 : : : f n=A i, a s e t f i as a sequence of set lines L = hl 1 : : : l A i, and the store as a set of memory blocks M = fm 1 : : : m s g.
The function adr : M ! N 0 gives the address of each memory block. The function set : M ! F gives the set where a memory block w ould be stored (% denotes the modulo division):
set(m) = f i where i = adr(m)%(n=A) + 1
To indicate the absence of any memory block in a set line, we i n troduce a new element I M 0 = M f Ig.
Our cache semantics separates two k ey aspects:
{ The set where a memory block is stored: This can statically be determined as it depends only on the address of the memory block. The dynamic distribution of memory blocks into sets is modeled with the cache states. { The aspect of associativity and the replacement strategy within one set of the cache: Here the history of memory reference executions is relevant. This is modeled with the set states. If s(l x ) = m for a concrete set state s, t h e n x describes the relative a g e o f the memory block according to the LRU replacement strategy and not the physical position in the cache hardware.
The update function describes the side e ects on the set (cache) of referencing the memory:
{ The set where a memory block m a y reside in the cache is uniquely determined by the address of the memory block, i.e., the behavior of the sets is independent o f e a c h other. { The LRU replacement strategy is modeled by using the positions of memory blocks within a set to indicate their relative age. The order of the memory blocks re ects the \history" of memory references. The most recently referenced memory block is put in the rst position l 1 of the set. If the referenced memory block m is in the set already, then all memory blocks in the set that have been more recently used than m are shifted by one position to the next set line, i.e., they increase their relative age by one. If the memory block m is not yet in the set, then all memory blocks in the cache are shifted and the`oldest', i.e., least recently used memory block is removed from the set. The cache state for a path (k 1 : : : k p ) in the control ow graph is given by applying U C to the initial cache state c I that maps all set lines in all sets to I and the concatenation of all sequences of memory references along the path:
2 A basic block is a sequence (of fragments) of instructions in which c o n trol ow enters at the beginning and leaves at the end without halt or possibility of branching except at the end. For our cache analysis, it is most convenient t o h a ve one memory reference per control ow node. Therefore, our nodes may represent the di erent fragments of machine instructions that access memory. 3 This is appropriate for instruction caches and can be too restricted for data caches and combined caches. See Section 8 for weaker restrictions.
Abstract Semantics
The domain for our abstract interpretation consists of abstract cache states that are constructed from abstract set states:
De nition 5 (abstract set state) An abstract set stateŝ : L ! 2 M 0 maps set lines to sets of memory blocks.Ŝ denotes the set of all abstract set states.
De nition 6 (abstract cache state) An abstract cache stateĉ : F !Ŝ maps sets to abstract set states.Ĉ denotes the set of all abstract cache states.
We will present t wo analyses. The must analysis determines a set of memory blocks that are de nitely in the cache whenever control reaches a given program point. The may analysis determines all memory blocks that may b e i n the cache at a given program point. The latter analysis is used to guarantee the absence of a memory block in the cache.
The analyses are used to compute a categorization for each memory reference that describes its cache behavior. The categories are described in Table 1 . De nition 7 (join function) A join functionĴ :Ĉ Ĉ 7 !Ĉ combines two abstract cache states.
Must Analysis
An abstract cache stateĉ describes a set of concrete cache states c, a n d a n abstract set stateŝ describes a set of concrete set states s.
To determine if a memory block is de nitely in the cache we use abstract set states where the position (the relative age) of a memory block in the abstract set stateŝ is an upper bound of the positions (the relative ages) of the memory block in the concrete set states thatŝ represents. The address of a memory block determines the set in which it is stored. This is re ected in the abstract cache update function in the following way:
The join function for abstract set states is similar to set intersection. A memory block only stays in the abstract set state, if it is in both operand abstract set states. It gets the oldest age, if it has two di erent ages. The join function for abstract cache states applies the join function for abstract set states to all its abstract set states:
An abstract cache stateĉ a t a c o n trol ow n o d e k is interpreted in the following way: Let m a memory block a n d s = c(set(m)). If m 2ŝ(l y ) for a set line l y then m is de nitely in the cache every time control reaches k. Therefore, a reference to m is categorized as always hit (ah).
May Analysis
To determine, if a memory block is never in the cache, we compute the set of all memory blocks that may be in the cache. We use abstract set statesŝ where the position (the relative age) o f a m e m o r y b l o c k in the abstract set state is a lower bound of the positions (the relative ages) of the memory blocks in the concrete set states thatŝ represents. m a 2ŝ(l x ) means the memory blocks m a may be in the cache. The position (relative age) of a memory block m a in a set can only be changed by references to memory blocks m b with set(m a ) = set(m b ), i.e., by memory references that go into the same set. Other memory references do not change the position of m a . The position is also not changed by references to memory blocks m b 2 s(l y ) where y < x , i.e., memory blocks that are already in the cache and are \younger" as m a .
If there are no memory references to m a , t h e n m a will be removed from the cache after at most A ; x + 1 references to memory blocks that go into the same set and are not yet in the cache or are older or the same age than m a .
The concretization function for the may analysis conc Ĉ is given by: The abstract cache update function for the may analysis has the same structure as the one for the must analysis:
The join function is similar to set union. If a memory block s has two di erent ages in two abstract cache states then the join function takes the youngest age.Ĵ 
An abstract cache stateĉ a t a c o n trol ow n o d e k is interpreted in the following way: Let m be a memory block a n d s = c(set(m)). If m is not inŝ(l y ) f o r a n arbitrary l y then it is de nitely not in the cache whenever control reaches k. Therefore, a reference to m is categorized as always miss (am).
Termination of the Analysis
There are only a nite number of sets and set lines and for each program a nite number of memory blocks. This means the domain of abstract cache statesĉ : F ! (L ! 2 M 0 ) is nite. Hence, every ascending 5 chain is nite. Additionally, the abstract cache update functionsÛ and the join functionsĴ are monotonic. This guarantees that our analysis will terminate.
Analysis of Loops and Recursive Procedures
Loops and recursive procedures are of special interest, since programs spend most of their time there. In a control ow graph, a loop is represented as a cycle. The start node of a loop 6 has two incoming edges. One represents the entry into the loop, the other represents the control ow from the end of the loop to the beginning of the loop. The latter is called loop edge (see Figure 1 ). There are loops that can iterate more than once. Since the execution of the loop body usually changes the cache contents, it is useful to distinguish the rst iteration from others. This could be achieved by conceptually unrolling each loop once. In the rst execution of the loop, the reference to x will be a cache miss, because x is not in the cache. In all further iterations the reference to x will be a cache hit, if the cache is su ciently large to hold all variables referenced within the loop.
For the abstract interpretation, the join functionĴ \ combines the abstract cache states at the start node of the loop. Since the join function is`similar' to set intersection, the combined abstract cache state will never include the variable x, because x is not in the abstract cache state before the loop is entered. For a WCET computation for a program this is a safe approximation, but nevertheless not very good.
Loop unrolling would overcome this problem. After the rst unrolled iteration, x would be in the abstract cache state and would be classi ed as always hit. For our analysis of cache behavior we treat loops as procedures to be able to use existing methods for the interprocedural analysis 7 . T h i s i s d o n e b y transforming all loops into \loop-procedures" in the control ow graph according to Figure 2 . This is only done for the analyses and has no in uence on the program code. In the presence of (recursive) procedures, a memory reference can be executed in di erent execution contexts. An execution context corresponds to a path in the call graph of the program.
The interprocedural analysis methods di er in which execution contexts are distinguished for a memory reference within a procedure. Widely used are the callstring approach and the functional approach which h a ve been proposed by Sharir and Pnueli 29] and are implemented in PAG.
The callstring approach limits the number of distinguished execution contexts statically. T o do this the call graph is considered. The goal is, not to merge information that is obtained on di erent paths through the graph. But in presence of recursion, the graph is cyclic and therefore has an in nite number of paths. So only the information obtained on paths which di er in su xes of a xed length K are kept separated.
In the functional approach, the number of distinguished execution contexts is not statically limited. The PAG generated analyzer tabulates all di erent i n p u t values and output values of the abstract domain (here: abstract cache states) for every procedure. To guarantee termination of the analysis, the abstract domain has to be nite. The functional approach computes the most precise solution.
The applicability of these approaches to the cache behavior prediction is limited:
{ Callstring approach: If we restrict the callstring length K to 0 (callstring(0)), then one categorization for each memory reference in the program is computed. This is fast, but yields not very precise information.
Callstring (1) gives better results, as it distinguishes as many di erent execution contexts of a memory reference in a procedure as there are calls. For each transformed loop there is one call to the loop{procedure at the original place of the loop in the program (1) (see Figure 2) and one for the recursive call of the loop-procedure (2). The rst call corresponds to the rst iteration of the loop. The second call corresponds to all other iterations of the loop.
Longer callstrings increase the analysis e ort and lead to a more precise categorization. The precision gained is quite poor with respect to the enormously increasing analysis costs, as there are many execution contexts distinguished that are \non interesting" for our analysis. { Functional approach: The dynamically distinguished execution contexts cannot be easily combined with the results of a program path analysis that determines a safe approximation to the worst case execution path. This makes a WCET estimation more di cult.
To o vercome the de ciencies of the callstring(>1) and the functional approaches, we h a ve developed the VIVU approach which has been implemented with the mapping mechanism of PAG as described in 1]. It corresponds to callstring(1), but paths through the call graph that only di er in the number of repeated passes through a cycle are not distinguished. It can be compared with a combination of virtual inlining of all non recursive procedures 8 
The results of the callstring(0), callstring (1) , and the VIVU approach are compared in Section 10.
Example
We are shown in Figure 3 . We assume that all variables are stored in pairwise di erent memory blocks. The nodes of the control ow graph are numbered 1 to 6, and each n o d e is marked with the variable it accesses. For the analysis, we assume the loop has been implicitly transformed into a loop-procedure according to Figure 2 .
Each node is marked with the abstract cache states (in the same format as in Example 1) computed by the PAG{generated analyzer immediately before the abstract cache states are updated according to the memory references. The loop entry edge is marked with the incoming abstract cache states. The loop exit edge is marked with the outgoing abstract cache states.
Data Caches and Combined Caches
Our analysis can be used to predict the behavior of data caches or combined instruction/data caches, if the addresses of referenced data can be statically computed.
Addresses of references to global data can usually be easily determined. Local variables and procedure parameters that are allocated on the stack are addressed relatively to the stack p o i n ter or frame pointer, i.e., a register that points to a known address within the procedure frame on the execution stack.
If the values of the stack p o i n ter or frame pointer are known, the absolute addresses of the variables and parameters can be determined by a data ow 9 Here, the analyses with callstring(1) yield the same results. (Node,Variable) rst iteration other iterations (1,e), (2,b) always hit always miss (3,c) always miss always hit (4,a), (5,d) always miss always miss (6,c) always hit always hit analysis 12]. For programs without recursive procedures, it is possible to determine all values of the stack or frame pointers for all procedures for the distinguished execution contexts of the cache behavior analysis.
To support the analysis of programs for which not all addresses of the memory references can precisely be determined, theÛ functions are extended to handle a set of possibly referenced memory locations 10 .
Since it is not de nitely known which memory block i s p u t i n to the cache, the update functionÛ \ C for the must analysis applied to a set of possible memory locations fm 1 ::: m x g and an abstract cache stateĉ only a ects the ages of the memory blocks inĉ in all sets where an element o f fm 1 : : : m x g could be stored: So far, we h a ve ignored writing to a cache and only considered reading from a cache. There are two common cache organizations with respect to writing to the cache 10]:
{ Write through: O n a c a c he write the data is written to both the memory block and the corresponding set line. { Write back: The data is written only to the set line. The modi ed set line is written to main memory only when it is replaced. This is usually implemented with a bit (called dirty bit) f o r e a c h set line that indicates whether the set line has been modi ed.
The execution time of a store instruction often depends on whether the memory block that is written is in the cache (write hit) o r n o t ( write miss). For the prediction of hits and misses we can use our analysis. There are two common cache organizations with respect to write misses: { Write allocate: The block is loaded into the cache. This is generally used for write back caches. { No write allocate: The block is not loaded into the cache. The write changes only the main memory. This is often used for write through caches.
Writes to write through/write allocate caches can be treated as reads for the cache analysis. For no write allocate caches, a write access to a block m is treated as a read access, if m is already in the concrete or abstract cache state. Otherwise, the write access is ignored.
Write back c a c hes write a modi ed line to memory when the line is replaced. The timing of a load or store instruction may depend on whether a modi ed or an unmodi ed line is replaced 11 . T o k eep track of modi ed set lines, we extend the cache states by a`dirty' bit, i.e., we use pairs (m b) of memory blocks and dirty bits instead of memory blocks in the set/cache states, where b = d means modi ed, b = p means unmodi ed. The identi ed (possible) write backs can be used in another abstract interpretation similar to the cache analysis for the prediction of the write bu er behavior.
Practical Experiments
For reasons of simplicity, w e h a ve restricted our practical experiments to the analysis of instruction caches.
The cache analysis techniques are implemented i n a PAG generated analyzer that gets as input the control ow graph of a program and an instruction cache description and produces a categorization cat of the instruction/context pairs of the input program. A context represents the execution stack, i.e., the function calls and loops along the corresponding path in the call graph. It is represented as a sequence Additionally, w e compute for every instruction/context pair ic with cat(ic) = nc the set of competing instructions, i.e., the instructions that are in the same fully associative set in the abstract cache state of the may analysis. For instance, if the competing instructions reside in less than A (= level of associativity) memory blocks, then all executions of the instruction will result in at most one cache miss. Generally, an upper bound of the number of cache misses of the instruction is given by one plus the maximal number of possible sequences of length A of executions of competing instruction that are stored in pairwise disjoint memory blocks. To determine the bound is a nontrivial problem. We use simple heuristics to compute a safe approximation to the upper bound.
Our experiments have been performed for the Sun SPARC architecture. The Sun SPARC i s a R I S C a r c hitecture with pipelined instruction execution. It has a uniform instruction size of four bytes. The front end to the analyzer reads a Sun SPARC executable in a.out format. Our implementation is based on the EEL library 13] of the Wisconsin Architectural Research T ool Set (WARTS). EEL (Executable Editing Library) is a C++ library for building tools to analyze and modify an executable (compiled) program. It hides system-speci c detail (like executable le format) and allows to edit linked executables, not just object les.
The objective o f o u r w ork is to improve t h e W CET estimation of programs on computer systems with caches. The execution time of a program depends on the program path, i.e., the sequence of instructions that are executed and their individual execution times. But the program path is usually dependent o n t h e program input and cannot generally be determined in advance. Therefore, a program path analysis is part of a WCET analysis 27, 17, 14, 15] . For example, with the help of user annotations, like maximal iteration counts of loops, an architecture dependent w orst case execution pro le can be determined that gives a conservative approximation to the worst case execution path.
The program path analysis can be very accurate. Yau-Tsun Steven Li and Sharad Malik report that their estimated bounds are within two p e r c e n t o f the (calculated) worst case bounds for their set of benchmark examples 14]. The worst case execution pro le allows to compute how often each instruction/context pair is maximally encountered. Combined with the categorizations of our cache analysis, the overall number of cache hits and cache misses can be estimated (see Figure 4) .
In our experiments, we h a ve circumvented the program path analysis problem and combine the categorizations cat with \exact" execution pro les instead of worst case execution pro les (see Figure 4 ). This allows us to assess the e ectiveness of our analysis without the in uence of possibly pessimistic path analyses. The pro lers that produce the pro les are produced with the help of qpt2 (Quick program Pro ler and Tracer) that is part of the WARTS distribution. A pro ler for a program computes an execution pro le pro le, i . e . , the execution counts for the instruction/context pairs.
pro le : IC! N 0 For the experiments we use parts of the program suites of Frank M uller 3,22], the djpeg and fdct program of Yau-Tsun Steven Li 16] , and some additional programs (see Table 2 ). For some programs, there exists a worst case input, so that our execution pro les are worst case execution pro les. The programs are compiled with the GNU C compiler version 2. The programs fft, stats and lloops use arithmetic library functions. These functions are more or less structured into treatment of special cases, normalization, computation, and nal rounding. Not all parts are necessarily executed when the function is called. This uncertain execution path typically leads to relatively many occurrences of nc in our categorizations.
The executable of lloops consists of more than 100 loops in deeply nested loop nests. This program structure leads to a very high number of distinguished execution contexts with the VIVU approach.
The AVL tree as implemented in avl2 is a height balanced binary tree. Every insert or delete operation may lead to a series of recursive calls for rebalancing. The code of the insert and delete operations consists of many cases for the di erent rebalancing operations called rotations. Such a program structure seems to be rather typical for the handling of many dynamic data structures. Table 3 The Table 3 shows the distribution of ah, am, a n d nc in the categorizations for the test programs for callstring(0), callstring(1), and VIVU for one selected cache con guration. The sum of ah, am, a n d nc in the categorizations is the number of distinguished instruction/context pairs. It is a measure for the complexity of the analysis. In our current implementation, the categorization for a given cache con guration can be computed within seconds on a SUN SPARCstation 20 for most of our test programs, but the computation for lloops with VIVU requires about 7 minutes. In our implementation, there is room for improvements, though.
To give a more expressive presentation of the results of our experiments than bounds on cache hit ratios, we assume an idealized hardware that executes all instructions that result in an instruction cache hit in one cycle and all instructions that result in an instruction cache miss in 10 cycles 13 .
The cache behavior of the test programs for di erent c a c he con gurations is computed by simulating the cache for the program trace. The cache simulation is always started with the empty c a c he, and we assume uninterrupted execution. For technical reasons, instructions in functions from dynamic link libraries 14 are not traced and their e ects on the cache are therefore ignored. From the number of hits and misses in the trace we compute the execution time ET of our idealized hardware.
With our categorization an upper and a lower bound of the execution time can be computed by c o m bining the pro les with the results of our analysis. An upper bound of the execution time is given if we count all instructions in the pro le as misses that cannot be determined from the categorization as cache hits. A lower bound of the execution time is given if we c o u n t all instructions in the pro le as hits that cannot be determined from the categorization as cache misses. The upper and lower bounds of the test programs for various cache con gurations are shown in Figures 5 and 6 in percent of the execution time ET (the meaning of the x axis tic marks is given in Table 4 ). Figures 5 and 6 can be interpreted as follows:
{ The VIVU approach generally leads to the most precise predictions. { Conditionally executed code, e.g. as found in the arithmetic library functions or in avl2, can lead to less precise predictions which result from many nc in the categorizations. { There can be a wide variation of the quality of the prediction depending on the cache con guration. { F or all test programs our method (especially with VIVU) gives much better results than the naive methods that counts all memory references as misses for a WCET estimation, and as hits for a BCET estimation.
Related Work
The computation of WCETs for real-time programs is an ongoing research activity. P ark and Shaw 24] describe a method to derive W CETs from the structure of programs. In 27], Puschner and Koza propose methods to guide the computation of WCETs by user annotations like maximal loop counts. This approach seems to be commonly used in WCET analysis tools. Both approaches do not take c a c he behavior into account.
The possibilities to use optimizing compilers to improve c a c he performance of programs has extensively been studied 18, 19, 25, 26, 34] . But all the proposed An overview of`Cache Issues in Real-Time Systems' is given in 4]. We restrict our examination here to the intrinsic cache behavior.
The work of Arnold, M uller, Whalley, and Harmon has been one of the starting points of our work. 22, 20] describes a data ow analysis for the prediction of instruction cache behavior of programs for direct mapped caches. The extension to set associative instruction caches has later been given in 21]. Two data ow analyses are used. The result of the rst corresponds to the result of our may analysis. The second is only required for set associative caches for the categorization of instructions within loops. It corresponds to the rst analysis whereby the loop back edges are deleted in the control ow graph. In contrast to our method that derives semantics based categorizations of memory references only from the results of our analyses, an additional complex bottom-up algorithm over the control ow graph is used to compute a classi cation of the instructions for each l o o p l e v el. The distinction of a rst or a further execution of a loop is not explicit but expressed by the classi cations rst miss and rst hit. F or a set of small programs the same or slightly worse upper bounds of the execution time than our results are reported in 21]
15
. But the assessment i s di cult as the environment for the experiments is not the same, e.g., di erent compilers have been used to compile the test programs.
In 15,16] Yau-Tsun Steven Li, Sharad Malik, and Andrew Wolfe describe an integrated method to determine the worst case execution path of a program and to model architecture features like instruction caches and/or pipelines. The problem of nding an accurate worst case execution time bound is formulated as an integer linear program that must be solved, which is a NP-hard problem. This approach has been implemented in the cinderella tool 16 . U nlike the method described in 22] or our method that rely only on the control ow graph to determine the cache behavior of a memory reference, user provided functionality constraints can be used to describe the control ow m o r e precisely. F or direct mapped instruction caches and programs whose execution path is well de ned and not very input dependent the predictions can be computed fast and are very accurate 16]. Increasing levels of associativity w h e r e the cache behavior of one memory reference depends on more other references and less de ned execution paths lead to prohibitively high analysis times.
In 17], Lim et al. describe a general framework for the computation of WCETs of programs in the presence of pipelines and cache memories. Two k i n d s of pipeline and cache state information are associated with every program construct for which timing equations can be formulated. One describes the pipeline and cache state when the program construct is nished. The other can be combined with the state information from the previous construct to re ne the WCET computation for that program construct. Unlike our method that is based on well explored theories and tools for abstract interpretation, the set of timing equations must be explicitly solved. An approximation to the solution for the set of timing equations has been proposed. The usage of an input and output state provides a way for a modularization for the timing analysis. Experimental results are reported for three small programs, but they cannot be easily compared with our experiments.
The approach of Lim et al. has also been applied to data caches. In 11], Hur et al. treat references to unknown addresses as two c a c he misses. The reported results are worse than the ones without data cache analysis where one assumes one cache miss for every data reference. But the authors expect that the results improve with better methods to resolve addresses of data references. For loops that reference only data that t entirely into the cache, Kim et al. 12] have improved the approach based on the pigeonhole principle. Applied to the cache analysis, the pigeonhole principle says: If we h a ve n memory reference to m memory locations and n > m and all referenced memory blocks t into the cache, then there must inevitably some cache hits.
A method for the data cache analysis by graph coloring is described in 23, 28] . Similar to the Chow-Hennessy register allocator, variables are allocated to cache lines. The objective of the analysis is to show that throughout the live range of a cache line, no other memory access interferes with this particular cache line. This approach has limited success even for small programs.
Conclusion and Future Work
We h a ve described semantics based analysis methods by abstract interpretation that allows to predict the intrinsic cache behavior of programs for various types of one level caches. The theory of abstract interpretation supports the correctness proofs for the analysis and provides e cient implementation methods.
The analyzers are generated by the program analyzer generator PAG from very concise speci cations. It is possible to trade time for precision, but even with the VIVU approach our implementation of the analyses is quite fast. No special input of a skilled user is required to tune for acceptable results. This makes it feasible to use our analyses as part of the compilation process to support the automatic schedulability analysis by the compiler.
The applicability of our methods has been shown with the results of our practical experiments. The newly developed VIVU approach m a k es it possible to predict the cache behavior within tight bounds for many programs and cache con gurations.
We directly analyze executables and there are no special compilers or linkers required. Our current implementation supports the SPARC a r c hitecture. Other architectures can be supported by supplying additional front ends to our analyzers. The analyses are extensible to accommodate further cache designs like m ultilevel caches or wrap around line ll.
Future work includes the integration of our tool with a program path analysis. We a r e w orking on extension to predict the pipeline behavior of processors.
The pipeline analyzers will be generated from a description similar to the speci cations used for the generation of code schedulers. For the analysis of array references, there exist methods based on data dependency analysis which should be combined with our approach. Finally, w e will explore methods that allow t o c o m bine the separated analyses of modules, libraries, or operating systems calls and thereby support the modularization of the analysis.
