The use of caches poses a difficult rradeoff for architects of real-time systems. While caches provide signifreant performance advantages, they have also been viewed as inherently unpredictable since the behavior of a cache reference depends upon the history of the previous references. The use of caches will only be suitable for realtime systems i f a reasonably tight bound on the performance of programs using cache nzemory can be predirterl. This paper describes an approach for bounding the worstcase instruction cache performanre of large code segments. First, a new method called Static Cache Sitnulation is used to analyze a program's control flow to statically categorize the caching behavior of each instruction. A timing analyzer, which uses the Categorization information, then estimates the worst-case instruction cache performance for each loop and function in the program.
Introduction
Caches present a dilemma for architects of real-time systems. The use of cache memory in the context of realtime systems introduces a potentially high level of unpredictability. An instruction's execution time c,m vnry greatly depending on if the instruction causes a cache hit or miss. Whether or not a particular reference is in cache depends on the program's previous dynamic behavior. As a result, it has been common practice to simply disable the cache for sections of code where predictability is required [l]. Unfortunately, even the use of other architectural features, such as a prefetch buffer, cannot approach the effectiveness of using a cache. Furthermore, as processor speeds continue to increase faster than the speed of accessing memory. the performance advantage of using cache memory becomes more significant. Thus, the performance penalty for not using cache memory in realtime applications will continue to increase.
Bounding instrvction cache performance for real-time applications may be quite beneficial. The use of instruction caches has a greater impact on performance than the *This work was supported in part by the Office of Naval Research under contract nuniber N00014-94-1-0006.
Marion Harmon
Comp. & Info. Sys. Dept., Florida A&M Univ.
Tallahassee, FL 32307-3 101 e-mail: harmon@cis.fmu.edu, phone: (904) 599-3042 use of data caches. In addition, code generated for RISC machines often results in four times more instruction references thnn data references [2] . There also tends to be greater locality for instruction references than data references, resulting in higher hit ratios for instruction cache performance. Unlike many data references, the address of each instruction remains the same during a program's execution. Thus, instruction caching behavior should be inherently more predictable than data caching behavior.
This paper shows that with certain restrictions it is possible to predict much of the instruction caching behavior of a program. Let a task be the portion of code executed between two scheduling points (context switches) in a system with a non-preemptive scheduling paradigm. When a task starts execution, the cache memory is assumed to be invalidated. During task execution, instructions are brought into cache and often result in many hits and misses that can be predicted statically.
Figure 1 depicts an overview of the approach described in this paper for bounding instruction cache performance of large code segments. Control-flow information, which could have also been obtained by analyzing assembly or object files, is stored as the side effect of the compilation of a file. The control-flow information is passed to a static cache simulator. It constructs the control-flow graph of the program that consists of the call graph and the control flow of each function. The program controlflow graph is then analyzed for a given cache configuration and a categorization of each instruction's potential caching behavior is produced. Next, a timing analyzer uses the instruction caching categorizations along with the control-flow information provided by the compiler to estimate the worst-case instruction caching performance for each loop within the program. A user is then allowed to request the instruction cache performance bounds for my function or loop within the program.
Related work
Several tools to predict the execution time of programs have been designed for real-time systems. The analysis has been performed at the level of source code [31, intermediate code (41, and machine code (51. Only the last tool attempted to estimate the effect of instruction caching [7] . While this approach may have some slight benefit for a few tasks, the performance of the remaining tasks will be significantly decreased. part of their rationale was that if a task could not entirely fit in cache, then the worst-case execution would be the same as an uncached system since cache hits could not be guaranteed. It will be shown later that a high percentage of instruction cache hits for such programs can be guaranteed and that the worst-case performance is significantly better than a comparable system with a disabled cache.
There have been attempts to improve the performance and predictability of accessing memory for real-time systems by architectural modifications. For instance, Kirk described a system that d i e d on the ability to segment cache memory into a number of dedicated partitions, each of which can only be accessed by a dedicated task [81.
But this approach introduced new problems that included lower hit ratios due to the partitioning and an increased complexity of scheduling analysis by introducing another resource (cache partitioning) in the allocation process. Lee et. al. suggested to prefetch instructions in the direction that improves the worst-case execution time [9]. The justification for using their approach was that "it is very difficult, if not impossible, to determine the worst-case execution path and, therefore, the worst-case execution time of a task" when instruction caching is employed. Their analysis measured a 45% improvement of the predicted worst-case time as compared to no prefetching (and no instruction cache). This improvement is probably quite optimistic since bus contention was not taken into consideration (contention between instruction prefetching, data access, and thread prefetching). Furthermore, mispredicted branches may result in an unintemptible block fetch along the wrong path that cannot be aborted. This misprediction penalty may now cause worst-case behavior along the (previously) shorter path. It will be shown later in this paper that much better worst-case performance predictions can be made in the presence of instruction caching than with just a prefetch buffer.
Static cache simulation
The method of static cache simulation is used to statically categorize the caching behavior of each instruction for a given program/task with a specific cache configuration. The static simulation consists of three phases. First, a program control-flow graph is constructed. Next, this graph is analyzed to determine the possible program lines that can be in cache at the entry and exit of each basic block within the program. Finally, the control-flow analysis information is used to categorize the caching behavior of each instruction. The following subsections give a brief overview of the static simulator. A more formal approach can be found elsewhere [ 101, [ 111. Figure 2 . Instruction a is the first instruction that can be executed within the program line x in the outer loop. Instruction b is the first instruction that can be executed within the program line y in the inner loop. Assume program lines x and y are the only two lines that map to cache line c and there are no conditional transfers of control within the two loops. In other words, instructions a and b will always be executed on each iteration of the outer and inner loops, respectively. How should instruction b be classified? With respect to the inner loop, instruction b will not be in cache when referenced on the first iteration, but will be in cache when referenced on the remaining iterations. This situation can be ascertained by the static cache simulator since it can determine that there are no other progr'm lines within the inner loop that conflict with program line y and the abstract cache state at the exit point of the basic block preceding the inner loop does not contain program line y. With respect to the outer loop, instruction b will always cause a miss on each iteration since it will not be in cache as the outer loop initially enters the inner loop. The caching behavior of instruction a in Figure 2 can also be predicted, assuming that the instruction that physically precedes the outer loop has to be executed immediately before the loop is entered. In this situation the first reference to instruction a will be a hit. All subsequent references to instruction a will be misses. This situation can be ascertained by the static simulator since it can determine that no other conflicting program lines can be accessed before instruction a is referenced for the first time and program line x will never be in cache on transitions back to the loop header.
Constructing the control-flow graph
The static cache simulator will produce a classification for each loop level in which an instruction is contained. 
Implementation of static cache simulation
The iterative algorithm in Figure 3 was used to calculate the abstnct cache states. Each basic block has an input and output state of program lines that can potentially be in cache at that point. Initially, the top block's input state (the entry block of the main function) is set to all invalid lines. The input state of a block is calcuhted by taking the union of the output states of its immediate predecessors. The output state of a block is calculated by A simple example will be used to illustrate the approach for bounding instruction cache performance. Figure 4 contains C code for a simple toy program that finds the largest value in an array. Figure 5 shows the actual SPARC assembly instructions generated for this program within a control-flow graph of basic blocks.
Note the immediate successor of a block with a call is the first block in that instance of the called function. Assume there are 4 cache lines and the line size is 16 bytes (4 SPARC instructions). Block 8a corresponds to the first instance of value ( ) called from block 2 and block 8b corresponds to the second instance of v a l u e 0 called from block 4. The instruction categorizations are given to the right of each instruction. For instructions that were not categorized as always being a hit or always a miss for each loop level, a categorization of each loop level is given, proceeding left to right from the innermost to the outermost loop. Note that a function is considered a loop with a single iteration. Two passes are required to calculate the input and output states of the blocks, given that the blocks are processed in the order shown in Figure 6 . Pass 3 results in no more changes. After determining the input states of all the blocks, each instruction is categorized according to the criteria specified in the previous section. By examining the input states of each block, one can make observations that may not be detected by a naive inspection of only physically contiguous sequences of references. For instance, the static cache simulator determined that the last instruction in block 6 will always be in cache (NI always hit) due to spatial locality. It also determined that the first instruction in block 8b will always be in cache (an always hit) due to temporal locality. It detected that the first instruction of block 3 and the second instruction of block 8 will never be in cache (always misses) since the program lines associated with the two instructions map to the same cache line and the execution of block 8 always precedes block 3.
The static cache simulator was also able to predict the caching behavior of instructions that could not be classified as always being a hit or always a miss. It determined that the second instruction in block 3 will miss on its first reference and all subsequent references will be hits. Since the first instruction in block 5 and first instruction in block 6 are both classified as first misses and they are in the same program line, then only one miss will occur associated with both instructions during the progrdm execution.
Finally, the first instruction in block 2 will always be in cache on its first reference and may or may not be in cache on subsequent references depending on whether the second call to v a l u e ( ) is executed. Thus. in the worst case the instruction is viewed as a first hit.
The current implementation of the static simulator imposes some restrictions. First, only direct-mapped Figure 5 complicate the generation of unique function inst;lllces.' Finally, indirect calls are not handled since an explicit call graph must be generated statically.
Timing analysis
The goal of this research is to allow a user to acquire the most accurate bounds on instruction caching performance of code segments that can be obtained in a reasonable amount of time. After the static cache simulator has produced the instruction categorizations, the user will be queried for a maximum number of iterations for each loop that the compiler could not determine statically. Next, a timing analysis tree is constructed and the worst-case instruction cache performance is estimated for each loop in the tree. Once this initial timing analysis has been completed, the timing analyzer accepts timing requests for either functions or loops.'
I Recent studies have shown that direct-mapped caches typically have a faster access time for hits. which outweighs the benefit of a higher hit ratio in set-associative organizations for large caches (131.
While cycles in a call graph can be detected, they are also difficult to describe to a user and it is difficult for the user to estimate the maximum number of recursive iterations that will be perfomied.
work is currently progressing on processing timing requests for ranges of source lines williin a single iteration of a loop.
Constructing the timing analysis tree
A timing analysis tree is constructed to simplify the p e s s of predicting the worst-case times. Each node within the tree is considered a natural loop! The outer level of each function instance is treated as a loop that will iterate only once when entered.
The timing analyzer next determines the set of possible paths through each loop. A path is a sequence of unique blocks in the loop connected by control-flow transitions.
Each path starrs with the loop header and is terminated by a block with a backedge or a transition to an exit block outside the loop. Figure 7 shows a simple example that identifies a loop header, backedges, exit blocks, continue paths, and exit paths. Each path is designated as either a continue path (the last block is the head of a backedge transition), an exit path (the last block has a transition to an exit block outside the loop), or both. Thus, each path corresponds to a possible sequence of blocks that could be executed during a single loop iteration. The number of loop iterations indicates the number of times the header of the loop is executed once the loop is entered. If a path within a loop enters a child loop, then the entire child loop is represented as a single block along that path. Associated with each loop is a set of exit blocks, which indicates the possible blocks outside the loop that can be reached from the last block in each exit path. Thus, the possible paths within non-leaf loops that contain child loops can also be calculated. Figure 8 shows some of the information in the timing analysis tree for the program in Figure 5 . Within each loop node the maximum number of iterations is indicated.
To the right of each loop node are the possible paths A natural loop is a loop with a single entry block. While the static simulator can process unnatural loops, the timing analyzer is restricted to only analyzing natural loops since it would be difficult for both the timing analyzer and the user to determine the set of possible blocks associated with a single iteration in an unnatural loop. It should be noted that unnatural loops occur quite infrequently. through the loop. Blocks representing a child loop in a path are denoted by having a dashed line boundary. In this example all paths can both continue and exit. The worst-case instruction cache performance is given adjacent to each loop node. The calculation of these results is described in the next section. 
Loop analysis
The loops in the timing analysis tree are processed in a bottom-up manner. In other words, the worst-case time for a loop is not calculated until the times for all of its immediate child loops are known. There will be a worstcase time calculated that corresponds to each exit block. Thus, when the timing analyzer is calculating the worstcase time for a path containing a child loop, it uses the child loop times associated with the exit block of the child loop that is the next block along the path. For instance, the time associated with the loop in Figure 7 exiting to block 5 would be less than the time exiting to block 7 since block 6 would not be executed on the last iteration.
Let n be the maximum number of iterations associated with a loop. The algorithm for estimating the worst-case time for the loop is as follows:
Calculate the maximum time required to execute any continue path assuming that all first misses are counted as hits and first hits are counted as misses. Set the number of calculated iterations to 0. Go to step 6 if the number of calculated iterations is n -1. Calculate the maximum time required to execute any continue path in the current iteration, where each instruction classified as a first miss and not yet encountered is counted as a miss and all first hits are counted as misses. Go to step 6 if the time calculated in step 3 is equal to the time calculated in step 1.
Add the maximum time calculated in step 3 to the total worst-case time for the loop. If this is the first iteration, subtract the difference between a miss and a hit from the total worst-case time for each first hit in the loop. Denote which first misses will now be counted as hits. Add one to the number of calculated iterations. Go to step 2.
(6) Add (n -1 -number of calculated iterations) * (time from step 1) to the total worst-case time for the loop.
(7) Calculate the times for all exit paths within the loop for the last iteration. For each set of exit paths that have a transition to a unique exit block, add the longest time within that set to the time calculated in step 6 to produce a total worstcase time associated with that exit block for the loop.
The algorithm terminates when the number of calculated iterations reaches n -1. The algorithm can terminate earlier if the maximum time required to execute any continue path is equal to the maximum time required to execute a continue path where all first misses are treated as hits. In fact, the upper bound on the number of times that step 3 has to be processed is m + l , where nz is the number of paths in the loop. Each path will have its first misses treated as misses at most once. After all first misses are eliminated, the next maximum path found would be equal to the value calculated in step 1.
The algorithm selects the longest path on each iteration of the loop. In order to demonstrate the correctness of the algorithm, one must show that no other other path for a given iteration of the loop will produce a longer worstcase time than that calculated by the algorithm. The calculation of a worst-case time associated with a path simply requires summing the times associated with each of the instructions in the path. The time used for each instruction depends on whether it is assumed to be a hit or miss, which depends on its categorization. The cache hit time is one cycle on most machines. The cache miss time is the cache hit time plus the miss penalty, which is the time required to access main memory. All categorizations are treated identically on repeated references, except for first misses and first hits. Assuming that the instructions have been categorized correctly for each loop, it remains to be shown that first misses and first hits are interpreted appropriately for a given iteration of the loop.
A first hit implies that the instruction will be a hit on its first reference after the loop is entered and all subsequent references to the instruction during the execution of the loop will be misses. The definition the authors used for a first hit requires that the instruction be within every path of the loop. Thus, the first path chosen for step 3 will encounter every first hit in the loop. After the first iteration, first hits are treated as misses.
A first miss implies that the instruction will be a miss on its first reference after the loop is entered and all subsequent references will be misses.
Step 3 indicates that an instruction classified as a first miss will be counted as a miss only the first time it is encountered.
Once the maximum time of the current iteration is equal to the time calculated in step 1 (where all first misses are treated as hits), then this value is replicated for all remaining iterations, except for the last one. Once there are no more first misses encountered for the first time (and the first iteration has encountered all first hits), then the worst-case cache performance for a path will not change since the instructions within a path will always be treated the same. The last iteration is treated separately in step 7. The longest exit path for a loop may be shorter than the longest continue path. By examining the exit paths separately, a tighter estimate can be obtained. Thus, the algorithm estimates a bound that is at least as great as the actual Worst-case bound.
The timing of a non-leaf loop is accomplished using this algorithm and the times from its immediate child loops. Whenever a path in a non-leaf loop contains a child loop, then the time associated with that child loop will be used in the calculation of the path time. The transition of a categorization from the child bop level Io the current loop level will be used to determine if any adjustment to the the child loop time is required. These transitions between categorizations and appropriate adjustments are given in Table 2 . The fm=>fm adjustment is necessary since there should be only one miss associated with the instruction ' and a miss should only occur the first time the child loop is entered. The m=>fh adjustment is necessary since the first reference will be a hit. To illustrate the use of the worst-case algorithm, the calculation of the worst-cse instruction cache performance for the ex'ample shown in Figures 4, 5 , 6 , and 8
Note that additional work was required when the number of distinct paths containing first misses to different progran) lines exceeds the number of loop iteratioas. This situation can coninionly occur within functions. A maximum adjustnient value was used to compensate in an efficient niaiuier for the remaining loop iterations.
will be described. The worst-case performance results for each loop in the timing analysis tree are shown in Figure  8 . Since a loop cannot be timed until its immediate child loops are processed, the two function instances of value will be processed first, followed by loop 1 in main, and finally the function main. For loops with just a single iteration, only step 7 in the worst-case algorithm contributes to the calculated performance of that loop.
The worst-case performance for the example is calculated in the following manner. The leaf loops of the timing analysis tree m the two instances of the function value and are processed first. The worst-case instruction cache performances of value (a) and value (b) are (2 misses, 3 hits) and { 1 miss, 4 hits), respectively. For loop 1 in main, step 1 of the algorithm calculates a cache performance of ( 4 misses, 18 hits) given that all first misses are treated as hits and first hits are treated as misses. This result was obtained from {2 misses, 10 hits) fmm instructions directly in loop 1 and { 1 miss, 4 hits) from both of the invoked function instances of value.
Note that the time obtained from the first function instance of value was adjusted as described in Table 2 (fm => fm). The result found for the first iteration in step 3 is 1: 6 misses, 16 hits), which was obtained by adding { 3 misses, 9 hits} from instructions directly in loop 1, (2 misses, 3 hits) from value (a), and { 1 miss, 4 hits) from value (b) . The next result calculated in step 3 is equal to the result from step 1. By applying step 6,8*(4 misses, 18 hits) will be used to represent the performance of the next 8 iterations. Since both paths through the loop are exit paths, the worst-case time for the exit paths calculated in step 7 is the same as the result in step 1. Thus, the total worst-case performance for loop 1 in main is {42 misses, 178 hits) ({6+9*4 misses, 16+9*18 hits}). The loop representing the entire function main only iterates once and is calculated in step 7. The worst-case instruction cache performance for the entire program is 
Effectiveness of the timing analyzer
To assess the effectiveness of the timing analyzer, six simple programs were selected. Des (Data Encryption Standard) encrypts and decrypts 64 bits. Matmul multiples 2 50x50 matrices. Matsum determines the sum of the nonnegative values in a 100x100 matrix. Matcnt is a vatiation from Matsum since it also counts the number of elements that were summed. Sort uses the bubblesort algorithm to sort 500 numbers into ascending order. The final program, Slats, calculates the sum, mean, variance, and standard deviation for two arrays of numbers and the linear correlation coefficient between the two arrays.
These programs and the results of evaluating these pro- . Column 4 shows the number of cycles estimated by the timing analyzer. Column 5 shows the ratio of the predicted worst-case instruction cache performance using the timing analyzer in column 5 to the observed worst-case performance in column 3. Column 6 shows a similar ratio assuming a disabled cache. This naive prediction simply determines the maximum number of instructions that could be executed and assumes that each instruction reference requires a memory fetch of ten cycles (miss time). The Marcnr program not only determines the sum of the nonnegative elements (like the Mutsum program), but also determines the number of nonnegative and negative elements in the matrix. Thus, there was an i f -thenelse construct used in the code to either add a nonnegative value to a sum and increment a counter for the number of nonnegative elements or just increment a counter for the negative elements. The adding of the nonnegative value to a sum was accomplished in a separate function. This function was placed in a location that would conflicr with the program line containing the code to increment a counter for the negative elements. Multiple executions of the then path, which includes the call to the function to perform the addition, still required more cycles than alternating between the two paths. Yet, the algorithm for estimating the worst-case performance assumed that the first reference to a program line within a path would always be a miss if there were accesses to any other conflicting program lines within the same loop (see Table 1 ). This assumption simplified the algorithm since the effect of all combinations of paths does not have to be calculated and an exponential time complexity was avoided. Thus, one reference was counted repeatedly as a miss instead of a hit. This path was executed 10, OOO times and this accounted for a 90,000 cycle [10,000*miss penalty] or 9% overestimation. Note that the execution of this single path accounted for 43.56% of the total instructions referenced during the execution of the program.
The analysis of the final two programs, Des and Sort, depicts problems faced by all timing analyzers. The timing analyzer did not accurately determine the worst-case paths in a function within Des primarily due to data dependencies. A longer path could not be taken in a function due to a variable's value in an if statement. The Sort program contains an inner loop whose number of iterations depends on the counter of an outer loop. At this point the timing tool either automatically receives the maximum loop iterations from the control-flow information produced by the compiler or requests a maximum number of iterations from the user. Yet, the tool would need a sequence of values representing the number of iterations for each invocation of the inner loop. The number of iterations performed was overrepresented on average by a factor of two for this specific loop. This inaccuracy accounted for the overestimation in both the estimated and naive ratios since most of the cycles for the program were produced within this loop. Note that both of these problems have nothing to do with cache predictability.
Processing user timing requests
Once the timing analyzer has calculated a worst-case time for each loop in the timing analysis tree, the user can request specific timing information about portions of the program. The user first specifies the name of a function. The user is then presented with the set of loops that are within the function. Each loop is identified by its loop nesting level within the function and the source line numbers it spans. The user can choose to obtain a worst-case performance for the entire function or select a loop. Since there may be more than one instance of a function within the timing analysis tree, the timing analyzer will determine the worst-case times from all function instances associated with the user request.
Future work
We have designed and partially implemented an algorithm to estimate the best-case instruction cache performance for each loop within a program. A naive best-case estimation, which assumes all instructions along the shortest paths will be hits, will be much closer to the observed best-case performance since locality within programs causes most instruction references to be hits. We expect that the estimated best-case performance can be as tightly predicted as the estimated worst-case performance.
We are exploring methods to predict the timing of other architectural features associated with RISC proces-SOTS. Work is currently ongoing that uses H microanalysis technique [5] to predict pipeline performance for the MicroSPARC I. The effect of data caching is also an area that we are pursuing. Unlike instruction caching, many of the addresses of references to data can change during the execution of a program. Thus, obtaining reasonably tight bounds for worst-case and best-case d a h cache performance is significantly more challenging. However, many of the data references are known. For instance, static or global data references retain the same addresses during the execution of a program. Due to the analysis of a function instance tree (no recursion allowed), addresses of run-time stack references can be statically determined as well. Compiler flow analysis can be used to detect the pattern of many calculated references, such as indexing through an m y . While the benefits of using a data cache for real-time systems will probably not be as significant as using an instruction cache, its effect on performance should still be substantial.
Conclusions
Predicting the worst-case execution time of a program on a processor that uses cache memory has long been considered an intractable problem [ll, 171, [SI. This paper has presented a technique for predicting worst-case instruction cache performance in two steps. It has been demonstrated that instruction cache behavior is sufficiently predictable for real-time applications. Thus, instruction caches should be enabled, yielding a speedup of four to nine for the predicted worst case as compared to disabled caches (depending on the hit ratio and miss penalty). This speedup is a considerable improvement over prior work, such as requiring special architectural modifications for prefetching, which only results in a speedup factor of 2 [91. As processor speeds continue to increase faster than the speed of accessing memory, the performance benefits for using cache memory in real-time systems will only increase.
