Abstract
Introduction
Manyarchitectural features, such as pipelines and caches, present a dilemma for architects of realtime systems. Use of these architectural features can result in significant performance improvements.
In order to exploit these performance improvements in a real-time system, the WCET (Worst Case Execution Time) must be predicted statically.I naddition, sometimes the BCET (Best Case Execution Time) is also needed. However, the aforementioned performance enhancing features introduce a potentially high levelofunpredictability.D ependencies between instructions can cause pipeline hazards that may delay the completion of instructions. While there has been much work accomplished on analyzing the execution performance of a sequence of instructions within a basic block, the analysis of pipeline performance across basic blocks is more problematic. Instruction or data cache misses further complicate the performance prediction problem since theyr equire several more cycles to resolvethan cache hits. Predicting the caching behavior of an instruction is evenmore difficult since it may be affected by memory references that occurred long before the instruction was executed.
The timing analysis of these features is further exacerbated since pipelining and caching behavior are not independent. Fori nstance, consider the code segment and pipeline diagram in Figure 1 consisting of three SPARC instructions. The pipeline cycles and stages represent the execution of these instructions on a MicroSPARC I processor [1] . Each number within the pipeline diagram denotes that the specified instruction is currently in the pipeline stage shown on the left and is in that stage during the cycle indicated above.T he first instruction performs a floating-point addition and requires a total of 20 cycles. Fetching the second instruction results in a cache miss, which is assumed to have a miss penalty of nine additional cycles in this paper.T he third instruction has a data dependencyw ith the first instruction and the execution of its MEM stage is delayed until the floating-point addition is completed. 1 The miss penalty associated with the access to main memory to fetch the second instruction is completely overlapped with the execution of the floating-point addition in the first instruction.
If pipeline stalls and cache misses were treated independently,t hen the number of estimated cycles associated with these instructions would be increased from 22 to 31 (i.e. by the cache miss penalty).
Unfortunately,t he problem of overestimating WCET and underestimating BCET may become more severe in the future. Cache miss penalties are increasing due to the growing gap between processor and main memory speeds. Delays due to pipeline stalls become more likely with the introduction of superscalar and superpipelined architectures. Thus, naive timing analysis of programs on machines with pipelines and caches will result in increased execution time prediction errors.
Let us define a task as the portion of code executed between twos cheduling points (context switches) in a system with a non-preemptive scheduling paradigm. When a task starts execution, the cache memory is assumed to be invalidated. During task execution, instructions are brought into cache and often result in manyhits and misses that can be predicted statically.T hese caching predictions can be integrated with pipeline analysis to estimate tight WCET and BCET bounds. Figure 2 depicts an overviewo ft he approach described in this paper for bounding the worst and best-case performance of large code segments on machines with pipelines and instruction caches.
Control-flowi nformation, which could have been obtained by analyzing assembly or object files, is stored as the side effect of the compilation. This information identifies the loops that are in each function, the basic blocks that comprise each loop, the instructions that reside in each basic block, and the register operands associated with each instruction. The control-flowi nformation is passed to a 1 A std instruction has no write back stage since a store instruction only updates memory and not a register.T he std instruction also requires three cycles to complete the MEM stage on the MicroSPARC I. givencache configuration to produce a categorization of each instruction'spotential caching behavior.
The timing analyzer uses these categorizations to determine whether an instruction fetch should be treated as a hit or a miss during the pipeline analysis. It also reads machine-dependent and controlflowi nformation to determine howe ach instruction proceeds through the pipeline. The timing analyzer produces a worst and best-case estimate of execution time for each loop and function within the program. Finally,awindow-based interface is used to allowthe user to request the timing bounds for portions of the program.
Instruction Caching Categorization
Static cache simulation 2 is used to statically categorize each instruction according to its caching behavior using a specific cache configuration in a givenp rogram. The static simulation consists of three phases. First, the control-flowg raph of the entire program is constructed. This graph includes the control-flowi nformation of each function and a function instance graph, which is simply a call 2 Static cache simulation is only briefly introduced in this section. It is described in more detail elsewhere [2, 3, 4, 5, 6] .
graph where each function instance is uniquely identified by the sequence of call sites required for its invocation. Thus, adirected acyclic call graph (without recursion) is transformed into a tree of function instances.
Next, this program control-flowg raph is analyzed to determine the program lines that may be in cache at the entry and exit of each basic block within the program. The iterative algorithm in Figure   3i su 
always miss
The instruction is not guaranteed to be in cache when it is referenced.
always hit
The instruction is guaranteed to always be in cache when it is referenced.
first miss
The instruction is not guaranteed to be in cache on its first reference each time the loop is executed, but is guaranteed to be in cache on subsequent references.
first hit
The instruction is guaranteed to be in cache on its first reference each time the loop is executed, but is not guaranteed to be in cache on subsequent references. 
always miss
The instruction is guaranteed to not be in cache when it is referenced.
always hit
It is possible that the instruction is in cache every time it is referenced.
first miss
The instruction is guaranteed to not be in cache on its first reference each time the loop is executed, but may be in cache on subsequent references.
first hit
The instruction may be in cache on its first reference each time the loop is executed, but is guaranteed to not be in cache on subsequent references. (1) The instruction is the first reference to L in the block, and L is in the abstract cache state.
(2) There exists a program line in the abstract cache state for this loop that conflicts with L.
(3) L is in the abstract output cache state of all preheaders 3 of this loop.
(4) None of the conflicting lines is in the abstract output cache state of the preheaders of this loop. The purpose of this stipulation is to guarantee that the instruction will be a hit in cache on the first iteration of the loop, in accord with the definition of first hit in Table 1 .
(5) L is in the post dominator of the loop'sheaders, i.e. the current line will be referenced during each loop iteration. 4 (6) None of the conflicting lines is in the linear cache state of the current block, i.e. for each loop iteration, the current line will be referenced before anyc onflicting line. This requirement guarantees that L can only be replaced by a conflicting line after the instruction has been referenced at least once. 5 An instruction is a first miss if it is not already categorized as an always hit or first hit,the instruction wasafirst miss at the next deeper loop nesting level( if that levele xists), it is the first instruction encountered in L in the block and L is in the abstract cache state, and there exist conflicting program lines but only outside the current loop nesting level. In all other cases, the instruction is conservatively categorized as an always miss.
The instruction'sb est-case categorization is determined as follows. The instruction is categorized as an always miss if it is the first reference to L in the block and L is not in the abstract cache state.
The instruction is categorized an a first miss if it was a first miss or always hit at the next deeper loop nesting level( if that levele xists), this instruction is the first reference to L in the block, L is in the 3 The loop header of a natural loop is the single basic block in which the loop is initially entered. The preheader is the basic block that precedes the header. 4 Note that an instruction does not have tobereferenced during each loop iteration to be classified as a first miss.
abstract cache state, and L is not in the linear cache state of the block. 4 The instruction is categorized as a first hit if it was a first hit for the previous (deeper) loop nesting levels or if the following conditions (1)-(5) hold:
(1) The instruction is the first reference to L in the block, and L is in the abstract cache state.
(3) L is in the abstract output cache state of all preheaders of this loop.
(4) L is in the post dominator of the loop'sheaders, i.e. the current line will be referenced during each loop iterations.
(5) L is not in the abstract cache state preceding anyo ft he back edges, i.e.Lis replaced by a conflicting line during each loop iteration. The purpose of this requirement is to guarantee that the program line conflicting with L will be encountered on every iteration after the first. Thus, the instruction will be a cache miss on these iterations, in agreement with the definition of first hit in Table 2 .
In all other cases, the instruction is conservatively categorized as an always hit.F ormal definitions of these instruction categorizations are giveninthe appendix.
The current implementation of the static simulator imposes some restrictions. First, only directmapped cache configurations are allowed. 6 Second, recursive programs are not allowed since cycles in the call graph would complicate the generation of unique function instances. 7 Finally,indirect calls are not handled since an explicit call graph must be generated. 6 Recent studies have shown that direct-mapped caches often have a faster access time for hits, which sometimes outweigh the benefit of a higher hit ratio in set-associative org anizations for large caches [7] . We are currently investigating the timing analysis of set-associative caches. 7 While cycles in a call graph can be detected, theyare also difficult to describe to a user and it is difficult for the user to estimate the maximum number of recursive iterations that will be performed.
Pipeline Path Analysis
This section describes howthe analysis of the pipeline performance of a sequence of instructions is accomplished. Information for all levels of timing analysis is stored in data structures as depicted in Figure 4 . First, information about each type of instruction is read from a machine-dependent data file.
This pipeline information for each type of instruction includes the worst and best-case number of cycles required by each stage of the pipeline for its execution. 8 The analyzer also reads from the machine-dependent data file other information for each instruction. This information includes the latest stage each source operand of an instruction can receive its value via hardware forwarding without causing a pipeline stall and the earliest stage in which the result of the instruction can be forwarded.
Finally,information about the specific instructions in the sequence is obtained and stored in instances of struct inst_node.T his information includes the actual registers associated with the source and destination operands, which is obtained from the control-flowinformation generated by the compiler,a nd the instruction caching categorization of each instruction, which is produced by the static cache simulator.
Ap ath of instructions consists of all the instructions that can be executed during a single iteration of a loop (or in the case of a function, all the instructions that are executed in one invocation of the function). Thus, ap ath consists of a sequence of basic blocks connected by control-flowt ransitions.
If a loop has no conditional control flow(e.g. if or switch statements), then there will be only one path associated with this loop.
During the analysis of a path, the analyzer stores path information in instances of struct path_node.T his information includes the total number of cycles required by the path and a set of 8 The number of cycles required for some floating-point instructions on processors can vary depending upon the values of its operands. within the path for avoiding structural hazards. 9 It is represented as the number of cycles from the beginning and end of the path for each pipeline stage. In addition, information indicating when each register was first and last used in the path is also maintained to avoid data hazards. 10 Again, this information is represented as the number of cycles from the beginning and end of the path for each register.T he set of pipeline information, as stored in path->wc_pipeline_information,f or avoiding hazards after the three instructions in Figure 1 have been analyzed is shown in Tables 3 and   4 . Table 3 Figure 1 . 9 Astructural hazard indicates that a stage of an instruction cannot be executed earlier due to the pipeline stage already being used. 10 Adata hazard indicates that a particular stage of an instruction cannot be executed earlier due to the pipeline stage using a source register that matches the destination register not yet updated by a pipeline stage of another instruction.
This set of pipeline information is created by processing one instruction at a time from the sequence of instructions that comprise a path. Figure 5 depicts an algorithm that creates this pipeline information for worst-case analysis. The best-case path analysis algorithm is analogous. Each instruction can be represented by the same form of pipeline information that is shown in Tables 3 and   4f or a path. This information is modified if it is found that the instruction'sc aching categorization indicates that the instruction fetch was a miss. The miss penalty is used to increment the total number of cycles and the cycles from the beginning (structural hazard information) for all other stages besides the IF stage and the first needed registers (data hazard information) for that instruction. The addition of an instruction to the pipeline information for a path will not only update the total number of cycles and the information associated with the end of the pipeline, but also the beginning of the pipeline if a referenced stage or register in the instruction had not been previously used.
void Time_Path (struct path_node *path){ struct block_node *block; struct inst_node *instruction; path->wc_pipeline_information =N ULL. FOR each block in path->block_list DO FOR each instruction in block->inst_list DO IF (instruction->cat_list->wc_cat == first miss AND this instruction has not been encountered already) OR (instruction->cat_list->wc_cat == first hit AND this instruction has not been encountered already) OR instruction->cat_list->wc_cat == miss THEN Treat this instruction fetch as a miss in the pipeline. ELSE Treat this instruction fetch as a hit in the pipeline. Concatenate w.c. pipeline information for instruction->inst_type with path->wc_pipeline_information. END FOR END FOR path_ptr->wcet =t emporal length of path->wc_pipeline_information. } Figure 5 . Worst-Case Path Analysis Algorithm.
Retaining this set of pipeline information allows additions to the beginning or end of a path. Since both the pipeline requirements for a path and a single instruction can be represented with this set of pipeline information, concatenating twop aths together can be accomplished in the same manner as concatenating an instruction onto the end of a path. The concatenation is accomplished one stage at a time. A stage from the second set of pipeline information is movedtothe earliest cycle that does not violate anyofthe following conditions.
(1) There is no structural hazard with another instruction. Fori nstance, the beginning of the IF stage of instruction 2 in Figure 1 could not be placed in cycle 1 since that stage was already occupied.
(2) There is no data hazard due to a previous instruction producing a result that is needed by a source operand of the current instruction in that stage. Forexample, the beginning of the MEM stage for instruction 3 in Figure 1 could not be movedp ast the FEX stage of instruction 1 at cycle 19 due to the data hazard between the faddd and std instructions. Thus, the amount of pipeline information associated with a path is dramatically reduced as opposed to storing howe ach stage is used during every cycle. Furthermore, no limit need be imposed on the amount of potential overlap when concatenating the analysis of twopaths.
Loop Analysis
In order to predict the worst-case execution time of a loop, the timing analyzer has to predict the execution time of each possible path within the loop. The static cache simulator provides categorizations for each instruction. The timing analyzer will reservee ither one cycle or the number of cycles Each path starts with the loop header and is terminated by a block with a back edge 11 or a transition to an exit block outside the loop. Figure 6 shows a simple example that identifies a loop header, back edges, exit blocks, continue paths, and exit paths. Each path is designated as either a continue path (the last block is the head of a back edge transition), an exit path (the last block has a transition to an exit block outside the loop), or both. The number of loop iterations indicates the number of times the header of the loop is executed once the loop is entered. 11 Aback edge is a control-flowtransition from a basic block in a loop to its loop header. Alternation between the paths will produce the worst case execution time since there will be a structural hazard between the twofloating-point additions.
To avoid the problem of calculating all combinations of paths, which would be the only method for obtaining perfectly accurate estimations, it was decided to union the pipeline effects of the paths for a single iteration of a loop together.Aunion, an instance of struct union_node in Figure 4 , is dynamically allocated for each path and loop. Calculating the union of the beginning pipeline structural hazard information for a givens tage in the WCET analysis is accomplished by determining the earliest initial occupation of that stage by anyp ath in the union. Likewise, we calculate the WCET union of the ending pipeline structural hazard information for a givens tage by finding the last occupation of that stage, relative tothe last cycle of the longest path, by anypath in the union. The BCET unioning of pipeline information is accomplished in an analogous manner.T he beginning (ending) pipeline structural hazard information for each stage is updated to contain the latest initial (earliest final) occupation of that stage. If a path does not use a particular stage, then the BCET union will record that stage as empty.T he data hazard information is handled similarly with the earliest and latest use of each register from the paths in the union being updated. This unioning of pipeline information simplified the algorithm and also did not cause a noticeable overestimation or underestimation in the worst or best-case analysis, respectively.T he beginning pipeline information (stages and registers) is rarely affected since all paths through a loop start with the same loop header block. Paths through a loop often end with the same block of instructions. In addition, one path may be significantly longer or shorter than the others, so the ending pipeline information for worst and best-case analysis is often not affected. Figure 7 shows a toyf unction and its corresponding SPARC assembly code. 12 There are two Figure 7 . Example C Source Code and Corresponding SPARC Instructions. 12 Note that the generated assembly code has been optimized by the compiler.T he local variables i, count,and dcount have been allocated to registers %o2, %o1,and %f2,respectively.T he instruction following each transfer of control takes effect before the transfer of control is taken since the SPARC has delayed branches. The cmp comparison preceding the bge branch (instruction 7) has been movedtoboth immediately precede the loop and in the delay slot (instruction 16) of the bl branch (instruction 15). Branches with a ",a"represent that the result of the instruction within the delay slot will be annulled if the branch is not taken. Figure 8 shows the instructions and the corresponding pipeline diagrams for the twop aths within the loop. 13 To simplify the example, it is assumed that the loop has already been executed and all of the instructions and data are in cache (i.e. there are no instruction fetch or data memory misses). Table 5 shows the structural hazard information for the twop aths in Figure 7 and howthe information in path 1 has to be adjusted before being unioned. The worst-case union of the number of cycles from the beginning and end of the paths for a givens tage will simply be the minimum number encountered. Likewise, the best-case union will be the maximum number encountered. The structural hazard information indicating the number of cycles from the end of path 1has to be adjusted since its total number of cycles is 13 less than the cycles required by path 2. The Thus, there are no additional pipeline stages associated with these instructions. Also note the one cycle stall between instructions 8 and 12 in the EX stage of path 1 due to a load hazard. Finally,the ldd (instruction 9) requires twocycles to complete the MEM stage [1] .
CS ource Code Inst Assembly Code -------------------------------------------------------
resulting worst-case union of the structural hazard information of the twopaths would be identical to the structural hazard information for path 2. Likewise, the best-case union would be identical to the information for path 1. Note that the data hazard information would change slightly since instruction 12 references register %o0 as a source operand and %o1 as both a source and destination. Yet, representing access to these registers would not likely have aneffect when the timing analysis is performed between this path and its predecessor and successor paths since the EX stage is used before and after cycle 6, which is when instruction 12 enters the EX stage. Table 5 . Structural Hazard Information for the Paths in Figure 8 .
Let n be the maximum number of iterations associated with a loop. The algorithm for estimating the worst-case execution time for a loop is shown in Figure 9 . The algorithm contains three phases.
During the first phase, the loop is analyzed one iteration at a time. Fore ach iteration, the algorithm chooses the path with the greatest WCET.T he first phase continues as long as newfirst miss instructions are encountered on each iteration. The WHILE loop in the algorithm represents this first phase, and it terminates when the number of calculated iterations reaches n -1ornomore first misses (first hits) are encountered as misses (hits). Thus, the WHILE loop will iterate up to (n -1 )o r( m +1 ),
where m is the number of paths in the loop since a first miss (first hit) can miss (hit) at most once during the loop execution. the remaining iterations except the last iteration. In the third and final phase, the last iteration of the loop is handled separately.I ft he loop being analyzed has only one iteration, as is the case with a function, only this third phase is performed.
The algorithm selects the longest path on each iteration of the loop. In order to demonstrate the correctness of the algorithm, one must showt hat no other path for a giveni teration of the loop will produce a longer worst-case time than that calculated by the algorithm. Since the pipeline effects of each of the paths within the loop are unioned, it only remains to be shown that the caching effects are treated properly.T he instruction fetch time used for each instruction depends on whether it is assumed to be a hit or miss, which depends on its categorization. The cache hit time is one cycle on most machines. The cache miss time is the cache hit time plus the miss penalty,w hich is the time required to access main memory.A ll categorizations are treated identically on repeated references, except for first misses and first hits. Assuming that the instructions have been categorized correctly for each loop and the pipeline analysis was correct, it remains to be shown that first misses and first hits are interpreted appropriately for a giveniteration of the loop.
Afirst hit implies that the instruction will be a hit on its first reference after the loop is entered and all subsequent references to the instruction during the execution of the loop will be misses. The definition the authors used for a first hit requires that the instruction be within every path of the loop.
Thus, the first path chosen in the WHILE loop of the algorithm will encounter every first hit in the loop. After the first iteration, first hits are treated as misses.
Afirst miss implies that the instruction will be a miss on its first reference after the loop is entered and all subsequent references will be hits. An instruction classified as a first miss will be counted as a miss only the first time it is encountered within the WHILE loop of Figure 9 . Because of this dual caching behavior of a first miss instruction, it is necessary to perform more than one pipeline analysis of a path since the caching behavior of the instructions comprising the path can change between iterations.
Once no more first miss instructions are encountered that miss, the pipeline effects associated with the path chosen will not change since the caching behavior of the instructions within a path will always be treated the same. The pipeline effects of the last chosen continue path are efficiently replicated for all but one of the remaining iterations. The last iteration of the loop is treated separately.
The longest exit path for a loop may be shorter than the longest continue path. By examining the exit paths separately,at ighter estimate can be obtained. Thus, the algorithm estimates a bound that is at least as great as the actual worst-case bound.
The algorithm used for estimating the best-case execution time for a loop is somewhat simpler.L et n be the minimum number of iterations associated with a loop. Likethe corresponding algorithm for worst case, the best-case loop analysis algorithm contains three phases. However, during the first phase, a shortest path is found only for the first iteration of the loop. The second phase of the algorithm determines the shortest path for the middle n − 2i terations of the loop. The third phase finds the shortest exit path from the loop in the final iteration. The algorithm for estimating the BCET for a loop is shown in Figure 10 .
The best-case algorithm selects the shortest path on each iteration of the loop. In order to demonstrate the correctness of the algorithm, one must showt hat no other path for a giveni teration will produce a shorter best-case time than that calculated by the algorithm. The pipeline information for the first iteration is typically calculated within the IF-THEN portion (i.e. when the loop iterates more than once). The first time program lines are referenced in a loop, first misses will be misses and first hits will be hits. Thus, the algorithm will calculate the shortest path for the first iteration. The shortest continue path will then be calculated givent hat first misses will be hits and first hits will be misses. All the first hits within the loop will be encountered on the first iteration according to the definition of first hits that was used by the authors. Thus, theycan be safely treated as misses on subsequent iterations. Afi rst miss will be a hit if it has been encountered previously.E veni fafi rst miss had not been encountered in the first iteration, treating the reference as a hit in the second iteration will only cause a slight underestimation. The pipeline information for the first iteration will be concatenated to the pipeline information calculated for the next n-2 iterations. The algorithm in Figure   10 examines the last iteration separately since paths associated with the exit blocks may be shorter than the shortest continue path. When the number of loop iterations is one (i.e. the loop is actually a function), first misses and first hits will be treated as misses and hits, respectively in the pipeline analysis of the exit path. Thus, the algorithm estimates a bound that is at least as small as the actual bestcase bound.
It is important to note that the worst-case and best-case loop analysis algorithms are not perfectly analogous. Consider al oop having three paths with information depicted in Table 6 In this case, the timing analyzer will underestimate the BCET of the loop by fivec ycles, and this underestimation is due to the incorrect prediction of which path had been chosen for the first iteration. In order to makea ne xact prediction in best case, it becomes necessary to re-examine path choices for prior iterations. We believe that having to re-examine all combinations of path choices for prior iterations to compute the BCET of a current iteration is overly inefficient. As ar esult, the best-case loop analysis algorithm shown in Figure 10 assumes that the same path will be taken during the middle iterations of the loop at the expense of a small underestimation in the total BCET.
Program Analysis
At iming analysis tree is constructed to predict the worst-case times of code segments containing nested loops and function calls. In the context of the notation in Figure 4 , the root of this tree is an instance of struct loop_node representing main().E ach node of the tree represents either a loop or a function in the function instance graph. Each node is assumed to be a natural loop. 14 The nodes representing the outer levelo ff unction instances are treated as natural loops that will iterate only once when entered.
The loops in the timing analysis tree are processed in a bottom-up manner.I no ther words, the worst-case and best-case times for a loop are not calculated until the times for all of its immediate child loops are known. The algorithm giveninthe previous section described howaloop containing no other loops would be analyzed. The timing of a non-leaf loop is accomplished using this algorithm and the pipeline information and total times from its immediate child loops. Associated with each loop is a set of exit blocks, which indicates the possible blocks outside the loop that can be reached from the last block in each exit path. Au nique set of timing information is stored for the child loop with each of these exit blocks. If a path within a loop enters a child loop, then the pipeline information and total time from the appropriate exit block are used at that point during the analysis of the path. Fori nstance, if the loop in Figure 6 exits to block 5, then the last iteration of the loop will be shorter than if it had exited to block 7. Thus, the possible paths within non-leaf loops that contain child loops can also be calculated. 15 The transition of an instruction categorization from the child loop levelt ot he current loop level will be used to determine if anya djustment to the child loop time is required. The transitions between categorizations requiring adjustments are described in Table 7. 14 An atural loop is a loop with a single entry block. While the static simulator can process unnatural loops, the timing analyzer is restricted to only analyzing natural loops since it would be difficult for both the timing analyzer and the user to determine the set of possible blocks associated with a single iteration in an unnatural loop. It should be noted that unnatural loops occur quite infrequently. 15 The timing analysis across loop levels is only briefly introduced in this section. It is described in more detail elsewhere [2, 4] . Table 7 . Use of Child Loop Times.
The fm=>fm adjustment is necessary since there should be only one miss associated with the instruction and a miss should only occur the first time the child loop is entered. 16 Fori nstance, consider a program with twon ested loops and each loop iterates 10 times. An instruction within both loops is classified as a fm at both the inner and outer loop levels. The instruction should miss only during the first iteration of the inner loop within the first iteration of the outer loop (1 miss, 99 hits).
If no adjustment were made and the inner (child) loop pipeline information was used directly,then an overestimation would result since the analyzer would treat the instruction as initially missing for each iteration of the outer loop (10 misses, 90 hits). The m=>fh adjustment is necessary since the first ref-
erence to the instruction in the outer loop will be a hit. These same adjustments were used in previous work on bounding only instruction cache performance [4, 6] .
Making these adjustments when pipelining is involved resulted in some slight mispredictions. The problem is that the caching behavior of a particular instruction depends on the loop levelb eing analyzed. When aw orst-case adjustment at an outer loop levelw ould be needed for an instruction having a transition in Table 6 , we conservatively added the maximum number of cycles associated with a cache miss penalty to the total time of the path containing the instruction and treated the instruction 16 Note that additional work was required when the number of distinct paths containing first misses to different program lines exceeds the number of loop iterations. This situation can commonly occur within functions. Am aximum adjustment value was used to compensate in an efficient manner for the remaining loop iterations.
-25-fetch as a cache hit within the path pipeline analysis for the inner loop. When the instruction fetch should be viewed as a cache hit at an outer loop level, the previously added miss penalty cycles were subtracted from the loop'st ime. This strategy permitted a single pipeline analysis of each loop, yet adjustments could still be made at outer levels of the program. Aw orst-case overestimation occurs when the instruction fetch is regarded as a miss and the cache miss penalty could have been overlapped with other pipeline delays (as shown in Figure 1 ).
Forb est-case estimations we treated the fetch of an instruction having a transition in Table 6 as a cache miss within the path pipeline analysis of the inner loop. When the instruction fetch should be viewed as a cache hit at an outer loop level, then the miss penalty will be subtracted from the total time of the path. If the miss penalty could be overlapped with some hazard (as shown in Figure 1 ), then an underestimation will result.
The timing analyzer could achieve ane xact prediction by storing pipeline information about both cases (whether an instruction having such a instruction categorization transition between loop levels should be treated as a miss or a hit in the pipeline). There could be several instructions within a single loop having such caching categorization transitions between loop levels. Storing pipeline information about both cases for each instruction would result in an exponential space and complexity since all combinations of categorizations would have tobeanalyzed.
During best-case analysis, it is sometimes necessary to ignore a potential data hazard between a parent and child loop to avoid a potential overestimation in execution time. This situation can occur when a hazard is overlapped with some other delay (e.g. an instruction cache miss). The timing analyzer determines the number of cycles that a particular stage is vacant from the point it is first occupied to the point it is last occupied. If a data or structural hazard is detected for a particular stage between a parent and child loop, then the delay is reduced by number of vacant cycles for that stage in the child loop. If there were no vacant cycles, then the hazard could not be overlapped with other delays. This potential underestimation could be avoided by storing more information about the child loop. Again, this would result in increasing the complexity of the algorithm. Amore detailed discussion about dealing with vacant cycles for best-case timing analysis is givenelsewhere [8] .
Fortunately,these adjustments are not that common. Forinstance, results indicated that only about 4.5% of the instructions within the function instance graph were classified as first misses or first hits and manyo ft hese did not require adjustments. Thus, these adjustments resulted in only small and relatively infrequent worst-case overestimations and best-case underestimations.
Results
Measurements were obtained on code generated for the SPARC architecture by the vpo optimizing compiler [9] . Six simple programs described in Table 8 were used to assess the effectiveness of the timing analyzer.Adirect-mapped instruction cache configuration containing 8 lines of 16 bytes was used. Thus, the cache contained 128 bytes of instructions. Av ery small cache size was chosen because the test programs were relatively small themselves. The instruction cache performance of each entire program was predicted. The sizes of these test programs may be comparable to the size of typical code segments containing timing constraints in real-time applications. In addition, the code executed between twos cheduling points (context switches) in a non-preemptive system is often smaller than the code of a typical program. Using a small cache also provided a more realistic simulation of a typical ratio of program to cache size. The programs were 4 to 17 times larger than the cache as shown in column 2 of Table 8 . The analysis of test cases with smaller ratios, where test programs fit into the instruction cache, could be accomplished quite easily and would not represent a significant challenge. Using a smaller cache demonstrates the ability of the timing analyzer to predict tight bounds under a more difficult setting. Column 3 shows that each program was highly modularized to illustrate the handling of timing predictions across functions. Column 4 shows the worst-case hit ratio of each program. Only Matmul had a very high ratio due to three tightly nested loops in a single function to perform the matrix multiplication.
The results of evaluating these programs are shown in Table 9 . The observed cycles for these measurements were obtained by enhancing the Ease cache simulator [10] . This simulator produced the pipeline only observed cycles and the timing analyzer produced the Table 9 . Results for the Test Programs.
-29- The number of iterations performed was overrepresented on average by a factor of twof or this specific loop. Note that both of these problems are encountered by other timing tools and are not directly related to the pipeline analysis.
The best-case pipeline only timing analysis resulted in exact predictions for Matmul and Stats.T he predictions for Matcnt and Matsum were slightly underestimated due to diminishing the effect of data hazard because of vacant cycles within a child loop. Even though Matmul has no conditional control flow, its BCET is less than its WCET because the integer multiply instruction smul can spend [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] cycles in the EX stage. Floating-point instructions also takeav arying time to execute, which can result in a WCET that is significantly greater than the corresponding BCET.T he best-case predictions for Des and Sort were substantially underestimated for the same reasons theyw ere overestimated in the worst-case analysis.
The worst-case and best-case caching only timing analysis results were also quite accurate. This analysis had exact predictions for Matmul, Matsum,a nd Stats since there were fewc onditional constructs except to exit loops. The Matcnt program used an if-then-else construct to either add a nonnegative value to a sum and increment a counter for the number of nonnegative elements or just increment a counter for the negative elements. The adding of the nonnegative value to a sum was accomplished in a separate function, which was purposely placed in a location that would conflict with the program line containing the code to increment a counter for the negative elements. Multiple executions of the then path, which includes the call to the function to perform the addition, still required more cycles than alternating between the twop aths. Yet, the algorithm for estimating the worst-case instruction caching performance assumes that the first reference to a program line within a path would always be a miss if there were accesses to anyo ther conflicting program lines within the same loop. This assumption simplified the algorithm since the effect of all combinations of paths need not be calculated. Thus, one reference was counted repeatedly as a miss instead of a hit in the worst-case analysis. This path was executed 10,000 times and accounted for a 90,000 cycle [10,000*miss penalty] or an 8% overestimation. Note that the execution of this single path accounted for 40.61% of the total instructions referenced during the program execution. The best-case analysis for Matcnt wase xact since the shorter path did not contain the call to add a nonnegative value. The programs Des and Sort had overestimations for the worst-case predictions and underestimations for the best-case predictions due to the same problems described previously for the pipeline only measurements. The worst-case naive ratio was lower than initially anticipated by the authors. These test programs contained manyl ong running instructions (floating-point operations and integer multiply and divides) that were frequently executed and often resulted in stalls. In addition, transfers of control were also quite frequent and were only considered to require twopipeline stages in our analysis.
The integrated pipeline and caching worst-case analysis also resulted in quite tight predictions.
Again the predictions for the programs Matmul, Matsum,and Stats were very accurate. Note that the estimated worst-case cycles were slightly greater than the observed cycles for these programs. This overestimation was due to the problem of an instruction'sc aching behavior changing between loop levels. These changes require an adjustment as shown in Table 6 . The approach used by the authors wastotreat such an instruction as a hit in the pipeline analysis and simply add the miss penalty to the total time. When the instruction should be viewed as a hit at an outer level, then this miss penalty was simply subtracted and an accurate estimation is obtained. However, int hese three programs the potential overlap between a miss penalty and a stall due to a hazard were not always detected. 17 The
Des, Matcnt,and Sort programs had its usual worst-case overestimations due to data dependencies, a cache conflict, and an inaccurate number of estimated loop iterations, respectively.T he naive ratio indicates that much tighter WCET bounds can be obtained when the benefits of pipelining and instruction caching are analyzed.
The integrated pipeline and caching best-case analysis for the four programs (Matcnt, Matmul, Matsum,a nd Stats)w ithout data dependencyo rl oop iteration problems was within 8% of the observed cycles. The underestimations were largely due to inaccuracies resulting from a fm=>fm transition between inner and outer loops. The timing analyzer treats the instruction in this case as a miss in the pipeline best-case analysis and subtracts the miss penalty from the time of the path when the instruction will be viewed as a hit. Thus, if a portion of the miss penalty can be overlapped with a delay due to a data hazard, an underestimation will occur on each iteration except the first. In contrast, the worst-case analysis would treat the instruction as a hit in the pipeline analysis and only overestimate in a similar situation on the first iteration of the loop when the instruction reference was 17 Forinstance, the 502 cycle overestimation in Matmul occurred from 50 miss penalties completely overlapping with stalls from an integer multiply instruction and 52 misses overlapping with one cycle load hazards.
regarded as a miss. In addition, some of the underestimation in the best-case analysis was from disregarding data hazard stall cycles between a parent and a child loop due to subtracting vacant cycles from the stall. Thus, it was common to have a larger underestimation in best-case analysis than an overestimation in worst-case analysis. Fortunately,most timing constraints are associated with meeting deadlines, which requires worst-case analysis, instead of finishing a task too soon, which would require best-case analysis. The other twoprograms (Des and Sort)were significantly underestimated due to data dependencies and loop iteration problems discussed previously.
If the pipeline and caching analysis had been handled independently,t hen the cache miss penalty
would not have the opportunity to overlap with a pipeline stall, as shown in Figure 1 . Thus, one would anticipate a greater overestimation in predicting WCET with an independent analysis approach. The effect of an independent analysis strategy would be to add the cache miss penalty to the total time of a path when an instruction fetch is predicted to be a miss and treat the instruction as a hit in the pipeline. The benefit of integrating the pipeline and instruction cache worst-case analysis is depicted in Table 10 . Without an integrated analysis, the test programs would have been overestimated by an additional 3% on average. Note that the most significant effect was on the worst-case prediction of Stats,w hich was the only floating-point intensive test program. Programs requiring floating-point operations result in more frequent and lengthyd elays that may sometimes be overlapped with instruction cache misses or anyo ther source of multicycle pipeline stage occupation.
Thus, the benefit of using an integrated analysis approach would be more pronounced in floatingpoint intensive programs.
-33- Table 10 . Ratios for Integrated versus Independent Worst-Case Analysis
User Interface
Once the initial timing analysis has been completed, a graphical user interface is invokedt hat is depicted in Figure 11 . The main windowonthe left allows the user to quickly request timing predictions for functions, loops, paths, subpaths, or ranges of machine instructions and reports the be more than one instance of a function within the timing analysis tree, the user interface displays the worst-case and best-case times from all of the instances of the construct associated with the user request. Wheneverad ifferent construct is selected, the highlighted lines in windows containing the source and assembly code are automatically updated and scrolled to the appropriate position. Thus, the user can quickly observet he relationship between timing constraints associated with the source code and sequences of machine instructions. This interface is described in more detail elsewhere [11] .
Comparison with Previous Work
There has been much work on the issue of predicting execution time of programs. However, most approaches in the past have not dealt with the effects of pipelining and instruction caching [12, 13, 14] . There have also been some recent studies on predicting pipeline performance by Harmon et al.
[15] and Narasimhan and Nilsen [16] . Yet, these studies did not address caching issues. 18 Furthermore, the former study was limited to nonnested functions and the latter study required the sequence of executed instructions to be known. Finally,t here has been some recent work on predicting 18 Harmon assumed the entire code segment would fit into cache. Thus, at most one miss could occur for each instruction reference.
instruction caching performance. Arnold et al. [4] implemented a timing analysis system to tightly bound instruction cache performance. However, this approach did not address pipelining issues.
Li et al. [17, 18] used an integer linear programming (ILP) approach to model instruction caching behavior.T heir approach is also used to predict data and set-associative caching behavior [19] . The authors automatically derivedc onstraints from a program'sc ontrol-flowg raph that could be solved using ILP.A dditional user-provided constraints regarding data dependencies within the control flow can be easily integrated into the analysis. In their control-flowa nalysis, each set of instructions within a basic block mapping to the same cache line was identified as a line-block. Three possible states were identified for each cache line. First, if only one line-block is mapped to it, then it will experience at most one miss penalty.S econd, if twoo rm ore non-conflicting line-blocks map to it, then these line-blocks will have atm ost one miss penalty among them. Finally,i ft wo orm ore conflicting line-blocks map to it, then a cache conflict graph is constructed for this cache line. The edges between the line-blocks in this graph represent a possible path between the twoc onflicting lineblocks. Additional constraints are generated to represent the number of times these edges are traversed. Wheneveral ine-block is reached from a conflicting line-block, it is assumed that there is a miss penalty associated with its execution.
Apparently,t he pipeline behavior was not modeled and it is unclear howw ell Li'sa pproach will work when pipelining is addressed. However, iti sp ossible that pipeline behavior for instructions within a single basic block can be modeled with Li'sI LP approach. By performing no general pipeline analysis, this allowed their approach to disregard the potential effects of different paths on pipeline behavior.T hus, theyhad only twopossible times for the instructions within a line-block, one with an instruction cache miss and one without a miss. Unfortunately,t he state of the pipeline can affect the execution time associated with a sequence of instructions. Thus, there was also no method shown for detecting pipeline stalls or potential overlap between stalls and cache misses.
There has been only one previous study that attempted to address the issue of predicting the WCET of programs on machines with both pipelining and an instruction cache. Lim et al. [20] described a method of predicting the performance of pipelining and instruction caching, which is based on an extension of a previous timing tool [21] . Theyhav e also extended this tool to address data caching as well [22] . It has been proposed that the Lim approach can be extended to analyze set-associative caching behavior as well. Lim'sm ethod differs quite significantly from our approach described in this paper,w hich instead builds on flowa nalysis techniques found in optimizing compilers. Lim's method uses a timing schema associated with each source-levell anguage program construct. They stored information about the number of cycles at the head and tail of a reservation table produced as a result of the pipeline analysis on the instructions associated with a program construct. In addition, this method stored information about the set of memory blocks whose first reference depends upon the cache contents prior to the execution of the construct. Lim also stored the set of memory blocks known to remain in cache after the execution of the construct. Eventually,this timing information is concatenated with another construct that would be executed immediately before the current construct.
Their timing analyzer attempted to overlap the head of the reservation table of the current construct with the tail of the reservation table of the other construct as much as possible. Their row-based approach of concatenating reservation tables is equivalent to our tables of structural and data hazard information depicted in Tables 3 and 4 . Likewise, the list of memory blocks known to be in cache after executing the other construct is used to adjust the time of the current construct by comparing this list to the list of first reference blocks in the current construct. This method stored multiple paths for conditional constructs, such as an if-then-else.T heyp runed or eliminated a particular path when it was found that the worst-case execution time of the path was faster than the best-case execution time of another path within the same construct.
The approach that Lim et al. used to analyze caching behavior limits the accuracyo ft he analysis.
Theyu sed a single bottom-up pass when performing the timing analysis of a program. The caching behavior of a large percentage of the instruction fetches within a construct would be unknown until manyofthe surrounding constructs were processed. Their approach was to treat the instruction fetch as a hit within the pipeline and add the cycles associated with a cache miss penalty to the total time of the construct. When it was later found that an instruction reference was a hit, theywould subtract the miss penalty from the total time. However, ano verestimation may result when the instruction is not found in cache. As shown in Figure 1 , the instruction fetch miss penalty of one instruction (instruction 2) can be completely hidden by a stall with a long running instruction (data hazard stall on instruction 3). Whether the fetch of instruction 2 was a hit or a miss would have noeffect on the total number of cycles. The Lim method would rarely detect instruction fetches that would always be misses until the surrounding constructs are analyzed, which is after the pipeline analysis of a construct has already occurred. Our approach of categorizing the caching behavior of each instruction before starting the timing analysis allows the detection of such situations. Forinstance, about 25% of the instructions within the function instance graphs of the programs we evaluated were statically categorized as always misses.A sT able 10 above indicates, we found that the pipeline and caching estimated ratio for the six test programs increased on average by about 3% when the complete miss penalty was always added for each predicted miss.
FutureW ork
We are working on several enhancements to the timing analyzer.W eplan to automate the detection of manyd ata dependencies using existing compiler optimization techniques to obtain tighter performance estimations [23] . We also plan to accurately calculate the number of iterations for loops which are dependent on the value of a loop counter variable of an outer loop. The retargetability of the timing analyzer will also be enhanced by isolating anyremaining machine dependent information in data files.
We are exploring methods to predict the timing of other architectural features associated with RISC processors. Work is currently ongoing to verify that our technique accurately predicts performance for the MicroSPARC I by using a logic analyzer.T his will require predicting the performance of other features, such as wrap-around filling of cache lines. The effect of data caching is also an area that we are pursuing. Unlikei nstruction caching, manyo ft he addresses of references to data can change during the execution of a program. Thus, obtaining reasonably tight bounds for worst-case and best-case data cache performance is significantly more challenging. However, manyo ft he data references are known. For instance, static or global data references retain the same addresses during the execution of a program. Due to the analysis of a function instance tree (no recursion allowed), addresses of run-time stack references can be statically determined evenwhen the addresses may differ for different invocations of the same function. Compiler flowa nalysis can be used to detect the pattern of manycalculated references, such as indexing through an array.W hile the benefits of using adata cache for real-time systems will probably not be as significant as using an instruction cache, its effect on performance should still be substantial. We are also currently working on extending the timing analyzer to predict the performance of set-associative caches.
Conclusions
This paper has presented a technique for predicting the worst and best-case execution time of programs on machines with pipelining and instruction caches. First, a static cache simulator analyzes the control flowo faprogram to statically categorize the caching behavior of each instruction within the program. Second, at iming analyzer uses these instruction categorizations when analyzing the pipeline performance of a path of instructions. Third, the timing analyzer uses a concise representation of the pipeline information to concatenate the performance of paths in an efficient manner when predicting the performance of loops. Fourth, a timing analysis tree is used to predict the performance of an entire program. Finally,ag raphical user interface has been implemented that allows users to obtain timing predictions of portions of the program. The results indicate that the timing analyzer can quickly obtain tight predictions of performance.
Acknowledgements
Lo Koa nd Emily Ratliffi mplemented the user interface. Wea re also grateful to the anonymous referees who provided helpful suggestions that improvedthe quality of the paper. # " 8 7 
