I INTRODUCTION
In real-time computing systems, tasks have timing requirements (i.e., deadlines) that must be met for correct operation. Thus, it is of utmost importance to guarantee that tasks nish before their deadlines. Various scheduling techniques, both static and dynamic, have been proposed to ensure this guarantee. These scheduling algorithms generally require that the WCET (Worst Case Execution Time) of each task in the system be known a priori. Therefore, it is not surprising that considerable research has focused on the estimation of the WCETs of tasks.
In a non-pipelined processor without cache memory, it is relatively easy to obtain a tight bound on the WCET of a sequence of instructions. One simply has to sum up their individual execution times that are usually given in a table. The WCET of a program can then be calculated by traversing the program's syntax tree bottom-up and applying formulas for calculating the WCETs of various language constructs. However, for RISC processors such a simple analysis may not be appropriate because of their pipelined execution and cache memory. In RISC processors, an instruction's execution time varies widely depending on many factors such as pipeline stalls due to hazards and cache hits/misses. One can still obtain a safe WCET bound by assuming the worst case execution scenario (e.g., each instruction su ers from every kind of hazard and every memory access results in a cache miss). However, such a pessimistic approach would yield an extremely loose WCET bound resulting in severe under-utilization of machine resources.
Our goal is to predict tight and safe WCET bounds of tasks for RISC processors. Achieving this goal would permit RISC processors to be widely used in real-time systems. Our approach is based on an extension of the timing schema 1]. The timing schema is a set of formulas for computing execution time bounds of language constructs. In the original timing schema, the timing information associated with each program construct is a simple time-bound. This choice of timing information facilitates a simple and accurate timing analysis for processors with xed execution times. However, for RISC processors, such timing information is not su cient to accurately account for timing variations resulting from pipelined execution and cache memory.
This paper proposes extensions to the original timing schema to rectify the above problem. We associate with each program construct what we call a WCTA (Worst Case Timing Abstraction). The WCTA of a program construct contains timing information of every execution path that might be the worst case execution path of the program construct. Each timing information includes information about the factors that may a ect the timing of the succeeding program construct. It also includes the information that is needed to re ne the execution time of the program construct when the timing information of the preceding program construct becomes available at a later stage of WCET analysis. This extension leads to a revised timing schema that accurately accounts for the timing variation which results from the history sensitive nature of pipelined execution and cache memory.
We assume that each task is sequential and that some form of cache partitioning 2, 3] is used to prevent tasks from a ecting each other's timing behavior. Without these assumptions, it would not be possible to eliminate the unpredictability due to task interaction. For example, consider a real-time system in which a preemptive scheduling policy is used and the cache is not partitioned. In such a system, a burst of cache misses usually occurs when a previously preempted task resumes execution. Increase of the task execution time resulting from such a burst of cache misses cannot be bounded by analyzing each task in isolation.
This paper is organized as follows. In Section II, we survey the related work. Section III focuses on the problems associated with accurately estimating the WCETs of tasks in pipelined processors. We then present our method for solving these problems. In Section IV, we describe an accurate timing analysis technique for instruction cache memory and explain how this technique can be combined with the pipeline timing analysis technique given in Section III. Section V identi es the di erences between the WCET analysis of instruction caches and that of data caches, and explains how we address the issues resulting from these di erences. In Section VI, we report on preliminary results of WCET analyses for a RISC processor. Finally, the conclusion is given in Section VII.
II RELATED WORK
A timing prediction method for real-time systems should be able to give safe and accurate WCET bounds of tasks. Measurement-based and analytical techniques have been used to obtain such bounds. Measurement-based techniques are, in many cases, inadequate to produce a timing estimation for real-time systems since their predictions are usually not guaranteed, or enormous cost is needed. Due to these limitations, analytical approaches are becoming more popular 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16] . Many of these analytical studies, however, consider a simple machine model, thus largely ignoring the timing e ects of pipelined execution and cache memory 8, 12, 13, 15] .
A. Timing Analysis of Pipelined Execution
The timing e ects of pipelined execution have been recently studied by Harmon, Baker, and Whalley 6], Harcourt, Mauney, and Cook 5], Narasimhan and Nilsen 11], and Choi, Lee, and Kang 4] . In these studies, the execution time of a sequence of instructions is estimated by modeling a pipelined processor as a set of resources and representing each instruction as a process that acquires and consumes a subset of the resources in time. In order to mechanize the process of calculating the execution time, they use various techniques: pattern matching 6], SCCS (Synchronous Calculus of Communicating Systems) 5], retargetable pipeline simulation 11], and ACSR (Algebra of Communicating Shared Resources) 4]. Although these approaches have the advantage of being formal and machine independent, their applications are currently limited to calculating the execution time of a sequence of instructions or a given sequence of basic blocks 1 . Therefore, they rely on ad hoc methods to calculate the WCETs of programs.
The pipeline timing analysis technique by Zhang, Burns and Nicholson 16] can mechanically calculate the WCETs of programs for a pipelined processor. Their analysis technique is based on a mathematical model of the pipelined Intel 80C188 processor. This model takes into account the overlap between instruction execution and opcode prefetching in 80C188. In their approach, the WCET of each basic block in a program is individually calculated based on the mathematical model. The WCET of the program is then calculated using the WCETs of the constituent basic blocks and timing formulas for calculating the WCETs of various language constructs.
Although this approach represents signi cant progress over the previous schemes that did not consider the timing e ects of pipelined execution, it still su ers from two ine ciencies. First, the pipelining e ects across basic blocks are not accurately accounted for. In general, due to data dependencies and resource con icts within the execution pipeline, a basic block's execution time will di er depending on what the surrounding basic blocks are. However, since their approach requires that the WCET of each basic block be independently calculated, they make the worst case assumption on the preceding basic block (e.g., the last instruction of every basic block that can precede the basic block being analyzed has data memory access, which prevents the opcode prefetching of the rst instruction of the basic block being analyzed). This assumption is reasonable for their target processor since its pipeline has only two stages. However, completely ignoring pipelining e ects across basic blocks may yield a very loose WCET estimation for more deeply pipelined processors. Second, although their mathematical model is very e ective for the Intel 80C188 processor, the model is not general enough to be applicable to other pipelined processors. This is due to the many machine speci c assumptions made in their model that are di cult to generalize.
B. Timing Analysis of Cache Memory
Cache memories have been widely used to bridge the speed gap between processor and main memory. However, designers of hard real-time systems are wary of using caches in their systems since the performance of caches is considered to be unpredictable. This concern stems from the following two sources: inter-task interference and intra-task interference. Inter-task interference is caused by task preemption. When a task is preempted, most of its cache blocks 2 are displaced by the newly scheduled task and the tasks scheduled thereafter. When the preempted task resumes execution, it makes references to the previously displaced blocks and experiences a burst of cache misses. This type of cache miss cannot be avoided in real-time systems with preemptive scheduling of tasks. The result is a wide variation in task execution time. This execution time variation can be eliminated by partitioning the cache and dedicating one or more partitions to each real-time task 2, 3] . This cache partitioning approach eliminates the inter-task interference caused by task preemption.
Intra-task interference in caches occurs when more than one memory block of the same task compete with each other for the same cache block. This interference results in two types of cache miss: capacity misses and con ict misses 19] . Capacity misses are due to nite cache size. Con ict misses, on the other hand, are caused by a limited set associativity. These types of cache miss cannot be avoided if the cache has a limited size and/or set associativity.
Among the analytical WCET prediction schemes that we are aware of, only four schemes take into account the timing variation resulting from intra-task cache interference (three for instruction caches 10, 9, 7] and one for data caches 14]). The static cache simulation approach which statically predicts hits or misses of instruction references is due to Arnold, Mueller, Whalley and Harmon 10] . In this approach, instructions are classi ed into the following four categories based on a data ow analysis: always-hit: The instruction is always in the cache. always-miss: The instruction is never in the cache.
rst-hit: The rst reference to the instruction hits in the cache. However, all the subsequent references miss in the cache. This approach is simple but has a number of limitations. One limitation is that the analysis is too conservative. As an example, consider the program fragment given in Fig. 1 . Assume that both of the instruction memory blocks corresponding to S i (i.e., b i ) and S j (i.e., b j ) are mapped to the same cache block and that no other instruction memory block is mapped to that cache block. Further assume that the execution time of S i is much longer than that of S j . Under these assumptions, the worst case execution scenario of this program fragment is to repeatedly execute S i within the loop. In this worst case scenario, only the rst access to b i will miss in the cache and all the subsequent accesses within the loop will hit in the cache. However, by being classi ed as always-miss, all the references to b i are treated as cache misses in this approach, which leads to a loose estimation of the loop's WCET. Another limitation of this approach is that the approach does not address the issues regarding pipelined execution and the use of data caches, which are commonly found in most RISC processors.
In 9], Niehaus et al. discuss the potential bene ts of identifying instruction references corresponding to always-hit and rst-miss in the static cache simulation approach. However, as stated in 10], their analysis is rather abstract and no general method for analyzing the worst case timing behavior of programs is given.
In 7], Liu and Lee propose techniques to derive WCET bounds of a cached program based on a transition diagram of cached states. Their WCET analysis uses an exhaustive search technique through the state transition diagram which has an exponential time complexity. To reduce the time complexity of this approach, they propose a number of approximate analysis methods each of which makes a di erent trade-o between the analysis complexity and the tightness of the resultant WCET bounds. Although the paper mentions that the methods are equally applicable to the data cache, the main focus is on the instruction cache since the issues pertinent to the data cache such as handling of write references and references with unknown addresses (cf. Section V) are not considered. Also, it is not clear how one can incorporate the analysis of pipelined execution into the framework.
Rawat performs a static analysis for data caches 14]. His approach is similar to the graph coloring approach to register allocation 20]. The analysis proceeds as follows. First, live ranges of variables and those of memory blocks are computed 3 . Second, an interference graph is constructed for each cache block. An edge in the interference graph connects two memory blocks if they are mapped to the same cache block and their live ranges overlap with each other. Third, live ranges of memory blocks are split until they do not overlap with each other. If a live range of a memory block does not overlap with that of any other memory block, the memory block never gets replaced from the cache during execution within the live range. Therefore, the number of cache misses due to a memory block can be calculated from the frequency counts of its live ranges (i.e., how many times the program control ows into the live ranges). Finally, the total number of data cache misses is estimated by summing up the frequencies of all the live ranges of all the memory blocks used in the program.
Although this analysis method is a step forward from the analysis methods in which every data reference is treated as a cache miss, it still su ers from the following three limitations. First, the analysis does not allow function calls and global variables, which severely limits its applicability. Second, the analysis leads to an overestimation of data cache misses resulting from the assumption that every possible execution path can be the worst case execution path. This limitation is similar to the rst limitation of the static cache simulation approach. The third limitation of this approach is that it does not address the issues of locating the worst case execution path and of calculating the WCET, again limiting its applicability. 
III PIPELINING EFFECTS
In pipelined processors, various execution steps of instructions are simultaneously overlapped. Due to this overlapped execution, an instruction's execution time will di er depending on what the surrounding instructions are. However, this timing variation could not be accurately accounted for in the original timing schema since the timing information associated with each program construct is a simple time-bound. In this section, we extend the timing schema to rectify this problem. In our extended timing schema, the timing information of each program construct is a set of reservation tables rather than a time-bound. The reservation table was originally proposed to describe and analyze the activities within a pipeline 21]. In a reservation table, the vertical dimension represents the stages in the pipeline and the horizontal dimension represents time. Fig. 2 shows a sample basic block in the MIPS assembly language 22] and the corresponding reservation table. In the gure, each x in the reservation table speci es the use of the corresponding stage for the indicated time slot. In the proposed approach, we analyze the timing interactions among instructions within a basic block by building its reservation table. In the reservation table, not only the con icts in the use of pipeline stages but also data dependencies among instructions are considered.
A program construct such as an if statement may have more than one execution path. Moreover, in pipelined processors, it is not always possible to determine which one of the execution paths is the worst case execution path by analyzing the program construct alone. As an example, suppose that an if statement has two execution paths corresponding to the two reservation tables shown in Fig. 3 . The worst case execution path here depends on the instructions in the preceding program constructs. For example, if one of the instructions near the end of the preceding program construct uses the MD stage, the execution path corresponding to R 1 will become the worst case execution path. On the other hand, if there is an instruction using the DIV stage instead, the execution path corresponding to R 2 will become the worst case execution path. Therefore, we should keep both Fig. 4 shows the data structure for a reservation table used in our approach in both textual and graphical form. In the data structure, t max is the worst case execution time of the reservation table, which is determined by the number of columns in the reservation table. In implementation, not all the columns in the reservation table are maintained. Instead, we maintain only a rst few (i.e., head ) columns and a last few (i.e., tail ) columns. The larger head and tail are, the tighter the resulting WCET estimation is since more execution overlap between program constructs can be modeled as we will see later. head = tail = 1 corresponds to the case where the full reservation table is maintained.
As explained earlier, we associate with each program construct a set of reservation tables where each reservation table contains the timing information of an execution path that might be the worst case execution path of the program construct. We call this set the WCTA (Worst Case Timing Abstraction) of the program construct. This WCTA corresponds to the time-bound in the original timing schema and each element in the WCTA is denoted by (t max ; head; tail).
With this framework, the timing schema can be extended so that the timing interactions across where w 1 and w 2 are reservation tables and the operation concatenates two reservation tables resulting in another reservation table. This concatenation operation models the pipelined execution of a sequence of instructions followed by another sequence of instructions. The semantics of this operation for a target processor can be deduced from its data book. Fig. 5 shows an application of the operation. From the gure, one can note that as more columns are maintained in head and tail, more overlap between adjacent program constructs can be modeled and, therefore, a tighter WCET estimation can be obtained. S is the set union operation. As in the previous timing formula, pruning is performed during each instantiation of this timing formula. Function calls are processed like sequential statements. In our approach, functions are processed in a reverse topological order in the call graph Finally, the timing formula of a loop statement S: while (exp) S 1 is given by
where N is a loop bound that is provided by some external means (e.g., from user input). This timing formula e ectively enumerates all the possible candidates for the worst case execution scenario of the loop statement. This approach is exact but is computationally intractable for a large N. In the following, we provide approximate methods for loop timing analysis.
Approximate Loop Timing Analysis The problem of nding the worst case execution scenario for a loop statement with loop bound N can be formulated as a problem to nd the longest weighted path (not necessarily simple) containing exactly N arcs in a weighted directed graph. Thus, the approximate loop timing analysis method is explained using a graph theoretic formulation.
Let G = (P; A) be a weighted directed graph where P = fp 1 For a large N, this time complexity is still unacceptable. In the following, we describe a faster technique that gives a very tight upper bound for D`; i;j . This technique is based on the calculation of the maximum cycle mean of G.
The maximum cycle mean of a weighted directed graph G is m = max c2C m(c) where C ranges over all directed cycles in G and m(c) is the mean weight of c. The maximum cycle mean can be calculated in O(jPj jAj) time, which is independent of N, using an algorithm due to Karp 24] . Let m be the maximum cycle mean of G, then D`; i;j can safely be approximated as Interference Up to now, we have assumed that tasks execute without preemption. However, in real systems, tasks may be preempted for various reasons: preemptive scheduling, external interrupts, resource contention, and so on. For a task, these preemptions are interference that breaks in the task's execution ow. The problem regarding interference is that of adjusting the prediction made under the assumption of no interference such that the prediction is applicable in an environment with interference. Fortunately, the additional per-preemption delay introduced by pipelined execution is bounded by the maximum number of cycles for which an instruction remains in the pipeline (in MIPS R3000 it is 36 cycles in the case of the div instruction). Once this information is available, adjusting the predictions to re ect interference can be done using the techniques explained in 26].
IV INSTRUCTION CACHING EFFECTS
For a processor with an instruction cache, the execution time of a program construct will di er depending on which execution path was taken prior to the program construct. This is a result of the history sensitive nature of the instruction cache. As an example, consider a program construct that accesses instruction blocks 5 (b 2 , b 3 , b 2 , b 4 ) in the sequence given (cf. Fig. 6 ). Assume that the instruction cache has only two blocks and is direct-mapped. In a direct-mapped cache, each instruction block can be placed exactly in one cache block whose index is given by instruction block number modulo number of blocks in the cache.
In this example, the second reference to b 2 will always hit in the cache because the rst reference to b 2 will bring b 2 into the cache and this cache block will not be replaced in the mean time. On struct pipeline cache timing information f time t max ; reservation 4 are mapped to the same cache block in the assumed cache con guration.) Unlike the above two references whose hits or misses can be determined by local analysis, the hit or miss of the rst reference to b 2 cannot be determined locally and is dependent on the cache contents immediately before executing this program construct. Similarly, the hit or miss of the reference to b 3 will depend on the previous cache contents. The hits or misses of these two references will a ect the (worst case) execution time of this program construct. Moreover, the cache contents after executing this program construct will, in turn, a ect the execution time of the succeeding program construct in a similar way. These timing variations, again, cannot be accurately represented by a simple time-bound of the original timing schema.
This situation is similar to the case of pipelined execution discussed in the previous section and, therefore, we adopt the same strategy; we simply extend the timing information of elements in the WCTA leaving the timing formulas intact. Each element in the WCTA now has two sets of instruction block addresses in addition to t max , head, and tail used for the timing analysis of pipelined execution. Fig. 7 gives the data structure for an element in the WCTA in this new setting where n block denotes the number of blocks in the cache.
In the given data structure, the rst set of instruction block addresses (i.e., first reference) maintains the instruction block addresses of the references whose hits or misses depend on the cache contents prior to the program construct. In other words, this set maintains for each cache block the instruction block address of the rst reference to the cache block. The second set (i.e., last reference) maintains the addresses of the instruction blocks that will remain in the cache after the execution of the program construct. In other words, this set maintains for each cache block Fig. 8 . Contents of the WCTA element corresponding to the example in Fig. 6 the instruction block address of the last reference to the cache block. These are the cache contents that will determine the hits or misses of the instruction block references in the first reference of the succeeding program construct. In calculating t max , we accurately account for the hits and misses that can be locally determined such as the second reference to b 2 and the reference to b 4 in the previous example. However, the instruction block references whose hits or misses are not known (i.e., those in first reference) are conservatively assumed to miss in the cache in the initial estimate of t max . This initial estimate is later re ned as the information on the hits or misses of those references becomes available at a later stage of the analysis. Fig. 8 shows the timing information maintained for the program construct given in the previous example.
With this extension, the timing formula of S: S 1 ; S 2 is given by
This timing formula is structurally identical to the one given in the previous section for the sequential statement. The di erences are in the structure of the elements in the WCTAs and in the semantics of the operation. The revised semantics of the operation is procedurally de ned in Fig. 9 .
The function concatenate given in the gure concatenates two input elements w 1 and w 2 and puts the result into w 3 , thus implementing the operation. In lines 9-12 of function concatenate, w 3 inherits w 1 's first reference if the corresponding cache block is accessed in w 1 . If the cache block is not accessed in w 1 , the rst reference to the cache block in w 1 w 2 is from w 2 . Therefore, [17] [18] . In this calculation, the pipeline operation is the operation de ned in the previous section for the timing analysis of pipelined execution and t miss penalty is the time needed to service a cache miss. As before, an element in a WCTA can safely be eliminated (i.e., pruned) from the WCTA if we can guarantee that the element's WCET is always shorter than that of some other element in the same WCTA regardless of what the surrounding program constructs are. This condition for pruning is procedurally speci ed in Fig. 10 . The function prune given in the gure checks whether either one 1 Fig. 10 . Semantics of pruning operation of the two execution paths corresponding to the two input elements w 1 and w 2 can be pruned and returns the pruned element if the pruning is successful and null if neither of them can be pruned.
In the function prune, lines 6-12 determine how many entries in w 1 's first reference and last reference are di erent from the corresponding entries in w 2 's first reference and last reference. The di erence bounds the cache memory related execution time variation between w 1 and w 2 . Line 13 checks whether w 2 can be pruned by w 1 . Pruning of w 2 by w 1 can be made if w 2 's WCET assuming the worst case scenario for w 2 is shorter than w 1 's WCET assuming w 1 's best case scenario. Likewise, line 16 checks whether w 1 can be pruned by w 2 .
Again as before, the timing formula of S: if (exp) then S 1 else S 2 is given by
As in the previous section, the problem of calculating W(S) for a loop statement S: while (exp) S 1 can be formulated as a graph theoretic problem. Here, wcta(wp N ij ) is given by (p i :t max + D 0 N?1;i;j ; p i :head; p j :tail; p i :first reference; p j :last reference) After calculating wcta(wp N ij ) for all p i , p j 2 P, W(S) can be computed as follows:
The loop timing analysis discussed in the previous section assumes that each loop iteration bene ts only from the immediately preceding loop iteration. This is because in the calculation of w ij , we only consider the execution time reduction of p j due to the execution overlap with p i . This assumption holds in the case of pipelined execution since the execution time of an iteration's head is a ected only by the tail of the immediately preceding iteration. In the case of cache memory, however, the assumption does not hold in general. For example, an instruction memory reference may hit to a cache block that was loaded into the cache in an iteration other than the immediately preceding one. Nevertheless, since the assumption is conservative, the resulting worst case timing analysis is safe in the sense that the result does not underestimate the WCET of the loop statement. The degradation of accuracy resulting from this conservative assumption can be reduced by analyzing a sequence of k (k > 1) iterations at the same time rather than just one iteration 25]. In this case, each vertex represents an execution of a sequence of k iterations and w ij is the execution time of sequence j when its execution is immediately preceded by an execution of sequence i . This analysis corresponds to the analysis of the loop unrolled k times and trades increased analysis complexity for more accurate WCTA calculation. a) Set associative caches: Up to now we have considered only the simplest cache organization called the direct-mapped cache in which each instruction block can be placed exactly in one cache block. In a more general cache organization called the n-way set associative cache, each instruction block can be placed in any one of the n blocks in the mapped set 6 . Set associative caches need a policy that decides which block to replace among the blocks in the set to make room for a block fetched on a cache miss. The LRU (Least Recently Used) policy is typically used for that purpose. Once this replacement policy is given (assuming that it is not random), it is straightforward to implement and prune operations needed in our analysis method.
V DATA CACHING EFFECTS
The timing analysis of data caches is analogous to that of instruction caches. However, the former di ers from the latter in several important ways. First, unlike instruction references, the actual addresses of some data references are not known at compile-time. This complicates the timing analysis of data caches since the calculation of first reference and last reference, which is the most important aspect of our cache timing analysis, assumes that the actual address of every memory reference is known at compile-time. This complication, however, can be avoided completely if a simple hardware support in the form of one bit in each load/store instruction is available. This bit, called allocate bit, decides whether the memory block fetched on a miss will be loaded into the cache. For a data reference whose address cannot be determined at compile-time, the allocate bit is set to zero preventing the memory block fetched on a miss from being loaded into the cache. For other references, this bit is set to one allowing the fetched block to be loaded into the cache. With this hardware support, the worst case timing analysis of data caches can be performed very much like that of instruction caches, i.e. treating the references whose addresses are not known at compile-time as misses and completely ignoring them in the calculation of first reference and last reference. Even when such hardware support is not available, the worst case timing analysis of data caches is still possible by taking two cache miss penalties for each data reference whose address cannot be determined at compile-time, and then ignoring the reference in the analysis 27]. The one cache miss penalty is due to the fact that the reference may miss in the cache. The other is due to the fact that the reference may replace a cache block that contributes a cache hit in our analysis.
The second di erence stems from accesses to local variables. In general, data area for local variables of a function, called the activation record of the function, is pushed and popped on a runtime stack as the associated function is called and returned. In most implementations, a specially designated register, called sp (Stack Pointer), marks the top of the stack and each local variable is addressed by an o set relative to sp. The o sets of local variables are determined at compiletime. However, the sp value of a function di ers depending on from where the function is called. However, the number of distinct sp values a function may have is bounded. Therefore, the WCTA of a function can be computed for each sp value the function may have. Such sp values can be calculated from the activation record sizes of functions and the call graph.
The nal di erence is due to write accesses. Unlike instruction references, which are read-only, data references may both read from and write to memory. In data caches, either write-through or write-back policy is used to handle write accesses 18]. In the write-through policy, the e ect of each write is re ected on both the block in the cache and the block in main memory. On the other hand, in the write-back policy, the e ect is re ected only on the block in the cache and a dirty bit is set to indicate that the block has been modi ed. When a block whose dirty bit is set is replaced from the cache, the block's contents are written back to main memory.
The timing analysis of data caches with the write-through policy is relatively simple. One simply has to add a delay to each write access to account for the accompanying write access to main memory. However, the timing analysis of data caches with the write-back policy is slightly more complicated. In a write-back cache, a sequence of write accesses to a cached memory block without a replacement in-between, which we call a write run, requires only one write-back to main memory. We attribute this write-back overhead (i.e., delay) to the last write in the write run, which we call the tail of the write run. With this setting, one has to determine whether a given write access can be a tail to accurately estimate the delay due to write-backs. In some cases, local analysis can determine whether a write access is a tail or not as in the case of hit/miss analysis for a memory reference. However, local analysis is not su cient to determine whether a write access is a tail in every case. Hence, when this is not possible, we conservatively assume that the write access is a tail and add a write-back delay to t max . However, if later analysis over the program syntax tree reveals that the write access is not a tail, we subtract the incorrectly attributed write-back delay from t max . This global analysis can be performed by providing a few bits to each block in first reference and last reference and augmenting the and pruning operations 27].
VI EXPERIMENTAL RESULTS
We tested whether our extended timing schema approach could produce useful WCET bounds by building a timing tool based on the approach and comparing the WCET bounds predicted by the timing tool to the measured times. Our timing tool consists of a compiler and a timing analyzer (cf. We chose an IDT7RS383 board as the timing tool's target machine. The target machine's CPU is a 20 MHz R3000 processor which is a typical RISC processor. The R3000 processor has a ve-stage integer pipeline and an interface for o -chip instruction and data caches. It also has an interface for an o -chip Floating-Point Unit (FPU).
The IDT7RS383 board contains instruction and data caches of 16 Kbytes each. Both caches are direct-mapped and have block sizes of 4 bytes. The data cache uses the write-through policy and has a one-entry deep write bu er. The cache miss service times of both the instruction and data caches are 4 cycles. The FPU used in the board is a MIPS R3010. Although the board has a timer chip that provides user-programmable timers, their resolutions are too low for our measurement purposes. To facilitate the measurement of program execution times in machine cycles, we built a daughter board that consists of simple decoding circuits and counter chips, and provides one user-programmerable timer. The timer starts and stops by writing to speci c memory locations and has a resolution of one machine cycle (50 ns).
Three simple benchmark programs were chosen: Clock, Sort and MM. The Clock benchmark is a program used to implement a periodic timer. The program periodically checks 20 linked-listed timers and, if any of them expires, calls the corresponding handler function. The Sort benchmark sorts an array of 20 integer numbers and the MM program multiplies two 5 5 oating-point matrices. Table 1 compares the WCET bounds predicted by the timing tool and the measured execution times for the three benchmark programs. In all three cases, the tool gives fairly tight WCET bounds (within a maximum of about 30% overestimation). A closer inspection of the results revealed that Table 1 . Predicted and measured execution times of the benchmark programs more than 90% of the overestimation is due to data references whose addresses are not known at compile-time. (Remember that we have to account for two cache miss penalties for each such data reference.)
Program execution time is heavily dependent on the program execution path, and the logic of most programs severely limits the set of possible execution paths. However, we intentionally chose benchmark programs that do not su er from overestimation due to infeasible paths. The rationale behind this selection is that predicting tighter WCET bounds by eliminating infeasible paths using dynamic path analysis is an issue orthogonal to our approach and that this analysis can be introduced into the existing timing tool without modifying the extended timing schema framework. In fact, a method for analyzing dynamic program behavior to eliminate infeasible paths of a program within the original timing schema framework is given in 29] and we feel that our timing tool will equally bene t from the proposed method.
We view our experimental work reported here as an initial step toward validating our extended timing schema approach. Clearly, much experimental work, especially with programs used in real systems, need to follow to demonstrate that our approach is practical for realistic systems.
VII CONCLUSION
In this paper, we described a technique that aims at accurately estimating the WCETs of tasks for RISC processors. In the proposed technique, two kinds of timing information are associated with each program construct. The rst type of information is about the factors that may a ect the timing of the succeeding program construct. The second type of information is about the factors that are needed to re ne the execution time of the program construct when the rst type of timing information of the preceding program construct becomes available at a later stage of WCET analysis. We extended the existing timing schema using these two kinds of timing information so that we can accurately account for the timing variations resulting from the history sensitive nature of pipelined execution and cache memory. We also described an optimization that minimizes the overhead of the proposed technique by pruning the timing information associated with an execution path that cannot be part of the worst case execution path.
We also built a timing analyzer based on the proposed technique and compared the WCET bounds of sample programs predicted by the timing analyzer to their measured execution times. The timing analyzer gave fairly tight predictions (within a maximum of about 30% overestimation) for the benchmark programs we used and the sources of the overestimation were identi ed.
The proposed technique has the following advantages. First, the proposed technique makes possible an accurate analysis of combined timing e ects of pipelined execution and cache memory, which, previously, was not possible. Second, the timing analysis using the proposed technique is more accurate than that of any other technique we are aware of. Third, the proposed technique is applicable to most RISC processors with in-order issue and single-level cache memory. Finally, the proposed technique is extensible in that its general rule may be used to model other machine features that have history sensitive timing behavior. For example, we used the underlying general rule to model the timing variation due to write bu ers 27].
One direction for future research is to investigate whether or not the proposed technique applies to more advanced processors with out-of-order issue 30] and/or multi-level cache hierarchies 18]. Another research direction is in the development of theory and methods for the design of a retargetable timing analyzer. Our initial investigation on this issue was made in 31]. The results indicated that the machine-dependent components of our timing analyzer such as the routines that implement the concatenation and pruning operations of the extended timing schema can be automatically generated from an architecture description of the target processor. The details of the approach are not repeated here and interested readers are referred to 31].
