Abstract Real-time 
Introduction
The execution time of a program running on a given system may vary significantly according to different input data values and initial system state. In many cases it is essential to know the worst case execution time (WCET) of a program running on a particular hardware system. This WCET is useful in many areas. In hard real-time systems, the designer must prove that the WCET satisfies the timing deadlines. Many real-time operating systems rely on this information for process scheduling. In embedded system designs, the WCET of the software is often required for deciding how hardwarehoftware partitioning is done.
The actual WCET of a program cannot be determined unless we simulate all possible combinations of input data val- ' In recognition of her hard real-time constraint -she had to be back home at the stroke of midnight! ues and initial system states. This is clearly impractical due to the huge number of simulations required. As a result, we can only obtain an estimate on the actual WCET by performing a static analysis of the program. For it to be useful, the estimated WCET must be tight and conservative such that it bounds the actual WCET without introducing undue pessimism.
The objective of this paper is to examine the problem of determining the estimated WCET of a given program on a given hardware system, assuming uninterrupted execution. There are two components involved in solving this problem. They are:
Program path analysis. This determines what sequence
of instructions will be executed in the worst case scenario. Infeasible program paths should be removed from the solution search space as much as possible. This can be done by a data flow analysis of the program, but is more effective with the help of programmer. Therefore, the analysis should provide a mechanism for program path annotations. Another important aspect is that the number of program paths is typically exponential with the program size. An efficient path analysis method is required to avoid exhaustive program path search.
2. Microarchitecture modeling. This models the hardware system and computes the WCET of a given sequence of instructions. This is becoming difficult to model because most modern processors have pipelined instruction execution units and cached memory systems. These features, while speeding up the typical performance of the system, complicate timing analysis. The execution time of a single instruction depends on many factors and varies more than in the previous generation of microprocessors. The cache memory is particularly difficult to model accurately. To determine whether or not the execution of an instruction results in a cache hit, several previously executed instructions must be examined. Any incorrect prediction will result in large pessimism.
Both components must be studied well and together in order to obtain a tight estimated WCET. We will study program path analysis in Section 3 and microarchitecture modeling in Section 4.
Related Work
The problem of determining a program's estimated WCET is in general undecidable and is equivalent to a halting problem. Kligerman and Stoyenko [l] , as well as Puschner and Koza [2] have listed the conditions for this problem to be decidable. These conditions are: (i) 'absence of recursive function calls, (ii) absence of dynamic structures and (iii) bounded loops. These restrictions can be imposed either through specific language constructs or through an external annotation mechanism. These researchers adopt the first approach. They develop the languages Real-Time Euclid and MARS-C respectively. The main drawback of this approach is that it comes with tlhe usual high costs associated with a new programming language. Mok et al. [3] , and Piirk and Shaw [4] choose the latter approach. The loop bounds and other path information are annotated in a separate file. The analyzing tool reads the annotated file and the executable code of the program and then computes the estimated WCET. We believe this is a better approach, as it does not limit the choice of programming language and it requires only minimal additional programming, tools.
Most researchers have recognized that in order to tighten the estimated WCET, it is necessary to remove some logically infeasible program paths. While some of them can be automatically inferred from the program using data flow and symbolic analysis techniques, it is widely felt that this is a difficult task. In contrast, it is relatively less difficult for the programmer to provide such information since he/she is familiar with what the program is supposed to do. The scope of the path information that the programmer can provide may have a direct impact on the tightness of the estimated WCET. Initial efforts [2, 31 to include this information were restricted to providing loop bounds and maximum execution counts of program statements. Subsequent work by Park and Shaw accepts path annotations regarding interactions among program statements (e.g., two stafements are mutually exclusive). They use regular expressions to represent all feasible program paths. While this approach is powerful in describing program paths and can evaluate the worst case program path, it is computationally expensive. As a result, several compromises are made to limit both the scope of user annotations anld the tightness of the analysis.
Microarchitecture modeling handles the timing analysis of a known sequence of instructions. It is widely agreed that this must be done at the assembly level in order to capture all the effects of compiler optimizations and the microarchitecture implementation. Previously, the modeling was simple because the execution time of an instruction was largely independent of others. The researchers normally assumed that the execution time of an instruction was a constant and was equal to its worst case execution time throughout the execution of the program. In modern processors, however, the execution time of an instruction depends on i1. s surrounding instructions and it varies more than in the previous generations of processors. Using the above simple model will result in a very loose estimated WCET. Worse, as the processors become more complicated, the pessimism due to inaccurate modeling is becoming a dominant factor and is increasing. This problem must be overcome before modern processors can be used efficiently in the real-time community. Much research effort has been shifted from path analysis to microarchitecture modeling.
The CPU pipeline is considered to be relatively easy to model because it is only effected by adjacent instructions. The cache memory system poses a much bigger challenge. Fetching an instruction may result in a cache miss, causing it and perhaps some surrounding instructions to be loaded into the cache line. This action has two effects: (i) subsequent instructions are more likely to be found in the cache, and (ii) the displaced cache contents may later cause cache misses if they are needed again. Therefore, during the program path analysis stage, if any portion of the instruction sequence is altered, the cache memory activities of the whole sequence will be affected and a new cache analysis is required. This often leads to explicit enumeration of program paths. Data cache analysis is even more difficult because some data addresses may not be determined statically.
Several WCET analysis with instruction cache modeling methods have been proposed. Liu and Lee [5] [9] handles data cache performance analysis by using graph-coloring techniques. However, this approach has limited success even for small programs. A severe drawback of all the above methods is that they cannot handle any user annotations describing infeasible program paths, which are essential in tightening the estimated WCET. Explicit path enumeration is not a necessity in obtaining tight estimated WCET. An important observation here is that the WCET can be computed by methods other than path enumeration. We propose a method that determines the worst case execution counts of the instructions and from these counts, computes the estimated WCET. The main advantage of this method is that it reduces the solution search space sig-nificantly. Further, as we will show in Section 4, only minimal necessary sequencing information is kept in doing the cache analysis. No path enumeration is needed. The method supports user annotations that is at least as powerful as Park's Information Description Language (IDL), and at the same time, computes the cache memory activity that is far more accurate than Lim's work. To the best of our knowledge, our research is the first to address both issues together.
Program Path Analysis
In this section, we will show how our method handles the program path analysis problem. Here, we use a simple microarchitecture model that assumes the execution time of an instruction to be a constant, i.e., every instruction fetch is assumed to result in a cache miss. This pessimistic assumption will be removed in Section 4 when we introduce a more sophisticated microarchitecture modeling that includes the cache analysis.
As stated in the previous section, our method uses the counting approach to compute the estimated WCET. The method converts the problem of solving the estimated WCET into a set of integer h e a r programming (ILP) problems in which the estimated WCET, and the worst case execution counts of the instructions are solved for. There are similarities between our analysis technique and the one used by Avrunin More details will be provided in the following subsections.
ILB Formulation
Since we assume that each instruction takes a constant time to execute, the total execction time can be computed by summing the products of instruction counts by their corresponding instruction execution times. Furthermore, since the instructions within a basic block' are always executed together, their execution counts are always the same. This allows us to consider them as a single unit. If we let x, be the execution count of a basic block B,, and c, be the execution time of the basic block, then given that there are N basic blocks in the program, the total execution time of the program is given as:
(1) xi, is associated with each node. Each edge in the CFG is labeled with a variable di which serves both as a label for that edge and as a count of the the number of times that the program control passes through that edge. Analysis of the CFG is equivalent to a standard network-flow problem. Structural constraints can be derived from the CFG from the fact that, for each node Bi, its execution count is equal to the number of times that the control enters the node (inflow), and is also equal to the number of times that the control exits the node (outflow). The structural constraints of this example are:
The first constraint (2) is needed to specify that the code fragment is executed once. The structural constraints do not provide any loop bound information. This information can be provided by the user as a functionality constraint. In this example, we note that since k is positive before it enters the loop, the loop body will be executed between 0 and 10 times each time the loop is entered. The coiistraints to specify this infomation are:
The functionality constraints can also be used to specify other path information. For example, we observe that the e l s e statement (B5) can be executed at most once inside the loop. This information can be specified as:
More complicated path information can also be specified. For instance, the user may know that if the e 1 se statement is executed, then the loop will be executed exactly 5 times. The constraint to represent this information is:
Here, the symbols '2%' and ' 1' represent conjunction and disjunction respectively. This constraint is not a linear constraint by itself, but ii disjunction of linear constraints sets. This can be viewed as a set of constraint sets, where at least one constraint set member must be satisfied. We hiave been able to show that all language constructs in Park's IDL can be transformed into sets of linear constraints. As a result, using the linear constraints is at least as descriptive as using IDL.
Solving the Constraints
Because of the '&' and 'I' operators, the program functionality constraints may, in general, be a disjunction of conjunctive constraint sets. To solve the estimated WCET, each set of the functionality constraint sets is combined (the conjunction taken) with the set of structural constraints. The combined set is passed to thie ILP solver with (1) to be maximized. The ILP solver returns, the maximum value of the expression, as well as the basic block counts. The above procedure is repeated for every functionality constraint set. The maximum over all these running times is the estimated WCET. The total time required to solve the estimated WCET depends on the number of functionality constraint sets and the time to solve each constraint set. Although the number of functionality constraint sets double every time a functionality constraint with disjunction operator 'I' is added, we found the size to be small for all the experiments we did as reported in Section 6. The second issue is the complexity of solving each ILP problem, which is, in general, an JfP-hard problem.
We are able to demonstrate that if we restrict our functionality constraints to those that correspond to the constructs in IDL, then the ILP problem collapses to a network flow problem, which can be solved in polynomial time. Our experiments show that the time to solve the estimated WCET is negligible. 
Microarchitecture Modeling
Our goal is to model the CPU pipeline and the cache memory systems and find out the execution times (ci's) of the basic blocks. In this paper, we will limit our method to model a direct-mapped instruction cache. However, it can be extended to handle set associative instruction cache memory.
Direct-mapped Instruction Cache Analysis
To incorporate cache memory analysis into our ILP model shown in the previous section, we will need to modify the cost function (1) and add a list of linear constraints, denoted as cache constraints, representing the cache memory behavior. These will be described in the following subsections.
Modified Cost Function
With cache memory, the execution time of an instruction will be different depending on whether it results in a cache hit or cache miss. Thus, we need to subdivide the original instruction counts into counts of cache hits and misses. If we can determine these counts, and the hit and miss execution times of each instruction, then a tighter bound on the execution time of the program can be established. As in the previous section, we can group adjacent instructions together. We define a new type of atomic structure for analysis, the line-block or simply 1-block. A 1-block is defined as a contiguous sequence of instructions within the same basic block that are mapped to the same line in the instruction cache. All instructions within an 1-block will always have the same cache hit/miss counts, and the same total execution counts. Fig. 2 (i) shows a CFG with 3 basic blocks. Suppose that the instruction cache has 4 lines. Since the starting address of each basic block can be determined from the program's executable code, we can find all the cache lines that instructions within it map to, and add an entry on these cache lines in the cache table (Fig. 2(ii) ). The boundary of each 1-block is shown by the solid line rectangle. Suppose a basic block Bi is partitioned into ni 1-blocks. We denote these 1-blocks as Bi.1, Bi.2, . * 9 , Bin,.
For any two 1-blocks that map to the same cache line, they conflict with each other if the execution of one 1-block will displace the cache content of the other. Otherwise, they are called non-conflicting 1-blocks (e.g. B1. 3 and B2.1 in Fig. 2) .
Since 1-block Bi,j is inside the basic block Bi, its execution count is equal to xi. The cache hit and the cache miss counts of 1-block Bi.j are denoted as $7 and respectively, and
The new total execution time (cost function) is given by:
Total execution time = cc(c~~$~ +crpqy). (14) where c e and c :
? are the hit cost and the miss cost of the 1-block Bi,, respectively. Equation (13) links the new cost function (14) with the program structural constraints and the program functionality constraints, which remain unchanged. In addition, the cache behavior can now be specified in terms of the new variables 2;'s and qp's.
Cache Constraints
These constraints are used to constrain the hitlmiss counts of the 1-blocks. Consider a simple case. For each cache line, if there is only one 1-block Bk.1 mapping to it, then once Bk.1 is loaded into the cache it will permanently stay there. In other words, only the first execution of this 1-block may cause a cache miss and all subsequent executions will result in cache hits. Thus, X;;"p<1.
(15)
A slightly more complicated case occurs when two or more non-conflicting 1-blocks map to the same cache line, such as B1. 3 and B2.1 in Fig. 2 . The execution of any of them will load all the 1-blocks into the cache line. Therefore, the sum of their cache miss counts is at most one. In this example, the constraint is:
When a cache line contains two or more conflicting 1-blocks, the hidmiss counts of all the 1-blocks mapped to this line will be affected by the sequence in which these 1-blocks are executed. An important observation is that the execution of any other 1-blocks from other cache lines will have no effect on these counts. This leads us to examine the control flow of the 1-blocks mapped to that particular cache line by defining a cache conflict graph.
Cache Conflict Graph
A cache conflict graph (CCG) is constructed for every cache line containing two or more conflicting 1-blocks. It contains a Fig. 3 . The program control begins at the start node. After executing some other 1-blocks from other cache lines, it will eventually reach any one of node Bk.1, node B,,,n or the end node. Similarly, after executing Bk.l, the control may pass through some 1-blocks from other cache lines and then reach to node Bk,l again or it may reach node B,.n or the end node.
For each edge from node Bi.j to node B,.,, we assign a variable ~( ; , j ,~,~) to count the number of times that the control passes through that edge. At each node Bi,j, the sum of control flow going into the node must be equal to the sum of control flow leaving the node, and it must also be equal to the execution count of 1-block Bi.j. Therefore, two constraints are constructed at each node Bi,j:
where 'u.v' may also include the start node 's' and the end node 'e'. This set of constraints is linked to the program structural and functionality constraints via the x-variables.
The program is executed once, so at start node:
The Equations (15) through (20) are the possible cache constraints for bounding the cache hit/miss counts. These constraints, together with (13), the structural constraints and the functionality constraints, are passed to the ILP solver with the goal of maximizing the cost function (14). Because of the cache information, a tighter estimated WCET will be returned. Further, some path sequencing information can be expressed in terms of p-variables as extra functionality constraints. 'The CCGs are network flow graphs and thus the cache constraints are typically solved rapidly by the ILP solver. In the worst case, there is one CCG for each cache line.
The above constraints can also be used to solve best case execution timie. In this case the ILP solver will try to increase the value of as much as possible. If p(i,j,j,j) (self-edge variable) exisls, then the ILP solver may set ,u(i,j,i.j) =$j" =xi.
However, this; is not possible in any execution trace. Before this path can occur, control must first flow into node Bi.j from some other node. To handle this problem, an additional constraint is required for all nodes Bi.j with a self-edge:
where Z is a large positive integer constant. The addition of this kind of constraints may generate some non-integral optimal variable values when the whole constraint set is passed to LP solver. If the ILP solver uses branch and bound techniques for solving the ILP problem, the computational time may be lengthened significantly.
Bounds on p-variables
In this subsection, we discuss bounds on the p-variabIes. Without the correct bounds, the solver may return an infeasible 1-block count and an overly pessimistic estimated WCET. This is demoinstrated by the example in Fig. 4 . In this example, the CFG contains two nested loops. Suppose that there are two conflicting k-blocks B4.1 and B7.1. A CCG will be constructed (Fig. 4(ii) ) and the following cache constraints will be generated: Suppose that the user specifies that both loops will be executed 10 times each time they are entered and that basic block B4 will be executed 9 times each time the outer loop is entered. The functionality constraints for this information are:
(27-29)
If we feed the above constraints and the structural constraints into the ILP solver, it will return a worst case solution in which the counts are as shown on the left of the variables in the figure. From the CCG, we observe that these p-values imply that 1-blocks B4.1 and B7.1 will be executed alternately, with 1-block B7.1 being executed first. This execution sequence will generate the maximum number of cache misses and hence the WCET. However, if we look at the CFG, we know that this sequence is impossible because the inner loop will be entered only once. Once the program control enters the inner loop, 1-block B7.1 must be executed 10 times before control exits the inner loop. Hence, there must be at least 9 cache hits for 1-block B7.1. The ILP solver over-estimates the number of cache misses based on the given constraints. Upon closer investigation, we find that the correct solution also satisfies the above set of constraints. This implies that some constraints for tightening the solution space are missing.
The reason for producing such pessimistic worst case solution is that the p-variables are not properly bounded. When we assign the p-variables to the edges of the CCG, we do not specify any upper limits on these p-variables. However, the flow equations (17) place a bound on them. For any variable ~(i.j,~,,), its bounds are:
xu). (30)
Consider the case that two conflicting I-blocks Bi,, and B , , are in the same loop and at the same loop nesting level. In this case the maximum control flow allowed between these two 1-blocks is equal to the total number of loop iterations. This will be the upper bound on ~( i . j ,~. , ) .
Since 1-blocks Bi.j and B, , are inside the loop, xi and xu can at most be equal to the total number of loop iterations. Therefore, (17) will bound Suppose that there are two nested loops such that 1-block Bi,j is in the outer loop while Bu., is in the inner loop. If edge (Bi,j,Bu.,) exists, all paths represented by this edge go from basic block Bi to basic block B, in the CFG. They must pass through the loop preheader3, say basic block Bh, of the inner loop. Since the execution count of basic block Bh, xh, may be smaller than x i and xu, a constraint P(i.j,u.v) correctly.
is needed to properly bound p(i.j,u.v).
In general, a constraint is constructed at each loop preheader. All the paths that go from outside the loop to inside the loop must pass through the loop preheader. Therefore, the sum of these flows can at most be equal to the execution count of the loop preheader. In our example, a constraint at loop preheader B5 is needed:
'A loop preheader is the basic block just before entering the loop. For instance, in the example shown in Fig. 4 , basic block B1 is the loop preheader of the outer loop and hasic block 85 is the loop preheader of the inner loop.
With this constraint, the ILP solver will generate a correct solution.
Interprocedural Calls
A function may be called many times from different locations of the program. The variable xi represents the total execution count of the basic block Bi when the whole program is executed once. Similarly, $$ and cy represents the total hit and miss counts of the 1-block Bi,j respectively. Equation (13) is still valid and (14) still represents the total execution time of the program.
Every function call is treated as if it is inlined. During the construction of CFG, a function call is represented by an fedge pointing to an instance of the callee function's CFG. The edge has a variable fk which represents the number of times that the particular instance of the callee function is called.
Each variable and name in the callee function has a suffix ". fk" to distinguish it from other instances of the same callee function.
Consider the example shown in Fig. 5 . Here, function inc is called twice in the main function. The CFG is shown in Fig. 5 (ii). The structural constraints are:
The last equation above links the total execution counts of basic block B3 with its counts from two instances of the function. Based on these variables, the user can provide specific information on different instances of the same function.
The CCG is constructed as before by treating each instance of 1-block Bi.j. fk as different from other instances of the same 1-block. In the example, if 1-block Bl.1 conflicts with 1-block B3.1, then since 1-block B3.1 has two instances (B3.1, f1 and B3.1 .fi), there will be 5 nodes in the CCG (Fig. S(iii) ).
The cache constraints and the bounds on p variables are constructed as before, except the hit constraints are modified slightly. In addition to the self edges, the edge going from one instance of a 1-block (say Bi.j. fk) to another instance of the same 1-block (I3i.j.h) are counted as the cache hit of the 1-block Bi.j, as it represents the execution of 1-block Bi.j at f i after the same 1-block has just been executed at fk. The complete cache constraints derived from the example's CCG are: 
CPU P5peline
Since c:g's and cyy's must be constants, we assume that the time required to execute a sequence of instructions in the CPU pipeline is always a constant throughout the execution of the prograim. The hit cost c$ of a I-block Bi,j is determined by adding up the effective execution times of the instructions in the 1-block. Since the effective: execution times of some instructions, especially the the floating point instructions, are data dependent, a conservative approach is taken by assuming the worst case effective execution time. This may induce some ]pessimism in the final estimated WCET. Additional time is ;also added to the last 1-block of each basic block so as to ensure that all the buffered load/store instructions are completed when the control reaches the end of the basic block. The miss cost c? of the I-block is equal to the time needed to load the instructions of the I-block into the cache memory and to execute them in the CPU.
Implementation
The above cache analysis method has been implemented in a tool called cinderella4, which estiniates the WCET of programs irunning on an Intel QT960 development board [l 11 c o n t a i n i n g a n 20MHz I n t e l i960KEX processor, 128KE3 of main memory and several U 0 peripherals. The i960KB processor is a 32bit RISC processor used in many embedded systems (e g. in laser printers). It contains an on-chip 512 byte direct-mapped instruction cache which is organized as 32 x 16-byte lines. It also features a floating point unit, a 4-stage instruction pipeline, and 4 register windows for faster execution of function call instructions [12, 131.
Cinderella contains about 15,000 lines of C++ code. The tool reads the subject program's executable code and constructs the CFGs and the CCGs. It then outputs the annotation files in which the x's and f's are labeled along with the program's source code. The user is then asked to provide loop bounds. An estimated WCET can thus be computed. The user can provide additional path information, if available, to tighten this bound. We use a public domain ILP solver lp-solve5 to solve the constraints generated by cinderella. The solver uses the branch and bound procedure to solve the ILP problem.
An optimization implemented in cinderella actually reduces the number of variables and CCGs. If two or more cache limes can hold instructions from the same set of basic blocks, e.g. cache lines 0 and 1 in Fig. 2(ii) , then the corresponding 1-blocks can be combined and only one CCG is constructed for these cache lines. This technique is used in cinderella to improve efficiency.
Experimental Results
Our goal is to find a tight bound on a program's WCET. A small amount of pessimism is normally present in the estimated bound. This is due to two factors: (i) insufficient path information from the user so that some infeasible program paths are considered, and (ii) inaccuracy in microarchitecture modeliing which affects the accuracy of the values of cpl's and cyF7s in (14) . The first factor can be reduced by providing more path information and the second can be reduced by a more sophisticated h a r d w a r e model. Since it is impractical to simulate all the possible program input data and all initial system states, a program's actual WCET cannot be computed. Instead, we try to identify the worst case data set by a careful study of the program and use the logic analyzer to measure the program's execution time for this worst case data set. We denote this time as the program's measured WCET. A program's measured WCET is always bounded by its actual WCET and we assume that it is very close to the actual WCET. Table 2 shows the results of our experiments. The second and third columns show the measured WCET and the estimated WCET with cache analysis. For comparison, we also estimate WCET without performing the cache analysis. This is shown in the last column. Clearly the WCET bound with cache analysis is much tighter than the one without it. For small programs (e.g. check-data and p i k s r t ) , the estimated WCETs are very close to their corresponding measured WCETs. For larger programs, the differences are larger. We found that this discrepancy is mainly due to two factors. The first is that we assume that the register window overflows (underflows) on each function call (return). This pessimism incurs about 50 clock cycles on each function call and function return. This can be illustrated in examples matcnt and matcnt2. Matcnt has two small functions which are called frequently inside the loops. These function calls generate a large amount of pessimism. In matcnt2, these two functions are inlined and the estimated WCET is tightened significantly. We are currently working on this area to reduce the pessimism. The second factor is due to the pessimism in the execution times of floating point instructions. The execution time of a floating point instruction depends on the values of its arguments and its worst case execution time are typically 30%40% more than its average execution time.
The structural constraints and the cache constraints are derived from the CFG and the CCGs which are very similar to network flow graphs. We therefore expect that the ILP solver can solve the problem efficiently. Table 3 shows, for each program, the number of variables and constraints, the number of branches in solving the ILP problem, and the CPU time required to solve the problem. Since each program may have more than one set of functionality constraints, a '+' symbol is used to separate the number of functionality constraints in each set. For a program having n sets of functionality constraints, the ILP will be called n times. The '+' symbol is once again used to separate the number of ILP branches and the CPU time for each ILP call.
We found that even with thousands of variables and constraints, the branch and bound ILP solver can still find an integer solution within the first few calls to the linear programming solver. The time taken to solve the problem ranges from less than a second to a few minutes on a SGI Indigo2 workstation. With a commercial ILP solver CPLEX, the CPU time reduces significantly to a few seconds.
In order to evaluate how the cache size will effect the time needed for solving the problem, we double the number of cache lines (and hence the cache size) from 32 lines to 64 lines and find the CPU time needed to solve the problems. Table 4 shows the results. From the table, we find that the number of variables and the number of constraints change little when the number of cache lines is doubled. The solution time is of the same order as before. The primary reason is that although increasing the number of cache lines will increase the number of CCGs and hence more cache constraints are generated, each CCG has fewer nodes and edges. As a result, there are fewer cache constraints in each CCG. These two factors roughly cancel out each other. In this paper, we present a method to determine a tight bound on a program's worst case execution time. The method includes a direct-mapped instruction cache analysis and uses an integer linear programming formulation to solve the problem. This approach avoids enumeration of prograim paths. Furthermore, it allows the user to provide program path annotations so that a tighter bound may be obtained. The method is implemented in ithe tool cinderella and the experimental results show that the WCET bound is much closer to the measured WCET than if cache analysis is not included. Since the linear constraints are mostly derived from the network flow graphs, the ILP problems are typically solveid efficiently.
We are now working on set-associative instruction cache and data cachle memory modeling, as well as the register window modeling.
