Generalizing Timing Predictions to Set-Associative Caches by Mueller, Frank
Generalizing Timing Predictions to SetAssociative Caches
Frank Mueller
HumboldtUniversitat zu Berlin
Institut fur Informatik
 Berlin Germany
email muellerinformatikhuberlinde phone 	
 
 
Abstract
Hard realtime systems rely on the assumption that the deadlines of tasks can be met  otherwise the
safety of the controlled system is jeopardized Several scheduling paradigms have been developed to support
the analysis of a task sets and determine if a schedule is feasible These scheduling paradigms rely on the
assumption that the worstcase execution time WCET of hard realtime tasks be known apriori
In the past years research in the static prediction of WCET has been extended from unoptimized
programs on simple CISC processors to optimized programs on pipelined RISC processors and from un
cached architectures to directmapped instruction caches The work presented here goes one step beyond
the previous research by introducing the rst framework to handle WCET prediction for setassociative
caches Generalizing the work of static cache simulation of directmapped caches to setassociative caches
a formalization of the new method is given and the operational characteristics are presented and discussed
by example Furthermore WCET predictions for several programs are presented by combining the static
cache analysis for setassociative caches with a timing analysis tool This approach has the advantage that
cache conguration details are handled by static cache simulation but remain transparent to the timing
analyzer Overall this work lls another gap between realistic WCET prediction of contemporary cached
architectures and its use in schedulability analysis for hard realtime systems
  Introduction
Hard realtime systems rely on the assumption that the deadlines of tasks can be met  otherwise the safety of the
controlled system is jeopardized Several scheduling paradigms have been developed to support the analysis of a
task sets and determine if a schedule is feasible eg ratemonotone analysis 	
 These scheduling paradigms rely
on the assumption that the worstcase execution time WCET of hard realtime tasks be known apriori If the
WCET of all tasks is known it can be determined if a schedule is feasible ie if the tasks are guaranteed to meet
their deadlines
Determining the WCET is a prerequisite for oline schedulability analysis but most practitioners use adhoc
methods to measure the execution time of a program with some worstcase input However such an approach
may yield incorrect overly optimistic results in the context of modern processors with pipelines caches or even
instructionlevel parallelism For example while a longer execution path typically contributes to the WCET in
uncached systems a shorter path with frequent cache misses may result in longer execution in a cached system
An analytical approach is needed supported by tools to determine the WCET for contemporary architectures if
realtime systems want to exploit the speed of such processors
Modern processors generally use instruction and data caches to bridge the increasing gap between everfaster
processors and only moderately faster memory Most caches are split caches ie instruction cache and data cache
are kept separately The level of associativity for such caches ranges between 
 and  see Table 
 	
In the past years research in static analysis of WCET of programs has intensied Conventional methods for
static analysis have been extended from unoptimized programs on simple CISC processors 	  
 to optimized
programs on pipelined RISC processors 	 
  and from uncached architectures to directmapped instruction
caches 	 
 



Associativity Processors

 most SPARC and MIPS chips
 Intel Pentium Nexgen Nx PowerPC  MIPS R

 AMD K Motorolla  PowerPC 
 PowerPC 
Table 
 Associativity of Caches for various Processors
The work presented here goes one step beyond the previous research by introducing the rst framework to handle
WCET prediction for setassociative caches Generalizing the work of static cache simulation 	
 of directmapped
caches to setassociative caches a formalization of the new method is given and the operational characteristics
are presented and discussed by example Furthermore WCET predictions for several programs are presented by
combining the static cache analysis for setassociative caches with a timing analysis tool This approach has the
advantage that cache conguration details are handled by static cache simulation but remain transparent to the
timing analyzer Overall this work lls another gap between realistic WCET prediction of contemporary cached
architectures and its use in schedulability analysis for hard realtime systems
The paper is structured as follows In section  an overview of the operational framework is given In section
 related work is summarized Sections  presents a formalization of the static analysis for setassociative caches
it describes the operational semantics and presents an example Section  outlines the task of the timing analyzer
In section  measurements are presented and discussed Section  outlines future work Finally conclusions are
presented in section 
 Related Work
The task of bounding the WCET of programs is due to the undecidability of the halting problem generally constraint
by a set of assumptions about the use of programming language constructs and about the underlying operating
system For a static estimate of the WCET an upper bound on the number of loop iterations has to be known
indirect calls should not be used and memory should not be allocated dynamically 	 Often recursive functions
are also not allowed although there exist outlines on treating bounded recursion similar to bounded loops 	

In particular in the presence of caches nonpreemptive scheduling is assumed to prevent undeterministic behavior
due to unpredictable context switch points If context switches occurred at arbitrary points eg in a preemptive
system cache invalidations may occur resulting in unexpected cache misses when the execution of a task is resumed
later on Hardware and software approaches have been proposed to counter this problem but nd little use in
practice due to a loss of cache performance when caches are partitioned 	

  
 Recently attempts have been
made to incorporate caching into ratemonotone analysis and responsetime analysis 	  which may allow WCET
predictions for nonpreemptive systems to be used in the analysis of preemptively scheduled systems
Early work in the eld of WCET prediction used a timing schema to propagate execution times of programming
structures along the controlow graph of functions and call graph of a program Park analyzed programs at the source
level disregarding compiler optimizations 	
 while Harmon et al performed instructionlevel timing augmented
by the execution times of each instruction for a given architecture 	 Recent advances in computer architecture
forced researchers to extend these methods Pipelined processors were handled by simulating the execution stages
of instructions and overlapping the stages of adjacent instructions along execution paths 	 
 
To model the eect of cache memories several approaches were taken The rst approach uses an extension of the
timing schema by Park and updates caching information associated with paths during the traversal of the timing
graph Distinct cache states of multiple paths may have to be considered during the analysis before shorter paths
are pruned until only the longest worstcase path remains 	
 
 In another approach integer linear programming
ILP was used to describe constraints on the execution paths to derive the WCET from these constraints rst by
Puschner and most recently by Li et al 	
 
 
 Li et al enhanced this method by a set of nitestate
automata one for each cache line with conicts The automata simulated the behavior of directmapped caches and
was described by constraints placed at the reference points of program lines in the control ow The ILPsolver would
then take caching eects into account but the cache constraints increased the complexity and the search space of
the ILPproblem Since the ILP approach is subject to long response times in the absence of caching factor 


slower than our appraoch it seems questionable if this approach is feasible when used every day in the software

development cycle Also the additional overhead of dealing with setassociative caches once implemented will
increase polynomially with the number of cache sets due to the increased search space whereas our approach scales
linearly
The method described here static cache simulation was used to separate cache analysis from path analysis 	

It uses dataow information to categorize instructions according to their caching behavior The timing analyzer
receives these instruction categorizations from the static cache simulator and proceeds by traversing paths and
propagating timing predictions within a timing tree 	 The cache conguration remains transparent to the timing
analyzer but pipelining has to be modeled by storing the leading and trailing active stages of a pipeline for paths 	
The discussed methods of WCET analysis that model cache eects only handle directmapped caches The
possibility of an extension of Parks timing schema for setassociative caches is briey mentioned in 	
 but neither
formalized nor implemented The approach described in this paper formulates an approach to dealing with set
associative caches in the area of WCET prediction and reports results of its implementation
 The Framework for Timing Prediction
The framework of WCET prediction discussed here includes a modied compiler and a set of tools to replace hand
calculated or externally timed estimations with analytically derived reliable timings The programmer does not
need to know the worstcase input since the framework statically determines execution paths leading to WCET
predictions by means of path analysis Figure 
 gives an overview of the tools within the framework
source
control
flow
simulator
files
cache
static
information
cache configuration
linker
program
executableobject
files
compiler
files
assembler
assembly
cache
analysis
library
routines
code instru-
mentation
cache
prediction
timing
analyzer
user requests
timing
predictions
source-
level
debugger
Figure 
 Framework for Timing Predictions
An optimizing Compiler accepts the source code of a program currently for the language C The compiler
produces object code and separately emits controlow information and the calling structure of functions The static
cache simulator uses the controlow information and calling structure in conjunction with the cache conguration
to produce instruction categorizations describing the caching behavior of each instruction The timing analyzer
combines these categorizations with the controlow information to perform a path analysis of the program This
may include the simulation of architectural characteristics eg pipelining but the caching behavior can be inferred
from instruction categories ie the process of cache simulation is entirely separated and transparent to the timing
analyzer The timing analyzer produces WCET predictions for portions of the program or the entire program
depending on user requests
 Static Simulation of SetAssociative Caches
Static cache simulation provides the means to predict the caching behavior of the instructions of a programtask
The predicted caching behavior is distinguished by the following categories
Alwayshit The instruction results in a cache hit on each reference

Alwaysmiss The instruction results in a cache miss on each reference
Firsthit The instruction results in a cache hit on the rst reference and a cache miss for any subsequent references
Firstmiss The instruction results in a cache miss on the rst reference and a cache hit for any subsequent references
A program may consist of a number of loops possibly nested and distributed over several functions For each loop
level an instruction receives a distinct categorization The timing analyzer can then derive tight bounds of execution
time by inspecting the categorizations for each loop level
Since instruction categorizations have to be performed interprocedurally for the entire program the call graph of
the program has to be analyzed The method of static cache analysis traces the origin of calls within the call graph
by distinguishing function instances Since instruction categorizations for a function are specied for each function
instance the timing analyzer can interpret dierent caching behaviors depending on the calling sequence to yield
tighter WCET predictions More detailed explanations of the operational characteristics of the timing analyzer as
well as examples will be given later
  Abstract Cache State
The static cache simulator determines the categories of an instruction based on a novel view of cache memories
using a variation of iterative interprocedural dataow analysis We rst introduce the formal framework to reason
about the caching behavior
Denition  Potentially Cached A program line l can potentially be cached if there exists a sequence of tran
sitions in the combined controlow graphs and functioninstance graph such that l is cached when it is reached in
the current path
The process of determining if a line is potentially cached may be performed by a path traversal However the
combinatorial problem of traversing every possible sequence of paths leads to an exponential explosion in the search
space with regard to the branching factor ie nodes with conditional branches that have two successors in the
control ow
Static cache simulation counters this complexity problem via interprocedural dataow analysis modied for
caching purposes Dataow analysis within compilers yields sets of live objects whereas static cache simulation
yields sets of cached program lines The latter sets are referred to as abstract cache states
Denition  Abstract Cache State ACS The abstract cache state of a program line l within a path and a
function instance is the set of program lines that can potentially be cached prior to the execution of l within the path
and the function instance
For directmapped caches the ACS is a singleton set used to determine the category of an instruction describing
the cache behavior For an nway setassociative cache the ACS is an ntuple of sets used for the purpose of
instruction categorization However n dierent sets are employed to support the operational framework of dataow
analysis simulating the cache invalidation protocol of setassociative caches
  Operational Framework for Dataow Analysis
Given the controlow information of a program and a cache conguration the ACS for each path have to be
calculated Using dataow analysis each path has an input state and an output state corresponding to the ACS
before and after the execution of the path respectively
Before program execution the cache is assumed to be invalid ie it does not contain any lines of the program
Thus the input state of the rst path contains only invalid lines As a path is executed its lines are cached ie
they are added to the output state When caching a line it may replace a conicting line within the current state
Such conicting lines are subject to the replacement policy For an nway setassociative cache the leastrecently
used LRU line is generally replaced
 
Other cached lines age upon such a reference Given the ntuple of an
ACS this replacement process is simulated by shifting the replaced conicting lines of the 
st cache state to the
nd cache state If any lines were shifted they subsequently cause conicting lines in the nd set to be shifted to
the rd set ie the shifting operation cascades until the conicting lines in the nth set are kicked out of the cache
 
We assume the LRU policy for the remainder of the paper since it is the mostcommonly used policy in practice

Finally the input state of a path with predecessors in the control ow is obtained by the union of output states of
its predecessors ie any potentially cached line is included along the control ow For each set of the ntuple the
union of the predecessors of the same tuple is calculated separately
Algorithm Figure 
 depicts the calculation of ACS for an nway associative cache Changes to the algorithm due
to the extension to setassociative caches are depicted in bold face except for minor details eg indexing states
by sets The rst path of function main is invalidated with respect to the incoming ACS of the 
st tuple For all
other paths the input states are calculated as the union of the predecessors output states as discussed before The
output path is determined for each item in the ntuple by adding new cached lines and subtracting conicting lines
within the input state The conicting lines cascade through the tuple space ie they become the new cached lines
of the next tuple
This dataow analysis requires a time overhead comparable to that of interprocedural dataow analysis per
formed in optimizing compilers The space overhead is Opl bb fi n where pl bb fi n denote the number of pro
gram lines basic blocks function instances and cache associativity respectively Notice that setassociative caches
impose a factor of n which is typically very small 
  n   for instruction caches in contemporary architectures
for directmapped caches n  
 The correctness of iterative dataow analysis has been discussed elsewhere 	

Algorithm  Calculation of Abstract Cache States
Input FunctionInstance Graph of the program and controlow graph for each function
Output Abstract Cache State for each path
Algorithm Let prog linesP be the set of program lines of path P Let map into same lines t return the
subset of lines in s that map into the same cache line as any lines in t Let n be the associativity of the cache
input statemain 
 all invalid lines
WHILE any change DO
FOR each instance of a path P in the program DO
FOR set 	  TO n DO
input stateP set 
FOR each immediate predecessor Pred of P DO
input stateP set input stateP set  output statePred set
cache lines 	 prog linesP

FOR set 	  TO n DO
conf lines 	 map into same lineinput stateP set cache lines

output stateP set 	input stateP set  cache lines n conf lines
cache lines 	 conf lines

  Deriving Instruction Categorizations
Instructions have to be categorized for each loop level based on the ACS Some additional dataow information is
required to determine these categories namely the linear cache state and the postdominator set for each path The
linear cache state is based on the forward controlow graph ie the acyclic graph resulting from the removal of
backedges backwards edges forming loops 	
 in the regular controlow graph
Denition  Linear Cache State LCS The linear cache state of a program line l within a path and a function
instance is the set of program lines that can potentially be cached in the forward controlow graph prior to the
execution of l within the path and the function instance
Informally the LCS represents the hypothetical cache state in the absence of loops It will be used to determine
whether a program line may be cached due to loops or due to the sequential control ow In essence the algorithm
to calculate the ACS can also be used to calculate the LCS by simply using the forward control ow As a result
the LCS is an ntuple of sets of program lines assuming an nway set associative cache
The postdominator set of a path includes the program lines certain to be reached from the current path regardless
of the taken paths in the control ow It can also be calculated by dataow analysis and results in a singleton set
even for setassociative caches
Denition  Postdominator Set The postdominator set of a program line l within a path and a function
instance is the selfreexive transitive closure of postdominating program lines

This information is commonly used with respect to basic blocks in optimizing compilers A more detailed discussion
of post dominators can be found elsewhere 	

The instruction categories can now be dened with respect to the available dataow information Denition 
formalizes the worstcase instruction categories for each loop level Dierent loop levels can be distinguished by
extracting only the program lines of the dataow information within the current loop level Operationally this can
be achieved eciently by intersecting the set of program lines within the loops with any dataow set of program
lines Changes to the denition due to the extension to setassociative caches are depicted in bold face except for
minor details
Denition  WorstCase Instruction Categorization 
 Let i
k
be an instruction within a path a loop  and a function instance
 Let n be the degree of associativity of the cache
 Let l  i

i
m  
be the program line containing i
k
and let i
first
be the rst instruction of l within the path
 Let s
j
be the jth component of the ACS ntuple for l within the path and let s  
 jn
s
j

 Let l map into cache line c denoted by l  c
 Let u be the set of program lines in loop 
 Let child be the child loop innernext loop within nesting of  for this path and function instance if such
a child loop exists
 Let header be the set of header paths and preheader be the set of preheader paths of loop  respectively

 Let sp be the abstract output cache state of path p
 Let linear
j
be the jth component of the LCS ntuple for l within the path and let linear  
 jn
linear
j

 Let postdomp be the set of selfreexive postdominating programming lines of path p
Then
category i
k
 
 






















alwayshit if k  first  l  linear	 	 

 jn
l  s
j
	  
mcm  l
jm  s
j
j    
mcm  l
jm  sj  n
rsthit if categoryi
k
 child rsthitk  first 	 l  s 	 l  linear	

ppreheaders
l  sp 	 
pheaders
l  postdomp 	 
mcm l
jm  s  uj  n	

mcm l
ppreheaders
jm  sp  uj  n	 
mcm l
jm  linear uj  n
rstmiss if worsti
k
 child rstmiss	k  first 	 l  s	

mcm l
jm  sj  n 	 
mcm l
jm  s  uj  n
alwaysmiss otherwise
While the denition seems complex it can be implemented rather eciently once the dataow information has
been calculated First simple set operations on bit vectors suce to test the conditions Second if one conjunct in
a condition fails the remaining ones are not tested Third the implementation orders the conjuncts such that the
least likely ones are tested rst To motivate this denition an informal description of the conditions shall be given

The common notion of natural loops denes a single loop header preceded by a single preheader outside the loop 	 This work
extends this notion to handle more general control 
ow with unstructured loops Multiple loop headers occur only for unstructured loops
which are handled by the simulator Multiple loop preheaders occur when the loop can be entered from more than one path outside the
loop which can occur even for natural loops

Alwayshit If the rst instruction of the line has been categorized then any other instruction of the same line is an
alwayshit follows from spacial locality Otherwise there are enough sets to cache at most n 
 conicting
lines ie
 if the line is in the LCS cached without backedges
 if the line is in some jth ACS potentially cached
 if there is no conicting line in the jth ACS or there are enough cache sets to hold conicting lines in the
ACS
Firsthit If the instruction was a rsthit at the innernext loop level or if the line was guaranteed to be cached
upon entering the loop but may have been replaced in cache after iterating within the loop ie
 if the instruction is rst within the line
 if the line is in the ACS
 if the line is in the LCS
 if the line is in the output ACS for all preheaders of the loop
 if the line is guaranteed to be executed when the loop is entered post dominator
 if there are more conicting lines within the loop than available cache sets
 if there are enough cache sets to hold conicting lines within the loop in the output ACS of the preheaders
 if there are enough cache sets to hold conicting lines within the loop in the output LCS of the preheaders
Firstmiss If the instruction was a rstmiss at the innernext loop level or if the line may not be in cache when
the loop is entered but is guaranteed to be brought into cache after one loop iteration ie
 if the instruction is rst within the line
 if the line is in the ACS
 if there are more conicting lines than available cache sets
 if there are enough cache sets to hold conicting lines within the loop in the ACS
Alwaysmiss This is the pessimistic assumption for the prediction of worstcase execution time when none of the
above conditions apply
   Example
The example in Figure  shows a small program calculating the sum of positive elements of an array less the number
of nonpositive elements It consists of the functions main and value the latter being called from two dierent places
within a loop in main Assuming a twoway setassociative cache with four instructions per cache line and a total
of two sets program lines f   g are in conict such that only two of these lines can be cached at a time way
setassociative cache and program lines f
  g are also in conict such that two of these lines can be cached at
the same time The static cache simulator determined the categories at the right of each instruction that could
not have been detected by manual inspection of sequential sections of instructions Static cache simulation does not
only handle such spacial locality but also temporal locality across loops as well as interprocedurally
For instance the rst instruction of value a is a rstmiss at the innermost loop level value a and the
next loop level loop within main The instruction is an alwaysmiss at the outermost level function main The
function instance value a is called in block  within the loop in main and program line  is uncached when the
loop is entered But this line will remain in cache on subsequent executions of block  temporal locality since there
exists only one more conicting line program line  within the loop and a way setassociative cache can hold both
lines Thus a rstmiss is reported for the function instance and the loop This also explains the rstmiss at the
loop level of instruction  in block  which is the rst reference to program line  In both cases an alwaysmiss is
reported for the outermost level since main as a function is only considered to iterate once A rstmiss behaves
equivalent to an alwaysmiss on the rst iteration The distinction between categories results from Denition  and
will be explained later

int a 	 
         
int init 	 
int valueindex
int index


return aindex

int main 

int i sum neg
sum 	 neg 	 init
for i 	  i   i 

if valuei  
sum 	 sum  valuei
else
neg

return sum  neg

fm = first miss
m = always miss
h = always hit
fh = first hit
h
h
sethi   %hi(_a),%o1
add     %o1,%lo(_a),%o1
sll     %o0,2,%o0
ld      [%o0 + %o1],%o1
retl
mov     %o1,%o0
value() (a)
h
h
msave    %sp,-96,%sp
sethi   %hi(_init),%o0
ld      [%o0+%lo(_init)],%o0
mov     %o0,%l2
mov     %o0,%l0
mov     %g0,%l1
program line 0
program line 1
program line 2
f, f, m
h
m
Block 1
m, i, i
h
h
(b)
h
h
h
main()
h
Block 2
h
h
h
h
m
h
call    _value,1
mov     %l1,%o0
cmp     %o0,%g0
bge,a   L22
add     %l2,1,%l2
call    _value,1
mov     %l1,%o0
add     %l0,%o0,%l0
add     %l1,1,%l1
cmp     %l1,10
bl      L21
nop
m
h
h
f, m
h
h
h
h
Block 4
Block 3
Block 5
Block 6
Block 8
h
program line 3
program line 4
m
sub     %l0,%l2,%o1
ret
restore %o1,%g0,%o0 h
Block 7
program line 5
program line 6
h
source line 16
Figure  Program Structure and Instruction Categorizations
cache line                  
program line I I        I I       
Abstract Cache State ACS
valueain 	    in 	I    
mainout	   out	I I 
Linear Cache State LCS
valueain 	   in 	I  
PostDominator
mainin 	       
Figure  DataFlow Sets for Selected Blocks

Now consider the rst instruction of value b an alwayshit The call to value a in block  precedes the
execution of block  calling value b Thus program lines  and  are cached when value b is called Thus
the rst instruction of value b is an alwayshit due to interprocedural spacial locality
Instruction  of value a belongs to program line 
 This line is in conict with lines  and  and all three
lines are referenced in each loop iteration Instruction  of value a was determined to be an alwaysmiss for the
innermost loop level and a rsthit for higher levels Consider the outermost level rst On the rst call to value
a program line 
 is still in cache since the loop had just been entered from block  containing this line and block
 also brought line  in cache causing a hit for instruction  This is equivalent to a rsthit at the outermost loop
level the function level of main
At the next loop level the hit remains on loop entry On subsequent iterations lines  and  will replace line 

when blocks  and  are executed respectively This results in misses for instruction  of value a starting with
the second iteration Thus the instruction is categorized as a rsthit at this level
The innermost loop level corresponds to the function level of value a where an alwaysmiss is reported This
innermost level has a loop frequency of one iteration since it corresponds to a function When the timing analyzer
determines worstcase predictions for program subranges like this function instance it has to report the worst case
Since value a is executed in a loop and this line was a rsthit within the loop the worst case for a single iteration
is a miss While the static cache simulator supplies the worstcase scenario for each loop level the timing analyzer
decides which values to use according the the loop level of analysis requested by the user
Instruction  of value b on the other hand is an alwayshit since line 
 and line  are still cached from the
call to value a This is another example of interprocedural spacial locality
So far the categorizations have been motivated by informal arguments based on analyzing the execution paths
of the program The static cache simulator on the other hand does not performs such path analysis Instruction
categories are instead based on the dataow information and derived from Denition 
For example consider instruction 
 of value a again Line  is not in the current LCS see Figure  Thus
an alwayshit or rsthit can be counted out But line  is in the ACS in ACS	
 and lines  and  are in ACS	
When only the lines within the innermost loop function value are regarded ACS intersect lines at loop level
line  remains by itself Thus the line is categorized as a rstmiss at the innermost level The same holds for
the next loop level the loop within main since only lines  and  are in the intersection At the outermost level
function main all three lines are in the intersection Thus an alwaysmiss is reported This is consistent since a
rstmiss at the the inner levels corresponds to an alwaysmiss for the rst iteration ie the function level of main
with one iteration
Finally consider instruction  of value a at the level of the loop within main categorized as a rsthit The
ACS contains lines 
  and  Both lines 
 and  are in ACS	
 Thus an alwayshit can be counted out Line 
 is
in the LCS in the output ACS of block  preheader and in the postdominator of block  header Lines  and 
are in the intersection between ACS and lines in the loop ie the number of lines in the intersection equals to the
level of associativity n   There are no conicting lines in the output ACS of block  preheader and line  is
the only conicting line in the LCS at this loop level Thus the instruction is categorized as a rsthit
 Timing Analyzer
The timing analyzer calculates the WCET by constructing a timing tree traversing paths within each loop level and
propagating this timing information bottomup within the tree During this traversal the timing analyzer has to take
hardware characteristics into account eg pipelining 	 and the instruction categorizations have to be interpreted
However the timing analyzer does not have to take the cache conguration into account The approach of splitting
cache analysis via static cache simulation and timing analysis makes the caching aspects completely transparent
to the timing analyzer Solemnly based on the instruction categorizations the timing analyzer can derived the
WCET by propagating timing predictions bottomup within the timing tree In the following this interpretation of
instruction categories shall be described in more detail
The timing tree represents the calling structure and the loop structure of the entire program As seen in the
context of the static cache simulator functions are distinguished by their calling paths into function instances This
allows a tighter prediction of the WCET due to the enhanced information about the calling context Each function
instance is regarded as a loop level with one iteration and is represented as a node in the timing tree Regular
loops within the program are represented as child nodes of its surrounding function instance outermost loops or
as child nodes of another loop that they are nested in

The timing analyzer determines the WCET in a bottomup traversal of the tree For any node all possible paths
sequences of basic blocks within the current loop level have to be analyzed When a child node is encountered along
a path its WCET is already calculated and can simply be added to the WCET of the current path sometimes with
small adjustments

For a loop with n iterations a xpoint algorithm is used to determine the cumulative WCET
of the loop along a sequence of possibly dierent paths Once an pattern of longest paths has been established the
remaining iterations can be calculated by a closed formula In practice most loops have one longest path Thus the
rst iteration is needed to adjust the WCET of child loops along the path and the second iteration represents the
xpoint time for all remaining iterations The scope of the WCET analysis can such be limited to one loop level at
a time making timing analysis very ecient compared to an exhaustive analysis of all permutations of paths within
a program A more detailed description of the the timing analyzer can be found elsewhere 	
Consider the example from Figure  again The timing analyzer predicts the WCET by traversing a timing tree
consisting of a node for each loop level see Figure  The leaf nodes correspond to the function instances of value
each with a maximum number of one iteration The loop within main has a maximum iteration count of 
 and
main has an iteration count of one again since it is a function
main
(a)
value
(b)
[max:1]
[max:1] [max:1]
loop 1
in main
[max:10]
value
{worst case: 31 misses + 209 hits = 519 cycles}
{worst case: 34 misses + 215 hits = 555 cycles}
{worst case: 1 miss + 5 hits = 15 cycles} {worst case: 6 hits = 6 cycles}
Figure  Timing Tree with WCET Prediction
The WCET of value a is given by a miss and  hits either instruction  misses on the rst iteration or
instruction 
 misses on consecutive iterations Assuming a miss penalty of 
 cycles the predicted WCET for
value a is 
 cycles Since value b consists of  hits  cycles are predicted The WCET for the loop in main
is bounded by executing the longer path blocks  and  during each iteration There are  misses at this level
 alwaysmisses within the loop and one between the rsthit and rstmiss in value a There are  hits 
alwayshits in the loop 
 alwayshits in the instances of value and one more hit between the rstmiss and rst
hit in value a Each of these hits and misses occur during of each the 
 loop iterations In addition there is a
rstmiss counted as  hits and 
 miss Thus there is a total of    
  
  
 misses and    
     hits
ie 
   
    
 cycles
Notice that the WCET at a nonleaf node is calculated by taking the values of the childrens nodes adjusting
them if necessary and then adding the estimates of instructions at the current level Separating the calculation of
each node speeds up the process of WCET prediction considerably

For the level of main  hits and  misses are added to result in  cycles This WCET prediction estimated by
static analysis without program execution is 
! accurate We conrmed these numbers by measuring the cache
behavior of the programs execution with a tracedriven cache simulator on the worstcase input data
 Measurements
Static cache simulation and timing analysis were performed for instruction caches for 
way setassociative
caches with 
 lines respectively and a line size of 
 bytes Thus each cache conguration has the equivalent
storage capacity of  bytes The test programs were  to  times larger than the cache and included a data
encryption program des matrix operations such as multiplication matmult summation matsum and counting
of nonnegative elements matcnt as well as the bubblesort algorithm sort and a program calculating statistical

Adjustments are necessary for transitions from rstmisses to rstmisses and alwaysmisses to rsthits between loop levels	

Adjustments due to dierent categorizations at loop levels are discussed elsewhere 	


functions of two arrays of numbers stats The estimated number of cycles for a program execution was derived
from static cache simulation and timing analysis without program execution
	
This number is compared to the
number of observed cycles obtained by a tracedriven cache simulation In the latter case the program was executed
with its worstcase input data The miss penalty was assumed to be 
 cycles a realistic value for contemporary
architectures	
Table  shows the results of WCET prediction for a way associative cache with  lines The observed cycles
during program execution column  are slightly less than the number of cycles estimated by our tools column
 The ratio between estimated and observed cycles column  shows that our method yields tight estimations
sometimes even exact ones The results for some programs require further explanation
Observed Estimated
Program Cycles Cycles Ratio
Des  

 

Matcnt   

Matmult 
 
 

Matsum   

Sort 
  
Stats 

 
 

average  

 


Table  WorstCase Times for a B way SetAssociative Cache
The program sort contains an inner loop whose termination depends on the iteration count of the outer loop The
static bound on the maximum iterations of the inner loop however is presented as a constant to the timing analyzer
Thus the timing analyzer overestimates the number of cycles by a factor of  due to a lack of information Tighter
estimations would be possible by requesting a more detailed analysis of the loop structure and providing distinct
bounds on the inner loop for each iteration of the outer loop The program des has a similar data dependency
preventing tighter estimates However these problems are not caused by the cache analysis approach
For the programs matcnt matsum and stats the number of cycles was also overestimated The rst two programs
contain conditional control ow and would require exhaustive analysis of all permutations of execution paths to yield
more accurate results Such an approach would result in exponential complexity Instead the timing analyzer
approximates the execution times conservatively using the xpoint algorithm described earlier This tradeo
between accuracy and feasible time complexity still results in relatively tight but not always precise estimations The
program stats suers from an overly pessimistic categorization due to a program line crossing a function boundary
However the pessimistic category reported results in safe estimates that are still very tight see stats
Figure  shows the average ratio between estimated and observed cycles for cache associativities between 
 and
 The estimations remain tight for dierent levels of associativity A more detailed analysis can be seen in Figure
1
1.2
1.4
1.6
1.8
2
1 2 4 8
R
at
io
Associativity
Figure  Ratio between Estimated and Observed Cycles

For the numbers reported here pipeline simulation of the timing analyzer was intentionally disabled to isolate the eects of caching



 representing the distribution of the instruction categories averaged over the test set The distribution varies
only insignicantly for dierent levels of associativity Thus the presented method for WCET predictions yields
tight results regardless of the associativity of caches Figure  displays the average time measured for static cache
0
20
40
60
80
100
1 2 4 8
Pe
rc
en
t o
f I
ns
tru
ct
io
ns
Associativity
always-hit
always-miss
first-miss
first-hit
Figure  Distribution of Instruction Categories
simulation on a lightly loaded SPARC  via gettimeofday It shows that the execution time increases linearly with
the level of cache associativity The increase can be attributed to the overhead of bitvector operations implementing
the dataow equations The performance overhead for directmapped caches is extremely low about  ms and is
still respectable about 
 sec for the largest associativity found in todays processors Thus static cache simulation
is an adequate method to model caches for WCET predictions for contemporary architectures eciently
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1 2 3 4 5 6 7 8
t [s
ec
]
Associativity
Figure  Performance Overhead of Static Cache Simulation
 Future Work
The current implementation of the static cache simulator handles instruction caches Current work is under way to
handle data caching as well 	
 The data ow for setassociative data caches should be handled similarly to the
methods presented in this paper The current implementation of tools only supports nonrecursive programs This
constraint could be lifted within the framework of the static cache simulator by bounding the recursion depth 	

The timing analyzer has been extended to support pipeline simulation 	 Current work includes an extension of
the pipeline model to support wraparound caches eg for the MicroSparc I 	
The process of static cache simulation could be further enhanced by providing more detailed information about
the control ow of a program Currently we are investigating analytical methods and user annotations to support
such improvements


	 Conclusion
This paper describes a formal method and the corresponding operational framework for simulating setassociative
caches and yielding worstcase execution time WCET predictions of realtime programs The method of static cache
simulation is generalized to model setassociative caches by means of dataow analysis Instructions are categorized
to describe their caching behavior for each loop level of the program This information is used by a timing analyzer to
bound the WCET The isolated handling of caching concerns in the static cache simulator allows caching aspects to
remain transparent to the timing analyzer The method of static cache simulation is shown to yield adequate results
to enable tight predictions of the WCET by the timing analyzer regardless of the degree of cache associativity The
performance of the static cache simulator is remarkably good for directmapped caches and increases linearly with
the associativity to a moderate level even for the highest degree of associativity found in practice By handling set
associative caches in the context of timing analysis this work lls another gap between realistic WCET prediction
of contemporary architectures and its use in schedulability analysis for hard realtime systems
Acknowledgement
David Whalley pointed out mistakes in the manuscript Chris Healy provided the timing analyzer including some
modication and corrections for this work
References
	
 A V Aho R Sethi and J D Ullman Compilers  Principles Techniques and Tools AddisonWesley 

	 R Arnold F Mueller D B Whalley and M Harmon Bounding worstcase instruction cache performance In
IEEE RealTime Systems Symposium pages 


 December 

	 J V BusquetsMatraix The impact of extrinsic cache performance on predictability of realtime systems In
Workshop on RealTime Computing Systems and Applications 

	 J V BusquetsMatraix Adding instruction cache eect to an exact schedulability analysis of preemptive real
time systems In EuroMicro RealTime Workshop June 

	 UC Berkeley CS CPU info center httpinfopadeecsberkeleyeduCICsummarylocalindexhtml February


	 M Harmon T P Baker and D B Whalley A retargetable technique for predicting execution time In IEEE
RealTime Systems Symposium pages  December 

	 C A Healy D B Whalley and M G Harmon Integrating the timing analysis of pipelining and caching In
IEEE RealTime Systems Symposium pages  December 

	 C A Healy D B Whalley and M G Harmon Worstcase timing analysis of instruction caches with wrap
around ll In IEEE RealTime Systems Symposium December 
 submitted
	 J Hennessy and D Patterson Computer Architecture A Quantitative Approach Morgan Kaufmann nd
edition 

	
 Y Hur Y H Bea SS Lim BD Rhee S L Min Y C Park M Lee H Shin and C S Kim Worst case
timing analysis of RISC processors RR
 case study In IEEE RealTime Systems Symposium pages

 December 

	

 D B Kirk SMART strategic memory allocation for realtime cache design In IEEE RealTime Systems
Symposium pages  December 

	
 YT S Li S Malik and A Wolfe Ecient microarchitecture modeling and path analysis for realtime software
In IEEE RealTime Systems Symposium pages  December 



	
 SS Lim Y H Bea G T Jang BD Rhee S L Min Y C Park H Shin and C S Kim An accurate worst
case timing analysis for RISC processors In IEEE RealTime Systems Symposium pages 
 December


	
 CL Liu and James W Layland Scheduling algorithms for multiprogramming in a hardrealtime environment
Journal of the Association for Computing Machinery 

 January 

	
 F Mueller Static Cache Simulation and its Applications PhD thesis Dept of CS Florida State University
July 

	
 F Mueller Compiler support for softwarebased cache partitioning In ACM SIGPLAN Workshop on Language
Compiler and Tool Support for RealTime Systems pages 

 June 

	
 C Y Park Predicting program execution times by analyzing static and dynamic program paths RealTime
Systems 


 March 

	
 P Puschner Zeitanalyse von Echtzeitprogrammen PhD thesis Dept of CS Technical University Vienna
December 

	
 P Puschner Computing maximum task execution times  a graphbased approach RealTime Systems to
appear October 

	 P Puschner and C Koza Calculating the maximum execution time of realtime programs RealTime Systems



 September 

	
 R White D B Whalley and M G Harmon Bounding worstcase data cache performance In IEEE RealTime
Systems Symposium December 
 submitted
	 A Wolfe Softwarebased cache partitioning for realtime applications In Workshop on Responsive Computer
Systems 

	 N Zhang A Burns and M Nicholson Pipelined processors and worst case execution times RealTime Systems

 October 



