Scope-aware data cache analysis for WCET estimation by HUYNH BACH KHOA





National University of Singapore
A THESIS SUBMITTED
FOR THE DEGREE OF MASTER OF SCIENCE
SCHOOL OF COMPUTING
NATIONAL UNIVERSITY OF SINGAPORE
2010
2Acknowledgement
First and foremost, I thank Lord God in heaven for His providence, His words, and the blessings
I enjoyed. I thank Him for the opportunity to pursuit graduate study, for all the people that I
meet, and for their kindness and their supports.
Next, I wish to express my sincere gratitude to my supervisor, A/P. Abhik Roychoudhury. I
am very grateful for his encouragement, his patience and his advices throughout my research.
I have special thanks to my senior Ju Lei for his discussions, his various helps and the
time we worked together. Besides, I thank my fellow labmates: Wang Chundong, Sudipta
Chattopadhyay, Dawei Qi, Vivy Suhendra, Liang Yun, Huynh Phung Huynh, to name a few. I
thank my friends in church and my roommates. I am grateful for their friendship through out
my study, and I really enjoyed my time with these brilliant people.
Finally, I wish to thank my parents for their unconditional love.
Summary
Caches are widely used in modern computer systems to bridge the increasing gap between pro-
cessor speed and memory access time. However, presence of caches, especially data caches,
complicates the static worst case execution time (WCET) analysis. Access pattern analysis
(e.g., cache miss equations) are applicable to only a specific class of programs, where all
array accesses must have predictable access patterns. Abstract interpretation-based methods
(must/persistence analysis) determines cache conflicts based on coarse-grained memory access
information from address analysis, which usually leads to significant over-estimation.
In this thesis, we first present a refined persistence analysis method which fixes the poten-
tial underestimation problem in the original persistence analysis. Based on our new persistence
analysis, we propose a framework to combine access pattern analysis and abstract interpreta-
tion for accurate data cache analysis. We capture the dynamic behavior of a memory access
by computing its temporal scope (the loop iterations where a given memory block is accessed
for a given data reference) during address analysis. Temporal scopes as well as loop hierarchy
structure (the static scopes) are integrated and utilized to achieve a more precise abstract cache
state modeling. We also prove the correctness of the proposed new persistence analysis. Ex-
perimental results shows that our proposed analysis obtains up to 74% reduction in the WCET






1.1 Background and Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Related work 10
3 Correcting persistence analysis 13
3.1 Assumptions and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Persistence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 Safety issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.3 Correcting the persistence analysis . . . . . . . . . . . . . . . . . . . . 20
3.3 Safety Proofs of Corrected Persistence Analysis . . . . . . . . . . . . . . . . . 25
3.3.1 Structure of the proof . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.2 Safety of update function . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.3 Safety of join function . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.4 Safety of set update function . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.5 Termination of the analysis . . . . . . . . . . . . . . . . . . . . . . . . 32
4
4 Scope-aware Persistence Analysis 33
4.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Temporal Scope and Address Analysis . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Scope-aware Persistence Analysis . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.1 Overall framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.2 Scope-aware update and join functions . . . . . . . . . . . . . . . . . 40
4.3.3 ACS computation of the motivating example . . . . . . . . . . . . . . 45
4.4 Safety proofs of scope-aware persistence analysis . . . . . . . . . . . . . . . . 45
4.4.1 Structure of the proof . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4.2 Safety proof of scope-aware update function . . . . . . . . . . . . . . 48
4.4.3 Safety proof of scope-aware join function . . . . . . . . . . . . . . . . 51
4.5 Cache Miss Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5 Discussion and Conclusion 59
5
List of Figures
3.1 Running example and analysis result of persistence analysis [11] . . . . . . . . . . . 17
3.2 Analysis result of with proposed update and join function . . . . . . . . . . . . . . 21
3.3 Cache update for set of possible access addresses . . . . . . . . . . . . . . . . . . 24
4.1 Motivating example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Address expressions and temporal scopes . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Multi-level analysis and results for the motivating example in Figure 4.1 . . . . . . . 39
4.4 Scope-aware ACS computation for L2 of the motivating example in Figure 4.1 . . . . 43
4.5 Temporal scopes and loop iterations . . . . . . . . . . . . . . . . . . . . . . . . . 54




1.1 Background and Motivations
Worst-case Execution Time (WCET) is a key metric for real-time embedded software. Static
WCET analysis provides a safe bound on the maximum execution time of a program on a
target platform over all possible program inputs. For cost-sensitive domains like automotive
electronics, the WCET estimation must be tight for cost-effective design and resource dimen-
sioning. However, modern processors contain performance enhancing features such as caches
and pipeline whose run-time timing behavior is hard to predict statically. This makes micro-
architectural modeling (building timing models for micro-architectural features such as caches)
a key component of WCET analysis.
Timing models of instruction caches for WCET analysis have been well-studied [23]. On
the other hand, static timing analysis of data cache behavior remains a major challenge for
WCET analysis methods and tools. Accurate data cache modeling is of paramount importance
for tight WCET analysis of data-intensive routines. However, the run-time computed access
address (which data locations are accessed by different instances of an instruction) and dynamic
cache behavior make it difficult to develop a tight yet flexible and scalable static analysis.
Conservatively assuming that every memory access results in a cache miss yields a safe but
pessimistic WCET estimate.
7
Different static data cache analysis techniques have been developed so far. Access pattern-
based techniques (e.g., cache miss equation framework in [13]) achieve tight estimation, but
are applicable to programs that contain only regular accesses with predictable patterns. On
the other hand, abstract interpretation-based data cache analysis techniques ([11, 20]) work on
general programs but suffer from large over-estimation. In this thesis, we seek to combine the
strengths of these two approaches. We observe that the over-estimation in existing abstract
interpretation-based data cache analysis stems from the globally defined abstract domain. In
particular, a coarse-grained address analysis is adopted to compute a set of memory blocks
possibly referenced by a memory access, while temporal property of the access is ignored
(e.g., a memory block can be accessed in only certain iterations of a loop execution). The
approximation in the address analysis causes substantial over-estimation in WCET estimates.
Furthermore, traditionally the abstract interpretation computes fixed point of the abstract cache
state conservatively for the entire program execution (disregarding cache behavior in specific
program scopes), leading to large over-estimation.
In this work, we propose a general and accurate static data cache analysis method by com-
bining access pattern analysis and abstract interpretation. For abstract cache state computation,
we extend the cache behavior categorization of “persistence” as in the persistence analysis of
[11] to capture the access pattern information. In our new persistence analysis framework, we
also fix an error in the original persistence analysis which may result in underestimation of the
cache misses.
1.2 Thesis Contributions
Our contributions include the followings:
Firstly, given a data referenceD and its access pattern, we derive not only the set of possible
accessed memory blocks, but also their temporal scopes. The temporal scope of a memory
block m captures the loop iterations in the program where m may get accessed. Our proposed
data cache analysis decides whether a memory block is persistent within its temporal scope. In
8
particular, two memory blocks accessed in mutually exclusive temporal scopes do not conflict
with each other within their scopes, even though they are mapped to the same cache set.
Secondly, we also consider the static scopes in our analysis. Similar to the multi-level
cache analysis for instruction cache proposed in [2], we maintain a copy of abstract data cache
states for each loop nesting level of the program execution. As a result, certain memory blocks
can be classified as persistent within a local scope of program execution (though it can not be
guaranteed to be persistent globally).
Thirdly we utilize scope-aware persistence while computing the number of data cache
misses. In original persistence analysis, a data reference is classified as globally persistent
throughout the program execution. However, our persistence analysis framework can guaran-
tee that a data reference is persistent within certain temporal and static scopes.
Last but not the least, we have integrated our proposed framework into the open-source
Chronos WCET analyzer ([9]). The experimental results show that our proposed scope-aware





Early work in data cache analysis classifies data accesses into static data accesses for scalar
references and dynamic data accesses for array and pointer references. [8] performs data cache
analysis for static data accesses as with instruction memory accesses, and conservatively as-
sumes each dynamic data access will cause two cache misses. One cache miss is because the
dynamic access itself may access a data memory not in the cache. Another cache miss is be-
cause the dynamic access may evict a useful cache line that leads to a cache hit in the result
cache analysis for static data accesses. This approach leads to significant over-estimation when
there are more dynamic data accesses than static data accesses.
To guarantee cache hit without knowing the access pattern, [14] proposes using pigeonhole
principle. In a loop, if a data reference D may access n1 possible distinct memory blocks
and they will not be evicted out due to cache conflict, then D has at most n1 cold misses. If
D is executed n2 times in that loop, it will have at least n2 − n1 cache hits. This approach
effectively detects cache reuse if the cache can hold all possibly accessed memory blocks in a
loop. However, it could not guarantee cache reuse when cache conflicts occur, or detect cache
reuse across different loop-nests.
[17] extends their instruction cache conflict graph (CCG) to data CCG to capture possible
cache reuses of data accesses as constraints in their integer linear programming (ILP) frame-
work. However, they require a separate constraint for each possible cache reuse between two
10
possible accessed addresses. This causes scalability problem for large arrays, given the com-
plexity of solving ILP problem. No experimental result is reported.
Many successful techniques for instruction cache analysis using abstract interpretation have
been extended for data cache such as must analysis [20] and persistence analysis [11]. They
compute an abstract cache state (ACS) that conservatively represents all possible concrete cache
states at a program point under all circumstances. From the ACS, they derive the pessimistic
cache behavior for each data reference. However, the ACS is insensitive to local behavior (e.g.
behavior within subset of loop iterations). To overcome this problem, [20] proposes virtual loop
unrolling, which makes the analysis computationally expensive. Moreover, in the presence of
input-dependent branches, even with unrolling, no memory block could be guaranteed to be
loaded to the cache for later reuse in must analysis.
While the behavior of data accesses is very complex, in many real program the access pat-
tern of array accesses follows a regular, loop-affine pattern. The cache miss equation (CME)
framework [13] and Presburger Arithmetic formulation [4] apply mathematical model to an-
alyze the cache behavior of those accesses. The CME framework computes the reuse vector
for each regular reference and generates a set of Diophantine equations to characterize whether
the cache reuse can be realized, or interfered by cache conflicts. The solutions of this equation
set are the possible conflict points, from which they can derive the number of cache misses.
[18] extends the CME framework to analyze scalar accesses and more general loop-nest, and
reduces over-estimation at the cost of higher computational complexity. The Presburger Arith-
metic framework is exact and can handle certain non-linear access patterns; however, it has
super-exponential computational cost in the worst case. Aside being computationally expen-
sive, these approaches could not handle programs with input-dependent branches and unpre-
dictable data accesses. Very recently, [12] presents an analytical model for analyzing worst case
performance of data cache without knowing the base addresses of data structure (e.g. array, ob-
ject). They analyze the reuse vector of each data reference, and estimate the worst case conflict
rate (the ratio of evicted lines over total accessed lines). Their approach is fast; however, as with
other reuse-based analysis, they are also restricted to regular loop-affine access pattern without
11
input-dependent branches and irregular accesses. Because these approaches rely on mathemat-
ical model, it is hard to combine them with the WCET analysis of other micro-architecture such
as with instruction cache to perform unified cache analysis [5], or cache analysis for multi-core
[6].
Array access analysis of CME framework is typically performed at high level. [25] pro-
poses a framework to detect loop-affine array accesses at binary code level. From the array
access pattern, they could guarantee the cache reuse of data blocks that must be loaded in the
cache in previous loop iterations. However, this approach requires analyzing each loop itera-
tion individually. As it is computationally expensive, for a loop which there is no conflicting
line, they determines the worst case cache miss as the maximum data blocks could be accessed
according to the access pattern. However, they do not consider unpredictable data accesses, or
discuss how possible cache conflict will influence the worst-case cache performance.
[22] identifies single data sequence (SDS) data references in program fragments where
both control flow and access addresses are input independent. Their cache performance can
be determined by simple simulation. They bound the impact of non-SDS data references on
simulation result using a cache miss counter. The cache miss counter is increased by one
for each data access that causes cache conflict with SDS data references. To bound cache
performance of non-SDS data references, they perform persistence analysis to determine if data
memory can be evicted from the cache once it is loaded. If all possibly accessed memory blocks
of a data reference D are persistent, D will have only one cold miss for each possibly accessed
memory block, while its other accesses must result in cache hits. The SDS classification is
quite restrictive, while the persistence analysis does not consider access pattern and could not




3.1 Assumptions and Notations
In our cache analysis, we consider a memory hierarchy containing separated L1 instruction and
data caches. We use the following notations to represent the instruction/data cache configura-
tion and accessibility.
• Capacity C: size of the cache in number of bytes
• Block (line) size B: number of contiguous bytes to be loaded from memory to cache on
each memory access.
• Associativity A: A-way set associative cache means that information stored at some
addresses in memory could be loaded into any of A locations in the cache (depends on
the cache replacement policy).
• Cache set F = 〈f1, . . . , f(C/B)/A〉: A cache set fi is a sequence of cache blocks (lines)
CL = 〈l1, . . . , lA〉 which contains all the A ways that can be addressed with the same
index. set(m) returns the cache set memory block m maps to.
Reineke et al. [19] has investigated the predictability of popular cache replacement policies
such as LRU, PLRU, MRU, and FIFO. Their analysis indicates that LRU policy is the most
13
suitable for timing critical system, and other policies (PLRU, MRU, and FIFO) are considerably
worse in their predictability benchmark. As a result, we choose the LRU policy for our analysis.
We assume LRU (Least Recently Used) replacement policy is used to determine relative
age of a memory block in the A-way associative cache set. Given a concrete cache state c at a
program point p, the concrete set state si describes the state of cache set c[fi] at p. If si(lx) = m,
memory block m has a relative age x (1 ≤ x ≤ A) in cache set c[fi], and is in cache line lx.
The cache line l1 contains the youngest (most recently used) memory block, while lA contains
the oldest (least recently used) memory block. We assume write-through with no-write-allocate
policy for a memory store instruction in our discussion of data cache analysis. However, our




Persistence analysis determines if a memory block m is persistent: once loaded, it will not be
evicted out of the cache in any possible execution. Therefore, the first access to a persistent
memory block m may encounter a miss. However, all subsequent accesses are guaranteed to
result in cache hits.
To determine if a memory block m is persistent at a program point p, the persistence anal-
ysis [10, 11] computes an abstract cache state (ACS) to determine the maximum relative age x
for each memory block m which may be in the cache when the program control reaches p in all
possible executions. If x is not higher than cache associativity A, once loaded, m is guarantee
to remain in the cache at program point p. As a result, m is classified as persistent and causes
at most one cold miss.
An ACS cˆ = 〈sˆ1, ..., sˆn/A〉 at a program point p models an A-way set associative cache
with n cache lines, n/A cache sets. Each abstract set state sˆk = 〈l1, ..., lA, l>〉 consists of A
cache lines l1, ..., lA and an additional evicted cache line l> to record evicted memory blocks.
14
For each memory block m, sˆ = cˆ[set(m)] returns the abstract set state sˆ in ACS cˆ where m is
mapped to. If m ∈ sˆ(lx), m has maximal relative age x in all possible concrete cache states
when program control reaches p. If m is in evicted line sˆ(l>), the maximum relative age of m
is greater than cache associativity A, so it may be evicted from the cache in some executions.
Persistence analysis can be performed on the control flow graph (CFG). A CFG consists of
a set of node V = {n1, ..., nk} connected by directed edges. Each control flow node nk is a
basic block where the program execution is strictly sequential without any jump or jump target.
At basic block nk with incoming ACS cˆin, if the program accesses memory block m, the cache
update function UˆCˆ computes the output ACS cˆout after accessing m. If a basic block nk has
two or more incoming ACSs, the cache join function JˆCˆ combines upper bound of all incoming
ACSs into the representative input ACS cˆin of node n. The persistence analysis repeatedly
traverses through the CFG and performs these computations until the input ACSs of all nodes
reach fixed-point.
Given an accessed to memory block m and a concrete cache state c, the updating of A-way
set associative cache is modeled using the concrete cache update function UC [10] as follows:
UC(c,m) = c[set(m) 7→ US(c[set(m)],m)]
The concrete cache update function UC models the change in cache set s = set(m) where
15




li 7→ s(li−1)|i = 2...h
li 7→ s(li)|i = h+ 1...A
if∃h ∈ {1..A},m ∈ s(lh)
l1 7→ {m},
li 7→ s(li−1)|i = 2...A
otherwise
From the concrete update function, Ferdinand and Wilhelm [11] proposes an abstract cache
update function UˆCˆ to compute the ACS after an access to memory block m as follows:




li 7→ sˆ(li−1)|i = 2...h− 1
lh 7→ sˆ(lh) ∪ sˆ(lh−1) \ {m}
li 7→ sˆ(li)|i = h+ 1...A,>
if∃h ∈ {1..A},m ∈ sˆ(lh)
l1 7→ {m},
li 7→ sˆ(li−1)|i = 2...A
l> 7→ sˆ(l>) ∪ sˆ(lA) \ {m}
otherwise
The abstract set update function UˆSˆ computes the change in abstract state set state sˆ =
cˆ[set(m)] after accessing m. It brings (or renews) the newly accessed memory block m to
youngest cache line l1. If m /∈ sˆ, UˆSˆ ages all memory blocks m′ currently in sˆ. If m ∈ sˆ(lh),






















































(a) CFG (b) 1st iteration (c) Final ACS
Figure 3.1: Running example and analysis result of persistence analysis [11]
Otherwise (k ≥ h), m′ remains in sˆ(lk).
If a CFG node n has two immediate predecessors n1 and n2, a join function JCˆ combines
the output ACSs of n1 and n2 to form the input ACS of n. The new relative age of a memory
block m is equal to the maximum age of its existences in all output ACSs of the predecessor
nodes of n. Let cˆ1, cˆ2 be the output ACS of predecessors n1, n2, join function JCˆ computes the
input ACS cˆ of node n as follows:
JCˆ(cˆ1, cˆ2) = cˆ[si 7→ JSˆ(cˆ1[si], cˆ2[si])]
JSˆ(sˆ1, sˆ2) = sˆ where:
sˆ(lx) = {m|m ∈ sˆ1(la) ∧m ∈ sˆ2(lb), x = max(a, b)}
∪ {m|m ∈ sˆ1(lx) ∧m /∈ sˆ2}
∪ {m|m /∈ sˆ1 ∧m ∈ sˆ2(lx)}
Figure 3.1 describes a program fragment’s CFG having six basic blocks B0 . . . B5 in a
loop. The program accesses memory block a in B1 and B4, b in B3, and c in B2. Assume
a, b, c are all mapped to cache set s with associativity A = 2. In the first iteration, if the
program takes execution path B0 → B1 → B3, it accesses memory block a in B1 and then
b in B3. Abstract set state sˆoutB3 in Figure 3.1(b) models the output cache state after B3 has
been executed. Memory block b ∈ sˆoutB3 has just been accessed, so it is brought to the youngest
cache line sˆoutB3 (l1). Memory block a, accessed in B1, is mapped to the same cache set with
17
b. Therefore, the access to b in B3 will age memory block a to cache line sˆoutB3 (l2). Similarly,
abstract set state sˆoutB4 in Figure 3.1(b) models output cache state of B4 when the program
executes path B0 → B2 → B4, with memory block a in the youngest cache line sˆoutB4 (l1) and
memory block c in line sˆoutB4 (l2).
In Figure 3.1(b), as B5 has two predecessors B3 and B4, the join function JSˆ joins sˆoutB3
and sˆoutB4 to compute the input abstract cache set sˆ
in
B5 of B5. sˆ
in
B5 captures the maximum relative
age of each memory block a, b, c when the program reaches B5 in the first iteration. Memory
block a has relative age x = 2 in B3 (a ∈ sˆoutB3 (l2)) and relative age x = 1 in B4 (a ∈ sˆoutB4 (l1)).
Therefore, it has maximum relative age x = 2 at B5 (a ∈ sˆinB5(l2)). Similarly, memory block b
does not appear in B4, and is the youngest memory block at B3. Therefore, it has maximum
relative age x = 1 in at B5 (b ∈ sˆinB5(l1)). In the same way, c has maximum relative age x = 2
in B5 (b ∈ sˆinB5(l2)).
Figure 3.1(c) describes the ACSs after the second iteration through the loop, also the final
ACS at fixed-point. From output cache state sˆoutB5 of B5 in Figure 3.1(b), in the loop-back
B5 → B0 → B1 → B3, the program accesses memory block a in B1 and b in B3. As b
has just been accessed, it is renewed to the youngest cache line sˆoutB3 (l1). Memory block a is
aged to sˆoutB3 (l2) by b. Since the maximum relative age of memory block c is older or equal to
that of a and b, the access to a in B1 and b in B3 will not further increase maximum relative
age of c, according to the update function UˆSˆ described above. Therefore, memory block c
keeps maximum relative age x = 2 (sˆoutB3 (l2)). Similarly, output abstract set state sˆ
out
B4 captures
the maximum relative age for each memory block at after the execution of B4. Because all
memory blocks a, b, and c are the in the ACSs, all accesses to a, b, c will not further increase
the maximal relative age of the other memory blocks to evicted line l>. As a result, the analysis
reaches fixed-point, where the ACSs capture the maximum relative age of each memory block
through out program execution.
From the analysis result, in input set state sˆinB5 ofB5, memory block a has maximum relative
age x = 2, so it is persistent. Once loaded, it will always remain in cache at B5 all executions,
thus it causes at most one cold miss. Similarly, memory block b and c are also persistent, each
18
cause at most one cold miss through out the program’s execution.
3.2.2 Safety issue
It has been pointed out that the persistence analysis proposed in [10] is unsafe. Figure 3.1
also illustrates an unsafe scenario of the original persistence analysis as proposed by [11]. As
described above, Figure 3.1(c) gives the ACS at fixed-point. The input ACS of B5 at fixed
point (sˆinB5 in Figure 3.1(c)) shows that memory block c is persistent in the loop. However, in
the path B0 → B2 → B4 → B5, then B0 → B1 → B3, we see that c is evicted by accesses
to a and b. Therefore, c is not persistent at B5, and the persistence analysis in [11] is unsafe.
The incorrectness is due to an error of the update function UˆSˆ . It wrongly assumes that if
memory block b ∈ sˆinB5 (Figure 3.1(c)), b is in concrete set sinB5 in all possible execution paths.
Consequently, the update function does not age memory blocks with relative age equal or older
than b in sˆinB5 such as a or c. However, when b ∈ sˆinB5, b just may be in concrete set state sinB5. As
a result, there exists concrete set states sinB5 that do not contain b (e.g. only a and c are in s
in
B5
of path B0 → B2 → B4 → B5). In that case, b will age both a and c in sinB5, and the original
persistence analysis [10] will underestimate the relative age of a and c.
Let concCˆ(cˆ
in) be the set of all possible concrete cache states represented by ACS cˆin at
program point p, the unsafe scenario when accessing a memory blockma ∈ cˆ can be formulated
mathematically as follows:
sˆin = cˆin[set(ma)] ∧ma ∈ sˆin(lh)
→ ∃cin ∈ concCˆ(cˆin), sin = cin[set(ma)] ∧ma /∈ sin
∧ ∃m,m ∈ sˆin(lh) ∧m ∈ sin(lh)
∧ h > 1 ∧ h ≤ A
Let sout = US(sin,ma) and sˆout = UˆSˆ(sˆin,ma) be the output concrete set state sout and
abstract set state sˆout after the cache update. The relative age of memory block m in the output
19
concrete set sout and abstract set sˆout are as follows
m ∈ sin(lh) ∧ma /∈ sin,
sout = US(sin,ma)→ m ∈ sout(lh+1)
m ∈ sˆin(lh) ∧ma ∈ sˆin(lh)
sˆout = UˆSˆ(sˆin,ma)→ m ∈ sˆout(lh)
Because ma is not in sin, ma ages m in line lh to lh+1. On the other hand, ma is in sˆin(lh), so
update function UˆSˆ does not age m from lh to lh+1. Therefore, m ∈ sˆout(lh) but m ∈ sout(lh+1),
the abstract set state sˆout underestimate the maximum relative age ofm in concrete set state sout.
3.2.3 Correcting the persistence analysis
As demonstrated above, we cannot use the maximum relative age of memory block ma in ACS
cˆ to determine if an access to ma would further age other memory blocks in cˆ. Given abstract
set state sˆ with ma ∈ sˆ(lh) and m ∈ sˆ(lk), an access to ma could still increase maximum
relative age k of memory block m even when m has older maximum relative age (k ≥ h). As a
result, we propose to track the set of memory blocks that may be more recently used (younger)
than memory block m in the ACS. An access to memory block ma will increase the maximum
relative age of m only if ma is not in the current younger set of m. Otherwise, ma is already
counted as a possible younger memory block than m. Therefore according to LRU policy, it
will not further increase the maximum relative age of memory block m. We define the Younger
Set (YS) as follows.
Definition 1 (Younger Set): For an abstract set state sˆ at program point p, the younger set
YS(sˆ, m) captures a superset of all memory blocks that may be more recently used (younger)










































































Figure 3.2: Analysis result of with proposed update and join function
In LRU replacement policy, the relative age of memory blockm is determined by the number of
memory blocks more recently used (younger) than m in the same cache set. Consequently, the
maximum relative age x of m in sˆ should be larger than the number of memory blocks possibly
younger than m, i.e. the size of younger set YS(sˆ, m) (x = |YS(sˆ, m)| + 1). If maximum
relative age x is not greater than cache associativity A, memory block m is guaranteed to
remain in the cache once it has been accessed.
To optimize analysis performance, we stop tracking younger set YS(sˆ, m) of m once it has
more memory blocks than cache associativity A (hence m is not persistent). For cache using
LRU replacement, A is usually small (e.g. A ≤ 4). Therefore, the younger set YS(sˆ, m) is
generally small and easy to track.
Figure 3.2(a) illustrates the younger set of each memory blocks a, b, c in ACS of B3, B4,
B5 in the first loop iteration. In B3, b is just accessed so b is brought to the youngest line
sˆoutB3 (l1) with no younger memory block. a is older than b, so a is in sˆ
out
B3 (l2) with younger set
YS(sˆoutB3 , a) = {b}. Similarly in B4, a is just accessed so a is in the newest cache line sˆoutB4 ,
and the younger set YS(sˆoutB4 , a) is empty. c is older than a, so YS(sˆoutB4 , c) = {a}. In B5, b has
no younger memory block in both incoming block B3 and B4, so it has no younger memory
block in B5. a has younger memory block b in incoming block B3 and none in B4, so the
younger set YS(sˆinB5, a) = {b}. Similarly, c has only one younger memory block a in B4, so
the younger set YS(sˆinB5, c) = {a}.
Notice that from the younger set, we know that in first iteration, memory block b is not
a possible younger memory block of c in any concrete cache state at B5 even though the
21
maximum relative age of b is smaller than the maximum relative age of c in sˆinB5. Therefore, we
know that a subsequent access to b will increase the maximum relative age of c. Consequently,
our proposed younger set notion helps avoid the incorrectness of original persistence analysis
in [11] (Figure 3.2(c)).
We propose a new update and join function to track and use younger set notion in ACS
computation as follows.
New update function: Given a program point p with ACS cˆin, if the program accesses
memory block ma at p, our cache update function UˆCˆ updates the state of cache set set(ma)
using the set update function UˆSˆ
UˆCˆ(cˆin,ma) = cˆout[set(ma) 7→ UˆSˆ(cˆin[set(ma)],ma)]
Given the accessed memory block ma and the input abstract set state sˆin where ma is
mapped to, the update function UˆSˆ computes the output abstract set state sˆout and calculate the
younger set YS(sˆout,m) for each memory block m in sˆout as follows:
UˆSˆ(sˆin,ma) = sˆout with sˆout(lx) = {m|m ∈ sˆin ∪ {ma}, x = min(|YS(sˆout,m)|+ 1,>)}
Where ∀m ∈ sˆin ∪ {ma},
YS(sˆout,m) =
 YS(sˆ
in,m) ∪ {ma} if m 6= ma
∅ if m = ma
When ma is accessed, for each memory block m in sˆin, ma becomes a more recently used
memory block than m if m 6= ma. Therefore, update function UˆSˆ adds ma to the younger set
YS(sˆout,m) and changes maximum relative age of m accordingly. If m = ma, m is accessed
and becomes the youngest memory block in set sˆout. As a result, update function UˆSˆ brings m
to sˆout(l1) and set its younger set YS(sˆout,m) to empty.
Figure 3.2(b) shows our update function at B1 after the first iteration described in Figure
3.2(a). sˆinB1 contains memory block b in cache line l1, a and c in cache line l2. As seen in
22
Figure 3.2(a), after the first iteration, b is the youngest memory block. Therefore, YS(sˆinB1, b)
is empty. a is aged by b in B3 so YS(sˆinB1, a) = {b}. And similarly, c is aged by a in B4 so
YS(sˆinB1, c) = {a}. At B1, the program accesses memory block a. Consequently, a is renewed
to youngest line sˆinB1(l1) and younger set YS(sˆoutB1 , a) is set to empty. a becomes a new younger
block of b so YS(sˆoutB1 , b) = {a}. With one possible younger memory block, b has maximal
relative age x = 2. Because c already has a in its younger set YS(sˆinB1, c), it keeps the same
maximal relative age and younger set.
New join function: Given a program point p with two incoming edges from p1 and p2
having ACS cˆ1 and cˆ2, the join function JCˆ computes the joined ACS cˆ as combined upper
bound of incoming ACSs
JCˆ(cˆ1, cˆ2) = cˆ[si 7→ JSˆ(cˆ1[si], cˆ2[si])]
Given two incoming abstract set state sˆ1 and sˆ2, we propose a new join function to compute
combined abstract set state sˆ and track the younger set for each memory block m ∈ sˆ as
follows:
JSˆ(sˆ1, sˆ2) = sˆ with:
sˆ(lx) = {m|m ∈ sˆ1 ∪ sˆ2, x = min(|YS(sˆ, m)|+ 1,>)}
where ∀m ∈ sˆ1 ∪ sˆ2
YS(sˆ, m) =

YS(sˆ1,m) ∪ YS(sˆ2,m) if m ∈ sˆ1 ∧m ∈ sˆ2
YS(sˆ1,m) if m ∈ sˆ1 ∧m /∈ sˆ2
YS(sˆ2,m) if m /∈ sˆ1 ∧m ∈ sˆ2
The joined abstract set state sˆ is a set union of sˆ1 and sˆ2. Moreover, the younger set YS(sˆ, m)
of each memory block m in sˆ is also the set union of younger set of m in sˆ1 and sˆ2 if there is.
The relative age of m in sˆ is then set according the size of its younger set. Because the younger


























Figure 3.3: Cache update for set of possible access addresses
the possible memory blocks younger than m in sˆ in all possible executions.
Figure 3.2(c) illustrates our join function. In B3, memory block b has no younger memory
block but inB4, b has two younger memory blocks a and c, soYS(sˆinB5, b) = {a, c} in combined
abstract set state sˆinB5 of B5. Similarly, YS(sˆinB5, c) = {a, b} and YS(sˆinB5, a) = {b}. Our
proposed persistence analysis accurately points out that a is persistent at B5. However, b and c
have up to two possible younger memory blocks so they may be evicted.
New update function for set: Unlike instruction references, a data reference D can access
a set of possible different data addresses Addr(D). Therefore, cache update function UˆCˆ need
to handle sets of possibly referenced memory blocks, as in [11]. We propose a new update
function for set to update the change in ACS cˆ and track the younger set after an access of data
reference D as follows:
UˆCˆ(cˆ, Addr(D)) = cˆ[fi 7→ UˆSˆ(cˆ[fi], Xfi)]
for allfi ∈ {f = set(m)|m ∈ Addr(D)}
where Xfi = {my|my ∈ Addr(D), set(my) = fi},
Given a set of possible access addresses Addr(D) of data reference D, the abstract cache
update function UˆCˆ divides it into Xfi , the set of possible access addresses in Addr(D) corre-
sponds to cache set fi. Our new abstract set update function UˆSˆ compute the output abstract set
state sˆout from the input abstract set state sˆin and the set Xfi of Addr(D) mapped to this cache
24
set as follows
UˆSˆ(sˆin, Xfi) = sˆout with sˆout(lx) = {m|m ∈ sˆin ∪Xfi , x = min(|YS(sˆout,m)|+ 1,>)}
Where ∀m ∈ sˆin ∪Xfi
YS(sˆout,m) =
 YS(sˆ
in,m) ∪Xfi \ {m} if m ∈ sˆin
∅ otherwise
Because no memory block ma ∈ Addr(D) is guaranteed to be accessed, we cannot renew
ma ∈ sˆin even though ma ∈ Addr(D). However, any ma ∈ Xfi could possibly become a new
younger memory block of all memory block m currently in sˆin. Therefore, the update function
UˆSˆ adds Xfi to the younger set YS(sˆ, m) of m. If a memory block ma ∈ Xfi and ma /∈ sˆ, ma
may be a newly accessed memory block in sˆout. Therefore, update function UˆSˆ adds ma to the
abstract set state sˆout as a youngest memory block with empty younger set.
Figure 3.3(a) illustrates such scenario. A data reference D in B3 may access a set of
possible memory block {b, c, d}mapped to sˆinB3. Figure 3.3(b) shows the input abstract set state
sˆinB3 and the resulting abstract set state sˆ
out
B3 after the memory access. As all of {b, c, d} could
be accessed, the set update function adds all of them to the younger set of memory block a and
b in sˆinB2. Therefore, a is aged to evicted line l> because it has {b, c, d} as possible younger
blocks. b is also evicted to l> because it has two possible younger blocks c, d. c and d are added
to sˆoutB2 (l1) as most recently used memory blocks with no younger memory block.
3.3 Safety Proofs of Corrected Persistence Analysis
In this section, we will prove the safety and termination of our proposed persistence analysis.
In our persistence analysis and the proofs, we consider a program point before and after
each program instruction. Note that for data cache analysis, it is possible that there is no data
memory references between two program points if the instruction does not access data memory.
For each memory block m, the relative age of m in the cache is determined by the number
25
of more recently used (younger) memory blocks in the same cache set. At program point p,
given a execution path pa that reaches p with concrete cache state c. Memory block m in cache
set s = c[set(m)] will have relative age y (m ∈ s(ly)) if there are y−1 younger memory blocks
in s (from s(l1) to s(ly−1)). We define the concrete younger set of memory block m as follows:
Definition 2 (Concrete younger set) Concrete younger set ys(s,m) of memory block m is the
set of memory blocks more recently used (younger) than m in concrete set state s of cache set
where m is mapped to. 
m ∈ s(ly)→ ys(s,m) = s(l1) ∪ ... ∪ s(ly−1) ∧ y = |ys(s,m)|+ 1
In our proposed persistence analysis, at program point p with ACS cˆ at fixed point, we
determine the maximum relative age x of memory block m by the younger set YS(sˆ, m), the
set of all memory blocks possibly younger (more recently used) than m in the abstract set state
sˆ = cˆ[set(m)], i.e. x = |YS(sˆ, m)| + 1. To prove the safety of our persistence analysis, we
prove that from our proposed update and join function, the younger set YS(sˆ, m) is the superset
of concrete younger set ys(s,m) in concrete set state s = c[set(m)] at p in any execution path
that reaches p, captured by the younger set property.
Definition 3 (YS property): Given an arbitrary path pa from start of execution to program
point p which results in concrete cache state c. Let cˆ be the computed fixed point ACS at p.
For each memory block m ∈ c, let sˆ = cˆ[set(m)] and s = c[set(m)] be the abstract and
concrete state of cache set where m is mapped to, the younger set YS(sˆ, m) is the superset of
the concrete younger set ys(s,m). 
∀m ∈ c, s = c[set(m)], sˆ = cˆ[set(m)], ys(s,m) ⊆ YS(sˆ, m)
If the younger set YS(sˆ, m) is the superset of concrete younger set ys(s,m), the maximum
relative age x of m in sˆ computed by our analysis (x = |YS(sˆ, m)| + 1) is always greater or
equal than the concrete relative age y of m in s (y = |ys(s,m)| + 1). Hence if maximum
26
relative age x is less than or equal cache associativity A, m is not evicted out of the cache for
any concrete cache set s at p. Therefore, our persistence analysis is safe.
3.3.1 Structure of the proof
We prove by induction that the YS property holds in all possible execution paths in the program.
• Because the concrete cache state c is empty at the start of the execution, YS property is
trivially true initially.
• Assume YS property holds at pin, before program point p. If at p, the program accesses
memory block ma (or a set of possible memory blocks Addr(D) = {m1...mk} of data
reference D), we prove that YS property holds at pout, after program point p by proving
the correctness of our update function UˆSˆ (Section 3.3.2 and Section 3.3.4).
• Assume YS property holds at pout, after program point p, we prove that YS property
holds at pinn , before the next program point pn by proving the correctness of our join
function JˆSˆ (Section 3.3.3)
As YS property is true at the start of the execution, before and after each program point,
and from one program point to another, YS property holds for all possible executions of the
program. Therefore, given fixed-point ACS cˆ at program point p, in any execution path that
reaches p with concrete cache state c, let sˆ = cˆ[set(m)] and s = c[set(m)], the younger set
YS(sˆ, m) is the superset of the concrete younger set ys(s,m) of m in s. Consequently, the
maximal relative age x of m in sˆ (x = |YS(sˆ, m)| + 1) is always greater or equal than the
relative age y of m in s (y = |ys(s,m)| + 1). As a result, if the maximal relative age x is less
than or equal to cache associativity A, m is persistent when the program control reaches p in
all executions.
27
3.3.2 Safety of update function
We prove our update function preserves the YS property. If the program accessesma at program
point p, assume YS property holds at pin, we prove YS property holds at pout.
Given a path pa having concrete cache state cin at pin, before program point p. Let cˆin be
the fixed-point ACS at pin. Assume YS property holds at pin, we have
∀m ∈ cin, sin = cin[set(m)], sˆin = cˆin[set(m)], ys(sin,m) ⊆ YS(sˆin,m) [B.1]
If the program accesses memory block ma at program point p, let cout be the concrete cache
state of path pa at pout, after program point p. Let cˆout be the fixed-point ACS at pout. We prove
YS property holds at pout
∀m ∈ cout, sout = cout[set(m)], sˆout = cˆout[set(m)], ys(sout,m) ⊆ YS(sˆout,m) [B.2]
Case 1: set(m) 6= set(ma)
Because set(m) 6= set(ma), the cache state of m is unaffected by the access to memory
block ma. As a result, there is no change in the concrete set state, sout = sin, so ys(sout,m) =
ys(sin,m). Similarly, there is no change in the abstract set state, sˆout = sˆin, so YS(sˆout,m) =
YS(sˆin,m). Therefore, YS property continues to hold from pin to pout.
Case 2: set(m) = set(ma)
As m and ma are mapped to the same cache set, if m 6= ma, ma becomes a new younger




in,m) ∪ {ma} if m 6= ma
∅ if m = ma
[B.3]
28
From our proposed update function UˆSˆ , the new younger set of each memory block in sˆin
is computed as follows.
∀m ∈ sˆin,YS(sˆout,m) =
 YS(sˆ
in,m) ∪ {ma} if m 6= ma
∅ if m = ma
[UˆSˆ]
As a result, we have
[B.1] → ys(sin,m) ⊆ YS(sˆin,m)
[B.3] → ys(sout,m) =
 ys(s
in,m) ∪ {ma} if m 6= ma
∅ if m = ma
[UˆSˆ ] YS(sˆout,m) =
 YS(sˆ
in,m) ∪ {ma} if m 6= ma
∅ if m = ma
[B.1],[B.3], [UˆSˆ ]→
if m = ma
ys(sout,m) = ∅ ⊆ YS(sˆout,m)
if m 6= ma
ys(sout,m) = ys(sin,m) ∪ {ma}
YS(sˆout,m) = YS(sˆin,m) ∪ {ma}
ys(sin,m) ⊆ YS(sˆin,m)
→ ys(sout,m) ⊆ YS(sˆout,m)
Therefore, YS property holds at pout, after the execution of step p.
3.3.3 Safety of join function
Assume YS property holds at pout, after program point p, we prove that YS property holds at
pinn , before the immediate program point pn by proving the correctness of our join function JˆSˆ .
29
Given a path pa having concrete cache state cout at pout. Let cˆout be the fixed-point ACS at
pout. Assume YS property holds at pout, we have
∀m ∈ cout, sout = cout[set(m)], sˆout = cˆout[set(m)], ys(sout,m) ⊆ YS(sˆout,m) [C.1]
Let cinn be the concrete cache state of path pa at p
in
n , before the next program point pn. Let
cˆinn be the fixed-point ACS at p
in
n . We prove YS property holds at cˆ
in
n
∀m ∈ cinn , sinn = cinn [set(m)], sˆinn = cˆinn [set(m)], ys(sinn ,m) ⊆ YS(sˆinn ,m) [C.2]
From our proposed join function sˆ = JˆSˆ(sˆ1, sˆ2), younger set YS(sˆ, m) of m at pinn is the
union of all younger sets of incoming edges of pinn . As p
out is one of the incoming edge, we
have
YS(sˆout,m) ⊆ YS(sˆinn ,m) [JˆSˆ ]
Because program point pinn is immediately after p
out, no new memory block is accessed, so
the concrete set state remains the same, sinn = s
out. As a result, the concrete younger set for
each memory block m also remains the same
ys(sinn ,m) = ys(s
out,m) [C.3]
In summary
[C.1] → ys(sout,m) ⊆ YS(sˆout,m)
[JˆSˆ ] → YS(sˆout,m) ⊆ YS(sˆinn ,m)
[C.3] → ys(sinn ,m) = ys(sout,m)
→ ys(sinn ,m) ⊆ YS(sˆinn ,m)
30
So the younger set YS(sˆinn ,m) always contains all possible memory blocks younger than m in
set(m) of cin at pinn . Therefore the YS property holds at next program point p
in
n .
3.3.4 Safety of set update function
A data referenceD can access a set of possible different data addressesAddr(D) = {m1...mk}.
Therefore, cache update function UˆCˆ need to handle sets of possibly referenced memory blocks,
as in [11]. We prove our set update function preserves the YS property. If the program may
access anyma ∈ Addr(D) = {m1...mk} at p, assume YS property holds at pin, before program
point p, we prove YS property holds at pout, after the data memory access at program point p.
Given a path pa having concrete cache state cin at pin. Let cˆin be the fixed-point ACS at pin.
Assume YS property holds at pin, we have
∀m ∈ cin, sin = cin[set(m)], sˆin = cˆin[set(m)], ys(sin,m) ⊆ YS(sˆin,m) [D.1]
Let cout be the concrete cache state of path pa at pout, after the memory access at p. Let cˆout
be the fixed-point ACS at pout. We prove YS property holds at pout
∀m ∈ cout, sout = cout[set(m)], sˆout = cˆout[set(m)], ys(sout,m) ⊆ YS(sˆout,m) [D.2]
For each memory block m in the cache set sin, let Xfi be the set of memory blocks in
Addr(D) mapped to sin. The data reference D can access any memory block ma ∈ Xfi .
If m 6= ma, ma becomes a new younger memory block of memory block m. Otherwise
(m = ma), m is renewed to the youngest cache line and has no younger memory block.
ys(sout,m) =
 ys(s
in,m) ∪ {ma}, for any ma ∈ Xfi if m ∈ sin ∧m 6= ma
∅ Otherwise
[D.3]
Our proposed set update function calculates new possible younger set of m in sˆin when
31
accessed by set Xfi as follow
YS(sˆo,m) =
 YS(sˆi,m) ∪Xfi \ {m} if m ∈ sˆi∅ otherwise [UˆSˆ ]
In summary
[D.1], [D.3], [UˆSˆ ]→
if m 6= ma
ys(sout,m) = ys(sin,m) ∪ {ma}, for any ma ∈ Xfa ,m 6= ma
YS(sˆout,m) = YS(sˆin,m) ∪Xfi \ {m}
ys(sin,m) ⊆ YS(sˆin,m)
→ ys(sout,m) ⊆ YS(sˆout,m)
if m = ma
ys(sout,m) = ∅ → ys(sout,m) ⊆ YS(sˆout,m)
So YS(sˆout,m) contains all possible memory blocks younger than m in cout[set(m)] at pout
after the access of data reference D. As a result, the YS property holds at program point pout,
after the data access in p.
3.3.5 Termination of the analysis
The number of memory blocks in a program and the number of cache lines are finite. Therefore,
the abstract domain cˆ : L 7→ 2S is finite. Moreover, the cache update function UˆSˆ , and join





Current persistence analysis (proposed by [11], corrected in the above chapter) determines if
once loaded, a memory block m will not be evicted out of the cache under all circumstances.
However, a data memory block m remains in the cache under all circumstances only when the
data cache is large enough to hold all possible data addresses. Otherwise, memory block m
could be evicted hence it cannot be classified as persistence. Consequently, all data accesses to
unclassified m are conservatively treated as all miss.
However, we notice that for each loop L, a data reference D may access memory block m
only in a limited interval [lw, up] of L’s iterations (from iteration lw to iteration up of loop L).
In this interval, if memory block m is guaranteed to remain in the cache once loaded, the first
time D accesses m may causes one cache miss, but all subsequent accesses to m must result in
i<4
int A[16];  int B[4][16];
int D[4]; short int C[4][16];
for ( i=0; i<4; i++) { //L1
    a = A[x];
    for (j=0; j<16; j++) {//L2
        if (a%2==0) b = B[i][j];
        else b = C[i][j];
        sum += D[0] + b;









(c) Memory block accessed according to 











m0,  m4,  m8,  m12
m1,  m5,  m9,  m13
m2,  m6,  m10,  m14







0 1 2 3
A[x] 0..15 m0,m1
B[i][j] 0..7 m2 m4 m6 m8
B[i][j] 8..15 m3 m5 m7 m9




0 1 2 3
A[x] 0..15 m0,m1
B[i][j] 0..7 m2 m4 m6 m8
B[i][j] 8..15 m3 m5 m7 m9










Figure 4.1: Motivating example
33
cache hit. Moreover, outside this interval, memory block m is not accessed by data reference
D, so it causes no cache miss to D. As a result, if memory block m is persistent (not evicted
out of the cache once loaded) in the interval [lw, up] of loop L’s iterations, it causes at most
one cache miss to D each time loop L is executed. Therefore, by capturing the persistence of
memory block m in a smaller scope (i.e. interval [lw, up] of loop L), we could guarantee a
tighter worst-case performance of data cache.
Figure 4.1(a) presents our motivating example with four array references in two nested loop
L1 and L2. The unpredictable array reference A[x] could access any memory block in address
set Addr(A) = {m0,m1} (assume A[x] always accesses within address range of array A).
Similarly, the array reference B[i][j] and C[i][j] could access any memory block in address set
Addr(B) = {m2...m9} and Addr(C) = {m12...m15} respectively. And D[0] accesses only
memory block m10. Figure 4.1(b) shows the CFG and possible memory addresses of each data
references. Assume a 2-way associative cache with four cache sets {f0...f3}, Figure 4.1(d)
gives the possible cache conflicts within the loop nest. Because no memory block is persistent
throughout the program execution, all data accesses are conservatively treated as all-miss in
worst case according to the existing persistence analysis framework.
However, Figure 4.1(c) describes the access pattern for each data reference in the running
example. As A[x] is an unpredictable data access, it could access either m0 or m1 in any itera-
tion of loop L1. On the other hand, B[i][j] and C[i][j] are loop-affine array access with stati-
cally predictable access pattern. When i = 2 and j = 0..7, B[i][j] only accesses m6. Therefore,
if m6 is not evicted in the scope {L1 7→ [2, 2], L2 7→ [0, 7]} (interval [0, 7] of L2’s iterations,
for each L2’s execution in interval [2, 2] of L1’s iterations), B[i][j] has at most one cache miss
for 8 accesses. Similarly, if m15 is persistent in the scope {L1 7→ [3, 3], L2 7→ [0, 15]} , C[i][j]
has at most one cache miss for 16 accesses. As a result, by capturing the persistence of memory
block in those scopes, we could obtain a much tighter data cache performance estimation.
34
4.2 Temporal Scope and Address Analysis
Central to our scope-aware data cache analysis is the notion of temporal scope that characterizes
the behavior of a data reference over different loop iterations. Furthermore, we parameterize the
definition and operations of temporal scopes with the static scope information on loop nesting.
We will discuss how our proposed persistence analysis can utilize such information for more
accurate abstract domain construction in Section 4.3.
Definition 4 (Temporal scope) A temporal scope mD of memory block m which may be ac-
cessed by a data reference D is defined as
mD = {Li 7→ [lw, up]|∀Li ∈ reside(D)}
where reside(D) is the set of loops where D resides in. To simplify the presentation, we use
m to denote mD when there is no ambiguity about the data reference. For each of such loops
Li, temporal scope m (or mD) maintains a mapping between Li and m[Li], a closed interval
[lw, up] of Li’s iterations where D may access m. 
For a data reference D, address analysis calculates set of memory blocks possibly accessed
by D. We follow the register expansion framework in [25] to identify address expression for
each data reference at binary-code level. For each register used to specify address of load/store
instruction, we perform register expansion to trace the source registers and the computation
performed. We recursively expand a source register until it traces back to a defined constant
c, an unpredictable value ⊥, or a loop induction variable V . Readers are referred to [25] for
details of address expression detection.
Given the address expression of a data referenceD, set of possibly accessed memory blocks
and their corresponding temporal scopes are automatically derived as follows.
• In case the address expression is a constant, it corresponds to a scalar access to a fixed
memory block m. Data reference D will access m in all loop iterations. Therefore, the














B[i][j] 16 × i × 4 + j × 4 + BaseB (m2)










Figure 4.2: Address expressions and temporal scopes
4.2(a), address expression of D[0] is evaluated to BaseD, which corresponds to m10.
Because D[0] will access m10 in all iterations of loop L1 and L2 where it resides in, the
temporal scope m10 = {L1 7→ [0, 3], L2 7→ [0, 15]}.
• If the address expression contains unpredictable value ⊥, the corresponding array access
may reference any of the memory blocks contained in the array. For example in Figure
4.2, A[x] is an unpredictable access which may reference m0 or m1 in any iteration of
L1. Therefore, the temporal scope m0 = {L1 7→ [0, 3]}. Similarly, temporal scope
m1 = {L1 7→ [0, 3]}.
• If the address expression contains linear expression of loop-induction variables, it cor-
responds to loop-affine access with predictable access pattern, such as B[i][j] in Figure
4.2(a). By enumerating possible values of the loop induction variables i and j, temporal
scope of each memory block that is possibly accessed by B[i][j] can be automatically
calculated. For example, when i = 2 and 0 ≤ j ≤ 7, value of the address expression
for B[i][j] is evaluated to [128 +BaseB, 128 + 28 +BaseB], where BaseB is the base
address of B[i][j]. Given our assumption that BaseB corresponds to memory block m2
and memory block size is 32-Byte, the address range [128+BaseB, 128+28+BaseB]
corresponds to m6, so the temporal scope m6 = {L1 7→ [2, 2], L2 7→ [0, 7]}.
Given two memory blocks mi and mj accessed in temporal scope mi and mj respectively.
An access to mi in scope mi[L] will increase the relative age of mj in scope mj[L] only if mi
and mj are mapped to the same cache set and their temporal scopes overlap during execution
of L. We define the overlapping between two temporal scope mi and mj in loop L as follows
36
Definition 5 (Scope overlap) The overlapping between two temporal scope mi and mj in loop
L is recursively defined as
overlap(mi,mj, L) ⇐⇒ (mi[L] ∩mj[L]) 6= ∅ ∧ overlap(mi,mj, outer(L)) (4.1)
where outer(L) is the immediate outer loop of L. Thus, two temporal scopes overlap at loop
level L only if the access intervals for loop L and all outer loops containing L are not mutually
exclusive.
In Figure 4.2(b), since m6[L2] and m7[L2] refer to interval [0, 7] and [8, 15] of L2’s itera-
tions, they do not overlap. In an other example, m15[L2] and m6[L2] overlap in interval [0, 7] of
L2’s iterations. However, in the parent loop L1, m15[L1] refers to interval [3, 3] while m6[L1]
refers to a separated interval [2, 2] of loop L1’s iterations. Therefore, the scope m15[L2] and
m6[L2] do not overlap because they belong to L2’s executions in separated intervals of L1.
To capture the persistence of a data memory in a scope for more accurate WCET analysis,
we integrate access pattern analysis into the abstract interpretation framework. In our analysis,
we extend the definition of memory block persistence in [11], and utilize the computed temporal
scope information for a scope-aware analysis. The proposed framework is built on our correct
version of persistence analysis as described in Chapter 3. The soundness proofs are presented
in Section 4.4.
4.3 Scope-aware Persistence Analysis
The basic idea of our scope-aware persistence analysis is to categorize the persistence of mem-
ory blocks in the calculated temporal scopes (Section 4.2), instead of the globally defined per-
sistence in [11]. For a data reference D, the temporal scope mD identifies a mapping between
loop L where D resides in and L’s iteration interval mD[L] where D may access m. The
scope-aware analysis approach allows us to integrate access pattern into the abstract interpreta-
tion framework, and determine the local behavior of data cache. In particular, our scope-aware
persistence analysis computes memory block persistence within its temporal scope for each
37
static scope (loop hierarchy) it may get accessed.
Definition 6 (Scope persistence) LetmD defines the loop interval [mD[L].lw,mD[L].up]where
data reference D may access memory block m in an execution of loop L (between L’s entry
and exit). The temporal scope mD is persistent at loop level L if and only if within interval
mD[L], m is guaranteed to remain in the cache after the first time it is loaded into cache by D.

Given the above definition of scope persistence, for memory block m to cause only one
cache miss to data reference D in one complete execution of loop L, it does not need to stay
in the cache for all iterations of L. In loop L, the temporal scope mD (or m for short) defines
an interval m[L] (from iteration m[L].lw to iteration m[L].up of loop L) where D may access
m. If once loaded, memory block m is not evicted out of the cache in any execution within the
interval m[L], all data accesses to m from D cause at most one cache miss for each complete
execution of L.
To capture the scope persistence in the abstract domain of the persistence analysis frame-
work, we define our scope-aware abstract set state and abstract cache state as follows.
Definition 7 (Scope-aware abstract cache state) In analysis at loop level L, abstract cache
state cˆ[L]: F → Sˆ maps cache sets to abstract set states. 
Definition 8 (Scope-aware abstract set state) An abstract set state sˆ: {l1 . . . lA} ∪ {l>} →
2M maps cache lines (including the specially introduced evicted line l>) to set of all temporal
scopes M . Sˆ denotes the set of all abstract set states. 
In our scope-aware ACS cˆ[L] of loopL, if temporal scopem is in sˆ(lx), once loaded to the cache
in scope m[L], memory block m reaches maximum relative age x in any possible execution
from iteration m[L].lw to iteration m[L].up of loop L.
We have re-designed the update function UˆCˆ and join function JˆCˆ to utilize the scope infor-
mation when modeling cache conflicts in the ACS. By capturing such fine-grained persistence
properties, our analysis can accurately model the local behavior of data cache for WCET esti-
mation.
38




































9m 1m 5m 13m
0m12m8m4m
3m 7m 15m








Figure 4.3: Multi-level analysis and results for the motivating example in Figure 4.1
4.3.1 Overall framework
We adopt the multi-level persistence framework for instruction cache analysis from [2], and
extend it for our data cache analysis. As shown in Figure 4.3(a), for each loop L, we per-
form a separate persistence analysis on the CFG fragment within L, with empty initial ACS
cˆinLentry [L] = ⊥ as input ACS of the L’s entry node Lentry. Consequently, the analysis will
consider only paths and data accesses within loop L. As a result, we can determine the local
persistence of a memory block in different loop levels. In Figure 4.3 we show the estima-
tion results of our analysis for the motivating example presented in Figure 4.1, and a detailed
discussion will be given in Section 4.3.3.
Algorithm 1 MPA(L) — Multi-level Persistence Analysis Algorithm. L denotes a loop (or the
main procedure) under analysis.
1: cˆinLentry [L] = ⊥;
2: Queue.insert(Lentry);
3: while !Queue.empty() do
4: n = Queue.remove();
5: cˆinn [L] = JˆCˆ({cˆoutn′ [L]|∀n′ ∈ Pred(n) ∧ n′ ∈ L});
6: if reached fixed point( cˆinn [L]) then continue;
7: cˆoutn [L] = cˆinn [L];
8: for each data reference D in n do
9: cˆoutn [L] = UˆCˆ(cˆoutn [L], D, L);
10: end for
11: Queue.insert({n′|∀n′ ∈ Succ(n) ∧ n′ ∈ L});
12: end while
Algorithm 1 describes the multi-level persistence analysis algorithm to analyze loop L.
cˆinn [L] and cˆ
out
n [L] denote the input and output ACSs of a node n for analysis at loop level L.
Pred(n) and Succ(n) refer to the sets of predecessors and successors of n within the CFG of
loop L currently being analyzed. We perform a standard fixed-point computation of the ACSs.
The analysis initializes the input ACS of loop entry node Lentry to empty (line 1) because
initially no memory block has been accessed in this loop. The processing queue Queue starts
39
with the loop entry node (line 2). For each node n, we compute the input ACS cˆinn [L] by joining
all the output ACSs of its predecessors within L (line 5). The scope-aware join function JˆCˆ
computes the joined ACS as the union of all input ACSs. If the input ACS cˆinn [L] has reached
fixed point, the analysis continue to process the next node in Queue (line 6). Otherwise, we
compute cˆoutn [L] from its input ACS and each memory reference D in node n (line 7-10). In
case where no-write-allocate is used (in write-through or write-back policy), a store instruction
does not modify the cache state. We consider only load instructions in the cache analysis.
Otherwise for write-allocate policy, all load and store instructions will be considered in the
ACS calculation. Finally, all successors of n within L are inserted into Queue to capture the
possible changes in cˆoutn [L] (line 11).
4.3.2 Scope-aware update and join functions
Scope-aware update function
Given a data reference D which accesses a set of possible addresses Addr(D) = {m1...mk}
in loop L, the scope-aware update function UˆCˆ calculate the change in ACS cˆ[L] after a data
reference of D (line 9 in Algorithm 1). For each memory block ma ∈ Addr(D), the temporal
scopemDa (orma for short) identify the loop intervals whereD may accessma. An access toma
in scope ma[L] (from iteration ma[L].lw to iteration ma[L].up) does not affect the maximum
relative age (and the scope persistence) of a memory block m in scope m[L] if ma and m do not
overlap in loop L (refer to Equation 4.1 in Section 4.2). Therefore, our proposed scope-aware
update function UˆCˆ only considers memory block ma as conflict with memory block m in scope
m[L] when the temporal scope ma and m overlap in loop L.
UˆCˆ(cˆ, D, L) = cˆ[fi 7→ UˆSˆ(cˆ[fi], D, L)]
for all fi ∈ {set(ma)|∀ma ∈ Addr(D)}
Given data reference D and its set of possible addresses Addr(D), our scope-aware cache
update function UˆCˆ computes the change in cache set fi possibly affected by the data access
40
using our scope-aware set update function UˆSˆ . For each input abstract set state sˆin, the set
update function computes the output abstract set state sˆout and tracks the Younger Set of each
temporal scope m ∈ sˆin as follows.
UˆSˆ(sˆin, D, L) = sˆout with :
sˆout(lx) = {m|m ∈ sˆin ∪ {ma|ma ∈ Xfi}, x = min(|YS(sˆout,m)|+ 1,>)}
where ∀m ∈ sˆin ∪ {ma|ma ∈ Xfi}
YS(sˆout,m) =

∅ if m /∈ sˆin
∅ if OpS(m,D,L) = {m}
YS(sˆin,m) ∪ (OpS(m,D,L) ∩Xfi \ {m}) Otherwise.
where
• Xfi denotes set of memory blocks possibly accessed by data reference D which are
mapped to cache set fi of abstract set state sˆin
Xfi = {ma|ma ∈ Addr(D), set(ma) = fi}
• Overlap set OpS(m,D,L) denotes the set of memory blocks which data reference D
may access in scope m[L]. For each memory block ma ∈ Addr(D), D may access ma
in scope m[L] if temporal scope m and ma overlap in loop L.
OpS(m,D,L) = {ma|ma ∈ Addr(D) ∧ overlap(m,ma, L)}
The update function UˆSˆ determines the maximum relative age x of temporal scope m in output
abstract set state sˆout by computing the younger set YS(sˆout,m). In our scope-aware ACS, the
younger set YS(sˆout,m) identifies the set of all possible memory blocks that could be younger
than m in all executions in scope m[L] after the first access to m in this scope. To determine
41
the younger set YS(sˆout,m), we have the following scenarios:
• If temporal scope m is not in sˆin, memory block m has not been accessed the first time
in scope m[L] in any execution. If the data reference D accesses m, m will be brought
to youngest cache line l1 with no younger memory block. Otherwise, memory block
m remains not accessed. Since our scope-aware persistence analysis only captures the
maximum relative age of m after the first access to m in scope m[L], our scope-aware
update function UˆSˆ adds m to sˆout as youngest memory block with empty younger set.
• If temporal scope m ∈ sˆin, memory block m may have been accessed in scope m[L]. In
scopem[L], the data referenceD only accesses memory blockma if it is inOpS(m,D,L).
If exists other memory block ma ∈ OpS(m,D,L) and ma 6= m, D may access ma and
not renew m. However, if m is the only memory block in OpS(m,D,L), all data ac-
cesses of D in scope m[L] will definitely access and renew m. Consequently, if overlap
set OpS(m,D,L) contains only memory block m, we can guarantee that data reference
D will indeed access m in scope m[L] and renew m to youngest cache line l1.
• Otherwise, in scopem[L], the data referenceD may access any memory blockma (ma 6=
m) in overlap set OpS(m,D,L). Consequently, any ma ∈ OpS(m,D,L) that is mapped
to cache set fi (ma ∈ Xfi) can be accessed and become a new younger memory block of
m in scope m[L]. Therefore, our scope-aware update UˆSˆ function adds all those memory
blocks to the younger set YS(sˆout,m) of m, and set its maximal relative age accordingly.
Figure 4.4(a) illustrates our scope-aware persistence analysis within loop L2 of the running
example in Figure 4.1. Initially, input ACS cˆinB5[L2] of loop header B5 is empty (cˆ
in
B5[L2] = ⊥)
for no memory block is yet accessed in L2. Because the program does not access memory in
B5, cˆoutB5 [L2] = cˆ
in
B5[L2] = ⊥. In step (1), if the program takes execution path B5 → B6,
it may accesses any memory block in Addr(B[i][j]) = {m2 . . .m9} in B6. Since cˆinB6[L2] =
cˆoutB5 [L2] = ⊥, no memory block in {m2 . . .m9} has yet been accessed in loop L2. Therefore,






























































































































































































Figure 4.4: Scope-aware ACS computation for L2 of the motivating example in Figure 4.1
line l1 of ACS cˆoutB6 [L2]. Similarly, in step (2), if the program takes execution path B5→ B7, it
may access any memory block in Addr(C[i][j]) = {m12 . . .m15}. As a result, {m12 . . .m15}
are added to youngest line l1 of ACS cˆoutB7 [L2].
In step (4), the program executes data reference D[i] in block B8 and accesses m10. For
D[i] only accesses m10, temporal scope m10 is added to youngest line l1 of ACS cˆoutB8 [L2].
Moreover, m10 is mapped to the same cache set with m2 and temporal scope m10 overlaps
with m2, m10 will become younger memory block of m2 and age m2 in the scope m2. As a
result, temporal scope m2 is aged to l2 in cˆoutB8 [L2]. Similarly, m10 also age m6 and m14 in their
temporal scopes. Therefore, m6 and m14 are aged to l2 in cˆoutB8 [L2].
In step (6), the analysis loops back to loop header B5 and takes path B8→ B5→ B6. In
B6, data reference B[i][j] accesses any memory block in Addr(B[i][j]) = {m2 . . .m9}. With-
out scope awareness, persistence analysis as in Section 3.2 will assume that memory blocks
mapped to the same cache set will conflict with each others. In example, because memory
block m2 and m8 are mapped to cache set f0 as m12, they will age m12 to evicted line, as in
Figure 4.3(b). However, with temporal scope information, our scope-aware update function
can guarantee that in temporal scope m12, data reference B[i][j] will not access m4 or m8 be-
43
cause B[i][j] only accesses m4 in temporal scope m4, and m8 in temporal scope m8, while
those temporal scopes do not overlap with m12. Therefore, memory block m12 will remain the
youngest memory block in scope m12, as in step (6) of Figure 4.4(a), and not evicted like in
Figure 4.3(b).
Scope-aware join function
At any program point p in loop level L, the join function JˆCˆ (line 5 in Algorithm 1) computes
an ACS from all the output ACSs of p’s control flow predecessors. It can be done by pair-wise
joining of two output ACSs cˆ1[L] and cˆ2[L] into a representative ACS cˆ[L] at p using the the
scope-aware join function JCˆ . For each temporal scope m, the scope-aware join function JCˆ
unionizes the younger set of m in both output ACSs from the control flow predecessors to form
the younger set YS(sˆ, m) of m in abstract set state sˆ = cˆ[set(m)] at p. Therefore, YS(sˆ, m)
always contains all possible younger memory blocks of m in scope m at p. Formally, our
scope-aware join function is defined as follows.
JCˆ(cˆ1, cˆ2) = cˆ[si 7→ JSˆ(cˆ1[si], cˆ2[si])]
JSˆ(sˆ1, sˆ2) = sˆ with:
sˆ(lx) = {m|m ∈ sˆ1∪ ∈ sˆ2, x = min(|YS(sˆ, m)|+ 1,>)}
where ∀m ∈ sˆ1 ∪ sˆ2
YS(sˆ, m) =

YS(sˆ1,m) ∪ YS(sˆ2,m) if m ∈ sˆ1 ∧m ∈ sˆ2
YS(sˆ1,m) if m ∈ sˆ1 ∧m /∈ sˆ2
YS(sˆ2,m) if m /∈ sˆ1 ∧m ∈ sˆ2
In Figure 4.4(a), step (8), as B8 has two predecessors B6 and B7, for each temporal scope,
our scope-aware join function JCˆ unionizes its younger sets in ACS cˆoutB6 [L2] and cˆoutB7 [L2] to
compute its younger set at cˆinB8[L2]. In B6, m5 is renewed to the youngest cache line l1 because
B[i][j] is guaranteed to access m5 in temporal scope m5. However, in B7, C[i][j] may access
44
m13 in temporal scope m13, which overlaps with m5. Therefore, m13 becomes a possible
younger memory block of m5 in m5 and ages m5 to l2. As a result, m5 is in l2 of cˆinB8[L2],
having m13 in its younger set as shown in Figure 4.4(b).
4.3.3 ACS computation of the motivating example
Figure 4.3(b), (c) and (d) shows the fixed-point ACSs computed by the original persistence
analysis (at basic block B4, exit of L1), our multi-level analysis for L1 (at B4) and L2 (at
basic block B8, exit of L2), respectively. Given 2-way associative cache with 4 cache sets,
no memory block accessed by B[i][j] and C[i][j] can be categorized as persistent in the orig-
inal persistence analysis. On the other hand, our multi-level scope-aware persistence analysis
produces much tighter estimation results on the worst-case cache behavior. For example, m4
accessed by B[i][j] is guaranteed to be scope persistent at both loop levels, resulting in at most
1 cold miss globally. m5 is scope persistent only in L2. Thus, accesses to m5 in each complete
execution of L2 (between entry to exit) incurs at most 1 cold miss.
4.4 Safety proofs of scope-aware persistence analysis
In this section, we will prove the safety of our proposed scope-aware persistence analysis frame-
work.
In a concrete cache state c, for LRU replacement policy, the relative age of memory block
m is determined by the number of memory blocks more recently used (younger) than m in the
same cache set. Let s = c[set(m)] be the concrete set state of the cache set where memory
block m is mapped to, and concrete younger set ys(s,m) be the set of memory blocks more
recently used (younger) than m in set s (as in Definition 2), we have
m ∈ s(ly)→ ys(s,m) = s(l1) ∪ ... ∪ s(ly−1) ∧ y = |ys(s,m)|+ 1
A memory block m is persistent in the scope m[L] (from iteration m[L].lw to iteration
45
m[L].up of loop L) if once m has been loaded to the cache the first time in this scope, it
will not be evicted out of the cache in any possible execution before the program exists the
scope (i.e. finishes iteration m[L].up of loop L). In our ACS semantic, given ACS cˆ[L] of
analysis in loop L and sˆ = cˆ[L][set(m)], if temporal scope m ∈ sˆ(lx), once loaded to the
cache in scope m[L], memory block m has maximum relative age x in all possible executions
in the scope. Our scope-aware persistence analysis computes the maximum relative age x by
tracking the younger set YS(sˆ, m), the set all memory blocks which are possibly younger than
m in the scope m[L] after m is loaded to the cache. As the relative age of memory block m
is determined by the number of memory blocks more recently used (younger) than m in the
same cache set, the maximum relative age of m in scope m[L] should greater than the size of
younger set YS(sˆ, m), i.e. x = |YS(sˆ, m)| + 1. If memory block m has less than A possibly
younger memory blocks in scope m[L], once loaded, it will not be evicted out of the cache and
is persistent in scope m[L].
To prove the safety of our scope-aware persistence analysis, we prove that for any execution
path pa that reaches program point p in the scope m[L] with concrete cache state c, if path pa
has accessed memory block m in this scope, the younger set YS(sˆ, m) contain all memory
blocks in concrete younger set ys(s,m), the set of memory blocks younger than m in cache
set s = c[set(m)]. Consequently, the maximum relative age x determined by our analysis
(x = |YS(sˆ, m)| + 1) will always greater or equal than the relative age y of memory block m
in concrete cache set s (y = |ys(s,m)| + 1). Therefore, our scope-aware persistence analysis
is safe.
Note that our scope-aware persistence analysis computes the maximum relative age x of
memory block m only after the first time memory block m has been loaded to the cache in
scope m[L]. We do not consider the relative age of memory block m before its first access in
this scope, as we conservatively assume the first access to m in the scope m[L] always results
in a cache miss.
46
4.4.1 Structure of the proof
We prove by induction that for each temporal scope m in ACS cˆ[L], the ScopeYS property
holds in all possible execution paths in scope m[L].
Definition 9 (ScopeYS property): Given an arbitrary path pa from the start of execution to
program point p in scope m[L] of loop L which results in concrete cache state c, and cˆ[L] be
the computed fixed point ACS of loop L at p. For each memory block m ∈ s = c[set(m)]
and its corresponding temporal scope m ∈ sˆ = cˆ[L][set(m)], if path pa has accessed memory
block m in scope m[L], the younger set YS(sˆ, m) will contain all memory blocks in concrete
younger set ys(s,m).
∀m ∈ c, s = c[set(m)], sˆ = cˆ[L][set(m)],
¬Accessed(m,m[L], s) ∨ ys(s,m) ⊆ YS(sˆ, m)
where Accessed(m,m[L], s) indicates if memory block m has been accessed in scope m[L]
for concrete set state s.
We prove by induction that for each memory block m and its corresponding temporal scope
m, the ScopeYS property holds in all possible execution paths in scope m[L] (from iteration
m[L].lw to iteration m[L].up of loop L)
• If memory block m has not been accessed in scope m[L] (¬Accessed(m,m[L])), our
ScopeYS property is trivially true. We do not consider the relative age of memory block
m before its first access in scope m[L], as we conservatively assume the first access to m
in the scope results is a miss.
• At the first access to m in scope m[L], memory block m is brought to concrete set state s
at youngest line s(l1). Consequently, ys(s,m) = ∅, so ys(s,m) ⊆ YS(sˆ, m). Therefore
the ScopeYS property is true immediately after the first access to m in scope m[L].
• Assume ScopeYS property holds at pin, before the program point p. If at p, a data
reference D accesses a set of possible memory blocks {m1...mk} in their respective
47
temporal scopes {m1...mk}, we prove the ScopeYS property holds at pout, after program
point p by proving the correctness of our scope-aware update function (Section 4.4.2).
• Assume ScopeYS property holds at pout, we prove ScopeYS property holds at pinn , before
the next program point pn, by proving the correctness of our scope-aware join function
(Section 4.4.3).
For each memory block m in scope m[L], we prove that ScopeYS property holds before
and immediately after the first access to m in scope m[L]. In subsequent executions within the
scope, ScopeYS property holds after each data access, and from one program point to another.
Therefore, ScopeYS property holds for any arbitrary path pa in the scope m[L]. Consequently,
at any program point p in scope m[L] with concrete cache state c, the younger set YS(sˆ, m)
contains all memory blocks in concrete younger set ys(s,m) of m in the set s = c[set(m)].
As a result, the maximum relative age x of memory block m in scope m[L] determined by our
ACS cˆ[L] (x = |YS(sˆ, m)|+ 1) is always greater than or equal to the relative age y of m in set
s (y = |ys(s,m) + 1). Therefore, our analysis safely estimates the maximum relative age and
the persistence of m in scope m[L].
4.4.2 Safety proof of scope-aware update function
At program point p in loop L, a data reference D accesses a set of possible memory blocks
Addr(D) = {m1...mk} in their respective temporal scopes {m1...mk}. The scope-aware up-
date function computes the change in ACS cˆ[L], and tracks the younger set YS(sˆ, m) of each
temporal scope m after the data access. We prove our scope-aware update function preserves
the ScopeYS property. Assume ScopeYS property holds at pin, before program point p, we
prove ScopeYS property holds at pout, after program point p.
Given the concrete cache state cin of path pa at pin, and cˆin[L] is the computed ACS of loop
L at pin. Assume ScopeYS property holds at pin, we have
48
∀m ∈ cin, sin = cin[set(m)], sˆin = cˆin[set(m)],
¬Accessed(m,m[L], sin) ∨ ys(sin,m) ⊆ YS(sˆin,m) [B.1]
Given concrete cache state cout of path pa at pout, and cˆout[L] is the computed ACS of loop
L at pout. We prove ScopeYS property holds at pout:
∀m ∈ cout, sout = cout[set(m)], sˆout = cˆout[L][set(m)],
¬Accessed(m,m[L], sout) ∨ ys(sout,m) ⊆ YS(sˆout,m) [B.2]
At program point p in loop L, given a data reference D and input abstract set state sˆin,
our scope-aware update function UˆSˆ computes the output abstract set state sˆout and the updated
younger set YS(sˆout,m) as follow:
UˆSˆ(sˆin, D, L) = sˆout with :
sˆout(lx) = {m|m ∈ sˆin ∪ {ma|ma ∈ Xfi}, x = min(|YS(sˆout,m)|+ 1,>)}
where ∀m ∈ sˆin ∪ {ma|ma ∈ Xfi}
YS(sˆout,m) =

∅ if m /∈ sˆin
∅ if OpS(m,D,L) = {m}
YS(sˆin,m) ∪ (OpS(m,D,L) ∩Xfi \ {m}) Otherwise.
where
• Xfi denotes set of memory blocks possibly accessed by data reference D which are
mapped to cache set fi of abstract set state sˆin
Xfi = {ma|ma ∈ Addr(D), set(ma) = fi}
• Overlap set OpS(m,D,L) denotes the set of memory blocks which data reference D
49
may access in scope m[L]. For each memory block ma ∈ Addr(D), D may access ma
in scope m[L] if temporal scope m and ma overlap in loop L.
OpS(m,D,L) = {ma|ma ∈ Addr(D) ∧ overlap(m,ma, L)}
We prove the correctness of our scope-aware update function UˆSˆ by dividing access scenar-
ios into two cases:
Case 1: Memory block m has not been accessed in scope m[L]
• Case 1.1: Data reference D does not access m at program point p
As D does not access m at p, m remains not accessed at pout. We have
¬Accessed(m,m[L], sin) ∧D does not access m
→ ¬Accessed(m,m[L], sout) ([B.2] proven)
• Case 1.2: Data reference D accesses m at program point p
Since data reference D accesses m, m becomes the most recently used memory block in
cache line l1. Consequently, m has no younger memory block.
ys(sout,m) = ∅
→ ys(sout,m) ⊆ YS(sˆout,m) ([B.2] proven)
Case 2: Memory block m has been accessed in scope m[L]
Since memory block m has been accessed in scope m[L] and ScopeYS holds at pin
[B.1] ∧ Accessed(m,m[L], sin)→ ys(sin,m) ⊆ YS(sˆin,m) [1]
In scope m[L], D may access memory block ma only if temporal scope ma overlaps with
m in loop L (ma ∈ OpS(m,D,L). Moreover, ma will become a younger memory block of m
50
in sout if ma 6= m and they are mapped to the same cache set (ma ∈ Xfi). As a result, we have
[2] ys(sout,m) =

∅ if ma = m
ys(sin,m) ∪ {ma} if ma 6= m ∧ma ∈ Xfi
ys(sin,m) Otherwise
where ma ∈ Ops(m,D,L)
[3] YS(sˆout,m) = YS(sˆin,m) ∪ (OpS(m,D,L) ∩Xfi \ {m})
[1][2][3]→ ys(sout,m) ⊆ YS(sˆout,m) ([B.2] proven)
As a result, in all cases, either memory block m has not been accessed, or YS(sˆout,m)
contains all possible temporal scopes of memory blocks accessed within scope m[L] which
may be younger than m. Therefore, the ScopeYS property holds at pout.
4.4.3 Safety proof of scope-aware join function
Assume ScopeYS property holds at pout, after program point p, we prove that ScopeYS property
holds at pinn , before the next program point pn by proving the correctness of our scope-aware
join function JˆSˆ .
Given concrete cache state cout of path pa at pout, and cˆout[L] is the computed ACS of loop
L at pout. Assume ScopeYS property holds at pout, we have
∀m ∈ cout, sout = cout[set(m)], sˆout = cˆout[L][set(m)],
¬Accessed(m,m[L], sout) ∨ ys(sout,m) ⊆ YS(sˆout,m) [C.1]
Let cinn be the concrete cache state of path pa at p
in
n , and cˆ
in
n [L] is the computed ACS of loop
51
L at pinn . We prove ScopeYS property holds at p
in
n :
∀m ∈ cinn [L], sinn = cinn [set(m)], sˆinn = cˆinn [L][set(m)],
¬Accessed(m,m[L], sinn ) ∨ ys(sinn ,m) ⊆ YS(sˆinn ,m) [C.2]
From our proposed scope-aware join function sˆ = JˆSˆ(sˆ1, sˆ2), younger set YS(sˆ, m) of m
at pinn is the union of all younger sets of incoming edges of p
in
n . As p
out is one of the incoming
edge of pinn , we have
YS(sˆout,m) ⊆ YS(sˆinn ,m) [JˆSˆ ]
Because pinn is immediately after p
out, no new memory block is accessed. Therefore the
concrete set state sinn is exactly the same as concrete set state s
out, and the concrete younger set
remains the same:
ys(sinn ,m) = ys(s
out,m) [C.3]
If m has not been accessed in scope m[L] at pout, m remains not accessed at pinn . The
ScopeYS property will hold at pinn .
Otherwise, if m has been accessed in scope m[L] at pout, we have
[C.1] ys(sout,m) ⊆ YS(sˆout,m)
[JˆSˆ ] YS(sˆout,m) ⊆ YS(sˆinn ,m)
[C.3] ys(sinn ,m) = ys(s
out,m)
→ ys(sinn ,m) ⊆ YS(sˆinn ,m) ([C.2] proven)
The younger set YS(sˆinn ,m) contains all possible memory blocks younger than m in set(m) of
sinn at p
in
n . Therefore the ScopeYS property holds at p
in
n .
According to the proof structure outlined in Section 4.4.1, the ScopeYS property holds
52
before and immediately after memory block m is first accessed in scope m[L]. Then ScopeYS
property holds before and after memory access at each program point p, and from p to the next
program point pn. As a result, the maximum relative age x of memory block m in scope m[L]
determined by our scope-aware persistence analysis (i.e. x = |YS(sˆ, m)|+1) is always greater
or equal to the relative age of m in concrete set state s = c[set(m)] in arbitrary path pa after
the first access of m in scope m[L]. Therefore, our scope-aware persistence analysis is safe.
4.5 Cache Miss Computation
In abstract interpretation-based approaches, the cache analysis results are used to classify the
cache behavior of each data reference D in the program. Typical worst case categories are (1)
All Hit (AH): all data accesses of D result in cache hit; (2) All Miss (AM): all data accesses of
D result in cache miss; (3) Persistent (PS): all possible accessed memory blocks of D are per-
sistent (D has at most one cold miss for each persistent memory block); and (4) Non Classified
(NC): the cache behavior of D could not be classified (all accesses of D are considered to be
misses).
In the presence of data cache, different executions of the same data reference may access
various memory blocks and result in different cache behavior. In our motivating example shown
in Figure 4.1, data reference B[i][j] may access m4, m5, and m6 in the temporal scopes m4,
m5, and m6 respectively. As illustrated in Figure 4.3(c) and Figure 4.3(d), memory blocks may
have distinct cache behaviors in different loop nesting levels. Scope persistence of the above-
mentioned memory blocks are shown in Figure 4.5. In Figure 4.3, because temporal scope m4
is not aged to evicted line l> in both L1 and L2, m4 is persistent in both scope m4[L1] and
m4[L2]. Therefore, we annotate the iterations of L1 and L2 bounded by m4 with PS. On the
other hand, m5 is not persistent in outer loop L1 (annotated as ¬PS) but is persistent in inner
loop L2, so m5 is persistent in scope m5[L2] but not m5[L1]. m6 is not persistent in any of the
loop levels. Pessimistically categorizing all data accesses from B[i][j] as Non Classified (as in







i=0 1 2 3








Figure 4.5: Temporal scopes and loop iterations
data misses, which can be avoided in our scope-aware data cache analysis.
Our multi-level analysis computes a fixed-point abstract cache states cˆinn [L] (cˆ
out
n [L]) for
entry (exit) of each CFG node n in each loop level L. If m is persistent in scope m[L] (or
mD[L]) of loop level L, accesses to m by data reference D incurs only one cold miss for each
complete execution of L (between entry and exit). Let Lps be the outer-most loop level where
m is persistent. Hence, accesses to m incur 1 cold miss for each execution of Lps (including all
its inner loops). The following function blockMiss(D,m) computes the maximum number of




(m[Li].up−m[Li].lw + 1)∀Li ∈ reside(D) if Lps == ∅
1 if outer(Lps) == ∅∏
(m[Li].up−m[Li].lw + 1)∀Li ∈ outer(Lps) otherwise.
where outer(Lps) is the set of all outer loops of Lps. In other words, blockMiss(D,m) com-
putes the number of times Lps executed (in its outer loops) given the temporal scope where m
may get accessed by D. In case m is not persistent in any loop level (Lps == ∅), each access
to m within its temporal scope results into 1 miss. On the other hand, if Lps is outer-most loop
of the program (globally persistent), all accesses to m incur only 1 cold miss.
As illustrated in Figure 4.5, L1 is the outer most loop where m4 is persistent. Since L1
is the outermost loop, m4 causes at most one cold miss globally. m5 is only persistent in L2.
Therefore, accesses to m5 from B[i][j] causes one cold miss for each iteration of L1 in the
interval [1, 1] defined by m5[L1]. m6 is not persistent in any level, so all occurrences of B[i][j]
in the scope result in cache misses. The temporal scope m6 covers interval [2, 2] of L1 and
[0, 7] of L2, so m6 causes at most 1× 1× 8 = 8 misses to B[i][j].
54
Finally, the maximal possible cache misses incurred byD is the summation of blockMiss(D,m)




In our motivating example, B[i][j] accesses 8 memory blocks ({m2, . . . ,m9}). According
to our scope-aware analysis results shown in Figure 4.3,m6 is non-persistent in bothL1 and L2,
m5 is persistent only in L2, and other 6 memory blocks are persistent in both loops. According
to our cache miss estimation, maximal number of cache misses fromB[i][j] is 8+1+1×6 = 15
misses, compared to the original pessimistic analysis which considers all accesses to B[i][j]
lead to totally 64 cache misses.
4.6 Experimental Results
In this section, we evaluate the performance of our proposed scope-based persistence analy-
sis using the data-intensive routines taken from the WCET Benchmarks ([1]). We assume the
benchmarks are executed on a processor architecture with 5-stage pipeline, in-order execution,
perfect branch prediction, separate L1 instruction cache and data cache. Both instruction and
data caches have cache size 2 KB , block size 32 B, cache associativity 2, and perfect LRU
replacement policy. Cache hit latency is 1 cycle, and cache miss latency is 6 cycles. We use
SimpleScalar tool ([3]) to obtain simulation results. We extend SimpleScalar to support write-
through with no-write-allocate policy and no write buffer in the simulation, to be consistent
with the assumptions made in our analysis. The cache analysis results on maximum number
of data cache misses for each data reference are integrated as linear constraints into Chronos
([9]), an ILP-based WCET analysis tool for static WCET estimation. We use existing instruc-
tion cache modeling in Chronos [16]. As we assume separate instruction cache and data cache,
we can model their behavior separately. In the current experiment, we assume a processor
architecture without timing anomalies [7]. However, it is possible to integrate our cache anal-
ysis result in the presence of timing anomaly. To deal with timing anomaly problem, we can
55
Table 4.1: Benchmark descriptions and WCET estimation result







Edn Finite Impulse Response (FIR) filter. 2048 2,542,444 2,631,312 0.25s
Fdct Fast Discrete Cosine Transform. 2048 917,636 970,646 0.82s
Cnt Counts non-negative numbers in a matrix. 32× 32 21,611 22,826 0.02s
Matmult Matrix multiplication. 24× 24 374,887 467,116 0.02s
Bsort100 Bubblesort program. 1024 15,945,200 16,556,926 0.02s
InsertSort Insertion sort on a reversed array. 1024 14,900,732 16,298,086 0.58s
Jfdctint Discrete-cosine transformation of pixel blocks. 256× 64 1,485,075 1,499,938 2.05s
Lms LMS adaptive signal enhancement. 1024 1,425,585 1,527,952 0.02s












Edn Fdct Cnt Matmult  Bsort100 Insertsort Jfdctint Lms Adpcm
Est/Obs Ratio
Simulation Result Persistence Analysis[9] Must Analysis [13] (20% unrolling) Must Analysis [13] (50% unrolling) Our Analysis
Figure 4.6: WCET estimation results from different analyses
consider the cache behavior of data references in pipeline analysis, similar to [16]. If a data
reference D is persistent, then the latency corresponds to D in the pipeline analysis is N (miss)
cycles for the first execution, and one (hit) cycle for the subsequent executions. Table 4.1 shows
the set of benchmarks used in our evaluation. We have enlarged array sizes (and correspond-
ing loop bounds) to introduce more data cache conflicts and amplify the effect of data cache
performance on overall program execution time. Array Size shows the array size used in our
simulation and analysis for each of the benchmarks. Simulation shows the observed WCET
from SimpleScalar simulation in CPU clock cycles. However, the simulation results may be
smaller the actual WCET values for benchmarks with input-dependent branches/accesses (e.g,
Cnt, Bsort100, InsertSort and Adpcm). Finally, we report the WCET results obtained with our
scope-aware persistence data cache analysis, as well as the time spent for the analysis (on a
Intel(R) Xeon(TM) 2.20 Ghz processor with 2.5 GB of RAM).
We have implemented the must analysis with loop unrolling as proposed in [20], and the re-
vised persistence analysis (Section 3) to compare with our proposed scope-aware analysis. Fig-
ure 4.6 shows the percentage of overestimation from various data cache analysis approaches,
56
compared to the normalized observed WCET results from SimpleScalar simulation (shown in
Table 4.1). Given the array size in our experiment, since the entire array does not fit into the
data cache for any of the benchmarks, no memory block can be categorized as persistent in
the original persistence analysis of [11]. As a result, the estimated WCET results with original
persistence analysis are up to 83% higher than the observed WCET (for InsertSort). We also
compare the estimated WCET results using must analysis with 20% and 50% virtual unrolling
of the loop nest ([20]), where the analysis is repeatedly performed for each unrolled loop it-
eration. As shown in Figure 4.6, even when 50% the loop nest is unrolled, [20] still reports
up to 65% higher WCET estimate compared to the observed simulation time (for Adpcm). In
particular, must analysis requires loop unrolling to bring memory blocks to the data cache and
to capture subsequent cache reuse. As a result, for the remaining portion of the loop nest where
unrolling is not applied, they can not capture any cache reuse.
On the other hand, our scope-aware analysis always obtains tighter WCET estimates com-
pared to existing approaches. In most of the benchmarks, our WCET estimates are less than
10% higher than the simulation results (except for Matmult and Adpcm). We observe that many
data references in these benchmarks have sequential array access patterns. They traverse array
elements in sequential order, according to the row-major arrangement of array in the memory.
Our scope-aware approach fully captures the temporal locality of such data accesses to bound
the worst-case data cache performance. Our scope-aware persistence analysis achieves 12%
to 74% tighter WCET estimates compared to original persistence analysis, and 5% to 35%
compared to must analysis with 50% unrolling.
Matmult contains a column array access in addition to sequential array accesses. In our
analysis, a temporal scope captures the lower and upper bound of loop iterations where a mem-
ory block may get accessed. For column array access, array elements contained in a single
memory block are usually accessed in non-contiguous loop iterations, which leads to over-
estimation in the computed temporal scopes. However, as shown in Figure 4.6, our estimated
WCET is only 25% higher than the observed WCET, and is 10% to 40% tighter than other
approaches.
57
Adpcm is a complex benchmark with input-dependent branches and accesses, so our simula-
tion result may underestimate the real WCET. Due to the presence of input-dependent branches
and accesses, must analysis cannot guarantee a memory block to be loaded into the cache for
subsequent reuse even with unrolling. In our scope-aware persistence analysis, by guaranteeing
the scope persistence of memory blocks, we can achieve 20% tighter WCET estimate compared




In this thesis, we have revised and corrected the persistence analysis as proposed in [10, 11],
and presented a novel data cache modeling approach for static WCET analysis. Our analysis
effectively exploits regular data access patterns, while retaining the strength and wide applica-
bility of the abstract interpretation approach. We define temporal scopes to capture the local
behavior of memory references (when a particular memory block is accessed). These tem-
poral scopes are automatically calculated during address analysis. Our proposed scope-aware
multi-level data cache analysis extends the cache persistence analysis framework to compute
fine-grained scope-based persistence information to tightly capture the worst case performance
of data cache. Our data cache modeling has been integrated into the open-source Chronos
WCET analyzer ([9]).
In terms of future works, our scope-aware data cache analysis inherits the flexibility and
wide applicability of the abstract interpretation framework. Abstract interpretation techniques
have been used for cache analysis in various cache types and environments, e.g. unified
data/instruction cache analysis [5], multi-level caches [21], instruction cache in multi-cores
platform [6]. An immediate next step is to develope a tight and scalable analysis of data cache
in multi-core platform, based on the analysis technique proposal for instruction cache in [6].
Besides caches, abstract interpretation approach is also used to analyze other mechanism
to bridge the gap between processor speed and memory access time, such as prefetching. Re-
59
cently, there has been an effort to estimate WCET in the presence of instruction prefetch [26].
Data prefetching [15, 24] has been used to hide the latency of data memory access. Our scope-




[1] WCET Benchmarks, http://www.mrtc.mdh.se/projects/wcet/benchmarks.html.
[2] C. Ballabriga and H. Casse. Improving the First-Miss Computation in Set-Associative
Instruction Caches. In ECRTS, 2008.
[3] D. Burger and T.M. Austin. The SimpleScalar tool set, version 2.0. ACM SIGARCH,
25(3), 1997.
[4] S. Chatterjee, E. Parker, P. J. Hanlon, and A. R. Lebeck. Exact analysis of the cache
behavior of nested loops. In PLDI, 2001.
[5] S. Chattopadhyay and A. Roychoudhury. Unified cache modeling for wcet analysis and
layout optimizations. In RTSS, 2009.
[6] S. Chattopadhyay, A. Roychoudhury, and T. Mitra. Modeling shared cache and bus in
multi-cores for timing analysis. In SCOPES, 2010.
[7] J. Reineke et al. A definition and classification of timing anomalies. In In WCET Work-
shop, 2006.
[8] S. Lim et al. An accurate worst case timing analysis for risc processors. IEEE Transac-
tions on Software Engineering, 21(7):593–604, 1995.
[9] X. Li et al. Chronos: A timing analyzer for embedded software. Science of Computer Pro-
gramming, 69(1-3):56–67, 2007, http://www.comp.nus.edu.sg/˜rpembed/
chronos.
61
[10] C. Ferdinand. Cache behavior prediction for real-time systems. PhD thesis, Saarland
University, 1999.
[11] C. Ferdinand and R. Wilhelm. On predicting data cache behavior for real-time systems.
In LCTES, 1998.
[12] B.B. Fraguela, D. Andrade, and R. Doallo. Address-independent estimation of the worst-
case memory performance. IEEE Transactions on Industrial Informatics, 2010.
[13] S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: a compiler framework for
analyzing and tuning memory behavior. ACM Transactions on Programming Languages
and Systems, 21(4):703–746, 1999.
[14] S.K. Kim, S.L. Min, and R. Ha. Efficient worst case timing analysis of data caching. In
RTAS, 1996.
[15] Alexander C. Klaiber and Henry M. Levy. An architecture for software-controlled data
prefetching. SIGARCH Comput. Archit. News, 19:43–53, April 1991.
[16] X. Li, A. Roychoudhury, and T. Mitra. Modeling out-of-order processors for software
timing analysis. In RTSS, 2004.
[17] Y.-T. S. Li, S. Malik, and A. Wolfe. Cache modeling for real-time software: beyond direct
mapped instruction caches. In RTSS, 1996.
[18] H. Ramaprasad and F. Mueller. Bounding worst-case data cache behavior by analytically
deriving cache reference patterns. In RTAS, 2005.
[19] J. et al. Reineke. Timing predictability of cache replacement policies. Real-Time Systems,
2007.
[20] R. Sen and Y.N. Srikant. WCET estimation for executables in the presence of data caches.
In EMSOFT, 2007.
62
[21] Tyler Sondag and Hridesh Rajan. A more precise abstract domain for multi-level caches
for tighter wcet analysis. In RTSS, 2010.
[22] J. Staschulat and R. Ernst. Worst case timing analysis of input dependent data cache
behavior. In ECRTS, 2006.
[23] H. Theiling, C. Ferdinand, and R. Wilhelm. Fast and precise WCET prediction by sepa-
rated cache and path analyses. Real-Time Systems, 18(2):157–179, 2000.
[24] S.P. Vander Wiel and D.J. Lilja. When caches aren’t enough: data prefetching techniques.
Computer, 30(7):23 –30, July 1997.
[25] R. T. White et al. Timing analysis for data and wrap-around fill caches. Real-Time System,
17(2-3):209–233, 1999.
[26] Jun Yan and Wei Zhang. Analyzing the worst-case execution time for instruction caches
with prefetching. ACM Trans. Embed. Comput. Syst., 8:7:1–7:19, January 2009.
63
