Cache Persistence Analysis: Finally Exact by Stock, Gregory et al.
Cache Persistence Analysis: Finally Exact
Gregory Stock, Sebastian Hahn, and Jan Reineke
Saarland University
Saarland Informatics Campus
Saarbru¨cken, Germany
{g.stock, sebastian.hahn, reineke}@cs.uni-saarland.de
Abstract—Cache persistence analysis is an important part
of worst-case execution time (WCET) analysis. It has been
extensively studied in the past twenty years. Despite these efforts,
all existing persistence analyses are approximative in the sense
that they are not guaranteed to find all persistent memory blocks.
In this paper, we close this gap by introducing the first exact
persistence analysis for caches with least-recently-used (LRU)
replacement. To this end, we first introduce an exact abstraction
that exploits monotonicity properties of LRU to significantly
reduce the information the analysis needs to maintain for exact
persistence classifications. We show how to efficiently implement
this abstraction using zero-suppressed binary decision diagrams
(ZDDs) and introduce novel techniques to deal with uncertainty
that arises during the analysis of data caches.
The experimental evaluation demonstrates that the new exact
analysis is competitive with state-of-the-art inexact analyses in
terms of both memory consumption and analysis run time, which
is somewhat surprising as we show that persistence analysis is
NP-complete. We also observe that while prior analyses are not
exact in theory they come close to being exact in practice.
I. INTRODUCTION
Modern processors can perform several arithmetic and logic
operations in a single cycle. On the other hand, a single access
to main memory can take hundreds of cycles. To bridge this
performance gap, modern processors include one or multiple
levels of caches. Caches are small but fast memories that store
parts of main memory to quickly serve accesses to commonly
used instructions and data. Memory accesses that “hit” the
cache are served from the cache at a low latency, while
accesses that “miss” the cache are served from main memory
at a much higher latency. The execution time of a program
thus heavily depends on how effective the processor’s caches
are in hiding the high latency of main memory.
Real-time systems are systems that, in order to function
correctly, have to perform their computations with limited
amounts of wall-clock time. To verify a system’s real-time
behavior, a major task is to bound each software component’s
worst-case execution time (WCET). In the presence of caches,
WCET analysis [1] has to account for the software’s cache
behavior. Simply assuming that each memory access could
result in a cache miss would yield extremely pessimistic
WCET bounds. Thus, static cache analyses [2] have been
developed to soundly and precisely characterize a program’s
cache behavior on a particular cache architecture. These can
broadly be categorized into two groups:
1) Classifying cache analyses aim to classify individual
memory accesses as cache hits or cache misses.
y
x Fig. 1: Simple motivating example
for persistence analysis.
2) Quantitative cache analyses aim to determine the number
of cache misses resulting from a set of memory accesses.
In this paper we study persistence analysis, an instance of
quantitative cache analysis. Persistence analysis considers all
memory accesses in a program, or a fragment of a program
such as a loop, that access the same memory block. A memory
block is persistent if all memory accesses referring to this
memory block may cumulatively result in at most one cache
miss during any possible program execution.
For a motivating example, consider Figure 1, which contains
the control-flow graph of a simple program. The program
consists of a loop, in which, in each loop iteration either
memory block x or memory block y is accessed. As neither
block x nor block y is guaranteed to have been accessed in any
loop iteration, it is impossible for a classifying cache analysis
to classify any of the memory accesses in the program as a
guaranteed cache hit, and so a WCET analysis would have to
pessimistically account for misses upon all memory accesses.
However, provided the cache is large enough to hold blocks x
and y simultaneously, among all memory accesses to x (and
similarly to y) only the very first may result in a cache miss.
Both x and y are persistent and WCET analysis can safely
account for at most two misses in total.
Given a program, the goal of persistence analysis is to deter-
mine which of the memory blocks accessed in the program are
persistent. Persistence analysis has been extensively studied for
caches with least-recently-used (LRU) replacement, starting
with Mueller’s [3]–[6] and Ferdinand’s [7], [8] work in the
1990s up until today [9]–[17]. Notably, all prior persistence
analyses are approximative, in the sense that they are not
guaranteed to find all persistent memory blocks of a program.
In this paper, we close this gap by introducing the first
exact persistence analysis. We develop this analysis via a
sequence of three consecutive exact abstractions. The first
abstraction is based on the observation that the persistence of
a memory block can be determined by examining its possible
conflict sets, i. e., the sets of blocks that may have been
accessed since the last access to the block itself. The two
following abstractions exploit a monotonicity property of LRU
ar
X
iv
:1
90
9.
04
37
4v
1 
 [c
s.P
L]
  1
0 S
ep
 20
19
replacement to further increase analysis efficiency without
sacrificing exactness.
Next, we discuss how to efficiently implement the exact
abstraction using zero-suppressed binary decision diagrams
(ZDDs), a data structure that enables the compact repre-
sentation of sets of conflict sets, sharing information across
program points and between different memory blocks. We also
introduce novel techniques to deal with uncertainty about the
memory-access behavior that arises in data cache analysis.
We experimentally evaluate the new exact persistence anal-
ysis and a selection of prior persistence analyses and make the
following high-level observations:
• Our exact analysis is competitive with prior analyses in
terms of both memory consumption and analysis run time.
• While prior persistence analyses are not exact in theory,
they come very close to being exact in practice.
Even though our exact persistence analysis is fairly efficient
in practice, its worst-case complexity is exponential. We show
that persistence analysis is NP-complete, which implies that a
persistence analysis that is polynomial in all input parameters
is not attainable, unless P=NP.
II. BACKGROUND: CACHES, CONTROL-FLOW GRAPHS,
AND CACHE PERSISTENCE ANALYSIS
A. Caches
Caches are fast but small memories that buffer parts of the
large but slow main memory in order to bridge the speed gap
between the processor and main memory. Caches operate at
the granularity of memory blocks b ∈ B, which are stored in
the cache in cache lines of the same size. In order to facilitate
the cache lookup, the cache is organized in sets such that each
memory block maps to a unique cache set. The size k of a
cache set is called the associativity of the cache. If an accessed
block resides in the cache, the access hits the cache. Upon a
cache miss, the block is loaded from main memory. To ease the
formal presentation in this paper, we assume a fully-associative
cache, i. e., with a single cache set. Set-associative caches with
n sets can be treated as n independent fully-associative caches
as described in [18].
Upon a cache miss, another memory block has to be evicted
due to the limited size of the cache. The block to evict
is determined by the replacement policy. In this paper, we
assume the least-recently-used (LRU) policy that replaces the
block that has not been accessed for the longest. A memory
block b hits in an LRU cache of associativity k if b has been
accessed before and less than k distinct blocks have been
accessed since the last access to b. LRU is generally considered
to be the most predictable replacement policy [19].
In this paper, we refer to the age of block b as the number
of distinct blocks since the last access to b including the access
to b itself1. Thus, a block b hits the cache if its age is less
than or equal to the associativity k.
1This is subtly different from most of the related work in which the age of
a block does not account for the access of the block itself.
B. Programs as Control-Flow Graphs
In this paper, we follow the common approach of rep-
resenting the program under analysis by its control-flow
graph (CFG). A CFG G = (V,E, i) consists of a set of
vertices V , corresponding to control locations in the program;
a set of edges E ⊆ V × B × V , which represent the possible
control flow between locations; and the initial control location
i ∈ V . Each edge is annotated with a single memory block
accessed between the source and target location.
The CFG is an abstraction of the program behavior as it
does not capture the functional semantics of the instructions.
In particular, all paths in the graph are assumed to be feasible
even if, in reality, some are not, e. g., in case of a nested
conditional statement with contradicting conditions. All our
claims of exactness are relative to this control-flow graph
abstraction. Incorporating the program semantics into the per-
sistence analysis problem immediately renders it undecidable
due to Rice’s theorem. For example, it is undecidable whether
a certain path in a program is feasible, and hence, it is also
undecidable to collect all memory access sequences which are
needed for exact persistence analysis.
To ease the visualization of a CFG, we allow empty edges
on which no access is performed. For the sake of simplicity,
we do not treat such empty edges in the formalization, but
the extension would be trivial as such empty edges do not
influence the cache state.
C. Notion of Persistence
A memory block is persistent during a program’s execution
if all accesses to the memory block collectively result in at
most one cache miss. Assuming the cache is empty at the
start of the program’s execution, the first access to any memory
block will always result in a cache miss. Thus, to be persistent,
all accesses but the very first to a memory block must hit the
cache.
In other words, a memory block is persistent during a
program’s execution if all accesses to the block hit the cache
if the block has been accessed before. As an example, in
Figure 1, as discussed before, blocks x and y are persistent in
a cache of associativity two.
Due to its dependence on previous accesses, persistence is a
property of execution traces. A trace τ = b0b1 . . . bn−1 ∈ B∗
is a sequence of memory blocks. We use τi to denote the i-th
block bi in the sequence and |τ | := n to denote its length. The
trace of length zero is denoted by .
The conflict set of block b on trace τ is the set of all memory
blocks accessed from the last access of block b onward:
CS (τ, b) :=
⋂
0≤i<|τ |
τi=b
{τj | j ≥ i}
If block b has not been accessed on trace τ , the empty
intersection yields the universe B.
The cardinality of the conflict set is referred to as the age
of block b at the end of trace τ :
age(τ, b) := |CS (τ, b)|
In accordance with the discussion at the beginning of this
subsection, we call block b persistent on trace τ if b is cached
once it has been accessed before.
Definition II.1 (Persistence on Trace). Memory block b ∈ B
is persistent on trace τ ∈ B∗ if:
(∃0 ≤ i < |τ | : τi = b)→ age(τ, b) ≤ k
The above definition captures persistence of a block on
a single trace. In order to reason about the persistence of
a memory block during a program’s execution, we need to
capture all possible traces of a given program. To this end, we
define the trace semantics, which captures for each location in
a control-flow graph all possible traces that end in this location.
Formally, the trace semantics is defined as the least solu-
tion RC of the following set of equations where RC : V →
P(B∗) maps each location to a set of traces:
∀w ∈ V : RC(w) = IC(w) ∪
⋃
(v,b,w)∈E
update
(
RC(v), b
)
where update(T, b) is defined as {τ ◦ b | τ ∈ T} and ◦
denotes the concatenation operator. The initial values, denoted
by IC(w), are the empty set for all locations except the
program’s entry point which is initialized to the set containing
the empty trace IC(i) = {}.
The equations can intuitively be understood as follows:
Either a trace starts in location w, then it is given by
IC(w) = {}; or it starts elsewhere and reaches w from one
of its predecessors v via edge (v, b, w). In the latter case,
the trace reaching location w is obtained by concatenating the
memory access b on the edge from v to w to the trace reaching
location v.
A memory block b is persistent throughout a given program
if b is persistent on each trace that may be immediately
followed by an access to b. The set Vb = {v ∈ V |
∃w ∈ V : (v, b, w) ∈ E} contains the program locations that
can be followed by an access to block b.
Definition II.2 (Persistence). Memory block b ∈ B is persis-
tent, denoted by persistent(b), if at each location v ∈ Vb ⊆ V
that can be followed by an access to b:
persistent
(
RC(v), b
)
:= ∀τ ∈ RC(v) : persistent(τ, b)
A memory block b may cause at most one miss during any
possible execution through G if and only if it is persistent
according to the definition above.
Scopes: A memory block might only persist in the cache
during a certain phase of execution, e. g., in the innermost loop
of a loop nest, and not during the overall program execution.
To capture this behavior, scopes have been introduced in [9],
[11] to describe a portion of program execution. A block that
is persistent within a scope can cause at most one miss for
each entrance of the scope during execution. For the sake of
readability, we limit our formalization to persistence within the
whole program. It is an easy exercise to extend the definitions
to account for scopes. Indeed, the experimental evaluation in
Section VI is performed by analyses at scope level.
D. Existing Cache Persistence Analyses
In general, the trace semantics is not computable as the
number of traces may be infinite and the lengths of individ-
ual traces are unbounded. Persistence analyses thus rely on
abstractions of cache traces that lead to finite representations.
A variety of cache persistence analyses has been proposed
in the literature [3]–[16] with varying degrees of precision. In
this section, we briefly discuss two of the existing persistence
analyses, C -Must and Block -CS to convey how such persis-
tence analyses operate in general, and also to illustrate that
they are not exact.
Observe that according to Definition II.2, whether or not a
memory block b is persistent is determined by the sizes of b’s
conflict sets at all program points that may be followed by
accesses to block b. As a consequence, all existing persistence
analyses can be seen as approximating the possible sets of
conflict sets of each memory block at each program point.
Existing persistence analyses employ two different ap-
proaches to approximate the conflict sets of a memory block:
1) The C -Must analysis [7], [8] maintains an upper bound
on the sizes of all of the block’s possible conflict sets.
2) The Block -CS analysis [11], [14] maintains a superset
of all of the block’s possible conflict sets.
See Figure 2 for an example illustrating the results of the two
analyses. For readability, the figure only includes the analysis
information for memory block v. Following the access to v, the
only possible conflict set of v is {v} and so C -Must maintains
a bound of 1 and Block -CS maintains {v} as a superset of
all of v’s conflict sets. After the possible accesses to w and x,
the possible conflict sets of v are {v, w} and {v, x}, and
C -Must(v) = 2. However, to overapproximate both conflict
sets, Block -CS (v) = {v, w, x} = {v, w} ∪ {v, x}. Thus,
C -Must is able to conclude that v is persistent in a cache
of size 2, while Block -CS is not, as |{v, w, x}| > 2. Clearly,
Block -CS is not an exact analysis.
Unfortunately, as the example in Figure 3 illustrates,
C -Must neither is exact: Both x and y are persistent in a
cache of size 2, and indeed Block -CS is able to show that,
as Block -CS (x) = {x, y}. On the other hand, C -Must is not
able to derive any finite bound on the sizes of x’s possible
conflict sets. This is because C -Must does not “remember”
whether or not a given memory block has already been
accounted for in its upper bounds. This may lead the analysis
to account for the same block multiple times in loops.
E. A General Framework for Cache Persistence Analyses
Persistence analyses can be formalized within the frame-
work of abstract interpretation [20] to reason about their
correctness and precision. Here, we briefly present a simpli-
fied version of the persistence analysis framework developed
in [17]. Persistence abstractions Ĉ are characterized by
• an abstract update function ûpdate to model the effect of
a memory access,
• a join operator unionsq to combine multiple abstract traces into
one at control-flow joins, and
vw x
C -Must : v 7→ 2
Block -CS : v 7→ {v, w, x}
C -Must : v 7→ 1
Block -CS : v 7→ {v}
C -Must : v 7→ 2
Block -CS : v 7→ {v, w, x}
Fig. 2: Example illustrating C -Must and Block -CS , which
shows that C -Must may be more precise than Block -CS .
x y
C -Must : x 7→ ∞
Block -CS : x 7→ {x, y}
C -Must : x 7→ ∞
Block -CS : x 7→ {x, y}
Fig. 3: Example illustrating C -Must and Block -CS , which
shows that Block -CS may be more precise than C -Must .
• an abstract function ̂persistent to classify memory blocks
as persistent.
Note that the join operator also defines a partial order v on
the abstract traces as follows: x v y if and only if xunionsq y = y.
This partial order captures the relative precision of different
abstract traces, where x v y implies that x is more precise
analysis information than y.
In order to formally capture the meaning of abstract traces,
abstraction and concretization functions, α and γ, can be
defined to relate sets of concrete traces to abstract traces.
Analogously to the concrete semantics, the abstract seman-
tics is captured as the least solution R̂ : V → Ĉ of the
following set of equations:
∀w ∈ V : R̂(w) = Î(w) unionsq
⊔
(v,b,w)∈E
ûpdate
(
R̂(v), b
)
The initial value Î(w) is ⊥, the bottom element of the partial
order v, for all locations except for the program’s entry point,
where Î(i) = α(IC(i)).
A sound persistence analysis overapproximates the concrete
trace semantics, i. e., RC(v) ⊆ γ(R̂(v)) for all locations v.
Equivalently, given that (α, γ) form a Galois connection [20],
we have α(RC(v)) v R̂(v) for all locations v.
Analogously to persistence in the concrete case, a memory
block b is classified as persistent in a given program, denoted
by ̂persistent (b), if b is classified as persistent at each control
location in Vb, i. e., each location that might be immediately
followed by an access to b. A sound persistence analysis never
classifies a block as persistent that is not actually persistent
on all concrete traces:
Definition II.3 (Soundness of Persistence Analysis). A per-
sistence analysis is sound if:
̂persistent (b)→ persistent(b)
Sound abstractions are not guaranteed to be exact, i. e.,
there can be persistent memory blocks that are not classified
as persistent by the abstraction. Indeed, none of the existing
persistence analyses is exact [17].
III. EXACT CACHE PERSISTENCE ANALYSIS:
A SEQUENCE OF ABSTRACTIONS
In this paper, we do not just aim for yet another sound
persistence analysis, but we aim for an exact persistence
analysis that determines each and every persistent memory
block.
Definition III.1 (Exactness of Persistence Analysis). A per-
sistence analysis is exact if:
̂persistent (b)↔ persistent(b)
Note that by definition an exact analysis is also sound. How
can we obtain an exact analysis? This requires an abstraction
that satisfies the following two properties:
1) Applying the abstraction to “perfect” concrete informa-
tion preserves enough information to precisely classify
memory blocks as persistent or not. This property is com-
parably easy to achieve. It is captured formally by (3) in
the theorem below. The C -Must abstraction, for example,
satisfies this property, while Block -CS does not.
2) Abstract joins and abstract updates may not lose any ad-
ditional information, beyond the information loss inherent
to the abstraction itself. This is more difficult to achieve,
and indeed none of the existing persistence analyses does.
This property is captured formally by (1) and (2) in the
theorem below.
Theorem III.2 (Exactness of Persistence Analysis). A persis-
tence analysis over a finite abstract domain Ĉ is exact if:
∀T, b : α(update(T, b)) = ûpdate (α(T ), b) (1)
∀I, Ti : α
(⋃
i∈I
Ti
)
=
⊔
i∈I
α(Ti) (2)
and the abstraction preserves the persistence classification:
∀T, b : persistent(T, b)↔ ̂persistent (α(T ), b) (3)
Proof. We will use standard arguments from abstract interpre-
tation and fixpoint theory (along the lines of [21]) to show that
(1) and (2) imply that:
∀v ∈ V : α(RC(v)) = R̂(v) (4)
i. e., the abstract semantics is precisely the abstraction of
the concrete semantics. Applying (3) to (4) then yields the
theorem.
To this end, we first define concrete and abstract transform-
ers f and fˆ as follows:
f(R) := λw. IC(w) ∪
⋃
(v,b,w)∈E
update(R(v), b)
fˆ(R̂) := λw. Î(w) unionsq
⊔
(v,b,w)∈E
ûpdate
(
R̂(v), b
)
By construction, the trace semantics RC and the abstract
semantics R̂ are the least fixed points of f and fˆ .
Applying (1) and (2) we show below that:
α ◦ f = fˆ ◦ α (5)
where α : C → Ĉ is lifted to the required domain α : (V →
C)→ (V → Ĉ) as follows:
α(h) := λv ∈ V. α(h(v)) (6)
The transformations to show (5) are as follows:
(α ◦ f)(R)
Def
= α(λw. IC(w) ∪
⋃
(v,b,w)∈E
update(R(v), b))
(6)
= λw. α(IC(w) ∪
⋃
(v,b,w)∈E
update(R(v), b))
(1),(2)
= λw. Î(w) unionsq
⊔
(v,b,w)∈E
ûpdate (α(R(v)), b)
(6)
= λw. Î(w) unionsq
⊔
(v,b,w)∈E
ûpdate (α(R)(v), b)
Def
=
(
fˆ ◦ α)(R)
By construction, f is continuous [22], and fˆ is continuous as
it is monotone and the abstract domain is finite.
By Kleene’s fixpoint theorem (?) and (5), we have:
α
(
RC
) Def
= α(lfpf)
(?)
= α
( ⋃
i∈N0
f i(⊥)
)
(2)
=
⊔
i∈N0
α
(
f i(⊥)) (5)= ⊔
i∈N0
fˆ i(α(⊥))
=
⊔
i∈N0
fˆ i
(⊥̂) (∗)= lfpfˆ Def= R̂
(7)
Applying (3) to (7) then yields the theorem. 
The remaining part of this section presents the abstraction of
cache traces underlying our new exact persistence analysis. For
pedagogical reasons, the abstraction is presented incrementally
in three steps: First, an abstraction of concrete traces as
a mapping from memory blocks to sets of conflict sets is
defined. In the second step, a monotonicity property of LRU
replacement is exploited to reduce the number of conflict sets
the analysis needs to track. Third, conflict sets that exceed
the cache’s associativity are collapsed to further improve
efficiency. The overall scheme is depicted in Figure 4.
P(B∗) Ĉ0 Ĉ↑ Ĉ≤k
α0
γ0
α↑
γ↑
α≤k
γ≤k
Fig. 4: Exact persistence abstractions developed in this section.
A. From Memory-Access Traces to Sets of Conflict Sets
The first abstraction Exact-CS0 maintains for each memory
block the set of all possible conflict sets that appear in cache
traces in which the memory block has been accessed. Recall
that the conflict set of block b is the set of all distinct memory
blocks accessed since the last access of b. Formally, the
abstract domain is defined as:
Ĉ0 := B → P(P(B))
An abstract trace Ŝ ∈ Ĉ0 represents all concrete traces where
for each memory block b ∈ B either b’s conflict set is in Ŝ(b)
or b has not been accessed on the trace.
Upon a memory access to block b, its set of conflict sets is
set to {{b}}, i. e., b conflicts just with itself. For all remaining
memory blocks b′, the accessed block b is added to every
conflict set s ∈ Ŝ(b′):
ûpdate0
(
Ŝ, b
)
:=
λb′.
{
{{b}} if b′ = b{
s ∪ {b} ∣∣ s ∈ Ŝ(b′)} otherwise (8)
Note that as long as a block b′ has not been accessed, the
information Ŝ(b′) = ∅ propagates.
At control-flow joins in the program, the union of the sets
of conflict sets is taken:
∀I ⊆ N0 :
⊔
0
i∈I
Ŝi := λb ∈ B.
⋃
i∈I
Ŝi(b)
A drawback of the Block -CS abstraction is the loss of
precision at joins as seen in Figure 2. Using sets of sets,
Exact-CS0 keeps all conflict sets side by side without losing
any precision.
Next, we define the abstraction function α0 that relates a set
of concrete traces to an abstract trace. For singleton sets, i. e.,
sets containing a single concrete trace, the abstraction function
is recursively defined as follows:
α0({}) := λb ∈ B. ∅
α0({τ ◦ bn−1}) := ûpdate0(α0({τ}), bn−1)
Abstractions of concrete traces are recursively constructed
using the abstract update function with the accessed memory
blocks. The base case, i. e., the abstraction of the empty trace ,
assigns an empty set of conflict sets to each block b since no
memory block has been accessed.
The abstraction α0 is lifted to arbitrary sets in the usual way:
α0({ti | i ∈ I ⊆ N0}) :=
⊔
0
i∈I
α0({ti})
l0 l1 l2 l3 l4
v
w
x
y
Fig. 5: Example illustrating Exact-CS .
A memory block b is classified as persistent if all conflict
sets have cardinality less than or equal to the associativity k.
This means that at most k− 1 distinct other blocks have been
accessed since the last access to the block itself:
̂persistent0
(
Ŝ, b
)
:= max
{|s| ∣∣ s ∈ Ŝ(b)} ≤ k
Note that we assume the maximum over an empty set to be
zero. As a consequence, a block that has never been accessed
is classified as persistent.
See Figure 5 for an example of a control-flow graph and the
corresponding Exact-CS cache trace abstractions in Table I.
Note that only the analysis information for memory block v
is shown as it highlights best the differences between the
different exact analyses.
Theorem III.3 (Exactness of Exact-CS0 ). The persistence
analysis Exact-CS0 is exact in the sense of Definition III.1.
Due to space limitations we omit the proofs in this paper,
but they can be found in the appendix of this technical report.
B. Exploiting Monotonicity
While Exact-CS0 is exact, the number of conflict sets to
track for a given block can be high. To classify a block as
persistent only the largest conflict set of each block at each
program location is relevant. It would be tempting to only
keep the largest sets, but, unfortunately, this would yield an
incorrect analysis, as these largest sets could not be correctly
maintained across updates and joins.
It is, however, safe to remove conflict sets that are com-
pletely subsumed by others. This is because the update func-
tion ûpdate0 is monotonic: For example, consider the set of
conflict sets {{a, b, c}, {a, b}}. No matter which blocks are
accessed and used to update these conflict sets, the first set
{a, b, c} and its successors will always subsume the second
set {a, b} and its successors. Removing such subsumed sets
reduces the computational effort of the analysis, in particular
the memory consumption, without any loss of precision. The
resulting abstraction Exact-CS↑ is defined relative to the
Exact-CS0 abstraction as shown in Figure 4.
The abstraction function α↑ takes an abstract trace and
removes all conflict sets that are subsumed by larger sets:
α↑
(
Ŝ
)
:= λb ∈ B. maxSet(Ŝ(b))
where maxSet is defined as follows:
maxSet(S) := {s ∈ S | ¬∃s′ ∈ S : s ( s′}
In order to maintain a minimal set of conflict sets, the join
operator takes the maximal conflict sets of the union of the
sets of conflict sets to be joined:
∀I ⊆ N0 :
⊔
↑
i∈I
Ŝi := λb ∈ B. maxSet
(⋃
i∈I
Ŝi(b)
)
(9)
Similarly, the abstract update function removes all non-
maximal conflict sets after adding b to the conflict sets:
ûpdate↑
(
Ŝ, b
)
:= λb′ ∈ B. maxSet(ûpdate0(Ŝ, b)(b′))
The classification function can be reused without modification.
Theorem III.4 (Exactness of Exact-CS↑). The persistence
analysis Exact-CS↑ is exact in the sense of Definition III.1.
C. Limit at Associativity
The abstraction described above allows the individual
conflict sets to grow arbitrarily large. However, the persistence
classification checks the existence of a single conflict set
with cardinality greater than the associativity k. Thus, there
is no need to distinguish conflict sets containing more than k
elements.
Instead, all conflict sets with a cardinality larger than k
can be collapsed into a single representative B, i. e., the set
of all memory blocks. Note that all conflict sets are trivially
subsumed by B. This fact and the monotonicity property from
Section III-B allow to replace the whole set of conflict sets
by {B} in the presence of an oversized conflict set. This
reduces the computational effort of the analysis even further,
in particular its memory consumption, without any loss of
precision. The resulting abstraction Exact-CS≤k is defined
relative to the Exact-CS↑ abstraction as shown in Figure 4.
The abstraction function α≤k takes an Exact-CS↑ abstract
trace and eliminates conflict sets with cardinality larger than k:
α≤k
(
Ŝ
)
:= λb ∈ B. limit(Ŝ(b))
where
limit(S) :=
{
{B} if ∃s ∈ S : |s| > k
S otherwise
The update function ûpdate≤k is consequently defined as:
ûpdate≤k
(
Ŝ, b
)
:= λb′ ∈ B. limit(ûpdate↑(Ŝ, b)(b′))
The join operator and the classification function can be reused
as they do not increase the size of any conflict set.
Theorem III.5 (Exactness of Exact-CS≤k ). The persistence
analysis Exact-CS≤k is exact in the sense of Definition III.1.
D. Example of Superiority over Prior Persistence Analyses
Figure 6 contains an example control-flow graph on which
all existing persistence analysis fail. In the example, block v
is clearly persistent in a fully-associative cache of size k = 3:
at most two distinct blocks are accessed between any two
accesses to v.
TABLE I: Concrete traces as regular expressions and persistence abstractions for Figure 5 (with k = 3).
Reachable access traces Exact-CS0 Exact-CS↑ Exact-CS≤k
l0 ε v 7→ ∅ v 7→ ∅ v 7→ ∅
l1 v v 7→ {{v}} v 7→ {{v}} v 7→ {{v}}
l2 v[w|x]∗ v 7→ {{v}, {v, w}, {v, x}, {v, w, x}} v 7→ {{v, w, x}} v 7→ {{v, w, x}}
l3 v[w|x]+ v 7→ {{v, w}, {v, x}, {v, w, x}} v 7→ {{v, w, x}} v 7→ {{v, w, x}}
l4 v[w|x]+, v[w|x]+y v 7→ {{v, w}, {v, x}, {v, w, x}, {v, w, y}, {v, x, y}, {v, w, x, y}} v 7→ {{v, w, x, y}} v 7→ {B}
v
w
x
y w
persistent?
Fig. 6: Example illustrating exact analysis outperforming pre-
vious persistence analyses. Fully-associative cache with k = 3.
The most precise known persistence analysis is the com-
bination C -Must×Must×Block -CS of the traditional must
analysis, the C -Must , and the Block -CS analysis [17].
Since neither x, y, nor w is guaranteed to get accessed, the
must analysis does not gain any information about them. The
Block -CS analysis fails to classify v as persistent because
there are three distinct blocks that may conflict with v in
between two consecutive accesses to v. As a consequence,
neither of the two analyses is able to support the C -Must
analysis. On its own, the C -Must analysis ages v upon each
access distinct from v. Since more than associativity many
accesses might happen between the accesses to v (note the
inner loop), the C -Must analysis is also of no use.
IV. EXACT CACHE PERSISTENCE ANALYSIS:
IMPLEMENTATION
A. Implementation using Binary Decision Diagrams
The implementation of the Exact-CS≤k analysis needs to
maintain a set of conflict sets for each memory block in the
program at every program point. Maintaining separate explicit
representations of each set of conflict sets for each memory
block and at each program point would likely be highly
inefficient. An efficient implementation should implicitly (and
thus hopefully more compactly) represent sets of conflict sets
and it should share as much information as possible between
different memory blocks and program points.
Observe that sets of conflict sets can be represented using
Boolean functions: Let the set of memory blocks be B =
{b1, . . . , bn}. Then a Boolean valuation v = v1, . . . , vn ∈ Bn
represents a set set(v) of memory blocks as follows:
set(v1, . . . , vn) = {bi | 1 ≤ i ≤ n ∧ vi = 1}
Extending upon this, a Boolean function f : Bn → B repre-
sents a set of conflict sets CS (f) as follows:
CS (f) := {set(v) | v ∈ Bn ∧ f(v) = 1}
Binary decision diagrams [23] are a class of data struc-
tures that have been designed to efficiently represent and
1 int arr[10];
2 for (int i = 0; i < 100; ++i)
3 sum += arr[read_sensor()]
Fig. 7: Input-dependent data accesses.
manipulate Boolean functions. They compactly represent in-
dividual Boolean functions and share information between the
representations of different Boolean functions stored in the
same data structure. Our implementation uses zero-suppressed
binary decision diagrams (ZDDs) [24], [25], a variant of binary
decision diagrams optimized to represent sets of sparse sets.
This is beneficial in our setting as the sets of conflict sets used
in our trace abstractions are typically sparse with respect to
the universe B; no conflict set in Exact-CS≤k is greater than
the associativity k.
To perform operations on ZDDs, the Colorado University
Decision Diagram (CUDD) library2 [26] is used in combination
with the EXTRA library3. The latter offers extended ZDD
procedures that facilitate the manipulation of ZDDs. More
precisely, the following operators of the EXTRA library are
used in the persistence analysis where S and T are sets
containing sets:
1) maxUnion calculates the maximum of the union of sets S
and T :
maxUnion(S, T ) := maxSet(S ∪ T )
This function is used to compute the abstract joins
exploiting monotonicity in (9).
2) maxDotProduct takes all pair-wise unions of subsets
from S and T and computes the maximum of this set,
i. e., it drops subsumed elements in the result:
maxDotProduct(S, T ) :=
maxSet{s ∪ t | s ∈ S ∧ t ∈ T}
This function is involved in the abstract update (8) to add
an accessed block b to every conflict set in a set. More
precisely, the following property is exploited:
maxDotProduct(S, {{b}}) = maxSet{s ∪ {b} | s ∈ S}
B. Extension to Data Caches
Until now, we have assumed that only a single known
memory block is accessed on each control-flow edge. This
2Available at https://github.com/ivmai/cudd
3Available at https://people.eecs.berkeley.edu/∼alanmi/research/extra/
is valid in the context of an instruction cache analysis. For
data cache analysis, however, we have to deal with uncertainty
about which memory blocks might be accessed. The memory
blocks a single load or store instruction accesses might
depend on dynamic aspects, e. g., the program input or a loop
iteration counter. See Listing 7 for an example of a memory
access depending on program input. Array arr is small
compared to the number of loop iterations in the example.
If arr completely fits into the cache, an exact persistence
analysis has to classify it as persistent.
Instead of cache updates with a single accessed memory
block, the update function for data caches has to consider
a set of potentially accessed memory blocks B ⊆ B. The
straightforward solution is to lift the update function for single
blocks to sets of blocks by performing an update for each
block b ∈ B individually and then joining the results:
ûpdate
(
Ŝ, B
)
:=
⊔
b∈B
ûpdate
(
Ŝ, b
)
(10)
Implementation: Prior to cache analysis, a preprocessing
analysis determines a set B ⊆ B of potentially accessed
memory blocks for each memory instruction in the program.
Such sets can be large, e. g., when a large array is accessed.
The preprocessing analysis might even fail to derive useful
information for some accesses, resorting to B. In such cases,
performing a cache update according to (10) can be compu-
tationally expensive or infeasible.
To efficiently cope with large sets B ⊆ B, we implemented
a slightly different analysis that maps memory blocks to a list
of k sets of conflict sets, i. e.,
Ĉ := B → P(P(B))× · · · × P(P(B))
The list entry at index i ∈ [0, . . . , k−1] corresponds to the set
of conflicting sets where each conflict set implicitly contains
an additional i distinct unknown anonymous memory blocks.
Upon classification, these anonymous blocks have to be
accounted for by adding them to the conflict set cardinalities:
̂persistent
(
Ŝ, b
)
:= ∀i : max{|s| ∣∣ s ∈ Ŝi(b)}+ i ≤ k
At updates with a single memory block b, the information
for b is reset to the set with the conflict set {b} as first list entry.
For all other memory blocks, the original update is performed
for each entry in the list:
ûpdate
(
Ŝ, b
)
:=
λb′.
{[{{b}}, ∅, . . . , ∅] if b′ = b[
ûpdate
(
Ŝi, b
)
(b′)
∣∣ 0 ≤ i < k] otherwise
In case of a completely unknown access, all entries are shifted
by one position to the right. If there has been a non-empty
conflict set at the rightmost position, the corresponding block
is not guaranteed to be cached any more, which is indicated
by B in the first position:
ûpdate
(
Ŝ
)
:=
λb′.
{[ ∅ , Ŝ0(b′), . . . , Ŝk−2(b′)] if Ŝk−1(b′) = ∅[{B}, ∅ , . . . , ∅ ] otherwise
µ1
µ2 µ3
µ4
µ5 µ6
b miss b hit
b miss b hit
Integer Linear Program
Variables:
xµi→µj execution frequency
of edge µi → µj
Persistence Constraints:
xµ1→µ2 + xµ4→µ5 ≤ 1
Fig. 8: Abstract execution graph snippet and persistence path
analysis constraint.
For sets B ⊆ B of potentially accessed blocks larger than the
associativity k, the above update can be performed without
any loss of precision. For smaller sets B ⊆ B, the update is
performed as defined by Equation 10.
C. Validation
The two features described in this section are used for
testing purposes and by default turned off during analysis.
Their intention is to validate our implementation and spot
bugs. One of the advantages of having an exact analysis is
that it can be used to verify the correctness of other analyses
to some extent as the exact one will always provide more
precise results than the imprecise ones. We implemented a flag
that allows us to run the exact persistence analysis alongside
any other persistence analysis. After every operation, e. g.,
updates or joins, the sets of persistent memory blocks are
compared. The set of the exact analysis must always include
the other set due to its exactness. This proves by no means the
implementation of the exact analysis to be correct but gives
some hint that the analyses are computing reasonable results.
Another critical aspect of the implementation is the manipu-
lation of ZDDs to represent sets of conflict sets. Therefore, the
analysis was extended with an explicit representation which
uses the sets of the C++ Standard Template Library. Again,
after each operation, the equality between the ZDDs and
explicit set representations is checked to see whether the ZDD
library is working as intended.
D. Integration with WCET Analysis
The information on the persistence of memory blocks is
used in worst-case execution time (WCET) analysis to obtain
more precise upper bounds on a program’s execution time.
WCET analysis is commonly performed in two phases: mi-
croarchitectural analysis and path analysis. Microarchitectural
analysis constructs an abstract execution graph that represents
the execution of a program on a given hardware platform at
the granularity of processor cycles. Nodes correspond to the
abstract (microarchitectural) state of the hardware platform,
including information about the caches, and edges represent
the actual execution. Figure 8 illustrates an example graph.
The path analysis determines the longest path through the
abstract execution graph using an integer linear program [27].
Since persistence information is a property of execution traces,
it is taken into account in the path analysis. To this end, for
each persistent memory block b ∈ B in the program, a linear
constraint is used to limit the number of cache misses of b to
one. An example constraint is shown in Figure 8.
V. RELATED WORK
For LRU caches, persistence analysis is strongly related
to must analysis [8]: A memory block b must be cached at
location v, if b’s age is at most k, the associativity of the cache,
on all traces ending in location v. This is the case if and only
if block b is persistent on all traces ending in location v and
block b has previously been accessed on all these traces. As a
consequence, must analysis could be solved by a persistence
analysis running alongside a “definitely-accessed” analysis
that determines whether a block is guaranteed to have been
accessed before reaching a given program location. Exploring
this relation further is future work.
The development of the exact persistence analyses in this
paper has been inspired by recent work of Touzeau et al. [28],
in which they develop exact must and may analyses for LRU
caches. Similarly to our persistence analysis, their implemen-
tation employs ZDDs to compactly represent sets of sets of
memory blocks and their abstraction exploits the monotonicity
of LRU replacement to reduce the number of sets to track.
Besides solving a related but different problem, our analysis
adds support for the efficient analysis of data caches as
discussed in Section IV and unlike [28] we evaluate the
analysis in the context of WCET analysis.
The presence of timing anomalies [29], [30] can often be
traced back to the non-monotonicity of a system’s dynamics.
In contrast to LRU, other cache replacement policies such as
FIFO, NMRU, and PLRU, do not behave monotonically, and
have been found to exhibit timing anomalies [31], [32]. For
that reason it seems unlikely that our analysis approach can be
extended to such policies. The strictly in-order core [33] is a
pipelined processor core that has been designed to be free of
timing anomalies by eliminating dependencies in the pipeline
that induce non-monotonicity.
A variety of cache persistence analyses has been proposed in
the literature [3]–[16] with varying degrees of precision. Many
persistence analyses can be seen as combinations of analyses
from a small set of basic persistence analyses. We have already
discussed two of these basic analyses in Section II.: Block -CS
and C -Must . Another basic analysis, called Global -CS main-
tains a superset of the conflict sets of all memory blocks, rather
than maintaining separate information for each memory block,
as Block -CS does. This simple approach has been particularly
popular in the literature [3]–[6], [14], [15] and it constitutes
the most efficient known analysis.
Reineke [17] has analyzed the landscape of persistence
analyses and has shown how the different analyses relate
to each other in terms of precision, and how they can be
explained as combinations of basic persistence analyses. In
Figure 9, we reproduce the landscape of persistence analyses
from [17]. In the diagram, a node labeled A×B denotes an
C -Must
[7], [8]
C -Must×C -May
[10], [14]–[16]
C -Must×Must
[17]
Block -CS
[11], [14]
C -Must×Block -CS
[14], [16]
C -Must×Must×C -May
[12], [13]
C -Must×Must×Block -CS
[17]
Exact-CS
[this paper]
C -May
[17]
Global-CS
[3]–[6], [14], [15]
(Must)
[7], [8]
Fig. 9: Landscape of persistence analyses adapted from [17].
analysis that is obtained by the combination of the basic
analyses A and B. Further, if analysis A is provably more
precise than analysis B, then A and B are connected by an
edge and A is higher up in the diagram.
For the experimental evaluation in Section VI, we chose
to evaluate the three basic analyses Global -CS , Block -CS ,
C -Must , and the most precise known combination of analyses
C -Must×Must×Block -CS as explained below in Section VI.
These four analyses and the exact analysis developed in this
paper are framed in Figure 9.
VI. EXPERIMENTAL EVALUATION
In this evaluation, we compare the performance of the
exact analysis Exact-CS in terms of the calculated WCET
bound as well as the run time and memory consumption with
the four existing persistence analyses Global -CS , Block -CS ,
C -Must , and C -Must×Must×Block -CS . While Global -CS
represents the most efficient persistence analysis, C -Must×
Must×Block -CS provides the most precise results among the
previously known analyses. In addition, we chose Block -CS
and C -Must as they represent two basic but complementary
ideas to approximate conflict sets as described in Section II-D.
The results below also indicate that a comparison with the
various possible other combinations (Figure 9) would not have
revealed major further insights.
A. Experimental Setup
We implemented the exact persistence analysis as well as
the previously known analyses within the WCET analysis
framework LLVMTA, which is described in [34]. LLVMTA
analyses the persistence of memory blocks at the level of
scopes instead of the whole program. Precisely, every loop
at any nesting level in the program is considered a separate
persistence scope.
To evaluate the implementation of the exact analysis, we use
the benchmarks of the TACLeBench suite [35]. TACLeBench
0.6 0.7 0.8 0.9 1.0
susan
powerwindow
pm
ndes
minver
lift
insertsort
huff dec
h264 dec
epic
cjpeg wrbmp
adpcm enc
C -Must
Global -CS
Block -CS
C -Must×Must×Block -CS
Exact-CS
Fig. 10: WCET ratios of the persistence analyses compared
with performing no persistence analysis at all.
consists of several open-source C programs commonly used
to evaluate timing analysis.
In the following, results and measurements are shown
for TACLeBench compiled without compiler optimizations
enabled. Software for safety-critical embedded systems is
often compiled without optimizations to ease the subsequent
verification of the produced binary w. r. t. the underlying
high-level model [36]. We also conducted experiments with
enabled compiler optimizations. Since compiler optimizations
generally reduce the amount of memory operations, they are
less impacted by different cache persistence analyses and thus
show slightly fewer differences among the different analyses.
For the sake of completeness, all the numbers are made
available in the appendix of this technical report.
The analyzed cache configuration has a total size of 4 kB
and consists of 32 cache sets, 8 ways, and cache lines holding
16-byte-sized memory blocks, which is also used by [28]
taking into account the relative small size of the benchmarks.
Accessing a single word in main memory takes ten processor
cycles with an additional cycle per consecutive word accessed.
In total, the load of a cache line takes 13 cycles which is
a realistic value for main memories such as the Automotive
DRAM MT46V16M16 [37] clocked at 100MHz. An addi-
tional evaluation of the exact analysis with a higher latency of
100 cycles showed no interesting differences.
All measurements have been performed on the same Linux
machine, equipped with an Intel Core™ i5-4590 CPU (running
at 3.30GHz) and 20GB of main memory.
B. Analysis Precision
This section answers how the exact, ZDD-based analysis
compares with inexact alternatives in terms of the calculated
WCET bounds. Figure 10 shows the WCET bounds obtained
with the different persistence analyses for a selection of the
TACLeBench benchmarks. All bounds are normalized to the
WCET bounds obtained with no persistence analysis at all,
i. e., only running the traditional age-based must and may
analysis [7]. For instance, a value of 0.8 means that the
WCET bound is improved by 20% using the respective
persistence analysis compared with the WCET bound obtained
without any persistence analysis. A general observation is that
persistence analysis is often important to obtain precise WCET
bounds with improvements of more than 20% in several cases.
We omitted a total of 27 benchmarks from the figure, among
which 5 showed no differences between the analyses at all (as
in insertsort); for 10 omitted benchmarks, only C -Must
performed significantly worse than the rest (as in lift); and
12 benchmarks only showed negligible differences (as in pm).
An unabridged chart can be found in the appendix of this
technical report.
The chart shows that, in practice, the exact persistence
analysis is only slightly more precise than existing inexact
approaches. In most cases all persistence analyses perform
almost identically (in particular in most of the omitted bench-
marks), the only outlier being the C -Must analysis, e. g.,
in case of lift or ndes. Even the cheapest Global -CS
analysis usually performs similar to the exact analysis, except
for cjpeg_wrbmp. There are seven benchmarks in which the
exact analysis obtains strictly better WCET bounds than all
other analyses: the benchmarks adpcm_enc (0.2%), epic
(0.0006%), h264_dec (0.04%), huff_dec (0.6%), ndes
(0.1%), pm (0.007%), and powerwindow (2.4%). The
numbers in parentheses indicate the marginal improvements
in terms of the computed WCET bound.
At its default settings, LLVMTA heuristically performs loop
peeling, i. e., it distinguishes the initial loop iteration from
the following loop iterations using trace partitioning [38].
The motivation for loop peeling is that the memory blocks
used in the loop body miss the cache in the loop’s first
iteration, while the same memory blocks hit the cache in
subsequent iterations. Without loop peeling, i. e., if the loop is
considered as a whole, regular must and may analysis cannot
classify such accesses. On the other hand, persistence analysis
can cover such cases without loop peeling. Thus, to stress
test the persistence analyses further, we performed another
evaluation in which we deactivated loop peeling. As expected,
without loop peeling, the average improvement of the exact
persistence analysis over the plain must and may analysis in
terms of WCET bounds rises to 52%—compared to 17%
with loop peeling. The exact analysis, however, is again on
par with the previously known persistence analyses showing
improvements in the range of only 0.2% to 0.3% in most
cases. An interesting insight from this experiment is that the
Global -CS analysis performs significantly worse relative to
the remaining analyses once loop peeling is deactivated. This
is likely due to the fact, that Global -CS is the only analysis
whose analysis information for a block b is not “reset” upon
an access to b itself. Loop peeling conceals this limitation
because the must analysis is able to classify many accesses as
always hit for all but the first loop iteration.
The execution time of a program can be seen as the
sum of computation times and memory access latencies. The
memory access latencies can further be split into contributions
from data accesses and from instruction fetches. To focus
the evaluation on the number of cache misses, we performed
an analysis that separately bounds the maximum number of
instruction and data cache misses. This experiment reveals that
the observed differences between the persistence analyses are
mainly due to the data cache.
C. Analysis Cost
This section evaluates the analysis cost of the different
persistence analyses in terms of analysis run time and memory
consumption. The same set of persistence analyses as in the
previous section is evaluated. Additional experimental results
may be found in the appendix of this technical report.
The run time (Figure 11a) and memory consumption (Fig-
ure 11b) results are visualized in scatter plots. In each of
the scatter plots the horizontal axis corresponds to the value
obtained for the simplest and presumably cheapest analysis
Global -CS . These values are compared with the remaining
analyses on the vertical axis, respectively. A logarithmic scale
is used for all scatter plots because the measured numbers vary
greatly in size. Unsurprisingly, a general trend is that the exact
analysis is the most expensive one regarding both analysis
run time and memory consumption. However, even compared
with the cheapest analysis Global -CS , the memory overhead
of the exact analysis is less than 3× for all benchmarks and
the analysis time is at most 23× higher.
In Figure 12, the exact analysis is compared directly with
the most precise analysis from the literature, C -Must×Must×
Block -CS . The data shows that the exact analysis is on the
average 2× slower and needs about 1.6× more memory, which
is indicated by the blue lines in the figure.
VII. PERSISTENCE ANALYSIS IS NP-COMPLETE
In this section, we show that persistence analysis is NP-
complete. The persistence problem is defined as follows: given
a control-flow graph G = (V,E, i), a designated memory
block b, and a cache size k, is there a path through G that
yields an access trace that results in more than one miss upon
accesses to block b in a fully-associative LRU cache of size k?
Theorem VII.1. The persistence problem is NP-complete.
Proof. First, we show that the problem is indeed in NP. To
this end, we show that if there is a witness path pi that shows
that block b is not persistent, then there is also a short witness
path pi′, i. e., a witness path of length polynomial in the size
of the control-flow graph G. This proves that the problem is
1 s 102 s
Global -CS
1 s
102 s C
-M
u
st
1 s 102 s
Global -CS
1 s
102 s
B
lo
ck
-C
S
1 s 102 s
Global -CS
1 s
102 s
C
-M
u
st
×
M
u
st
×
B
lo
ck
-C
S
1 s 102 s
Global -CS
1 s
102 s
E
xa
ct
-C
S
(a) Time
256MB 2GB
Global -CS
25
6
M
B
2
G
B
C
-M
u
st
256MB 2GB
Global -CS
25
6
M
B
2
G
B
B
lo
ck
-C
S
256MB 2GB
Global -CS
25
6
M
B
2
G
B
C
-M
u
st
×
M
u
st
×
B
lo
ck
-C
S
256MB 2GB
Global -CS
25
6
M
B
2
G
B
E
xa
ct
-C
S
(b) Memory
Fig. 11: Run time and memory comparison of persistence
analyses relative to Global -CS .
in NP, because a non-deterministic algorithm could first guess
and then verify this witness path in polynomial time.
Let pi be an arbitrary witness path through G containing
at least two accesses to block b that are misses in an LRU
cache of size k. Then pi can be decomposed as follows: pi =
pre◦(si, b, si+1)◦mid◦(sj , b, sj+1)◦post, where the transition
(sj , b, sj+1) corresponds to the second miss to b among all
accesses to b in pi, and mid does not contain accesses to b,
i. e., the transition (si, b, si+1) corresponds to the last access
to b before the second miss to b in pi.
Clearly, the suffix post can be removed from pi, and the
resulting path is still a witness path. Next, we argue that
mid can be replaced by mid′, such that |mid′| < |V | · |E|,
maintaining that the subsequent transition (sj , b, sj+1) results
in a miss: To this end, mid is further decomposed into
mid = mid1 ◦ mid2 ◦ · · · ◦ midn, where each midi starts with
1 s 102 s
C -Must ×Must × Block -CS
1 s
102 s
E
xa
ct
-C
S
Time
256MB 2GB
C -Must ×Must × Block -CS
2
5
6
M
B
2
G
B
E
xa
ct
-C
S
Memory
Fig. 12: Run time and memory comparison of Exact-CS and
C -Must ×Must × Block -CS .
v0
v1
v3
v2
(a) Graph with Hamilto-
nian circuit (thick).
v00 v
1
2
v11
v13
v21
v22
v23
v31
v32
v33
v40
b
(b) Control flow graph obtained by the
reduction. Edge labels not shown, except
for the back edge labeled b, which is to be
classified. The thick path corresponds to the
Hamiltonian circuit of the graph in (a).
Fig. 13: Reduction from the Hamiltonian circuit problem.
an access to a memory block that was not accessed previously
in mid. Thus, the number of subpaths n is the number of
distinct memory blocks accessed on the path mid. Clearly,
n < |E|. Each midi can be replaced by a mid′i, such that
|mid′i| ≤ |V |: Such a mid′i can be obtained by keeping the
first transition of midi and then removing from midi any
cycles, i. e., subpaths starting and ending in the same node.
By construction, mid′ = mid′1 ◦ mid′2 ◦ · · · ◦ mid′n does not
contain accesses to b and consists of accesses to at least
as many distinct memory blocks as mid. Finally, pre can
be replaced by the shortest path pre′ in G from the initial
location i to si. Clearly, |pre′| ≤ |V |. Also, the first access
to b in pre′ ◦ (si, b, si+1), which must exist due to the final
transition (si, b, si+1), results in a miss.
Thus, the path pi′ = pre′ ◦ (si, b, si+1) ◦mid′ ◦ (sj , b, sj+1)
is also a witness to the fact that b is not persistent, and its
length is bounded by |V |+ |V | · |E|+2, i. e., it is polynomial
in the size of the control-flow graph G.
Now we show that the persistence problem is NP-hard. This
part of the proof is analogous to Touzeau et al.’s proof that
LRU must analysis is NP-hard [28].
We reduce the Hamiltonian circuit problem to the per-
sistence problem. Let (V,E) be a graph, let n = |V |,
V = {v0, . . . , vn−1}. We construct a control-flow graph
G = (V ′, E′, i) for cache persistence analysis as follows:
• Create two copies v00 and v
n
0 of v0 in V
′.
• For each vi, i ≥ 1, create |V | − 1 = n − 1 copies vji ,
1 ≤ j < n in V ′. This arranges these vertices in layers
indexed by j.
• For each pair vlj , v
l+1
j′ of nodes in consecutive layers,
create an edge in E′, labeled by the address j′, if and
only if there is an edge (j, j′) in E.
• The initial control location i is v00 .
See Figure 13 for an example. There is a Hamiltonian circuit
in (V,E) if and only if there is a path in G from v00 to v
n
0
such that no edge label is repeated, thus if and only if there
exists a path from v00 to v
n
0 with at least n distinct edge labels.
Now assume an edge going from vn0 back to v
0
0 labeled with
the fresh memory block b. This memory block b is the one
to classify. For cache size n there exists a path resulting in
two or more misses to b if and only if there is a path from v00
to vn0 with at least n distinct edge labels, corresponding to a
Hamiltonian circuit in the graph (V,E). 
It is important to point out that the above NP-hardness proof
critically relies on the cache size being an input parameter. In
fact, it turns out to be possible to devise a persistence analysis
that is exponential in the cache size but polynomial in the
size of the control-flow graph [39] based on recent results in
theoretical computer science [40].
VIII. CONCLUSIONS AND FUTURE WORK
We have shown that it is possible to perform exact cache
persistence analysis for caches with LRU replacement at a rea-
sonable analysis cost. To this end, we introduced a sequence of
exact abstractions, exploiting monotonicity properties inherent
to LRU replacement; followed by an efficient implementation
based on zero-suppressed binary decision diagrams (ZDDs).
In addition, we introduced novel techniques to efficiently deal
with uncertainty arising in the context of data cache analysis.
The motto of our paper could be: “in practice, theory and
practice are different”, as the following findings demonstrate:
Our experimental evaluation reveals that the new exact analysis
is only moderately more costly than existing inexact analyses;
in particular, in our experiments it does not exhibit exponential
complexity in terms of the input size. This is in spite of the fact
that its worst-case complexity is indeed exponential and that
persistence analysis is NP-hard, as we show in Section VII.
Similarly, while even the most precise existing persistence
analyses are not exact in theory, our experiments show that
they are close to exact in practice.
ACKNOWLEDGEMENTS
This work was supported by the Deutsche Forschungsge-
meinschaft as part of the project PEP – 289264719. We thank
the anonymous reviewers for their helpful comments.
IX. APPENDIX
Theorem IX.1 (Exactness of Persistence Analysis). Let P be
a persistence analysis over a finite abstract domain ĈP that
satisfies the equations of Theorem III.2. Furthermore let Q be
a persistence analysis over a finite abstract domain ĈQ defined
relative to P by an abstraction function αQ. The persistence
analysis Q is exact if:
∀T, b : αQ
(
ûpdateP(T, b)
)
= ûpdateQ(αQ(T ), b)
∀I, Ti : αQ
(⊔
P
i∈I
Ti
)
=
⊔
Q
i∈I
αQ(Ti)
and the abstraction preserves the persistence classification:
∀T, b : ̂persistentP(T, b)↔ ̂persistentQ(αQ(T ), b)
Proof. Composing the abstraction functions αP and αQ we
obtain the abstraction function αQ◦P := αQ ◦ αP that relates
the persistence analysis Q directly with the concrete trace
semantics. If we show that the equations of Theorem III.2
are satisfied for αQ◦P , then the exactness of Q follows.
1) ∀T, b : αQ◦P(update(T, b)) = ûpdateQ(αQ◦P(T ), b)
2) ∀I, Ti : αQ◦P
( ⋃
i∈I
Ti
)
=
⊔
Q
i∈I
αQ◦P(Ti)
3) ∀T, b : persistent(T, b)↔ ̂persistentQ(αQ◦P(T ), b)
The proof is straightforward and follows immediately from
the equations of Theorem III.2 for P and the premises of this
theorem. 
Theorem III.3 (Exactness of Exact-CS0 ). The persistence
analysis Exact-CS0 is exact in the sense of Definition III.1.
Proof. It suffices to prove the equations of Theorem III.2.
1) ∀T, b : α0(update(T, b)) = ûpdate0(α0(T ), b)
Let T be a set of traces and b ∈ B. By unfolding
definitions, the term α0(update(T, b)) reduces to:
λb′.
⋃
τ∈T
{
{{b}} if b′ = b
{s ∪ {b} | s ∈ α0({τ})(b′)} otherwise
Next, the union over all traces can be drawn in:
λb′.
{
{{b}} if b′ = b{
s ∪ {b} ∣∣ s ∈ ⋃τ∈T α0({τ})(b′)} otherwise
This is equivalent to ûpdate0(α0(T ), b) by definition.
2) ∀Ti : α0
( ⋃
i∈N0
Ti
)
=
⊔
0
i∈N0
α0(Ti)
The claim follows trivially from the definition of α0.
3) ∀T, b : persistent(T, b)↔ ̂persistent0(α0(T ), b)
Let T be an arbitrary set of traces and b ∈ B. The left
hand side is equivalent to ∀τ ∈ T : persistent(τ, b) by
definition while the right hand side can be reformulated
as ∀τ ∈ T : max{|s| | s ∈ α0(τ)(b)} ≤ k. It is therefore
sufficient to apply Lemma IX.4 on all τ ∈ T . 
Lemma IX.2.
∀τ, b : (∀0 ≤ i < |τ | : τi 6= b)→ α0(τ)(b) = ∅
Proof. Let τ be a trace and b ∈ B. We are always in the
second case of ûpdate0 and propagate the initial information
α0()(b) = ∅. 
Lemma IX.3.
∀τ, b : (∃0 ≤ i < |τ | : τi = b)→ α0(τ)(b) = {CS (τ, b)}
Proof. Let τ be a trace, b ∈ B and m := max{0 ≤ i < |τ | |
τi = b}. We know that α0(τ0 ◦ · · · ◦ τm)(b) = {{b}} by
the definition of ûpdate0. Furthermore, {CS (τ, b)} = {tj |
j ≥ m}. Now, all upcoming updates just add some block in
CS (τ, b) to {b} and the claim follows. 
Lemma IX.4.
∀τ, b : persistent(τ, b)↔ ̂persistent0(α0({τ}), b)
Proof. Let τ be a trace and b ∈ B. The left hand side
reduces to (∃0 ≤ i < |τ | : τi = b)→ |CS (τ, b)| ≤ k and the
right hand side to max{|s| | s ∈ α0({τ})(b)} ≤ k by
unfolding all definitions. Applying Lemmas IX.2 and IX.3
completes the proof. 
Theorem III.4 (Exactness of Exact-CS↑). The persistence
analysis Exact-CS↑ is exact in the sense of Definition III.1.
Proof. It suffices to prove the equations of Theorem IX.1.
1) ∀T, b : α↑
(
ûpdate0(T, b)
)
= ûpdate↑(α↑(T ), b)
Let T ∈ α0 be an abstract trace and b ∈ B. We show that
both functions agree on all b′ ∈ B. The case b′ = b is
trivial.
If b′ 6= b, the left hand side α↑
(
ûpdate0(T, b)
)
(b′)
reduces to:
maxSet({s ∪ {b} | s ∈ T (b′)})
while the right hand side ûpdate↑
(
α↑(T ), b
)
(b′) reduces
to:
maxSet({s ∪ {b} | s ∈ maxSet(T (b′))})
The missing step is closed by Lemma IX.5.
2) ∀I, Ti : α↑
(⊔
0
i∈I
Ti
)
=
⊔
↑
i∈I
α↑(Ti)
Let I ⊆ N0 and Ti ∈ α0 be arbitrary ab-
stract traces. We show that both functions agree
on all b ∈ B. By definition, we can transform
α↑
(⊔
0
i∈I
Ti
)
(b) to maxSet
( ⋃
i∈I
Ti(b)
)
. This is equal to
maxSet
( ⋃
i∈I
maxSet(Ti(b))
)
by Lemma IX.6 which is
equal to
(⊔
↑
i∈I
α↑(Ti)
)
(b) by definition.
3) ∀T, b : ̂persistent0(T, b)↔ ̂persistent↑(α↑(T ), b)
Let T ∈ α0 be an abstract trace and b ∈ B. The claim is
proven with Lemma IX.7 after unfolding all definitions.

Lemma IX.5.
∀A, b : maxSet({s ∪ {b} | s ∈ A}) =
maxSet({s ∪ {b} | s ∈ maxSet(A)})
Proof. Let A be a set and b ∈ B. We prove the claim by
showing mutual inclusion.
“⊆”: Let t ∈ maxSet({s ∪ {b} | s ∈ A}), i. e., t = s ∪ {b}
for some s ∈ A. We have to prove two statements:
1) t ∈ {s ∪ {b} | s ∈ maxSet(A)}
In the case t ∈ A, we claim that t ∈ maxSet(A). By
assuming the contrary, we immediately get a contra-
diction with t ∈ maxSet({s ∪ {b} | s ∈ A}).
If t /∈ A, i. e., t = s unionmulti {b}, we claim that s ∈
maxSet(A). Assume the contrary, i. e., there is s′ ∈ A,
s ( s′. Note that s′ 6= t as t /∈ A. Then, s′ ∪ {b} ) t
contradicts t ∈ maxSet({s ∪ {b} | s ∈ A}).
2) ¬∃s′ ∈ {s ∪ {b} | s ∈ maxSet(A)} : t ( s′
If we assume otherwise, we get a contradiction to t ∈
maxSet({s ∪ {b} | s ∈ A}).
“⊇”: Trivial, as maxSet(A) ⊆ A. 
Lemma IX.6.
∀I,Ai : maxSet
(⋃
i∈I
Ai
)
= maxSet
(⋃
i∈I
maxSet(Ai)
)
Proof. Let I ⊆ N0, Ai be arbitrary sets and Âi :=
maxSet(Ai). We prove the claim by showing mutual inclu-
sion.
“⊆”: Let s ∈ maxSet(⋃i∈I Ai), i. e., s ∈ ⋃i∈I Ai and
¬∃s′ ∈ ⋃i∈I Ai : s ( s′. It is easy to see that s ∈ Âj for
some j ∈ I , and hence, s ∈ ⋃i∈I Âi. As (⋃i∈I Âi) ⊆(⋃
i∈I Ai
)
, we get that s ∈ maxSet(⋃i∈I Âi).
“⊇”: Trivial, as (⋃i∈I Âi) ⊆ (⋃i∈I Ai). 
Lemma IX.7.
∀A : max{|x| | x ∈ A} = max{|x| | x ∈ maxSet(A)}
Proof. Let A be a set. We prove the equality by showing
mutual less or equal relations.
“≤”: Let n := max{|x| | x ∈ A} and m := max{|x| |
x ∈ maxSet(A)}. Assume that n > m and there is a ∈ A
with |a| = n. There cannot exist b ∈ A with a ( b
because otherwise |b| > n is a contradiction. But then,
a ∈ maxSet(A) which contradicts |a| = n > m.
“≥”: Trivial, as maxSet(A) ⊆ A. 
Theorem III.5 (Exactness of Exact-CS≤k ). The persistence
analysis Exact-CS≤k is exact in the sense of Definition III.1.
Proof. With the proof of Theorem III.4, we have already
shown that the domain of α↑ satisfies the equations of The-
orem III.2. As α≤k is defined relative to this domain, is is
sufficient to prove the equations of Theorem IX.1.
1) ∀T, b : α≤k
(
ûpdate↑(T, b)
)
= ûpdate≤k(α≤k(T ), b)
Let T ∈ α↑ be an abstract trace and b ∈ B. We show that
both functions agree on all b′ ∈ B. The case b′ = b is
trivial.
If b′ 6= b, the left hand side α≤k(ûpdate↑(T, b))(b′)
reduces to:
limit(maxSet({s ∪ {b} | s ∈ T (b′)})) (11)
We have to show equality with:
limit(maxSet({s ∪ {b} | s ∈ limit(T (b′))})) (12)
which is the right hand side ûpdate≤k(α≤k(T ), b)(b
′)
where all definitions are unfolded.
We distinguish the two cases of limit(T (b′)). The case
limit(T (b′)) = T (b′) is trivial. Therefore, assume ∃s ∈
T (b′) : |s| > k. Equation 12 reduces to {B} by definition
as limit(T (b′)) = {B}. For Equation 11, the union
with {b} cannot decrease the cardinality and we get
∃s′ ∈ {s ∪ {b} | s ∈ T (b′)} : |s′| > k. The proof is
completed by applying Lemma IX.7 and the definition of
limit which prove that (11) is equal to {B}, too.
2) ∀I, Ti : α≤k
(⊔
↑
i∈I
Ti
)
=
⊔
≤k
i∈I
α≤k(Ti)
Let I ⊆ N0 and Ti ∈ α↑ be arbitrary ab-
stract traces. We show that both functions agree
on all b ∈ B. By definition, we can transform
α≤k
(⊔
↑
i∈I
Ti
)
(b) to limit
(
maxSet
( ⋃
i∈I
Ti(b)
))
. This is
equal to maxSet
( ⋃
i∈I
limit(Ti(b))
)
by Lemma IX.8 which
is equal to
(⊔
≤k
i∈I
α≤k(Ti)
)
(b) by definition.
3) ∀T, b : ̂persistent↑(T, b)↔ ̂persistent≤k(α≤k(T ), b)
Let T ∈ α↑ be an abstract trace and b ∈ B. The claim is
proven with Lemma IX.9 after unfolding all definitions.

Lemma IX.8.
∀I, Ai : limit
(
maxSet
(⋃
i∈I
Ai
))
=
maxSet
(⋃
i∈I
limit(Ai)
)
Proof. Let I ⊆ N0 and Ai be arbitrary sets. We prove the
claim by showing mutual inclusion.
“⊆”: Let s ∈ limit(maxSet(⋃i∈I Ai)).
– Assume ∃s′ ∈ maxSet(⋃i∈I Ai) : |s′| > k,
i. e., s = B. W. l. o. g. let s′ ∈ Aj for some
j ∈ I . Then, limit(Aj) = {B} and s ∈ {B} =
maxSet
(⋃
i∈I limit(Ai)
)
.
– Otherwise, s ∈ maxSet(⋃i∈I Ai) and ¬∃s′ ∈
maxSet
(⋃
i∈I Ai
)
: |s′| > k. Therefore, ∀i ∈ I :
¬∃s′ ∈ Ai : |s′| > k, i. e., limit(Ai) = Ai.
“⊇”: Let s ∈ maxSet(⋃i∈I limit(Ai)). If ∀i ∈ I : ¬∃s′ ∈
Ai : |s′| > k, then s ∈ maxSet
(⋃
i∈I Ai
)
. Otherwise
let w. l. o. g. limit(Aj) = {B} for some j ∈ I . Then,
s = B ∈ limit(maxSet(⋃i∈I Ai)). 
Lemma IX.9.
∀A, k : max{|x| | x ∈ A} ≤ k ↔
max{|x| | x ∈ limit(A)} ≤ k
Proof. Let A be a set and k ∈ N. We prove the claim by
showing mutual implication.
“→”: Let max{|x| | x ∈ A} ≤ k, i. e., ¬∃y ∈ A : |y| > k.
But then we have limit(A) = A.
“←”: Let max{|x| | x ∈ limit(A)} ≤ k. Assume limit(A) =
{B}, then ∃y ∈ A : |y| > k, a contradiction. 
0.5 0.6 0.7 0.8 0.9 1.0
susan
statemate
st
sha
rijndael enc
rijndael dec
prime
powerwindow
pm
petrinet
ndes
minver
md5
matrix1
ludcmp
lms
lift
jfdctint
insertsort
iir
huff dec
h264 dec
gsm encode
gsm dec
g723 enc
fir2dim
filterbank
fft
epic
dijkstra
countnegative
complex updates
cjpeg wrbmp
cjpeg transupp
bsort
binarysearch
audiobeam
adpcm enc
adpcm dec
C -Must
Global -CS
Block -CS
C -Must×Must×Block -CS
Exact-CS
(a) Without compiler optimizations.
0.5 0.6 0.7 0.8 0.9 1.0
susan
statemate
st
sha
rijndael enc
rijndael dec
prime
powerwindow
pm
petrinet
ndes
minver
md5
matrix1
ludcmp
lms
lift
jfdctint
insertsort
iir
huff dec
h264 dec
gsm encode
gsm dec
g723 enc
fir2dim
filterbank
fft
epic
dijkstra
countnegative
complex updates
cjpeg wrbmp
cjpeg transupp
bsort
binarysearch
audiobeam
adpcm enc
adpcm dec
C -Must
Global -CS
Block -CS
C -Must×Must×Block -CS
Exact-CS
(b) With compiler optimizations.
Fig. 14: WCET ratios of the persistence analyses compared with performing no persistence analysis at all.
Cache configuration: 32 cache sets, 8 ways, 16B line size.
0.4 0.5 0.6 0.7 0.8 0.9 1.0
susan
statemate
st
sha
rijndael enc
rijndael dec
prime
powerwindow
pm
petrinet
ndes
minver
md5
matrix1
ludcmp
lms
lift
jfdctint
insertsort
iir
huff dec
h264 dec
gsm encode
gsm dec
g723 enc
fir2dim
filterbank
fft
epic
dijkstra
countnegative
complex updates
cjpeg wrbmp
cjpeg transupp
bsort
binarysearch
audiobeam
adpcm enc
adpcm dec
C -Must
Global -CS
Block -CS
C -Must×Must×Block -CS
Exact-CS
(a) Without compiler optimizations.
0.4 0.5 0.6 0.7 0.8 0.9 1.0
susan
statemate
st
sha
rijndael enc
rijndael dec
prime
powerwindow
pm
petrinet
ndes
minver
md5
matrix1
ludcmp
lms
lift
jfdctint
insertsort
iir
huff dec
h264 dec
gsm encode
gsm dec
g723 enc
fir2dim
filterbank
fft
epic
dijkstra
countnegative
complex updates
cjpeg wrbmp
cjpeg transupp
bsort
binarysearch
audiobeam
adpcm enc
adpcm dec
C -Must
Global -CS
Block -CS
C -Must×Must×Block -CS
Exact-CS
(b) With compiler optimizations.
Fig. 15: WCET ratios of the persistence analyses compared with performing no persistence analysis at all.
Cache configuration: 32 cache sets, 8 ways, 16B line size, loop peeling disabled.
0.4 0.5 0.6 0.7 0.8 0.9 1.0
susan
statemate
st
sha
rijndael enc
rijndael dec
prime
powerwindow
pm
petrinet
ndes
minver
md5
matrix1
ludcmp
lms
lift
jfdctint
insertsort
iir
huff dec
h264 dec
gsm encode
gsm dec
g723 enc
fir2dim
filterbank
fft
epic
dijkstra
countnegative
complex updates
cjpeg wrbmp
cjpeg transupp
bsort
binarysearch
audiobeam
adpcm enc
adpcm dec
C -Must
Global -CS
Block -CS
C -Must×Must×Block -CS
Exact-CS
(a) Without compiler optimizations.
0.4 0.5 0.6 0.7 0.8 0.9 1.0
susan
statemate
st
sha
rijndael enc
rijndael dec
prime
powerwindow
pm
petrinet
ndes
minver
md5
matrix1
ludcmp
lms
lift
jfdctint
insertsort
iir
huff dec
h264 dec
gsm encode
gsm dec
g723 enc
fir2dim
filterbank
fft
epic
dijkstra
countnegative
complex updates
cjpeg wrbmp
cjpeg transupp
bsort
binarysearch
audiobeam
adpcm enc
adpcm dec
C -Must
Global -CS
Block -CS
C -Must×Must×Block -CS
Exact-CS
(b) With compiler optimizations.
Fig. 16: WCET ratios of the persistence analyses compared with performing no persistence analysis at all.
Cache configuration: 32 cache sets, 8 ways, 16B line size, higher latency of 100 cycles to deliver the first word.
0.0 0.2 0.4 0.6 0.8 1.0
susan
statemate
st
sha
rijndael enc
rijndael dec
prime
powerwindow
pm
petrinet
ndes
minver
md5
matrix1
ludcmp
lms
lift
jfdctint
insertsort
iir
huff dec
h264 dec
gsm encode
gsm dec
g723 enc
fir2dim
filterbank
fft
epic
dijkstra
countnegative
complex updates
cjpeg wrbmp
cjpeg transupp
bsort
binarysearch
audiobeam
adpcm enc
adpcm dec
C -Must
Global -CS
Block -CS
C -Must×Must×Block -CS
Exact-CS
(a) Instruction cache misses.
0.0 0.2 0.4 0.6 0.8 1.0
susan
statemate
st
sha
rijndael enc
rijndael dec
prime
powerwindow
pm
petrinet
ndes
minver
md5
matrix1
ludcmp
lms
lift
jfdctint
insertsort
iir
huff dec
h264 dec
gsm encode
gsm dec
g723 enc
fir2dim
filterbank
fft
epic
dijkstra
countnegative
complex updates
cjpeg wrbmp
cjpeg transupp
bsort
binarysearch
audiobeam
adpcm enc
adpcm dec
C -Must
Global -CS
Block -CS
C -Must×Must×Block -CS
Exact-CS
(b) Data cache misses.
Fig. 17: Ratios of the maximum number of instruction and data cache misses of the persistence analyses compared with
performing no persistence analysis at all.
Cache configuration: 32 cache sets, 8 ways, 16B line size, without compiler optimizations.
0.0 0.2 0.4 0.6 0.8 1.0
susan
statemate
st
sha
rijndael enc
rijndael dec
prime
powerwindow
pm
petrinet
ndes
minver
md5
matrix1
ludcmp
lms
lift
jfdctint
insertsort
iir
huff dec
h264 dec
gsm encode
gsm dec
g723 enc
fir2dim
filterbank
fft
epic
dijkstra
countnegative
complex updates
cjpeg wrbmp
cjpeg transupp
bsort
binarysearch
audiobeam
adpcm enc
adpcm dec
C -Must
Global -CS
Block -CS
C -Must×Must×Block -CS
Exact-CS
(a) Instruction cache misses.
0.0 0.2 0.4 0.6 0.8 1.0
susan
statemate
st
sha
rijndael enc
rijndael dec
prime
powerwindow
pm
petrinet
ndes
minver
md5
matrix1
ludcmp
lms
lift
jfdctint
insertsort
iir
huff dec
h264 dec
gsm encode
gsm dec
g723 enc
fir2dim
filterbank
fft
epic
dijkstra
countnegative
complex updates
cjpeg wrbmp
cjpeg transupp
bsort
binarysearch
audiobeam
adpcm enc
adpcm dec
C -Must
Global -CS
Block -CS
C -Must×Must×Block -CS
Exact-CS
(b) Data cache misses.
Fig. 18: Ratios of the maximum number of instruction and data cache misses of the persistence analyses compared with
performing no persistence analysis at all.
Cache configuration: 32 cache sets, 8 ways, 16B line size, with compiler optimizations.
0.5 0.6 0.7 0.8 0.9 1.0
susan
statemate
st
sha
rijndael enc
rijndael dec
prime
powerwindow
pm
petrinet
ndes
minver
md5
matrix1
ludcmp
lms
lift
jfdctint
insertsort
iir
huff dec
h264 dec
gsm encode
gsm dec
g723 enc
fir2dim
filterbank
fft
epic
dijkstra
countnegative
complex updates
cjpeg wrbmp
cjpeg transupp
bsort
binarysearch
audiobeam
adpcm enc
adpcm dec
C -Must
Global -CS
Block -CS
C -Must×Must×Block -CS
Exact-CS
(a) Without compiler optimizations.
0.5 0.6 0.7 0.8 0.9 1.0
susan
statemate
st
sha
rijndael enc
rijndael dec
prime
powerwindow
pm
petrinet
ndes
minver
md5
matrix1
ludcmp
lms
lift
jfdctint
insertsort
iir
huff dec
h264 dec
gsm encode
gsm dec
g723 enc
fir2dim
filterbank
fft
epic
dijkstra
countnegative
complex updates
cjpeg wrbmp
cjpeg transupp
bsort
binarysearch
audiobeam
adpcm enc
adpcm dec
C -Must
Global -CS
Block -CS
C -Must×Must×Block -CS
Exact-CS
(b) With compiler optimizations.
Fig. 19: WCET ratios of the persistence analyses compared with performing no persistence analysis at all.
Cache configuration: 8 cache sets, 8 ways, 16B line size.
1 s 102 s
Global -CS
1 s
102 s C
-M
u
st
1 s 102 s
Global -CS
1 s
102 s
B
lo
ck
-C
S
1 s 102 s
Global -CS
1 s
102 s
C
-M
u
st
×
M
u
st
×
B
lo
ck
-C
S
1 s 102 s
Global -CS
1 s
102 s
E
xa
ct
-C
S
(a) Time, without compiler optimizations.
1 s 102 s
Global -CS
1 s
102 s C
-M
u
st
1 s 102 s
Global -CS
1 s
102 s
B
lo
ck
-C
S
1 s 102 s
Global -CS
1 s
102 s
C
-M
u
st
×
M
u
st
×
B
lo
ck
-C
S
1 s 102 s
Global -CS
1 s
102 s
E
xa
ct
-C
S
(b) Time, with compiler optimizations.
256MB 2GB
Global -CS
25
6
M
B
2
G
B
C
-M
u
st
256MB 2GB
Global -CS
25
6
M
B
2
G
B
B
lo
ck
-C
S
256MB 2GB
Global -CS
25
6
M
B
2
G
B
C
-M
u
st
×
M
u
st
×
B
lo
ck
-C
S
256MB 2GB
Global -CS
25
6
M
B
2
G
B
E
xa
ct
-C
S
(c) Memory, without compiler optimizations.
256MB 2GB
Global -CS
25
6
M
B
2
G
B
C
-M
u
st
256MB 2GB
Global -CS
25
6
M
B
2
G
B
B
lo
ck
-C
S
256MB 2GB
Global -CS
25
6
M
B
2
G
B
C
-M
u
st
×
M
u
st
×
B
lo
ck
-C
S
256MB 2GB
Global -CS
25
6
M
B
2
G
B
E
xa
ct
-C
S
(d) Memory, with compiler optimizations.
Fig. 20: Run time and memory comparison of persistence analyses relative to Global -CS .
Cache configuration: 32 cache sets, 8 ways, 16B line size.
1 s 102 s
Global -CS
1 s
102 s C
-M
u
st
1 s 102 s
Global -CS
1 s
102 s
B
lo
ck
-C
S
1 s 102 s
Global -CS
1 s
102 s
C
-M
u
st
×
M
u
st
×
B
lo
ck
-C
S
1 s 102 s
Global -CS
1 s
102 s
E
xa
ct
-C
S
(a) Time, without compiler optimizations.
1 s 102 s
Global -CS
1 s
102 s C
-M
u
st
1 s 102 s
Global -CS
1 s
102 s
B
lo
ck
-C
S
1 s 102 s
Global -CS
1 s
102 s
C
-M
u
st
×
M
u
st
×
B
lo
ck
-C
S
1 s 102 s
Global -CS
1 s
102 s
E
xa
ct
-C
S
(b) Time, with compiler optimizations.
256MB 2GB
Global -CS
25
6
M
B
2
G
B
C
-M
u
st
256MB 2GB
Global -CS
25
6
M
B
2
G
B
B
lo
ck
-C
S
256MB 2GB
Global -CS
25
6
M
B
2
G
B
C
-M
u
st
×
M
u
st
×
B
lo
ck
-C
S
256MB 2GB
Global -CS
25
6
M
B
2
G
B
E
xa
ct
-C
S
(c) Memory, without compiler optimizations.
256MB 2GB
Global -CS
25
6
M
B
2
G
B
C
-M
u
st
256MB 2GB
Global -CS
25
6
M
B
2
G
B
B
lo
ck
-C
S
256MB 2GB
Global -CS
25
6
M
B
2
G
B
C
-M
u
st
×
M
u
st
×
B
lo
ck
-C
S
256MB 2GB
Global -CS
25
6
M
B
2
G
B
E
xa
ct
-C
S
(d) Memory, with compiler optimizations.
Fig. 21: Run time and memory comparison of persistence analyses relative to Global -CS .
Cache configuration: 8 cache sets, 8 ways, 16B line size.
REFERENCES
[1] R. Wilhelm, J. Engblom, A. Ermedahl, N. Holsti, S. Thesing, D. B.
Whalley, G. Bernat, C. Ferdinand, R. Heckmann, T. Mitra, F. Mueller,
I. Puaut, P. P. Puschner, J. Staschulat, and P. Stenstro¨m, “The worst-case
execution-time problem - overview of methods and survey of tools,”
ACM Trans. Embedded Comput. Syst., vol. 7, no. 3, pp. 36:1–36:53,
2008. [Online]. Available: https://doi.org/10.1145/1347375.1347389
[2] M. Lv, N. Guan, J. Reineke, R. Wilhelm, and W. Yi, “A survey on
static cache analysis for real-time systems,” Leibniz Transactions on
Embedded Systems, vol. 3, no. 1, pp. 05–1–05:48, 2016. [Online].
Available: https://doi.org/10.4230/LITES-v003-i001-a005
[3] F. Mueller, “Static cache simulation and its applications,” Ph.D.
dissertation, Florida State University, Tallahassee, United States, 1994.
[Online]. Available: https://www.cs.fsu.edu/∼whalley/papers/mueller
diss94.pdf
[4] R. D. Arnold, F. Mueller, D. B. Whalley, and M. G. Harmon,
“Bounding worst-case instruction cache performance,” in Proceedings
of the 15th IEEE Real-Time Systems Symposium, San Juan, Puerto
Rico, December 7-9, 1994, ser. RTSS 1994, 1994, pp. 172–181.
[Online]. Available: https://doi.org/10.1109/REAL.1994.342718
[5] R. T. White, C. A. Healy, D. B. Whalley, F. Mueller, and M. G. Harmon,
“Timing analysis for data caches and set-associative caches,” in 3rd
IEEE Real-Time Technology and Applications Symposium, Montreal,
Canada, June 9-11, 1997, ser. RTAS 1997, 1997, pp. 192–202.
[Online]. Available: https://doi.org/10.1109/RTTAS.1997.601358
[6] F. Mueller, “Timing analysis for instruction caches,” Real-Time
Systems, vol. 18, no. 2, pp. 217–247, May 2000. [Online]. Available:
https://doi.org/10.1023/A:1008145215849
[7] C. Ferdinand, “Cache behavior prediction for real-time systems,” Ph.D.
dissertation, Saarland University, Saarbru¨cken, Germany, 1997, iSBN:
3-9307140-31-0. [Online]. Available: https://d-nb.info/953983706
[8] C. Ferdinand and R. Wilhelm, “Efficient and precise cache behavior
prediction for real-time systems,” Real-Time Systems, vol. 17, no. 2-3,
pp. 131–181, Nov. 1999. [Online]. Available: https://doi.org/10.1023/A:
1008186323068
[9] C. Ballabriga and H. Casse, “Improving the first-miss computation in set-
associative instruction caches,” in Proceedings of the 2008 Euromicro
Conference on Real-Time Systems, ser. ECRTS 2008, 2008, pp.
341–350. [Online]. Available: https://doi.org/10.1109/ECRTS.2008.34
[10] C. Cullmann, “Cache persistence analysis: a novel approach,” in
Proceedings of the ACM SIGPLAN/SIGBED 2011 conference on
Languages, compilers, and tools for embedded systems, Chicago,
IL, USA, April 11-14, 2011, ser. LCTES 2011, 2011, pp. 121–130.
[Online]. Available: https://doi.org/10.1145/1967677.1967695
[11] B. K. Huynh, L. Ju, and A. Roychoudhury, “Scope-aware data
cache analysis for WCET estimation,” in Proceedings of the 2011
17th IEEE Real-Time and Embedded Technology and Applications
Symposium, ser. RTAS 2011, 2011, pp. 203–212. [Online]. Available:
https://doi.org/10.1109/RTAS.2011.27
[12] K. Nagar and Y. N. Srikant, “Interdependent cache analyses for better
precision and safety,” in Tenth ACM/IEEE International Conference on
Formal Methods and Models for Codesign, Arlington, VA, USA, July
16-17, 2012, ser. MEMOCODE 2012, 2012, pp. 99–108. [Online].
Available: https://doi.org/10.1109/MEMCOD.2012.6292306
[13] K. Nagar, “Cache analysis for multi-level data caches,” Master’s thesis,
Indian Institute of Science, Bangalore, India, 2012. [Online]. Available:
http://clweb.csa.iisc.ac.in/kartik.nagar/thesis.pdf
[14] C. Cullmann, “Cache persistence analysis for embedded real-time sys-
tems,” Ph.D. dissertation, Saarland University, Saarbru¨cken, Germany,
2013. [Online]. Available: https://doi.org/10.22028/D291-26418
[15] ——, “Cache persistence analysis: Theory and practice,” ACM Trans.
Embedded Comput. Syst., vol. 12, no. 1s, pp. 40:1–40:25, 2013.
[Online]. Available: https://doi.org/10.1145/2435227.2435236
[16] Z. Zhang and X. D. Koutsoukos, “Improving the precision of abstract
interpretation based cache persistence analysis,” in Proceedings of the
16th ACM SIGPLAN/SIGBED Conference on Languages, Compilers
and Tools for Embedded Systems, Portland, OR, USA, June 18-19,
2015, ser. LCTES 2015, 2015, pp. 10:1–10:10. [Online]. Available:
https://doi.org/10.1145/2670529.2754967
[17] J. Reineke, “The semantic foundations and a landscape of cache-
persistence analyses,” LITES, vol. 5, no. 1, pp. 03:1–03:52, 2018.
[Online]. Available: https://doi.org/10.4230/LITES-v005-i001-a003
[18] M. Alt, C. Ferdinand, F. Martin, and R. Wilhelm, “Cache behavior
prediction by abstract interpretation,” in Static Analysis, Third
International Symposium, SAS’96, Aachen, Germany, September
24-26, 1996, Proceedings, 1996, pp. 52–66. [Online]. Available:
https://doi.org/10.1007/3-540-61739-6 33
[19] J. Reineke, D. Grund, C. Berg, and R. Wilhelm, “Timing predictability
of cache replacement policies,” Real-Time Systems, vol. 37, no. 2,
pp. 99–122, Nov. 2007. [Online]. Available: https://doi.org/10.1007/
s11241-007-9032-3
[20] P. Cousot and R. Cousot, “Abstract interpretation: A unified lattice
model for static analysis of programs by construction or approximation
of fixpoints,” in Proceedings of the 4th ACM SIGACT-SIGPLAN
Symposium on Principles of Programming Languages, ser. POPL 1977.
New York, NY, USA: ACM, 1977, pp. 238–252. [Online]. Available:
https://doi.org/10.1145/512950.512973
[21] ——, “Systematic design of program analysis frameworks,” in
Proceedings of the 6th ACM SIGACT-SIGPLAN Symposium on
Principles of Programming Languages, ser. POPL ’79. New
York, NY, USA: ACM, 1979, pp. 269–282. [Online]. Available:
https://doi.org/10.1145/567752.567778
[22] B. A. Davey and H. A. Priestley, Introduction to Lattices and Order,
2nd ed. Cambridge University Press, 2002. [Online]. Available:
https://doi.org/10.1017/CBO9780511809088
[23] R. Drechsler and D. Sieling, “Binary decision diagrams in theory
and practice,” International Journal on Software Tools for Technology
Transfer, vol. 3, no. 2, pp. 112–136, May 2001. [Online]. Available:
https://doi.org/10.1007/s100090100056
[24] S. Minato, “Zero-suppressed BDDs and their applications,” International
Journal on Software Tools for Technology Transfer, vol. 3, no. 2,
pp. 156–170, May 2001. [Online]. Available: https://doi.org/10.1007/
s100090100038
[25] A. Mishchenko, Introduction to Zero-Suppressed Decision Diagrams.
Morgan & Claypool, Dec. 2014, ch. 1. [Online]. Available:
https://doi.org/10.2200/S00612ED1V01Y201411DCS045
[26] F. Somenzi, CUDD: CU Decision Diagram Package, 2018. [Online].
Available: https://add-lib.scce.info/assets/documents/cudd-manual.pdf
[27] Y. S. Li and S. Malik, “Performance analysis of embedded software
using implicit path enumeration,” in Proceedings of the ACM SIGPLAN
1995 Workshop on Languages, Compilers, & Tools for Real-Time
Systems (LCT-RTS 1995). La Jolla, California, June 21-22, 1995, 1995,
pp. 88–98. [Online]. Available: https://doi.org/10.1145/216636.216666
[28] V. Touzeau, C. Maı¨za, D. Monniaux, and J. Reineke, “Fast and
exact analysis for LRU caches,” Proc. ACM Program. Lang.,
vol. 3, no. POPL, pp. 54:1–54:29, Jan. 2019. [Online]. Available:
https://doi.org/10.1145/3290367
[29] T. Lundqvist and P. Stenstro¨m, “Timing anomalies in dynamically
scheduled microprocessors,” in Proceedings of the 20th IEEE Real-Time
Systems Symposium, Phoenix, AZ, USA, December 1-3, 1999, 1999, pp.
12–21. [Online]. Available: https://doi.org/10.1109/REAL.1999.818824
[30] J. Reineke, B. Wachter, S. Thesing, R. Wilhelm, I. Polian, J. Eisinger,
and B. Becker, “A definition and classification of timing anomalies,”
in Proceedings of 6th International Workshop on Worst-Case
Execution Time (WCET) Analysis, Jul. 2006. [Online]. Available:
https://doi.org/10.4230/OASIcs.WCET.2006.671
[31] C. Berg, “PLRU cache domino effects,” in 6th International Workshop
on Worst-Case Execution Time Analysis (WCET’06), ser. OpenAccess
Series in Informatics (OASIcs), F. Mueller, Ed., vol. 4. Dagstuhl,
Germany: Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2006.
[Online]. Available: https://doi.org/10.4230/OASIcs.WCET.2006.672
[32] G. Gebhard, “Timing anomalies reloaded,” in 10th International
Workshop on Worst-Case Execution Time Analysis (WCET 2010),
ser. OpenAccess Series in Informatics (OASIcs), B. Lisper, Ed.,
vol. 15. Dagstuhl, Germany: Schloss Dagstuhl–Leibniz-Zentrum fuer
Informatik, 2010, pp. 1–10. [Online]. Available: https://doi.org/10.4230/
OASIcs.WCET.2010.1
[33] S. Hahn and J. Reineke, “Design and analysis of SIC: A
provably timing-predictable pipelined processor core,” in 2018 IEEE
Real-Time Systems Symposium, RTSS 2018, Nashville, TN, USA,
December 11-14, 2018, 2018, pp. 469–481. [Online]. Available:
https://doi.org/10.1109/RTSS.2018.00060
[34] S. Hahn, “On static execution-time analysis — compositionality,
pipeline abstraction, and predictable hardware,” Ph.D. dissertation,
Saarland University, Saarbru¨cken, Germany, 2019. [Online]. Available:
https://doi.org/10.22028/D291-27991
[35] H. Falk, S. Altmeyer, P. Hellinckx, B. Lisper, W. Puffitsch,
C. Rochange, M. Schoeberl, R. B. Sorensen, P. Wa¨gemann, and
S. Wegener, “TACLeBench: A benchmark collection to support
worst-case execution time research,” in 16th International Workshop
on Worst-Case Execution Time Analysis, WCET 2016, July 5,
2016, Toulouse, France, 2016, pp. 2:1–2:10. [Online]. Available:
https://doi.org/10.4230/OASIcs.WCET.2016.2
[36] R. B. Franc¸a, D. Favre-Felix, X. Leroy, M. Pantel, and J. Souyris,
“Towards Formally Verified Optimizing Compilation in Flight Control
Software,” in Bringing Theory to Practice: Predictability and
Performance in Embedded Systems, ser. OpenAccess Series in
Informatics (OASIcs), P. Lucas, L. Thiele, B. Triquet, T. Ungerer, and
R. Wilhelm, Eds., vol. 18. Dagstuhl, Germany: Schloss Dagstuhl–
Leibniz-Zentrum fuer Informatik, 2011, pp. 59–68. [Online]. Available:
https://doi.org/10.4230/OASIcs.PPES.2011.59
[37] Micron Technology, Inc., Automotive DDR SDRAM MT46V32M8,
MT46V16M16, Available at https://micron.com/∼/media/documents/
products/data-sheet/dram/mobile-dram/low-power-dram/lpddr/256mb
x8x16 at ddr t66a.pdf.
[38] L. Mauborgne and X. Rival, “Trace partitioning in abstract interpretation
based static analyzers,” in Programming Languages and Systems, 14th
European Symposium on Programming,ESOP 2005, Held as Part of the
Joint European Conferences on Theory and Practice of Software, ETAPS
2005, Edinburgh, UK, April 4-8, 2005, Proceedings, 2005, pp. 5–20.
[Online]. Available: https://doi.org/10.1007/978-3-540-31987-0 2
[39] H. Dell and C. Brand, Personal Communication, 2019.
[40] C. Brand, H. Dell, and T. Husfeldt, “Extensor-coding,” in Proceedings
of the 50th Annual ACM SIGACT Symposium on Theory of Computing,
ser. STOC 2018. New York, NY, USA: ACM, 2018, pp. 151–164.
[Online]. Available: https://doi.org/10.1145/3188745.3188902
