Abstract-In many multi-core architectures, inclusive shared caches are used to reduce cache coherence complexity. However, the enforcement of the inclusion property can cause invalidation of memory blocks at higher cache levels. In order to ensure safety, analysis of cache hierarchies with inclusive caches for worst-case execution time (WCET) estimation is typically based on conservative decisions. Thus, the estimation may not be tight. In order to tighten the estimation, this paper proposes an approach that can more precisely analyze the behavior of a cache hierarchy maintaining the inclusion property. We illustrate the approach in the context of multi-level instruction caches. The approach first analyzes all the inclusive caches in the hierarchy in a bottom-up direction, and then analyzes the remaining non-inclusive caches in a top-down direction. In order to capture the inclusion victims and their effects, we also propose a concept of aging barrier and integrate it with the traditional must and persistence analyses to safely slow down their aging process so as to derive more precise analyses. We evaluate the proposed approach on a set of benchmarks and the evaluation reveals that the estimations are tightened.
I. INTRODUCTION
Hard real-time system design requires WCET estimation for each task. Since the exact WCET of a task is impossible to derive in general, an overestimation is necessary to ensure safety. Yet, in order to maximize resource utilization, the estimation should be as tight as possible. However, due to the complex behavior of many performance enhancing features in modern processors, it is very challenging to safely and tightly estimate the WCET.
Caches are very common in processors in order to bridge the increasing gap between the processor clock cycle time and main memory access time. Although the presence of caches improves the average performance, it poses great challenges on the tightness of WCET estimation. Over the past two decades, the analysis of the effects of single-level cache behavior on WCET estimation has been studied extensively [17, 19] .
Recently, multi-level cache analysis has drawn much attention in real-time systems [8, 12, 4, 18, 9] , since there is a rising need of exploiting the high-performance processors, which are often equipped with multi-level caches. However, compared to single-level cache analysis, multi-level cache analysis is much more challenging. Besides the sequence of memory references, there is a need to take into account the effects of the behavior of one cache level on the behavior of other cache levels (e.g. filtering memory accesses and invalidating memory blocks), which can be different depending on the type of the cache hierarchy.
Typically, there are three cache hierarchy types, which are inclusive, exclusive, and non-inclusive. Multi-level inclusive caches require that the contents at upper cache levels must be a subset of the contents at lower cache levels. On the contrary, multi-level exclusive caches require that the contents at a cache level should not be duplicated at any other cache levels. Multilevel non-inclusive caches allow duplicated contents existing at any cache level, but they do not strictly enforce the inclusion. Moreover, there are some hybrid cache hierarchies, which have some inclusive and/or exclusive cache levels and other levels being non-inclusive. In this paper, we call a cache hierarchy a multi-level inclusive cache as long as it maintains the inclusion property at some cache level(s).
Compared to an exclusive/non-inclusive cache hierarchy, a cache hierarchy enforcing inclusion has less effective cache capacity, but the inclusion property can significantly simplify the maintenance of cache coherence [1] . Therefore, multilevel inclusive caches are widely used in many multi-core architectures. A multi-level cache analysis framework that can precisely analyze cache hierarchies that enforce inclusion becomes necessary for WCET estimation.
Most of the current approaches target multi-level noninclusive cache analysis, and it is not straightforward to extend these approaches to tightly analyze inclusive caches, since the invalidation behavior introduced by maintaining the inclusion property requires making conservative decisions in order to ensure safety [9] . The main idea in this paper is that this pessimism can actually be reduced by analyzing the multi-level inclusive caches in a bottom-up direction, which is counterintuitive in contrast with the natural top-down cache hierarchy access direction that is used in existing methods for multilevel cache analysis. In this paper, the top-down direction is referring to the direction from the uppermost cache level (i.e. L1) downto the lowest cache level, and the bottom-up direction is referring to the opposite.
The main technical contributions of this paper are: (1) We propose an approach which analyzes all the inclusive caches in the bottom-up direction first, and then analyzes the rest noninclusive caches in the top-down direction. Due to the bottomup analysis, the invalidation behavior becomes visible at the time of analyzing upper levels; (2) We propose a concept of aging barrier to capture the effects of the invalidations caused by inclusive caches, and by using the aging barriers, we can safely slow down the increase of memory block ages in a cache that is above an inclusive cache level, so more precise must and persistence analyses can be achieved; (3) We evaluate the proposed approach using a set of benchmarks, and we find the proposed approach can tighten the WCET estimation by 12.2% on average, compared to the approach proposed in [9] . In this paper, we only consider multi-level inclusive instruction caches for a single processor. Although the effects of data references and inter-core interferences are not considered, this approach can serve as a basis for such extensions.
The rest of the paper is organized as: Section II shows why a multi-level inclusive cache is hard to analyze for WCET estimation; Section III gives the system model considered in this paper; Section IV presents our multi-level inclusive cache analysis; Section V evaluates the proposed approach; Section VI describes the related work, and Section VII concludes this paper.
II. PROBLEM STATEMENT
In the case of single-level cache analysis, only the effects of the memory reference sequences need to be taken into account. In order to make the analysis scalable, most of the approaches are based on abstract interpretation. An abstract interpretation based approach aims to assign a cache hit/miss classification (CHMC) to each memory reference according to the abstract cache states (ACSs) derived by three different analyses [19, 5] . The analyses are usually performed on the control-flow graph (CFG) reconstructed from the low-level code of the program. At a given program point, a must analysis is used to determine the set of memory blocks that are definitely in the cache, so a memory reference to a block being in the set can be classified as always hit (AH); a may analysis is used to determine the set of memory blocks that are possibly in the cache, so a memory reference to a block not being in the set can be classified as always miss (AM); a persistence analysis is used to determine the set of memory blocks that stay in the cache once they are loaded, and a memory reference to such a block is classified as persistent (PS) or first miss (FM); and, if a memory reference cannot be classified as AH, AM, or PS, it is classified as not classified (NC).
When analyzing multi-level caches, it is also important to consider the effects of other cache levels, like cache access filtering and memory block invalidation. For example, if we treat every possible access at a level as always happening, the analysis may become unsafe, since doing so may underestimate the set reuse distances 1 of memory blocks [8] .
For a reference at a cache level, a cache access classification (CAC) can be used to represent whether the cache access at this level will occur: always (A) denotes the access will always occur, never (N) denotes the access will never happen, and uncertain (U) denotes the access may occur [8] . In order to ensure safety, the updates of the abstract cache states due to U accesses need to take into account the two possible cases (access occurring and not occurring).
In the case of multi-level non-inclusive cache analysis, the CAC for a reference r at a cache level l can be derived from the CHMC and CAC for r at l − 1 (as described in [8] ), and the behavior of l will not be affected by any lower cache level. However, in the case of analyzing cache hierarchies containing inclusive caches, the CAC for r at l cannot be safely derived from CHMC and CAC for r at l − 1. The reason is the behavior of l depends not only on the behavior of l − 1, but also on the invalidation behavior induced by some lower inclusive cache level(s): When a memory block is evicted from a lower inclusive cache level, all the contents that belong to this memory block need to be invalidated from its upper cache levels (the invalidated memory blocks are called inclusion victims).
Example: Fig. 1 shows a 3-level inclusive cache, where L1 is 2-way set associative, L2 is 4-way set associative, and L3 is 8-way set associative (at each level, only one set is shown). We assume L1 has the smallest cache block size and L3 has the biggest, so a block in L1 is a sub-block of some block in L2 and that block in L2 is a sub-block of some block in L3. For a memory block m in L3, letṁ denote a m's sub-block in L2, and letm denote aṁ's sub-block in L1. For example, we havë ma ⊂ṁa ⊂ ma. If the next reference needs the information that is in mx (mx is also mapped to the shown set of L3), the oldest ma in that set needs to be evicted. The eviction of ma will also invalidatema in L1 andṁa in L2 to maintain the inclusion property. Due to the invalidation,m h in L1 can live longer, and depending on which sub-block of mx is needed by the reference, there may be some "holes" left in L1 and L2. In [9] , multi-level non-inclusive cache analysis is adapted to multi-level inclusive cache analysis. To achieve this, several conservative decisions are made on the CAC and CHMC for a reference at a cache level due to any possible invalidation to ensure safety: (1) Except for L1 which is always accessed, the CAC at any other level should be classified as U; (2) If a reference is classified as AH or PS at a level, this CHMC may be changed into NC depending on the analysis of lower inclusive levels; (3) Even if a memory reference is classified as AM at a level, this CHMC has to be changed into NC. In this way, although safety is ensured, the tightness of the estimation may suffer a lot. Therefore, we need a method that can more precisely analyze the effects of multi-level inclusive caches on WCET estimation.
III. SYSTEM MODEL
We focus on a general multi-level inclusive cache model. The model has p cache levels, where p ≥ 2, among which q levels are inclusive, where p > q ≥ 1, and the other p − q levels are non-inclusive 2 . We also assume the time for a processing element to access a cache level is bounded and predictable, which can be achieved by using deterministic interconnects to connect the caches, like TDMA buses [11] .
Let L = {lx|1 ≤ x ≤ p} be the set of all the cache levels, in which lx denotes the x th cache level. Let I be the set of all the inclusive cache levels, and let N be the set of all the non-inclusive cache levels. Thus, we have L = I ∪N I ∩N = ∅ |I| = q. Since it does not matter whether l1 is inclusive or non-inclusive, we can simply assume l1 ∈ N , so neither I nor N is an empty set. Fig. 2 gives two examples of the models focusing on single cores of two multi-core architectures. L1 private P …. L1 private L1 private …. L1 private We assume at each level the cache is set associative, and least recently used (LRU) replacement policy is used. The size of a cache block can be different at different cache levels, and it is common to assume the block size does not increase as the level goes up. It is also common to assume the capacity decreases as the level goes up. Let C lx denote the cache at the cache level lx, let A lx denote the associativity of C lx , and let s lx denote the number of cache sets of C lx . Sometimes we use "cache level" to actually mean the cache located at that level if there is no ambiguity.
Although we do not consider exclusive caches in the model, we can easily add them into our analysis by using the approach proposed in [9] . Basically, the exclusive cache levels can be collapsed by concatenating them to the end of the upper level to form a single level for the analysis, as long as they all have the same number of cache sets and the same cache block size. In this paper, we focus on how to analyze multi-level caches in the presence of invalidations caused by inclusion enforcement, so we simply consider multi-level instruction caches in terms of a single processor. This work can serve as a basis for analysis of multi-level data or unified caches, that may also suffer from invalidations, in terms of a multi-core processor.
In order to facilitate the presentation, we introduce the following notations. As described in [19] , an abstract cache state is a mapping from a cache set number to an abstract set state, where an abstract set state is a mapping from a position to a set of memory blocks. [10] or [5] .
For a memory reference r at a cache level lx, let m r lx denote the memory block that contains the information r needs with respect to the cache block size and the number of cache sets in C lx . We use m r lx ∈ C lx to denote the needed memory block is in the corresponding concrete set state of C lx , and use m r lx ∈ α t lx to denote the block is in the corresponding abstract set state of t-analysis at this level, where t is either must, may, or persistence.
IV. MULTI-LEVEL INCLUSIVE CACHE ANALYSIS:
GOING TOP-DOWN OR BOTTOM-UP?
To our knowledge, existing work analyzes the cache hierarchies in a top-down direction, since it is the natural direction of accessing a multi-level cache. As long as there are no invalidations at any cache level, a top-down analysis can be safe and precise. However, when there are inclusive caches in the cache hierarchy, a top-down analysis cannot capture the possible invalidation behavior precisely, since the invalidations appearing at a cache level are actually caused by the inclusive caches located below this level. Thus, as discussed in [9] , conservative decisions have to be made to ensure safety which makes the analysis pessimistic.
In order to make the analysis of multi-level inclusive caches more precise, we propose a safe approach which analyzes the cache hierarchy in a rather counter-intuitive way: We first analyze all the inclusive cache levels in the bottom-up direction so as to make the possible invalidation behavior visible at a cache level, and then we analyze all the non-inclusive levels in the traditional top-down direction taking into account the revealed invalidations. The analysis process is shown in Fig. 3 . Our bottom-up analysis of inclusive caches is based on the following observation, that is related to the amount of information that can be derived for the access to an inclusive cache level ly from the state of C ly . to handle the access uncertainty so as to carry out a safe t-analysis at this level, where t is either must, may, or persistence [8] . However, the more uncertainty we can resolve, the more precise the analysis can become.
A. Last Inclusive Cache Analysis
The proposed multi-level inclusive cache analysis begins with the last inclusive cache. There can be other non-inclusive caches located between the last inclusive cache and the main memory. Let us assume the last inclusive cache level corresponds to lLIC ∈ I, so we have ∀lx ∈ L : x > LIC =⇒ lx ∈ N . , we can safely categorize r as AM at any cache level lx where 1 ≤ x ≤ LIC, since, according to the inclusion property, if a memory block is absent from the underlying inclusive cache, it is also absent from all of the included upper-level caches. Therefore, compared to the top-down approach proposed in [9] , which needs to conservatively change any reference classified as AM to NC at any cache level, the approach is more precise.
2) Last Inclusive Cache Must and Persistence Analysis: At a program point, the proposed must and persistence analyses of the last inclusive cache depend on the α may lLIC of that point. This is because only the information deduced from α may lLIC can be used to determined whether the lLIC will be definitely accessed according to Lemma 1.
For the last inclusive cache must (resp. persistence) analysis, we define the join function J must LIC (resp. J pers LIC ) and update function U must LIC (resp. U pers LIC ) as follows: , this reference will cause no cache misses at lLIC, but may result in misses at a cache level lx where 1 ≤ x < LIC. In other words, this classification for this memory reference is only locally safe. If r is classified as AH by the last inclusive cache must analysis, no memory blocks need to be evicted from C lx because of this reference, so no invalidations are enforced by C lLIC . Similarly, if a memory reference r is classified as PS by the last inclusive cache persistence analysis (i.e. m r lLIC is not in of the corresponding set of α pers lLIC ), r will result in at most one cache miss at lLIC, but may cause more than one misses at a cache level lx where 1 ≤ x < LIC. Finally, if r is classified as PS by the last inclusive cache persistence analysis, at most one memory block will be evicted from C lLIC so that at most one invalidation enforcement can be caused because of r.
B. Aging Barriers
In order to analyze a cache located above an inclusive cache level more precisely, the effects of the invalidations need to be captured. Since the invalidations are caused by lower inclusive caches, compared to the top-down approach, one advantage of the bottom-up approach is the invalidation behavior becomes visible when analyzing an upper level.
At a cache level, if a memory block is invalidated due to the maintenance of the inclusion property, a "hole" will be left in the cache; until this "hole" is filled by some memory block, any access to the corresponding cache set will not increase the ages of the memory blocks that are behind this "hole". Yet, it does not mean the age of a memory block behind the "hole" will not be decreased, since a reference to such a block will decrease its age to 1 and fill the "hole", in which case another "hole" will be created behind the filled "hole". A "hole" will be filled and no new one will be created when the referenced memory block is not in the cache.
We propose a concept of aging barrier to capture this "hole" behavior so as to perform more precise must and persistence analyses of a cache that may suffer from invalidations. Without loss of generality, we present the concept in terms of an A-way set associative cache C which has s cache sets.
Definition 1 (Aging Barrier). A valid aging barrier
and represents an unused position within the range [1, j] in the i th cache set, which prevents the age of any memory block in the i th abstract set state of α must or α pers from increasing if the age is already greater than or equal to j for an access.
We treat an aging barrier (i, j) as an abstract must "hole": if there is a valid aging barrier (i, j) at a program point, in any concrete state of C, there must be a corresponding real "hole" appearing in the i th cache set of C within the position range [1, j] . Thus, j serves as the position upper bound of the real "hole". For example, the aging barrier (1, 2) represents either the 1 st or the 2 nd young memory block in the 1 st cache set is invalidated and the position it occupied becomes available.
It is possible to have multiple valid aging barriers with respect to the i th cache set, which are listed as (i, j1), · · · , (i, j k ) where k ≥ 1. In that case, there are certainly at least k real "holes" in the i th cache set, whose positions are bounded by j1, · · · , j k respectively. Note that it is valid to have multiple identical j's with respect to the i th cache set, as long as the multiset 3 formed by these upper bounds satisfies the condition: Given any position pos in the cache set, the total number of j's with j ≤ pos is at most pos. Let Ξ denote the set of all of the valid multisets formed by "hole" position upper bounds of a cache set. Formally, we have:
where max(ξ) gives the maximum member and ν(ξ, j) gives the multiplicity of j in the multiset ξ.
Definition 2 (Aging Barrier State). An aging barrier state β : {1, · · · , s} → Ξ is a mapping from a cache set number to a multiset of "hole" position upper bounds.
Given an aging barrier state β, the set of all the valid aging barriers is {(i, j) ν(β(i),j) |i ∈ {1, · · · , s} j ∈ β(i)}, which is a multiset and uses ν(β(i), j) as the multiplicity of (i, j). Let 3 A multiset is a set in which members are allowed to appear more than once.
ABS denote the set of all the aging barrier states of C. We define three functions to operate on the aging barrier states. Let = A + 1 be the invalid aging barrier indicator. The function A : ABS × {1, · · · , s} × {1, · · · , A, } → ABS is used to add an aging barrier into the state and is defined as:
The function adds the aging barrier (i, j) into the state β only if the result of β(i) {j} ( is the multiset sum operation) is a member of Ξ; otherwise, it keeps β unchanged. For example, given a 4-way set associative cache (i.e. A is 4), when we want to add an aging barrier (1, 3) into the state β, the function A needs to check if β(1) {3} is a member of Ξ. Assume we have β(1) = {2, 2}; then β(1) {3} = {2, 2, 3} is a member of Ξ according to the condition given above -the maximum member in {2, 2, 3} is 3 that is less than 4 and no matter what pos is, the total number of the members that are less than or equal to pos is at most pos. Therefore, after applying A (β, 1, 3), we will have β(1) = {2, 2, 3}.
The function U : ABS ×{1, · · · , s} → ABS ×{1, · · · , A, } is used to acquire an aging barrier from the state and is defined as:
Given a cache set number i, the resultant aging barrier depends on whether the mapped multiset β(i) is empty: If β(i) is not empty, minc(β(i)) equals min(β(i)) that is the minimum member in β(i), and the composite (i, min(β(i))) will be a valid aging barrier; otherwise, minc(β(i)) equals and there is no valid aging barrier for the i th cache set. Since a valid aging barrier may be acquired in which case this aging barrier should no longer be in the state, the function changes the state by mapping i to β(i)\{minc(β(i))} (\ is the multiset asymmetric difference operation). For example, let us continue with the last example in which we have β(1) = {2, 2, 3}. Since the minimum member in {2, 2, 3} is 2, after applying U (β, 1), we have a valid aging barrier (1, 2) and β(1) becomes {2, 3}.
The function J : ABS × ABS → ABS is used to join two aging barrier states and is defined as:
where ). When joining two aging barrier states, for the i th cache set, the cardinality of β1(i) c β2(i) (i.e. k) is the smaller one of the cardinalities of β1(i) and β2(i), which implies the number of aging barriers that can be derived from J (β1, β2) will never exceed that derived from either β1 or β2. In the case of k ≥ 1, j1 is the bigger one between the two minimum members of β1(i) and β2(i), which safely captures an aging barrier since there must be a "hole" within position range [1, j1] 
Definition 3 (Partial Ordering). Let β1 and β2 be two aging barrier states. We define β1 β2 if and only if ∀i ∈ {1, · · · , s} :
Therefore, we have β1 β2, if and only if, for any cache set i, the mapped multisets β1(i) and β2(i) satisfy: the number of members of β2(i) is not greater than that of β1(i), and when we iterate the two multisets in the ascending order in parallel, the iterated number from β2(i) is not smaller than that from β1(i). According to Definition 3, we can deduce that β1 J (β1, β2) and β2 J (β1, β2). Let β ⊥ = i → {1, · · · , A}|i = 1, · · · , s and β = i → ∅|i = 1, · · · , s ; thus, according to Definition 3, we can deduce that ∀β ∈ ABS : β ⊥ β β .
C. Integrating Aging Barriers into Update Functions
In order to realize more precise must and persistence analyses of the caches which suffer from invalidations, we need to integrate the aging barriers into the update functions of these analyses. Let M denote the set of all the memory blocks. Given a reference to a memory block m ∈ M that is mapped to the i th set of C and an aging barrier (i, j), where j ∈ {1, · · · , A, } (recall that we use as the invalid aging barrier indicator), we redefine the update function U must : ACS must × M × {1, · · · , A, } → ACS must for the must analysis as:
The rationale of the redefined update function is: If there is no valid aging barrier available (i.e. j = ), or if the current valid aging barrier (i, j) is not needed (i.e. m ∈ α must (i)(h) j ≤ A h ≤ j, in which case this update never attempts to affect the ages of the memory blocks "protected" behind this aging barrier), then we can simply use the U must to update the α must ; otherwise, the current aging barrier can prevent the memory blocks that are behind it in the corresponding abstract set state from aging, since it means there is a "hole" before j (including j) that needs to be filled, and we can only increase the ages of the memory blocks until j, and keep the ages of other blocks not increased (excluding m which will be moved to the first age position if it is in the current state). Fig. 4 shows an example of using an aging barrier to update α must more precisely -if mc in α must is invalidated, since it is definitely in the cache before the invalidation with an overestimated maximal age 3, a "hole" will definitely appear within the range [1, 3] , namely we have an aging barrier with j = 3; when m d is referenced, even if it is not in the cache, there is a "hole" to fill, the maximal ages of m b and ma should not be increased. Therefore, using the redefined function U must leads to more precise analysis. Similarly, when updating α pers , given an aging barrier (i, j) and the k which is the affected position range upper bound when applying the normal U pers , if we have k ≤ j, we simply perform the normal U pers ; otherwise, we know in any concrete state of C there will be a "hole" in the i th cache set within the position range [1, j] , so we can take advantage of this information to carry out a more precise update. We redefine the update function if we have j < k, for the memory blocks whose maximal ages are already greater than or equal to j in the i th abstract set state, their ages will not be increased (but one of them may be decreased to 1 if that block is the referenced one).
We maintain an aging barrier state for each cache which is located above at least one inclusive cache level so as to achieve more precise analysis (described in the next subsection). Since we first analyze the inclusive caches in the bottom-up direction, the analyses of C ly are already completed at the time of analyzing C lx , and these analyses of C ly have captured the possible invalidations caused by the inclusive levels lower than ly if there are any. Thus, from α may ly , we can deduce whether the contents of a memory block are definitely absent from C ly , and from α pers ly , we can deduce whether the contents of a memory block are possibly absent from C ly . Thus, we only need to check lx against ly and not any other lower inclusive cache levels. 1) May Analysis: As described in [9] , it is unsafe to update the abstract cache state α may lx without considering the possible invalidations caused by its underlying inclusive levels, since there possibly exist some "holes" so that some memory blocks at lx may live longer. Fortunately, since we first analyze all the inclusive caches in the bottom-up direction, when we analyze C lx , the invalidation behavior induced by its underlying inclusive levels has already become visible.
D. Cache Analysis above One Inclusive Cache Level

First, let us redefine the update function
may for the may analysis of C lx that is located above the inclusive level ly. Similar to the U must and U pers described in IV-C, given a memory reference r, in the U may (α may lx , m r lx , j), j controls the upper bound on the aging process. However, different from the U must and U pers , where j is given by an aging barrier, here j is decided by finding the youngest position in which there is a possible inclusion victim (i.e. there is possibly a "hole" within the range [j, A lx ] if such a j can be found). Thus, if we have j = , we just perform the normal U may ; otherwise, for the memory blocks whose ages are already greater than or equal to j, their ages will not be increased (but may be decreased to 1 by the reference). The steps to update α may lx are given in Algorithm 1. The first loop (line 4-6) checks whether there is a memory block m lx whose contents are in a block located in a position of α pers ly after the reference r (i.e. α pers ly has taken into account the effect of the reference), namely it checks if m lx is a subblock of a possibly evicted memory block due to the reference at ly. If there is such a block found in a position k ≤ A lx , increasing the ages of the memory blocks which are not less than k may make the may analysis unsafe (since there may be a "hole" within the range [k, A lx ]), so we set j as the youngest k; otherwise, j is . If lx is an inclusive level (line 7-9), we are still moving up in the cache hierarchy, so it is not possible to decide the access occurrence by using the traditional CAC method. Therefore, like in the last inclusive cache analyses, the algorithm checks against itself (i.e. α by taking into account the two cases (i.e. access occurring and not occurring). If lx is a non-inclusive level (line 10-14), we have already analyzed all the inclusive levels and are moving down in the cache hierarchy. Therefore, no matter which type lx−1 is, where x > 1 (when lx is l1, it is always accessed), the analyses of C l x−1 have been completed, so it is possible to derive the CAC for r at lx from the CHMC and CAC for r at lx−1, and then to update the α has reached a fixed-point).
2) Must Analysis: In the must analysis of C lx , we maintain both the abstract cache state α must lx and the aging barrier state β lx . As we discussed above, at a join point, we simply perform J must (α must lx,1 , α must lx,2 ) to safely join two abstract cache states. Similarly, given two aging barrier states β lx,1 , β lx,2 , we simply perform J (β lx,1 , β lx,2 ) to join these two aging barrier states. At a program point in a basic block, we update the α must lx and β lx following the steps described in Algorithm 2.
The loop (line 1-7) first checks whether a memory block in α must lx is definitely an inclusion victim (i.e. the contents of the block are not in α may ly after the reference r). If there is such a block, there will be a "hole" created by removing this block from α must lx , since it was definitely in the cache C lx before the reference r. Thus, we add an aging barrier corresponding to this certainly invalidated block into β lx (line 3-6). In order to guarantee safety of the must analysis, the algorithm also (line 7) takes into account all the possibly evicted memory blocks by removing them from the α must lx .
In the next steps, we first acquire an aging barrier (i, j) by applying β lx , j = U (β lx , i) (line 9). Since lx can be either inclusive or non-inclusive, line 11-18 take into account the two possibilities, which is similar to the corresponding steps in the may analysis. A valid aging barrier (i, j) (i.e. we have j = ) means there must be a "hole" in the i th cache set within the position range [1, j] , different from that in Algorithm 1 where j is chosen to be the position lower bound of a possible "hole". After updating α must lx , we update the aging barrier state by per- forming A (β lx , i, max(j, k) ) to add an aging barrier back to the (1) If we have k ≤ j, we perform the normal update function U must , and line 19 will add the acquired aging barrier back to the aging barrier state (since we have k ≤ j, max(j, k) is always j, and no matter whether j is or not, after line 19 the β lx will be the same as the input β lx ) -in the case of j = , the acquired aging barrier is valid, since we have k ≤ j, the "hole" represented by the aging barrier has not been filled yet, so after line 19, β lx becomes the same as the input β lx ; in the case of j = , no valid aging barrier has been acquired from the input β lx at line 9, so β lx was still the same as the input β lx , and after line 19, β lx is the same as β lx as well as the input β lx . (2) If we have j < k = , it means the referenced memory block m r lx is not in the i th set state of α must (since k = ), and the acquired aging barrier is valid (i.e. j = ); so m r lx intends to fill the "hole" represented by this valid aging barrier; since we have max(j, k) = k = , A (β lx , i, ) will not change the state β lx which represents the valid aging barrier has already been used. (3) If we have j < k < , it means m r lx is definitely present in any concrete state, so no other memory blocks will be loaded due to this reference, and we can safely guarantee there will be a "hole" in the range [1, k] , even if the "hole" that was in the range [1, j] has been filled; we have max(j, k) = k < , and (i, k) is an valid aging barrier; so A (β lx , i, k) will add the new valid aging barrier into the state β lx .
3) Persistence Analysis: For the persistence analysis, the steps to update α pers lx are similar to the steps in Algorithm 2. The differences are: (i) We set j according to the aging barrier state β lx maintained by the must analysis of C lx , but we do not change β lx in the steps, namely we only use the fact that if there is a valid aging barrier available before executing the reference, there is a "hole" within the position range [1, j] ; (ii) We do not remove memory blocks from α There can also be some non-inclusive caches located below the last inclusive cache level, but they do not suffer from any invalidation. When moving down in the cache hierarchy, the analysis of any of them is the same as the traditional multilevel non-inclusive cache analysis. Theoretical analysis of the approach's safety and termination is provided in the appendix.
V. EVALUATION
The objective of this paper is to tighten the WCET estimation in the presence of inclusive caches. We evaluate the proposed approach and compare with the approach proposed in [9] . In order to analyze the effects of multi-level inclusive caches, we developed a research prototype tool, which reconstructs the CFG from the binary executable of the program and recursively derives the fixed-points of the abstract cache states at each level. Currently, our tool does not distinguish calling contexts, so overestimations are possible. However, in terms of precision, handling contexts is orthogonal to the problem considered in this paper.
In order to calculate the WCET bound, we apply the widely used IPET (Implicit Path Enumeration Technique) [14] . IPET uses a set of integer linear constraints to combine the flow information and the timing effects of the multi-level caches [9, 12] . In terms of the flow information, the structural constraints are generated directly, but currently the loop bounds need to be determined and input manually in our tool. The CPLEX solver is used to solve the generated ILP (Integer Linear Programming) problems.
Due to the limitations of our current tool, we only take into account the timing effects of multi-level caches on the WCET estimation and do not consider the effects of other micro-architectural components like pipelines and branch predictors, so we assume there are no timing anomalies. Therefore, a reference that is classified as NC can be safely treated as a AM when used to estimate the WCET. However, if the timing anomalies are considered, we will gain more precision using the proposed approach, since it can safely classify some references as AM compared to the approach in [9] . We leave this as future work.
Our experiments are carried out on the set of benchmarks maintained by the Mälardalen WCET research group [6] , and they are compiled for MIPS R3000 processor using gcc-3.4.4. Since the approach proposed in [9] only considers strict multilevel inclusive caches (i.e. it does not consider mixed inclusive and non-inclusive cache levels), we carry out the experiments on a three-level cache hierarchy and configure L2 and L3 to be inclusive. The parameters of the cache at each level are shown in Tab. I. Moreover, we assume every needed information can be found in the main memory with a 200-cycle latency. The experimental results are shown in Tab. II. For a benchmark, WCET top-dw is derived by using the method proposed in [9] , and WCET bot-up is derived by using the method proposed in this paper. The WCET estimation is reported in clock cycles. The precision improvement is calculated by WCET top-dw WCET bot-up − 1. We also report the computation time overhead in seconds, along with the reported WCET. The experiments are performed on a Linux machine with a 1.2GHz quad-core processor and 12GB memory.
We sort Tab. II in descending order of the precision improvement. From the results, we can see that the bound can be tightened about 12.2% on average. In some cases, the improvement is more than 20%, e.g. up to 57.3% is gained in the case of fibcall and up to 44.4% is gained in the case of insertsort. For some benchmarks, the improvement rate is not that substantial (less than 3%), e.g. only 2.7% is gained in the case of ludcmp and only 2.4% is gained in the case of adpcm. We find most of these benchmarks contain nested loops and/or are contextsensitive. The advantage of the proposed method may become larger if the persistence analysis is multi-leveled to handle the nested loops [2] and contexts are taken into account in the inter-procedural analysis. Furthermore, as mentioned above, our prototype tool does not analyze other micro-architectural features than multi-level caches for the present. Since the proposed approach can classify some references as AM while the method in [9] cannot, we would expect more precision gains if timing anomalies are considered. Although these techniques are not integrated in our tool yet, the improvement is still significant. Even in some cases the improvement rate is less than 3%, thousands of overestimated cycles are reduced (e.g. up to 12400 clock cycles are reduced in the case of adpcm). However, it should be noted that the proposed approach is standalone and can be integrated with other techniques without any changes. From the results, we can see the computation time overhead differences between the two methods are within a few seconds in most cases. The biggest difference is about 93 seconds in the case of nsichneu. Since this difference is just a small portion of the overheads, which are 6.4 and 7.9 minutes respectively, we believe the computation time overhead is acceptable.
VI. RELATED WORK Abstract interpretation based single-level cache analysis has been widely used in WCET analysis [19] . However, it has been found its original persistence analysis is not safe, and the safe persistence analysis is proposed in [5, 10] . The first multilevel cache analysis is proposed in [16] , which is an extension to another well-established single-level cache analysis method called static cache simulation [17] . Later, in [8] , it is pointed out that this method is actually unsafe for analyzing multi-level set associative caches, and it is proposed to use CAC to filter the references at each level and defines an update strategy to take into account the uncertain accesses.
Based on the work in [8] which does not take into account data caches, a method for analyzing multi-level non-inclusive data caches is proposed in [12] , and a method for analyzing non-inclusive cache hierarchies with unified caches is proposed in [4] . In [18] , an abstract domain called live caches is used to model the relationships between cache levels and the analysis based on this domain can handle unified caches using writeback policy.
Cache hierarchies are natural in multi-core processors, for which the analysis needs to take into account the inter-core interferences. In [20] , a dual-core processor with a shared L2 cache model is considered. In [13] , task lifetime information is computed and utilized to refine possible interferences. In [7] , a method for identifying and bypassing the static single usage memory blocks so as to reduce the number of interferences is proposed. In [15] , abstract interpretation based cache analysis is combined with model checking based bus analysis to achieve more precise interference analysis. In [3] , a WCET analysis framework that covers different micro-architectural components in a multi-core processor is presented. All these works assume multi-level non-inclusive caches are used.
In [9] , the methods to analyze cache hierarchies of different types (non-inlucisve, inclusive, and exclusive) are presented. It shows the difficulties in deriving a tight WCET estimation for systems using multi-level inclusive caches and non-LRU replacement policies. It considers different multi-level instruction cache types separately without taking into account hybrid types like a combination of non-inclusive and inclusive caches.
VII. CONCLUSION AND FUTURE WORK
In this paper, we propose an approach that can safely and more precisely analyze multi-level inclusive caches for WCET estimation. The approach first analyzes all the inclusive levels in the bottom-up direction and then analyzes the rest noninclusive levels in the top-down direction. Although bottomup sounds counter-intuitive considering the cache levels are accessed in the top-down direction, we show that it is actually very suitable for analyzing inclusive caches. In order to capture the effects of the invalidations caused by an inclusive level, we propose a concept of aging barrier. Aging barriers can safely slow down the increase of memory blocks' ages, and we show how to integrate them into the must and persistence analyses to gain more precision. From the experiment results, we can observe the proposed approach can tighten the bound by 12.2% on average. In the future, we want to extend the approach to take into account the effects of data references and inter-core interferences, and we also want to enhance our tool to consider the interactions between multi-level caches and other microarchitectural features.
APPENDIX
In order to prove the proposed multi-level (inclusive) cache analysis is safe, we need to prove the may, must, and persistence analyses of the last inclusive cache are safe, and we also need to prove the analyses of the cache located above at least one inclusive cache are safe.
When analyzing a cache level, we can safely use the welldefined join function of the single-level cache may, must, or persistence analysis at a join point for the corresponding analysis [8] , so we can focus more on proving the defined update functions are safe.
A. Safe Analyses of the Last Inclusive Cache Given the last inclusive cache level lLIC, we first prove the proposed may, must, and persistence analyses are safe. Lemma 4. The last inclusive cache persistence analysis is safe. In other words, at a program point p, any memory block that has been loaded into C lLIC is in an age position of α pers lLIC which is greater than or equal to its possible maximal age when the execution reaches p (which implies if it is possibly absent from C lLIC , it is in a position of α pers lLIC ). Proof: This proof will be the same as the proof of Lemma 3, except we prove the defined U pers LIC is safe.
B. Safe Analyses of Inclusive Caches Located above One Inclusive Cache
Since we analyze all the inclusive caches in the bottom-up direction at first, we prove the analyses of the inclusive caches that are located above the last inclusive cache are safe. Let lv be the second last inclusive cache level. is updated according to Algorithm 1, we need to prove the steps in the algorithm will not overestimate the age of a memory block. In the algorithm, j is calculated and used to control the upper bound on the aging process of updating α may lv
. Note that if we have j ≤ j , where j represents the smallest position where has a "hole", line 7-9 will be always safe (some blocks' ages will be underestimated but will not be overestimated). Based on Lemma 4, we know the last inclusive cache persistence analysis captures all the possibly evicted memory blocks in the positions of α pers lLIC . Thus, line 4-6 will give a j such that j ≤ j holds. is safely derived at each point, and Lemma 5 holds.
Lemma 6. The must analysis of C lv is safe. In other words, at a program point p, any aging barrier (i, j) derived from β lv corresponds to a "hole" in the i th set within the position range of [1, j] , and the memory blocks contained in α must lv are definitely in C lv when the execution reaches p.
Proof: As discussed in IV-B concerning the definition of J function, we know the J function ensures only the "holes" that definitely exist along either path are kept and the function overestimates the position upper bounds of these "holes". Since the join function J must does not underestimate the age of a memory block, we only need to prove updating β lv and α must lv are safe. We prove this by mathematical induction. Base case: At the beginning, β lv = β ⊥ , which means all the positions in all sets are "holes", and α must lv corresponds to an empty state. We have a cold start, there is no memory blocks loaded. Therefore, the lemma holds in the base case. Inductive hypothesis: Before a reference r which accesses the memory block m r lv that is mapped to the i th cache set, any aging barrier (i, j) derived from β lv corresponds to a "hole" in the i th set within the position range of [1, j] , and the memory blocks contained in α must lv are definitely in C lv . Inductive step: Based on the inductive hypothesis and Lemma 2, if a memory block is in the current α must lv , but its contents are not in α may lLIC after the reference, this memory block needs to be invalidated, so a "hole" will be created. Since the must analysis captures the maximal ages of memory blocks, adding the created "hole" into β lv will not violate the lemma. Based on Lemma 4, the memory blocks in the positions of α pers lLIC after the reference are possibly evicted; thus, after line 7 the lemma still holds with respect to the updated β ly and α must ly . When updating the states according to the rest of Algorithm 2, after line 9, j has a position and if we have j = , there is a "hole" within the range [1, j] , and any of the rest aging barriers derived from the used β lv , namely β lv , still corresponds to a "hole" (deduced from the inductive hypothesis). There are two possibilities when updating α , based on Lemma 5, we have k = and we are sure that m r lv is not in C lv . Based on Lemma 1, C lv will be definitely accessed due to the reference r. Therefore, line 16 (i.e. applying U must which takes into account the effects of the existence of a "hole") can safely update α must lv , and that "hole" is possibly filled. In this case, max(j, k) = no matter what j is, so A will not change the β lv at line 19. , we do not know if C lv will be accessed or not, so line 17 can safely update α must lv by taking into account the access occurring and not occurring. We have j = or j = , and k = or k = . If j = or k = , max(j, k) = , so A will not change the β lv at line 19. The only case in which A will change β lv is when j = k = . In this case, although the "hole" with the range [1, j] may be possibly filled, there is still a "hole" within the range [1, max(j, k)] -this is because, based on the hypothesis, m r lv is definitely in C lv if k = , and in either case of k < j or j < k, the reference does not load a new memory block into C lv ; so after applying A on β lv , the resultant β lv does not violate the lemma. Thus, after line 19, this lemma still holds with respect to the updated β lv and α must lv . Lemma 7. The persistence analysis of C lv is safe. In other words, at a program point p, any memory block that has been loaded into C lv is in an age position of α pers lv which is greater than or equal to its possible maximal age.
Proof: Since J pers does not underestimate the age of a memory block, we only need to prove this lemma holds in terms of updating, which we do by mathematical induction. Base case: At the beginning, no memory block is loaded, and all the positions of α pers lv are empty. The lemma holds. Inductive hypothesis: Before a reference r which accesses the memory block m r lv , any memory block that has been loaded into C lv is in an age that is greater than or equal to its possible maximal age. , based on Lemma 5, we know m r lv is not in C lv ; and based on Lemma 1, C lv will be accessed due to the reference r. According to the definition of U pers , when j = , it is U pers and it will not underestimate the possible maximal ages of the blocks; when j = , no matter what k is, it will never increase the ages of the blocks that are already greater than or equal to j, so we need to prove in this case the possible maximal ages of these memory blocks are actually not greater than these unchanged ages: since j = , there is definitely a "hole" within the position range [1, j] , so even C lv is accessed and m r lv is not in C lv , the ages of the blocks that are behind this "hole" will not be increase, which means the possible maximal ages of the memory blocks which are already greater than or equal to j will not be increase; from the inductive hypothesis, we know that before this reference, for a memory block, the position where it is in α pers lv is the upper bound of its possible maximal age position; thus, even though the ages of the memory blocks that are already greater than or equal to j are unchanged after applying U pers , they are still not less than the possible maximal ages of these memory blocks, based on the arguments above. , we do not know if C lv is accessed or not, so we safely join the two states corresponding to the access occurring and not occurring. Thus, the lemma holds with respect to the updated α pers lv . Theorem 1. The proposed may, must, and persistence analyses of the inclusive caches in the bottom-up direction are safe.
Proof: Since we have proven the analyses of the last inclusive cache are safe (Lemma 2, Lemma 3, and Lemma 4), we only need to prove the analyses of the rest inclusive caches in the bottom-up direction are safe by mathematical induction. Base case: The analyses of C lv are safe, where lv is the second last inclusive cache level. Inductive hypothesis: The analyses of all the inclusive caches that are located beneath C ly are safe, where ly is an inclusive level above the last inclusive level lLIC. Inductive step: Let us assume the next inclusive level located beneath ly in the top-down direction is l i y . Following the proofs of Lemma 5, Lemma 6, and Lemma 7, we can prove the may, must, and persistence analyses of C ly are safe, as long as the analyses of C l i y are safe. Since the inductive hypothesis gives the analyses of C l i y are safe, the analyses of C ly are safe.
C. Safe Analyses of Non-Inclusive Caches
