Optimizing compilers implement program transformation strategies aimed at reducing data movement to or from main memory by exploiting the data-cache hierarchy. However, instead of attempting to minimize the number of cache misses, very approximate cost models are used, due to the lack of precise compile-time models for misses for hierarchical caches. The current state of practice for cache miss analysis is based on accurate simulation. However, simulation requires time proportional to the dataset/problem size, as well as the number of distinct cache conigurations of interest to be evaluated. This paper takes a fundamentally diferent approach, by focusing on polyhedral programs with static control low. Instead of relying on costly simulation, a closed-form solution for modeling of misses in a set-associative cache hierarchy is developed. This solution can enable program transformation choice at compile time to optimize cache misses. A tool implementing the approach has been developed and used for validation of the framework.
INTRODUCTION
Optimizing compilers currently use simpliied cost models to select the composition of loop transformations applied to programs. For example, Bondhugula et al. maximized idealized measures of data locality [Bondhugula et al. 2008] , Kong et al. maximized data locality under single instruction/multiple data (SIMD) constraints [Kong et al. 2013] , and other works have examined combined objectives for parallelization and data locality (e.g., [Feautrier 1992; Lim and Lam 1997] ). While useful in practice, the major drawback of these approaches is that they are limited to abstract performance models, such as łmaximizing parallelismž or łminimizing reuse distance, ž which are only indirectly related to concrete performance metrics, such as data cache misses.
Addressing these limitations requires several challenges to be addressed. First, approaches to accurately model cache behavior typically employ actual execution or simulation to measure or analyze the program's execution on the target device. These approaches can be expensive, with cost proportional to the number of instructions the program executes. For example, running the DineroIV [Edler and Hill 1999] cache simulator on a simple matrix-multiply code for 1024 × 1024 32:2 Wenlei Bao, Sriram Krishnamoorthy, Louis-Noel Pouchet, and P. Sadayappan matrices can take hours. Most importantly, these approaches do not provide an exact model of cache behavior for use at compile time. Second, nearly all practical modern systems use cache hierarchies. A small, irst-level (L1) cache is complemented with a larger second-level (L2) cache, and so on ś the Intel Haswell CPU may contain up to four levels of caches. To accurately model the accesses to main memory, which are much slower than any cache accesses and can signiicantly inluence program performance, the full hierarchy of caches must be modeled, where accesses to level L+1 are derived from the misses at level L. Third, most practical caches are set-associative rather than fully associative, where multiple virtual memory addresses may map to a speciic set of physical cache lines, requiring accurate models of conlict misses.
In the quest for an approach that would enable compile-time analysis of runtime cache behavior, we need to: (1) model the cache hierarchy; (2) accurately model cold, capacity, and conlict misses of set-associative caches at each level; and (3) achieve an analysis time that is essentially independent of problem size and cache size. In this work, we achieve all of these goals by focusing speciically on the class of aine programs, where key properties of the control and data lows aford accurate capture of runtime cache behavior. We also propose a novel and purely static analysis to accurately compute cache behavior at compile time, without the need for execution or simulation.
Compile-time cache modeling (e.g., avoiding simulation) for aine programs has been studied in the past. For example, tile size selection has been investigated by approximating capacity misses for regular/aine data tiles, such as the distinct line model [Sarkar and Megiddo 2000] and its generalization with the minimum line model [Shirako et al. 2012] . Ghosh et al. [1998 Ghosh et al. [ , 1999 proposed Cache Miss Equations (CME) to capture situations when a cache miss occurs in perfectly nested loop programs with uniform references. It handles set-associative caches, but only for single-level caches. However, a major limitation of the approach is the diiculty to translate these equations into a cache miss count, which requires actually solving the CMEs (i.e., determining whether or not they have valid solutions) for a multitude of cases. Furthermore, translating a miss count at the irst-level cache into a set of cache access for the next level is impractical. Thus, this approach is not suitable for efectively modeling hierarchical caches that are ubiquitous in practical systems. Chatterjee et al. [2001] made a signiicant advance by providing an exact closed-form solution for polyhedral programs for direct-mapped caches using Presburger formulae, which s partially generalized to model a set-associative single-level cache. However, none of these prior formulations of the cache miss problem are suited for modeling hierarchical caches. There is a fundamental diference between simply counting cache misses, as in prior work, and modeling the set of accesses that lead to a cache miss, as developed in this work. Such a set of events should be exactly modeled to compute misses in cache hierarchies because only the subset of accesses to L1 that misses in L1 form the set of accesses to L2.
We make the following contributions:
• We present a closed-form solution to the problem of modeling cache behavior for arbitrary polyhedral programs executing on processors with set-associative virtually-indexed hierarchical caches under a variety of write policies.
• We develop a tool that computes the number of misses at each cache level for set-associative caches, from an input C source ile containing aine program(s).
• We perform extensive evaluation of our cache miss analysis and validate it against a simulator with identical hardware assumptions, and present several approximation heuristics. The rest of the paper is organized as follows: Sec. 2 elaborates on the problem. Sec. 3 presents the notation and background on mathematical concepts used in the work. Sec. 4 presents a static analysis for a single-level set-associative cache, while Sec. 5 generalizes to hierarchical caches with various write policies. Extensive evaluation is reported in Sec. 7. Additional optimizations and use cases are discussed in Sec. 8. Related work is presented in Sec. 9 before concluding the paper. Fig. 1(a) ). Fig. 1(d) shows the execution trace of the various read/writes for execution of the program. The following includes a walk-through for computing the set of memory accesses that lead to cache misses in a simple 2-way set-associative cache.
A K-way set-associative cache is partitioned into cache sets, each of which can store K cache lines. A cache line stores B bytes of data. For a cache of size n bytes, the total number of cache lines, nblines, is given by n/B, and the total number S of cache sets is (n/B)/K. To determine the speciic cache set that a memory address maps to, a classical and typical approach is to use the least signiicant bits of the memory address: a virtual memory address addr maps to cache set (addr /B)%S. Since the cache is K-way associative, up to K lines mapping to the same cache set can co-exist simultaneously in the cache. We illustrate this in Fig. 1(d) for a simple 2-way associative cache of size 32 bytes with 4 cache lines: B = 8, K = 2, and S = 2, that is each cache line stores a single element of A. The trace of accesses in Fig. 1(d) is typical of input traces provided to cache simulators, such as DineroIV [Edler and Hill 1999] . First, the (annotated) program is executed to build the trace of accesses. Then, the simulator reads this trace in sequential order, one reference at a time. For each address accessed, it computes the mapped cache set and maintains an internal data structure representing the cache state (i.e., what data is currently in cache). When the accessed data item is not currently in cache, a cache miss occurs, possibly leading to the eviction of a cache line previously stored in this cache set, and the loading of the data into a line in this set from a copy in a later level cache (or main memory). Conversely, when the accessed data is already in cache, a cache hit occurs. Fig. 1(d) shows the cache line and cache set to which each address maps. The łvirtual cache linež is computed by L id = addr /B and the cache set by S L id = (addr /B)%S. We remark that S L id = (addr /B)%S = ((addr /B)%nblines)%S because we have nblines = K * S, K ≥ 1. That is, L1,S1 is read as L id = 1, S L id = 1 for this reference. Fig. 1(e) shows whether each access is a hit or miss for this cache coniguration.
Consider a cache line L id to be accessed. This access is a miss if it is the irst time L id is accessed or K distinct cache lines that map to the same set have been accessed more recently than L id . Otherwise, it is a cache hit. Indeed, in a set-associative cache, each line maps to a speciic set with K slots. A cold miss (an unavoidable miss even in an ininite cache) occurs the irst time L id is accessed. A capacity/conlict miss (one that can be avoided in an ininite cache) occurs if K diferent lines mapping to the same set have been accessed prior to (re)accessing L id . In other words, for a particular cache set, all K + 1 distinct accessed lines mapping to this set will lead to a cache miss.
Focusing on Set 1 in the 2-way set-associative cache, Fig. 1(d) shows that irst A[0,1] and then A[1,1] are loaded in Set 1 (they are both cold misses), and A[0,1] then is accessed again (it is a hit: no line has been evicted from Set 1 yet). When A[0,3] is loaded, Set 1 is already full, and the least recently used (LRU) line is evicted. A[1,1] is deleted, and A[0,3] is stored in the cache. A cache simulator such as DineroIV operates exactly the same way. Following the sequence order in the program trace, it determines where a reference maps, using L id and S L id , and keeps track of the current cache state to determine if an access is a hit or a miss, performing evictions as needed, e.g., based on the LRU policy. The total number of misses is reported, and, for hierarchical caches, accesses that lead to a miss in L1 cache are used as accesses to L2 cache, etc.
In this paper, we describe a compile-time analysis intended to produce the same output as a cache simulation, i.e., the set of accesses that lead to cache misses, without actually generating the program trace or simulating cache state. We achieve this by restricting the class of modeled programs to polyhedral programs (deined in Sec. 3). In a nutshell, for polyhedral programs, the exact ordered set of all accessed memory locations can be described in a compact form using integer sets and relations, as presented in Sec. 3. From this example, we observe that the following are needed for such an analysis: (1) the ordered set of memory accesses (capturing the same information as the trace in Fig. 1(d) ), (2) a mechanism to determine which line and set an address is mapped to (modeling exactly Fig. 1(d) 's mapping), and (3) the ability to determine sequences of accesses to K + 1 distinct lines mapping to the same set. Item (1) above is a direct consequence of polyhedral programs. The exact ordered set of accesses can always be built for these programs. For (2), the L id and S L id expressions can be modeled as Presburger formulae if n, B and S are known at compile time, as Chatterjee et al. [2001] showed. Achieving (3) is a core contribution of this paper (Sec. 4).
We illustrate the approach with Fig. 1 (f)-(h), using the previously described code and cache coniguration. First, a polyhedral representation of the program is generated. The set of executions of the statement is captured by a polyhedron, Fig. 1(b) plots this iteration domain of the program. Each dot corresponds to an execution of the associated color-coded reference, and the coordinates of the dot capture the value of the corresponding loop iterators. The program execution order (the schedule) can always be determined. In Fig. 1(b) , it corresponds to the lexicographic ordering.
For each array reference (e.g.,
, an access function from each iteration point to the accessed memory cell is built (e.g., Fig. 1(c) is obtained from Fig. 1(b) by applying the access functions to the iteration domain. Fig. 1 (g) corresponds to the subset of cells from A that map to Set 1. This can be obtained by adding constraints so that only accesses with S L id = 1 are kept, i.e. (addr /4)%2 == 1, where addr is the accessed memory address. addr is obtained from F A by linearizing the access function. We do require the array sizes to be known and constant, to obtain {[i, j]→[A + i * 4 + j + 1]}. We remark that because the hit/miss pattern of a cache set is not impacted by the accesses to other cache sets, we can model accesses to each cache set independently and aggregate the results for all cache sets at the end. Fig. 1(f) displays the same subset of references that map only to Set 1, by showing the subset of points in the iteration domain that correspond to these references. This pruned iteration domain forms the (ordered) set of references to Set 1 only, obtained in a manner similar to how Fig. 1(g) is built, but reasoning on iteration points instead.
The last and most critical step is to obtain Fig. 1(h) , the subset of only the accesses that lead to a miss (cold and capacity/conlict). This is built in two stages. First, we determine the subset corresponding to cold misses, which is computed as the irst access to a unique cache line (unique L id ). Given a function IterToLine that maps iteration points to the virtual cache line L id they access (as deined in Sec. 4), the irst access to each L id is computed by taking the lexicographically irst (with respect to the program schedule) iteration accessing L id , i.e., the preimage of lexmin(IterToLine). This can be computed analytically for polyhedral programs. In Fig. 1 (h), these accesses correspond to the dots not inside a square.
Second, we must determine the ordered sequence of virtual lines L id accessed by the program, where for a particular set S L id , every K + 1-th accessed line mapping to that set will incur a cache miss, and thus must be captured. In Fig. 1(h) , it corresponds to the dot inside the square. We solve this problem by building several functions, typically from an iteration point to another (set of) iteration point(s) with some speciic properties. For example, the composition SameSet = IterToSet • IterToSet −1 maps an iteration point ⃗ i to the other iteration points { ⃗ j} accessing the same cache set, some of which will be miss events. We successively reine SameSet to end up with only the iterations { ⃗ j} which do incur a cache miss. Iterations mapping to the same virtual line are computed as SameLine = IterToLine • IterToLine −1 . We can combine SameSet and SameLine to keep only the iterations that correspond to accesses to distinct cache lines within a set. For a K-way associative cache, we then successively remove the second, third, ..., K-th unique line accessed mapping to the same cache set and retain only the K + 1-th ones, i.e., the collection of miss events. This is achieved by building a function from a point in an iteration domain to the immediately next executed point in this domain. This function can be composed with others to get the immediately next relevant iteration for our analysis, e.g., the next iteration accessing the same cache set and a diferent cache line. We proceed by successively removing such next iterations, K times, to end up with only the iterations incurring a miss, as detailed in Sec. 4.
The resulting set (e.g., Fig. 1(h) ) has two fundamental properties: it can be counted to obtain the number of cache misses, and be used to model accesses to the next level cache, since misses at L1 form the set of accesses at L2. By composing this process, we can seamlessly reason on hierarchical set-associative caches and produce per-level cache miss count, paving the way for simulation-free compile-time analysis and optimization of cache behavior for polyhedral programs (discussed in Sec. 8).
32:6
Wenlei Bao, Sriram Krishnamoorthy, Louis-Noel Pouchet, and P. Sadayappan
PROGRAM REPRESENTATION
In this work, we target the compile-time analysis of polyhedral program regions [Feautrier 1992; Girbal et al. 2006] , also called aine programs. A program region is polyhedral if the only controllow structures are for loops and if conditionals; the iteration set of each syntactic statement can be described by static analysis using aine inequalities of the surrounding loop iterators and program region parameters (unknown but constant values for one execution of the region); the access functions of arrays are multidimensional aine functions of the surrounding loop iterators and parameters; each for loop has loop-invariant loop bounds, a constant scalar stride, and the exit value of the loop iterator is not read outside the loop; and syntactic statements may contain a data-dependent conditional (e.g., using a ternary operator in C), but both the true and false branch will be assumed to be always taken in the analysis.
Polyhedral programs have essential properties that we rely on to build our static analysis (Sec. 3.2). This program class covers a wide spectrum of compute-and data-intensive processing kernels typically found in linear algebra methods, image processing, or physics simulation [Bao et al. 2016a,b; Girbal et al. 2006] where high performance is a must and, consequently, exploiting cache hierarchies efectively is highly desired. We irst describe the key mathematical objects manipulated in this work and their operations. We follow with presenting the representation of aine programs using these structures.
Modeling Integer Tuples
All mathematical structures and operations used in this work involve modeling integer tuples using Presburger formulae to model sets and relations among them. Operations on these structures are readily available in the Integer Set Library (ISL) [Verdoolaege 2010a ], including more advanced operations, such as counting the number of points in a set or relation ]. All structures and operations are briely surveyed in this work. As integer sets are manipulated, the worst-case complexity of most operations is NP-hard. However, in practice, quasi-polynomial time is often achieved. Sec. 7.6 details the measured execution time of these operations.
Integer tuple. A point in a set is a multidimensional integer vector, or integer tuple. Its n components take value in Z: i ∈ Z n .
Integer sets. A set S of integer tuples is a subset of Z n deined as:
where i 1 , ..., i n index the n dimensions of the set (noted ⃗ i); p 1 , ..., p p are invariant parameters (noted ⃗ p); and c 1 , ..., c m are m Presburger formulae typically in the form of aine inequalities deining constraints on the values of ⃗ i. Presburger formulae in this work involve existential quantiiers, modulo operations, integer division, and ceil/loor.
Integer sets are not polyhedra. They can contain łholes, ž i.e., they can model the intersection of a polyhedron with an integer lattice, for example, to model a subset containing only some even integers.
Relation. A relation R maps points Z n →Z m and is deined as:
where ⃗ i are the input dimensions and ⃗ j the output dimensions. Similarly to sets, Presburger formulae are used to express a relation between points in two sets of possibly diferent dimensions but also to express constraints on the input and output sets. Domain and range of a relation. To retrieve the set of points where a relation can be applied (its domain) and the set of points that may be produced (its range), use:
and:
Composition. Two relations R 1 : Z n →Z m and R 2 : Z m →Z o can be composed to form a new relation:
A relation can be inversed or reversed, i.e., the input and output dimensions are reversed. For a relation R as deined previously, its inverse is:
A relation can be applied to a set, which results in a set of points that are the image of the input set S by the relation R:
Union and intersection. We manipulate union, intersection, and diference of sets and relations. These binary operations are written ∪ for union, ∩ for the intersection, and − for the diference.
Counting. Finally, an essential operation we extensively rely on is the ability to build a counting formula for an integer set (or relation). This formula takes the form of an Ehrhart quasi-polynomial and can be computed using the Barvinok algorithm implementation in ISL [Verdoolaege 2007; ]. We note this operation #S for the cardinality of a set S.
Representing Programs
Programs with aine data low and static control low are called static control parts (SCoP) [Feautrier 1992; Girbal et al. 2006] . Polyhedral programs are represented in this work using (a union of) integer sets and integer relations. Three key structures are needed in this work to deine a program. For all statements in the program, we capture its iteration domain, data access relations, and schedule of iterations. We use the illustrative example in Fig. 2 .
Fig. 2. Triangular Matrix-Multiply
Iteration domains. Iteration domains capture the set of runtime executions of a statement. Because programs are polyhedral, this set can be exactly captured using integer sets where the loop bounds are used to constrain the number of points in the set. Each statement R is associated with an iteration vector ⃗ i R with one component per surrounding loop, and the values ⃗ i R can take are captured by deining its iteration space D R . For example, the iteration domain of R in Fig. 2 is:
It is possible to count the number of points in this set with:
We note ProgDomain the union of all per-statement iteration domains, i.e., the set of all iteration domains for the entire program.
Data access functions. An essential part of our analysis is based on representing the data accessed by each program iteration. For polyhedral programs, the function that maps a statement instance to the array cell being accessed is, by deinition, an aine relation, involving surrounding loop iterators and parameters. We will distinguish the read and write references, and the access relation maps an iteration domain to the multidimensional array index being accessed. For example, the function that relates the iterations of R with the location read in array A for the reference Fig. 2 is:
We note Write for the write references, and the set of all read and/or write access functions for the program is obtained by building the union of the per-statement data access relations. Furthermore, it also is possible to build the relation restricted to the set of iterations of R by computing R = Read Finally, we note ReadRefs, the union of all Read access relations for the program; WriteRefs, the union of all Write relations in the program; and the union of all access relations for the program ProgRefs = ReadRefs ∪ WriteRefs.
Program execution order. A schedule is a relation used to specify the execution order of all statement instances. It maps points in the iteration domain to those in an integer set (the set of timestamps). As such, statement instances in the iteration domain are executed following the lexicographic ordering ≺ of their associated timestamp. ≺ is deined as (a 1 , . . . , a n )
The original program schedule is modeled using 2d + 1 timestamps, where d is the maximal nesting depth in the program [Girbal et al. 2006] . For example, the schedules of R and S in Fig. 2 are:
where each odd dimension of the output space is a scalar dimension whose value denotes the lexical abstract syntax tree (AST) ordering of the loops surrounding the statement. For statements surrounded by less than d loops, the even schedule components associated with the missing loops is set to 0. This approach seamlessly models imperfectly nested loops. A schedule can be constrained by the iteration domain of its statement, e.g., via Shed R ∩ D R . Consequently, the set of all distinct statement iterations in the program can be built by making the union of all schedules constrained by their respective statement iteration domain. Shed denotes this union.
SINGLE-LEVEL CACHE ANALYSIS
We begin in this section by modeling accesses to a single-level set-associative cache and then extend to hierarchical caches in Sec. 5. For a set-associative cache with associativity K and S sets, for each data read/write executed in the program, we represent the speciic set in the cache to which the corresponding cache line L is mapped. If it is the irst time L is accessed, the access is a miss. It is also a miss if K distinct cache lines that map to the same set have been accessed more recently than L. Otherwise, the access is a cache hit. Indeed, in a set-associative cache, each cache line maps to a speciic set, which contains K slots. By reasoning on the sequence of events (i.e., data read/write events in the program) that lead to access of distinct cache lines mapping to the same set, we can build the set of events corresponding to the K + 1 distinct line accessed for a set, i.e., the set of events leading to a cache miss. This information can be built in a closed-form for aine programs.
Modeling Cache Accesses
To model events corresponding to accessing diferent cache lines, we must irst translate array indices given by access relations into unique cache line indices and their associated set in the cache. We assume least-recently-used (LRU) replacement policy and deine a cache and the mapping of virtual memory address to cache elements as follows.
Deinition 4.1 (Set-associative cache). A set-associative cache C with associativity K, cache line (i.e., block) size of B bytes, and size n bytes contains S sets, with S = n/B/K. A virtual memory address addr maps to a unique line index L id = loor(addr/B), and the line maps to a unique set
We now assume that all array accesses have been linearized. This is always possible if the array extents are known at compile time. For example, the linearization transformation from a two-dimensional access relation R A for 2D array A with size sz along the fastest varying dimension (i.e., number of columns of A for row-major linearization as used in C) and start A as its starting address is:
Generalizing to n-dimensional arrays is straightforward.
The value of B and K are assumed to be known at compile time, which means the value of S also is a numerical constant. Given a virtual memory address, the unique cache line index is given by applying the relation:
The set to which a cache line maps is given by the relation:
It follows the deinition of the relation from an array access function to a cache set.
Deinition 4.2 (Array to Cache set index).
Given an access function F A to array A of size ⃗ sz and starting address start A for a cache as deined in Def. 4.1, the associated cache line in C is identiied by AccessToLine as:
The associated cache set in C is identiied by AccessToSet as:
With these relations, we can now reason in the cache space used by the program. For example, the set of distinct cache lines accessed in the program is modeled as:
and can be counted immediately with #Clines. This expression corresponds exactly to computing, in a general form and for arbitrary aine programs, the Distinct Line (DL) expression used in prior work for tile size selection [Sarkar and Megiddo 2000; Shirako et al. 2012] . Notably, it is signiicantly simpler than the original DL formulation, with no loss of accuracy.
Miss Events in Set-Associative Caches
Equipped with the ability to reason on cache lines and cache sets being accessed, we can now model if a particular access is a cache miss by reasoning on the sequence of accesses to distinct lines mapping to the same set.
Modeling the next iteration. Intuitively, we want to model consecutive (in terms of program execution order) accesses to the same cache set. Thus, we need to model a notion of sequencing in program execution by using the program schedule Shed. Speciically, we want to capture the point(s) j of the program iteration domain that is executed immediately after point i such that i and j have some properties, such as accessing the same cache set but a diferent cache line. Given an iteration i 1 , i 2 is consecutive to i 1 if there is no iteration i 3 in between. The non-existence of such point i 3 can be equivalently expressed using relations and set/relation diferences.
The relation LexSucc maps a point to all points that are executed after it. We have:
Similarly, we build LexSuccEq, the relation that generates points that are executed after an input point, including it; and LexPrec, the relation that generates points executed prior to an input point. Note these relations are each speciically built with respect to the particular program we are analyzing.
Modeling iterations accessing the same line/set. To model cache misses, we start by modeling a relation from the iteration domain to the cache line indices and cache sets:
These relations associate to each point of the iteration space the set of cache lines (and cache sets) it accesses. To reason about iterations accessing the same lines or sets, we use the relation inversion to build a map from iterations to iterations accessing the same cache line/set, i.e.:
Modeling sets of relevant iterations. Relations such as LexSucc are meant to be intersected with other relations/sets that encode speciic properties. For example, one can model the relation from an iteration i to all iterations accessing the same set that are executed after i as:
A complete procedure is built by assembling the set of iterations that access diferent cache lines for each diferent set, retaining only the ones leading to the K + 1 t h distinct line accessed in a particular set, as this incurs a miss, as described below.
Algorithm for Miss Calculation
Algorithm 1 provides the complete procedure to obtain the set of iterations incurring in a cache miss event for a set-associative cache, as well as the count of these events. Set specialization. Notably, building a formulation for all cache sets at once is unnecessary. To manipulate simpler systems, it is possible to embed an additional constraint, e.g., in LineIdToCaheSet, where the set is ixed to a constant value, e.g., cset = x, where x ∈ N is a known constant. Then, an iterative algorithm can be built, specializing SingleLevelMisses for each set value S i : [0, S] and forming the union of all Miss i sets to form the complete set of all misses for all cache sets.
WRITE POLICIES AND HIERARCHICAL CACHES
In the preceding section, we modeled accesses to cache sets of a single-level cache from read and write operations in an aine program. To model a multi-level cache, we need to determine the read and write operations that are performed on the next level of the cache. The reads and writes that arrive at a cache at level L + 1 depend on the policies employed by the cache at level L.
Thus far, we have modeled caches that manage reads and writes in a symmetric fashion. In this section, we model caching approaches that employ alternative strategies to manage the write operations. When a cache encounters a write operation, it can: (a) Write-through: immediately forward the write to the next cache level, or (b) Write-back: write to the cache and forward to the next level only if the cache line is evicted. In addition, a write to a line not in the cache (a write miss) can be handled in two ways: (a) Write allocate: allocate a line in the cache, read in the current contents of that cache line, and then perform the write in cache, or (b) No-write allocate: do not allocate a line but forward the write operation to the next cache level.
These policies change the cache contents at any point in time, impacting the cache hits, misses, dirty cache lines, and action taken on evictions. Table 1 describes the various scenarios.
Write-allocate write-through policy. The cache miss model for this policy is exactly the same as that described in Algorithm 1. All misses at cache level L with write-allocate, write-through policy become reads to the cache at level L + 1. All writes to such a cache, irrespective of whether they hit or miss in the cache at level L, become writes to the next cache level.
Write-allocate write-back policy. The cache miss model for this policy is exactly the same as that discussed in Algorithm 1. All misses at cache level L with write-allocate, write-back policy become reads to the cache at level L + 1. Evictions of dirty cache lines become writes to the next cache level. This is illustrated in Algorithm 2. The algorithm takes as input the Miss relation computed in line 13 of Algorithm 1, the relation from the iteration that allocates a cache line to the nearest following iteration that evicts it. No-write-allocate write-through policy. A cache with this policy caches all read operations and only the write operations that result in a cache hit. In particular, while write misses do not afect the cache state, a write hit updates a cache line's priority, afecting subsequent evictions under LRU replacement policy. Therefore, modeling the misses for a no-write-allocate cache requires that we irst model the write hits, and thus, in turn, write misses. To overcome this challenge, we present an approximate solution. We employ a cache miss formulation that is the same as in Algorithm 1, except that only the read references (ReadRefs) are used in formulating the cache misses. All read misses at cache level L become reads to the cache at level L + 1. All writes to a no-write-allocate write-through cache become writes to the next cache level.
No-write-allocate write-back policy. We approximate the cache misses using the same strategy as with a no-write-allocate write-through cache. We compute the read misses using only ReadRefs. All read misses at cache level L become reads to the cache at level L + 1. Write miss operations are forwarded to the next cache level. We compute the write misses using the Algorithm 1 but instead to compute K distinct reads between a write and its immediately preceding read. Speciically, to compute the write misses, line 5 in Algorithm 1 is changed to:
where WriteIterToLine considers only iterations involving write operations. Evictions of dirty cache lines (Algorithm 2) and write misses become writes to the next cache level.
CACHE MODELING ACROSS PROGRAM PHASES
We have presented an analysis of cache misses for an aine program executed with an initial cold cache (all cache lines are invalid). Programs can consist of aine and non-aine phases. Our modeling approach can be used in the context of a whole program analysis framework for the aine program phases, with a conventional trace-based simulation being used for the non-aine phases. To enable such a łhybridž analysis, we need to adapt our modeling approach to produce the actual inal state of the cache at the end of an aine phase, so that it can be provided to a conventional cache simulator to model the subsequent non-aine program phase.
Final Cache State
The inal cache state is deined by the set of cache lines in the cache, with information on whether or not they are dirty, and the recency order among the lines in each set. This information about the inal cache state after the aine phase is needed to ensure correct simulation for cache accesses in the subsequent non-aine phase using standard cache simulation. Given a cache of associativity K, the cache lines resident in the set S in the inal cache state can be computed as the last K distinct accessed cache lines that map to S.
Cache lines in inal cache state. Algorithm 3 shows the steps involved in computing this information for a write-allocate cache.
It starts with computing the last iteration (A 0 ) that accesses each cache set using the lexicographic maximum operation. Then, all iterations that access the same cache line as this last iteration are removed from the set of all program iterations. This procedure is repeated until K iterations are computed. The union of all these cache lines (identiied in terms of the iterations that access them) deines the cache lines in the inal cache state. Note that this procedure works even if fewer than K distinct cache lines are accessed for some cache sets. Write-allocate caches allocate cache lines for both read and write accesses. Thus, the algorithm is invoked with the set of all program references (ProgRefs) to compute the inal state.
Given the relation returned by the algorithm, the actual lines in the cache are given by: A assoc = lexmax tmp //Compute the last assoc-th access to each cache set //Remove the assoc-th access from the list of accesses to be considered 9:
//Relation from each cache set to the iterations that access the last K distinct cache lines 12: FinalAccess = ( K i=1 A i ) 13: return FinalAccess Dirty cache lines. In the case of write-through caches, no cache lines need to be marked as dirty because all write operations are also relected in the next cache level. In the case of write-back caches, cache lines that were written to after they were allocated in the cache for the last time need to be marked as being dirty in the inal cache state. This computation is performed as shown in Algorithm 4. We now have the collection of dirty and non-dirty cache lines in the inal cache state. The program's schedule gives us the last iteration that accesses each line in the inal cache state. Accessing the cache lines in the inal cache stateÐreading (writing) non-dirty (dirty) cache linesÐ will reproduce the inal cache accesses in program order, preserving the LRU characteristics.
Algorithm 4 Determine Dirty Cache Lines In The Final Cache State For Write-Back Caches
Together, the cache lines, dirty markers, and recency information in the cache inal state mirror those that would be observed in an actual program execution.
Initial Cache State
The cache behavior of an aine program phase is dependent on the initial cache state when starting the phase. The cache state at any point can be encoded with an aine program that, if executed, would lead to such cache state. When composing two aine program phases, The inal cache state of the irst phase can be represented as an aine program to be executed before the second aine phase. This will correctly evaluate the cache behavior.
For arbitrary initial cache states, this program representation can be as large as the cache and expensive to manipulate in the form of integer sets. In this scenario, the irst K distinct cache lines to be accessed in each set can be computed analogous to computing the inal cache state. We enumerate each such access to check if it is a hit or a cold miss. The remaining accesses are modeled as presented in preceding sections.
The overall procedure to combine simulation of arbitrary program phases and the aine program phase involves the following steps:
(1) Formulate a model of the irst access to K distinct cache lines to each set in the aine phase.
(2) For each such initial access, determine if it is a hit or a miss based on the initial cache state. (3) Model the aine phase using the integer set formulation. (4) Reconstruct the inal cache state (Sec. 6.1).
We note that while the aine phase is modeled in a problem-and cache-size independent fashion, the interface between the aine and non-aine phases incurs costs proportional to the cache size.
EXPERIMENTAL EVALUATION
The following details an extensive evaluation of our framework.
Experimental Setup
The presented approach to cache modeling has been implemented in the PolyCache tool, which takes as input a C program and information about the cache parameters and size and starting address of arrays. PolyCache outputs the cache miss count of the program, and was used to generate all results presented in this paper. It is implemented using ISL-0.17 [Verdoolaege 2010b ] (with barvinok-0.38 and pet-0.07), an integer set library for the polyhedral model. The Polyhedral Extraction Tool (PET) [Verdoolaege and Grosser 2012] , a powerful SCoP extractor on top of Clang, detects aine regions and extracts the polyhedral model from C source code. ISL [Verdoolaege 2010a ] is used to perform the operations in the algorithms described in previous section.We use the set-specialization approach discussed in Sec. 4.3 to compute the analysis for each set independently, and then accumulate the results as appropriate. These computations are run in parallel, using one process per set in the cache, since each set's behavior is independent of the others.
Evaluation of Hierarchical Set-Associative Cache
Benchmarks. We evaluate PolyCache's performance with the PolyBench/C benchmarking suite [Pouchet 2017b ], which contains key numerical kernels and programs written as SCoPs. We also select some scientiic computing benchmark kernels from HPGMG [Adams 2014], which is a representative high-performance computing benchmark based on geometric multigrid methods, and CCSD(T) kernels from the computational chemistry suite NWChem [Valiev et al. 2010] . Table 2 shows the various codes (64 in total) that were evaluated. We employed the standard dataset size provided for PolyBench/C, which typically leads to a data footprint exceeding 1 MB. 10612080000  166855680  133562880  133562880  5844  963  ccsd_d1_1  NWchem tensor contraction kernel1  1073741824  270598144  269549568  2777765  444  2  ccsd_d1_2  NWchem tensor contraction kernel2  1073741824  271581184  270532608  2263970  439  2  ccsd_d1_3  NWchem tensor contraction kernel3  1073741824  287309824  286261248  2587673  449  2  ccsd_d1_4  NWchem tensor contraction kernel4  1073741824  270598144  269549568  1723455  537  2  ccsd_d1_5  NWchem tensor contraction kernel5  1073741824  271581184  270532608  1505958  529  2  ccsd_d1_6  NWchem tensor contraction kernel6  1073741824  287309824  286261248  1830622  534  2  ccsd_d1_7  NWchem tensor contraction kernel7  1073741824  270598144  269549568  1784865  494  2  ccsd_d1_8  NWchem tensor contraction kernel8  1073741824  271581184  270532608  1552533  484  2  ccsd_d1_9  NWchem tensor contraction kernel9  1073741824  287309824  286261248  1877197  491  2  ccsd_d2_1  NWchem tensor contraction kernel10  1073741824  3211264  2162688  1444382  397  2  ccsd_d2_2  NWchem tensor contraction kernel11  1073741824  18939904  17891328  1447441  418  2  ccsd_d2_3  NWchem tensor contraction kernel12  1073741824  18954304  17905728  1754040  415  2  ccsd_d2_4  NWchem tensor contraction kernel13  1073741824  3211264  2162688  1489164  397  2  ccsd_d2_5  NWchem tensor contraction kernel14  1073741824  18939904  17891328  1508608  418  2  ccsd_d2_6  NWchem tensor contraction kernel15  1073741824  18954304  17905728  1813508  407  2  ccsd_d2_7  NWchem tensor contraction kernel16  1073741824  3149824  2101248  2101248  399  2  ccsd_d2_8  NWchem tensor contraction kernel17  1073741824  18886144  17837568  2464140  411  2  ccsd_d2_9  NWchem tensor contraction kernel18  1073741824  18893824  17845248  2708672  410  2  ccsd_s1_1  NWchem tensor contraction kernel19  67108864  2162704  1114128  1114128  28  1  ccsd_s1_2  NWchem tensor contraction kernel20  67108864  2162704  1114128  1114128  27  1  ccsd_s1_3  NWchem tensor contraction kernel21  67108864  2162704  1114128  1114128  27  1  ccsd_s1_4  NWchem tensor contraction kernel22  67108864  2162704  1114128  1114128  27  1  ccsd_s1_5  NWchem tensor contraction kernel23  67108864  2162704  1114128  1114128  28  1  ccsd_s1_6  NWchem tensor contraction kernel24  67108864  2162704  1114128  1114128  28  1  ccsd_s1_7  NWchem tensor contraction kernel25  67108864  2101264  1052688  1052688  27  1  ccsd_s1_8  NWchem tensor contraction kernel26  67108864  2101264  1052688  1052688  27  1  ccsd_s1_9  NWchem tensor contraction kernel27  67108864  2101264  1052688  1052688  28  1 The number of data access operations range from millions to billions. These benchmarks provide a variety of challenges, including out-of-cache dataset sizes, multiple arrays, imperfectly nested loops, triangular loops, non-uniform reuse distance, etc.
Tools and setup. As a comparison point for both correctness and performance of PolyCache, we used the DineroIV simulator [Edler and Hill 1999] , a uniprocessor cache simulator that can handle hierarchical set-associative caches, as well as numerous replacement and write policies. All experiments were performed on a cluster of Intel Xeon E5640 processors running at 2.67 GHz with 32 KB L1 cache. The programs were all compiled using a GCC-6.1.0 compiler with -O3 optimization. We parallelized the PolyCache computation across sets, as mentioned above, using as many processes as sets in the cache and report the time to completion.
Evaluation. The irst experiment aims to validate the correctness of the framework against a trace-based simulator for a real-life complex scenario: a 2-Level cache memory hierarchy with 32 KB 4-way set-associative L1 cache and 256 KB 4-way set-associative L2 cache, both implementing a write-allocate write-back policy. Both caches have 64-byte cache line size. Table 2 compares the cache accesses/misses computed by DineroIV with the accesses/misses obtained with PolyCache, as well as the execution time (in seconds) for both systems. The number of data read/writes issued by the program is not reported herein. Instead, we focus solely on the cache access/miss count.
A key observation is that for all situations, there is a perfect match between DineroIV and PolyCache. We performed this validation for all experiments reported in this paper, we always obtained an exact match in the number of accesses/misses reported by DineroIV and PolyCache for the nonapproximated schemes.
To illustrate the intricacy of the situations modeled by our analytical framework, we detail results for the Floyd-Warshall benchmark. Of note, the total number of misses in L2 is actually slightly higher than the number of misses in L1, 64,000 higher exactly. This is the expected result, but it results from a complicated efect. First, cold misses are the same for L1 and L2. Capacity misses are also the same. Floyd-Warshall uses a matrix of size 1024 × 1024, and the reuse distance between two iterations of the outer loop is larger than both the 32-and the 256-KB cache. Moreover, there is nearly no conlict miss either in L1 or L2 for this experiment due to the large number of capacity misses. Thus, cache misses for L1 and L2 are expected to be mostly the same, yet there are 64,000 more misses in L2. In the 2-level cache hierarchy with write-back policy implemented for both L1 and L2, a line evicted from L1 is written to L2 to maintain coherency between the caches, and a line evicted from L2 is written back to memory. Writing a line evicted from L1 in L2 is a cache hit (write hit) only if the line is already in L2. Otherwise, it will generate an L2 miss (write miss). Furthermore, it may lead to evicting a line in L2 if the set to which this line maps to in L2 is already full with other lines. Therefore, events in L1, such as dirty line eviction, may result in a cache miss in L2. In fact, there are 64,000 lines written in Floyd-Warshall (using a matrix of size 1024 × 1024), which corresponds exactly to the additional 64,000 misses in L2.
We note that there is signiicant variability in execution time across benchmarks and methods. As expected, for a trace-driven simulator, the DineroIV time is proportional to the trace size. For example, the 2mm benchmark comprised of a sequence of two gemm matrix multiplication takes about twice as much time to simulate than gemm (there are about a billion operations in gemm). Our analytical approach is much faster, up to 365× faster in this case. However, there exists some cases (2 over 64, shown in bold in Table 2 ) where the compile-time approach is slower than the simulation. Unfortunately, predicting a priori the execution time of PolyCache is infeasible. By design, we operate on integer linear programs, where even a simple polyhedron emptiness test is NP. However, for many typical cases the observed complexity is polynomial in practice. This issue is addressed in greater detail in Sec. 7.6.
Evaluation of Write Policies
As discussed in Sec. 5, the framework is capable of handling exactly various write-allocate policies. In PolyCache, we have implemented the write-allocate write-back policy used in Table 2 , and evaluate below our implementation for the no-write-allocate write-back policy. Table 3 compares the cache misses computed by PolyCache (L1 and L2 columns) with DineroIV (% dif. columns, -0.1% means PolyCache under-approximated the misses by 0.1%). We use L1 and L2 caches of same size and associativity as in Table 2 , but both using instead the no-write-allocate-write-back policy. 1685  764  trisolv  530944  0%  527488  0%  12  2  chebyshev  409900120  0%  409900120  0%  1003  3135  heat  630612960  0%  630612960  0%  2573  324  minigmg  659660352  0%  659660352  0%  803  776  poisson  630874080  0%  630874080  0%  4650  3650  j3d7pt  630612960  0%  630612960  0%  1953  208  j3d13pt  3179606544  0% 3179606544  0%  3780  1664  j3d27pt  630874080  0%  630874080  0%  5986  4749  ccsd_d1_1  269549568  0%  2531989  0%  438  3  ccsd_d1_6  286261248  0%  1798021  0%  534  4  ccsd_d2_1  2162688  0%  1444382  0%  398  3  ccsd_d2_9  17845248  0%  2466976  0%  405  3  ccsd_s1_1  1114128  0%  1114128  0%  27  2  ccsd_s1_9  1052688  0%  1052688  0%  27  2 We make three observations. First, as the formulation is signiicantly more complex for this write policy, the execution time of PolyCache signiicantly increases compared to Table 2 . In fact, 8 benchmarks (2mm, cholesky, durbin, fdtd-apml, lu, reg-detect, symm and trmm) timed out, that is PolyCache did not complete the calculation within 120 minutes, our timeout for this experiment. Second, for most benchmarks in our experimental setup, there is 0% diference in the cache miss count computed by PolyCache compared to DineroIV's output. That is, for these experiments, approximating the no-write-allocate by only considering read references does not change the miss count at either L1 or L2. This highlights that ignoring write-hits on LRU in these experiments does not change the miss count. Third, only three benchmarks (shown in bold) shows diferences between our approximation and DineroIV: PolyCache under-approximates the miss count by less In two cases (Table 4 , in bold) the execution times are longer than simulation, for example, fdtd-apml, where it takes almost 1 hour to complete. The reason is the same as explained earlier.
Due to the inherent diiculty of predicting when this analytical formalism will lead to prohibitive execution time, we advocate a timeout-based approach. Users may set the timeout to a few minutes. If no answer is provided, users can, for example, fall back to use of approximation heuristics (refer to Sec. 7.5).
Evaluation of Approximation Heuristics
Here, we discuss possible acceleration techniques with the goal of reducing static analysis time, while allowing for an approximate result. Thus, the methods presented in this section do not guarantee exact miss count, is in contrast to the results presented earlier that are exact by deinition.
Per-array analysis. We have developed a simple heuristic that analyzes only the references to a single array in the program at a time, and the process is repeated for all arrays. In other words, we only capture self-interference in this process. Table 5 reports the number of misses computed for each array in the program, where for programs with more than four arrays, we have summed the number of misses obtained for all subsequent arrays. We compare the sum obtained by approximation of misses in the program based solely on self-interference, versus the exact number of misses in the full program, as well as the execution time of both this heuristic and the full program analysis with PolyCache. The average speedup of the analysis time is 8x, and, in many situations, there are no cross-interference misses between the various arrays in the benchmark. Table 5 . Summary of Per-Array Heuristic such as the minimum, maximum, average, and standard deviation, and report the miss count error (% diference) and execution time of the heuristic compared to the execution time of PolyCache. Note an essential diference in the setups between the set-0 heuristic and full program analysis. For the set-0 heuristic, we compute only one set, meaning PolyCache is executed on a single core without any parallelism. In contrast, for łTime all sets, ž the time reported is the time for PolyCache to complete when using S-way parallelism, i.e., using S cores.
First, we observe a variety of benchmarks where the set-0 heuristic provides actually near-exact results. These illustrate regular cache access patterns. Conversely, the approach may dramatically fail, such as for LU, where the number of misses is under-approximated by 20x. Yet, these result show a surprisingly good approximation can be obtained very quickly. For all but two (LU and symm), the cache misses are of by a factor of 2 or less with 18/30 benchmarks giving nearly exact value. We believe these results pave the way for additional focused experimental studies of the regularity of cache accesses for aine programs with the goal of determining when the cache behavior may be approximated efectively by the behavior of a single (or few) set(s). As future work, we will investigate sampling heuristics, computing the value of a few randomly chosen sets and using their average miss instead of remaining limited to set-0. Table 2 shows very large execution time variation across benchmarks for PolyCache. The explanation is simple to state but the execution time is infeasible to characterize a priori. In this work, we rely extensively on manipulating sets modeled by Presburger formulae, in particular, on counting the number of points in these sets. This computation is NP-hard. As such, we obviously see situations where a much more complex system (i.e., miss set) is generated and counting takes a very long time, despite an excellent implementation from the Barvinok library [Verdoolaege 2007; ] integrated in ISL. To the best of our understanding, it is impossible to predict a priori the execution time of PolyCache because it relates to the irregularities in the generated cache miss set, not to simple input program features, such as the number of loops or problem size.
Time Complexity of PolyCache
In this section, we provide details on two representative cases, Gemm and Symm. While both have roughly similar numbers of cache misses and similar DineroIV execution times (shown in Table 2 ), for Symm, PolyCache is 150× slower to compute the solution than for Gemm. For each operator type invoked by PolyCache, Table 7 details the number of times this operator is used (Nb Ops) and the total time to execute the Nb Ops calls.
For both computations, the operators are called exactly the same number of times. This is expected, given that it follows Algorithm 1 using the same cache coniguration, but it highlights the fact that the number of operator calls does not determine execution time. Interestingly, we see that the time taken for set diference explodes for Symm. This is due to the shape of the sets being subtracted. The result is not a convex set but a union of small, distinct convex sets (pieces). For Symm, the number of pieces explodes, indicating high irregularity in the Miss set. The coalesce operation similarly explodes. Coalescing is a form of simpliication of the representation, and tries to combine smaller pieces into a larger convex piece. It is invoked prior to counting (the inal operation of our algorithm), i.e., on the result of the diference operation. Intuitively, the execution time appears to be driven by how complicated the intermediate polyhedral sets we manipulate are, which itself strongly depends on the cache coniguration being modeled. Because predicting when PolyCache's execution time will be very high is impractical, the approximation techniques outlined herein can prove critical to quickly compute a solution. simulations, where the lifetime of cache lines can be precisely monitored via hooks in the simulator. However, it is usually too expensive for practical use in design space exploration to optimize vulnerability. Instead, our formulation can be extended to perform detailed lifetime analysis of each cache line so as to characterize vulnerability of the entire cache hierarchy at compile time.
Array padding. Inter-array padding has been shown to impact cross-interference, or, conlict misses between arrays, and can be impact by the padding between arrays (e.g., Rivera and Tseng 1998 ]). The inter-array ofset can be cast as a linear parameter that needs to be determined to minimize the cache misses.
Cache design analysis. Our formulation enables two optimizations when exploring cache conigurations by varying its associativity (cache lines per set) and number of sets, for a given input program. First, as seen in Algorithm 1, an analysis of an input program for a K-way cache with a given number of sets incorporates an analysis of caches with the same number of sets but with lower associativity. Therefore, only the max-associativity for a given number of sets needs to be evaluated. On the other hand, a simulation may have to consider all combinations of associativity and number of sets. Second, given coniguration for a cache at level L, our approach can present the reads and writes that reach the cache at level L + 1. This afords reuse of the cache evaluation at level L to explore the choice of cache conigurations at level L + 1. Together, these can signiicantly reduce the cost to explore the space of possible cache conigurations.
Bounding worst-case execution time. The execution time of programs executing on real-time systems is bounded to guaranteed response times. Cache-based systems complicate this analysis by introducing unpredictable data access latencies [Alt et al. 1996] . Our exact analysis of cache misses can assist in more accurate determination of cache miss costs and, thus, better bounds on the execution times.
RELATED WORK
Cache simulation. Simulators such as Dinero [Edler and Hill 1999] and Sniper [Carlson et al. 2014 ] characterize an application's cache behavior with varying degrees of idelity. Also, approaches have been developed to simulate multiple levels of associativity simultaneously [Hill and Smith 1989] . In general, simulation cost is proportional to the number of executed operations, with the approaches either executing the program or manipulating large memory reference traces [Bao et al. 2017; Wang and Baer 1990] . These approaches do not provide a closed-form model that can be used by compile-time optimizers.
Approximate analysis of cache behavior. Approximate metrics, such as reuse distance, have been used as an approximate measure of cache reuse by compile-time optimization techniques [Ahmed et al. 2001; Bondhugula et al. 2008; Carr et al. 1994; Kelly and Pugh 1993; Lim and Lam 1997] . Ferrante et al. [1991] estimate the number of distinct cache lines accessed by a loop nest. Similar analyses have been developed to predict cache miss ratios for set-associative and fully associative caches [Agarwal et al. 1989; Harper et al. 1999; Singh et al. 1992] . Other approaches to estimate cache misses include probabilistic estimation of an array reference incurring a miss [Fraguela et al. 1999 [Fraguela et al. , 2003 and sampling the iteration to approximate the absolute miss ratio for each static array reference , dynamic memory reference [Berg and Hagersten 2004] , and individual instructions [Fang et al. 2005 [Fang et al. , 2004 . These techniques provide inexact cache behavior analysis while being potentially applicable to a larger class of programs. Frumkin and Van W. [2002] developed bounds on cache misses for stencil operations on rectangular grids. Our approach can generate exact cache miss information for a larger class of programs that includes such stencil codes.
Exact analysis of cache misses. Cascaval and Padua [2003] estimate cache misses using stack distances to construct a stack histogram. The model is accurate for fully associative caches with LRU replacement policy and provides approximate solutions for set-associative caches. This work is restricted to perfectly nested loops and dependence characterized as distance vectors. Beyls and D'Hollander [2005] develop a compile-time analysis of cache misses for fully associative caches by analytical modeling of reuse distances, deined as the number of other distinct cache lines accessed between two successive references to a particular cache line. By identifying cache misses and hits in certain parts of the program, Alt et al. [1996] and Bao et al. [2014] model cache behavior using abstract interpretation to improve worst-case execution time bounds. Ghosh et al. [1998 Ghosh et al. [ , 1999 present cache miss equations (CMEs), an approach to counting the exact number of cache misses in perfectly nested loop nests in the form of the number of solutions to a set of aine equations. The number of solutions to the cache miss equations gives an exact count of the set of misses in direct-mapped and set-associative caches. The CMEs have been extended for use in a variety of contexts, including bounding worst-case data cache behavior and execution time for real-time systems [Ramaprasad and Mueller 2005] and cache vulnerability analysis [Shrivastava et al. 2010] . Solving cache miss equations is expensive [Vera et al. 2005; Vera and Xue 2002] , motivating the design of approximate counting strategies [Ghosh et al. 1999; Vera et al. 2003 ]. Vera et al. [Vera and Xue 2002; Xue and Vera 2004] extend the use of CMEs to count cache misses in programs consisting of imperfectly nested loops and non-recursive subroutine calls using abstract inlining and loop normalization, sinking with if-conditionals. CMEs and approaches that build on them are restricted to handling dependences characterized as distance vectors.
Chatterjee et al. present an exact solution for direct-mapped caches using Presburger formulae [Chatterjee et al. 2001] . Their formulation for set-associative caches is limited to interior misses, but it appears extensible to boundary misses. As discussed previously, such formulation cannot be used to build the ordered set of events (e.g., cache miss) needed for hierarchical caches.
CONCLUSION
Cache behavior analysis is an essential tool for program optimization. Optimizing the data traic for cache-equipped processors has been investigated both via software transformations and hardware cache design space exploration. However, the current state of practice to analyze cache behavior is limited to actual execution and/or cache simulation, both with time complexity proportional to the number of executed operations, making compile-time optimization of cache behaviors diicult.
In this work, we proposed a fully compile-time approach to static analysis of cache behavior for hierarchical set-associative virtually-indexed caches for the class of aine (polyhedral) programs. The framework was validated by implementation of a tool to analyze aine C programs to compute cache misses at any level, producing cache statistics equivalent to trace-based cache simulators.
Since the framework makes extensive use of complex polyhedral operations with NP-hard worst-case complexity, long analysis times remain possible. As future work, we plan to investigate additional acceleration schemes for the formulations, as well as the development of compiler analyses and optimizations driven by cache miss estimations.
