Application-specific system-on-chip platforms create the opportunity to customize the cache configuration for optimal performance with minimal chip estate. Simulation, in particular trace-driven simulation, is widely used to estimate cache hit rates. However, simulation is too slow to be deployed in the design space exploration, specially when it involves hundreds of design points and huge traces or long program execution. In this paper, we propose a novel static analysis technique for rapid and accurate design space exploration of instruction caches. Given the program control flow graph (CFG) annotated only with basic block and control flow edge execution counts, our analysis estimates the hit rates for multiple cache configurations in one pass. We achieve this by modeling the cache states at each node of the CFG in probabilistic manner and exploiting the structural similarities among related cache configurations. Experimental results indicate that our analysis is 24-3,855 times faster compared to the fastest known cache simulator while maintaining high accuracy (0.7% average error), in predicting hit rates for popular embedded benchmarks.
INTRODUCTION
The fixed functionality nature of embedded systems opens up the opportunity to design a customized system-on-chip (SoC) platform for a particular application or an application domain. The memory subsystem plays a critical role in the design of such customized SoC both in terms of performance and energy consumption. Thus careful tuning of the memory subsystems, in particular the cache Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. parameters, is of paramount importance in meeting the design constraints of a specific embedded application. The cache design parameters include the size of the cache, the line size, the degree of associativity, the replacement policy, and many others. This entire design space has to be explored to identify the cache configuration that optimizes certain objectives, such as performance, energy consumption, or a combination of the two.
The design space exploration of caches is a well studied problem. The exploration process requires cache hit/miss rates for all possible design points. The most popular approach to compute the hit rate for a particular cache configuration is to employ tracedriven simulation or functional simulation. Unfortunately, simulation based approaches are too slow and huge trace sizes put practical limit on both the size of the application and its input. In this paper, we explore static analysis method as an alternative to simulation for fast and accurate estimation of cache hit rates.
Recently, we have introduced the concept of probabilistic cache states [10] , which captures the set of possible cache states at a program point along with their probabilities. We have also proposed a static analysis method [10] that models the cache behavior to estimate the expected (average) execution time of a program over all possible program inputs. The notion of probabilistic cache states is quite general. It can be easily adapted to construct a fast and accurate static analysis method that estimates cache hit rate of a program for a particular configuration. Unfortunately, when employed in the context of design space exploration, the runtime of this static cache analysis approach is not competitive compared to state-ofthe-art cache simulators such as Cheetah [13] . This is because fast cache simulators employ single-pass simulation that estimates the hit rates for a large number of cache configurations in one pass. In contrast, static cache analysis has to estimate the hit rate for each cache configuration individually leading to overall slower design space exploration.
We observe that if a static analysis approach can model multiple cache configurations in one pass, we get a very powerful tool for design space exploration. In this paper, we extend the concept of probabilistic cache states to achieve this goal. We borrow the data structure, called Generalized Binomial Tree (GBT), proposed by Sugumar and Abraham [13] to exploit the inclusion property among related cache configurations. GBT enables us to capture the cache states corresponding to a number of related configurations in one succinct representation. However, as a program point can be reached from different contexts, we may have a number of GBTs, each associated with the probability of the corresponding context.
In this paper, we propose probabilistic GBT to capture the cache states corresponding to all cache configurations and all contexts at any program point. We also define operators for update and concatenation of probabilistic GBTs. These operators are employed in our static program analysis to obtain the probabilistic GBTs at every program point in an efficient manner. Given a probabilistic GBT, we can easily estimate the hit rate of a memory access for all possible cache configurations. However, maintaining these probabilistic GBTs and operating on them can become space and time inefficient as the number of contexts increases. Therefore, we propose a number of optimizations for space and time efficiency.
In summary, we propose a static analysis method for rapid and accurate design space exploration of instruction caches. Our analysis method can estimate the hit rates for all cache configurations with varying number of sets and associativity in one pass as long as the cache line size remains constant. The input to our analysis is simply the basic block and control flow edge execution count profiles, which is significantly more compact compared to memory address traces required by trace-driven simulators. Our experimental evaluation for a number of embedded benchmarks reveal that our estimation is highly accurate (0.7% average error) and our singlepass static cache analysis is 24-3,855 times faster compared to the fastest known single-pass cache simulator Cheetah.
ANALYSIS FRAMEWORK
The inputs to our analysis framework are the executable program code and its corresponding input. We can obtain the basic block and control flow edge counts through execution or quick functional simulation of an instrumented version of the program. The instrumentation can be done very efficiently by using edge profiling [2] . More importantly, the profiling needs to be done only once, as basic block and edge execution counts remain unchanged across different cache configurations.
Our analysis first constructs the loop-procedure hierarchy graph (LPHG) corresponding to the whole program [9] . The LPHG represents the procedure calls and loop nest relations in the program. Loop and procedure bodies are represented as directed acyclic graphs (DAG), where the nodes of a DAG are the basic blocks. If a loop (procedure) contains other loops within its body, then the inner loops are represented as dummy nodes in the DAG. For each loop L, it is annotated with its loop count NL and its control flow graph is transformed such that every loop has a loop pre-header, postloop, start, and end node (see Figure 6 ).
Given a basic block B and an edge B → B, we use NB and N B →B to denote their execution counts, respectively. For control flow edge B → B, the edge frequency f (B → B) is defined as the probability that B is reached from B , that is,
. By definition, e∈In(B) f (e) = 1, where In(B) represents all the incoming edges of B.
Cache Hit Rate. Let us use B to represent the set of the basic blocks of the program and R hit to represent the cache hit rate of the program. Let IB be the number of instructions and MB be the set of memory blocks of B. Then, R hit can be computed as
where Hm is the cache hit rate of memory block m ∈ MB. NB and IB are constants across different cache configurations and are available through profiling. However, Hm is unknown and may change across different cache configurations. In the following, we will illustrate how to estimate Hm for all cache configurations through our static cache modeling.
CACHE MODELING
We rely on General Binomial Forest (GBF) data structure to estimate Hm for multiple cache configurations simultaneously. GBF 11(00) 00(00)
Set no 00 01 11(00) 00 (00) 10 (01) 01 (10) 00 (10) 10 (11) 10 (01) 01(10) 00 (10) 10 ( was originally proposed for simulating multiple cache configurations in one pass [13] . In this section, we provide a brief background on GBF and then proceed to present our probabilistic GBF. We use the probabilistic GBF for static cache analysis in Section 4.
GBF Background
Let us explain the GBF data structure with an example. In this paper, we consider LRU as the cache replacement policy. Figure  1(a) shows, for the same memory address trace, the contents of six caches with number of sets = 1, 2, 4 and associativity = 1, 2.
From the example, we observe that for the caches with the same associativity, the memory blocks in the cache with 2(1) sets are included in the cache with 4(2) sets. For the caches with the same number of sets, the memory blocks in the cache with associativity 1 are included in the cache with associativity 2.
GBF exploits the aforementioned inclusion property that holds between cache configurations. Let us denote a set-associative cache with 2 S sets, line size L, and associativity N as C
is the minimum (maximum) number of sets among the group of cache configurations and N is the maximal associativity.
A GBF consists of one or more Generalized Binomial Trees (GBT). A GBT can be defined recursively as follows. A GBT of degree 0 is a list of length N and the elements in the list are ordered according to LRU policy (i.e., the top element is the most recently accessed address, while the bottom element is the least recently accessed address). A GBT of degree k is constructed by linking two GBTs of degree k −1 together, with the most recently accessed N references in either root lists of the two GBTs as the new root list. By definition, a GBT of degree k has 2 k · N nodes. Let us explain the construction of GBF based on the example shown in Figure 1 . The GBF for the cache configuration C L 2 (2) consists of 4 GBTs of degree 0 (one corresponding to each set). We use ⊥ to denote an empty cache block. The GBF for the cache configuration C L 1 (2) contains 2 GBTs of degree 1 (one corresponding to each set). The GBT for a set s in C Array Implementation. We use an array based implementation of GBT [13] . Let us assume the degree of GBT as M . The GBT is implemented as a two-dimensional array with 2 M +1 − 1 rows and N columns. The rows are divided into M + 1 levels from 0 to M and level k has 2 k rows. As discussed before, a GBT of degree M has 2 M · N nodes. Thus, array implementation has about a factor of two redundancy. Figure 2 shows an example of the array implementation of GBT, where M = 2 and N = 2. Given a node t in the GBT, we use des(t) to denote the number of descendants (inclusive) of node t. The rank of a node is defined as log(
. Memory block at a node of rank k maps to level M − k and the row within the level is determined by the least significant M − k bits of the memory block address. There are at most N memory blocks in the same row and they are arranged in the order in which they have been accessed (i.e., the leftmost memory block is the most recently used, while the rightmost memory block is the least recently used).
Given an incoming memory block address address, the search and update procedure of GBT starts from the top level and only one row in each level is checked. The row examined in level k is determined by the least significant k bits of address and the tag matches are done with the memory blocks in that row. For example, in Figure 2 , suppose we are searching for address 0101. We first examine 1001 and 1100 in level 0. Then, in level 1, the address 0101 maps to row 1 and so 1011 is examined. Finally, in level 2, the address 0101 maps to row 1 and it is found there.
Cache Hits Computation. A two dimension array hit is used for storing the cache hits for multiple cache configurations. Array hit will be updated if a memory block is cache hit, and the corresponding entries will be increased by 1. However, hit [m] [n] only stores the number of references that hit in cache configuration C L m (n) but miss in smaller caches C L m (n ) where n < n. According to the inclusion property related to associativity, the number of hits in C L m (n) can be computed by summing up the hits of itself and those from smaller caches as
Probabilistic GBT
We now describe the probabilistic cache modeling based on General Binomial Forest (GBF). The multiple cache configurations we support are constant line size, varying number of cache sets and degree of associativity. Based on the description in Section 3.1, we are interested in the set of configurations {C
is the minimum (maximum) number of cache sets and N is the maximum associativity.
Assumptions. For the set of cache configurations above, we will have 2 S min GBTs with degree Smax −Smin in the GBF. However, one memory block maps to only one GBT based on its index in C L S min (N ). Thus, there is no interference between different GBTs. Thus, we assume Smin = 0. In other words, there is only one GBT of degree Smax in the GBF. For the configurations with more than one GBTs, each GBT can be modeled independently.
More concretely, in the following, we consider a GBT of degree M (Smax) and root list length as N . To indicate the absence of any memory block in a cache line, we introduce a new element ⊥. We use Ω to denote the set of all the possible GBTs of the program. We also introduce a special empty GBT c ⊥ .
At any program point, the GBT is determined by the program path taken before reaching this program point. Usually a program point can be reached via multiple program paths leading to a number of possible GBTs at that point. Thus, we introduce the notion of probabilistic GBT. DEFINITION 1 (Probabilistic GBT). A probabilistic GBT C is a 2-tuple: C, X , where C ∈ 2 Ω is a set of GBTs and X is a random variable. The sample space of the random variable X is Ω. Given a GBT c, we define P r[X = c] as the probability of c in C. If c / ∈ C, then P r[X = c] = 0. By definition, c∈Ω P r[X = c] = 1. Finally, we define a special probabilistic GBT C ⊥ denoting the empty probabilistic GBT. That is
We use ¡ to denote GBT search and update operator. Given a memory block m and a GBT c, c ¡ m returns the GBT after accessing m. Meanwhile, we define new operator ¢ as the search and update operator of probabilistic GBT. Given a memory block m and a probabilistic GBT C = C, X , ¢ will update each GBT c ∈ C and C ¢ m returns the updated probabilistic GBT. 
Concatenation of Probabilistic GBTs
In this subsection, we introduce the concatenation of probabilistic GBTs, which will be used later. We first define the operator for the concatenation of two GBTs in Algorithm 1. Algorithm 1: Implementation of operation input : GBT c1 and c2 output : c = c1 c2 c = c1; for lev ← M to 0 do Let T be the two dimension array at level lev in c2;
return c;
In the array based implementation of GBT, c2 is a multilevel twodimensional array. The concatenation is done by using the memory blocks in c2 from the bottom level to top level and from right to left to update c1. In other words, the update is done from the least recently used to most recently used memory blocks of c2. An example of GBT concatenation is shown in Figure 3 . Let us assume the GBT after the first and second memory traces are c1 and c2, respectively. Then the GBT after accesses corresponding to the two memory traces sequentially is c1 c2. Next, we extend the concatenation operation to probabilistic GBTs. DEFINITION 2 (Concatenation of Probabilistic GBTs). Given probabilistic GBTs C1 = C1, X1 and C2 = C2, X2
Let us assume the execution of two program fragments sequentially each starting with an empty GBT. The probabilistic GBT after the execution of the first and second program fragments are C1 and C2, respectively. Then the probabilistic GBT after execution of the two program fragments sequentially is C1 C2.
Merging GBTs in a Probabilistic GBT
A program path can be specified by the basic block sequence. Although multiple paths could reach a program point, they probably traverse some common basic block subsequence. Thus, the set of GBTs in a probabilistic GBT can include some identical memory blocks. By merging the similar GBTs together, we can reduce the space requirement of probabilistic GBTs. More importantly, the search and update of probabilistic GBTs will be much faster.
In the array based implementation, GBT is divided into M + 1 levels. We merge the GBTs level by level from top to bottom. More concretely, given two GBTs, if the content of the top k (k ≤ M +1) levels are identical, then they are merged together to have only one copy of the top k levels as shown in Figure 4(a) . Also as the GBTs are merged together, the probabilities are now associated with each level rather than with the GBTs. It is possible to perform merging at finer granularity, for example, using rows rather than levels. However, the complexity of the merging process increases considerably leading to slower implementation. It is also possible that two GBTs are different at the top levels, but they are identical at the bottom levels. We choose not to perform merging for such GBTs. This is because, as the probabilistic GBT is updated, the contents from the upper levels move to the lower levels. Thus the commonality among the GBTs are lost and they have to be split again. It is far more efficient to merge GBTs only if they are identical at the top levels.
The implementation of a merged GBT can be viewed as a tree with the sub-arrays (levels) of the original GBTs as nodes (see Figure 4(a) ). The sub-array corresponding to the common top levels 0 − k is the root node of this tree. Level k, however, has multiple children at level k + 1. Now the search and update of probabilistic GBTs become more efficient. Consider a memory block m that is present somewhere in the top k levels. Without merging, m will be searched in all the original GBTs; now it will be searched only once in the merged GBT. For example, in Figure 4 (a), before merging, the reference to memory block 100 is searched in both c1 and c2. With merged GBT, it is only searched once. In Figure 4 (b), we show the merged probabilistic GBT after concatenation operation.
Bounding the size of Probabilistic GBT
We observe that, in a probabilistic GBT, some of the constituent GBTs have very low probabilities. That is, these GBTs correspond to rare program paths. Based on this observation, we prune some of the GBTs for space and time efficiency.
We define the metric dist for pruning. Consider a merged GBT with two nodes at level k. Each node is a two dimension array with 2 k rows and N columns. Given two such nodes n1, n2 at the same level, we define d(n1, n2) as the measure of the distance between them. It is defined as a function of the number of different memory blocks between them. But higher priority is given to the more recently used memory blocks as shown in Equation 2.
We apply two merging strategies. First, if the probability of a node n is too small (< Te), then the subtree rooted at n is pruned. But its probability is added to the subtree rooted at the closest sibling of n (the closest is defined by the dist metric). Second, if the number of children of a node exceeds a pre-defined limit Z, then Z children with highest probability are kept and the subtrees rooted at the rest of the children are pruned. As before, the probability of each pruned child is added to its closest surviving sibling defined by the dist metric. The pruning process continues from top to bottom. As shown in Figure 5 , the subtree rooted at m (including m) is pruned because its probability is too small. However, its probability is added to the subtree rooted at m1, which is the closest sibling of m. Similar pruning strategy can be applied across independent or merged GBTs in a probabilistic GBT. In practice, we set Te to 10 −6 and Z to 4.
Cache Hit Rate of a Memory Block
Recall that in Section 3.1, if a memory block m results in a cache hit, the corresponding entries in the array hit are incremented by 1. However, in our probabilistic cache modeling, we get a cache hit probability by looking up the probabilistic GBT. The hit probability is simply the sum of the probabilities of all the nodes where m can be found in the probabilistic GBT. Now we add this hit probability to the hit array.
For memory block m, we can get its hit rate Hm for different cache configurations if the probabilistic GBT at that program point is known. Then the cache hit rate of the whole program can be derived from Equation 1. Now we present our static analysis method to derive the probabilistic GBTs at every program point.
STATIC CACHE ANALYSIS
In this section, we first describe cache analysis for a loop in isolation. Subsequently, we will extend this analysis to the whole program. For loops, we consider its control flow graph as a directed acyclic graph (DAG). We first perform the analysis on the DAG for a single iteration, followed by modeling across iterations. The incoming probabilistic GBT of B is obtained from the outgoing probabilistic GBTs of its predecessors. We rely on following new operator to do the combination. DEFINITION 3 (Probabilistic GBTs Combination). We define as the combination operator for probabilistic GBTs. It takes in n probabilistic GBTs Ci = Ci, Xi and a corresponding weight function w as input s.t. n i=1 w(Ci) = 1. It produces a combined probabilistic GBT C as follows.
Analysis of DAG
In other words, the set of GBTs in C is the union of all the GBTs in C1, . . . , Cn. The probability of a GBT c ∈ C is a weighted summation of the probabilities of c in the input probabilistic GBTs. Let in(B) = {B , B , . . .} be the set of predecessors of B. Then the incoming probabilistic GBT of B can be derived as
where the weight function w is defined as w(C out B ) = f (B → B). Starting with C ⊥ , Figure 6 shows an example of probabilistic GBT combination at basic block B4 and the probabilistic GBT after B4 in the first iteration of the loop, where M = 1 and N = 2.
Extension to Loop Iterations
In the previous subsection, we assume C in L = C ⊥ . However, for a loop iterating multiple times, the input GBT at the start node of the loop body is different for each iteration. More concretely, let us add the subscript n for the n th iteration of the loop. Then C in start n = C out end n−1 for n > 1. However, in order to compute C in start 1 , . . . , C in start N , where N = NL is the loop count, we do not need to traverse the DAG N times. Instead, we can rely on the operator. First, we note that
The final probabilistic GBT after N iterations starting with
The cache hit rate of a memory block is dependent on the input probabilistic GBT C in B of the corresponding basic block B, which in turn is dependent on C in start n of the loop L. Computing the cache hit rate for each memory block in each iteration is equivalent to complete loop unrolling. Instead, we observe that we only need to compute an "average" probabilistic GBT C avg L at the start node of the loop body. This captures the input GBT of the loop over N iterations. That is, C avg L is defined as
where w(C More importantly, the operator need not be invoked NL times as the probabilistic GBTs across iterations may converge. After convergence point, the size and content of the probabilistic GBT as well as the probability of each GBT in the probabilistic GBT do not change. In practice, we relax the convergence constraint. If the difference of probabilities between every pair of identical GBTs in C out end n and C out end n+1 are within Te, we declare convergence. Experimental results confirm that convergence is reached quickly for most of the loops in all the benchmark programs. In the worst case, concatenation operations is terminated at a pre-defined threshold of M axN iterations. The average probabilistic GBT across these M axN iterations is used as an approximation of the average probabilistic GBT across NL iterations. In practice, we set M axN to 100 and Te to 10 −6 .
Analysis of Whole Program
We first traverse the LPHG in bottom-up fashion, i.e., we start with the innermost loops/procedures and compute C gen L and C avg L for all such loops/procedures. Next, we replace the innermost loops or procedures with "dummy" nodes in the DAG of the enclosing loop or procedure. While traversing the DAG of the enclosing loop or procedure, special care is taken for the dummy nodes. Let C in L be the input GBT for dummy node L during traversal of the DAG. Then we treat the dummy node as a black box and compute the output GBT of the dummy node as
At the end of this bottom-up traversal process, we reach the root node (main procedure). Then, we perform a top-down traversal to compute the probabilistic GBT at each basic block in the context of the whole program. Suppose L is a dummy node during this top-down traversal with input probabilistic GBT C in L and start node start. Then we traverse the DAG of L with
and compute the probabilistic GBT at each node of the DAG. This top-down process continues till we traverse all the loops/procedures. At this point, we have computed the "average" probabilistic GBT for each basic block in the context of the whole program. Now the cache hit rate for each memory block across multiple cache configurations can be computed.
EXPERIMENTAL RESULTS
We evaluate the accuracy and efficiency of our static cache analysis by comparing it with cache simulator Cheetah [13] . Cheetah is the fastest known cache simulator, which can simulate multiple cache configurations in a single pass.
We select 10 programs from MiBench [5] . We fix a line size for each benchmark, but vary the number of cache sets from 4 to 64 and associativity from 1 to 8. That is, a total of 20 cache configurations are estimated and simulated. The line size for each benchmark is selected such that the cache hit rate has a wide coverage. The benchmarks, corresponding line size, and trace size are shown in Table 1 . For trace-driven simulation, trace size can be quite large even for small programs as shown in column Trace. We use SimpleScalar toolset [1] for the experiments. We instrument its functional simulator to collect execution count of basic blocks and control flow edges. The time spent in our instrumentation during the functional simulation is shown in column Prof. Our estimator first disassembles the executable to construct CFG and LPHG, and then proceeds with the cache hit estimation. We perform all experiments on a 3GHz Pentium 4 CPU with 2GB memory. The estimation and simulation times are shown in Table 1 . Our static analysis method is significantly faster (24-3,855 X speedup) compared to Cheetah simulation. To compare accuracy, for each benchmark, we show the cache hit rates of both simulation and estimation across all the 20 configurations in Figure 7 . The estimation for rijndael is identical to simulation for all configurations; so it is not shown. The estimation results from analysis track the simulation results quite closely. For all the benchmarks and cache configurations, we achieve high accuracy (0.7% average error). The error is defined as |est − sim| where est(sim) is the estimated (simulated) cache hit rate.
RELATED WORK
Trace-driven simulation is widely used for evaluating cache design parameters. A. Janapsatya et al. [7] propose an instruction cache simulation methodology that can operate directly on a compressed program trace file. Simulating reduced traces obtained by statistical sampling is proposed in [8] . In addition, lossless techniques for trace reduction are studied in [14, 15] . Inclusion property is exploited to remove certain references from the trace prior to simulation [14] . By simulating the cache configurations in a particular order, some redundant information can be stripped off from the trace after each simulation [15] . Single pass simulation is proposed in [13, 6, 11] . They are based on the inclusion property which states that the content of a smaller cache is included in a bigger cache for certain replacement policy. Various data structures, such as single stack [11] , forest [6] , and generalized binomial tree [13] , have been proposed for utilizing the inclusion property.
Given an address trace, [3, 12] propose probability based analytical models to compute cache hit ratio. But their approaches are either for only direct mapped caches or fully associative caches. In contrast, our method works on the program control flow graph and does not require address traces. We also predict hit rates for multiple configurations in a single pass. Ghosh and Givargis [4] propose an analytical approach for design space exploration that can directly compute cache parameters satisfying the desired performance.
CONCLUSION
In this paper, we present a fast and accurate design space exploration technique for instruction caches via static analysis. We introduce probabilistic Generalized Binomial Tree (GBT) to represent the cache contents for multiple paths and configurations, define operations on the probabilistic GBT, and discuss optimization to improve their space and time efficiency. Finally, we show how to derive these probabilistic GBTs at any point in the program. The experimental results indicate that our method achieves significant speedup compared to simulation while maintaining high accuracy.
ACKNOWLEDGMENTS
This work was supported by NUS project R-252-000-292-112.
