Multimedia-dominated consumer electronics devices (such as cellular phone, digital camera, etc.) operate under soft real-time constraints. Overly pessimistic worst-case execution time analysis techniques borrowed from hard real-time systems domain are not particularly suitable in this context. Instead, the execution time distribution of a task provides a more valuable input to the system-level performance analysis frameworks. Both program inputs and underlying architecture contribute to the execution time variation of a task. But existing probabilistic execution time analysis approaches mostly ignore architectural modeling. In this paper, we take the first step towards remedying this situation through instruction cache modeling. We introduce the notion of probabilistic cache states to model the evolution of cache content during program execution over multiple inputs. In particular, we estimate the mean and variance of execution time of a program across inputs in the presence of instruction cache. The experimental evaluation confirms the scalability and accuracy of our probabilistic cache modeling approach.
INTRODUCTION
Moore's Law has moved the center of gravity of computing from personal computers to numerous embedded computers hidden away inside our everyday electronic products. The application domain of embedded computing systems ranges from automotive, avionics, health-care to the multimedia-dominated consumer electronics devices. The safety-critical systems employed in automotive, avionics, and health-care domain demand strong timing predictability in Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. addition to the functional correctness. Traditional schedulability analysis techniques can guarantee the satisfiability of timing constraints for such hard real-time systems consisting of multiple concurrent tasks. One of the key inputs required for the schedulability analysis is the Worst-Case Execution Time (WCET) of each of the tasks. WCET of a task on a target processor is defined as its maximum execution time across all possible inputs [16] .
Multimedia-dominated consumer electronics devices, on the other hand, are known as soft real-time systems. These systems require the timing constraints to be satisfied most of the time. One can employ WCET-driven schedulability analysis in the context of soft real-time systems. But this approach leads to over-dimensioning of the processor resources due to the over-estimations inherent in static WCET estimation techniques. In particular, the complexity of static WCET analysis techniques has grown significantly over the years as embedded processors include performance enhancing features such as cache, branch prediction, out-of-order pipeline, etc [19] . While such complexities of WCET analysis have to be tolerated for hard real-time systems, systems with somewhat relaxed timing constraints call for novel performance analysis approaches.
Probabilistic schedulability analysis [6, 9] is gaining popularity in providing timing guarantees for soft real-time systems. Probabilistic analysis techniques can exploit the timing flexibility of soft real-time systems to offer better resource dimensioning while meeting the quality of service (QoS) requirements. Most proposals in probabilistic schedulability analysis assume probabilistic distribution of the execution times of the tasks. The distribution of execution times is also equally important in design space exploration, compiler optimizations and parallel program performance prediction for partitioning, scheduling, and load balancing [11, 18] .
A naïve approach to derive this distribution through simulation or execution of a large number of "representative inputs" is not suitable for the following reasons. First, it is extremely difficult, if not impossible, to identify representative inputs for a complex program with billions of possible inputs. If the inputs are not chosen appropriately, then the corresponding distribution can be completely different from the actual distribution. Second and most importantly, the target platform may not be available during the design phase of an embedded system, thereby leaving slow simulation for a large number of inputs as the only choice. Thus static program analysis techniques need to be explored in this context. However, despite its importance, static analysis techniques to predict the distribution of execution times remain largely unexplored. Most importantly, the few static analysis approaches proposed either completely ignore the micro-architectural effects or leave it as future work [7, 11, 18] In this work, we take the first step towards incorporating the timing effects of architectural features in probabilistic execution time analysis. In particular, we focus on the instruction cache in this pa- per as (1) it is most commonly used in embedded processors, and (2) the variation in execution time from instruction cache effects can be quite large. For example, Figure 1 shows the wide variation in instruction cache miss rate of susan-c benchmark that translates to large variation in the execution time. Ignoring the instruction cache effect may produce a distribution that is far from the actual distribution. More concretely, our contribution in this paper is modeling the instruction cache timing effects in static estimation of the execution times distribution of a program. Ideally the static analysis technique should derive the probability distribution function (pdf) of the execution time. However, such elaborate analysis will be, in general, computationally expensive and often of little practical value. Instead, we characterize the execution-time distribution through mean and variance. If necessary, our work can be easily extended to include additional statistical moments (such as skewness and kurtosis) so that the distribution can be completely reconstructed [11] .
Motivating Example. To illustrate the difficulty of modeling the instruction cache for probabilistic execution time analysis, let us consider the following program fragment.
For the purpose of this example, let us ignore the execution time of the for and if conditions. Also let p be the truth probability of the branch condition a[i] != b [i] . Then the expected execution time of this program fragment per iteration will be estimated by existing techniques as
where TS1 (TS2) is the constant execution time of statement S1 (S2). This assumption of constant execution time for each instruction fails in the presence of instruction cache, which has global timing effects.
In the presence of instruction cache, let us assume S1 and S2 conflict with each other for the same cache block. Now the execution time depends on whether S1 (S2) incurs cache miss, which in turn depends on the cache content after the previous iteration. For any loop iteration i > 1, the cache contains S1 with probability p and S2 with probability 1 − p at the end of the previous iteration. Therefore, S1 (S2) will incur cache miss with probability 1 − p (p). Thus the expected execution time per iteration now changes to
It is clear from the previous example that the cache content at any program point P depends on the probability of execution of the various paths leading to P. Thus we introduce the notion of probabilistic cache states to capture the probability of the different cache contents at any program point. Our analysis technique then proceeds by transforming the probabilistic cache states as we traverse the control flow graph of the program. Once we compute the probabilistic cache states at all the program points, we can estimate the cache miss probability of any code block. Our experimental results with a number of embedded benchmarks confirm that the model is accurate in estimating the miss probability as well as mean and variance of execution time in the presence of instruction caches.
RELATED WORK
A general framework for determining average program execution time and their variance has been presented by Sarkar [18] . A static analysis framework to obtain probabilistic distributions of execution times has been proposed by David et al. [7] . Based on the assumption that external variables (input data) are independent and their probability distributions are known, they derive the execution time and probability of each path through the program. Gautama et al. [11] presents a program performance prediction approach. In their work, loop bounds, branch probabilities, and execution time of basic blocks all are characterized by their statistical moments. The objective of these works is to predict the statistical moments of performance distribution or even full probability distribution function. However, architectural features such as caches have not been modeled so far. In other words, the execution time variations are only from program level (variation in loop bounds, branch direction, etc.) and not from the architecture. Finally Bernat et al. [5] propose a WCET analysis technique (without architectural modeling) that estimates the WCET with a high probabilistic guarantee. In contrast, we are interested in the entire distribution rather than just the tail end of the distribution.
Instruction caches have been modeled for WCET estimation in hard real time system. Alt et al. [1] use abstract interpretation to model the cache behavior, while Li et al. [13] use cache conflict graph and propose an Integer Linear Programming(ILP) solution. Another technique based on categorization of cache accesses is proposed in [2, 15] . Lim et al. models instruction cache using timing schema in [14] . As these techniques have been developed in the context of WCET analysis, the instruction cache is modeled for the worst-case scenario. Given an address trace, [4, 17] propose analytical models to compute cache miss probability. In contrast, ours is a static analysis method that works on the program control flow graph to generate cache miss probability across multiple inputs and does not require address traces.
PROBABILISTIC TIMING ANALYSIS
The inputs to our analysis are the executable program code, cache parameters and program statistical information. We assume that the statistical information about the loop bounds and the truth probability of the conditional branches are provided as inputs to our analyzer. This information can be derived through either program analysis [7] , user annotation, comprehensive profiling, or a combination of these approaches, which is beyond the scope of this paper. In the following, we use E[X], V ar[X], Cov[X, Y ], P r (where X and Y are random variables) to represent the expected value, variance, covariance, and probability, respectively.
Given a program, we first construct the loop-procedure hierarchy graph (LPHG) for the whole program [12] . The LPHG represents the procedure call and loop nest relations in the application. We assume that the loop or procedure body corresponds to a directed acyclic graph (DAG). The nodes of a DAG are the basic blocks within that loop or procedure. If a loop (procedure) contains other loops (procedures) within its body, then these inner loops (procedures) are represented by dummy nodes. The control flow graph within a loop is transformed such that every loop has a loop preheader and a post-loop node. In addition, there exists a unique start and end basic block corresponding to each such DAG ( Figure 2 ).
Given a Basis block B, its execution frequency NB is defined relative to the start basic block of the innermost loop or procedure it is in. Given the truth probability of the conditional branches, it is easy to compute E[NB]. For control flow edge B → B, the edge frequency f (B → B) is defined as the probability that B is reached from B . Again edge frequencies can be easily derived by propagating the branch truth probabilities. As shown in Figure 2 , f (B3 → B4) can be obtained from branch truth probability. By definition of edge frequency, e∈In(B) f (e) = 1, where In(B) represents the incoming edges of B.
For each loop L, we define both relative loop bound NL and absolute loop bound N L . Relative loop bound is the execution count of the loop in one execution of its preheader, while absolute loop bound is the total execution count of the loop in one complete execution of the program. For a procedure L, we only define the total number of invocations N L . Relative loop bound is used to derive the probabilistic cache states. Absolute loop bound is used to compute program execution time. Usually the loop bound of inner loop and outer loop are not independent. The expected value of loop 
where TL is the total execution time and tL is the execution time per iteration of L. Let B be the set of basic blocks (excluding the dummy nodes for inner loops and callee procedures) in L and tB be the execution time of basic block B per execution. Then
As 
By assuming V ar[tB] = 0 and ignoring the covariance between basic blocks, V ar[tL] can be simplified as
The covariance between TL and T L can be approximated by 
CACHE MODELING
Cache Terminology. A cache memory is defined in terms of four major parameters: block or line size L, number of sets K, associativity A, and replacement policy. The block or line size determines the unit of transfer between the main memory and the cache. A cache is divided into K sets. Each cache set, in turn, is divided into A cache blocks, where A is the associativity of the cache. For a direct-mapped cache A = 1, for a set-associative cache A > 1, and for a fully associative cache K = 1. In other words, a directmapped cache has only one cache block per set, whereas a fullyassociative cache has only one cache set. Now the cache size is defined as (K × A × L). A memory block m can be mapped to only one cache set given by (m modulo K). For a set-associative cache, the replacement policy (e.g., LRU, FIFO, etc.) defines the block to be evicted when a cache set is full.
Assumptions. Due to space limitations, we will limit our discussion to a fully associative cache. A set-associative cache with associativity A can be easily modeled by modeling each cache set as a fully associative cache containing A blocks. Let Mi denote the set of all the memory blocks that can map to the i th cache set. Clearly
Thus, there is no interference between the cache sets and they can be modeled independently.
In this paper, we assume LRU (least recently used) replacement policy, where the block replaced is the one that has been unused for the longest time. However, the technique presented in this work is general enough that it can be easily used for other replacement policies such as FIFO (first-in first-out).
More concretely, in the following, we consider a fully-associative LRU cache with A cache blocks and the program store as a set of memory blocks M . To indicate the absence of any memory block in a cache line, we introduce a new element ⊥.
Concrete Cache States
Let us first formally define the concrete cache states and the operations involving concrete cache states. These definitions will be used later to introduce the notion of probabilistic cache states. 
Probabilistic Cache States
At any program point, the concrete cache state is dependent on the program path taken before reaching this program point. In general, a program point can be reached through multiple program paths leading to a number of possible cache states at that point. We have to model the probability of each of these cache states in probabilistic execution time analysis. For this purpose, we introduce the notion of probabilistic cache states.
DEFINITION 4 (Probabilistic Cache States).
A probabilistic cache state C is a 2-tuple: C, X , where C ∈ 2 Ω is a set of concrete cache states and X is a random variable. The sample space In other words, we add up the probability of all the concrete cache states c ∈ C that contain the memory block m. The cache miss probability can now be defined as P M iss(C, m) = 1 − P Hit(C, m) DEFINITION 6 (Probabilistic Cache State Update). We define ¢ as the probabilistic cache state update operator. Given a probabilistic cache state C = C, X and an access to memory block m ∈ M , C ¢ m defines the updated probabilistic cache state.
For example, in Figure 2 , the probabilistic cache state at the end of basic block B4 after the first loop iteration (starting with empty cache state) consists of two concrete cache states c3 and c4 with equal probability 0.5. The cache miss probability of memory blocks m1-m3 in this probabilistic cache state is 0.5 whereas the miss probability of m0 and m4 are 0.
Analysis of Loops
In this subsection, we describe cache analysis for a loop in isolation, i.e., we assume an empty cache state at the loop entry point. Subsequently, we will extend this analysis to the whole program. In the following, we consider the control flow graph (CFG) to be a directed acyclic graph (DAG), representing the body of the loop. We first perform the analysis on the DAG to model cache behavior for a single iteration of a loop. This will be followed by probabilistic cache state modeling across iterations. 
Analysis of DAG
That is, the outgoing probabilistic cache state of a basic block can be derived by repeatedly updating the incoming probabilistic cache state with the memory accesses in B. Now in order to generate the incoming cache state of B from its predecessor cache states, we need to define the following new operator.
DEFINITION 7 (Probabilistic Cache States Merging).
We define as the merging operator for probabilistic cache states. It takes in n probabilistic cache states Ci = Ci, Xi and a corresponding weight function w as input s.t. n i=1 w(Ci) = 1. It produces a merged probabilistic cache state C as follows.
In other words, the concrete states in C is the union of all the concrete cache states in C1, . . . , Cn. The probability of a concrete cache state c ∈ C is a weighted summation of the probabilities of c in the input probabilistic cache states.
Let in(B) define the set of predecessor basic blocks. Then, we can derive the incoming probabilistic cache state of B by employing the merging operation on the outgoing probabilistic cache states of in(B). We define the weight function w as w(C Figure 2 shows the merging operator at the input of B4. There are two concrete cache states c1 and c2 at the entry of B4. As the two incoming edges to B4 have equal probability, the resulting probabilistic cache state at the entry of B4 contains c1 and c2 with equal probability. This probabilistic cache state is updated with memory block m4 inside B4 to obtain the concrete cache states c3 and c4 with equal probability at the end of B4.
Mean Execution Time of Basic Block
Recall that genB = m1, . . . , m k is the sequence of memory blocks accessed within a basic block B. Now let us define k random variables Y1, . . . , Y k corresponding to the memory blocks m1, . . . , m k in genB. Yi denotes the cache hit/miss event for the access of memory block mi. Now Yi can be modeled as a random variable with Bernoulli distribution by assuming Yi = 1 if mi is a cache miss and Yi = 0 otherwise. variable denoting the execution time of B when the cache is modeled. Then
where δ is a constant denoting the cache miss penalty.
Extension to Loop Iterations
In the previous subsection, we have derived the incoming and outgoing probabilistic cache states of each basic block for a single iteration of the loop body starting with the empty cache state C in L = C ⊥ . However, for a loop iterating multiple times, the input cache state at the start node of the loop body is different for each iteration. More concretely, let us add the subscript n for the n th iteration of the loop. Then C in start n = C out end n−1 for n > 1. However, in order to compute C in start 1 , . . . , C in start N as shown in Figure 2 , where N = E[NL] is the expected loop bound, we do not need to traverse the DAG N times. Instead, we introduce two new operators.
DEFINITION 8 (Concatenation of Concrete Cache States).
Given two concrete cache states c1, c2 c1 c2 = c where c = c1
DEFINITION 9 (Concatenation of Probabilistic Cache States). Given probabilistic cache states C1 = C1, X1 and C2 = C2, X2
Let us assume the execution of two program fragments each starting with an empty cache state. The probabilistic cache state after the execution of the first and second program fragments are C1 and C2, respectively. Then the probabilistic cache state after execution of the two program fragments sequentially is C1 C2. Now we can compute the outgoing probabilistic cache state of a loop L for each iteration by applying the operator. First, we note that C Instead, we observe that we only need to compute an "average" probabilistic cache state C avg L at the start node of the loop body. This captures the input cache state of the loop over N iterations. That is, C avg L = C, X is defined in terms of C in start n = C n , X n for 1 ≤ n ≤ N as follows.
This can be alternatively defined as
where w(C for direct mapped cache are simpler. In direct mapped cache, the concrete cache state will not change after the first iteration. Probabilistic cache state could be changed only if {⊥} exists in it. Thus, closed form expressions exist for computing the probability of concrete cache states in C avg L and C gen L , which we do not show due to space constraints. More importantly, for any cache configuration, the operator need not be invoked E[NL] times in practice. The probabilistic cache states converge very quickly for most loops. 70% of the cache sets converge after the second iteration for all associativity settings (for all loops in all our benchmarks) and almost 80% cache sets converge within 10 iterations.
Analysis of Whole Program
In this section, we first show how to compute C and compute the probabilistic cache state at each node of the DAG. This top-down process continues till we traverse all the loops/procedures. At this point, we have computed the "average" probabilistic cache state for each basic block in the context of the whole program. We can now use Equation 9 to compute mean execution time for each basic block.
EXPERIMENTAL EVALUATION
In order to evaluate the accuracy of our probabilistic cache modeling, we should ideally compare our estimation result with the actual mean and variance of execution time of a program, based on the given statistical information. However, given the statistical information, there is no way to determine the actual mean and variance (that is the exact problem we are trying to solve). Therefore, we decide to compare our estimation results to the results obtained from simulation. Given an application, we select multiple inputs and profile the application to collect the statistical information we state before. By simulating the application with multiple inputs, we could get the actual mean and variance of execution time across these multiple inputs. Then we apply our analysis technique based on the statistical information of these multiple inputs. Finally, we compare our estimation with the simulation results. We evaluate our modeling technique with nine benchmarks from MiBench. We provide for each benchmark multiple inputs with high variability [10] . We use SimpleScalar toolset [3] for the experiments. The profiling is done by sim-profile and cache simulation is done by sim-cheetah. Our estimator first disassembles the executable to construct CFG and LPHG, and then proceeds with the estimation. Standard deviation is the square root of variance that measures the average deviation from mean. In the experiments we compare our estimated mean (standard deviation) to simulated mean (standard deviation). We fix a cache block size for each benchmark, but consider different number of cache sets (8, 16, 32) and associativity (1, 2, 4, 8) . So a total 12 cache configurations are simulated for each benchmark. As our modeling is focused on the instruction cache, we assume constant execution time for each basic block in the absence of caches. Figure 3 shows the mean and standard deviation of the total number of cache misses corresponding to simulation and estimation. Due to space consideration, we only show the values for three cache configurations per benchmark. The results are similar for other configurations. It is clear that our modeling is quite accurate in estimating both the mean and the standard deviation.
Estimation Error for E[T]
As for execution time, our estimation is accurate for both mean and standard deviation of execution time. Figure 4 shows our relative estimation error compared to simulation for all benchmark, cache configuration pairs. The average relative error across all the benchmark, cache configuration pairs are 0.05% and 0.7% for mean and standard deviation, respectively. Our estimation technique is also very fast and robust w.r.t cache configuration and benchmark size. The total runtime to estimate mean and variance for all the benchmarks and configurations is about 34 seconds on a 3.0GHz Pentium 4 CPU with 2GB memory.
CONCLUSION AND FUTURE WORK
This paper presents, for the first time, an approach to instruction cache modeling in probabilistic timing analysis. We introduce the notion of probabilistic cache states and define operators to manipulate probabilistic cache states at control flow merge points, across loop iterations, and within the whole program. Finally, we show how to compute the cache miss probability of a memory block at any program point given the probabilistic cache states. This allows us to include the variation due to cache behavior in estimating the execution time distribution of a program. Our experimental results indicate that the cache modeling presented is both accurate and scalable. In future, we plan to consider other architectural features (e.g., pipeline, branch predictor) in probabilistic modeling.
ACKNOWLEDGMENTS
This work was partially supported by NUS project R-252-000-292-112 and A*STAR SERC project R-252-000-258-305.
