With the advent of multicore architectures, worst-case execution time (WCET) analysis has become an increasingly difficult problem. In this article, we propose a unified WCET analysis framework for multicore processors featuring both shared cache and shared bus. Compared to other previous works, our work differs by modeling the interaction of shared cache and shared bus with other basic microarchitectural components (e.g., pipeline and branch predictor). In addition, our framework does not assume a timing anomaly free multicore architecture for computing the WCET. A detailed experiment methodology suggests that we can obtain reasonably tight WCET estimates in a wide range of benchmark programs.
INTRODUCTION
Hard real-time systems require absolute guarantees on program execution time. Worstcase execution time (WCET) has therefore become an important problem to address. WCET of a program depends on the underlying hardware platform. Therefore, to obtain a safe upper bound on WCET, the underlying hardware need to be modeled. However, performance-enhancing microarchitectural features of a processor (e.g., cache, pipeline) make WCET analysis a very challenging task.
With the rapid growth of multicore architectures, it is quite evident that the multicore processors are soon going to be adopted for real-time system design. Although multicore processors are aimed for improving performance, they introduce additional challenges in WCET analysis. Multicore processors employ shared resources. Two meaningful examples of such shared resources are shared caches and shared buses. The presence of a shared cache requires the modeling of inter-core cache conflicts. On the other hand, the presence of a shared bus introduces variable bus access latency to accesses to shared cache and shared main memory. The delay introduced by shared cache conflict misses and shared bus accesses is propagated by different pipeline stages and affects the overall execution time of a program. WCET analysis is further complicated by a commonly known phenomenon called timing anomalies [Lundqvist and Stenström 1999] . In the presence of timing anomalies, a local worst-case scenario may not lead to the WCET of the overall program. As an example, a cache hit rather than a cache miss may lead to the WCET of the entire program. Therefore, we cannot always assume a cache miss or maximum bus delay as the worst-case scenario, as the assumptions are not just imprecise, but they may also lead to an unsound WCET estimation. A few solutions have been proposed which model the shared cache and/or the shared bus ( [Yan and Zhang 2008; Li et al. 2009; Chattopadhyay et al. 2010; Lv et al. 2010] ) in isolation, but all of these previous solutions ignore the interactions of shared resources with important microarchitectural features such as pipelines and branch predictors.
In this article, we propose a WCET analysis framework for multicore platforms featuring both a shared cache and a shared bus. In contrast to previous work, our analysis can efficiently model the interaction of the shared cache and bus with different other microarchitectural features (e.g., pipeline, branch prediction). A few such meaningful interactions include the effect of shared cache conflict misses and shared bus delays on the pipeline, the effect of speculative execution on the shared cache, etc. Moreover, our analysis framework does not rely on a timing-anomaly free architecture and gives a sound WCET estimate even in the presence of timing anomalies. In summary, the central contribution of this article is to propose a unified analysis framework that features most of the basic microarchitectural components (pipeline, (shared) cache, branch prediction and shared bus) in a multicore processor.
Our analysis framework deals with timing anomalies by representing the timing of each pipeline stage as an interval. The interval covers all possible latencies of the corresponding pipeline stage. The latency of a pipeline stage may depend on cache miss penalties and shared bus delays. On the other hand, cache and shared bus analysis interact with the pipeline stages to compute the possible latencies of a pipeline stage. Our analysis is context sensitive-it takes care of different procedure call contexts and different microarchitectural contexts (i.e., cache and bus) when computing the WCET of a single basic block. Finally, WCET of the entire program is formulated as an integer linear program (ILP). The formulated ILP can be solved by any commercial solver (e.g., CPLEX) to get the whole program's WCET.
We have implemented our framework in an extended version of Chronos [Li et al. 2007 ], a freely available, open-source, single-core WCET analysis tool. To evaluate our approach, we have also extended a cycle-accurate simulator [Austin et al. 2002] with both shared cache and shared bus support. Our experiments with moderate to large size benchmarks from Gustafsson et al. [2010] show that we can obtain tight WCET estimates for most of the benchmarks in a wide range of microarchitectural configurations.
RELATED WORK
WCET Analysis in Single Core. Research in single-core WCET analysis has started a few decades ago. Initial works used only integer linear programming (ILP) for both microarchitectural modeling and path analysis [Li et al. 1999] . However, the work proposed in Li et al. [1999] faces scalability problems due to the explosion in number of generated ILP constraints. In Theiling et al. [2000] , a novel approach has been proposed, which employs abstract interpretation for micro architectural modeling and ILP for path analysis. Subsequently, an iterative fixed-point analysis has been proposed in Li et al. [2006] for modeling advanced microarchitectural features such as out-oforder and superscalar pipelines. The pipeline analysis proposed in Li et al. [2006] has later been extended to consider parametric execution context in Rochange and Sainrat [2009] . An ILP-based modeling of branch predictors has been proposed, among others, in Li et al. [2005] and the work has later been extended by Maiza and Rochange [2011] to automatically generate models for different dynamic branch prediction schemes. Our baseline framework is built upon the techniques proposed in Li et al. [2006 Li et al. [ , 2005 .
Timing Analysis of Shared Cache. Although there has been a significant progress in single-core WCET analysis research, little has been done so far in WCET analysis for multicores. Multicore processors employ shared resources (e.g., shared cache, shared bus), which gives rise to a new problem for modeling intercore conflicts. A few solutions have already been proposed for analyzing a shared cache [Yan and Zhang 2008; Li et al. 2009; Hardy et al. 2009 ]. All of these approaches extend the abstract-interpretationbased cache analysis proposed in Theiling et al. [2000] . However, in contrast to our proposed framework, these approaches model the shared cache in isolation, assume a timing-anomaly-free architecture and ignore the interaction of shared cache with different other microarchitectural features (e.g., pipeline and branch prediction). A recent approach has enhanced the abstractinterpretation-based shared cache analysis with a gradual and controlled use of model checking. In , abstract interpretation is used as a baseline analysis. Subsequently, a model checking pass is applied to improve the result generated by abstract interpretation. Since abstract interpretation is inherently path insensitive, it generates some spurious cache conflicts due to the presence of infeasible program paths. However, due to the path sensitive search process employed by a model checker, it eliminates certain spurious shared cache conflicts that can never be realized in any real execution. does not model the shared bus and any improvement generated by the approach proposed in will directly improve the precision of WCET prediction using our framework.
Timing Analysis of Shared Bus. Shared bus analysis introduces several difficulties in accurately analyzing the variable bus delay. It has been shown in Wilhelm et al. [2009] that a time division multiple access (TDMA) scheme would be useful for WCET analysis due to its statically predictable nature. Subsequently, the analysis of TDMAbased shared bus was introduced in Rosen et al. [2007] . In Rosen et al. [2007] , it has been shown that a statement inside a loop may exhibit different bus delays in different iterations. Therefore, all loop iterations are virtually unrolled for accurately computing the bus delays of a memory reference inside loop. As loop unrolling is sometimes undesirable due to its inherent computational complexity, Chattopadhyay et al. [2010] proposed a TDMA bus analysis technique which analyzes the loop without unrolling it. However, Chattopadhyay et al. [2010] requires some fixed alignment cost for each loop iteration. Such a loop alignment ensures that a particular memory reference inside the loop suffers exactly the same bus delay in any iteration. The analysis proposed in Chattopadhyay et al. [2010] is fast, as it avoids loop unrolling, however imprecise due to the alignment cost added for each loop iteration. Finally, proposes an efficient TDMA-based bus analysis technique which avoids full loop unrolling, but it is almost as precise as Rosen et al. [2007] . The analysis time in significantly improves compared to Rosen et al. [2007] . However, none of the works [Rosen et al. 2007; Chattopadhyay et al. 2010; ] model the interaction of shared bus with pipeline and branch prediction. Additionally, Rosen et al. [2007] and Chattopadhyay et al. [2010] assume a timing-anomaly-free architecture. A recent approach [Lv et al. 2010] has combined abstract interpretation and model checking for WCET analysis in multi-cores. The microarchitecture analyzed by Lv et al. [2010] contains a private cache for each core and it has a shared bus connecting all the cores to access main memory. The framework uses abstract interpretation [Theiling et al. 2000] for analyzing the private cache and it uses model checking to analyze the shared bus. However, Lv et al. [2010] ignores the interaction of shared bus with pipeline and branch prediction. It is also unclear whether the proposed framework would remain scalable in the presence of shared cache and other microarchitectural features (e.g., pipeline).
Time-predictable Microarchitecture and Execution Model.
To eliminate the problem of pessimism in multicore WCET analysis, researchers have proposed predictable multicore architectures [Paolieri et al. 2009 ] and predictable execution models by code transformations [Pellizzoni et al. 2011] . The work in Paolieri et al. [2009] proposes several microarchitectural modifications (e.g., shared cache partitioning among cores, TDMA round robin bus) so that the existing WCET analysis methodologies for single cores can be adopted for analyzing the hard real-time software running on such system. On the other hand, Pellizzoni et al. [2011] proposes compiler transformations to partition the original program into several time-predictable intervals. Each such interval is further partitioned into memory phase (where memory blocks are prefetched into cache) and execution phase (where the task does not suffer any last level cache miss and it does not generate any traffic to the shared bus). As a result, any other bus traffic scheduled during the execution phases of all other tasks does not suffer any additional delay due to the bus contention. We argue that the previously mentioned approaches are orthogonal to the idea of this article and our idea in this article can be used to pinpoint the sources of overestimation in multicore WCET analysis.
In summary, there has been little progress on multicore WCET analysis by modeling the different microarchitectural components (e.g., shared cache, shared bus) in isolation. Our work differs from all previous works by proposing a unified framework, which is able to analyze the most basic microarchitectural components and their interactions in a multicore processor.
BACKGROUND
In this section, we introduce the basic background behind our WCET analysis framework. Our WCET analysis framework for multicore is based on the pipeline modeling of Li et al. [2006] .
Pipeline Modeling through Execution Graphs.
The central idea of pipeline modeling revolves around the concept of the execution graph [Li et al. 2006] . The execution graph is constructed for each basic block in the program control flow graph (CFG). For each instruction in the basic block, the corresponding execution graph contains a node for each of the pipeline stages. We assume a five-stage pipeline-instruction fetch (IF), decode (ID), execution (EX), write back (WB), and commit (CM). Edges in the execution graph capture the dependencies among pipeline stages; either due to resource constraints (instruction fetch queue size, reorder buffer size, etc.) or due to data dependency (read after write hazard). The timing of each node in the execution graph is represented by an interval, which covers all possible latencies suffered by the corresponding pipeline stage. Figure 1 shows a snippet of assembly code and the corresponding execution graph. The example assumes a 2-way superscalar processor with 2-entry instruction fetch queue (IFQ) and 4-entry reorder buffer (ROB). Since the processor is a 2-way superscalar, instruction I3 cannot be fetched before the fetch of I1 finishes. This explains the edge between IF nodes of I1 and I3. On the other hand, since IFQ size is 2, IF stage of Fig. 1 . Execution graph for the example program in a 2-way superscalar processor with 2-entry instruction fetch queue and 4-entry reorder buffer. Solid edges show the dependency between pipeline stages, whereas the dotted edges show the contention relation.
I3 cannot start before ID stage of I1 finishes (edge between ID stage of I1 and IF stage of I3). Note that I3 is data dependent on I1 and similarly, I5 is data dependent on I4. Therefore, we have edges from WB stage of I1 to EX stage of I3 and also from WB stage of I4 to EX stage of I5. Finally, as ROB size is 4, I1 must be removed from ROB (i.e., committed) before I5 can be decoded. This explains the edge from CM stage of I1 to ID stage of I5.
A dotted edge in the execution graph (e.g., the edge between EX stage of I2 and I4) represents contention relation (i.e., a pair of instructions which may contend for the same functional unit). Since I2 and I4 may contend for the same functional unit (multiplier), they might delay each other due to contention. The pipeline analysis is iterative. Analysis starts without any timing information and assumes that all pairs of instructions which use same functional units and can coexist in the pipeline, may contend with each other. In the example, therefore, the analysis starts with {(I1,I2), (I2,I4), (I1,I4), (I3,I5)} in the contention relation. After one iteration, the timing information of each pipeline stage is obtained and the analysis may rule out some pairs from the contention relation if their timing intervals do not overlap. With this updated contention relation, the analysis is repeated and subsequently, a refined timing information is obtained for each pipeline stage. Analysis is terminated when no further elements can be removed from the contention relation. WCET of the code snippet is then given by the worst-case completion time of the CM node for I5. Figure 2 gives an overview of our analysis framework. Each processor core is analyzed at a time by taking care of the intercore conflicts generated by all other cores. Figure 2 shows the analysis flow for some program A running on a dedicated processor core. The overall analysis can broadly be classified into two separate phases: (1) microarchitectural modeling and (2) path analysis. In microarchitectural modeling, the timing behavior of different hardware components is analyzed (as shown by the big dotted box in Figure 2 ). We use abstract interpretation (AI)-based cache analysis [Theiling et al. 2000 ] to categorize memory references as all-hit (AH) or all-miss (AM) in L1 and L2 cache. A memory reference is categorized AH (AM) if the resulting access is always a cache hit (miss). If a memory reference cannot be categorized as AH or AM, it is categorized as unclassified (NC). In the presence of a shared L2 cache, categorization of a memory reference may change from AH to NC due to the intercore conflicts [Li et al. 2009] .
OVERVIEW OF OUR ANALYSIS
Moreover, as shown in Figure 2 , L1 and L2 cache analysis has to consider the effect of speculative execution when a branch instruction is mispredicted (refer to Section 7 for details). Similarly, the timing effects generated by the mispredicted instructions are also taken into account during the iterative pipeline modeling (refer to Li et al. [2006] for details). The shared bus analysis computes the bus context under which 124:6 S. Chattopadhyay et al. an instruction can execute. The outcome of cache analysis and shared bus analysis is used to compute the latency of different pipeline stages during the analysis of the pipeline (refer to Section 5 for details). Pipeline modeling is iterative and it finally computes the WCET of each basic block. WCET of the entire program is formulated as maximizing the objective function of a single integer linear program (ILP). WCETs of individual basic blocks are used to construct the objective function of the formulated ILP. The constraints of the ILP are generated from the structure of the program's control flow graph (CFG), microarchitectural modeling (branch predictor and shared bus) and additional user-given constraints (e.g., loop bounds). The modeling of the branch predictor generates constraints to bound the execution count of mispredicted branches (for details, refer to Li et al. [2005] ). On the other hand, constraints generated for bus contexts bound the execution count of a basic block under different bus contexts (for details, refer to Section 6). Path analysis finds the longest feasible program path from the formulated ILP through implicit path enumeration (IPET). Any ILP solver (e.g., CPLEX) can be used for IPET and for deriving the whole program's WCET.
System and Application Model. We assume a multicore processor with each core having a private L1 cache. Additionally, multiple cores share a L2 cache. The extension of our framework for more than two levels of caches is straightforward. If a memory block is not found in L1 or L2 cache, it has to be fetched from the main memory. Any memory transaction to L2 cache or main memory has to go through a shared bus. For shared bus, we assume a TDMA-based round-robin arbitration policy, where a fixed length bus slot is assigned to each core. We also assume fully separated caches and buses for instruction and data memory. Therefore, the data references do not interfere with the instruction references. In this work, we only model the effect of instruction caches. However, the data cache effects can be considered in a similar fashion. Since we consider only instruction caches, the cache miss penalty (computed from cache analysis) directly affects the instruction fetch (IF) stage of the pipeline. We do not consider self-modifying code and therefore, we do not need to model the coherence traffic. Finally, we consider the LRU cache replacement policy and noninclusive caches only. Later in Section 11 and in Section 12, we shall extend our framework for FIFO cache replacement policy and we shall also discuss the extension of our framework for other cache replacement policies (e.g., PLRU), other cache hierarchies (e.g., inclusive) and data caches.
INTERACTION OF SHARED RESOURCES WITH PIPELINE
Let us assume each node i in the execution graph is annotated with the following timing parameters, which are computed iteratively. Pipeline modeling is iterative. The iterative analysis starts with the coarse interval [0, ∞] for each node and subsequently, the interval is tightened in each iteration. The computation of a precise interval takes into account the analysis result of caches and shared bus. The iterative analysis eliminates certain infeasible contention among the pipeline stages in each iteration, thereby leading to a tighter timing interval after each iteration. The iterative analysis starts with a contention relation. Such a contention relation contains pairs of instructions which may potentially delay each other due to contention. Initially, all possible pairs of instructions are included in the contention relation and after each iteration, pairs of instructions whose timing intervals do not overlap, are removed from this relation. If the contention relation does not change in some iteration, the iterative analysis terminates. Since the number of instructions in a basic block is finite, the contention relation contains a finite number of elements and in each iteration, at least one element is removed from the relation. Therefore, this analysis is guaranteed to terminate. Moreover, if the contention relation does not change, the timing interval of each node reaches a fixed-point after the analysis terminates. In the following, we shall discuss how the presence of a shared cache and a shared bus affects the timing information of different pipeline stages.
Interaction of Shared Cache with Pipeline
Let us assume CHMC L1 i (CHMC L2 i ) denotes the AH/AM/NC cache hit-miss classification of an IF node i in L1 (shared L2) cache. Further assume that E i denotes the possible latencies of an IF node i without considering any shared bus delay. E i can be defined as follows:
where LAT L1 and LAT L2 represent the fixed L1 and L2 cache miss latencies respectively. Note that the interval-based representation captures the possibilities of both a cache hit and a cache miss in case of an NC categorized cache access. Therefore, the computation of E i can also deal with the architectures that exhibit timing anomalies.
Interaction of Shared Bus with Pipeline
Let us assume that we have a total of C cores and the TDMA-based round robin scheme assigns a slot length S l to each core. Therefore, the length of one complete round is S l C. We begin with the following definitions, which are used throughout the article.
Definition 5.1 (TDMA Offset).
A TDMA offset at a particular time T is defined as the relative distance of T from the beginning of the last scheduled round. Therefore, at time T , the TDMA offset can be precisely defined as T mod S l C. 
Note that max lat p , min lat p are not constants and depend on the incoming bus context (O in i ) and the set of possible latencies of IF node i (E i ) in the absence of a shared bus. max lat p and min lat p are defined as follows:
In these equations, E i represents the set of possible latencies of an IF node i in the absence of shared bus delay (refer to Eq. (1)). Given a TDMA offset o and latency t in the absence of shared bus delay, p (o, t) computes the total delay (including shared bus delay) faced by the IF stage of the pipeline. p (o, t) can be defined as follows (similar to Chattopadhyay et al. [2010] or ):
In the following, we shall now show the computation of incoming and outgoing bus contexts , on the possible latencies of execution graph node i (including shared bus delay) and on the contention suffered by the corresponding pipeline stage. In the modeled pipeline, inorder stages (i.e., IF, ID, WB and CM) do not suffer from contention. But the out-of-order stage (i.e., EX stage) may experience contention when it is ready to execute (i.e., operands are available) but cannot start execution due to the unavailability of a functional unit. Worst-case contention period of an execution graph node i can be denoted by the term latest[t 
Here, u denotes the update function on TDMA offset set with a set of possible latencies of node i and is defined as follows:
Figures 3 gives a sound approximation of O in i . However, it is important to observe that not all predecessors in the execution graph can propagate TDMA offsets to node i. Recall that the edges in the execution graph represent dependency (either due to resource constraints or due to true data dependences). Therefore, node i in the execution graph can only start when all the nodes in pred(i) have finished. Consequently, the TDMA offsets are propagated to node i only from the predecessor j, which finishes immediately before i is ready. Nevertheless, our static analyzer may not be able to compute a single predecessor that propagates TDMA offsets to node i. However, for two arbitrary execution graph nodes j1 and j2, if we can guarantee that
], we can also guarantee that j2 finishes later than j1. 
where pmax is a predecessor of i such that
]. Therefore, O in i captures all possible outgoing TDMA offsets from the predecessor nodes that are possibly finished latest. Given that the value of O out j is an overapproximation of the outgoing bus context for each predecessor j of i, Eq. (9) gives an overapproximation of the incoming bus context at node i. Finally, Eqs. (7) and (9) together ensure a sound computation of the bus contexts at the entry and exit of each execution graph node.
WCET COMPUTATION UNDER MULTIPLE BUS CONTEXTS

Execution Context of a Basic Block
Computing Bus Context without Loops. In the previous section, we have discussed the pipeline modeling of a basic block B in isolation. However, to correctly compute the execution time of B, we need to consider (1) are the set of pipeline stages which could propagate TDMA offsets outside of the loop. Therefore, π in l corresponds to the pipeline stages of instructions inside l which resolve loop-carried dependency (due to resource constraints, pipeline structural constraints or true data dependency). On the other hand, π out l corresponds to the pipeline stages of instructions inside l which resolve the dependency of instructions outside of l. 
Bounding the Execution Count of a Bus Context
Foundation. As discussed in the preceding, a basic block inside some loop may execute under different bus contexts. For all nonfirst iterations, a loop l is entered with bus
where {x1, x2, . . . , xn} are the set of π in l nodes as described in Figure 4 . These bus contexts are computed during an iterative analysis of the loop l (described here). On the other hand, the bus context at the first iteration of l is a tuple of TDMA offsets propagated from outside of l to some pipeline stage inside l. Note that the bus context at the first iteration of l is computed by following the general procedure as described in Section 5.
In this section, we shall show how the execution count of different bus contexts can be bounded by generating additional ILP constraints. These additional constraints are added to a global ILP formulation to find the WCET of the entire program. We begin with the following notations.
l . The set of all bus contexts that may reach loop l in any iteration. . Number of times l can be entered with bus context w 1 at some iteration n and with bus context w 2 at iteration n + 1 (where 
where N l.h denotes the number of times the header of loop l is executed. Equation (10) generates standard flow constraints from each graph G s l , constructed for loop l. Special constraints need to be added for the bus contexts with which the loop is entered at the first iteration and at the last iteration. If w is a bus context with which loop l is entered at the last iteration, M w l is more than the execution count of outgoing flows (i.e., M w→x l ). Equation (10) takes this special case into consideration. On the other hand, Eq. (11) bounds the aggregate execution count of all possible contexts w ∈ l with the total execution count of the loop header. Note that N l.h will further be involved in defining the CFG structural constraints, which relate the execution count of a basic block with the execution count of its incoming and outgoing edges [Theiling et al. 2000] . Equations (10)-(11) do not ensure that whenever loop l is invoked, the loop must be executed at least once with some bus context in s l . We add the following ILP constraints to ensure this: 
where bound l represents the relative loop bound of l and parent(G Finally, we need to bound the execution count of any basic block i (immediately enclosed by loop l), with different bus contexts. We generate the following two constraints to bound this value:
where N i represents the total execution count of basic block i and N
w.i i
represents the execution count of basic block i with bus context w.i. Equation (15) tells the fact that basic block i can execute with bus context w.i at some iteration of l only if l is reached with bus context w at the same iteration (by definition). N i will be further constrained through the structure of program's CFG, which we exclude in our discussion. is also sound. For details of this analysis, readers are further referred to . 
Computing Bus Contexts at Loop
EFFECT OF BRANCH PREDICTION
Presence of branch prediction introduces additional complexity in WCET computation. If a conditional branch is mispredicted, the timing of the mispredicted instructions need to be computed. Mispredicted instructions introduce additional conflicts in L1 and L2 cache which need to be modeled for a sound WCET computation. Similarly, branch misprediction will also affect the bus delay suffered by the subsequent instructions. In the following, we shall describe how our framework models the interaction of branch predictor with caches and bus. It is important to note that the speculation also impacts the timing of different pipeline stages. Our proposed multicore WCET analysis framework is based on the work proposed in Li et al. [2006] , which considers the timing effects into pipeline due to speculation by augmenting the execution graph along wrong path execution (i.e., along the mispredicted branch path). As the central theme of this article is to discuss the WCET analysis in multicore, we shall only discuss the impact of speculation on (shared) caches and bus. For timing interactions that are not specific to multicore (e.g., timing interactions between speculation and pipeline), we refer to our previous work Li et al. [2006] .
Effect on Cache for Speculative Execution.
We assume that there could be at most one unresolved branch at a time. Therefore, the number of mispredicted instructions is bounded by the number of instructions till the next branch as well as the total size of instruction fetch queue and reorder buffer.
Abstract-interpretation-based cache analysis produces a fixed point on abstract cache content at the entry (denoted as ACS in i ) and at the exit (denoted as ACS out i ) of each basic block i. If a basic block i has multiple predecessors, output cache states of the predecessors are joined to produce the input cache state of basic block i. Consider an edge j → i in the program's CFG. If j → i is an unconditional edge, computation of ACS in i does not require any change. However, if j → i is a conditional edge, the condition could be correctly or incorrectly predicted during the execution. For a correct prediction, the cache state ACS in i is still sound. On the other hand, for incorrect prediction, ACS in i must be updated with the memory blocks accessed at the mispredicted path. We assume that there could be at most one unresolved branch at a time. Therefore, the number of mispredicted instructions is bounded by the number of instructions till the next branch as well as the total size of instruction fetch queue and reorder buffer. To maintain a safe cache state at the entry of each basic block i, we join the two cache states arising due to the correct and incorrect predictions of conditional edge j → i. We demonstrate the entire scenario through an example in Figure 5 . In Figure 5 , we demonstrate the procedure for computing the abstract cache state at the entry of a basic block i. Basic block i is conditionally reached from basic block j. To compute a safe cache content at the entry of basic block i, we combine two different possibilities-one when the respective branch is correctly predicted (Figure 5(a) ) and the other when the respective branch is incorrectly predicted (Figure 5(b) ). The combination is performed through an abstract join operation, which depends on the type of analysis (must or may) being computed. A stabilization on the abstract cache contents at the entry and exit of each basic block is achieved through conventional fixed point analysis.
It is worthwhile to mention that our analysis conservatively estimates the impact of speculation on the cache content. The cache hit/miss categorization of a memory reference is determined via the over-approximation (during the may analysis) and the under-approximation (during the must analysis) of cache contents at each program point. The join operation, as shown in Figure 5 (c), ensures that the overapproximation (underapproximation) of cache contents is preserved during the may (must) analysis in the presence of zero or more branch mispredictions. It is important to note that such an overapproximation and underapproximation consider all possible branch misprediction scenarios (including zero branch mispredictions). Therefore, our analysis preserves the soundness in the presence of timing anomaly. However, our modeling is also conservative in the sense that we do not take into account the exact number of branch mispredictions into account for cache analysis.
Effect on Bus for Speculative Execution.
Due to branch misprediction, some additional instructions might be fetched from the mispredicted path. As described in Section 6, an execution graph for each basic block B contains a prologue (instructions before B which directly affect the execution time of B) . If the last instruction of the prologue is a conditional branch, the respective execution graph is augmented with the instructions along the mispredicted path [Li et al. 2006] . Since the propagation of bus context is entirely performed on the execution graph (as shown in Section 5), our shared bus analysis remains unchanged, except the fact that it works on an augmented execution graph (which contains instructions from the mispredicted path) in the presence of speculation.
Computing the Number of Mispredicted Branches.
In the presence of a branch predictor, each conditional edge j → i in the program CFG can be correctly or incorrectly predicted. Let us assume E j→i denotes the total number of times control flow edge j → i is executed and E 
WCET COMPUTATION OF AN ENTIRE PROGRAM
We compute the WCET of the entire program with N basic blocks by using the following objective function: 
i denotes the set of all bus contexts under which basic block i can execute. Basic block i can be executed with different bus contexts. However, the number of elements in i is always bounded by the number of bus contexts entering the loop immediately enclosing j→i are bounded by the CFG structural constraints [Theiling et al. 2000 ] and the constraints proposed by Eqs. (10)- (15) in Section 6. Note that in Eqs. (10)- (15), we only discuss the ILP constraints related to the bus contexts. Other ILP constraints, such as CFG structural constraints and user constraints, are used in our framework for an IPET implementation.
Finally, the WCET of the program maximizes the objective function in Eq. (16). Any ILP solver (e.g., CPLEX) can be used for the same purpose.
SOUNDNESS OF ANALYSIS
In this section, we shall provide the basic ideas for the proof of the soundness of our analysis framework. Due to space constraints, details of the proofs are included in the technical report by .
The heart of soundness guarantee follows from the fact that we represent the timing of each pipeline stage as an interval. (7) and (9). Therefore, we can always compute the interval spanning from minimum to maximum bus delay using O in i (Eqs. (4) and (5)). To conclude, we argue that the longest acyclic path search in the execution graph always results in a sound estimation of basic block WCET. Finally, the IPET approach searches for the longest feasible program path to ensure a sound estimation of whole program's WCET.
EXPERIMENTAL EVALUATION Experimental Setup
We have chosen moderate to large-size benchmarks from Gustafsson et al. [2010] , which are generally used for timing analysis. Individual benchmarks are compiled into simplescalar PISA (Portable Instruction Set Architecture) [Austin et al. 2002] -a MIPS like instruction set architecture. We use the simplescalar gcc cross compiler with optimization level -O2 to generate the PISA-compliant binary of each benchmark. The control flow graph (CFG) of each benchmark is extracted from its PISA-compliant binary and is used as an input to our analysis framework. In our current implementation, the analysis frontend (CFG extractor) and the modeling of pipeline do not appropriately handle recursions, switch cases and unstructured goto, break statements inside loops. Such programs from Gustafsson et al. [2010] are therefore not included in our evaluation.
To validate our analysis framework, the simplescalar toolset [Austin et al. 2002] was extended to support the simulation of shared cache and shared bus. The simulation infrastructure is used to compare the estimated WCET with the observed WCET. Observed WCET is measured by simulating the program for a few program inputs. Nevertheless, we would like to point out that the presence of a shared cache and a shared bus makes the realization of the worst-case scenario extremely challenging. In the presence of a shared cache and a shared bus, the worst-case scenario depends on the interleavings of threads, which are running on different cores. Consequently, the observed WCET result in our experiments may sometimes highly underapproximate the actual WCET.
For all of our experiments, we present the WCET overestimation ratio, which is measured as Estimated WCET Observed WCET . For each reported overestimation ratio, the system configuration during the analysis (which computes Estimated WCET) and the measurement (which computes Observed WCET) are kept identical. Unless otherwise stated, our analysis uses the default system configuration in Table I (as shown by the column "Default settings"). Since the data cache modeling is not yet included in our current implementation, all data accesses are assumed to be L1 cache hits (for analysis and measurement both).
To check the dependency of WCET overestimation on the type of conflicting task (being run in parallel on a different core), we use two different tasks to generate the intercore conflicts-1) jfdctint, which is a single path program and 2) statemate, which has a huge number of paths. In our experiments (Figures 6-8) , we use jfdctint to generate intercore conflicts to the first half of the tasks (i.e., matmult to lcdnum). On the other hand, we use statemate to generate intercore conflicts to the second half of the tasks (i.e., minver to st). Due to the absence of any infeasible program path, intercore conflicts generated by a single path program (e.g., jfdctint) can be more accurately modeled compared to a multipath program (e.g., statemate). Therefore, in the presence of a shared cache, we expect a better WCET overestimation ratio for the first half of the benchmarks (i.e., matmult to lcdnum) compared to the second half (i.e., minver to st).
To measure the WCET overestimation due to cache sharing, we compare the WCET result with two different design choices, where the level 2 cache is partitioned. For a two-core system, two different partitioning choices are explored: first, each partition has the same number of cache sets but has half the number of ways compared to the original shared cache (called vertical partitioning). Second, each partition has half the number of cache sets but has the same number of ways compared to the original shared cache (called horizontal partitioning). In our default configuration, therefore, each core is assigned a 2-way associative, 2KB L2 cache in the vertical partitioning, whereas each core is assigned a 4-way associative, 2KB L2 cache in the horizontal partitioning.
Finally, to pinpoint the source of WCET overestimation, we can selectively turn off the analysis of different microarchitectural components. We say that a microarchitectural component has perfect setting if the analysis of the same is turned off (refer to the column "Perfect settings" in Table I ).
Basic Analysis Result
Effect of Caches. Figure 6 shows the WCET overestimation ratio with respect to different L1 and L2 cache settings in the presence of a perfect branch predictor and a perfect shared bus. Results show that we can reasonably bound the WCET overestimation ratio except for a few benchmarks (e.g., qurt, nsichneu, lcdnum, select). The major source of this overestimation is the presence of many infeasible paths in such programs, which may lead to infeasible microarchitectural states and WCET overestimations. These infeasible paths can be eliminated by providing additional user constraints into our framework and hence improving the ILP-based WCET calculation. We also observe that the partitioned L2 caches may lead to a better WCET overestimation compared to the shared L2 caches, with the vertical L2 cache partitioning almost always working as the best choice. The positive effect of the vertical cache partitioning is visible in programs such as adpcm, ndes and edn, where the overestimations in the presence of shared L2 caches are higher than the same using partitioned L2 caches. This is due to the difficulty in modeling the intercore cache conflicts from programs being run in parallel (i.e., jfdctint and statemate).
Effect of Speculative Execution. As we explained in Section 7, the presence of a branch predictor and speculative execution may introduce additional computation cycles for executing a mispredicted path. Moreover, speculative execution may introduce additional cache conflicts from a mispredicted path. The results in Figures 7(a) and  7(b) show the effect of speculation in L1 and L2 cache, respectively. qurt and ndes show reasonable increases in the WCET overestimations in the presence of speculation (Figures 7(a) and 7(b) ). A similar increase in the WCET overestimation is also observed with bs and sqrt in the presence of L1 caches and speculation (Figure 7(a) ). Such an increase in the overestimation ratio can be explained from the overestimation arising in the modeling of the effect of speculation on caches (refer to Section 7). Due to the abstract join operation to combine the cache states in correct and mispredicted path, we may introduce some spurious cache conflicts. Nevertheless, our approach for modeling the speculation effect in cache is scalable and produces tight WCET estimates for most of the benchmarks. Figure 8 shows the WCET overestimation in the presence of a shared cache and a shared bus. We observe that our shared bus analysis can reasonably control the overestimation due to the shared bus. Except for a few benchmarks (e.g., edn, nsichneu, ndes, qurt), the overestimation in the presence of a shared cache and a shared bus is mostly equal to the overestimation when the shared bus analysis is turned off (i.e., using a perfect shared bus). Recall that each overestimation ratio is computed by performing the analysis and the measurement on identical system configuration. Therefore, the analysis and the measurement both includes the shared bus delay only when the shared bus is enabled. For a perfect shared bus setting, both the analysis and the measurement consider a zero latency for all the bus accesses. As a result, we also observe that our shared bus analysis might be more accurate than the analysis of other microarchitectural components (e.g., in case of nsichneu, expint, and fir, where the WCET overestimation ratio in the presence of a shared bus might be less than the same with a perfect shared bus). In particular, nsichneu shows a drastic fall in the WCET overestimation ratio when the shared bus analysis is enabled. For nsichneu, we found that the execution time is dominated by shared bus delay, which is most accurately computed by our analysis for this benchmark. On the other hand, we observed in Figure 6 that the main source of WCET overestimation in nsichneu is path analysis, due to the presence of many infeasible paths. Consequently, when shared bus analysis is turned off, the overestimation arising from path analysis dominates and we obtain a high WCET overestimation ratio. Average WCET overestimation in the presence of both a shared cache and a shared bus is around 50%.
Effect of Shared Bus.
WCET Analysis Sensitivity with Respect to Microarchitectural Parameters
In this section, we report the WCET overestimation sensitivity with respect to different micro-architectural parameters. For all the experiments (Figures 9-10) , the reported WCET overestimation denotes the geometric mean of the term Estimated WCET Observed WCET over all the different benchmarks.
We evaluate our framework for different L1 and L2 cache sizes and configurations (Figure 9 (a) and Figure 9 (b), respectively). We observe that the average WCET overestimation is around 40% (50%) with respect to different L1 (L2) cache configurations. Figure 9 (c) presents the WCET overestimation for different pipeline configurations. Superscalar pipelines increase the instruction level parallelism and so as the performance of entire program. However, it also becomes difficult to model the inherent instruction level parallelism in the presence of superscalar pipelines. Therefore, Figure 9 (c) shows an increase in the WCET overestimation with superscalar pipelines. Finally, Figure 10 shows the WCET overestimation sensitivity with respect to the number of The first row represents the analysis time when speculative execution was disabled. The second row represents the analysis time when speculation was enabled. The first row shows the analysis time when speculation was disabled. The second row shows the analysis time when speculation was enabled.
cores and different bus slot lengths. For four-core experiments, we take a group of four programs (from left to right as shown in Figure 6 ) to run in four different cores. Figure 10 reports the geometric mean of WCET overestimation over all the benchmarks. With very high length of TDMA round (i.e., number of cores multiplied by TDMA bus slot length), WCET overestimation normally increases (as shown in Figure 10 ). This is due to the fact that with higher TDMA round lengths, the search space for possible bus contexts (or set of TDMA offsets) increases. As a result, it is less probable to expose the worst-case scenario in simulation with higher bus slot lengths.
Analysis Time
We have performed all the experiments on an 8-core, 2.83GHz Intel Xeon machine having 4 GB of RAM and running Fedora Core 4 operating system. Table II reports the maximum analysis time when the shared-bus analysis is disabled and Table III reports the maximum analysis time when all the analyses are enabled (i.e., cache, shared bus and pipeline). Recall from Section 4 that our WCET analysis framework is broadly composed of two different parts, namely, microarchitectural modeling and implicit path enumeration (IPET) through integer linear programming (ILP). The column labeled "μ arch" captures the time required for microarchitectural modeling. On the other hand, the column labeled "ILP" captures the time required for path analysis through IPET.
In the presence of speculative execution, number of mispredicted branches is modeled by integer linear programming [Li et al. 2005] . Such an ILP-based branch predictor modeling, therefore, increases the number of constraints which need to be considered by the ILP solver. As a result, the ILP solving time increases in the presence of speculative execution (as evidenced by the second rows of both Tables II and III) .
Shared bus analysis increases the microarchitectural modeling time (as evidenced by Table III ) and the analysis time usually increases with the bus slot length. The time for the shared-bus analysis generally appears from tracking the bus context at different pipeline stages. A higher bus slot length usually leads to a higher number of bus contexts to analyze, thereby increasing the analysis time.
In Table II and Table III , we have only presented the analysis time for the longest running benchmark (nsichneu) from our test suite. For any other program used in our experiments, the entire analysis (microarchitectural modeling and ILP solving time) takes around 20-30 seconds on average to finish.
The results reported in Table II show that the ILP-based modeling of branch predictor usually increases the analysis time. Therefore, for a more efficient but less precise analysis of branch predictors, one can explore different techniques to model branch predictors, such as abstract interpretation. Shared-bus analysis time can be reduced by using different offset abstractions, such as interval instead of an offset set. Nevertheless, the appropriate choice of analysis method and abstraction depends on the precision-scalability tradeoff required by the user.
EXTENSION OF SHARED CACHE ANALYSIS
Our discussion on cache analysis has so far concentrated on the least-recently-used (LRU) cache replacement policies. However, a widely used cache replacement policy is first-in-first-out (FIFO). FIFO cache replacement policy has been used in embedded processors such as ARM9 and ARM11 [Reineke et al. 2007] . Recently, abstractinterpretation-based analysis of FIFO replacement policy has been proposed in Reineke [2009, 2010a] for single level caches and for multilevel caches in Hardy and Puaut [2011] . In this section, we shall discuss the extension of our shared cache analysis for FIFO cache replacement policy. We shall also show that such an extension will not change the modeling of timing interactions among shared cache and other basic microarchitectural components (e.g., pipeline and branch predictor).
Review of Cache Analysis for FIFO Replacement
We use the must cache analysis for FIFO replacement as proposed in . In FIFO replacement, when a cache set is full and still the processor requests fresh memory blocks (which map to the same cache set), the first cache line entering the respective cache set (i.e., first-in) is replaced. Therefore, the set of tags in a k-way FIFO abstract cache set (say A s ) can be arranged from last-in to first-out order (leftmost capturing the last-in position) as follows:
where each T i ⊆ T and T is the set of all cache tags. Unlike LRU, cache state never changes upon a cache hit with FIFO replacement policy. Therefore, the cache state update on a memory reference depends on the hit-miss categorization of the same memory reference. Assume that a memory reference belongs to cache tag tag i . The FIFO abstract cache set A s = [T 1 , T 2 , . . . , T k ] is updated on the access of tag i as follows:
The first scenario captures a cache hit and the second scenario captures a cache miss.
The third scenario appears when the static analysis cannot accurately determine the hit-miss categorization of the memory reference.
The abstract join function for the FIFO must cache analysis is exactly same as the LRU must cache analysis. The join function between two abstract FIFO cache sets computes the intersection of the abstract cache sets. If a cache tag is available in both the abstract cache sets, the right most relative position of the cache tag is captured after the join operation.
We implement the must cache analysis for FIFO replacement as described in the preceding. To distinguish the cold cache misses at the first iterations of loops and different procedure calling contexts, our cache analysis employs the virtual-inline-virtual-unrolling (VIVU) approach (as described in Theiling et al. [2000] ). After analyzing the L1 cache memory references are categorized as all-hit (AH), all-miss (AM) or unclassified (NC). AM and NC categorized memory references may access the L2 cache and therefore, the L2 cache state is updated for the memory references which are categorized AM or NC in the L1 cache (as in Hardy and Puaut [2011] ).
To analyze the shared cache, we used our previous work on shared cache [Li et al. 2009 ] for LRU cache replacement policy. Li et al. [2009] employs a separate shared cache conflict analysis phase. For FIFO replacement policy too, we can use the exactly same idea to analyze the set of intercore cache conflicts. Shared cache conflict analysis may change the categorization of a memory reference from all-hit (AH) to unclassified (NC). For the sake of illustration, assume a memory reference which accesses the memory block m. This analysis phase first computes the number of unique conflicting shared cache accesses from different cores. Then, it is checked whether the number of conflicts from different cores can potentially replace m from shared cache. More precisely, for an N-way set associative cache, hit/miss categorization (CHMC) of corresponding memory reference is changed from all-hit (AH) to unclassified (NC) if and only if the following condition holds:
where |M c (m)| represents the number of conflicting memory blocks from different cores which may potentially access the same L2 cache set as m. AGE fifo (m) represents the relative position of memory block m in the FIFO abstract cache set and in the absence of inter-core cache conflicts. Recall that the memory blocks (or the tags) are arranged according to the last-in to first-out order in the FIFO abstract cache set. Therefore, the term N − AGE fifo (m) captures the maximum number of fresh memory blocks which can enter the FIFO cache before m being evicted out.
Interaction of FIFO Cache with Pipeline and Branch Predictor
As described in the preceding, after the FIFO-shared cache analysis, memory references are categorized as all-hit (AH), all-miss (AM) or unclassified (NC). In the presence of pipeline, such a categorization of instruction memory references add computation cycle with the instruction fetch (IF) stage. Therefore, we use Eq. (1) to compute the latency suffered by cache hit/miss and propagate the latency through different pipeline stages.
Recall from Section 7 that speculative execution may introduce additional cache conflicts. In Section 7, we proposed to modify the abstract-interpretation-based cache analysis to handle the effect of speculative execution on cache. From Figure 5 , we observe that our solution is independent of the cache replacement policies concerned. Our proposed modification performs an abstract join operation on the cache states along the correct and mispredicted path (as shown in Figure 5 ). Therefore, for FIFO replacement polices, the abstract-join operation is performed according to the FIFO replacement analysis (instead of LRU join operation we performed in case of LRU caches). Figure 11 demonstrates our WCET analysis experience with FIFO replacement policy. We have used the exactly same experimental setup as mentioned in Section 10. On average, our analysis framework can reasonably bound the WCET overestimation for FIFO cache replacement, except for fdct, lcdnum, select, and qurt. overestimation is largely due to the presence of a FIFO cache and not due to the presence of cache sharing, as clearly evidenced by Figure 11(a) . However, as mentioned in Berg [2006] , the observed worst-case for FIFO replacement may highly underapproximate the true worst case due to the domino effect.
Experimental Result
Figure 11 (b) shows that our modeling of the interaction between FIFO cache and the branch predictor (which is configured as in Table I ) does not much affect the WCET overestimation (except for select in the presence of vertically partitioned L2 caches). As evidenced by Figure 11(b) , the average increase in the WCET overestimation is minimal due to the speculation.
We also report the average WCET overestimation of FIFO replacement compared to LRU replacement policy. In Figure 11 , the results labelled "FIFO/LRU" capture the geometric mean of WCET overestimation in the presence of FIFO replacement, considering WCET overestimation with LRU replacement as a baseline. Our results show that FIFO replacement does not lead to more than 25% worse WCET estimate on average, when compared to the WCET estimate with LRU replacement. Therefore, we believe that FIFO is a reasonably good alternative of LRU replacement even in the context of shared caches.
Other Cache Organizations
In the preceding, we have discussed the extension of our WCET analysis framework with FIFO replacement policy. We have shown that as long as the cache tags in an abstract cache set can be arranged according to the order of their replacement, our shared cache conflict analysis can be integrated. Moreover, our modeling for the timing interaction among (shared) cache, pipeline and branch predictor is independent of the underlying cache replacement policy. Nevertheless, for some cache replacement policies, arranging the cache tags according to the order of their replacement poses a challenge (e.g., PLRU [Grund and Reineke 2010b] ). Cache analysis based on relative competitiveness [Reineke et al. 2007 ] tries to analyze a cache replacement policy with respect to an equivalent LRU cache, but with different parameters (e.g., associativity). Any cache replacement analysis based on relative competitiveness can directly be integrated with our WCET analysis framework. Nevertheless, more precise analysis than the ones based on relative competitiveness can be designed, as shown in Grund and Reineke [2010b] for PLRU policy. However, we believe that designing more precise cache analysis is outside the scope of this article. The purpose of our work is to propose a unified WCET analysis framework and any precision gain in the existing cache analysis technique will directly benefit our framework by improving the precision of WCET prediction.
In this article, we have focused on the noninclusive cache hierarchy. In multicore architectures, inclusive cache hierarchy may limit performance when the size of the largest cache is not significantly larger than the sum of the smaller caches. Therefore, processor architects sometimes resort to noninclusive cache hierarchies [Zahran et al. 2007 ]. On the other hand, inclusive cache hierarchies greatly simplify the cache coherence protocol. The analysis of inclusive cache hierarchy requires to take account of the invalidations of certain cache lines to maintain the inclusion property (as shown in Hardy and Puaut [2011] for multilevel private cache hierarchies). Such invalidations may change an all-hit (AH) categorized memory reference to unclassified (NC) [Hardy and Puaut 2011] . Our shared cache conflict analysis phase can be applied on this reduced set of AH categorized memory references for inclusive caches, keeping the rest of our WCET analysis framework entirely unchanged. Therefore, we believe that the inclusive cache hierarchies do not pose any additional challenge in the context of shared caches and the analysis of such cache hierarchies can easily be integrated, keeping the rest of our WCET analysis framework unchanged.
DISCUSSION AND FUTURE WORK
In this article, we have proposed a unified WCET analysis framework which considers the timing effects of shared caches and shared buses with several microarchitectural components (e.g., pipeline and branch predictor). The key contribution of our work is to build a coherent strategy to evaluate the impact of different microarchitectural components on WCET, both in single-core and multicore architectures. We have evaluated our framework which models pipeline, (shared) instruction caches, dynamic branch predictions, shared buses and the nontrivial timing interactions among the different components. We model a five-stage pipeline, which is similar to the five-stage pipeline in ARM-9 embedded processors. Dynamic branch predictors are also used now in modern embedded processors, such as ARM-CortexA8, ARM-11, and PowerPC-750 [Maiza and Rochange 2011] . Shared caches are widely used in the embedded multicore processors, such as in ARM-CortexA8. For shared buses, we have currently modeled TDMA arbitration schemes. Existing literatures [Wilhelm et al. 2009; Paolieri et al 2009] have discussed that TDMA arbitration is acceptable when guaranteed performance is required (e.g., for hard real-time systems), instead of average high performance. Real implementation of such TDMA arbitration includes AEthereal network-on-chip (NOC) [Goossens and Hansson 2010] . In our modeling, the concept of bus context and its ILPbased formulation is generic with respect to different bus arbitration policies. However, for TDMA arbitration policy, a bus context can easily be defined from TDMA offset set (cf. Definition 5.2) which in turn leads to a compositional and tractable shared bus analysis technique. For other dynamic arbitration policies (e.g., priority based), the number of bus contexts may easily have an exponential complexity (since the interleaving of different threads on different cores need to be considered), making the whole WCET analysis process either intractable or highly imprecise in practice.
However, it is worthwhile to point that our current implementation does not include the modeling of a few microarchitectural components. For an accurate WCET estimation, the modeling of such microarchitectural components is essential and is a subject of our future implementation. Examples of such microarchitectural components include data caches and branch target buffers (BTB).
The modeling of data caches is usually more complicated than instruction caches. The key to such difficulties arises due to the fact that different instances of the same instruction may access different data memory blocks (e.g., array accesses inside a loop, pointer aliasing). Therefore, the modeling of data caches usually involves an address analysis phase (e.g., similar to the analysis proposed in Balakrishnan and Reps [2004] ). The output of address analysis is an over-approximation of the set of addresses accessed by each load/store instruction. Using the results of address analysis, the modeling of data caches has been proposed in Sen and Srikant [2007] . The data cache modeling proposed in Sen and Srikant [2007] is a must analysis. Therefore, each load/store instruction is classified as all-hit (AH) or unclassified (NC). The extension of the basic data cache modeling for multilevel data caches (as well as for unified caches) has been discussed in Chattopadhyay and Roychoudhury [2009] . Since the basic technique applied for such data cache modeling is abstract interpretation, the modeling of data caches can easily be integrated into our framework (e.g., refer to Eq. (1) for integration with pipeline and Figure 5 for integration with branch prediction). Therefore, the integration of such data cache modeling into our framework does not pose any additional challenge. However, a recent approach [Huynh et al. 2011 ] has shown that a data cache modeling based on address analysis (e.g., using Balakrishnan and Reps [2004] ) may highly overestimate the WCET. To overcome the imprecision caused due to address analysis, Huynh et al. [2011] computes the set of loop iterations in which a particular data memory block could be accessed. Such a computation strategy is useful for data accesses, as the data memory blocks accessed in disjoint loop iterations can never conflict with each other in the data cache. In the future, we plan to extend our framework with such precise data cache modeling, such as handling shared data caches and cache coherence in the presence of shared data accesses.
Modern embedded processors also employ a branch target buffer (BTB) to cache the target address of a branch. Our current implementation does not include the modeling of BTBs. As shown in Grund et al. [2011] , the modeling of BTBs can be accomplished via abstract interpretation. The BTB analysis proposed in Grund et al. [2011] is a combined must and may analysis. Given any branch instruction address, the analysis proposed in Grund et al. [2011] classifies a branch instruction as t (i.e., the branch instruction must be in the BTB), f (i.e., the branch instruction must not be in the BTB) or (i.e., static analysis cannot determine the inclusion of the branch instruction in the BTB). Such a classification is analogous to the classification in the instruction cache analysis. Therefore, given an upper bound on BTB miss penalty (say BTB miss ), such a classification can be integrated into our framework using the technique similar to Eq. (1). Moreover, the static analysis of BTB content (as proposed in Grund et al. [2011] ) can be used in our framework to determine the speculative instructions and their effects on caches (cf. Figure 5) .
We believe that our framework can be useful to evaluate the precision and scalability of different analysis methodologies. In our current framework, ILP-based branch prediction modeling and the shared bus modeling fill up most of the analysis time.
Therefore, exploring different techniques for the modeling of branch predictors and shared buses, along with a notion of scalability will be an interesting direction in future.
