Abstract-In the domain of real-time systems, the analysis of the timing behavior of programs is crucial for guaranteeing the schedulability and thus the safeness of a system. Static analyses of the WCET (Worst-Case Execution Time) have proven to be a key element for timing analysis, as they provide safe upper bounds on a program's execution time. For single-core systems, industrial-strength WCET analyzers are already available, but up to now, only first proposals have been made to analyze the WCET in multicore systems, where the different cores may interfere during the access to shared resources. An important example for this are shared buses which connect the cores to a shared main memory. The time to gain access to the shared bus may vary significantly, depending on the used bus arbitration protocol and the access timings. In this paper, we propose a new technique for analyzing the duration of accesses to shared buses. We implemented a prototype tool which uses the new analysis and tested it on a set of realworld benchmarks. Results demonstrate that our analysis achieves the same precision as the best existing approach while drastically outperforming it in matters of analysis time.
I. INTRODUCTION
With the rising importance of multicore systems in the processor market, including the embedded systems or cyberphysical domain, there is a growing need for tools to verify the timing behavior of such systems, and as such the WCET. For embedded systems, this may be the most important metric, because they often must work under realtime conditions where a response must be delivered in a predefined time. Therefore, fine-grained WCET analyses have been developed for single-core systems in the last decade [1] , resulting in a variety of commercially available tools. In contrast, for multicores only first proposals exist. One of the major difficulties in analyzing the WCET for multicore platforms is that programs running on different cores may interfere with each other, for example during accesses to a common shared bus which connects the cores to a shared main memory. A possible approach to resolve these interferences is to implement a Time Division Multiple Access (TDMA) bus arbitration protocol which assigns a fixed-length time slot to each core in round robin fashion. In a scenario with n c cores each having a TDMA time slot of length s l cycles this leads to a maximum delay of D max = ((n c − 1)s l ) + (z − 1) cycles for a bus access which occupies the bus for z cycles. This maximum delay is encountered when the access request is issued z − 1 cycles before the end of the executing core's slot. The bus cannot be granted then, since the access would span into the slot of the next core. On the other hand, the bus access is granted instantly, if the access request is issued when the bus is assigned to the executing core for at least z remaining cycles. Thus, an important problem is to determine tighter bounds on the durations of bus accesses. D max cycles, as mentioned, is a valid but highly overestimated bound. In this paper, we present a new type of analysis which safely bounds the access time for TDMA-arbitrated resources with high precision and moderate analysis times thus enabling a tighter WCET estimation. The results can be used to avoid pessimistic hardware overdimensioning and to derive tighter system schedules.
The rest of this paper is organized as follows: In Section II, we will present related work, Section III introduces our system model used in the analyses and Section IV and V introduce the overall analysis framework as well as the general analysis concepts respectively. Section VI presents our new analyses which are evaluated in Section VII. Finally, we provide a summary of our results and give directions for future work in Section VIII.
II. RELATED WORK
The first approaches to multicore WCET analysis only modeled the shared resources to some extent. Suhendra [2] and Zhang [3] analyzed the effects of a shared L2 cache without considering the interference on a shared bus that is used to access the shared cache. [3] provides a bound on the number of additional cache misses due to the intercore interference, whereas [2] eliminates the interference altogether by exploring different scenarios of locking and partitioning the shared cache.
Gustavsson [4] investigates a totally different approach, where the whole multicore system is modeled as a set of timed automata. The WCET is obtained by proving special predicates through model checking. This approach allows for a detailed system modeling, but does not scale very well as all system states have to be explored in the course of the WCET analysis, leading to a state explosion.
For analyses that include the shared bus, the choice of the bus arbitration method is crucial. Pitter [5] compared the predominant arbitration methods and TDMA arbitration resulted as the most predictable method. Therefore, most of the works which include a bus analysis are restricted to TDMA bus arbitration. To provide a better access time estimation than the mentioned D max cycles, Andrei [6] tries to determine the precise time at which every single memory access takes place. The bus delay estimation is then performed separately for each access. The main problem is, that accesses in loops with an iteration count of i can potentially have i different access times associated to the same memory access. Therefore, the analysis has to unroll all loops virtually to determine the access times for each access individually, which makes the analysis runtime dependent on the loop iteration counts.
Chattopadhyay [8] circumvents this costly unrolling by aligning each loop head execution to the first TDMA slot during the analysis. However, this artificial alignment of each loop iteration results in an additional penalty term to be added in WCET estimation. Therefore, the analysis proposed in [8] is far more efficient but also less precise than [6] . The analysis which we propose in the following, will present a compromise between the two approaches, being almost as precise as [6] and only slightly less efficient than [8] .
Finally, Pellizzoni [7] derives the worst-case bus delays in a multicore system analytically with the help of memory traffic arrival curves. This approach is different from ours since we do not require such curves.
A different direction in static timing analysis is the adaption of multicore hardware to exhibit better predictability properties. Paolieri [9] proposed a multicore architecture in which the WCET of basic blocks is measurable, whereas Mische [10] developed a superscalar SMT processor, which provides built-in real-time capabilities. These approaches are orthogonal to ours since we focus on estimating the WCET of tasks on existing hardware platforms.
III. SYSTEM AND APPLICATION MODEL
We assume a system architecture where n c ≥ 2 cores are present in a single processor. Each of the cores has an in-order pipeline and a private L1-Cache and all the cores are connected to a shared TDMA-arbitrated memory bus with a uniform TDMA slot size of s l cycles per core. The bus is used to access a shared L2-Cache, which itself is linked to the main memory. The bus, the L2 cache and the main memory may be located on-chip or off-chip. We do not allow split transactions on the bus, therefore, for the maximum duration T max of a bus transaction, T max ≤ s l must hold. An access to the TDMA bus may incur a variable delay, depending on when the access is performed, but the delay cannot exceed D max cycles. As explained in the introduction, this bound is not tight in general. Due to T max ≤ s l and D max ≥ ((n c − 1)s l ), the maximum bus delay will at least be (n c − 1) times as big as the maximum memory latency. Thus, the bus access delay is the factor with the greatest variability and also with the greatest potential for overestimations during WCET analysis. This underlines the need for precise analyses of the bus access delays. In this paper we will provide such an analysis using a fixed TDMA schedule. The optimization of the TDMA schedule itself is out of the scope of the paper.
All the caches in the considered system are non-inclusive and use the least-recently-used (LRU) replacement policy. The cache hierarchy can be easily extended e.g. with more private cache levels, because we apply the generic framework from [11] to determine which accesses from cache level i−1 hit cache level i. We only model instruction caches and thus assume that data accesses occur via a different bus and do not interfere with the instruction accesses in any other way. The integration of a data cache analysis into our analysis would remove these restrictions. We do not allow self-modifying code hereby removing the need to deal with cache-coherency in our model.
The input task dependencies are given as acyclic task graphs with a fixed mapping of tasks to cores. Each edge (x, y) in the task graph denotes that task y can start execution only after task x has finished. We use fixed-priority, non-preemptive 1 scheduling. For each loop L in the tasks, the minimum and maximum loop iteration counts B min L and B max L are given and the control flow graphs (CFGs) of the tasks are assumed to be well structured (reducible).
IV. ANALYSIS FRAMEWORK
We embed our new analyses into the CHRONOS timing analyzer framework from [8] . Figure 1 shows the analysis process. The framework first analyzes the cache behavior of each task in isolation and then computes the maximum possible cache interference in the shared L2 cache. This interference information is used to update the worst-case cache states of the individual tasks. The cache analysis assigns to each single access one of the following categories for each cache level: "Always Hit" (AH), "Always Miss" (AM), "First Miss" / "Persistent" (PS) or "Unknown Behavior" (UNKNOWN). PS means that the first execution of the instruction suffers a cache miss, but every following execution hits the cache, which is most useful for instructions inside of loops. For details on the cache analysis, the interested reader is referred to [8] , since we are only using its results here. In the next analysis step, the cache information is used to compute BCET 2 and WCET values per task. This module (marked in bold in Figure 1 ) has been equipped with our new analysis technique, whereas all other modules have not been modified. After the tasks' BCETs and WCETs were computed, the overall system worst-case response time (WCRT) is determined. This process repeats as long as the task interference changes, e.g. due to altered task lifetimes. In the following, we will focus on the determination of single task WCETs with given cache states as this is our main contribution. Nevertheless, all of our analyses are applicable to the computation of BCETs as well.
V. STATIC ANALYSIS OF TDMA OFFSETS
Our new analysis builds upon concepts which are heavily used in the analysis of other architectural features. To establish the link between those existing analyses and our new analysis, we first give a short overview of existing static analysis techniques. We also demonstrate why those techniques are not sufficient in our scenario.
A. Abstract Interpretation In Timing Analysis
A static timing analysis is usually composed of a microarchitectural analysis and a path analysis [1] . The microarchitectural analysis is responsible for determining abstract hardware states which describe the possible concrete hardware states at every basic block entry. This microarchitectural analysis is normally based on abstract interpretation, a technique for static program analysis, which can provide safe approximations of program or, in this case, hardware states. In the past it was successfully employed to analyze cache, branch prediction and pipeline behavior. With these hardware states, a basic block WCET can be computed, which in turn can be fed into the path analysis to compute the longest path through the program. The abstract hardware states which are used in our analysis model the state of the shared TDMA bus, i.e. at which points in time the execution of a block may start. Since the TDMA schedule is cyclic, we can revert to representing only offsets instead of absolute 2 Best-Case Execution Time times. An offset o in our case can be computed from an absolute time t as o = (t mod n c s l ). To model the fact that a block can be entered with more than one offset we devise two offset representations:
These offset representations are the abstract hardware states that will be used in the analyses. An example for the different representations can be found in Figure 2 . While Figure 2 (b) shows the offset set representation with the represented offsets marked in gray, Figure 2 (a) presents the same offset information, again marked in gray, for the offset interval representation. Obviously, the set representation is more precise, but it also requires greater effort to maintain the sets during the analysis, thus leading to a typical tradeoff between analysis precision and analysis duration.
In the following, we use a special definition of basic blocks. A basic block b = (i 1 , . . . , i k ) in our definition is a sequence of instructions which may only be entered at i 1 and only be exited at i k and which, in addition, must also either not contain any instruction which potentially accesses the shared bus, or the block contains only a single instruction. The information whether an instruction potentially accesses the shared bus can be extracted from the cache information. In our case it may access the bus when it may access the L2 cache. This splits up a standard basic block which contains l potential bus accesses into at most 2l + 1 basic blocks whose WCET is either fixed (no bus access) or variable (bus access). The basic blocks execute in-order, since we required an in-order pipeline. A generalization of our concepts to out-of-order execution is possible, but it is omitted due to size constraints. With this notion of basic blocks and the results from the other microarchitectural analyses which yield WCET values for the blocks without bus accesses, we can formulate the offset analysis as a data flow problem. The data flow analysis requires a function u which updates the computed state after the execution of a basic block b and a function m which merges the states at control flow joins in the control flow graph. Given the set ET b ⊆ N of possible execution times of b and either an offset set S b or an offset interval I b , we have
b never accesses bus off execute ∪ off access b may access bus off access b always accesses bus
(1)
with set ([o min , o max ]) = {o min , . . . , o max } and
) function returns the time needed to finish the bus access (including the bus delay), when the bus request is issued by core p ∈ {0, . . . , n c − 1}, begins at offset o ∈ {0, . . . , n c s l − 1} and needs d ∈ {1, . . . , T max } cycles to complete after the bus access was granted. In the TDMA arbitration we can define Φ p (o, d) as:
Note that ET b may for example model the fact that we have a block with variable-latency instructions or a block whose L2 instruction memory access was classified as UNKNOWN. The merge functions for the two offset representations are:
where S m = m (set (I 1 ) , . . . , set (I j )). With these functions, we can establish a standard data flow analysis on the interprocedural control flow graph of each task (with given starting offsets for the task start block) which terminates after all offset data has stabilized. Unfortunately, this analysis will not be very precise, because branches and loops in the control flow force us to repeatedly merge the offset information, which quickly leads to results where a block can be reached with arbitrary offsets. In this situation, we cannot provide a better estimation than the pessimistic assumption that each bus access is delayed by D max cycles. The imprecision that stems from branches can be reduced through the offset set representation which allows to track the offset development in more detail. Loops pose a bigger problem. They can only be handled effectively with the concept of contexts in the analysis.
Since the functions u and m are defined for both offset sets S and offset intervals I, we will formulate our analyses based on an abstract offset data structure O in the following which may be either an offset set or an offset interval. 
B. Abstract hardware states and contexts
Usually, the hardware states presented in Section V-A are computed in a context-insensitive way, meaning that the abstract interpretation computes states which are valid for all execution contexts of a basic block, where an execution context denotes a certain loop iteration or calling context. This behavior is insufficient for some analyses like e.g. the cache analysis, where the first loop iteration may have a significantly different cache behavior than the following ones. For this purpose, analysis contexts were introduced, which describe the hardware states for a certain execution context. The known methods for dealing with contexts during bus access duration analysis are the following:
• The loop is virtually unrolled by a factor equal to its loop bound and thus, each loop iteration is explicitly analyzed [6] . This method, called full virtual unrolling is very precise but also very inefficient for larger loop bounds. It results in the analysis of B max L analysis contexts, which each represents exactly one execution context.
• The analysis is performed for a fixed offset o, and a delay is added that represents the maximum additional delay that can occur due to execution with offsets s = o. This is the approach from [8] , and we will refer to it under the name fixed-alignment approach. It results in a single analysis context which represents all B max L execution contexts. In the next section, we will present a third, novel approach to context handling in bus access duration analysis, which will analyze 1 ≤ x ≤ B max L contexts to provide a compromise between analysis duration and analysis precision. Our approach is based on an analysis of TDMA offsets as presented above.
VI. COMPUTING LOOP OFFSET BOUNDS
Our approach is based upon the observation that for each loop iteration which starts from a given set of offsets, we can compute the set of offsets in which the iteration may terminate. Therefore, our goal is to track the development of the TDMA offsets of the loop header block and thus to provide more precise offset bounds than by using the data flow analysis from Section V. This requires:
• A structural analysis to find loops in the CFG, and to build a directed acyclic graph (DAG) from each loop or procedure body. Nested loops are represented as single nodes in the surrounding DAG. Due to this, we required our input tasks to be reducible in Section III.
• An analysis that computes the set of offsets that may be reached when a loop body is executed once with given starting offsets. The overall analysis will then proceed in a hierarchical way, starting at the beginning of the task entry procedure and descending into called procedures or loops only when they are discovered in the CFG. The structural analysis is already present in the CHRONOS framework, whereas a singleiteration offset analysis is presented in Section VI-A. Section VI-B then introduces the core analysis which combines the single-iteration results into a complete loop WCET.
A. Determination of offset results for single iterations
As mentioned, we are interested in determining the offsets that can be reached after a single execution of the loop body finishes. This will be called a loop iteration in the following, in contrast to a loop execution which denotes the (possibly) repeated execution of the loop body until the loop condition is false. Figure 3 shows a scenario where a loop iteration, starting from a single, given offset may end at various different offsets, e.g. due to different paths through the loop. These single-iteration offset results can be determined by iterating over the loop's basic block DAG in topological order, as sketched in the following.
In our analysis of a single loop iteration, each basic block is seen as a transformation function which maps input offsets O in (either an offset set or an offset interval as explained in Section V-A) to resulting offsets O out and produces WCET values which are valid for the given O in . Algorithm 1 shows the analysis of single basic blocks. Function calls (lines 11 -14) or blocks which represent inner loops (line 2) are handled by specialized analysis functions. Note that function calls terminate basic blocks in our model. The WCET and offsets which result from bus accesses (lines 4 -8) or simple instructions (line 10) are computed with the known ET b values and Φ p and u functions from Section V-A, where p is the core which executes the currently analyzed task. Each DAG analysis, on either a procedure or a loop body, then composes the single-block results in topological order and forms its own WCET and offset result out of them. Algorithm 2 shows this for the case of a single loop iteration, where b sink and b header are the sink and header node of loop l, respectively and pred (b i ) returns the set of predecessor blocks for block b i . By supplying the starting offsets to the loop iteration analysis (lines 3 -4), this information becomes part of the analysis context, as explained in Section V-B. The iteration analysis then analyzes the behavior of each single block (lines 9 -11) and propagates the results to the successor blocks (lines 6 -7). Finally the results per loop iteration are summarized (line 13). The analysis of procedures in "AnalyzeProcedure" works analogously as "AnalyzeLoopIteration". This implies that recursive calls must be converted to standard loops before our analysis can handle them. For the analysis of complete loop executions (all iterations) in "AnalyzeLoop", we need to combine the context-sensitive single iteration results to form an overall loop WCET and offset result. This will be discussed in the next section.
Algorithm 1 AnalyzeBlock
end for 8: return wcet, u (O in ) 9: else 10:
if b is terminated by call to function f then 
end if 
B. Deriving full loop WCETs
To implement "AnalyzeLoop" for a given loop l and starting offsets O in,l , full unrolling could be performed by analyzing all iterations and supplying the offset results from one iteration as inputs to the next one. Alternatively, only a single iteration can be analyzed, with a forced alignment at the TDMA schedule border and an added alignment penalty as suggested in [8] . Section V-B already mentioned that our goal is to avoid these two approaches, because they are computationally too expensive or lose precision, respectively. In this section we devise two new methods which present a compromise between those two extremes. is true. In the first case we have hit the loop bound and thus have performed full unrolling implicitly, therefore this is the undesired case. In the second case we have reached a fixpoint of the starting offsets and thus the result from iteration j stays valid for all following iterations. In total there can't be more than n c s l iterations, which is the number of possible offset values. The final loop WCET can then be easily computed as:
Global Convergence
The offset result for the loop is equal to the offset result from iteration j, because this result stays valid for all following iterations.
Graph Tracking Analysis:
The global convergence analysis is superior to the static unrolling insofar, that it implicitly unrolls the loops selectively, as long as new information can be obtained. This is more suitable than a static unrolling, but it still relies on the idea of unrolling the first j iterations and handling the rest of the iterations under a single analysis context.
The drawback is that cyclic progressions of offsets cannot be captured by the analysis. Consider e.g. a loop in which all even iterations start with offset x and all odd iterations start with offset y ≥ x, because only the even iterations have to wait for the TDMA bus access, whereas the odd iterations can then proceed with direct bus access. The global convergence analysis will analyze the first two iterations (j = 2), compute O j in = {x, y} and use this offset information for all following iterations. This is clearly valid, but still imprecise. The example shows the need to handle cyclic contexts which do not distinguish the first j execution contexts from the remaining ones, but which distinguish groups of execution contexts which repeat cyclically. In our case, a cyclic context consists of all iterations starting with offset o, which leads to s l n c contexts. Thus, we can identify a cyclic context via the offset which it represents.
To obtain the final timing results using cyclic contexts we construct a weighted, directed graph from the contexts and compute the loop WCET by solving a flow problem on that graph. This graph G = (V, E, c), also called offset graph in the following, has V = {v
We have
For all edges e ∈ (E enter ∪ E exit ) we set the weight c (e) to 0. E transition is then constructed by iteratively analyzing single iterations. For each iteration i, we compute wcet i and O out and c (e) = wcet i . If any of these edges already exists, we update its weight by setting it to max (c (e) , wcet i ). We stop the iteration analyses when we reach an iteration where no edge is added or updated. An example for such a graph is given in Figure 4 where the first iteration starts with offset s and the succeeding iterations alternate between starting offset x and y as sketched in the example at the beginning of this section.
The offset graph can then be used to obtain the final loop WCET by solving a dynamic flow problem [12] . In contrast to standard flow problems, dynamic flow problems have an explicit notion of time built into the problem formulation. Based on the offset graph we can derive two different dynamic flow problems: one for determining the WCET and one for the resulting offsets. The basis of the problem formulation is a flow function x : E × T → N, which specifies for each edge e = (u, v) the amount of flow x (e, t) which leaves u at the discrete time instant t. This flow arrives at v at time t + τ (e) where τ (e) is the constant runtime of the edge. Conceptually, in our graph, a single time step of the flow problem corresponds to a single iteration of the loop, which implies T = {0, ..., B max l }. Thus a flow of x(e, t) = 1 through an edge e = (v, w) ∈ E transition represents the loop iteration t which starts at offset v and ends at offset w and has a maximum runtime of c(e). Therefore we set τ (e) = 1 for all e ∈ E transition , since these edges model single loop iterations, and we set τ (e) = 0 for all e ∈ E enter ∪ E exit , modeling entry into and exit from the loop. Both dynamic flow problems share a common constraint that ensures that all flow which enters a node at a time step must leave it in the same step (i.e. there must be one loop iteration per time step): ∀t ∈ T : ∀v ∈ V of f :
x (e, t − τ (e)) =
Here, δ − (v) and δ + (v) denote the sets of incoming and outgoing edges at node v ∈ V . For the start node v + and the sink node v − we need to provide explicit bounds on the flow. We want F units of flow to leave v + at time 0 and to arrive at v − at time B max l (i.e. we can model F full loop executions in a single flow problem). Therefore we have:
x (e, 0) = F (8)
x (e, B max l ) = F (10)
For the WCET analysis we only model the single worst-case loop execution scenario by setting F = 1 and by maximizing the objective function
The loop WCET is then given by the value of the objective function.
For the offset analysis, we use F = s l n c flow units which must arrive at the sink between time step B min L and B max L . We therefore need different sink flow constraints which replace Equations 10 and 11:
with
The flow of each of the flow units through the graph models a possible loop execution scenario. If K is the (unknown) set of offsets with which the loop can be left, then we have |K| ≤ s l n c since this is the total number of possible offsets. With F = s l n c flow units we can thus model at least one loop execution scenario which terminates with offset k for each offset k ∈ K. Therefore we can compute an overapproximation of K by maximizing the objective function
The offsets O out,l which result after the loop execution are then given as the elements of the set from Equation 16 with K ⊆ O out,l . A formal proof of correctness is omitted due to space constraints but can be found in [18] . Using either the global convergence or the graph tracking analysis, the analysis of tasks as a whole now only requires the offset information at the entry point of the task, which is provided by the overall analysis framework through the known processor mapping and task dependencies. All internal offset information, and with this, the WCET of the task, can then be computed through the presented framework.
C. Offset analysis in architectures without timing anomalies
Timing anomalies are a phenomenon which complicates WCET analysis. According to the definition from [14] a system shows timing anomalies whenever local worst-case behaviour does not forcedly lead to global worst-case behaviour, thus for example whenever a cache hit instead of a cache miss does trigger the global worst-case behaviour. This may be the case e.g. on systems with instruction prefetching and speculative execution [13] . In the static analysis of systems with timing anomalies it is not feasible to prune the search space of the analysis [14] . Therefore in a cache analysis for a system exposing timing anomalies we may not assume an UNKNOWN access to be a cache miss (AM), but instead we must then consider both possibilities, a hit and a miss, in the analysis. On systems without timing anomalies we can safely assume the local worst-case (AM) to increase the analysis performance and precision.
In our offset analysis we did not prune the search space (the set of reachable offsets) at any point up to now. To increase the analysis precision for timing-anomaly-free architectures we can thus reduce the offset result of any merge or update operation to the offset o which is reached by the local worst-case path. Therefore, the differentiation between offset sets and offset intervals is of no importance for the analysis any longer, because we are only tracking single offsets after this reduction. The graph-based analysis is then ideally suited to track the development of the worst-case offsets inside of loops using the known ILPs from Section VI-B to compute the total loop WCET. This reduction to the local worst-case makes the analysis highly precise, because the main source of imprecision, the divergence of offset information, is eliminated.
D. Extensions for further micro-architectural analyses
In an analysis that includes the analysis of more microarchitectural features like pipeline and branch prediction, the computed overapproximations of the hardware states must become part of the analysis context, in addition to the offset information. For the global convergence analysis, this means that a global overapproximation of the hardware states at the loop header is built and used in the analyses. For the offset graph, every context node must be annotated with an overapproximation of the hardware states with which the node may be entered, including cache, pipeline and branch prediction states. In such a scenario, the graph must be iteratively refined until 1) No more edges are added or updated 2) The hardware states on all nodes have converged Alternatively it is also possible to construct only a single, global overapproximation of the hardware states, depending on which degree of precision is required.
VII. EXPERIMENTAL RESULTS
In the following, the different approaches to bus-aware WCET analysis are compared. As mentioned, we have implemented our approaches based upon the code from [8] which enables a precise comparison. The prototype tool analyzes executables compiled for the SIMPLESCALAR platform and includes a thorough cache analysis. Unfortunately, no pipeline or branch prediction analysis is integrated yet, so all instruction latencies are set to 1 cycle. Section VI-D nevertheless introduced the general concept of how to perform such an integration. It can be expected that the classification of the approaches with respect to precision and analysis time stays the same even after additional microarchitectural analyses were integrated, since the number of analysis contexts is directly dependent on the analysis type as explained in V-B. The number of contexts in turn has the biggest influence on the analysis precision and duration. All experiments were run on an Intel Xeon 2.13GHz machine with 4GB of main memory under Debian Linux. Concerning the solution of the dynamic flow problems during the graphtracking analysis, we used the CPLEX ILP solver in the experiments.
The experiments were performed on a subset of the MRTC test bench [15] where the tasks are independent from one another. Thus we map each MRTC test case i ∈ [0, 23] from Table I to core (i mod n c ) with priority i, where 0 is the highest priority. We also tested the presented algorithms with the publicly available PapaBench [16] and Debie [17] benchmarks which are an unmanned aerial vehicle control software and a space debris monitoring software, respectively. The mapping of tasks to cores was done manually for these two benchmarks. The default system configuration is a 2-core system with 1KB L1 cache (direct-mapped, block size 32 byte) and 2KB L2 cache (4-way associative, block PapaBench 4663 200256 10 0 3 Table I  BENCHMARK PROPERTIES size 64 byte). Only for Debie, the cache configuration was changed to 2KB L1 cache (2-way associative) and 8KB L2 cache to account for the bigger program sizes of Debie. In any case, the L1 hit penalty is 0 cycles, the L2 hit penalty is 1 cycle and the main memory access time is 5 cycles modeling a Flash-based main memory. The default TDMA schedule assigns a slot of 80 cycles to each core. A more detailed overview of the used benchmarks is provided in Table I , including the byte size s byte of the "text" section of the executable (excluding startup code), the lines of code LOC (excluding comments and empty lines), the number of loops L, the maximum loop nesting level D and the average loop bound ∅ B . The Debie and PapaBench benchmarks consist of 6 resp. 13 tasks which have a relatively simple structure, especially since they have almost no nested loops.
A. Precision gain
In this section, we will distinguish between the approaches that assume no timing anomalies on the target hardware and those which do not make such an assumption. The fully unrolling and fixed-alignment analyses that are built into CHRONOS do make this assumption. Therefore, strictly speaking, only the comparison to our approaches with the extension from Section VI-C is feasible. In Figure 5 (a), we have listed the WCET results for the different approaches on the MRTC test bench subset with the mentioned default machine configuration. In Figure 5 as well as in the following, all WCET results are relative to the WCET result of the fully unrolling analysis which does not consider timing anomalies . We use the following shorthands for the different approaches: The results for OC-are not displayed here, because the graph-tracking is the most suitable method for the case without timing anomalies. As can be seen in Figure 5 (a) OT-almost always (except for minver) reaches the same precision as U-(100% = U-). It also outperforms F-which does not analyze cyclic contexts (compare Section VI-B), but instead analyzes all the loop iterations with a fixed alignment and finally adds a penalty term to the result which accounts for the ignored actual alignment of the loop iterations. This leads to imprecision because the actual blocking time due to bus accesses may be much lower than the blocking time for the fixed-alignment situation plus the penalty. In contrast to OT-, our general analyses OC+ and OT+ are less precise, which was expected, but still they outperform F-on benchmarks which show deeply nested loops or loops with high loop bounds, like for example mergesort, edn, ludcmp or select. On benchmarks which have a flat structure with many branches, like statemate, OC+ and OT+ are outperformed by F-, because they lose track of the offsets and must revert to worst-case assumptions. Nevertheless, even in those cases, they are still much more precise than the pessimistic assumption (W) that all bus accesses incur maximum delay, which results in an average WCET ratio of 414%. A surprising result is, that OT+ is worse than OC+ on average for the MRTC test bench subset. This is possible, because the global convergence analysis implicitly unrolls the first iterations as discussed in Section VI-B, whereas the graph-tracking analysis summarizes the iteration behavior in the offset graph. Therefore, once the offset information gets highly imprecise, the graph will be imprecise for all iterations, whereas the global convergence may achieve a better precision during its implicit unrolling. For loops with few iterations, this can have a strong impact on the precision of the WCET estimations. The graph tracking only shows its strength on the rather sparse graphs of OT-.
On Debie and PapaBench F-performs much better than on the MRTC test bench, because there are almost no nested loops and the loop bounds are rather small. Nevertheless, F-is still outperformed by OT-, and also OT+ performs consistently better than OC+ which emphasizes its applicability for realworld programs.
All presented results of the offset analyses use the offset interval representation, from Section V-A. Using the offset set representation the WCET estimation is further reduced by a maximum of 89% for bsort100 (avg. 1.3%) when combined with graph-tracking, or by a maximum of 0.3% for bs (avg. 0.0%) when combined with the global convergence. This underlines the suitability of the combination of offset sets with the graph tracking analysis.
To evaluate the impact of different TDMA slot sizes or processor configurations on the precision of the WCET estimations, the analyses were performed for a varied number of cores n c (with manually adapted task mapping) and varied TDMA slot lengths s l . The average WCET results of these experiments are shown in Figure 5( Table II  ANALYSIS TIME COMPARISON configuration is described as a tuple n c , s l . The experiment shows that OT-is able to compute results which are almost equal to those of U-, whereas the other analyses suffer from the increased maximum bus delay, F-even more so than OC+ and OT+. Table II summarizes the analysis duration in seconds for the WCET analyses that generated Figure 5(a) . Here it becomes visible that all analyses are much faster in total than U-, which takes 53.8 minutes. OT-only requires 7.7% of that time and delivers WCET results which deviate by less than 1% from those of U-. Therefore OT-is the best choice when high analysis precision with moderate runtimes is required. For applications where an extremely short analysis time is required, F-can be better suited. It delivers results with 79% overestimation compared to U-in only 0.4% of the analysis time of OT-.
B. Analysis time
The unrolling is quick for benchmarks with few loops and low loop bounds like e.g. PapaBench. Since its analysis time is directly dependent on the loop structures and the loop bound values, it performs much worse for Debie and MRTC where nested loops and higher loop bounds are found (see Table I ). This indicates that the unrolling is unsuitable for bigger realworld applications.
VIII. CONCLUSIONS
We have presented a new approach to the WCET analysis of TDMA-arbitrated shared resources, and applied it to a multicore system with shared bus. Our new analysis type is based on a static analysis of the TDMA offsets with which basic blocks may be entered and uses the key concept of cyclic contexts to improve the analysis precision. Concerning precision and analysis time, our solutions provide a good compromise between the fastest and the most precise approaches. The best variant (OT-) reduces the WCET overestimation by 79% compared to the quickest preexisting approach (F-) and achieves a speedup of 12.9 compared to the most precise preexisting approach (U-). Possible improvements to our methods are 
