Abstract
Introduction
The knowledge of the worst-case execution time (WCET) of software components is a prerequisite for ensuring the timeliness of a real-time system. Since the end of the 1980s significant effort has been spent on research towards the development of WCET analysis tools.
The research leading to these results has received funding from the Austrian Science Fund (Fonds zur Förderung der wissenschaftlichen Forschung) within the research project "Compiler-Support for Timing Analysis" (COSTA) under contract P18925-N13.
The two main tasks of WCET analysis tools are the control-flow analysis (also called path analysis [1] or high-level analysis) that determines the (in)feasible paths in a program and the processor-behavior analysis (also known as hardware modeling [1] or low-level analysis) that assesses instruction timing [2] .
Within this paper we discuss an open problem of processor-behavior analysis, namely the occurrence of socalled timing anomalies [3] , [4] , [5] . Timing anomalies are a challenge for WCET analysis, because they violate the continuity properties "proportionality" and "monotony" of program execution time.
Timing anomalies so far have only been discussed in the context of composing instruction sequences. We motivate in Section 3 that they are also a challenge for multiple phases of processor-behavior analysis, which we call parallel composition. In Section 4 we define parallel timing anomalies as timing effects due to changes of the initial state and show that they can occur in practice. In Section 5 we discuss two fundamental techniques of parallel composition and prove as an impossibility result that in case of arbitrary forms of parallel timing anomalies parallel composition does not provide safe WCET bounds. But we also prove that in case of restricted forms of parallel timing anomalies it is still possible to obtain safe WCET bounds. In Section 6 we discuss practical issues of timing anomalies on WCET analysis and discuss methods of how to avoid anomalous behavior.
Worst-Case Execution Time Analysis
WCET analysis is about finding the longest feasible path through a program, where length means execution time [1] , [2] . For example, the implicit path-enumeration technique allows to consider arbitrary linear flow constraints [6] , [7] . One or more program analysis phases precede the longest path search to calculate the instruction timing [8] , [9] . On modern processors with peak-performance improving features like caches or pipelines, the WCET of an instruction sequence I depends on the set of initial computer states that potentially reach the beginning of instruction sequence I. Since a precise notion of computer state is important for WCET analysis, we introduce the timingrelevant dynamic computer state (TRDCS) for a program scope S. As shown in Figure 1 , the TRDCS includes only those parts of the timing-relevant computer state that are changed within a program scope S. Those parts of the timing-relevant computer state that are changed only outside of S are called computer configuration and are not part of the TRDCS.
Notation
The following sections discuss several formal properties on WCET analysis. To keep the definition of these properties short and intuitive we use the following notation:
T (I, s) . . . the execution time of an instruction sequence 
Parallel Decomposition
WCET analysis with parallel decomposition is a technique to reduce the complexity of processor-behavior analysis.
The idea of parallel decomposition is to calculate the WCET T max (I, S) of an instruction sequence I = I 0 •I 1 • . . .•I n in two steps. Before this calculation, the TRDCS S is partitioned into S = A ∪ B, where A is the state space of a processor component hw A and B is the state space of other components hw B in the processor. For example, the hardware component hw A may be the instruction cache and the state fraction B may cover the pipeline and the other processor components. For the timing function T (I, s) of an instruction sequence I given in Figure 2 , the corresponding timing function with the partitioned state space A and B is shown in Figure 3 . The state spaces of hw A and hw B may also overlap (A ∩ B = ∅).
Figure 2. Timing of Non-Partitioned TRDCS
In the first step, the timing of processor component hw A is analyzed and one state a ∈ A is chosen (details describing the choice of a will follow below). Based on this result, the overall processor timing is analyzed in the second step by searching the state space B while using the result for state a ∈ A to model the timing of hw A .
The challenge is to find a composable timing calculation method that can be used to calculate safe WCET bounds for the target processor of interest. Concrete calculation methods are discussed in Section 5. 
Timing Anomalies
The name timing anomalies is used to describe system behavior where relaxing some constraints leads to an increase of the system timing. This is typically caused due to a greedy scheduler that cannot foresee the future impact of its local decisions. With respect to WCET analysis, for example, such a constraint may require the execution of two instruction sequences to finish within a given deadline. Decreasing the execution time of the first instruction sequence relaxes the contraint for the second instruction sequence to finish within the deadline, which can lead to timing anomalies.
Related Work on Timing Anomalies
Program execution time is not the first field where timing anomalies have been observed. For example, Graham described this effect for task scheduling [10] . He has shown that a greedy task scheduler can produce a longer schedule if the scheduling constraints are weakened, e.g., by using shorter tasks, less dependencies, or more processors.
Lundqvist and Stenström first described timing anomalies in the context of WCET analysis [3] . Their definition of timing anomalies is semi-formal. They have shown an example where a change from a cache hit to a cache miss of the first instruction of an instruction sequence on a processor with out-of-order pipeline and instruction cache can result in a decrease of the total execution-time. However, it has been shown that it is rather challenging to understand the potential triggers of timing anomalies. For example, Lundqvist and Stenström believed that it requires an out-of-order pipeline to trigger timing anomalies [3] , which later turned out as too specific, see below.
Schneider developed an integrated WCET analysis method, i.e., he integrated response-time analysis with WCET analysis. He did this on a PowerPCC 755, where he demonstrated that timing anomalies occur on real processors [11] . Besides, he has also demonstrated the occurrence of the so-called domino effect, i.e., different states at the header of a loop do never converge during execution of the loop. Domino effects do not necessarily cause timing anomalies. In the concrete example shown by Schneider it did result in a strong timing anomaly, as he showed that a delay caused in the loop header results in a constant delay in each loop iteration, resulting in a total delay that is at least a linear function of the loop bound. Berg has shown an example of a domino effect that results in a weak timing anomaly with a constant execution-time change each loop iteration [12] .
Wenzel et al. have analyzed different patterns of processor architectures to gain knowledge about the possible triggers of timing anomalies [13] , [4] . They have shown that timing anomalies can occur even on processors with in-order execution. Further, Wenzel et al. provide a necessary precondition for the potential occurrence of timing anomalies, the resource allocation criterion. However, the concrete formulation of the criterion was a bit too restrictive as it covers only cases in which exactly one instruction changes its timing. This criterion needs to be generalized to cover timing anomalies caused by speculative execution [5] or certain cache replacement policies like pseudo-round robin [14] .
Reineke et al. for the first time provide a formal definition of timing anomalies in the context of program execution time. Their definition of timing anomalies is based on a transition system as processor model where a timing anomaly occurs if the WCET path within a local scope is not part of the WCET path of a surrounding scope [5] . This was an important step towards improving the understanding of timing anomalies. However, this formalization of timing anomalies is rather complex, making it clumsy to use as a tool for exploring safeness properties of WCET analyses. Further, the authors enumerated three different sources of timing anomalies without claiming this to be a complete list: speculation, scheduling, and cache effects. Reineke et al. only discuss the type of timing anomaly that Eisinger et al. call a strong timing anomaly: the case where a local increase of execution time leads to a global decrease [15] . The other case, where a local increase of execution time leads to an even larger global increase -called weak timing anomaly -is not treated in the context of WCET analysis because it does not hinder an efficient search of the worst-case. The definition of a timing anomaly given by Reineke et al. is specific to WCET analysis, while the original definition given by Lundqvist includes also processor behavior that is not a challenge for WCET analysis.
The research described above discusses timing anomalies only in the context of serial decomposition of WCET analysis. Kirner et al. for the first time describe a class of timing anomalies not considered so far: timing anomalies in the context of WCET analysis using parallel decomposition. A timing anomaly in case of parallel decomposition is defined as the situation where the worst-case initial state of a hardware/processor sub-component is not part of the worst-case initial state of the total hardware/processor [16] . So far the authors formalized this new type of timing anomalies without discussing it in the context of the previously mentioned timing anomalies based on local changes within an instruction sequence.
Parallel Timing Anomalies
Recently it was found that timing anomalies do not only occur between the timing of two subsequent instruction sequences, but also between the component latency of processor components and the total execution time [16] . We call these timing anomalies "parallel timing anomalies" because they are challenging for the parallel decomposition described in Section 3.
Parallel timing anomalies are formally defined by Definition 4. TA-P-I "Parallel Inversion" ∃a, a ∈A, b∈B.
The parallel timing anomaly TA-P-I states that a change of the component latency of instruction sequence I for the processor component hw A results in a change in the opposite direction for the execution time over the state a, b ∈ A × B. Due to this behavior TA-P-I is also called "parallel inversion".
Analog to the series amplification, the parallel timing anomaly TA-P-A states that a change of the component latency of instruction sequence I for the processor component hw A results in a larger change in the same direction for the execution time over the state a, b ∈ A ∪ B. Thus, the parallel timing anomaly TA-P-A is also called "parallel amplification".
In case of parallel timing anomalies, both TA-P-I and TA-P-A can be considered to be "strong (parallel) timing anomalies", since both potentially invalidate the parallel decomposition described in Section 3. However, as described in Section 5, not all occurrences of TA-P-I and of TA-P-A are challenging.
Analogously to series timing anomalies, there exists also a more strict definition for parallel timing anomalies than that given in Definition 4.1. This more strict definition of parallel timing anomalies is given in Definition 4.2. These timing anomalies are called worst-case parallel timing anomalies, since they describe exactly the cases that cause problems for efficient WCET analysis with parallel decomposition.
Definition 4.2: (Worst-Case Parallel Timing Anomalies) Given a partitioned TRDCS S = A ∪ B with the timing behavior (component latency) of hardware component hw A modeled as T hwA (I, a), the timing behavior T (I, a, b ) of an instruction sequence I on a processor is called a worst-case parallel timing anomaly, iff at least one of the following two properties holds: , b ) } Note that the definition given in Definition 4.2 is almost identical with the definition given in Definition 4.1, with the small difference that it uses ∀a ∈A max , ∀b ∈B A,max (a ) respectively ∀a ∈A min , ∀b ∈B A,max (a ) instead of ∃a∈A. The worst-case timing anomalies are more specific than the others, which is, besides the specific elements to compare, due to the ∀ quantifier instead the ∃ quantifier. Analogous to series timing anomalies, the generic form of timing anomalies given in Definition 4.1 can imply for a specific program (with the right set of reachable states) the occurrence of the worst-case timing anomalies given in Definition 4.2.
Visualization of Parallel TAs.
To visualize parallel timing anomalies we assume that the TRDCS is partitioned into A and B as explained in Section 3. The component latency of instruction sequence I on hardware component hw A is assumed to be as shown in Figure 4 .a. It is easier to identify the occurrence of parallel timing anomalies if the component latencies for the different states a i are in a (decreasing) order. To get a decreasing order we relabel the states a i into states a i as shown in Figure 4 .b. We have to compare decreasing component latency with the overall execution time to find occurrences of parallel timing anomalies. However, the only exception are those cases where the component latency does not change. Whatever the change of the execution time is, as long as the component latency does not change, it is not considered to be a timing anomaly. As described in Section 5, such cases can be handled by doing the parallel composition for multiple component latencies of hw A . Figure 5 .b shows an example of timing anomaly TA-P-I (parallel inversion): the change from state a 3 to state a 4 where the execution time increases while the component latency decreases. Note that between state a 2 and state a 3 there is no timing anomaly, though the execution time also increases. This is not a timing anomaly because in this case the component latency of hw A does not change. Figure 5 .c shows examples of timing anomaly TA-P-A: between states a 0 and a 1 and between states a 3 and a 4 . In those cases the execution time decreases more than the component latency of hw A does.
Of course, it can also happen that both parallel timing anomalies, TA-P-I and TA-P-A, occur. As described in Section 5, such a scenario in general does not invalidate parallel decomposition. But it turns problematic when TA-P-I and TA-P-A do occur for the same b ∈ B. The limitations of parallel decomposition in case of such a scenario are described in Section 5.3.1. (I, a i , a i+1 
TA-P-I only (∀i∈{3}.
d) TA-P-I and TA-P-A (TA-P-I: 
Examples of Timing Anomalies
In the previous section we have shown how a processor behaves in case of timing anomalies. In this section we show examples of concrete hardware patterns that can cause such timing anomalies. It is not fully understood how to determine efficiently whether a hardware exhibits timing anomalies. The following presents known instances of timing anomalies, which might help to identify further sources of timing anomalies. One of the first known potential sources of timing anomalies is out-of-order execution. Wenzel et al. has constructed simple patterns of hardware architectures and studied whether they may exhibit timing anomalies [13] , [4] . Figure 6 shows a simple example of inversion timing anomaly, which has been taken from [4] . The assumed processor has an out-of-order pipeline with two nonoverlapping resources. Non-overlapping resources means that there are no instructions that can choose from more than one alternatives during each resource allocation. The bold arrows show data dependencies, which restrict the set of different possible executions through the pipeline. The timing anomalies in this example show up due to the combined effect of data dependencies and the out-oforder execution. Figure 7 shows for the same processor model an example of amplification timing anomaly which has been also taken from [4] . Both examples of timing anomalies can manifest as series timing anomalies or as parallel timing anomalies. 
Figure 7. Example of TA-P-A (out-of-order pipeline + cache + data dependencies)
Most patterns of hardware and software that can cause series timing anomalies can also cause parallel timing anomalies. However, there is an interesting difference between them: in case of series timing anomalies, the inversion is challenging, but not the amplification. As we show by Theorem 5.7 and Theorem 5.9, in case of parallel timing anomalies, only the coupled occurrence of inversion and amplification is challenging.
WCET Analysis with Parallel Composition
In Section 3 we described the basic idea of reducing analysis complexity by using parallel decomposition of a state TRDCS into two sets A and B. The challenge is to find a composable timing calculation method that can be used to calculate safe WCET bounds for the target processor of interest. In the following we describe two different timing-composition techniques and analyze their correctness in case of parallel timing anomalies. These are the only two possible approaches of parallel composition that first search the state of hardware hw A and hw B independently based on the maxima, minima, and maximal variation of hardware component hw A .
Delta-Composition
The first prototypical technique to derive the maximum overall instruction timing of an instruction sequence I based on a decomposition of the TRDCS into two state fractions A and B is called Delta-Composition. The principle of Delta-Composition is given in Figure 8: 1) Δ hwA,max , the maximum variability (Δ hwA ) of T hwA (I, a) is determined:
2) The set A min of local states a hwA,min ∈A where T dc (I) = max a∈Amin,b∈B
In case there are multiple states a ∈ A min of minimal latency then Delta-Composition evaluates each of these minima and takes the overall maxima. The computational cost for T dc (I) is O((|A| + |A min | · |B|)) · |I|). Thus, the more minima a ∈ A min exist, the higher is the computational cost of Delta-Composition. In the extreme case of A = A min the Delta-Composition degrades to searching all states s ∈ A× B. However, this extreme case of A = A min rarely seems to be a real problem, because in that case the whole set A has no influence on the timing and thus cannot be part of the TRDCS. However the worst case of complexity is the scenario |A| − 1 = |A min |.
Theorem 5.1 describes the sufficient and necessary condition about the hardware behavior such that DeltaComposition (T dc (I, s) ) is safe, i.e., that it provides an upper bound for the execution time of an instruction sequence I.
Theorem 5.1: Safeness of Delta-Composition: Based on above definitions of A min , B A,max (a), and Δ hwA,max , the Delta-Composition allows to provide a safe WCET bound on processor hardware whose timing characteristics obey the following sufficient and necessary condition (proof given in [17] 
Condition 2 states that there exists at least one local best-case state a ∈A min (a hwA,min ) such that whenever the state of a hardware component hw A changes from any state a / ∈ A min to this specific best-case state a , the resulting change in the execution time of instructionsequence I (Δ (I, a , b , a, b ) ) is not higher than the maximum change that is possible in the component latency of the hardware component hw A (Δ hwA (I, a , a) ).
The correctness condition given in Equation 2 is the negation of TAW-P-A (see Definition 4.1).
Max-Composition
The second prototypical technique to derive the maximum overall instruction timing of an instruction sequence I based on a decomposition of the TRDCS into two state fractions A and B is called Max-Composition. The principle of Max-Composition is given in Figure 9: 1) The set A max of local states a hwA,max ∈ A where T hwA (I, a) is maximal, is determined:
Equation 3 shows how the maximum instruction timing is calculated with Max-Composition (T mc (I)). MaxComposition provides a precise WCET bound: T mc (I) = T max (I). Δ(I, a, b , a , b ) ≥ 0) .
The correctness condition given in Equation 4 is the negation of TAW-P-I (see Definition 4.1).
Safeness of Parallel Composition
The following two theorems state which type of parallel timing anomaly are a challenge for the correctness of Delta-Composition and Max-Composition. 
From Corollary 5.5 it follows that parallel timing anomalies are not a serious problem as long as only one type of them occurs. Note that since the Delta-Composition is not tight, T dmc (I) also does not have to be tight: T dmc (I) ≥ T max (I).
Coupled Parallel Timing Anomaly.
In the previous section we have shown that parallel composition can be safe if at most one type of parallel timing anomalies is possible.
In the following we analyze in more detail what happens if both types of parallel timing anomalies (TA-P-I and TA-P-A) occur. In this case we can differ between the case where both types of parallel timing anomalies occur only for different states b ∈ B and b ∈ B (discussed in Section 5.3.2) and the more severe case where they also occur for the same state b ∈ B. The latter case is discussed in the following.
A formal definition of the case where both types of timing anomalies TA-P-I and TA-P-A occur for the same state b ∈ B is given in Definition 5.6. To simplify its reference, we have named this case as TA-P-C where "C" stands for the coupled (same b ∈ B) of both parallel timing anomalies. (I, a) , the timing behavior T (I, a, b ) of an instruction sequence I on a processor is called a coupled parallel timing anomaly, iff the following property holds: hwA (I, a 3 , a 4 ) < Δ(I, a 3 , b , a 4 , b ) ) Above definition combines the definitions of TA-P-I and TA-P-A given in Definition 4.1. The word "coupled" signals that both types of timing anomalies occur for the same state b ∈ B.
Theorem 5.7 states that timing anomalies of type TA-P-C can only be bounded by searching the whole state space A×B. This is actually an impossibility result for applying efficient parallel composition whenever the occurrence of TA-P-C is possible. (I, a) , the timing behavior T (I, a, b ) of an instruction sequence I on a processor is called an exclusive parallel timing anomaly, iff the following property holds:
Above definition allows the occurrence of both, TA-P-I and TA-P-A as given in Definition 4.1. But the word "exclusive" signals that the two types of timing anomalies can only occur for different states
Theorem 5.9 states that timing anomalies of type TA-P-E can efficiently be bounded without having to search the whole state space A×B. This is the most generic form of occurrence of parallel timing anomalies that can efficiently be bounded. As Theorem 5.7 states, this is not possible in Table 1 summarizes the situation where ParallelComposition is safe. Max-Composition (MC) is safe if at most parallel timing anomalies of type TA-P-A are present and Delta-Composition (DC) is safe if at most parallel timing anomalies of type TA-P-I are present. If TA-P-I and TA-P-A can both occur, but only for different states b ∈ B and b ∈ B (scenario TA-P-E) then the maximum of Max-Composition and Delta-Composition is a safe upper bound of the execution time. But if TA-P-I and TA-P-A can both occur for the same state b ∈ B (scenario TA-P-C) then there is no efficient method that does not rely on examining the combined state space of A and B. The examination of the combined state space ("Full State") is not a composition method anymore, but is given to show the consequences in case of TA-P-C.
Summary of Parallel Composition

Implications on WCET Analysis
The potential occurrence of timing anomalies depends on the target hardware as well as on the program code. Thus, one solution to avoid timing anomalies on existing hardware is to rewrite the program such that no timing anomalies can occur [18] , [19] . So far, only timing anomalies for series composition have been addressed. Avoiding parallel timing anomalies is open research.
The other way to avoid timing anomalies is to design predictable hardware that avoids timing anomalies by design. Besides the impossibility result, this paper also provides special cases of parallel timing anomalies where WCET analysis with parallel composition is safe. The latter provide important hints to hardware designers on their way to constructing predictable hardware components.
Conclusion and Outlook
The most challenging problem of WCET analysis is the high complexity of today's processors. Features like caches and pipelines create a huge state space. Even worse, effects like timing anomalies can make it impossible to construct an efficient processor behavior analysis that does not need to search the whole state space for the whole program at once.
In this paper we presented a new class of timing anomalies, which we call parallel timing anomalies. Parallel timing anomalies can occur on parallel composition, i.e., when analyzing the processor behavior in multiple phases based on a decomposition of the computer state (TRDCS). We have introduced the two fundamental techniques to perform parallel composition: Delta-Composition and MaxComposition. As an impossibility result we have proved that in case of arbitrary forms of parallel timing anomalies it is not possible to exclude underestimation of the WCET with parallel composition. Additionally, we have shown that parallel compostion provides safe WCET bounds for all types of timing anomalies except TA-P-C. These results provide the foundation for a useful tradeoff between flexibility and predictability on processor hardware design.
Future work is needed on identifying the concrete types of timing anomalies that might occur for a concrete processor implementation.
