Abstract-We consider complex embedded systems, where large, heterogeneous sets of communicating tasks are executed on heterogeneous multi-processor / multi-bus architectures with RTOSes and bus-arbitration. Examples include systems-on-chip for mobile communication, multimedia platforms or distributed automotive control systems. In such systems, data-dependent task execution times and preemption lead to data jitter and bursts, and consequently to sophisticated run-time interdependencies between tasks. Reliable validation of timing constraints in such systems can no longer be achieved through simulation due to incomplete corner-case coverage, but instead requires formal performance analysis.
I. EMBEDDED MULTIPROCESSOR SYSTEMS

A. Application Properties
Modern embedded systems are designed to execute large, heterogeneous sets of applications, many of them under hard real-time constraints. Applications are structured into communicating tasks. A task is usually activated when predecessors have produced output data on which the task can operate. Such a task is called a reactive task. Activating events may be sporadic by nature, e. g. alarm signals or user input, or basically periodic, e. g. a stream of sensor data or packets in a communication protocol. Even strictly periodic task activation can be viewed as event-driven, since it is the result of the expiration of a timer.
Activation dependencies in realistic embedded applications can be rather complex. A consumer task may require a different amount of data per execution than produced by a producer task, leading to multi-rate systems. Task activation may also be conditional, leading to execution-rate intervals. Furthermore, a task may consume data from multiple task inputs. Then, task activation is determined by a boolean function over the available input data. The behavior of tasks may also heavily depend on the type and value of received data. Any useful system-level performance analysis framework must be able to handle the complexity of real-world applications.
B. System Integration and Performance Validation
The ever increasing number of complex embedded applications can only be implemented efficiently on heterogeneous, programmable multiprocessor systems. Multiprocessor system-on-chip (MpSoC) designs use on-chip buses or networks to integrate multiple different programmable processor cores, specialized memories, and other intellectual property (IP) components on a single chip. MpSoCs have become the architecture of choice in network processing, mobile communication and consumer electronics. In the automotive domain, a trend can be seen towards distributed embedded systems.
Systems integration is becoming the major challenge in embedded system design. Complex hardware and software component interactions pose a serious performance threat, including transient overloads, memory overflow, data loss, and missed deadlines. The International Technology Roadmap for Semiconductors names system-level performance verification as one of the top three co-design issues [1] .
Simulation and test are state of the art in embedded system performance verification. Tools such as Mentor Graphics' Seamless-CVE [2] or Axys Design Automation's MaxSim [3] support cycle-accurate co-simulation of a complete MpSoC. In the automotive domain, sophisticated hardware-setups such as ETAS LabCar [4] or dSPACE AutoBox [5] allow rapid test of software prototypes. However, for complex systems it becomes increasingly difficult to cover all corner cases. A corner case describes a critical execution scenario. For each corner case, the designer must provide a simulation pattern that reaches that corner case during simulation. Essentially, if all corner cases satisfy the given performance constraints, then the system will satisfy its constraints. However, such corner cases are extremely difficult to find and debug, and it is even more difficult to find simulation patterns to cover them all.
To limit complexity, an industrial trend can be observed towards conservative but easily predictable communication protocols like time-division multiple access that minimize nonfunctional component dependencies. The Sonics SiliconBackplane Micronetwork [6] is an example of these protocols, which make communication timing straightforward and predictable. Distributed systems exhibit a similar trend toward conservative protocols, such as the time-triggered protocol [7] in automotive and aerospace electronics. However, this integration simplicity comes at a significant performance price -a price that increases with system complexity, since such protocols are inefficient in dynamically changing load situations that are typical for reactive embedded systems.
In this paper, we first argue in favor of performance analysis as an alternative to performance simulation for complex reactive embedded systems (section II). In particular, we introduce our compositional performance analysis approach to enable performance analysis for heterogeneous embedded architectures. Sections IV -X constitute the main part of the paper, where we explain how to extend compositional performance analysis beyond simple task dependencies, in order to handle complex applications. Specifically, we focus on ANDactivated tasks (section V), OR-activated tasks (section VI), data rate transitions between tasks (section VII), cyclic task dependencies (section IX), and tasks with input and output data rate intervals (section X). In section XI, we apply the results of this paper to analyze the performance of a larger example. We demonstrate the feasibility of our approach, and provide results for several experiments which show some possibilities that performance analysis offers for verification and design-space exploration.
II. PERFORMANCE ANALYSIS
Performance analysis is a promising alternative to simulation and test for the verification of performance constraints. It calculates minimum and maximum response times for tasks or task chains based on task properties, scheduling parameters and all possible timing of activating events. This has attractive advantages: the calculated performance bounds are reliable. Properties can be specified before task implementation for early design-space exploration. Analysis usually runs fast.
The worst-and best-case response-times of tasks sharing a single component (e. g. a CPU or a bus) under the control of a scheduler can be calculated using so-called scheduling analysis techniques. Scheduling analysis techniques assume that processes are activated by a stream of activating events. The minimum and maximum number of events within a certain time interval is bounded, which can be efficiently expressed with so called event models, e. g. periodic with a maximum jitter, or sporadic with a minimum distance. Event models will be considered in more detail in section III. Using these event models, as well as the core execution time of each process and assigned scheduling parameters, a scheduling analysis for a component can calculate the load of that component as well as minimum and maximum response times for each process scheduled on that component. This allows to validate deadlines, for example. A classical analysis technique is RateMonotonic Analysis (RMA) [8] which considers a system of periodically activated, independent tasks. Overviews of more general analysis techniques for single processors are given e. g. in [9] , [10] .
Scheduling analysis has found its way into industrial practice. TriPacific's Rapid RMA [11] , LiveDevices' Real-Time Architect [12] , and TimeSys's TimeWiz [13] all calculate worst-case task response times for single processors or homogeneous multi-processors with static priority scheduling. The most sophisticated tools also take RTOS-overhead into account.
The techniques mentioned so far cannot analyze systems with heterogeneous processor scheduling and bus arbitration policies. Tindell [14] as well as Eles et al. [15] have extended existing scheduling analysis techniques to allow performance analysis of special heterogeneous architectures, e. g. fixedpriority-scheduled CPUs connected via a TDMA-scheduled bus. Wolf et al. [16] have provided a sophisticated analysis algorithm for task graphs scheduled on homogeneous multiprocessors. The main problem with these 'holistic' approaches is that they do not scale well to increasingly heterogeneous systems.
We recently developed a compositional performance analysis methodology which integrates different local scheduling analysis techniques, such as those described above, through event stream propagation [17] , [18] . The local techniques are composed on the system level by connecting their input and output event streams according to the overall application and communication structure. This allows us to re-use the host of work in real-time systems research, and to easily scale to larger and more complex heterogeneous systems, a major advantage over 'holistic' multi-processor analysis approaches.
Event streams are described by event models (section III). We use event model interfaces (EMIFs) and event adaptation functions (EAFs) to adapt output event streams that are incompatible with input requirements. EMIFs and EAFs are also used for traffic shaping, e. g. to reduce transient load peaks.
A different compositional performance analysis approach based on a real-time calculus has been developed by Thiele et al. [19] , [20] . This approach is geared towards performance analysis of network processors and uses a related event model representation (section III).
Our basic compositional approach is only concerned with single-rate, single-input tasks without cyclic dependencies. In this work it is extended to cover multi-rate systems, data-rate intervals, multiple inputs, and functional cycles. The result is a compositional performance analysis methodology which remains applicable even if complex applications are executed on heterogenous architectures.
Little previous work exists that considers performance analysis in the presence of complex task dependencies. A notable exception is recent work on the timing analysis of conditional task graphs which considers tasks with multiple inputs [21] , [22] . However, these approaches assume global, homogeneous scheduling of multi-processors and thus do not support the coupling of different analysis techniques in a heterogeneous system with heterogeneous scheduling strategies.
III. EVENT MODELS
An event model describes all possible timing of events in an event stream. Event models are often categorized as periodic or sporadic and additionally can display jitter or bursts [14] , [18] . More general formalisms have been devised, e. g. [23] , but they are complex and have only been used in special scheduling configurations.
We focus on periodic with jitter event models, which can accurately describe the possible event timing in most signalprocessing as well as control systems. Note that this focus does not restrict the composability of different scheduling analysis techniques.
A periodic with jitter event model states that each event generally occurs periodically with period P, but that it can jitter around its exact position within a jitter interval J . Consider an example where
This event model is visualized in Fig. 1 . Each gray box indicates a jitter interval of length J which repeats with the event model period P. The figure additionally shows a snapshot of an event stream which satisfies the event model, since each event falls within one jitter interval box. If the jitter is zero, then the event model is strictly periodic. If the jitter is larger than the period, then two or more events can occur at the same time, leading to bursts. A more detailed discussion can be found in [18] .
Jitter is a critical event stream property. While a task that is activated with a certain average period also completes with the same average period, its completion may jitter by a considerably larger amount than its activation. This jitter increase is a result of non-constant task-execution times, as well as of scheduling effects, in particular preemption by higher-priority tasks. Calculating the jitter increase and the resulting output event model is at the core of our compositional performance analysis approach.
An event model can be described by two event functions η u (∆t) and η l (∆t).
Definition 1 (Upper Event Function):
The upper event function η u (∆t) specifies the maximum number of events that can occur during any time interval of length ∆t.
Definition 2 (Lower Event Function):
The lower event function η l (∆t) specifies the minimum number of events that have to occur during any time interval of length ∆t.
The actual number of events for any time interval of length ∆t is bound by the upper and lower event functions. In the following, the dependency of η u and η l on ∆t is omitted for brevity.
A periodic with jitter event model is described by the following event functions η u P+J and η l P+J [18] . Fig. 2 shows upper and lower event function plots for our periodic with jitter event model example. Note that at the points of discontinuity, the smaller value is valid for the upper event function, while the larger value is valid for the lower event function in accordance with equations 1 and 2. To get a better feeling for event functions, imagine a sliding window of length ∆t that is moved over the (infinite) length of an event stream. Consider ∆t = 4 (gray vertical bar in Fig. 2) . The upper event function indicates that at most 2 events can be observed during any time interval of length ∆t = 4. This corresponds e. g. to a window position between t 0 + 8.5 and t 0 + 12.5 in Fig. 1 . The lower event function indicates that no events have to be observed during ∆t = 4. This corresponds e. g. to a window position between t 0 + 12.5 and t 0 + 16.5 in Fig. 1 .
In network processing, load models are common that capture packet-size, peak-load and long-term load [19] . Such models are often described in form of arrival curves [24] , [25] . As was pointed out by Chakraborty et al. and ourselves in [26] and [20] , the plots of upper and lower event functions can be interpreted as event arrival curves.
IV. PROBLEM FORMULATION
We consider communicating tasks with one or more inputs and one or more outputs. The amount of communicated data (the data rate) is specified as a number of tokens, where a token is an atomic unit of data of a certain size. Activation of a task depends on the availability of a certain number of tokens at its inputs. The task consumes the activating tokens at the beginning of one execution, and produces a certain number of tokens at its outputs when execution completes. In this paper, we assume unidirectional FIFO-channels with non-destructive write and destructive read semantics that connect exactly one task output to one task input.
In sections V -VII data rates are fixed, and tokens are not further distinguished. As an extension, section X the number of consumed and produced tokens per port per execution can be an interval. A task-graph with these properties is able to capture a large variety of established modeling semantics [27] , [28] , [29] , thus making the presented methodology rather domain-and tool-independent.
V. AND-ACTIVATION For a consumer task C with multiple inputs, AND-activation implies that C is activated once the input conditions are satisfied at each input. The input condition at input i is satisfied if at least r c,i tokens are available for consumption at that input. The arrival of r c,i tokens at input i shall be described by a periodic with jitter input event model (P i , J i ). We are interested in calculating the activating event model (P AND , J AND ) for the AND-activated task, which we will need for scheduling analysis.
a) AND-activation period: To ensure bounded input buffer sizes, the period of all input event models must be the same. The period of the activating event model equals this period. Input event model jitters are not restricted, since jitter represents a transient load peak (or trough) which can be buffered with finite-size FIFOs.
b) AND-activation jitter: The activation jitter J AND is less obvious. Let us consider the example in Fig. 3 . Let us assume that the required number of tokens (2 at the upper, 2 at the middle, and 3 at the lower input) arrive at the three inputs of AND-activated task C with the following periodic with jitter event models: As can be seen, the minimum distance between two ANDactivations (activations 3 and 4 in Fig. 4 ) equals the minimum distance between two input events at input 3, which is the input with the largest input event model jitter. Likewise, the maximum distance between two AND-activations (activations 1 and 2 in Fig. 4 ) equals the maximum distance between two input events at input 3. It is not possible to find a different sequence of input events leading to a smaller minimum or a larger maximum distance between two AND-activations. From this we can conclude that the input with the largest input event jitter determines the activation jitter of the AND-activated task. A proof can be found in [30] .
Using equations 1 and 2, the results of this section can also be expressed in form of upper and lower activating event functions.
A. AND-Activation Incurred Token Delay and Backlog
We need to calculate the maximum token delay, and the maximum token backlog and consequently the required buffer size for each input. For this purpose, the lower activating event function (equation 6) is interpreted as a lower service function [19] of a hypothetical AND-concatenation task: this task serves each input immediately after enough tokens have arrived at all inputs.
This interpretation allows us to reuse results by Thiele et al. who showed in [19] that the maximum delay experienced by input data before it is processed by a consumer task is the maximum horizontal distance between the upper arrival curve of the data and the lower service curve of the task. The maximum buffer size required to buffer that data is determined by the maximum vertical distance between the upper arrival curve and the lower service curve.
Let us define an upper input event function η 
These inequations can be further improved for the input with the largest jitter, since tokens arriving at that input cannot wait for themselves. For that input, the term max(J i ) in equation 6 can be replaced with the maximum of all other input jitters.
In conclusion, for our example we obtain delay 1 ≤ 7 time units, backlog 1 ≤ 4 tokens delay 2 ≤ 9 time units, backlog 2 ≤ 6 tokens delay 3 ≤ 9 time units, backlog 3 ≤ 9 tokens 
VI. OR-ACTIVATION
For a consumer task C with multiple inputs, OR-activation implies that C is activated each time the input condition is satisfied at any input. Different to AND-activation, input event models are not restricted, and no activation buffering is required, since tokens at one input never have to wait for tokens to arrive at a different input. For any ∆t, the minimum (maximum) number of activating events for OR-activated task C is the sum of the minimum (maximum) number of satisfied input conditions at any input. For the periodic with jitter event models that we consider in this paper this implies
Fig. 6. Example of an OR-activated task with two inputs
Let us consider the example task in Fig. 6 . Let us assume that the required number of tokens arrive at the two inputs with the following periodic with jitter event models:
The corresponding upper and lower input event functions are shown in Fig. 7 . According to equations 9 and 10, the upper and lower activating event functions are constructed by adding the respective input event functions. The result is shown in Fig. 8(a) .
Due to the irregularities, the activating event functions cannot be described by a periodic with jitter event model. We could use a scheduling analysis algorithm that considers each input event model individually and internally calculates the sums in equations 9 and 10. Such an approach yields the most accurate results. However, we cannot expect that every analysis algorithm is able to do that. Furthermore, we are still faced with the problem that in the following system-level analysis step, an output event model has to be propagated to the next analysis component [18] . This requires an input event model of the same class. Since we are working with periodic with jitter event models, we would like to find conservative approximations for η u OR and η l OR that can be described by a periodic with jitter event model (P OR , J OR ).
c) OR-activation period:
The period of OR-activation is the least common multiple LCM(P i ) of all input event model periods (the macro period), divided by the sum of input events during the macro period assuming zero jitter for all input event streams.
For our example we obtain
OR with a periodic with jitter event model implies that equations 9 and 10 are replaced by the following inequations.
In order to be as accurate as possible, we are interested in the minimum jitter that satisfies both inequations. In [30] it is proven that both inequations yield the same minimum jitter. In the following, the upper approximation (inequation 12) is used.
Inequation 12 is evaluated piecewise for each interval ]∆t i , ∆t i+1 ], during which the right side of the inequation has a constant value k i ∈ N. For each constant piece of the right side, a condition for a local jitter J OR,i is obtained that 
Since the left side of this inequation is monotonically increasing with ∆t, it is sufficient to evaluate it for the smallest value of ∆t. I. e.
An algorithm to calculate ∆t i and k i is given in [30] . η u OR displays a pattern of distances between steps which repeats periodically with a macro period that is the least common multiple (LCM) of all input event model periods. Therefore, it is sufficient to perform above calculation for one macro period. The global minimum jitter is then the smallest value which satisfies all local jitter conditions.
In our example, inequation 12 becomes
Constant pieces of the right side of this inequation now have to be found. The first constant piece is for 0 < ∆t ≤ 1. According to inequation 14
The following table shows all relevant constant segments and the resulting local J OR,i . Each constant segment can also be directly obtained from the plot of η u OR in Fig. 8(a) .
The last table entry only shows that starting with ∆t > 1, the pattern repeats every ∆t = LCM(P 1 , P 2 ) = 4 * 3 = 12.
The global minimum jitter is
VII. DATA RATE TRANSITIONS
We speak of a data rate transition if the number of tokens produced during one execution of a producer task (the producer data rate r p ) differs from the number of tokens required for one activation of a consumer task (the consumer data rate r c ). Dataflow graphs such as Synchronous Dataflow (SDF) [31] are typical MoC where this can happen. Consider the data rate transition example in Fig. 9 . Task P produces 2 tokens per execution with the following periodic with jitter event model:
Task C consumes 3 tokens per execution, and therefore requires at least 3 tokens at its input to be activated. This implies that some tokens may have to be buffered until enough tokens for one consumer activation have arrived. Irregular buffering times can lead to consumer activation jitter. Furthermore, data rate transitions change the average period of consumer activations compared to producer activations [31] . We need to calculate the activating event model (P c , J c ) for the consumer task, which we will need for scheduling analysis.
e) consumer activation period: Calculation of the activating event model period P c is straightforward [31] .
f) consumer activation jitter: Calculation of the activating event model jitter J c is more complex. Let η 
In the following, the dependency of τ u and τ l on ∆t is omitted for brevity.
Producer token functions always assume values which are integer multiples of the producer data rate r p . Consumer token functions always assume values which are integer multiples of the consumer data rate r c . A simple relationship exists between token functions and corresponding event functions.
If r p is different from r c , then producer and consumer token function have an added level of expressiveness compared to event functions. This is exploited in our data rate transition problem.
To correctly construct the upper consumer token function, the maximum initial number of tokens in the buffer between P and C not leading to an activation of C, i. e. r c − 1 tokens, has to be assumed. Furthermore, it must be assumed that subsequent tokens arrive as early as possible. These assumptions are captured by shifting the upper producer token function upwards r c − 1 tokens. The shifted function shall be calledτ To correctly construct the lower consumer token function, the minimum initial number of tokens in the buffer, i. e. 0, has to be assumed. Furthermore, it must be assumed that subsequent tokens arrive as late as possible. These assumptions are already expressed by the lower producer token function.
Upper and lower consumer token functions can now be constructed. They step as soon as possible, without rising aboveτ 
Let us apply these results to the example in Fig. 9 . The upper and lower producer token functions τ u p and τ l p for producer task P are shown in Fig. 10(a) . In Fig 10(b) , the upper producer token function has been shifted upwards by r c − 1 = 2. The resulting consumer token functions τ . This is shown in Fig. 11(b) . Therefore, equation 20 can also be written as
This leads to a second interpretation of the transformation result. Consumer token functions are a tight conservative approximation of producer token functions. Conservative, since all possible timing of the arrival of tokens allowed by the consumer token functions (represented by the area between the curves) is also allowed by the producer token functions.
Finally, activating event functions are obtained from consumer token functions using equations 18 and 19. The result for our example is shown in Fig. 12(a) .
The activating event functions cannot be described by a periodic with jitter event model. For the same reasons as in section VI, we would like to find conservative approximations for η u c and η l c that can be described by a periodic with jitter event model (P c , J c }. I. e. we have to solve the following inequations.
We are interested in the minimum jitter that satisfies both inequations. In [30] it is proven that both inequations yield the same minimum jitter. The intended result is shown in Fig. 12(b) . In the following, the upper approximation (inequation 23) is used.
Inequation 23 is evaluated piecewise for each interval ]∆t i , ∆t i+1 ], during which the right side of the inequation has a constant value k i . For each constant piece of the right side, 
Since the left side of this inequation is monotonically increasing with ∆t, it is sufficient to evaluate it for the smallest value of ∆t. I. e. r c * lim
An algorithm to calculate ∆t i and k i is given in [30] . η u c displays a pattern of distances between steps which repeats periodically with a macro period that is the least common multiple (LCM) of the producer and consumer periods. Therefore, it is sufficient to perform above calculation for one macro period. The global minimum jitter is then the smallest values which satisfies all local jitter conditions.
In our example, inequation 23 becomes
Continuous pieces of the right side of this inequation now have to be found. The first continuous piece is for 0 < ∆t ≤ 3. According to inequation 25
The following table shows all relevant constant pieces and the resulting local J c,i .
∆t range ki local Jc,i
Jc,1 ≥ 1 * 6 − 3 = 3 7 < ∆t ≤ 11 6 Jc,2 ≥ 1 * 6 − 7 = −1 11 < ∆t ≤ 15 8
Jc,3 ≥ 2 * 6 − 11 = 1 15 < ∆t ≤ 19 10 Jc,4 ≥ 3 * 6 − 15 = 3
The last table entry is there only to show that starting with ∆t > 3 the pattern repeats every ∆t = 3 * 4 = 12 time units. We are looking for the smallest jitter that satisfies all local jitter conditions. Therefore
A. Rate Transition Incurred Token Delay and Backlog
We need to calculate the best-and worst-case token delay, and the maximum token backlog and consequently the required buffer size at the consumer input. The best-case delay is obviously zero, since each consumer activation is the result of at least one token arriving from the producer. Therefore, that token does not have to wait for additional tokens. The worstcase delay is incurred for the smallest set of tokens that can remain in the rate-transition buffer after a consumer activation. Let the size of this set be N min .
If r p is an integer multiple of r c , then such a non-empty set of tokens does not exist, and the delay is always zero. Otherwise, N min can be obtained by starting with 0 initial tokens and iterating one macro period to obtain each possible number of tokens in the buffer.
If N min tokens remain in the rate-transition buffer, the consumer is activated after n producer activations, where n is the smallest integer solution for the following inequation.
The maximum activation delay is obtained if the next n groups of r p tokens arrive as late as possible. Consequently
The arrival for r p tokens and the consumption of r c tokens is assumed to be instantaneous. However, in practice it is safe to assume that r p tokens arrive first, and r c tokens are consumed immediately afterwards. Therefore, the possible backlog due to data rate transitions and hence the required buffer size is backlog ≤ r c − 1 + r p (27) 
VIII. COMBINATION OF MULTIPLE INPUTS, DATA RATE TRANSITIONS AND TASK SCHEDULING
A task with multiple inputs can additionally be subject to data rate transitions at some of its inputs. In this case, the result from section VII can be combined with the results from sections V and VI to calculate an activating event model for the task. In the first step, the input event model at each rate transition input of consumer task C is calculated according to section VII. In the second step, the activating event model for task C is calculated from all input event models according to sections V and VI.
As was explained in sections V and VII, AND-activation as well as data rate transitions require token buffering. Additionally, tokens have to be buffered from the moment the consumer has been activated until the consumer actually starts to execute and the input tokens are consumed (execution buffering). The sum of rate transition delay, AND-activation delay, and task response time determines the contribution of a task to endto-end latency. The sum of rate transition buffering, ANDactivation buffering, and execution buffering determines the required communication buffer sizes. In certain cases, the buffer sizes can be slightly optimized. The reader is referred to [30] for further discussion.
IX. CYCLIC DEPENDENCIES
Tasks with multiple inputs allow to build cyclic dependencies. A typical application is a control loop, where one task represents the controller and the other task interacts with the controlled system. An example of an application with a cycle is shown in Fig. 13 . OR-activated tasks are not allowed in a cycle, since the output period of an OR-activated task is smaller than the smallest input period (section VI). This output period would be eventually propagated to an input of the OR-activated task, leading to yet a smaller output period. A fix-point cannot be reached, and the cycle thus cannot be scheduled. Therefore, only AND-activated tasks are allowed. Additionally, in this paper only cycles without data rate transitions are considered. While a cycle with data rate transitions is schedulable in principle if it satisfies certain requirements [32] , performance analysis becomes more complicated. The reader is referred to [30] for further discussion. Finally, without loss of generality, we consider only unit-rate cycles, since the extension to non-unit rates is trivial. It merely requires larger communication buffers. The example in Fig. 13 is in fact a system with a unit-rate cycle.
To avoid deadlock, at least one initial token has to be present in a unit-rate cycle [32] . The number of tokens in the cycle remains constant, since for every consumed token, on token is produced. More than one initial token allow to execute cycle tasks in parallel, if they are mapped onto different resources. The interesting question considered here is whether a deadlock-free cycle can be executed fast enough to avoid buffer overflow at the external cycle inputs. This question can only be answered if task implementation and scheduling are taken into account.
Let a unit-rate cycle be externally activated with average period P ext and an arbitrary jitter J ext . Then the following two conditions are sufficient to guarantee that the cycle can be executed fast enough.
1) The sum of worst-case response times of cycle tasks mapped onto the same CPU does not exceed P ext .
2) The number of initial tokens in the cycle is equal or larger than the sum of worst-case response times of all cycle tasks, divided by P ext . Rationale for condition 1: tasks mapped onto the same CPU cannot execute at the same time. Therefore, the sum of worstcase response times of cycle tasks mapped onto the same CPU is the worst case time that the CPU needs to process one cycle iteration. If this time does not exceed P ext , then the external period definitely can be maintained on this CPU.
Rationale for condition 2: tasks mapped onto different CPUs can be executed in parallel. However, parallel execution of tasks is only possible if a sufficient number of tokens is present in the cycle.
The two presented conditions can be rather pessimistic. It is possible to specify tighter conditions, but then phase information between events in different event streams (so called inter-event-stream contexts [33] ) is required during scheduling analysis additionally to period and jitter values, and a scheduling analysis algorithm must be able to take these contexts into account. The advantage of the two presented conditions is that they allow us to determine activating event models for each cycle task without the use of contexts, which means that the host of 'context-blind' scheduling analysis techniques remain applicable for systems with cyclic dependencies. The interested reader is referred to [30] for further discussion.
A. Analysis Idea
Let us initially assume that the activating event model for an AND-activated task in a cycle is solely determined by input event models at the cycle-external inputs. This assumption allows us to ignore the event models at the cycle-internal inputs of AND-activated tasks during response-time calculation, effectively cutting the cycles and thus yielding a purely feed-forward system. This feed-forward system can then be analyzed using existing scheduling analysis techniques.
The analysis results allow us to reason about the validity of the initial assumption. Without loss of generality, let us assume that the AND-activated task has one cycle-external and one cycle-internal input. Input events at the cycle-external input are described by a periodic with jitter event model (P ext , J ext ).
Let us introduce functions δ min (N ) and δ max (N ), which return the minimum respectively maximum distance between N ≥ 2 events in an event stream. For periodic with jitter event models we obtain
For example, as can also be seen from Fig. 2 , the minimum distance between 3 events in a periodic with jitter event model with P = 4, J = 1 is 7 time units, and the maximum distance between 3 events is 9 time units.
Let us further define t min f f and t max f f to be the minimum respectively maximum sum of response times of all tasks belonging to a cycle as obtained through analysis of the corresponding feed-forward system.
B. Cycles with one required and one available initial token
Consider a cycle where after analysis of the corresponding feed-forward system, t max f f ≤ P ext . Sufficient condition 1 is then fulfilled independent of the number of resources. At least one initial token is required in the cycle to fulfill sufficient condition 2. Let us assume that exactly one token is available in the cycle.
Let us assume that the AND-concatenated task (task c in Fig. 13 ) has just been activated. Consequently, the buffer at the cycle-internal input of task c is now empty (since there is only one token in the cycle), and thus no further activation of task c is possible at that moment. It will take between t min f f and t max f f time units until the next token becomes available at the cycle-internal input of task c.
According to equation 29, the maximum distance between two consecutive external input events is δ max ext (2) = P ext + J ext . From t max f f ≤ P ext follows that it is not possible that a late external token has to wait for an internal token. Since activation also cannot happen earlier than the arrival of an external token, it follows that the external input event model is a conservative approximation of the activating event model. I.e.
An early external token may have to wait for an internal token since two token arrivals at the cycle-internal input of task c cannot follow closer than t min f f , and thus
Therefore, if an extended periodic with jitter event model with a d min = δ min (2) parameter is used as is the case in SymTA/S [34], then it is worthwhile to perform scheduling analysis again with the tighter activating event model for the AND-concatenated task.
C. Cycles with M required and M available initial tokens
Now consider a cycle where after analysis of the corresponding feed-foreward system, (M − 1) * P ext ≤ t max f f < M * P ext . At least M initial tokens are required in the cycle to fulfill sufficient condition 2. Let us assume that exactly M tokens are available in the cycle. Let us further assume that sufficient condition 1 is also fulfilled.
According to equation 29, the maximum distance between M + 1 external input events δ
< M * P ext follows that it is not possible that a late external token has to wait for an internal token. We can now continue reasoning as in section IX-B to arrive at the same result.
D. Cycles with more available initial tokens than are required
Additional tokens obviously cannot increase the time between the arrival of two cycle internal-tokens. Therefore, the result from sections IX-B and IX-C remain valid with the exception of an optional calculation of δ min act , which has to be adjusted.
In conclusion, a single-rate cycle can be treated as a feedforward system during performance analysis if it satisfies the two sufficient conditions given at the beginning of this section. This is an important result since it allows to analyze systems with cycles without having to consider phase information between events in different event streams during performance analysis. Therefore, the host of existing scheduling analysis techniques that do not consider phases remains applicable. Furthermore, we can continue to compose different scheduling analysis techniques to analyze the performance of heterogeneous architectures, which as explained in section II has many advantages over 'holistic' techniques.
X. DATA RATE TRANSITIONS WITH INTERVALS
Conditional control-flow inside a tasks can lead to the situation that the task consumes or produces a non-constant number of tokens per execution. I. e. producer and consumer data rates become intervals: [r p,min , r p,max ] for a producer task and [r c,min , r c,max ] for a consumer task. Communication between tasks with data rate intervals can always lead to a data rate transition.
In general, we do not assume any correlation between the number of produced tokens and the number of consumed tokens. A lower consumer data rate of zero is not allowed, since without additional information a bounded communication buffer cannot be guaranteed. For the same reason, without additional information data rate intervals cannot be combined with AND-activation (and consequently cannot be used in cycles).
We would like to extend the results for data rate transitions with fixed data rates (section VII) to calculate consumer token functions and activating event functions in the presence of data rate intervals. Before doing so, an interpretation issue has to be addressed regarding the minimum number of tokens required for one activation of consumer task C. Our interpretation is as follows: r c,max tokens are required for one activation of C. This is because the total number of tokens consumed is not known a priory, and we do not want C to stall for lack of tokens. Stalling for the lack of token is problematic for scheduling, since it can lead to deadlocks. This interpretation is consistent with most models of computation. However, the results from this section remain equally valid if only r c,min tokens are required for one activation of C.
Following the approach in section VII, to construct the upper consumer token function, we assume the maximum number of initial tokens at the input of C not leading to an activation. We also assume that as many additional tokens as possible arrive as soon as possible. To construct the lower consumer token function, zero initial tokens at the input of C are assumed, and as few additional tokens as possible arriving as late as possible. The upper and lower producer token functions are shown in Fig. 15(a) . In, Fig. 15(b) the upper producer token function has been shifted upwards by r c,max − 1 in analogy to Fig. 10(b) . Since the consumer data rate is not fixed, there exists no single upper and lower consumer token function. This is illustrated for our example in Fig. 16 , with Fig. 16(a) -16(c) showing three possible scenarios 1 . As can be seen, depending on the sequence of consumer data rates, a different function can dominate the other functions for a particular ∆t.
We could now determine all relevant sequences of consumer data rates and then define the upper consumer token function to be the maximum off all upper 'consumer token scenario functions'. Likewise, we could define the lower consumer token function to be the minimum off all lower 'consumer token scenario functions'. However, in the end we are interested in the upper and lower activating event functions for the consumer task. Definitely, it is not possible to obtain a larger number of activations for any ∆t than in the scenario in which the minimum number of tokens are consumed during each consumer execution (upper 'consumer token scenario function' in Fig. 16(a) ). Likewise, it is not possible to obtain a smaller number of activations for any ∆t than in the scenario in which the maximum number of tokens are consumed during each consumer execution (lower 'consumer token scenario function' in Fig. 16(a) ).
We conclude that in the presence of a data rate transitions with intervals, the upper activating event function is calculated using the maximum producer data rate r p,max , the minimum consumer data rate r c,min and the maximum number of initial tokens at the input of C. To calculate the lower activating event function, the minimum producer data rate r p,min , the maximum consumer data rate r c,max and zero initial tokens at the input of C are used.
Obviously, two sets of period and jitter values are now required for a conservative periodic with jitter approximation of the upper and lower activating event functions. During scheduling analysis, worst case load has to be calculated using the upper activating event function, while best case load has to be calculated using the lower activating event function. This additional complexity is easily handled in system-level performance analysis, since it is never necessary to maintain more than two sets of event model parameters per event stream. Alternatively, it would be possible to use a single sporadic event model, where the period and jitter parameters describe the upper activating event function and the lower activating event is assumed zero for all ∆t. Obviously, this would be a very conservative approximation.
For our example, we obtain two periodic event models with the following properties 
A. Rate Interval Transition Incurred Token Delay and Backlog
As in section VII, the worst-case delay is incurred for the smallest set of tokens N min that can remain in the ratetransition buffer after a consumer activation. This delay is maximized if subsequently as few tokens as possible arrive as late as possible. Consequently, in correspondence to inequation 26
In correspondence to inequation 27, the possible backlog due to data rate transitions and hence the required buffer size is backlog ≤ r c,max − 1 + r p,max
XI. EXAMPLE
We will now combine in a larger example most of what we have learned. The system in Fig. 17 represents an SoC consisting of a micro-controller (uC), a digital signal processor (DSP) and dedicated hardware (HW), all connected via an onchip bus (Bus). The HW acts as an interface to a physical system. It runs one task (sys) which issues actuator commands to the physical system and collects routine sensor readings. sys is controlled by controller task ctrl, which evaluates the sensor data and calculates the necessary actuator commands. ctrl is activated by a periodic timer (tmr) and by the arrival of new sensor data. This is an example for AND-activation in a cycle.
The physical system is additionally monitored by 3 smart sensors (s1 -s3), which produce data sporadically as a reaction to irregular system events. This data is registered by an ORactivated monitor task (mon) on the uC, which decides how to update the control algorithm. This information is sent to task upd on the DSP, which writes the updated controller parameters into shared memory.
The DSP additionally executes a signal-processing task (fltr), which down-samples, filters and compresses a stream of data arriving at input in, and sends the processed data via output out. This is an example for a data rate transition. All communication (with the exception of shared-memory) is carried out by communication tasks c1 -c5 over the on-chip Bus.
We assume the following event models at system inputs.
input event model s1 sporadic, Ps1 = 1000 s2 sporadic, Ps2 = 750 s3 sporadic, Ps3 = 600 in periodic, Pin = 60 tmr periodic, Ptmr = 70
A sporadic event model with period P has the same upper event function as a periodic event model with the same period. However, the lower event function of the sporadic event model is zero for all ∆t. OR-concatenation of s1 -s3 results in the following sporadic activating event model for task mon: P OR = 250, J OR = 500.
Let us assume that to function correctly, the system has to satisfy the following path latency constraints. In paths with data rate transitions, path latency is defined for causally dependent tokens [35] .
constraint # source → sink maximum latency
cycle (e. g. ctrl → ctrl) 140 (= 2 * Ptmr)
Constraint 3 implies 2 initial tokens in the cycle. Finally, we require maximum jitter at output out (the output period has to be 90 due to the data rate transition). We will use static priority scheduling both on the DSP and and on the Bus. In the following table we show performance analysis result for different priority assignments that were obtained using SymTA/S [34]. In the first two columns, tasks are ordered by priority, with the highest priority on the left. In the last four columns, we give the actual value for all four constrained values, and indicate which constraints are met ( √ ). Note, that a latency along path 2 is only calculated if constraint 4 is met. As can be seen, if we increase priorities of tasks belonging to path 2 until constraints 4 and 2 are met (experiment 3), we violate constraint 1. One solution to this dilemma is to increase the speed of one or several components. Let us repeat experiment 3, but this time let us increase the clock-rate of uC to 120% of its original clock rate. This will speed up task mon which has the largest core execution time of all tasks belonging to path 1. As can be seen in the following table, all constraints are now satisfied. There are several other solutions to satisfy all timing constraints. Let us consider another, particularly interesting one. We will repeat experiment 4 in our set of original experiments, but this time we will reduce the clock-rate of uC to 80% of its original clock rate. Surprisingly, as can be seen in the following table, again all constraints are met. The satisfaction of constraint 1 is easily comprehensible. This constraint was already satisfied in experiment 4, and the now slower uC did not increase the latency beyond the given deadline. But why did the worst-case latencies on paths 2 and 3, as well as the jitter at output out decrease compared to experiment 4? Due to the slower uC, task mon produces data with a larger minimum distance. This leads to larger minimum gaps between activations of task c1, and consequently to a smaller worst-case transient load by task c1 on the Bus. Since task c1 interrupts c2 -c4, the larger gaps between these interrupts allow the lower priority task to execute earlier, thus reducing the worst-case latencies on paths 2 and 3, as well as the jitter at output out.
We call such an effect a scheduling anomaly [16] , [17] . One of the huge benefits of our compositional performance analysis is its ability to uncover scheduling anomalies. A simple worst-case response time analysis would not have caught this situation, since the minimum distance between activations of task c1 is determined by the best-case response time of task mon. Performance simulation probably would not have caught this situation either, since best-cases are often not tested. In this particular example, failing to detect the anomaly does not hurt, since in the worst-case a valid solution is rejected. However, in a situation where the anomaly works in the opposite direction and leads to a constraint violation, failure to detect it could be catastrophic.
XII. CONCLUSION
In this paper we first argued in favor of compositional performance analysis to verify timing constraint and optimize the implementation of complex, heterogeneous multi-processor embedded systems. However, existing solutions handle only single-rate, single-input tasks, and no cyclic task dependencies. These restrictions severely inhibit the applicability of systemlevel performance analysis to realistic embedded applications, which often exhibit more complex task dependencies. Therefore, an extension of compositional performance analysis to applications with complex task dependencies is highly desirable.
Compositional performance analysis is based on the propagation of activating event models. In the main part of this paper, we therefore devised rules to calculate activating event models for tasks in the presence of AND-activation, ORactivation, data rate transitions, and cyclic task dependencies. We also considered tasks with non-constant producer or consumer data rates. Together, these properties cover a substantially wider range of embedded applications than previous models. Our calculations can be applied in the context of compositional performance analysis, thus enabling performance analysis of complex embedded applications.
In the last chapter, we integrated our calculations into the compositional performance analysis framework SymTA/S and demonstrated their use in the performance analysis of a larger embedded system that included OR-activation, ANDactivation, a data rate transitions and a cyclic task dependency. We performed several experiments with different system configurations. For each configurations we verified whether all timing constraints were met. We obtained reliable performance results in very short time (about half a second in total for all six experiments in batch-mode on a standard Pentium III PC). We conclude that analytical techniques are now applicable also for complex embedded application, which a huge potential to speed-up reliable performance validation, and to enable much more thorough design space-exploration.
