Compositional Performance Analysis for Complex Embedded Applications by Jersak, Marek
Compositional Performance Analysis
for Complex Embedded Applications
MAREK JERSÁK
Dissertation, 2005
Institute of Computer and Communication Network Engineering
Department of Electrical Engineering and Information Technology
Technical University of Braunschweig
Braunschweig, Germany
 
 
 
 
 
 
 
Compositional Performance Analysis 
for Complex Embedded Applications 
 
 
 
 
 
 
 
Von der Gemeinsamen Fakultät für Maschinenbau und Elektrotechnik  
der Technische Universität Carolo-Wilhelmina zu Braunschweig 
 
zur Erlangung der Würde 
 
eines Doktor-Ingenieurs (Dr.-Ing.) 
 
genehmigte Dissertation 
 
 
  
 
 
von: Marek Jersak 
 
aus: Bern, Schweiz 
 
 
 
eingereicht am: 28. Mai 2004 
 
mündliche Prüfung am: 24. Juni 2004 
 
 
Referent: Prof. Dr.-Ing. Rolf Ernst 
 
Referent: Prof. Dr.-Ing. Lothar Thiele 
 
Vorsitzender: Prof. Dr.-Ing. Harald Michalik 
 
 
 
 
2005 
 
Compositional Performance Analysis
for Complex Embedded Applications
MAREK JERSA´K
Institute of Computer and Communication Network Engineering
Department of Electrical Engineering and Information Technology
Technical University of Braunschweig
Braunschweig, Germany

Abstract
Performance verification is key during the design of embedded sys-
tems. It must be ensured that a system meets all performance con-
straints, in particular hard real-time constraints. The system must also
be optimized for cost, size, power-consumption and flexibility to succeed
in the market. This requires evaluating the performance impact of each
design decision.
Performance verification is challenging, in particular when a large
numbers of tasks is implemented on a communication-centric heteroge-
neous multi-processor architecture with dynamic task and communica-
tion scheduling. Function- and subsystem-integration introduce many
performance dependencies which are extremely hard to track. Simula-
tion and test are increasingly insufficient for reliable performance verifi-
cation, since identification of all performance corner-cases and of stimuli
sequences to exercise them is unrealistic for complex systems. They
are also very time-consuming and require executable code, making them
poorly suited for early design-space exploration.
A promising alternative is formal system-level performance analysis.
It calculates best-case and worst-case performance bounds and thus
guarantees full corner-case coverage, making it a reliable approach to
performance verification. It requires neither executable code, nor a large
test-bench. Instead, only models of those system properties that influ-
ence performance are needed, as well as models for the possible timing
of task activations. Estimates for those relevant properties and timing
information can be obtained early in the design, and can be refined as
design progresses. Performance analysis also runs dramatically faster
than simulation, making it ideal for design-space exploration.
Recently, our group has developed a compositional approach which
enables performance analysis for heterogeneous multi-processor architec-
tures. The novelty of this approach lies in its scalability and its support
iv ABSTRACT
for subsystem-integration, which are required for large systems. How-
ever, the approach uses a simple application model where one activation
of a task depends on exactly one execution of exactly one predecessor
task. In realistic embedded applications, a task may require a differ-
ent amount of data per execution than produced by a predecessor task,
leading to multi-rate systems. Communication may also be conditional,
leading to execution-rate intervals. Furthermore, a task may consume
data from multiple task inputs. Then, task activation timing is a func-
tion of the possible arrival timing of all required input data. Applications
may also exhibit cyclic task dependencies. Any useful system-level per-
formance analysis framework must be able to handle the complexity of
real-world applications.
In this work, transformations are developed between the variety of
task dependencies that are found in complex embedded applications,
and the possible timing of activating events at the implementation level,
as required by our compositional performance analysis approach. Specif-
ically, it is shown how to calculate activating event timing in the presence
of data rate transitions (with fixed rates and rate intervals) and multiple
activating inputs (AND- or OR-concatenated). This includes the anal-
ysis of cyclic task dependencies (e. g. in a control loop). If buffering of
communicated data is required, then the required buffer sizes as well as
minimum and maximum buffering delay are also calculated.
As a further extension, the type of communicated data is captured
and considered during analysis, since it can heavily affect processing
times and communication load, leading to substantially tighter analysis
bounds. Phase information between different task activations can also
narrow possible performance bounds, and is considered as well. Finally,
to demonstrate a link to real-world design-flows, it is shown how to
apply the presented methodology to designs specified in the industry-
standard modeling tool Simulink. This additionally requires relaxing the
perfectly synchronous Simulink model of computation into a dataflow
representation.
In summary, the methodology presented in this thesis allows to apply
compositional performance analysis to obtain tight performance bounds
for complex applications that are mapped onto heterogeneous multi-
processor architectures. Due to analysis speed, the methodology also
allows rapid design-space exploration to optimize the implementation
of realistic embedded applications while satisfying all critical timing
constraints. The presented methodology has been implemented in the
system-level performance analysis framework SymTA/S.
Kurzfassung
Moderne eingebettete Systeme fu¨hren eine Vielzahl unterschiedlicher
Anwendungen auf heterogenen, programmierbaren Multiprozessorsyste-
men aus. Beispiele sind Multimedia-Plattformen oder verteilte Rege-
lungssysteme im Automobil. Eingebettete Anwendungen mu¨ssen ha¨ufig
in Echtzeit ausgefu¨hrt werden. Wa¨hrend des Entwurfs muss daher die
Performanz, insbesondere das Echtzeitverhalten, der Systemimplemen-
tierung verifiziert werden. Performanzverifikation auf Systemebene ist
eine Herausforderung fu¨r komplexe Systeme, in denen eine Vielzahl kom-
munizierender, gemischt transformativ-reaktiver Hardware- und Soft-
wareprozesse auf spezialisierten Multiprozessorsystemen mit verschiede-
nen Bussen samt Echtzeitbetriebssystemen und arbitrierenden Buspro-
tokollen ausgefu¨hrt wird.
Stand der Technik ist simulationsbasierte Performanzverifikation. Auf-
grund der Vielzahl von Abha¨ngigkeiten ist es jedoch praktisch unmo¨glich,
alle kritischen Performanzrandfa¨lle zu simulieren. Performanzsimulation
ist daher potentiell unzuverla¨ssig. Sie ist auch sehr zeitaufwa¨ndig und
beno¨tigt ausfu¨hrbaren Code, und daher wenig geeignet zur fru¨hen und
schnellen Entwurfsraumexploration. Eine viel versprechende Alterna-
tive ist die so genannte Performanzanalyse. Hierbei werden ku¨rzeste
und la¨ngster Antwortzeiten von Prozessen berechnet, basierend auf bes-
timmten Prozesseigenschaften, Schedulingeigenschaften, sowie unter Be-
ru¨cksichtigung aller mo¨glichen Aktivierungszeitpunkte mittels so genan-
nter Ereignismodelle, z.B. periodisch mit einer maximalen Abweichung
(Jitter), oder sporadisch mit einem Mindestabstand. Dieses Vorgehen
hat attraktive Vorteile: Die berechneten zeitlichen Grenzen sind zu-
verla¨ssig; performanzrelevante Eigenschaften ko¨nnen zwecks fru¨hzeitiger
Exploration schon vor der Implementierung gescha¨tzt werden; Analyse
la¨uft gewo¨hnlich sehr schnell. Ku¨rzlich wurde am Institut fu¨r Datentech-
nik und Kommunikationsnetze ein kompositionales Verfahren entwickelt,
vi KURZFASSUNG
mit dem Performanzanalyse fu¨r beliebig komplexe und heterogene Ar-
chitekturen anwendbar wird.
Problemdefinition: Unser kompositionaler Ansatz zur Performanz-
analyse ist bisher nicht fu¨r reale eingebettete Anwendungen mit kom-
plexen Prozessabha¨ngigkeiten anwendbar. Dies liegt daran, dass bisher
ein sehr einfaches Aktivierungsmodell verwendet wird, bei dem das Been-
den eines Prozesses sofort zur Aktivierung eines abha¨ngigen Prozesses
fu¨hrt. In realen Anwendungen findet sich aber eine Vielzahl verschiedener
Prozessabha¨ngigkeiten. Ein konsumierender Prozess kann eine andere
Menge von Daten pro Aktivierung beno¨tigen als von einem produzieren-
den Prozess pro Ausfu¨hrung erzeugt wird. Dies fu¨hrt zu Multiraten-
systemen. Daten ko¨nnen auch bedingt erzeugt werden, was zu Daten-
ratenintervallen fu¨hrt. Weiter kann ein Prozess Daten von verschiedenen
Produzenten konsumieren. Dann sind die mo¨glichen Aktivierungszeit-
punkte des Prozesses eine Funktion der mo¨glichen Ankunftszeitpunkte
aller beno¨tigten Daten. Schließlich ist das Verhalten von Prozessen im
Allgemeinen stark abha¨ngig von der Art der empfangenen Daten. Jede
in der Praxis verwendbare Performanzanalyse-Methodik muss die Kom-
plexita¨t realer eingebetteter Anwendungen beherrschen.
Entwickelte Lo¨sung: In der eingereichten Dissertation werden Trans-
formationen entwickelt zwischen der Vielzahl verschiedener Prozessab-
ha¨ngigkeiten auf der Anwendungsebene, und Modellen fu¨r mo¨gliche Ak-
tivierungszeitpunkte auf der Implementierungsebene, wie sie von ex-
istierenden Performanzanalysen beno¨tigt werden. Konkret wird dar-
gestellt, wie man aktivierenden Ereignismodelle berechnen kann unter
Beru¨cksichtigung von Datenratentransitionen ohne Intervalle, Datenra-
tentransitionen mit Intervallen, sowie von mehreren aktivierenden Ein-
ga¨ngen, die UND- oder ODER-verknu¨pft sein ko¨nnen. Dies schließt
insbesondere die Analyse zyklischer Prozessabha¨ngigkeiten ein, wie sie
z.B. in Regelschleifen ga¨ngig sind. Falls Daten gepuffert werden mu¨ssen,
werden die beno¨tigten Puffergro¨ßen und die maximalen, durch Pufferung
verursachten Verzo¨gerungen berechnet.
Daru¨ber hinaus wird die Art von kommunizierten Daten sowie zeitliche
Korrelationen zwischen Aktivierungszeitpunkten betrachtet und fu¨r die
Analyse erfasst, da diese einen starken Einfluss auf Ausfu¨hrungszeiten
und Kommunikationslast haben ko¨nnen, und deren Beru¨cksichtigung
eine deutlich engere Berechnung von Grenzen der mo¨glichen System-
performanz ermo¨glicht. Schließlich wird gezeigt, wie die entwickelte
Methodik fu¨r Applikationen angewendet werden kann, die mittels des
Standardwerkzeugs Simulink entworfen wurden. Hierzu muss in einem
zusa¨tzlichen Schritt zuna¨chst das streng synchrone Simulationsmodell
von Simulink relaxiert werden.
vii
Zusammenfassend ermo¨glicht die in der eingereichten Dissertation
entwickelte Methodik, existierende zuverla¨ssige Verfahren zur Perfor-
manzverifikation auf komplexe, heterogene, gemischt transformativ-reak-
tive Applikationen anzuwenden, die auf komplexen, heterogenen Ar-
chitekturen ausgefu¨hrt werden. Aufgrund der Analysegeschwindigkeit
erlaubt die Methodik eine schnelle Entwurfsraumexploration, um die Im-
plementierung realistischer eingebetteter Anwendungen zu optimieren,
und dabei Echtzeitfa¨higkeit sicherzustellen. Die theoretischen Ergeb-
nisse wurden im Performanzanalyse-Werkzeug SymTA/S umgesetzt.

Acknowledgments
This thesis is the result of my research at the Institute of Computer
and Communication Network Engineering (IDA) at the Technical Uni-
versity of Braunschweig, Germany.
I want to express my gratitude to my advisor Professor Rolf Ernst
for his trust and the freedom he gave me to find my way; for sharing
his visions, for countless fruitful discussions, and for generously funding
my travel to so many conferences. I want to thank Professor Lothar
Thiele for helping me focus my research, and for agreeing to co-examine
this work. I want to thank Professor Harald Michalik for chairing the
examination committee.
Many thanks to Dirk Ziegenbein and Kai Richter, my wonderful, fun
and challenging colleagues and friends in the SPI-Project. Kai again for
our fruitful collaboration in the SymTA/S project, and for sharing the
dream about its future. Many thanks to my colleagues Arne Hamann,
Rafik Henia and Razvan Racu who joined the SymTA/S-Project and
enriched it both scientifically and personally. Thanks to all other re-
searchers and students who I had the pleasure to work with at IDA, the
University of Paderborn, ETH Zu¨rich and other universities. Thanks to
the IDA staff for providing professional service otherwise only seen in
large companies.
I want to thank my parents for their love, support and determination
to guide me towards science. Most of all, I want to thank my wonderful
wife Silke for our deep love and the joy to pursue our goals together.
Her support and understanding for all the effort I invested made it easy
to write this thesis while fully enjoying our personal life.
To all you people who influenced it – without you, this thesis would
not have been possible!

Contents
1 INTRODUCTION 1
1.1 Motivation 2
1.2 Outline 5
2 SYSTEM-LEVEL DESIGN 7
2.1 Executable Specifications 9
2.1.1 Dataflow Process Networks 10
2.1.2 Globally Synchronous Models 12
2.1.3 Globally Asynchronous Models 12
2.1.4 Multi-Language Design 13
2.1.5 Commercial Tools 13
2.1.6 System-Level and Implementation Languages 14
2.2 Target Architectures 14
2.2.1 Systems-on-Chip 15
2.2.2 Distributed Systems 16
2.3 Performance Constraints 17
2.4 Function Implementation and Integration 17
2.4.1 Task Activation and Scheduling 18
2.5 Integration Problems 20
2.6 The SPI Project 21
2.7 Summary and Conclusion 24
3 PERFORMANCE ANALYSIS 27
3.1 Scheduling Analysis 27
3.1.1 Event Models 28
3.2 Single-Processor Scheduling Analysis 31
xii Contents
3.3 Homogeneous Multi-Processor Scheduling Analysis 32
3.4 Holistic Scheduling Analysis 33
3.5 Compositional Performance Analysis 34
3.6 Composition Using Standard Event Models 35
3.6.1 Task Model 35
3.6.2 Output Event Model Calculation 36
3.6.3 Analysis Composition 39
3.6.4 Starting Point 40
3.6.5 System-Level Analysis Iteration 40
3.6.6 Event Stream Adaptation 41
3.6.7 Communication Buffers 42
3.7 Composition Using Packet Flows and Resource Streams 43
3.8 Performance Analysis for Complex Applications 45
3.9 Other Performance Analysis Techniques 46
3.9.1 Model Checking-Based Scheduling Analysis 46
3.9.2 Tightly Coupled System Model and Scheduling
Analysis 47
3.10 Summary and Conclusion 48
4 TASKS WITH MULTIPLE INPUTS 51
4.1 AND-Activation 52
4.1.1 Calculation of Activating Event Functions 53
4.1.2 AND-Activation Incurred Delay and Backlog 55
4.1.3 AND-activation for periodic with jitter Input Event
Models 56
4.1.4 Example 58
4.2 OR-Activation 60
4.2.1 Calculation of Activating Event Functions 60
4.2.2 OR-activation for periodic with jitter Input Event
Models 61
4.2.3 Example 65
4.2.4 OR-activation for sporadic with jitter Input Event
Models 66
4.3 Combination of AND- and OR-Activation 66
4.4 Multiple Outputs 67
4.5 Summary and Conclusion 67
Contents xiii
5 RATE TRANSITIONS BETWEEN TASKS 69
5.1 Fixed Data Rates 70
5.1.1 Token Functions 70
5.1.2 Calculation of Activating Event Functions 72
5.1.3 Data Rate Transitions for periodic with jitter Event
Models 75
5.1.4 Example 76
5.1.5 Data Rate Transitions for sporadic with jitter Input
Event Models 77
5.1.6 Rate Transition Incurred Delay and Backlog 77
5.1.7 Special Cases 78
5.2 Data Rate Intervals 79
5.2.1 Data Rate Transitions for periodic with jitter Event
Models 82
5.2.2 Data Rate Transitions for sporadic with jitter Input
Event Models 83
5.2.3 Rate Interval Transition Incurred Token Delay and
Backlog 83
5.3 Combination with Multiple Inputs 83
5.4 Chaining of Rate Transitions, Multiple Inputs and EAFs 85
5.5 Combined Token Buffering 86
5.6 Summary and Conclusion 88
6 CYCLIC TASK DEPENDENCIES 91
6.1 Single-Rate Cycles 92
6.2 Analysis Idea 93
6.3 Cycles with one initial token 94
6.3.1 Buffer Calculation 96
6.4 Cycles with two or more initial tokens 97
6.4.1 Buffer Calculation 99
6.5 Analyzability Condition 99
6.6 Cycles with more available than required initial tokens 100
6.7 Cycles with fewer available than required initial tokens 100
6.8 Self-Cycles 101
6.9 System Startup 101
6.10 Nested Cycles 102
6.11 Cycles with Multiple Inputs 103
6.12 Multi-Rate Cycles 106
xiv Contents
6.13 Summary and Conclusion 106
7 SYSTEM-LEVEL ANALYSIS EXAMPLE 107
8 CONTEXT-AWARE ANALYSIS 115
8.1 Intra Event Stream Contexts 116
8.2 Inter Event Stream Contexts 118
8.3 Combination of Contexts 119
8.4 Example 119
8.5 Summary and Conclusion 122
9 PERFORMANCE ANALYSIS FOR
SIMULINK 123
9.1 Simulink Model of Computation 123
9.2 Code Generation from Simulink 125
9.2.1 Tick Scheduling 125
9.2.2 Rate Monotonic Scheduling 126
9.3 Simulink Code Generation Issues 127
9.3.1 Multi-Processor Implementation 127
9.3.2 Multi-Language Design 128
9.4 Model Relaxation 128
9.4.1 Single-Rate Designs 129
9.4.2 Multi-Rate Designs 129
9.4.3 Implications for Scheduling 131
9.5 Scheduling Analysis 133
9.5.1 Schedulability Condition 134
9.5.2 Analysis Approach 134
9.5.3 Triggered and Enabled Blocks 137
9.6 Summary and Conclusion 138
10 SUMMARY AND OUTLOOK 139
List of Figures 143
List of Tables 148
Bibliography 149
Chapter 1
INTRODUCTION
An embedded system is an application-specific computer system which
is embedded in a device that itself often does not appear as a computer.
Typical devices containing embedded systems are mobile phones, multi-
media devices, automobiles or Internet-routers. Usually, the system en-
vironment imposes timing constraints on the embedded system. I. e. the
system not only must produce the right results, it must produce them
at the right time. In this case, we speak of an embedded real-time sys-
tem. For example, a mobile phone must process a certain amount of
voice or data packets within a time frame set by a mobile communica-
tion standard. Such hard timing constraints must be satisfied under all
operating conditions. Safety-critical systems are an even more extreme
case. For example, an airbag in a vehicle must fire within a guaranteed
time interval after impact detection to optimally protect a passenger.
A modern embedded real-time system executes a set of complex,
specialized applications on a specialized architecture. Designing such
a system is an elaborate and costly task. A particularly challenging
design-step is the implementation of an executable function specifica-
tion, which is target hardware independent, on a real architecture with
a limited number of specialized resources and finite hardware speed. At
this point, performance becomes a major issue as the implementation
has to meet all timing constraints. The design-step during which system
performance is validated against timing constraints is called performance
verification.
Architecture and implementation also need to be optimized for cost,
power consumption, extensibility and other aspects. Good architecture
and implementation decisions can mean the difference between a prof-
itable product and one that cannot be sold. Ever increasing function
2 Introduction
and hardware architecture complexity lead to a huge number of im-
plementation alternatives. Each implementation decision influences the
performance of the system. Therefore, the performance impact of each
implementation decision needs to be evaluated and compared against
performance constraints and optimization goals. This process is often
called design-space exploration.
Design-space exploration and performance verification have a lot in
common. While the goal of design-space exploration is to rapidly esti-
mate the impact of implementation decisions on system performance, in
order to identify promising candidates, the goal of performance verifica-
tion is to validate the performance of a chosen system implementation
in detail against performance constraints that are specified in the func-
tion specification. Consequently, similar techniques can be applied to
address both problems.
1.1 Motivation
System-level performance verification is challenging for complex sys-
tems, in particular when a large number of communicating hardware and
software tasks are executed on heterogeneous multi-processor/multi-bus
architectures with real-time operating systems (RTOSes) and commu-
nication via shared busses. The integration of different functions on
a heterogeneous architecture introduces a confusing number of run-time
interdependencies that heavily affect system performance and can be ex-
tremely hard to track [97]. For example, a task may be delayed beyond
its deadline due to a transient overload on a resource, which occurred
because a burst of data arrived from a different resource. Performance
verification has been identified as one of the top 3 system-level design
issues in the 2001 International Technology Roadmap for Semiconduc-
tors [99].
The key in performance verification is to cover performance corner-
cases. A performance corner-case is the smallest or largest value that
can be observed for a performance property during regular operation
of a system, independent of the operation history. It must fall within
the bounds defined by hard performance constraints. During design-
space exploration, corner-cases are equally important since these are the
cases for which good estimates of system performance are most urgently
required.
State of the art in performance verification are cycle-true simulation,
as well as test using prototype hardware. These approaches are in-
creasingly insufficient for the reliable verification of hard performance
constraints, since identification of all performance corner-cases and of
stimuli sequences to exercise them is unrealistic for sufficiently large
Motivation 3
and complex systems. Simulation and test are thus potentially unreli-
able for system-level performance verification. They are also very time-
consuming and require executable code, which makes them poorly suited
for early design-space exploration.
Formal system-level performance analysis is an interesting alternative
to simulation and test for performance verification and design-space ex-
ploration. It calculates best-case and worst-case performance bounds
and thus guarantees full corner-case coverage, making it a reliable ap-
proach. Furthermore, performance analysis requires neither executable
code, nor a large test-bench. Instead, only models of those system prop-
erties that influence performance are needed, as well as models for the
possible timing of task activations. Estimates for those relevant prop-
erties and timing information can be obtained early in the design, and
can be refined as design progresses. Since performance analysis also runs
dramatically faster than simulation [109, 50], it is ideally suited for early
and thorough design-space exploration.
If performance analysis has so many advantages over simulation, why
isn’t is widely used? Performance analysis faces three major challenges.
Firstly, a performance analysis framework must be able to handle a va-
riety of performance interdependencies resulting from the complexity of
todays communication-centric, heterogeneous multi-processor architec-
tures. Secondly, architecture and timing models as well as algorithms
used for performance analysis must be sufficiently accurate to calculate
performance bounds are both conservative and tight. Thirdly, a perfor-
mance analysis framework must be able to handle a variety of complex
functional interdependencies that are common in state-of-the-art embed-
ded applications.
Recently, our group has developed a compositional approach which
enables performance analysis for heterogeneous multi-processor archi-
tectures [97, 98]. The novelty of this approach lies in its scalability and
support its for subsystem-integration, which are required for large sys-
tems. However, the approach uses an simple application model where
one activation of a task depends on exactly one execution of exactly one
predecessor task.
In realistic applications, a consumer task may require a different
amount of data per execution than produced by a producer task, leading
to multi-rate systems. Communication may also be conditional, leading
to execution-rate intervals. Furthermore, a task may consume data from
multiple task inputs. Then, task activation timing is a function of the
possible arrival timing of all required input data. Applications may also
exhibit cyclic task dependencies, which require analysis extensions to
accurately consider correlations induced by the cycle.
4 Introduction
HW
uC DSPBus
1
1
OR
mon
s1
s2
s3
1
1
1
1
sys
c1
c2
c3
c4
c5
upd
ctrl
fltr
tmr
AND
in
out
[0,1] 1 1 1
1 1 1 1
1 1
1
1
1
2 2 2 3
1 1
1
1
1
ex
Figure 1.1. Example of an embedded real-time system with a variety of task depen-
dencies
Consider the example in Fig. 1.1. It shows an embedded system con-
sisting of a micro-controller (uC ), a digital signal processor (DSP) and
dedicated hardware (HW ), all connected via a system bus (Bus). A
number of computation and communication tasks share theses compo-
nents, and display a variety of functional dependencies. Task mon is an
OR-activated task, which reacts to new input data at any of its inputs.
Task ctrl is AND-activated, and reacts once new input data is available
at all of its inputs. Task ctrl and sys form a functional cycle together
with communication channels c2 and c3, which strongly correlates the
timing of data produced and consumed by task ctrl. Data arrives at the
input of task fltr in groups of 2 tokens, but the task requires 3 tokens for
one execution. Task mon produces output data conditionally. All these
functional dependencies influence the possible timing of task activations,
and consequently system performance. Additionally, non-functional de-
pendencies between tasks due to sharing of the Bus and the DSP also
influence performance. Any useful system-level performance analysis
framework must be able to consider both the variety of functional, and
the variety non-functional dependencies in complex embedded systems.
Our compositional performance analysis approach can currently cope
with the heterogenous architecture in our example, but it cannot be
applied due to the complex application structure. On the other hand,
applications like this are easily modeled in a variety of system-level lan-
guages and tools. Obviously, there is a missing link in system-level design
between application models and performance analysis. This thesis pro-
Outline 5
vides a solution how to close this gap, in order to enable compositional
performance analysis for real-world embedded applications.
As a further extension, the type of communicated data, which is
known from the application model, is captured and considered during
analysis, since it can heavily affect processing times and communication
load, leading to substantially tighter analysis bounds. Phase information
between different task activations can also narrow possible performance
bounds, and is considered as well. Finally, to demonstrate a link to
real-world application modeling, it is shown how to apply the presented
methodology to designs specified in the industry-standard modeling tool
Simulink.
1.2 Outline
The remainder of this thesis is organized as follows.
Chapter 2 starts with an overview of the state-of-the-art in system-
level design for embedded systems. We emphasize the complexity and
heterogeneity of modern embedded applications, and cover the resulting
variety of modeling languages and tools. We also illustrate the need for
specialized multi-processor target architectures. We then explain why
function and subsystem integration leads to complex performance depen-
dencies, in particular for systems containing a large amount of embedded
software. These performance problems motivate formal system-level per-
formance analysis. We illustrate the role of performance analysis during
system-level design. The chapter concludes with a brief presentation of
the SPI model which comprehensively captures application properties
for the purpose of system-level performance analysis.
Chapter 3 gives an overview of the state-of-the-art in formal system-
level performance analysis. Scheduling analysis, which calculates worst-
and best-case response times for tasks sharing a single component, forms
the basis of sophisticated performance analysis approaches for multi-
processor architectures. We present several of these approaches. Our
main focus is on a compositional performance analysis approach which
allows to combine different local scheduling analysis techniques to enable
performance analysis for arbitrary architectures. A missing link is then
identified between realistic application models on one hand, and compo-
sitional performance analysis on the other. While realistic application
models do not adequately take architecture into account, compositional
performance analysis works only for very simple applications.
The following three chapters describe our novel approach to enable
compositional performance analysis for realistic applications with a va-
riety of task dependencies. Chapter 4 focuses on tasks with multiple
activating inputs. We argue that AND- and OR-activation are the only
6 Introduction
meaningful concatenations of multiple activating inputs. We calculate
possible activation timing, as well as required communication buffers
and incurred token delay for both types of activation dependencies. The
chapter concludes with a brief discussion of combination of AND- and
OR-activation.
Chapter 5 focuses on data rate transitions between a producer and
a consumer task. We start with transitions between fixed data rates,
and then extend our ideas to transitions between data rate intervals, in
order to model conditional communication. In both cases, we calculate
possible activation timing, as well as required communication buffers and
incurred token delay. We then combine the results from this chapter with
the results from chapter 4 to enable analysis of tasks with both data rate
transitions and multiple inputs. The chapter concludes with a discussion
of the required communication buffers.
Chapter 6 shows how to consider cyclic task dependencies during com-
positional performance analysis. The presented solution for single-rate
cycles allows to reuse previous results with only minimal modifications,
thus avoiding restricting the set of analyses that can be composed. The
chapter concludes with an outlook on analysis of multi-rate cycles.
In chapter 7, we apply the results from the previous chapters to an-
alyze the performance of our initial system-on-chip example. We show
how a designer is guided through the analysis and performance verifica-
tion process, and demonstrate the potential of our approach for design-
space exploration.
Chapter 8 shows how existing system-level models and analysis tech-
niques can be extended to exploit certain correlations between consecu-
tive task activations. Specifically, we consider sequences of activations of
the same task (intra contexts), and possible phases between activations
of different tasks (inter contexts). We show that significantly tighter
analysis results can be obtained when contexts are considered.
Chapter 9 gives an outlook on integrating system-level performance
analysis into a design-flow starting from the industry-standard modeling
tools Simulink. The limitations of schedulers synthesizable from Simu-
link today are discussed first. Then, an improved solution is presented
where a Simulink model is first transformed into a suitable task-graph
representation, which can then be scheduled more freely, and addition-
ally allows to apply the performance analysis techniques developed in
chapters 4 - 6.
Chapter 10 concludes the thesis with a summary and an outlook on
future research directions.
Chapter 2
SYSTEM-LEVEL DESIGN
This chapter presents a brief overview over system-level design. It
is needed to understand the role of formal system-level performance
analysis and its integration into a system-level design-flow, which is the
focus of this work. We cannot give a full account in a few pages of
the many design-flows that exist. Therefore, we mainly point out key
commonalities. We focus on problems that motivate the integration of
formal system-level performance analysis into system-level design.
sld
specification
exploration
implementation
verification
Figure 2.1. Coarse-grain view of a generic system-level design flow
A coarse-grain view of a generic system-level design flow is shown in
Fig. 2.1. Starting from a requirements specification, the first system-
level design step is to build an executable specification that models the
intended system functionality [34]. Often, multiple design-teams work
on different system functions. Some functions may also be re-used from
previous designs (legacy code), or purchased as intellectual property (IP)
from a 3rd party. In parallel to function design, a test-bench for system
verification is developed. The executable specification is combined with
8 System-Level Design
the test-bench in a simulation environment in order to test its correct-
ness. Certain critical functions may also be verified using formal model
checking techniques [21]. During this phase, the target architecture is
idealized, e. g. by assuming zero or unit computation and communication
latencies, perfect parallelism, unlimited buffers etc. Such idealizations
are attractive at this stage, since they allow to focus on function design
and verification without having to worry about implementation details
and implementation verification.
Once the executable specification satisfies functional requirements,
the focus shifts to target architecture design and function implementa-
tion. The target architecture is usually at least partially given due to the
need to use standardized components and evolve previous designs, and
constrained by the intended cost, size etc. of each unit. The challenge
becomes to integrate all functions on the target architecture under the
constraints imposed by a limited number of resources running at finite
speeds.
At this stage, implementation verification becomes a key issue. On
one hand it must be assured that different subsystems correctly func-
tion together. On the other hand, it must be assured that the system
meets all performance constraints. The test-bench is re-used for this pur-
pose, together with additional tests. The implementation is also rated
with respect to optimization goals. Problems at this stage can often be
fixed through implementation changes, but sometimes also necessitate
re-design of parts of the function or changes to the target architecture.
This is expressed e. g. in the Y-model (Fig. 2.2) which is often used in
hardware/software co-design [28].
y_model
P1
compiler
(manual) code adaptation 
• memory access optimization
• memory layout optimization
• parallelization (incl. SIMD)
• performance
• power consumption
• cost
P1‘
HW architecture
implementation 
language
object code
analysis
architecture adaptation
• instruction set optimization
• data path optimization ...
Figure 2.2. Y-model for hardware/software co-design
Executable Specifications 9
Implementation changes and implementation verification form several
cycles which are iterated as often as necessary to reach an acceptable
implementation. Such iterations can be costly, in particular if a problem
discovered late in the design can only be fixed through a major functional
or architecture change, which then impacts many other parts of the
system.
Typically, only performance problems have the potential for such a
large-scale impact. Functional problems tend to be confined to inter-
faces between functions or stem from incomplete or contradicting re-
quirements, and usually can be fixed locally. This is because it is well-
established design practice to modularize functions such that most inter-
dependencies are localized, and interaction between modules is kept to a
well-defined minimum. On the other hand, there is no simple approach
to ‘performance modularization’ or budgeting, since performance inter-
dependencies are introduced between functionally independent tasks due
to resource sharing.
Performance requirements are a key motivator for specialized embed-
ded architectures1. Good architecture and implementation decisions can
mean the difference between a profitable product and one that cannot be
sold. Therefore, it is desirable to thoroughly evaluate architecture and
implementation alternatives to arrive at an attractive solution. This
process is generally called design-space exploration. Obviously, design-
space exploration is limited if a system has to be fully implemented
before its performance can be evaluated. Therefore, industry strives to
work with simulation-based virtual prototypes as soon as possible, which
allow major changes relatively quickly and at relatively low cost. How-
ever, turnaround-times are still in the range of hours to days.
2.1 Executable Specifications
An executable specification is an executable model of the desired sys-
tem function. In this section, typical characteristics of executable spec-
ifications are described.
From a system-level perspective, an executable specification is divided
into multiple communicating functions. The coordination, meaning the
communication structure as well as activation dependencies between
functions [123] can be complex and consequently are best formalized.
A formal representation of these system-level dependencies is called a
model of computation (MOC) [100]. A MOC can be seen as the ‘laws
of physics’ which govern the interaction of elements of the model [60].
1A second major motivator, which is not considered in this thesis is power consumption.
10 System-Level Design
There is no single best MOC for all modeling problems. Instead, several
established MOCs exist that are tailored towards a specific application
properties (e. g. predominantly reactive vs. predominantly transforma-
tive, data or control dominated) [60, 123]. Lee has identified a set of
‘basic’ MOCs with fundamentally different key properties [64, 60].
Different research and commercial modeling tools are available which
either implement these basic MOCs directly, or as a variation or com-
bination. Modeling tools are typically graphical and often tailored to-
wards specific application domains (e.g. signal processing, communica-
tion protocols, control systems). Major benefits of mature, specialized
tools include quick design entry, availability of large function-component
libraries, excellent support for simulation, debugging and optimization,
as well as automatic code-generation.
We will now take a closer look at some important MOCs and tools
that implement them. The focus will be on the variety of functional
dependencies that can be modeled, as well as on suitability for imple-
mentation on complex multi-processor architectures. A more extensive
overview can be found in [123].
2.1.1 Dataflow Process Networks
Dataflow process networks [63], a special case of Kahn process net-
works [55], operate on streams of data. Processes communicate via
FIFO-buffers with unrestricted size, using non-blocking write and block-
ing read semantics. Processes are decomposed into sequences of indivis-
ible quantum of computation called actor firings. During each firing
an actor consumes input tokens and produces output tokens, where a
token is considered an elementary unit of data. Activation rules spec-
ify the type and number of data tokens which have to be available at
the actor inputs in order to allow an actor to fire. A key property of
dataflow process networks is that an actor firing is solely dependent on
the availability of data. This implies that dataflow process networks
are un-timed, meaning that nothing is specified about points in time at
which firings occur. Only valid firing sequences can be determined for
dataflow process networks.
Dataflow process networks expose the parallelism in an application.
Any actor with sufficient input data can execute in parallel to any other
actor with sufficient input data. Due to FIFO-communication, the or-
der in which actors execute, including actors interrupting each other,
does not affect the calculated results, i. e. dataflow process networks
are determinate in the sense that internal and output streams depend
only on input streams [63]. These properties greatly simplify mixed
hardware/software implementation on parallel architectures, since re-
Executable Specifications 11
sults cannot change due to different execution times of parallel actors,
or due to a particular processor- or bus-scheduling strategy. As we will
see below, several other MOCs do not have this benefit.
Token rate transitions are allowed between two actors, leading to dif-
ferent execution rates of these actors (multi-rate system). An actor can
also have multiple inputs, in which case the required number of tokens
must be available at each actor input for one firing (AND-concatenation).
Actors can also form functional cycles. In cycles, a sufficient number of
initial tokens must be available on some cycle edge to avoid deadlock.
Several specializations of dataflow process networks have been devel-
oped. Of particular importance is synchronous dataflow (SDF) [61],
where the number of tokens consumed at each input and produced at
each output by each actor per firing is fixed and specified a priori. This
allows to determine a static firing sequences which returns the SDF
graph into its initial state. Such a firing sequence can be repeated in
a loop to statically schedule an SDF graph operating on a stream of
data [62]. Fig. 2.3 shows a simple SDF graph and a valid sequential
firing sequence that returns the graph into its initial state. Most of
the existing work on SDF scheduling focuses on optimizing static sched-
ules for parallel execution, required sizes of FIFO-buffers, or end-to-end
latency (e. g. [82, 89]).
SDF
a c
b
1
1
21
1
2
step
1
2
3
4
0
firing
c
a
a
b
a → b b → c c → a
tokens on edges after firing
21
a → c
Figure 2.3. A simple SDF graph and a valid sequential firing sequence that returns
the graph into its initial state
In the cyclo-static dataflow (CSDF) model [27], the number of con-
sumed and produced tokens by an actor may vary cyclically. For exam-
ple, an actor may consume one token at its odd firings and consume three
tokens at its even firings. This still yields a statically determinable fir-
ing sequence as for an CSDF network a corresponding SDF network can
be created by unfolding. A further generalization is boolean dataflow
(BDF) [12] where the numbers of consumed and produced tokens de-
pends on the value of a token read from a dedicated control input. While
12 System-Level Design
this results in Turing completeness, it also causes a BDF process network
to be in general not statically schedulable.
2.1.2 Globally Synchronous Models
In globally synchronous models, parallel actors are assumed to exe-
cute at exactly the same points in time. Synchrony is enforced by perfect
clocks. Execution and communication are usually timeless. Application
topologies are not restricted and can include actors with multiple in-
puts and outputs, as well as cycles. Simulink [74] allows multiple clocks
with different clock-speeds in one system. All actors connected to the
same clock are activated periodically at the same time, independent of
a change at actor inputs, and always produce output values. We will
consider Simulink in detail in chapter 9. In synchronous/reactive mod-
els [37] all processes simultaneously react to changes at their inputs and
may produce new output values (conditional communication). A change
at any input of a multi-input actor causes a reaction (OR-activation).
One example are hierarchical finite state machines (HFSMs), e. g. Harel’s
StateCharts [38], where the reaction of parallel FSMs to an input event
is instantaneous and infinitely fast.
A major problem with global synchrony is that it does not scale well
to distributed architectures, since enforcing global synchrony across dis-
tributed resources makes the system inefficient [20]. Additionally, such
a system is often globally sensitive to small local changes. On the other
hand, the timing in a properly implemented distributed synchronous
system is highly predictable, and consequently popular for safety-critical
applications. One example for a distributed synchronous system is the
time-triggered protocol (TTP) [57] currently considered as a candidate
for system-level communication in automotive networks for applications
like steer-by-wire and break-by-wire.
2.1.3 Globally Asynchronous Models
One workaround around the limitations of a fully synchronous large
system is a locally synchronous, but globally asynchronous system. This
paradigm has been implemented e. g. in the Codesign Finite State Ma-
chines (CFSM) [18] used in the Polis codesign environment [6]. A
disadvantage is that the system behavior depends on the arrival or-
der of events between synchronous regions, and is thus fundamentally
implementation-dependent [123]. The same is true for system designed
in the Specification and Description Language (SDL) [45], a language
widely used in the design of telecommunication protocols. In SDL,
processes representing extended FSMs communicate via unbounded FI-
Executable Specifications 13
FOs. However, different to dataflow, all senders write into the same un-
bounded input queue. The resulting non-deterministic system behavior
is usually acceptable, since it reflects the nature of typical telecommu-
nication systems [103].
2.1.4 Multi-Language Design
Embedded systems often combine aspects from different application
domains. For example, a state-of-the-art mobile phone needs to inte-
grate various transformative signal processing functions, multiple reac-
tive communication protocols, and an interactive user interface. These
fundamentally different application properties encourage multi-language
design, since a good model for one aspect is often awkward for an-
other [19]. Two basic approaches can be distinguished [123]. In the
compositional approach, different MOCs are integrated hierarchically.
A prominent research tool supporting this approach is Ptolemy [117].
A special case is the unified modeling language (UML) [70] which
covers many different semantics but currently has little simulation, ver-
ification and optimization support. UML is thus positioned somewhere
between requirements specification and executable models.
The second approach is co-simulation. Different system parts are
designed with different tools, and then co-simulated using co-simulation
interfaces that most tools provide. For example, in [10] an efficient co-
simulation approach for Matlab on one hand, and SDL on the other
is presented. The simulation results from one tool can also be used
in form of characteristic values during a simulation in a different tool.
This approach makes sense if some level of detail, often the granularity
of time, is significantly different in both simulations, and co-simulation
thus would be either extremely inaccurate, or extremely time-consuming.
2.1.5 Commercial Tools
All widely-used MOCs are supported by one or more commercial tools.
Simulink/StateFlow [74, 75] by The MathWorks implements both con-
tinuous-time and perfectly synchronous MOCs. The tool is often used
for the modeling of control systems and signal-processing systems. A
transformative Simulink actor can internally be described as a StateFlow
FSM (composition). Simulink will be discussed in detail in chapter 9.
Tau [108] by Telelogic implements the language SDL (section 2.1.3),
and is widely used for the specification of telecommunication proto-
cols. CoCentric System Studio [106] by Synopsys implements various
dataflow models of computation, and allows hierarchical composition
with un-timed FSMs. It is used for the design of signal-processing sys-
14 System-Level Design
tems. StateMate [42] by I-Logix implements synchronous parallel FSMs
(Harel’s StateCharts [38]), and as such is used for complex, local control
algorithms.
2.1.6 System-Level and Implementation
Languages
All modeling tools offer code-generation into implementation langua-
ges, in particular C for functions marked for software implementation,
and VHDL or Verilog for functions that are to be implemented in hard-
ware. C++ and Java are alternatives to C for some application domains.
A modern alternative are system-level languages, which are both suited
for software and hardware implementation and thus allow the designer
to delay this decision. Two prominent examples are SystemC [83], which
is implemented as a C++ library, and SpecC [34], which is an extension
of C.
In many designs, graphical model-based approaches are never used.
Instead, the design starts directly with system-level or implementation
languages. In other designs, at least part of the function is specified in
these languages, either as a means to manually optimize the function, or
because legacy code in such a language needs to be integrated. Specifi-
cations in implementation languages offer maximum flexibility and opti-
mization potential, since these languages do not restrict the designer in
any way beyond the basic language semantics. However, corresponding
tools provide little simulation and verification support. Flexibility also
comes at the price of considerably more opportunity for errors.
The system-level languages SystemC and SpecC attempt to maintain
the expressiveness of implementation languages, while enforcing some
higher-level semantics, e. g. certain communication conventions between
tasks (so-called “transaction-level modeling”). This favors structured
design and reduces errors.
2.2 Target Architectures
In this thesis, we concern ourselves with system-level performance
analysis, and therefore assume a system-level view of an embedded archi-
tecture. Consequently, architecture models below the system-level, in-
cluding detailed models of processors, dedicated hardware blocks, mem-
ories etc. , as well as instruction sets, gates, transistors, and physical
models are not considered. On the other hand, our definition of ‘system-
level’ goes beyond the typical hardware-designer’s definition to include
operating systems and communication protocols, which are required in
complex embedded systems.
Target Architectures 15
In our system-level view, an embedded architecture consists of compu-
tation resources and communication resources. Computation resources
include (often specialized) programmable processors such as micro-con-
trollers (µCs) or digital signal processors (DSPs), and dedicated or
weakly-configurable co-processors (ASICs respectively FPGAs). A pro-
cessor often executes an embedded real-time operating system (RTOS),
e. g. VxWorks [119] in telecommunication applications, or OSEK-based
RTOSes [66] such as ERCOSEK [29] in automotive applications, to
schedule different software tasks, manage task access to local memory,
control timers and react to external interrupts.
Communication resources include busses and networks, bus interface
units (BIUs), and shared memory. Busses typically use standardized ar-
chitectures, such as AMBA [3] or SiliconBackplane [104] for System-on-
Chip (SoC) communication, and CAN [93] or the upcoming FlexRay [33]
for communication between distributed automotive control units. Some
bus architecture specifications include a protocol for bus arbitration
(e. g. a static-priority collision-avoidance protocol in CAN), others sup-
port several arbitration mechanisms (e. g. a combination of TDMA and
static priority scheduling in FlexRay), and yet others do not specify
arbitration mechanisms (e. g. AMBA).
2.2.1 Systems-on-Chip
We are particularly interested in complex architectures consisting of
many different components. In domains such as telecommunication or
multimedia, different components are often integrated on a single chip,
leading to so called systems on chip (SoC). Recently, the term ‘multi-
processor systems on chip’ (MPSoC) has emerged to describe complex
SoCs with multiple programmable cores and a complex communication
structure. An SoC usually comprises different intellectual property (IP)-
components from different vendors [28, 56], since a system integrator
usually tries to re-use as many proven components as possible to concen-
trate on his added value. Typical IP components are the aforementioned
types of processor cores with RTOSes, the aforementioned bus architec-
tures, but also standard interface blocks and hardware-components ded-
icated to a particular computation function. Often, a complete domain-
specific SoC platform is provided by a platform vendor (e. g. for mobile
communication), which already consists of suitable components to per-
form many standard domain functions, and can be extended with special
functions by an OEM (original equipment manufacturer, meaning a com-
pany who produces an end product). A platform also includes software
and reference designs, thus further reducing the effort for an OEM to
build a product. An example are the Philips Nexperia Home (Fig. 2.4)
16 System-Level Design
and Nexperia Mobile platforms for multimedia and mobile digital audio
applications [88].
External SDRAM
Interrupt controller
Enhanced JTAG
Universal async.
receiver/transmitter
(UART)
Universal serial bus
IC debug
Clocks
CPU debug
ISO UART
Reset
MIPS
(PR3940)
CPU
TriMedia
C-Bridge
Memory controller
Fast C-Bridge
TriMedia
(TM32)
CPU
MIPS bridge
IEEE 1394 
link layer controller
High-performance
2D-rendering engine
MIPS C-Bridge
I²C
Exp. bus interface
unit PCI/XIO
CRC
 DMA
Adv. image
composition
Processor
MPEG-2
video decoder
Video input
processor
Memory-based
scaler
MPEG
system proc.
C-bridge
Interrupt ctrl.
Audio I/O
Sony Philips
Digital I/O
Transport stream
DMA
General-purpose
I/O
Synchronous
serial interface
F a
s t
P I
 b
u s
M
IP
S
P I
 b
u s
M
em
.
M
gm
t .
IF
 b
u s
Tr
i M
ed
ia
P I
 b
us
D$
I$
D$
I$
nexperia
Figure 2.4. The Philips Nexperia Home platform for multimedia applications
2.2.2 Distributed Systems
In the automotive domain, a multitude (close to 100 in high-end ve-
hicles) of dedicated electronic control units (ECUs) located in different
parts of a vehicle are networked via the aforementioned CAN, as well as
LIN [67], MOST [80] and (soon to come) FlexRay busses. Application-
specific ECUs that already include specialized peripheral interfaces as
well as so-called basic software are provided by ECU vendors. For ex-
ample, an engine control ECU by default includes enough basic software
that it is able operate an engine, albeit maybe not very efficiently [11].
All that remains for the automotive OEM is to calibrate the control
algorithms on the ECU for optimal behavior, and potentially to add
functions that distinguish him from competing OEMs [115].
A more traditional alternative is board-level integration, where several
closely spaced boards with dedicated processors, memories and peripher-
als are connected via a backplane. An example is board-level integration
based on the VME bus [43], which is often used in industrial control sys-
tems, aerospace and military applications.
Function Implementation and Integration 17
2.3 Performance Constraints
Performance constraints are imposed by the environment in which
an embedded system will operate. They constrain valid implementa-
tions of the system function. Performance constraints can be catego-
rized as either hard or soft. A system implementation has to meet all
hard performance constraints; otherwise the system does not function
correctly. Hard performance constraints are typically timing constraints,
e. g. deadlines, maximum allowed jitter, or minimum execution rate. An
additional hard performance constraint is the avoidance of communica-
tion buffer over- or underruns that can corrupt system function.
Soft constraints are similar to optimization goals. Violation of soft
constraints gradually reduces the usability, marketability, profitability
etc. of the system, but does not lead to immediate failure of system
functions.
2.4 Function Implementation and Integration
During function implementation, a function available in an implemen-
tation or system-level language is mapped onto a specific architecture
component, using compilation (software) and synthesis (hardware), or
manual re-code. As a result, an image of the function specification is
obtained which is executable on the target architecture component.
If the architecture component is a programmable processor, then a
user-function will interact with low-level software functions, in particu-
lar the operating system and drivers, which in turn use low-level hard-
ware functions such as I/O-units, timers, interrupts control etc. Fig. 2.5
shows a typical hardware/software architecture on an embedded proces-
sor, with user-written application software resting on several levels of
system hardware and software.
Chip Bus
core
RTOS
I/O Int Bus-CTRL
timer
timer
drivers
RTOS-APIs
application
periphery
cache
mem
private
private
private
private
s h
a r
e d
hardware
software
architecture
application
embedded processor
API
hw_sw_arch
Figure 2.5. Typical hardware/software architecture on an embedded processor
18 System-Level Design
Often, function structure is changed during implementation. Changes
include the combination or splitting of functions, which may be imple-
mented on different resources, and the change of communication mech-
anisms, e. g. from message passing to shared memory or vice versa.
The target architecture has a constrained number of computation and
communication resources operating at finite speeds, as well as limited
memories and communication buffers. Consequently, computation re-
sources have to be shared between functions, and communication re-
sources have to be shared between logical communication channels.
2.4.1 Task Activation and Scheduling
A task is activated due to an activating event. Activating events can
be generated in a multitude of ways, including expiration of a timer,
external or internal interrupt, and task chaining.
The purpose of a scheduler is to arbitrate between multiple tasks that
want to simultaneously use a resource. A scheduler selects a task to
which it grants the resource out of the set of active tasks according to
some scheduling policy [13]. Other active tasks have to wait. The most
complex schedulers are part of operating systems that schedule software
tasks on embedded processors. But schedulers are also implemented in
hardware to control access to other types of resources. For example,
communication via a shared bus requires bus access scheduling, which
is often performed by a bus-master implemented in hardware. Memory
bandwidth can be considerably increased for certain types of memories,
e. g. SDRAM, if memory-accesses are scheduled smartly [39]. A node in a
network-on-chip needs to schedule different communication requests [17].
Three major classes of scheduling strategies can be distinguished.
1 Non-preemptive static execution order scheduling, mainly for highly
regular digital signal processing (DSP) applications (section 2.1.1).
2 Priority-driven scheduling, often used in highly reactive systems op-
erating in dynamic environments. Here, we further distinguish sche-
duling with fixed priorities and scheduling with dynamic priorities.
In the fixed-priority case, priority assignments often follow a rate-
monotonic scheduling (RMS) [68] or deadline-monotonic scheduling
(DMS) [4] strategy, but priorities can be statically assigned in any
other way. In the dynamic-priority case, priority assignments often
follow an earliest deadline first (EDF) [68] strategy, but priorities can
be dynamically assigned in any other way.
3 Preemptive time-slicing techniques, often used for a fair distribu-
tion of the resource among all tasks. Here, we further distinguish
Function Implementation and Integration 19
static time-division media access (TDMA) scheduling with fixed-
length time slots, and more dynamic round-robin (RR) scheduling
with variable-length time slots [15].
A major advantage of static execution order scheduling is that it re-
quires no run-time scheduler and hence incurs no scheduling overhead.
However, the flexibility that is needed in modern embedded systems,
as well as data-dependent task execution times and increased interac-
tion with the system environment more and more necessitate dynamic
scheduling. Static scheduling is thus typically limited to highly regu-
lar ‘islands’ [123] in an otherwise dynamic system. An example is the
static ordering of tasks with the same period, but preemptive schedu-
ling of tasks with different periods in the automotive operating system
ERCOSEK [29].
TDMA scheduling essentially separates a real resource into multi-
ple independent virtual resources, each with a guaranteed fraction of
the available processing time or communication bandwidth. This is
considered attractive for safety-critical applications in domains such as
aerospace or automotive, in particular because prediction of response
times is easy, and because faults in one virtual subsystem cannot im-
pact a different virtual subsystem. However, TDMA scheduling is inef-
ficient in the presence of dynamically changing loads, since a time-slice
large enough to handle a worst-case load remains partially unused in
all other cases. TDMA is also poorly suited to achieve short response
times. Unless a system is globally synchronized, which is problematic
(section 2.1.2), an activation may wait for a full turn (the sequence of all
time slots) in the worst-case until it is granted a TDMA resource. It will
also be interrupted for additional turns if it does not complete within
one time slot. RR scheduling resolves the problem of underused TDMA
resources, but short response times still cannot be achieved. Addition-
ally, the calculation of worst- and best-case response times is complex.
RR scheduling is rarely used in hard real-time applications.
Reactive, priority-based scheduling does not have the inefficiencies
of TDMA. In particular, it is well suited for multi-processor systems,
where a task is activated by the arrival of data from a different task,
often mapped onto a different component. In multi-processor systems,
arrival timing of data often jitters or can even arrive in bursts. High-
priority tasks can immediately react to data arrival, thus ensuring short
response times. The tasks also do not waste resources if they are idle,
instead leaving a resource fully to lower-priority tasks. While EDF-
scheduling has been proven to be theoretically optimal in the sense that
if EDF cannot guarantee that all deadlines are met, no other scheduling
20 System-Level Design
strategy can [68], it produces considerable scheduler overhead, since pri-
orities have to be frequently re-calculated. Static priority scheduling is
theoretically less efficient, but scheduler implementation is simple, and
the smaller scheduler overhead often compensates for the theoretical in-
efficiencies [14].
2.5 Integration Problems
The long list of differences between executable specification and im-
plementation imply that the architecture idealizations assumed during
function specification are no longer valid. This leads to a variety of in-
tegration problems which generally fall into three categories: functional
problems, resource problems and performance problems.
Typical functional problems during integration include different in-
terpretations of ambiguous functional requirements by different design-
teams, mismatches in bit-width, endian-ness or physical unit, and im-
proper handling of special situations, e. g. system startup. Typical re-
source problems during integration include insufficient memory to hold
all code and data at the same time, or deadlock due to faulty resource
arbitration. Functional problems can be debugged quite well using the
existing test-bench (though it is rarely complete and consequently some
bugs remain undetected). Code memory problems are usually detected
at linkage time. Data memory problems can be hard to detect in pro-
grams that allocate memory dynamically (either on the heap or on the
stack). Tools like Purify [94] help. Deadlocks can be avoided by proper
resource-allocation protocols, e. g. the priority ceiling or priority inheri-
tance protocols [102]. These problems are not further considered in this
thesis.
Performance problems typically arise due to finite resource speed,
and in particular due to resource sharing as a result of function- and
subsystem-integration. They include the violation of deadlines, failure
to sustain a required execution rate, buffer over- or under-run, or out-
put jitter beyond the acceptable limit. For example, in Fig. 2.6 tasks
belonging to one subsystem are first integrated onto shared processing
components (shown for subsystem 1). Then, functionally independent
subsystems are integrated onto a common architecture. In that process,
the double-arrowed communication links in both subsystems are mapped
onto a shared bus. Both integration steps create performance dependen-
cies between tasks. In reality, the number of performance dependencies
is much larger, since processors are also shared between subsystems,
and communication also occurs across subsystems boundaries. Track-
ing performance dependencies can be extremely difficult, in particular
The SPI Project 21
intergration
M2DSP
Subsystem 2Subsystem 1
M1
HW
CPU
T1
T3
T2
Sens
Subsystem
Integration
Task Integration
M2IP2M3
M1
Bus
DSPIP1
HWCPUSens
IP2M3IP1
Figure 2.6. Performance dependencies between functionally independent subsystems
introduced by resource sharing
if neither the system integrator nor the subsystem designers understand
the complete system.
State-of-the art to debug performance problems are test and cycle-
accurate system-simulation. Test requires the (prototyping) hardware
to be fully available, which limits test to the last design stage. Errors
found at this stage often lead to expensive re-designs. Cycle-accurate
system-simulation with tools like Axys MaxSim [5] or Mentor Graph-
ics Seamless [78] alleviates this problems by allowing to find problems
before hardware has been produced, and allowing reasonably fast and
inexpensive architecture changes. Cycle-accurate system-simulation ad-
ditionally offers superior debugging support, since internal state can
be visualized beyond what is accessible on the actual hardware, and
since debugging does not interfere with the system. Still, cycle-accurate
system-simulation requires detailed architecture models and fully imple-
mented functions, and is thus also confined to late design stages.
2.6 The SPI Project
The goal of the SPI project is to enable system-wide analysis of non-
functional system properties, in order to allow safe and reliable integra-
tion and optimized implementation of heterogeneously specified multi-
22 System-Level Design
language embedded systems [126, 123, 44]. The acronym SPI is derived
from (System Property Intervals).
The key concepts behind SPI are abstraction of functional and ar-
chitecture details and the use of intervals. Abstraction allows to focus
entirely on non-functional system properties, in particular those that in-
fluence performance. The use of intervals leads to a single description of
many different behaviors in the presence of uncertainty. Together, ab-
straction and the use of intervals lead to system-representations which
are excellent for performance analysis and design-space exploration.
Abstraction and the use of intervals have several further advantages.
A major advantage is the ability to effectively cope with complex em-
bedded systems [123], such as those described in this chapter. Further-
more, a design-flow with incomplete or partially hidden function and
architecture specifications is naturally supported. As long as proper-
ties of function or architecture components that influence performance
are available, performance analysis is possible without further knowl-
edge of internal details. Incomplete specifications are the rule in early
design-stages, while hidden internal details are the result of component
integration from different design teams.
2.6.0.1 The SPI Model
The SPI project initially focused on capturing non-functional system
properties from a variety of MOCs, in order to be independent of spe-
cific MOCs, and in particular to support multi-language design. The
idea was to transform a heterogeneous function specification into a ho-
mogeneous, internal representation of non-functional system properties,
which would then be used as a basis for performance analysis and co-
ordination synthesis [28], i. e. the control of mapping and scheduling
decisions as well as the synthesis of function interfaces during function
integration. The goal was to avoid many of the integration problems that
today are discovered late in the design when function implementations
are finally integrated and co-simulated.
Ziegenbein developed the SPI model [126, 123] as an internal represen-
tation for this purpose. SPI captures communication and coordination
as a process network, but abstracts detailed process behavior. The SPI
model is sufficiently expressive to capture a large variety of different
communication and coordination semantics common in different MOCs,
and thus can in fact serve as a unifying representation that allows to
consider all application components during performance analysis inde-
pendent of the modeling languages or tools used. To prove this claim,
model transformations from Simulink to SPI [48, 47], as well as from
The SPI Project 23
SDL to SPI [53] have been developed. The transformation from several
other MOCs to SPI was considered in [95].
In SPI, processes communicate via two different channel types, FIFO-
queues and registers. The activation of SPI processes is implicit and
based on data availability, i. e. a process may start if there is sufficient
data on its input queues to support one execution. A process is modeled
by an activation function, an execution time, as well as consumer and
producer data rates at task inputs. Data is produced on and consumed
from all connected communication channels simultaneously at the end
of the execution of a process (atomic buffer update). Channels can be
initialized with tokens.
Process and channel properties are specified as value intervals. For
example, a data rate interval limits the number of produced or consumed
data, capturing conditional communication. Using the concept of process
modes, such conditional process behavior depending on internal states or
input data can also be modeled explicitly, i. e. a process is refined to have
different behaviors (parameter sets) that are modeled as modes. Virtual
processes and channels are used to model the system environment and
to represent scheduling constraints.
Finally, timing constraints can be specified in SPI using latency path
constraints that limit the time for causal process executions along a
certain path. Other types of timing constraints (rate, jitter etc.) can
be modeled by latency path constraints over virtual elements. A formal
description of the SPI model can be found in [126, 123].
2.6.0.2 The SPI Workbench
The SPI model by itself is generally not well suited for performance
analysis, since the modeling of preemptive resource sharing, as well as the
modeling of timing in event streams are awkward. The modeling of re-
source sharing effectively requires describing state machines that model
the scheduler and the different scheduling states that each process can
assume. This is both complex and unwanted, since the benefit of the SPI
model, a pure representation of application coordination and communi-
cation, would be completely blurred by a multitude of implementation-
dependent components. The same hold for the modeling of possible
timing in event streams using SPI. For example, Fig. 2.7 shows a SPI
representation of an event source which produces an output event on av-
erage every 4 time units. The period is not strictly maintained. Instead,
each event can be delayed from its perfect position by up to 1 time unit.
The period is enforced by the latency constraint LC = 4 on channel cp.
The delay interval stems from the latency constraint interval LC = [0, 1]
on channel cj. The possible timing at the output of cj would have to be
24 System-Level Design
detected by a sophisticated analysis algorithm. In the next chapter we
will see a much more efficient way to express timing information.
SPI_periodic_jitter
1
1
LC = 4
cp
Pp
1
cj1
LC = [0 , 1]
0
lat = 0
Figure 2.7. SPI representation of an event source which produces an output event
on average every 4. Each event can be delayed from its perfect position by up to 1
time unit
In conclusion, SPI combines several key concepts which make it a
very efficient model to describe application coordination, which is funda-
mental for formal system-level performance analysis. In particular, these
concepts include abstraction of functional details and the use of intervals.
The choice between FIFO and register communication, together with
process modes and mode tags to capture correlation between modes, as
well a virtual elements allow to efficiently capture performance-relevant
data at various levels of detail.
However, SPI cannot capture efficiently neither the target architec-
ture, nor scheduling effects. This is obvious when comparing timing and
scheduling modeled directly in SPI [124] against an approach where an
additional representation better suited for scheduling analysis is used [127].
2.7 Summary and Conclusion
In this chapter we gave an overview over system-level design. We
emphasized the complexity and heterogeneity of both embedded appli-
cations and embedded architectures. We identified system-level per-
formance verification as a key design challenge, stemming from the
huge number of performance-interdependencies that arise from resource-
sharing and scheduling due to function- and subsystem-integration. We
emphasized the need to consider performance as early as possible in
the design flow, due to the large cost of fixing performance problems
discovered late in the design. Considering performance early also allows
design-space exploration, leading to optimized solutions which can make
the difference between a successful product and one that cannot be sold
profitably.
Summary and Conclusion 25
We argued that cycle-accurate simulation, which is state-of-the-art in
performance verification, is poorly suited both for early design stages
and for design-space exploration, since it requires detailed architecture
models and fully implemented functions. We then hinted at an alterna-
tive, formal performance analysis. The SPI-project took a first step in
that direction by systematically capturing application properties into a
representation that abstracts functional details and focuses exclusively
on performance-relevant properties (separation of concerns). In particu-
lar, SPI captures a large variety of activation dependencies which occur
in embedded applications (application coordination model).
Our overview of MOCs and languages showed that activation depen-
dencies common in embedded applications are AND- and OR-activation,
cyclic dependencies, conditional communication, data-rate transitions
and data-dependent behavior. Any useful system-level performance anal-
ysis framework must be able to deal with these complex application
dependencies. In the next chapter we will see that representations for
scheduling analysis also use abstraction and intervals to model the possi-
ble scheduling dependencies (i. e. execution coordination) between tasks.
On the other hand, the application models in these representations are
a far cry from the expressiveness of common MOCs and languages. The
focus of this thesis will thus be to combine powerful scheduling repre-
sentations with powerful application representations, in order to enable
performance analysis of real-world systems.

Chapter 3
PERFORMANCE ANALYSIS
System-level performance analysis calculates conservative bounds for
system performance. The most widely researched approaches use sche-
duling analysis, which constructs worst-case (and sometimes best-case)
scenarios, and calculates the performance of these scenarios.
In this chapter, we present several system-level performance analy-
sis approaches based on scheduling analysis. Our main focus is a novel
compositional performance analysis approach developed by our group.
We then point out that existing performance analysis techniques in gen-
eral do not support the complexity of real-world embedded applications.
We also give a brief overview over alternative performance analysis ap-
proaches that are not based on scheduling analysis.
3.1 Scheduling Analysis
When multiple computation (communication) tasks share one compu-
tation (communication) resource, then two or more tasks may request
the resource at the same time. Scheduling resolves these conflicting
requests (section 2.4.1). To properly design an embedded real-time sys-
tems it is important to understand the effect that scheduling has on
system performance.
Scheduling analysis calculates worst-case (sometimes also best-case)
task response times, i. e. the time between task activation and task com-
pletion, for all tasks sharing a resource under the control of a scheduler.
Scheduling analysis guarantees that all observable response times will
fall into the calculated [best-case, worst-case] interval. We therefore say
that scheduling analysis is conservative.
Worst-case response times can then be compared against deadlines. If
all deadlines are met in the worst case, then they will be met in all cases,
28 Performance Analysis
thus guaranteeing that a system will satisfy all deadline constraints. Ob-
viously, the calculated response times should also be as tight as possible
to avoid over-dimensioning of a system.
Scheduling analysis requires information about certain system prop-
erties (section 3.2), and about the possible timing of task activations.
The possible timing of task activations is efficiently described using event
models.
3.1.1 Event Models
An event model describes the possible timing of events. It can be
expressed using two event functions ηu(∆t) and ηl(∆t).
Definition 3.1 (Upper Event Function) The upper event function
ηu(∆t) specifies the maximum number of events that can occur during
any time interval of length ∆t.
Definition 3.2 (Lower Event Function) The lower event function
ηl(∆t) specifies the minimum number of events that have to occur during
any time interval of length ∆t.
Event functions are piecewise constant step functions with unit-height
steps, each step corresponding to the occurrence of one event. An exam-
ple is shown in Fig. 3.1. Note that at the points where the functions step
the smaller value is valid for the upper event function, while the larger
value is valid for the lower event function (indicated by dark dots). We
will explain this example in more detail in a minute. For any time inter-
val of length ∆t, the actual number of events is bound by the upper and
lower event functions. Event functions resemble arrival curves [23, 87]
which have been successfully used by Thiele et al. for compositional per-
formance analysis of network processors [109]. We will take a closer look
at their work in section 3.7. In the following, the dependency of ηu and
ηl on ∆t is omitted for brevity.
Event models can also be described by sets of parameters, together
with rules how to construct both event functions from them. For exam-
ple, a periodic with jitter event model has two parameters (P, J ) and
states that each event generally occurs periodically with period P, but
that it can jitter around its exact position within a jitter interval J .
A periodic with jitter event model is described by the following event
functions ηuP+J and η
l
P+J [98].
ηuP+J =
⌈
∆t+ J
P
⌉
(3.1)
ηlP+J = max
(
0,
⌊
∆t− J
P
⌋)
(3.2)
Scheduling Analysis 29
ef
5 10 15
∆t
5
events
2 * J = 2
P = 4 u
JP+η
l
JP+η
Figure 3.1. Upper and lower event functions for a periodic with jitter event model
with (P = 4, J = 1)
Consider an example where
(P, J ) = (4, 1)
This event model is visualized in Fig. 3.2. Each gray box indicates a
jitter interval of length J = 1. The jitter intervals repeat with the event
model period P = 4. The figure additionally shows a sequence of events
which satisfies the event model, since exactly one event falls within each
jitter interval box, and no events occur outside the boxes. An infinitely
long sequence of events is called an event stream. If all events in the
event stream satisfy an event model, then the event stream is said to
satisfy the event model.
es_em
t
P = 4 J = 1
t0 t0+ 4 t0+ 8 t0+ 12 t0+ 16
∆t = 4 ∆t = 4
Figure 3.2. Example of an event stream that satisfies a periodic with jitter event
model with (P = 4, J = 1)
Let us now return to the event functions in Fig. 3.1. These event func-
tions correspond to our periodic with jitter event model in accordance
with equations 3.1 and 3.2. To get a better feeling for event functions,
imagine a sliding window of length ∆t that is moved over the (infi-
nite) length of an event stream. Consider ∆t = 4 (gray vertical bar in
30 Performance Analysis
Fig. 3.1). The upper event function indicates that at most 2 events can
be observed during any time interval of length ∆t = 4. This corresponds
e. g. to a window position between t0+8.5 and t0+12.5 in Fig. 3.2. The
lower event function indicates that no events have to be observed during
∆t = 4. This corresponds e. g. to a window position between t0 + 12.5
and t0 + 16.5 in Fig. 3.2.
Let us further introduce distance functions δmin(N ≥ 2) and δmax(N ≥
2), which return the minimum respectively maximum distance between
N ≥ 2 consecutive events in an event stream.
Definition 3.3 (Minimum Distance Function) The minimum dis-
tance function δmin(N ≥ 2) specifies the minimum distance between
N ≥ 2 consecutive events in an event stream.
Definition 3.4 (Maximum Distance Function) The maximum dis-
tance function δmax(N ≥ 2) specifies the maximum distance between
N ≥ 2 consecutive events in an event stream.
For periodic with jitter event models we obtain
δmin(N ≥ 2) = max {0 , (N − 1) ∗ P − J } (3.3)
δmax(N ≥ 2) = (N − 1) ∗ P + J (3.4)
For example, the minimum distance between 2 events in a periodic
with jitter event model with (P = 4, J = 1) is 3 time units, and the
maximum distance between 2 events is 5 time units.
Periodic with jitter event models are well suited to describe generally
periodic event streams, which often occur in control, communication and
multimedia systems [97]. If the jitter is zero, then the event model is
strictly periodic. If the jitter is larger than the period, then two or more
events can occur at the same time, leading to bursts. In the presence of
bursts, a periodic with jitter event model can be extended with a dmin
parameter that captures the minimum distance between events within a
burst. A more detailed discussion can be found in [98].
Additionally, sporadic events are also common [97]. We model spo-
radic event streams with the same set of parameters as periodic event
streams. The difference is that for sporadic event streams, the lower
event function ηl(∆t) is always zero. The maximum distance function
δmax(N ≥ 2) approaches infinity for all values of N [98]. Note that jitter
and dmin parameters are also meaningful in sporadic event models, since
they allows to accurately capture sporadic transient load peaks.
Other event model representations have been proposed. Gresser [35,
36] proposes a very general formalism that uses tuples which capture
Single-Processor Scheduling Analysis 31
the maximum number of events that can occur for certain ∆t. It is
possible to express our event models using Gresser’s notation. However,
system-level performance analysis has been demonstrated only for earli-
est deadline first (EDF) scheduling where deadlines were locally assigned
and true completion times were not taken into account. Therefore, each
resource could be analyzed separately, which as we will see below is not
possible in general.
In section 2.6 we showed that event models can also be represented
directly in SPI. In fact the example we used there represents the same
periodic with jitter event model that we use in this section. As can be
seen, the modeling effort is considerably larger in SPI. The SPI repre-
sentation of event models resembles timed automata which are used for
model checking-based scheduling analysis (see below).
3.2 Single-Processor Scheduling Analysis
Single-processor scheduling analysis is a local analysis technique ap-
plicable to one scheduling component (typically a processor, hence the
name, but the same approach can be used e. g. to analyze bus-arbitration).
A classic example is rate-monotonic analysis (RMA) by Liu & Lay-
land [68], which analyzes if tasks scheduled according to the rate-mono-
tonic scheduling (RMS) policy can meet their deadlines. This work was
generalized by Joseph and Pandya to static-priority preemptive schedu-
ling on a single processor [54]. The following equation gives the worst-
case response time ri of a lower priority task i due to a worst-case number
of interrupts by all higher priority task j ∈ hp(i). Ci, Cj are the worst-
case core execution times (WCET, i. e. assuming no interrupts) of tasks
i and j. The equation holds only if ri is smaller than the minimum
distance between two activations of task i, and ignores context-switch
overhead.
ri = Ci +
∑
∀j∈hp(i)
ηuj (ri) ∗ Cj (3.5)
The term ri appears on both sides of this equation, which thus cannot
be solved directly. Audsley presented an iterative algorithm to obtain
the desired result. It is first assumed that ri = Ci. It is then checked,
for how long task i can be interrupted during this time. The sum of all
possible interrupts is added to ri, and the check is repeated, until a fix-
point is reached (success) or ri exceeds the minimum distance between
two activations of task i (abort).
Worst-case and best-case core execution times can be obtained in a
variety of ways. A task can be executed in isolation on the target archi-
tecture for a range of stimuli in search of worst-case and best-case paths.
32 Performance Analysis
This is much easier for a single task than for a whole system, because the
designer of the task knows the code very well. A more reliable approach
is static execution-time analysis [120], which considers all possible stim-
uli to calculate conservative worst- and best-case core execution times.
Core execution times can also be guessed in early design-stages before a
task has been coded, e. g. based on values from previous designs.
Single-processor scheduling analysis has improved over time to reduce
constraints under which the analysis works, e. g. consideration of arbi-
trary deadlines [65] or more complex activation timing [105, 35], and
to support different scheduling techniques [35, 46]. Additionally, sche-
duling effects have been modeled in more and more detail, including
blocking-times [102], RTOS-overhead [114] and various types of correla-
tions between different task activations [113, 79, 85, 7]. A good overview
of single-processor scheduling analysis techniques can be found in [32,
13, 96]. Single-processor scheduling analysis has also found its way into
industrial practice. TriPacific’s Rapid RMA [116], LiveDevices’ Real-
Time Architect (RTA) [69], and TimeWiz by TimeSys [111] all calculate
worst-case task response times for single processors with static priority
scheduling. RTA models the scheduling influences of a specific automo-
tive RTOS in detail.
We have also used a real-world RTOS from the automotive domain to
show that scheduling analysis accuracy can be increased to a point where
the calculated worst-case response time is only a few percent longer than
the longest measured response-time [52]. High accuracy comes at the
cost of higher model and analysis complexity, and consequently higher
modeling and analysis times. However, even at a high level of detail,
scheduling analysis run-time compares favorably to the time required
for cycle-true simulation [50].
One major open issue in single-processor scheduling analysis are caches.
In the past, embedded processors rarely had caches, or they were turned
off for hard real-time applications, because predictability of cache-beha-
vior was poor. Recently, performance requirements for embedded pro-
cessors and the size of embedded applications have grown so much that
caches are becoming commonplace. Examples for embedded processors
with caches include PowerPC, StrongARM, or TriCore. Extension of
scheduling analysis to consider cache-effect has become an interesting
field of research [59, 81].
3.3 Homogeneous Multi-Processor Scheduling
Analysis
Single-processor scheduling analysis cannot handle multi-processor ar-
chitectures, which are common in modern embedded systems. The sim-
Holistic Scheduling Analysis 33
plest form of multi-processor scheduling analysis are extensions to the
single-component analysis techniques described above, e. g. [101]. Com-
munication between tasks is via shared memory. Multiprocessor analysis
applies the component-level techniques locally, and captures task depen-
dencies by creating relations between certain process timing parameters
and the response time equations. It is restricted to homogeneous sche-
duling for all tasks.
As was shown by Yen and Wolf [122], in multi-processor scheduling
analysis it is generally not sufficient to consider only worst-case response
times. For example, a shorter response time of task a on one resource
can lead to an earlier activation of a dependent task b on a different
resource, which my then preempt a second task c on that resources that
would not have been preempted if task a had had a longer response time.
This may cause task c to miss its deadline. Such scheduling anomalies
can only be reliably detected if both worst- and best-case response times
are calculated during scheduling analysis.
3.4 Holistic Scheduling Analysis
“Holistic” in the context of scheduling analysis refers to a consistent
end-to-end response time analysis approach for multi-processor real-time
systems, where processors communicate over a bus. The maximum re-
sponse time of a task is treated as the release jitter for the next task
along a task dependency chain. The first such approach by Tindell con-
siders processes scheduled under RMS which exchange messages via a
TDMA bus [112]. Newer approaches by Eles et al. consider a larger va-
riety of mixed time/event-triggered distributed embedded systems, and
focus on real-world communication protocols [90–92]. They consider the
time-triggered protocol (TTP) [57] as a specific TDMA implementation,
and control area network (CAN) [93] as a specific static-priority imple-
mentation. Garcia and Harbour refined Tindell’s basic approach in sev-
eral ways to improved analysis tightness, including static and dynamic
offsets, arbitrary deadlines [85] and best-case analysis techniques to im-
prove worst-case analysis in the presence of scheduling anomalies [84].
Flexibility, subsystem integration and scalability are major weak-
nesses of holistic techniques. If an architecture changes, then it is easily
possible that a holistic analysis does no longer fit the problem. This
means that an architecture cannot be composed as desired without com-
promising analyzability. The problem is aggravated if subsystems, in
particular IP components need to be integrated. Then, a local perfor-
mance model for a subsystem may not satisfy holistic assumptions. For
example, holistic analyses by Tindell, Garcia and Harbour assume peri-
odic events at system inputs. This means that a subsystem cannot be
34 Performance Analysis
analyzed, if it is connected to a second subsystem with output jitter.
End-to-end delay calculation across subsystems is also not supported.
3.5 Compositional Performance Analysis
Compositional performance analysis [97, 98, 109] enables performance
analysis for complex, heterogeneous embedded architectures and sup-
ports subsystem integration, thus overcoming the restrictions of other
techniques. It works by integrating different local scheduling analysis
techniques into a system-level analysis. This process mimics the integra-
tion of different architecture components into the system architecture.
The local scheduling analysis techniques are composed on the system
level by connecting their input and output event streams according to
the overall application and communication structure.
Obviously, for compositional performance analysis to work, local sche-
duling analysis techniques must be available for each architecture compo-
nent. This is not necessarily the case. However, each newly programmed
local scheduling analysis can be added into a library of analysis tech-
niques for re-use in future systems. This reduces the amount of analysis
programming to novel components that have not been considered before.
The effort to scale to larger and more complex heterogeneous systems is
thus minimized. This is a major advantage over ‘holistic’ multi-processor
analysis approaches [114, 90]. On the other hand, holistic approaches
may make it easier to take global performance effects into account than
compositional performance analysis, potentially yielding tighter analysis
bounds.
In this section, we mainly focus on a novel compositional performance
analysis approach developed by Richter et al. [97, 98]. The performance
analysis methodology for complex embedded applications, which is the
subject of this thesis, builds upon Richter’s work. We additionally
present a compositional performance analysis methodology developed
by Thiele et al. [109], which uses a novel local analysis approach and is
geared towards the analysis of network processors. Richter’s approach
has the advantage that it allows to re-use the large number of exist-
ing analysis techniques developed by the real-time community, such as
those introduced in section 3.1. It is well suited for several important em-
bedded systems domains, including telecommunication, multi-media and
control systems. To the best of our knowledge, Richter’s and Thiele’s ap-
proach are the only two compositional performance analysis approaches
currently available for heterogeneous embedded architectures.
Composition Using Standard Event Models 35
3.6 Composition Using Standard Event Models
In the compositional performance analysis methodology developed by
Richter et al. [125, 97, 98] (henceforth called ‘our methodology’), input
and output event streams are described by event models with period,
jitter, and minimum distance parameters. Additionally, periodic and
sporadic event streams are distinguished. These event models were in-
troduced in detail in section 3.1.1.
Event models with this small set of parameters have several advan-
tages. Firstly, they are easily understood by a designer, since period, jit-
ter etc. are familiar event stream properties. Secondly, the corresponding
event functions and distance functions can be evaluated quickly, which
is important for scheduling analysis to run fast. Thirdly, as we will
shortly see, compositional performance analysis requires the modeling
of possible timing of output events for propagation to the next schedu-
ling component. Our event models allow us to specify simple rules to
obtain output event models that can be described with the same set of
parameters as the activating event models. Therefore, we do not have to
depart from our event models independent of size and structure of the
composed system (hence the term ‘standard’). This makes our compo-
sitional performance analysis approach very general.
Throughout this thesis, slightly simpler periodic with jitter and spo-
radic with jitter event models with parameters (P,J ) but without a
minimum distance parameter dmin will be used. This is sufficient to
illustrate our ideas for compositional performance analysis for complex
applications. It should be emphasized that this is not a restriction of the
work presented in this thesis. It can be extended to also consider dmin,
and is equally valid for any other type of event model representations
possibly used in compositional performance analysis.
3.6.1 Task Model
Our existing approach assumes that each task has one input FIFO. A
task reads its activating data from its input FIFO and writes data into
the input FIFO of a dependent task. A task may read its input data
at any time during one execution. We therefore assume that the data
needs to be available at the input during the whole execution of the task.
When calculating the required size of input FIFOs we assume that input
data is removed from the input FIFO at the end of one execution.
We also assume that a task writes its output data at the end of one
execution. This assumption is standard in scheduling analysis. It is
required to deduce the timing of output data from the calculated task
response times. Tasks that deviate from this model can be split into a
36 Performance Analysis
chain of sub-tasks, each of which writes data at the end of its execu-
tion [85].
3.6.2 Output Event Model Calculation
An activating event model describes the possible timing of task acti-
vations on a resource. As was explained in section 3.1, this is required
for response time calculation during scheduling analysis. An important
aspect of our approach is that we not only calculate maximum response
times, but also minimum response times. The importance of minimum
response times to detect scheduling anomalies has been pointed out by
several groups [122, 84] (see also section 3.1).
Our approach goes an essential step further and calculates the result-
ing output event model. An output event model describes the possible
timing of task completions. The output event model period obviously
equals the activation period. The difference between maximum and min-
imum response times (the response time jitter) is added to the activating
event model jitter, yielding the output event model jitter1.
Jout = Jact + (tresp,max − tresp,min)
For an intuitive explanation of this formula, recall that the event
model jitter captures the size of an interval within which each event can
deviate from its exact periodic position. Imagine that the nth activat-
ing event occurs as soon as possible, and is processed in the minimum
response time. The n + 1 st activating event occurs as late as possible,
and is processed in the maximum response time. Then, the maximum
distance between output events has increased by tresp,max − tresp,min
compared to the maximum distance between activating events. A sim-
ilar argument can be made for the minimum distance between output
events, thus explaining the jitter increase.
T1
T2
sa_ex
Figure 3.3. Example to illustrate output event model calculation
1In certain, a tighter output event model jitter can be calculated.
Composition Using Standard Event Models 37
Consider the example in Fig. 3.3. Tasks T1 and T2 are mapped onto
a common resource with static priority preemptive scheduling. The fol-
lowing task parameters are relevant for scheduling analysis: priority,
activation period, activation jitter, and core execution time interval. Let
us assume parameter values as given in the following table (a smaller
number indicates a higher priority). The table also shows the calculated
response time intervals, output jitters and, in case of T2, minimum dis-
tance between output events.
task prio. act. period act. jitter exe. time resp. time out. jitter min. dist.
T1 1 6 1 [2, 3] [2, 3] 2 −
T2 2 20 5 [6, 9] [8, 24] 21 8
Table 3.1. Input and calculated parameter values for both tasks from Fig. 3.3
Sched_ex_worst
0 248 3216
t
t
T1
T2
40
P1 - J1 = 5 P1 = 6
P2 - J2 = 15 P2 = 20
r1,max = 3
r2,max = 24
15
19
97
59
Figure 3.4. Scenario leading to the worst-case response times for both tasks from
Fig. 3.3
Figs. 3.4 and 3.5 show the scenarios leading to the worst-case respec-
tively best-case response time for both tasks. Note that in general the
best-case calculation is NP-complete, but conservative polynomial ap-
proximations have been published [84]. Fig. 3.6 shows a scenario lead-
ing to the minimum and maximum distance between output events for
task T1, from which we obtain the output jitter. The calculation for T2
works accordingly.
As can be seen, for both tasks the output jitter equals the activation
jitter plus the difference between the maximum and minimum response
38 Performance Analysis
Sched_ex_best
0 248 3216
t
t
T1
T2
40
P1 + J1 = 7 P1 = 6
P2 + J2 = 25 P2 = 20
r1,min = 2
r2,min = 8
Figure 3.5. Scenario leading to the best-case response times for both tasks from
Fig. 3.3
Sched_ex_jitter
tT1
P1 - J1 = 5 P1 + J1 = 7
P1 - J1,out = 4 P1 + J1,out = 8
Figure 3.6. Scenario leading to the minimum and maximum distance between output
events for task T1 from Fig. 3.3
times. The response time interval of T1 equals its execution time inter-
val, since T1 cannot be preempted. I. e. for task T1, the output jitter
increases only due to its non-constant execution time.
For task T2, the jitter increases due to a combination of its execu-
tion time interval, a non-constant number of preemptions by task T1,
and a possible activation backlog (an activation of T2 occurs before the
previous activation has been fully executed). In the best case T2 is pre-
empted by T1 once, resulting in a shortest possible interrupt of 2 time
units (the minimum execution time of T1). In this situation, no backlog
is possible. Consequently, The minimum response time of task T2 is
6 + 2 = 8 time units. In the worst case T2 is preempted by T1 several
times, and activation backlog occurs, leading to a maximum response
time of 24 time units.
Note, that the output event model of task T2 has a larger jitter than
period. This information alone would indicate that an early output
Composition Using Standard Event Models 39
event could occur before a late previous output event, which obviously
cannot be correct. In reality, output events cannot follow closer than
the minimum response time of a task. This is indicated by the value
of the minimum distance parameter. For T1, the minimum distance of
output events is implicitly its output period minus its output jitter.
Further effects, such as blocking times or scheduler overhead, have to
be modeled for accurate response-time and output event model calcula-
tion. They are omitted here because they are not needed to understand
the ideas presented in this thesis. Further details can be found in [96].
3.6.3 Analysis Composition
We assume that a task sends output data when it finishes execution.
Tasks that do not satisfy this model can be split into a chain of sub-
tasks for which the model is valid (section 3.6.1). The sending of output
data at task completion time is interpreted as one output event in an
event stream that satisfies the calculated output event model. Our com-
positional performance analysis allows to connect a task output with
the input of a different task. Consequently, the output event model be-
comes the activating event model of the second task and can be used for
scheduling analysis for that task.
DSPBus
mon
sys
c1
c2
upd
ctrl tmr
comp_ex
Figure 3.7. Example to illustrate compositional performance analysis
Consider the example in Fig. 3.7. Monitor task mon sends data that
satisfies an arbitrary event model via communication task c1 to update
task upd. Control task ctrl, which is activated periodically by timer tmr,
sends data via communication task c2 to a system interface sys. Tasks
upd and ctrl are both mapped to the same DSP, while both communica-
tion tasks share the same Bus. The idea of compositional performance
analysis is to perform scheduling analysis of both resources separately,
and to propagate output event models on the event streams between the
two resources. Our compositional performance analysis approach makes
no assumptions about scheduling policies and scheduling parameters and
merely requires that scheduling analysis techniques are available for both
40 Performance Analysis
resources. The example allows us to study several interesting aspects of
or compositional performance analysis approach.
3.6.4 Starting Point
Initially, only event models at the external system inputs are known.
As explained in section 3.1, local scheduling analysis generally requires
an activating event model for each task on a resource. Consequently, in
our example in Fig. 3.7 scheduling analysis initially cannot be performed
neither for the Bus nor for the DSP, since there is a cyclic scheduling
dependency between the functionally independent task graphs. Schedu-
ling analysis for the Bus requires an activating event model for task c2,
which is obtained only after scheduling analysis on the DSP, which in
turn requires an activating event model for task upd, which is obtained
only after scheduling analysis on the Bus.
One solution to the starting point problem is to initially propagate
all external event model periods and jitters along each task graph [96].
This approach is safe since on one hand scheduling cannot change an
event model period. On the other hand, as will be explained shortly,
scheduling can only increase an event model jitter [114]. Since a smaller
jitter interval is contained in a larger jitter interval, the minimum initial
jitter assumption is safe.
3.6.5 System-Level Analysis Iteration
In our compositional performance analysis approach we alternate local
scheduling analysis and event model propagation [98]. A global analysis
step consists of two phases [98]. In the first phase local scheduling anal-
ysis is performed for each resource. In the second phase, all output event
models are propagated. It is then checked if the first phase has to be
repeated because some activating event models are no longer up-to-date,
meaning that a newly propagated output event model is different from
the output event model that was propagated in the previous global anal-
ysis step. Analysis completes if either all event models are up-to-date
after the propagation phase, or if an abort condition, e. g. the violation
of a timing constraint has been reached.
Event model periods do not change during global analysis. Even
model jitters can only increase. For an intuitive explanation, imagine
that in Fig. 3.7 task upd interrupts task ctrl. The smaller the activa-
tion jitter of upd and ctrl, the tighter the bounds for the minimum and
maximum number of interrupts of ctrl by upd, and hence the output
jitter of ctrl. If any of the activation jitters increases, then the output
jitter obviously cannot decrease, but it may increase. This argument
Composition Using Standard Event Models 41
can be made for every resource in system. Therefore, propagation of
an increased output jitter can only increase other output jitters, which
are in turn propagated. A fix-point can be reached, since an activation
jitter increase for one task does not increase output jitter of a differ-
ent task if it does not change the minimum and maximum number of
interrupts [98].
3.6.6 Event Stream Adaptation
A second key property of our compositional performance analysis ap-
proach is the ability to adapt the possible timing of events in an event
stream (expressed through the adaptation of an event model [98]) to sat-
isfy requirements imposed on the event stream. Such requirements can
have different origins. It may be that a scheduler or a scheduling analy-
sis for a particular component requires certain event stream properties.
For example, rate-monotonic scheduling and analysis [68] require strictly
periodic task activation. Alternatively, an integrated IP component may
require certain event stream properties. External system outputs may
also impose event model constraints, e. g. a minimum distance between
output events or a maximum acceptable jitter. Such a constraint may be
the result of a performance contract with an external subsystem [115].
Event stream adaptation can also be done for the sole purpose of traffic
shaping [98]. Traffic shaping can be used e. g. to reduce transient load
peaks, in order to obtain more regular system behavior. It is also possible
to completely decouple non-functional cyclic scheduling dependencies by
reducing event stream jitter to zero. This can assure that global analysis
reaches a fix-point that without adaptation would not have been reached
(section 3.6.5).
In our compositional performance analysis approach, we logically dis-
tinguish event model interfaces (EMIFs) and event adaptation func-
tions (EAFs) [98]. An EMIF transforms between different event models
merely by calculating the target event model parameters from the source
event model parameters. Such a transformation is only possible in a few
cases. The transformation can be lossless, e. g. from periodic with jitter
= 0 to strictly periodic (no jitter parameter) with the same period. It
can also be lossy, e. g. from periodic with jitter to sporadic. If the min-
imum distance between events in the periodic with jitter event model
is not smaller than the required minimum distance between events in
the target even model, then the target event model is satisfies, but the
information about periodicity is lost.
In all other cases, active buffering of events is needed to arrive at
the desired result, e. g. from periodic with jitter > 0 to strictly periodic.
An EAF serves this purpose. Practically, we distinguish event model
42 Performance Analysis
adaptation from event model shaping in SymTA/S [107]. Adaptation is
required to satisfy an event model constraint, while shaping is voluntary
to obtain more regular system behavior. We have currently implemented
two types of EAFs: a periodic EAF produces a strictly periodic output
event stream from a periodic with jitter input event stream. A dmin-EAF
enforces a minimum distance between output events.
3.6.7 Communication Buffers
Two effects determine the required size of communication buffers, as
well as the worst- and best-case buffering delay: activation buffering
(section 3.6.2) and EAF buffering (section 3.6.6). We assume a FIFO
buffer at the input of a task which receives all incoming tokens. Tokens
are kept in the buffer from the moment of arrival until the instance of the
task that was activated by the token has been completed (section 3.6.1).
An optional EAF may delay task activation after the arrival of a token.
actBuffer
CP
act = in ex
activation
backlog
FIFO
(a) Activation buffering
actEAF_Buffer
CP
FIFO
act = EAF ex
activation
backlog
in
EAF
backlog
(b) Activation and EAF buffering
Figure 3.8. Communication buffering in our compositional performance analysis ap-
proach
Consider the example buffer states in Fig. 3.8 for a buffer between
producer task P and consumer task C. Pointer ex (execute) points
at the oldest token whose corresponding task instance has not been
completed. Pointer in (incoming) points at the FIFO position where
the next incoming token will be stored. The distance between in and ex
is the current backlog. If the buffer is empty, then both pointers point
at the same buffer position, indicating a backlog of zero.
In Fig. 3.8(a), each incoming token immediately leads to a consumer
task activation (act = in). Consequently, we call the backlog activation
backlog. In Fig. 3.8(b), an EAF has been additionally inserted at the
consumer task input. The EAF determines when a token leads to an
activation (act = EAF ). The EAF pointer follows the in pointer ac-
cording to the type and parameters of the EAF, e. g. with a minimum
distance between steps in case of a dmin-EAF. Consequently, the backlog
Composition Using Packet Flows and Resource Streams 43
is separated into EAF backlog and activation backlog. In these particular
buffer states, the distinction between Figs 3.8(a) and 3.8(b) has no effect
on task C. Assume however that both the total backlog and the EAF
backlog (in Fig. 3.8(b)) drop to 2. In that case, the activation backlog
is 2 in Fig. 3.8(a), but 0 in Fig. 3.8(b). Consequently, the task cannot
start executing in the presence of the EAF.
3.7 Composition Using Packet Flows and
Resource Streams
The compositional performance analysis methodology for network pro-
cessors developed by Thiele et al. focuses on scheduling analysis of
packet flows [109]. Packet flows are described by upper and lower ar-
rival curves [23, 87], which capture minimum and maximum packet-sizes,
maximum gaps between packets, peak-load and long-term load. An ex-
ample is shown in Fig. 3.9.
arrival_curves
∆t
load
uα
lα
largest gap
largest packet
maximum transient load
minimum transient load
average load
Figure 3.9. Arrival curves used to describe packet flows
Additionally, so called resource streams described by upper and lower
service curves represent the available processing power on a particular
resource. Each task is modeled with two inputs, one for an input packet
flow, one for an available resource stream, and with two outputs, one
for an output packet flow, one for a remaining resource stream. The
arrival curves describing the output packet flow, as well as the service
curves describing the remaining resource stream are constructed from the
input arrival and service curves using a so called network calculus. A nice
property of this approach is that the basic network calculus equations for
one task are independent of a scheduling policy. Scheduling policies are
expressed by appropriately connecting packet flows and resource streams
on one resource. An example with two tasks with two different priorities
on the same resource is given in Fig. 3.10 (task 1 has a higher priority
than task 2).
44 Performance Analysis
thiele_sched_comp_ex
task 2
task 1
available resource stream
input packet flow
input packet flow
output packet flow
output packet flow
available resource stream
remaining resource stream
remaining resource stream
=
Figure 3.10. Network calculus with two priority-scheduled tasks (task 1 has a higher
priority than task 2)
Modeling other schedulers is less straight-forward and may require ad-
ditional operations beyond the basic network calculus equations [16]. A
novel solution has to be found for each scheduler, since the network cal-
culus is different from standard scheduling analysis techniques developed
in the real-time community. In particular, this may require considerable
work for detailed scheduling analyses taking operating-system overhead,
blocking times, or event correlations into account.
Local analyses can be composed into a scheduling network [109] with
the same flexibility and scalability advantages that our compositional
performance analysis has over holistic approaches. Tasks are connected
through packet flows in the same way as tasks are connected trough event
stream in our approach. Thiele’s group has successfully demonstrated
how quickly a design-space can be explored using their network calculus.
They also evaluated different genetic algorithms to guide the exploration
process [109].
A major difference to our approach is the lack of discrete task acti-
vations in packet flows. This is reflected in the shape of arrival curves,
which are piecewise linear and continuous. Thiele’s group has demon-
strated that their network calculus is also applicable to discrete event
streams [16], and has extended their model to capture different types of
events [77]. While such a representation allows a more detailed model
of possible event timing compared to our set of event models, it is not
clear what kind of additional detail can actually be obtained as a result
of scheduling analysis, how complex it is to propagate and evaluate that
additional detail, and whether the obtainable benefit justifies the addi-
tional complexity and the need to derive scheduling analysis algorithms
anew.
Performance Analysis for Complex Applications 45
3.8 Performance Analysis for Complex
Applications
In section 2.1, we gave an overview over a range of models of com-
putation (MOCs), languages and tools used during application specifi-
cation in embedded system design. It became obvious that embedded
applications exhibit a variety of task-dependencies beyond the simple
sequential dependencies considered so far by our compositional perfor-
mance analysis approach 3.5. Specifically, we need to consider multi-
ple activating inputs (with different concatenation semantics), data-rate
transitions, conditional communication, cyclic task dependencies and
data-dependent behavior.
In this section we summarize to what extent the performance analy-
sis techniques presented in the previous sections support complex task
dependencies. A overview is given in Tab. 3.8.
analysis multiple
inputs
rate transitions
with fixed rates
conditional
communication
cycles data-dependent
behavior
Eles AND
√
(unrolling)
Palencia AND, OR
√
(N to 1)
Thiele
(packet
flow)
OR
Thiele
(event
streams)
√
Richter
Table 3.2. Complex application properties supported by a variety of performance
analysis approaches
The holistic approach by Eles et al. [90–92] supports AND-activation
and data rate transitions. The latter is supported through unrolling
into a single-rate schedule. The same approach has been proposed by
Ziegenbein [127]. Unrolling is common for static scheduling of SDF
graphs [62]. However, unrolling makes scheduling analysis of preemptive
systems considerably more difficult, since the length of the unrolled time-
window has to be considered in each step. It is also not applicable to
data rate transitions, where the producer or consumer data rate is an
interval.
The holistic approach by Palencia et al. supports AND- and OR-
activation as well as data-rate transitions, but only if external input
events are strictly periodic [86]. In particular, the effect that these acti-
46 Performance Analysis
vation dependencies have on activation jitter is not considered. There-
fore, their approach is limited to relatively simple, regular systems, and
in general is not applicable if subsystem-integration is required (sec-
tion 3.4). Furthermore, in their approach OR-activation requires dupli-
cation of subsystems, which in the worst-case leads to system-growth
that is exponential with the number of OR-activated nodes. Finally,
allowing only N to 1 rate transitions is a severe restriction since it does
not allow cyclic dependencies (see example in section 2.1.1).
Both Eles et al. and Palencia et al. additionally model phase informa-
tion between the activation of different tasks to calculate tighter analysis
bounds. This is also possible in our compositional approach, as we will
see in chapter 8.
The network calculus for packet flows by Thiele et al. [109] supports
OR-concatenation. Total packet flow into a task is the sum of packet
flows at all inputs. Individual activations are not considered. When the
network calculus is applied to event streams, then data-dependent be-
havior can be considered [77], but only for sequential task dependencies.
Our own compositional performance analysis approach (Richter) [97,
98] currently does not consider complex task dependencies. However,
the composition of local scheduling analysis techniques for the analysis of
arbitrarily complex architectures is very attractive and merits an exten-
sion of our approach to complex applications. In the following chapters
we will therefore develop solutions to support each of the application
features identified above as important for the analysis of real-world em-
bedded applications.
3.9 Other Performance Analysis Techniques
Performance analysis techniques that are not based on scheduling
analysis have also been proposed. We take a look at two interesting
approaches.
3.9.1 Model Checking-Based Scheduling Analysis
In model checking, a system is modeled as an automaton, and an ex-
haustive search-algorithm traverses the state-space to prove or disprove
certain system properties. Several groups have used model checking
based on timed automata [2] to derive solutions for certain scheduling
problems. The approach seems quite successful to derive schedules when
the timing of task activations is predictable, and when task preemp-
tion is not possible. An example is the scheduling of manufacturing
jobs on a complex assembly-line [30, 1]. A second example is the val-
idation of telecommunication protocols. In the tool TAXYS [22], non-
Other Performance Analysis Techniques 47
preemptive single-processor schedules specified in the synchronous lan-
guage ESTEREL [9] augmented by uncertainty and timing constraints
are validated using a timed automaton. It seems that the strength of
model checking based on timed automata lies in the verification of log-
ical properties of a scheduler such as reachability of all scheduler states
or deadlock-free operation. These properties cannot be checked by the
aforementioned scheduling analysis techniques, which assume a correct
scheduler.
When it comes to verification of temporal task properties assuming
preemptive scheduling (e. g. response times), then on one hand an ineffi-
ciency of timed automata, namely the inability to stop and restart clocks,
requires cumbersome workarounds. On the other hand, the well-known
state space explosion problem inherent in model checking becomes ap-
parent. A straight-forward model using the model checking tool UP-
PAAL [118] for as little as 10 tasks sharing one processor with a very
simple scheduler and periodic with jitter activation does not complete
analysis in any reasonable time [71]. A hybrid approach by Fersman is to
combine timed automata with standard task models and scheduling anal-
ysis approaches from the real-time community [31]. The search-space is
reduced because the task model and the known scheduling analysis ap-
proach guide the model checker to the interesting points. This produces
results in reasonable time for single-processor scheduling. Fersman com-
pares her approach to scheduling analysis which assumes periodic task
activation, and argues that in general activation timing can be more com-
plex, and is thus best modeled using a timed automaton. However, she
gives no indication if she has timing in mind beyond periodic, sporadic,
jitter and bursts, which are state-of-the-art in scheduling analysis. She
also does not consider core execution time intervals or best-cases, both of
which are required for the analysis of complex systems. Therefore, while
it is interesting to observe that model-checking can be applied to solve
certain scheduling-analysis problems, the future direction of this work is
unclear. In particular, we are not aware of any publications on model
checking-based scheduling analysis of complex multi-processor systems.
3.9.2 Tightly Coupled System Model and
Scheduling Analysis
Several tool-flows have been proposed where a specific application rep-
resentation is tightly coupled with a specific set of analysis techniques.
An interesting example is the RADHA-RATAN tool-chain by Gupta et
al. [24, 25, 72]. The focus of this tool-chain is the calculation of possible
execution rates for communicating tasks in a reactive real-time system.
These rates can then be compared against timing constraints. For this
48 Performance Analysis
purpose, a system is represented as a generalized task graph using a
simple specification language. Generalized in the sense that tasks with
different consumer and producer data-rates are allowed, similar to syn-
chronous dataflow (section 2.1.1), as well as complex topologies including
cycles. Generalized task graphs have a two-level hierarchy, where the up-
per level is cycle-free, and each of its nodes can be a strongly connected
component (SCC) on the lower level. The hierarchy is mandated by the
fact that the rate derivation techniques for acyclic and cyclic components
are different [24, 25].
Rate derivation for acyclic task graphs is part of RADHA [24, 25],
while RATAN focuses on cyclic task graphs [24, 72]. RATAN is more
restricted than RADHA since it allows only single-rate cycles, and only
a subset of the concatenation rules for tasks with multiple inputs.
RADHA-RATAN supports a variety of activation conditions for tasks
with multiple inputs. Specifically, these are AND- and OR-activation,
which are further refined into un-skipped and skipped tasks. An un-
skipped task has to consume every token produced by one of its prede-
cessors. A skipped task is allowed to drop tokens if it cannot keep up
with the rate of token arrivals. The expressiveness of the task graphs is
the most interesting feature of RADHA-RATAN from our perspective,
since it allows to model realistic embedded applications.
On the downside, while the authors claim that RADHA-RATAN is
a hardware/software-codesign methodology [24], SW synthesis remains
unclear. This is due to the independence assumption for tasks [72],
which implies that each task has its own exclusive resource. While this
can be a valid assumption for hardware-synthesis, it is unrealistic for
software, where multiple tasks share a processor under the control of
an operating system. Resource sharing is not considered in RADHA-
RATAN. Therefore, there is no obvious application of this approach
to scheduling analysis of complex embedded applications executed on
heterogeneous architectures with RTOSes and bus-arbitration.
3.10 Summary and Conclusion
In this chapter, we gave an overview of the state-of-the-art in formal
system-level performance analysis. Scheduling analysis, which calculates
worst- and best-case response times for tasks sharing a single compo-
nent, forms the basis of sophisticated performance analysis approaches
for multi-processor architectures. We focused on a compositional per-
formance analysis approach developed by our group which allows to
compose the large number of existing scheduling analysis techniques, is
applicable to complex, heterogeneous architectures, and supports sub-
system integration.
Summary and Conclusion 49
One problem of scheduling analysis is that the calculated corner cases
can be overly conservative, because scheduling analysis may not model
all scheduling effects in sufficient detail. Analysis tightness can be im-
proved considerably by modeling and evaluating select additional infor-
mation about the application (so called contexts). Our contribution to
analysis improvements through the use of contexts will be presented in
chapter 8.
However, the main drawback of our compositional performance analy-
sis approach is that it considers only single-rate, single-input tasks with-
out cyclic dependencies. As explained in the previous chapter, such task
graphs are too simple for realistic embedded applications. These short-
coming are remedied in the following three chapters in order to enable
compositional performance analysis for complex applications executed
on heterogenous architectures.

Chapter 4
TASKS WITH MULTIPLE INPUTS
Our compositional performance analysis approach (section 3.5) as-
sumes that one producer task can be connected to exactly one consumer
task, and vice versa. Consequently, the only application topology sup-
ported are functionally independent chains of tasks1.
In realistic embedded applications, task activation often depends on
tokens arriving from multiple other tasks and consequently requires mul-
tiple inputs. Likewise, a task often sends (potentially different) tokens
to multiple tasks and therefore may require multiple outputs. Multiple
inputs are the more complicated situation, since the possible activation
timing of a multi-input task depends on the possible output timing of
all tasks that are sending tokens.
Definition 4.1 (Input Event Functions) A lower input event func-
tion ηli(∆t) specifies the minimum number of times tokens become avail-
able for consumption at a consumer task port i during any time interval
of length ∆t. An upper input event function ηui (∆t) specifies the max-
imum number of times tokens become available for consumption at a
consumer task port i during any time interval of length ∆t.
In the following, the dependency of ηli and η
u
i on ∆t is omitted for
brevity.
The activation function of a task is a boolean function of input events
at the different task inputs. A restriction is that activation must not
1A chain could also form a loop, but then would have neither an external input for activation,
nor an external output to observe results. Nevertheless, it is possible to model such a ‘useless’
loop. In that case, our compositional performance analysis approach will not work, since an
activating event model for the tasks in the loop cannot be determined.
52 Tasks with Multiple Inputs
be invalidated due to the arrival of additional tokens [123]. This means
that negation is not allowed in the activation function. Consequently,
the only acceptable boolean operators are AND and OR, since an input
is negated in all other commonly used boolean operators (NOT, XOR,
NAND, NOR)2. In practice, this is not a restriction since negation does
not occur directly in commonly used MOCs. Negation internally would
have to be implemented by a separate ‘activator’ task that evaluates
the activation function upon arrival of a token to decide if activation of
the main task should be revoked. The ‘activator’ task would need to be
OR-activated.
AND- and OR-activation will be considered individually first. In sec-
tion 4.3, some issues that arise from combination of AND- and OR-
activation will be discussed.
4.1 AND-Activation
For a consumer task C with multiple inputs, AND-activation implies
that C is activated if tokens are available for consumption at each input i.
An example of an AND-activated task is shown in Fig. 4.1.
ANDex 
C
FIFO3
ANDFIFO2
FIFO1
1
1
1
P1
P2
P3
1
1
1
Figure 4.1. Example of an AND-activated task C with three inputs
Input tokens may have to wait at some inputs until enough tokens have
arrived at all other inputs for one consumer activation. This requires
token buffering, which we will refer to as AND-buffering. AND-buffering
is required on top of activation buffering (section 3.6.7). Logically, the
former holds tokens before task activation. The latter holds tokens from
the moment of task activation until execution of the task has been com-
pleted. We will consider token buffering due to a combination of both
effects in section 5.5. In this section, we concern ourselves exclusively
with AND-buffering.
2Using the same logic, buffered one-to-many communication is not allowed, since the activa-
tion of one task could be invalidated through the removal of an input token from the shared
buffer by another task.
AND-Activation 53
To ensure bounded AND-buffer sizes, the following requirements are
sufficient.
1 For lim∆t→∞, the difference between the minimum and maximum
number of input events at input i must be finite.
2 For lim∆t→∞, the difference between the number of input events at
different inputs must be finite.
Together, both requirements guarantee that every arriving token will
lead to a consumer activation in finite time. Note that the first require-
ment can be relaxed if the numbers of input events at different inputs
are suitably correlated. Such correlations will be briefly discussed in
chapter 8. Without a model of such correlations the first requirement in
particular implies that sporadic input events are not acceptable.
4.1.1 Calculation of Activating Event Functions
Returning to our example, imagine that tasks P1, P2, P3 and C are
mapped onto resources and scheduled together with other tasks under
RTOS control. In order to perform scheduling analysis on the resource
to which C is mapped, activating event functions for task C have to be
calculated from the input event functions at the three inputs.
Let us assume for now that task C requires one token at each input
for one activation. This restriction will be dropped in section 5.3, after
we have discussed data rates and data rate transitions. During any time
interval ∆t, the port with the smallest minimum number of available
tokens determines the minimum number of AND-activations. Likewise,
the port with the smallest maximum number of available tokens deter-
mines the maximum number of AND-activations. Intuitively, one could
think that the maximum number of AND-activations is determined by
the port with the smallest upper input event function. However, the
number of available tokens at input port i during a time interval ∆t
depends on both the number of tokens arriving during ∆t, and on the
number of tokens that arrived earlier, but did not yet lead to an activa-
tion because tokens at one or more other input ports are still missing.
This is illustrated in the following example. Let us assume that our task
in Fig. 4.1 receives tokens at each input with the following periodic with
jitter input event models:
P1 = 4, J1 = 0
P2 = 4, J2 = 2
P3 = 4, J3 = 3
54 Tasks with Multiple Inputs
Fig. 4.2 shows a possible sequence of input events that adhere to these
event models, and the resulting AND-activation events. The number-
ing of events in the figure indicates which events together lead to one
activation of AND-activated task C.
ANDes
t
t0 t0+ 4 t0+ 8 t0+ 12 t0+ 16
t
t
t
input1 events
input2 events
input3 events
AND-activation
 events
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
Figure 4.2. AND-activation timing example for a valid sequence of input events
As can be seen, the minimum distance between two AND-activations
(activations 3 and 4 in Fig. 4.2) equals the minimum distance between
two input events at input 3, which is the input with the largest input
event model jitter. Likewise, the maximum distance between two AND-
activations (activations 1 and 2 in Fig. 4.2) equals the maximum distance
between two input events at input 3. It is not possible to find a differ-
ent sequence of input events leading to a smaller minimum or a larger
maximum distance between two AND-activations. From this we can
conclude that the input with the largest input event jitter determines
the activation jitter of the AND-activated task. This observation can be
generalized to any combination of input event models which satisfy the
two AND-activation conditions given above.
Lemma 4.2 Let C be an AND-activated task with upper and lower input
event functions ηui , η
l
i for each input i. Then for any time interval ∆t,
the upper activating event function of task C equals the maximum of
all upper input event functions. Likewise, the lower activating event
function of task C equals the minimum of all lower input event functions.
ηuAND = max{ηui } (4.1)
ηlAND = min{ηli} (4.2)
Proof 4.2 Upon arrival, a token either has to wait for other tokens to
arrive for one consumer activation, or it is the last token required for one
AND-Activation 55
consumer activation. Let us assume that a token arrives at input i at
time tact,i, and that this token is the last token required for one consumer
activation. Then ηui is an upper bound on the number of times that the
input condition at input i can be satisfied during any time interval ∆tact,i
starting at tact,i, and thus bounds the maximum number of times that
task C can be activated during ∆tact,i.
Let us now consider all points in time tact when a token arrival leads to
a consumer activation. Without further information we have to assume
that the last required token can arrive at any input i. Then, max{ηui }
is an upper bound on the number of times that task C can be activated
during a time interval ∆tact starting at any point in time tact. For all
other points in time, the upper bound on the number of times that
task C can be activated during a time interval ∆t will be equal or lower,
since the first activation does not happen until the next tact. This proves
equation 4.1.
For any time interval of length ∆t, ηli is a lower bound on the number
of times that the input condition at input i has to be satisfied. Without
further information we have to assume that the last required token can
arrive at any input i. Then, min{ηli} is a lower bound for the number of
times that task C has to be activated. This proves equation 4.2. 2
Note that in Fig. 4.2 we assumed that corresponding input events
can occur at the same time. In some cases it may be possible to calcu-
late phases between the arrival of corresponding tokens in more detail,
e. g. through the use of inter-contexts (section 8.2). It may then be pos-
sible to calculate tighter activating event functions if it can be shown
that a certain input cannot (fully) influence the activation timing of
an AND-activated task, because tokens at this input arrive relatively
early. This is particularly important for the analysis of functional cycles
(chapter 6).
4.1.2 AND-Activation Incurred Delay and
Backlog
In embedded real-time systems, meeting deadlines as well as accu-
rately dimensioning memory are critical. Therefore, we need to calcu-
late the maximum AND-delay, as well as the maximum AND-backlog
and consequently the required AND-buffer size for each input. For this
purpose, the lower activating event function is interpreted as a lower
service function [109] of a hypothetical AND-concatenation task with 0
execution time: this task serves each input immediately after enough
tokens have arrived at all inputs.
56 Tasks with Multiple Inputs
This interpretation allows us to reuse results by Thiele et al. who
showed in [109] that the maximum delay experienced by input data
before it is processed by a consumer task is the maximum horizontal
distance between the upper arrival curve of the data and the lower service
curve of the task. The maximum buffer size required to buffer that
data is determined by the maximum vertical distance between the upper
arrival curve and the lower service curve.
Inserting ηui and η
l
AND into the delay and backlog inequations from [109]
yields
delayi ≤ max
∆t≥0
{
min
t≥0
{
t | ηui (∆t) = ηlAND(∆t+ t)
}}
(4.3)
backlogi ≤ max
∆t≥0
{
ηui (∆t)− ηlAND(∆t)
}
(4.4)
The minimum delay and backlog at each input are zero, since the last
token required for AND-activation can arrive at any input.
4.1.3 AND-activation for periodic with jitter Input
Event Models
For the reasons stated in section 3.1.1 we are specifically interested
in periodic with jitter input event models, which we describe with pa-
rameters (Pi, Ji). We would like to calculate the activating event model
(PAND, JAND) for the AND-activated task, which we will need for sche-
duling analysis.
4.1.3.1 AND-Activation Period
To ensure bounded input buffer sizes, the period of all input event
models must be the same. The period of the activating event model
equals this period.
Pi != Pj ; i, j = 1..k ⇒
PAND = Pi ; i = 1..k (4.5)
4.1.3.2 AND-Activation Jitter
The input event model with the largest jitter has both the maximum
upper and the minimum lower input event function, and thus according
to equations 4.1 and 4.2 determines the jitter of the activating event
model of the AND-activated task.
JAND = max{Ji} ; i = 1..k (4.6)
AND-Activation 57
4.1.3.3 AND-Activation Incurred Token Delay and Backlog
The input event model with the largest jitter determines ηlAND, which
is then inserted into inequations 4.3 and 4.4.
delayi ≤ max
∆t≥0
{
(4.7)
min
t≥0
{
t |
⌈
∆t+ Ji
P
⌉
= max
(
0,
⌊
∆t+ t− JAND
P
⌋)}}
backlogi ≤ max
∆t≥0
{⌈
∆t+ Ji
P
⌉
−max
(
0,
⌊
∆t− JAND
P
⌋)}
(4.8)
To find a simple solution for inequation 4.7, let us first consider the
strictly periodic case where all input jitters are zero. In that case, in-
equation 4.7 becomes
delayi ≤ max
∆t≥0
{
min
t≥0
{
t |
⌈
∆t
P
⌉
=
⌊
∆t+ t
P
⌋}}
For ∆t = n∗P+lim²→+0 ; n ∈ N0 the required value for t approaches
P. For all other ∆t, the required value for t is smaller. We conclude
that for this simplest case, delayi ≤ P. This result is also obvious from
Fig. 4.3(a) which shows the upper and lower event functions of a strictly
periodic event model with period P = 4. Obviously, the maximum
horizontal distance between the two curves approaches P.
Let us now return to the general case. Compared to the strictly
periodic case, a jitter Ji at input i moves the upper input event function
ηui left by Ji. The activation jitter JAND moves the lower activating event
function ηlAND right by JAND. Consequently, the maximum horizontal
distance between the two functions is
delayi ≤ P + Ji + JAND (4.9)
The maximum backlog is obviously incurred for a ∆t just before the
lower activating event function steps. The lower activating event func-
tion steps for the first time at ∆t = P +JAND. At that point, the value
of the upper input event function is d(P + JAND + Ji)/Pe according to
equation 3.1. Since thereafter both functions step periodically every P,
we do not need to consider any other point. Consequently
backlogi ≤
⌈P + Ji + JAND
P
⌉
(4.10)
These inequations can be further improved for the input with the
largest jitter, since a token arriving at that input can only wait for
58 Tasks with Multiple Inputs
ANDin_a
u
1η
5 10 15
∆t
5
events
l
1η
14
15
135
164
(a) AND input 1
ANDin_a
5 10 15
∆t
5
events
u
2η
l
2η
(b) AND input 2
Figure 4.3. Upper and lower input event functions at the first and second input in
our AND-example
ANDin_c
5 10 15
∆t
5
events
uu
AND 3ηη =
ll
AND 3ηη =
Figure 4.4. Upper and lower input event functions at the third input, as well as
upper and lower activating event functions in our AND-example (AND activation
∧
=
AND input 3)
tokens arriving at other inputs. Consequently, for the input with the
largest jitter, the second largest jitter can be used instead of max{Ji}
in inequation 4.6. If two or more inputs have the same largest jitter,
then one of them can be interpreted as the second largest, and a special
treatment of these inputs is not necessary.
4.1.4 Example
Let us return to our example in Fig. 4.1. The input event functions
as well as the activating event functions are shown in Figs. 4.3 and 4.4.
The activating event model equals the input event model at the third
input, since it has the largest jitter.
PAND = P3 = 4, JAND = J3 = 3
AND-Activation 59
ANDdelayBacklog_a
u
1η
5 10 15
∆t
5
events
backlog1,max
delay1,max
l
ANDη
u
1τ
5 10 15
∆t
5
10
token
backlog1,max
delay1,max
l
AND 1,τ
14
15
135
164
(a) AND buffer at input 1
ANDdelayBacklog_b
5 10 15
∆t
5
events
backlog2,max
delay2,max
u
2η
l
ANDη
5 10 15
∆t
5
10
token
backlog2,max
delay2,max
u
2τ
l
AND 2,τ
(b) AND buffer at input 2
Figure 4.5. Maximum delay and backlog incurred at the first and second input of
our example AND-activated task
ANDdelayBacklog_c
5 10 15
∆t
5
events
u
3η
l
AND 3,η
backlog3,max
delay3,max
5 10 15
∆t
5
10
token u
3τ
l
AND 3,τ
backlog3,max
delay3,max
Figure 4.6. Maximum delay and backlog incurred at the third input of our example
AND-activated task
We obtain the following values for maximum delay and maximum
backlog at each input.
delay1 ≤ 7 time units, backlog 1 ≤ 2 events / tokens
delay2 ≤ 9 time units, backlog 2 ≤ 3 events / tokens
delay3 ≤ 9 time units, backlog 3 ≤ 3 events / tokens
The calculation of theses values is shown in Figs. 4.5 and 4.6. Note
that we use a different lower service function ηlAND,3 at the input with the
largest jitter (the 3rd input) according to section 4.1.3.3. Further note
that backlog and delay at the two inputs with the largest and second
largest jitter (inputs 2 and 3) are the same. This is always the case
since the left-shift of ηui from the second-largest to the largest jitter is
60 Tasks with Multiple Inputs
compensated by the same-size left-shift of ηlAND from the largest to the
second-largest jitter.
4.2 OR-Activation
For a consumer task C with multiple inputs, OR-activation implies
that C is activated each time tokens are available for consumption at
any input. Different to AND-activation, input event models are not
restricted, and no OR-buffering is required, since tokens at one input
never have to wait for tokens to arrive at a different input in order to
activate C. Of course, activation buffering (section 3.6.7) is still required.
We will consider activation buffering in the presence of OR-activation in
section 5.5.
An example of an OR-activated task is shown in figure 4.7.
ORex
C
OR
1
1
Figure 4.7. Example of an OR-activated task with two inputs
4.2.1 Calculation of Activating Event Functions
We assume that no input event is ever skipped. Therefore, for any ∆t,
the maximum (minimum) number of activating events for OR-activated
task C is the sum of the maximum (minimum) number of input events.
ηuOR =
n∑
i=1
ηui (4.11)
ηlOR =
n∑
i=1
ηli (4.12)
Formalisms have also been proposed, e. g. by Gupta et al. ([24] and
section 3.9.2), that allow to skip input events of OR-activated tasks.
Some operating systems, e. g. ERCOSEK [29], skip input events of an
OR-activated task as long as the task has not completed its previous ex-
ecution. This implies that the activating event functions can no longer
be calculated from input event functions alone, but that task response
times have to be considered. Essentially it means that the upper acti-
vating event function is limited from above by a step function, were the
distance between steps corresponds to the minimum response time of
the activated task. The lower activating event function is limited from
OR-Activation 61
5 10 15
∆t
5
events
u
1η
l
1η
ORin_a
(a) OR input 1 (P1 = 4, J1 = 2)
5 10 15
∆t
5
events
u
2η
l
2η
ORin_b
(b) OR input 2 (P2 = 3, J2 = 2)
Figure 4.8. Upper and lower input event functions in our OR-example
below by a step function, were the distance between steps corresponds to
the maximum response time of the activated task. Skipped input events
are not further considered in this thesis.
4.2.2 OR-activation for periodic with jitter Input
Event Models
For the reasons stated in section 3.1.1 we are specifically interested
in periodic with jitter input event models, which we describe with pa-
rameters (Pi, Ji). We would like to calculate the activating event model
(POR, JOR) for the OR-activated task, which we will need for scheduling
analysis.
Let us consider the example task in Fig. 4.7. Let us assume that the
tokens arrive at the two inputs with the following periodic with jitter
event models:
P1 = 4, J1 = 2
P2 = 3, J2 = 2
The corresponding upper and lower input event functions are shown
in Fig. 4.8. According to equations 4.11 and 4.12, the upper and lower
activating event functions for task C are constructed by adding the re-
spective input event functions. The result is shown in Fig. 4.9(a).
Recall a key requirement of compositional performance analysis, namely
that event streams are described in a form that can serve both as input
for local scheduling analysis, and can be produced as an output of local
scheduling analysis for propagation to the next analysis component (sec-
tion 3.1.1). Due to the irregularly spaced steps (visible in Fig. 4.9(a)),
62 Tasks with Multiple Inputs
u
ORη
l
ORη
5 10 15
∆t
5
events
10
ORout_exact
14
15
135
164
(a) exact
5 10 15
∆t
5
events
10
ORout_pju JPOR +,η
l
JPOR +,η
(b) periodic with jitter approximation
Figure 4.9. Upper and lower activating event functions in our OR-example
the exact activating event functions cannot be described by a periodic
with jitter event model, and thus cannot serve directly as input for local
scheduling analysis. One solution would be to use a scheduling analysis
algorithm that considers each periodic with jitter input event model in-
dividually and internally calculates the sums in equations 4.11 and 4.12.
Such an approach would yield the most accurate response times. How-
ever, we cannot expect that every analysis algorithm is able to do that.
Furthermore, after local scheduling analysis a periodic with jitter out-
put event model has to be propagated to the next analysis component.
We need an activation jitter in order to calculate an output jitter (sec-
tion 3.6.2). Therefore, we need to find conservative approximations for
ηuOR and η
l
OR that can be described by a periodic with jitter event model
(POR,JOR). The intended result is shown in Fig. 4.9(b) (the exact curves
appear as dotted lines).
4.2.2.1 OR-Activation Period
The period of OR-activation is the least common multiple LCM(Pi)
of all input event model periods (the macro period), divided by the sum
of input events during the macro period assuming zero jitter for all input
OR-Activation 63
event streams.
POR = LCM(Pi)∑n
i=1
LCM(Pi)
Pi
=
1∑n
i=1
1
Pi
(4.13)
4.2.2.2 OR-Activation Jitter
A conservative approximation of ηuOR and η
l
OR with a periodic with
jitter event model implies that equations 4.11 and 4.12 are replaced by
the following inequations.⌈
∆t+ JOR
POR
⌉
≥
n∑
i=1
⌈
∆t+ Ji
Pi
⌉
(4.14)
max
(
0,
⌊
∆t− JOR
POR
⌋)
≤
n∑
i=1
max
(
0,
⌊
∆t− Ji
Pi
⌋)
(4.15)
In order to be as accurate as possible, we are interested in the min-
imum jitter that satisfies inequations 4.14 and 4.15. Let us define J uOR
to be the minimum upper activation jitter that satisfies inequation 4.14,
and J lOR to be the minimum lower activation jitter that satisfies inequa-
tion 4.15. Then the first interesting question is whether J uOR and J lOR
are the same.
Lemma 4.3 For any combination of periodic with jitter input event
models of an OR-activated task C, the minimum upper activation jitter
J uOR and the minimum lower activation jitter J lOR are the same.
Proof 4.3 Let us first assume that Ji = 0 for all i input event models of
task C. Then, the upper and lower activating event functions of task C
meet periodically each macro period, i. e. each ∆tLCM,n where ∆tLCM,n =
n∗LCM{Pi} ; n ∈ N 0. The activating event functions obviously display
an inverse symmetry around each ∆tLCM,n. If the upper activating event
function steps at ∆tu = ∆tLCM,n + t, then the lower activating event
function steps at ∆tl = ∆tLCM,n − t.
Upper and lower event function ηuOR,P+0 and η
l
OR,P+0 which corre-
spond to a periodic with jitter event model with (P = POR , J = 0)
also meet periodically each ∆tLCM,n. These two event functions are not
conservative approximations of ηuOR and η
l
OR. To obtain conservative
approximations, ηuOR,P+0 has to be shifted left by a minimum amount,
and ηlOR,P+0 has to be shifted right by a minimum amount. These mini-
mum shifts equal J uOR and J lOR, respectively. Due to the aforementioned
inverse symmetry, J uOR = J lOR.
64 Tasks with Multiple Inputs
Let us now drop the initial assumption that Ji = 0 for all i. The upper
and lower activating event functions of task C will no longer meet, but
the inverse symmetry around each ∆tLCM,n remains (e. g. around ∆t =
12 in Fig. 4.9(a)). The reason is that jitter increases the maximum, and
decreases the minimum distance between consecutive events by the same
amount (equations 3.3 and 3.4). Therefore, ηuOR,P+0 and η
l
OR,P+0 still
have to be shifted left respectively right by an equal minimum amount
to obtain conservative approximations of ηuOR and η
l
OR
3. Consequently,
J uOR = J lOR is true for any combination of periodic with jitter input
event models of an OR-activated task. 2
In the following, the upper approximation (inequation 4.14) is used
to calculate the OR-activation jitter. Since the left and right sides of
this inequation are only piecewise continuous, the inequation cannot be
simply transformed to obtain the desired minimum jitter. The solu-
tion used here is to evaluate inequation 4.14 piecewise for each interval
]∆tj ,∆tj+1], during which the right side of the inequation has a con-
stant value kj ∈ N. For each constant piece of the right side, a condition
for a local jitter JOR,j is obtained that satisfies the inequation for all
∆t : ∆tj < ∆t ≤ ∆tj+1.
For each constant piece of the right side, inequation 4.14 becomes⌈
∆t+ JOR,j
POR
⌉
≥ kj ; ∆tj < ∆t ≤ ∆tj+1 , kj ∈ N
Since the left side of this inequation is monotonically increasing with
∆t, it is sufficient to evaluate it for the smallest value of ∆t, which
approaches ∆tj . I. e.
lim
²→+0
⌈
∆tj + ²+ JOR,j
POR
⌉
≥ kj ; kj ∈ N
⇔ lim
²→+0
∆tj + ²+ JOR,j
POR > kj − 1
⇔ lim
²→+0
(JOR,j + ²) > (kj − 1) ∗ POR −∆tj
⇔ JOR,j ≥ (kj − 1) ∗ POR −∆tj (4.16)
The global minimum jitter is then the smallest value which satisfies
all local jitter conditions. As already said, ηuOR displays a pattern of
distances between steps which repeats periodically every macro period.
Therefore, it is sufficient to perform above calculation for one macro
period. An algorithm can be found in [41].
3The curves in Fig. 4.9(b) are the result of this operation.
OR-Activation 65
4.2.3 Example
Let us return to our initial example and calculate the activating event
model for OR-activated task C. According to equation 4.13 we obtain
the following OR-activation period.
POR = P1 ∗ P2P1 + P2 =
12
7
≈ 1.714
OR-activation jitter is calculated according to inequation 4.14, which
becomes ⌈
∆t+ JOR
12/7
⌉
≥
⌈
∆t+ 2
4
⌉
+
⌈
∆t+ 2
3
⌉
Constant segments of the right side of this inequation now have to
be found. The first constant segment is for 0 < ∆t ≤ 1. According to
inequation 4.16 ⌈
∆t+ JOR,0
12/7
⌉
≥ 2 ; 0 < ∆t ≤ 1
⇔ JOR,0 ≥ 1 ∗ 127 − 0 = 12/7
The following table shows all constant segments and the resulting
local JOR,i for the first macro period. Each constant segment can also
be seen in the plot of ηuOR in Fig. 4.9(a).
∆t range kj local JOR,j
0 < ∆t ≤ 1 2 JOR,0 ≥ 1 ∗ 127 − 0 = 127
1 < ∆t ≤ 2 3 JOR,1 ≥ 2 ∗ 127 − 1 = 177
2 < ∆t ≤ 4 4 JOR,2 ≥ 3 ∗ 127 − 2 = 227
4 < ∆t ≤ 6 5 JOR,3 ≥ 4 ∗ 127 − 4 = 207
6 < ∆t ≤ 7 6 JOR,4 ≥ 5 ∗ 127 − 6 = 187
7 < ∆t ≤ 10 7 JOR,5 ≥ 6 ∗ 127 − 7 = 237
10 < ∆t ≤ 13 9 JOR,6 ≥ 8 ∗ 127 − 10 = 267
13 < ∆t ≤ 14 10 JOR,7 ≥ 9 ∗ 127 − 13 = 177
The last table entry only shows that starting with ∆t > 1, the pattern
repeats every ∆t = LCM(P1,P2) = 4 ∗ 3 = 12 time units.
Finally, we take to largest JOR,i to obtain the activating event model
jitter.
JOR = 267 ≈ 3.714
66 Tasks with Multiple Inputs
4.2.4 OR-activation for sporadic with jitter Input
Event Models
If all input event models for an OR-activated task are sporadic with
jitter, then the same rules to calculate the activating event model apply
as for periodic with jitter event models. The only difference of course is
that the activating event model is also sporadic with jitter.
If some input event models are periodic with jitter and some are spo-
radic with jitter, then the simplest, but very conservative solution is
to treat all event models as sporadic with jitter. This is valid, since a
sporadic with jitter event model is a conservative approximation of a
periodic with jitter event model with the same parameters [98].
A second solution yielding tighter bounds is to perform a second calcu-
lations to obtain separate period and jitter parameters that more accu-
rately models the lower activating event function of the OR-concatenated
task. This calculation works exactly as described above, but ignores all
sporadic with jitter input event models and considers only the periodic
with jitter ones. This solution is only useful if our system-level analysis
framework is extended to propagate two separate sets of parameters for
each event stream. The additional complexity is bounded, since it is
never necessary to maintain more than two sets per event stream (one
for the upper, and one for the lower event function). During scheduling
analysis, worst case load, worst case response time and upper output
event function parameters are calculated using the parameters for the
upper input event function, and best case load, best case response time
and lower output event function parameters are calculated using the
parameters for the lower input event function.
4.3 Combination of AND- and OR-Activation
A task may have an activation function with nested AND- and OR-
concatenations, e. g. (i1OR i3) AND i2. For each concatenation, the con-
catenated event model is calculated as described in sections 4.1 and 4.2,
and used as an input event model for the next concatenation. The
number of nested concatenations, as well as the order of AND- and
OR-concatenations is not restricted. As was the case for pure AND-
activation, it has to be ensured that buffer over- or underrun can be
reliably avoided in the presence of a combination of OR- and AND-
activation. This implies that all input event models for each AND-
concatenation must be periodic and have the same period. For our
example, we obtain the following condition:
P1 ∗ P3
P1 + P3
!= P2
Summary and Conclusion 67
An interesting problem arises if the same input appears more than
once in the activation function. For example, the following is a correct
boolean transformation:
(i1OR i3) AND i2 = (i1 AND i2)OR (i2 AND i3)
This implies P1 != P2 != P3, which is obviously not consistent with the
condition that we gave above. The reason is that we implicitly assumed
that i2 can be read twice. Since this is not possible in a FIFO, we
conclude that an input must not appear more than once in the activation
function.
4.4 Multiple Outputs
A different number of tokens can be produced at different outputs of a
multi-output task at the end of one execution. The output rate can be a
fixed number of tokens or an interval (including a lower bound of zero).
Data rates will be considered in detail in the next chapter. However,
we assume that the specified number (interval) of tokens is produced at
each output every execution. Tasks that communicate conditionally are
modeled with an output rate interval. Once we consider intra-contexts
during analysis, it starts to make sense to explicitly distinguish different
task behaviors modes. We will return to this in section 8.1.
4.5 Summary and Conclusion
In this chapter we showed how to apply our compositional perfor-
mance analysis approach to tasks with multiple activating inputs. We
argued that AND- and OR-activation (and their combination) are the
only meaningful concatenations of multiple activating inputs. We cal-
culated activating event models for both types of concatenations, as
well as required communication buffers and incurred token delay for
AND-activation. In case of OR-concatenation, we had to conservatively
approximate the exact activating event functions, in order to obtain pe-
riod and jitter parameters required in our compositional performance
analysis approach.
We used input event functions which allowed us to abstract from the
specific number of tokens required at a particular input for one consumer
task activation. For simplicity, we assumed a single required token per
input during buffer-size calculation. In the following chapter we will first
consider data rate transitions between tasks, and then combine those
results with the results from this chapter into a general model which
allows to calculate activating event models and the required buffers in
the presence of both data rate transitions and multiple activating inputs.

Chapter 5
RATE TRANSITIONS BETWEEN TASKS
So far, we have assumed that a task consumes (produces) one token
per execution per input (output). However, in realistic applications a
task may consume (produce) multiple tokens per execution at a partic-
ular input (output). From this follows that the number of tokens pro-
duced by a producer task at a particular output may be different from
the number of tokens consumed by a consumer task at the connected in-
put, leading to a data rate transition. Furthermore, task communication
may also be conditional, leading to data rate transitions with intervals.
Data rate transitions occur in MoCs where process activation de-
pends on the availability of a certain amount of data (section 2.1.1).
A prominent example of such dataflow models is Synchronous Dataflow
(SDF) [61], which is a standard MoC used in signal-processing. For ex-
ample, Fig. 5.1 shows an SDF model of a chain of filters which convert
the digital audio format used for CDs to one used for DATs (digital
audio tapes).
cd2dat
b
4
c
7 5
d
7 4
out
1
in
1
a
1 2 3
Figure 5.1. SDF model of a chain of filters which convert digital audio from CD to
DAT format. The system exhibits data rate transitions between tasks.
Data rate transitions have two implications. They change the fre-
quency of consumer activations compared to producer executions. They
may also require buffering of some tokens (rate transition buffering) until
enough tokens have arrived for one consumer activation [61]. Note that
70 Rate Transitions between Tasks
a rate transition buffer is required on top of the execution buffer that
hold tokens from the moment of task activation until execution of the
task has been completed (section 3.6.7). We will consider token buffer-
ing due to a combination of these effects in section 5.5. In this chapter,
we concern ourselves exclusively with rate transition buffering.
We first consider data rate transitions between fixed data rates. We
initially assume tasks with only one input and one output. We show
how to calculate activating event functions from output event functions
in the presence of a data rate transition. We also calculate the required
buffer size and worst-case token delay due to a data rate transition. The
results are then extended to data rate transitions with intervals. At the
end of the chapter, we consider tasks with multiple inputs and combine
data rate transitions with AND- and OR-activation.
5.1 Fixed Data Rates
In a data rate transition with fixed data rates, a producer task pro-
duces a fixed number of tokens per execution, the producer data rate
rp. The connected consumer task consumes a different fixed number of
tokens per execution, the consumer data rate rc.
P
2
C
3FIFO RTex
Figure 5.2. Data rate transition with smaller producer data rate (rp = 2) and larger
consumer data rate (rc = 3)
Consider the data rate transition example in Fig. 5.2. Task P pro-
duces 2 tokens per execution, task C consumes 3 tokens per execution.
Now imagine that P and C are mapped onto resources and scheduled
together with other tasks under RTOS control. In order to perform
scheduling analysis on the resource to which C is mapped, activating
event functions for task C have to be calculated from the output event
functions of task P . The answer is not obvious, since due to the data
rate transition different tokens have to wait for different lengths of time
until enough tokens have arrived for one activation of C.
5.1.1 Token Functions
So far, we have implicitly assumed that an output event corresponds
to the production of one token, and that an input event corresponds to
the arrival of one token. In particular, we have made these assumptions
Fixed Data Rates 71
in the previous chapter when calculating activating event functions for
AND- and OR-activated tasks.
In this chapter, the amounts of data produced and consumed per ex-
ecution are different or even intervals. We introduce producer and con-
sumer token functions for the purpose of explicitly modeling the amounts
of data produced and consumed data. We initially assume tasks with
only one input and one output.
Definition 5.1 (Producer Token Functions) The lower producer
token function τ lp(∆t) specifies the minimum number of tokens that task P
produces during any time interval of length ∆t. The upper producer to-
ken function τup (∆t) specifies the maximum number of tokens that task P
can produce during any time interval of length ∆t.
Definition 5.2 (Consumer Token Functions) The lower consumer
token function τ lc(∆t) specifies the minimum number of tokens that be-
come available for consumption by task C during any time interval of
length ∆t. The upper consumer token function τuc (∆t) specifies the max-
imum number of tokens that become available for consumption by task C
during any time interval of length ∆t.
In the following, the dependency of τ l and τu on ∆t is omitted for
brevity.
The production of rp tokens is idealized to take zero time. Therefore,
producer token functions are piecewise constant between vertical steps
whose height corresponds to the producer’s data rate. The consumer is
activated if at least rc tokens are available. Therefore, consumer to-
ken functions are piecewise constant between vertical steps whose height
corresponds to the consumer’s data rate. Token functions are similar to
arrival curves [23, 110, 87] with the special property that production
and consumption of tokens is atomic and timeless [49, 51].
A simple relationship exists between token functions and event func-
tions. Let ηup and η
l
p explicitly refer to the upper and lower event func-
tions describing the output timing of producer task P . Let ηuc and η
l
c
explicitly refer to the upper and lower event functions describing the
activation timing of consumer task C. Then
τup = rp ∗ ηup (5.1)
τ lp = rp ∗ ηlp (5.2)
τuc = rc ∗ ηuc (5.3)
τ lc = rc ∗ ηlc (5.4)
If the producer data rate rp is different from the consumer data rate
rc, then producer and consumer token function have an added level of
72 Rate Transitions between Tasks
expressiveness compared to event functions. This is exploited to calcu-
late activating event functions for the consumer task in the presence of
data rate transitions.
5.1.2 Calculation of Activating Event Functions
As already said, the goal of this section is to calculate activating event
functions ηuc and η
l
c from output event functions η
u
p and η
l
p in the presence
of a data rate transition between producer task P and consumer task C.
Let us assume that P ’s output event functions are given. In the first
step, producer token functions are obtained from P ’s output event func-
tions using equations 5.1 and 5.2. In the second, central step, consumer
token functions have to be constructed from producer token functions.
Remember that consumer token functions describe the availability of
multiples of rc tokens as a function of ∆t.
To correctly construct consumer token functions, all possible num-
bers of tokens buffered between tasks P and C have to be considered.
Therefore, to correctly construct the upper consumer token function,
the maximum initial number of tokens in the buffer between P and C
not leading to an activation of C, i. e. rc − 1 tokens, has to be assumed.
Furthermore, it must be assumed that subsequent tokens arrive as early
as possible. These assumptions are captured by shifting the upper pro-
ducer token function upwards rc − 1 tokens. The shifted function shall
be called τ¯up .
τ¯up = τ
u
p + rc − 1
To correctly construct the lower consumer token function, the minimum
initial number of tokens in the buffer, i. e. 0, has to be assumed. Fur-
thermore, it must be assumed that subsequent tokens arrive as late as
possible. These assumptions are already expressed by the lower producer
token function.
Upper and lower consumer token functions can now be constructed.
They step as soon as possible, without rising above τ¯up and τ
l
p, respec-
tively. Otherwise, non-existent tokens might be consumed.
τuc =
⌊
τ¯up
rc
⌋
∗ rc (5.5)
τ lc =
⌊
τ lp
rc
⌋
∗ rc (5.6)
In the final step, the consumer token functions are transformed into
activating event functions, where each arrival of rc tokens leads to one
activation of task C according to equations 5.3 and 5.4.
Fixed Data Rates 73
5 10 15
∆t
5
10
token
u
pτ
l
pτ
RTin_a
(a) original
5 10 15
∆t
5
10
token
rC- 1
l
pτ
u
pτ RTin_b
(b) shifted upper producer token function
Figure 5.3. Upper and lower producer token functions for our data rate transition
example
5 10 15
∆t
5
10
token
l
pτ
u
cτ
l
cτ
u
pτ
RTout_a
(a) in comparison to τ¯up and τ
l
p
5 10 15
∆t
5
10
token
u
pτ
l
pτ
u
cτ
l
cτ
RTout_b
(b) in comparison to τup and τ
l
p
Figure 5.4. Upper and lower consumer token functions for our data rate transition
example
Let us apply these results to the example in Fig. 5.2. Let us assume
that task P produces 2 tokens per execution with the following periodic
with jitter output event model:
Pp = 4, Jp = 1
The upper and lower producer token functions τup and τ
l
p are shown in
Fig. 5.3(a). In Fig 5.3(b), the upper producer token function has been
shifted upwards by rc − 1 = 2. The resulting consumer token functions
τuc and τ
l
c are shown in Fig. 5.4(a). As required, they step as soon as
74 Rate Transitions between Tasks
u
cη
l
cη
5 10 15
∆t
5
events RTout_exact
(a) exact
RTout_pj
u
JPc +,η
l
JPc +,η
5 10 15
∆t
5
events
(b) periodic with jitter approximation (the
dotted lines represent the exact curves from
Fig. 5.5(a))
Figure 5.5. Activating event functions for consumer task C in our data rate transition
example
possible, without rising above τ¯up and τ
l
p, respectively. For comparison,
τ¯up and τ
l
p are also shown as dotted lines.
While τ¯up is helpful for the construction process, it is otherwise not
useful to describe event models. It is more interesting to compare the
upper consumer token function τuc to the upper producer token function
τup . τ
u
c is at most rc− 1 below τ¯up . Since τ¯up is exactly rc− 1 above τup , it
follows that τuc is always equal or above τ
u
p . This is shown in Fig. 5.4(b).
Therefore, equation 5.5 can also be written as
τuc =
⌈
τup
rc
⌉
∗ rc (5.7)
This leads to a second interpretation of the transformation result.
Consumer token functions are a tight conservative approximation of
producer token functions. Conservative, since all possible timing of the
arrival of tokens allowed by the consumer token functions (represented
by the area between the curves) is also allowed by the producer token
functions.
Finally, activating event functions are obtained from consumer token
functions using equations 5.3 and 5.4. The result for our example is
shown in Fig. 5.5(a). We can now also directly specify the relationship
Fixed Data Rates 75
between output event functions and activating event functions.
ηuc =
⌈
rp
rc
∗ ηup
⌉
(5.8)
ηlc =
⌊
rp
rc
∗ ηlp
⌋
(5.9)
5.1.3 Data Rate Transitions for periodic with jitter
Event Models
For the reasons stated in section 3.1.1 we are specifically interested
in the influence of data rate transitions on periodic with jitter event
models. I. e. our goal is to calculate the activating event model (Pc,Jc)
for the consumer task, which we will need for scheduling analysis, from
the output event model (Pp,Jp) of the producer task.
The output event model in our initial example is in fact periodic with
jitter. As can be seen in Fig. 5.5(a), the exact activating event func-
tions display irregularly spaced steps and thus cannot be described by a
periodic with jitter event model. The situation is basically the same as
for OR-activation 4.2.2 and has the same implications. We need to find
conservative approximations for ηuc and η
l
c that can be described by a
periodic with jitter event model (Pc,Jc). The intended result is shown in
Fig. 5.5(b) in comparison to the exact activating event functions (dotted
lines).
5.1.3.1 Consumer Activation Period
Calculation of the activating event model period Pc is straightfor-
ward [61].
Pc = Pp ∗ rc
rp
(5.10)
5.1.3.2 Consumer Activation Jitter
Calculation of the activating event model jitter Jc is more complex. A
conservative approximation of ηuc and η
l
c with a periodic with jitter event
model implies that equations 5.8 and 5.9 are replaced by the following
inequations.
rc ∗
⌈
∆t+ Jc
Pc
⌉
≥ rp ∗
⌈
∆t+ Jp
Pp
⌉
(5.11)
rc ∗max
(
0,
⌊
∆t− Jc
Pc
⌋)
≤ rp ∗max
(
0,
⌊
∆t− Jp
Pp
⌋)
(5.12)
76 Rate Transitions between Tasks
In order to be as accurate as possible, we are interested in the mini-
mum jitter that satisfies both inequations. Using a similar line of argu-
mentation as for OR-activation (section 4.2.2.2), it can be proven that
both inequations are satisfied by the same minimum jitter. In the fol-
lowing, the upper approximation (inequation 5.11) is used. Since the
left and right sides of this inequation are only piecewise continuous, the
inequation cannot be simply transformed to obtain the desired minimum
jitter. The solution used here is to evaluate inequation 5.11 piecewise for
each interval ]∆ti,∆ti+1], during which the right side of the inequation
has a constant value ki ∈ N. For each constant piece of the right side, a
condition for a local jitter Jc,i is obtained that satisfies the inequation
for all ∆t : ∆ti < ∆t ≤ ∆ti+1. For each constant piece of the right side,
inequation 5.11 becomes
rc ∗
⌈
∆t+ Jc,i
Pc
⌉
≥ ki ; ∆ti < ∆t ≤ ∆ti+1 , ki ∈ N
Since the left side of this inequation is monotonically increasing with
∆t, it is sufficient to evaluate it for the smallest value of ∆t, which
approaches ∆ti. I. e.
rc ∗ lim
²→+0
⌈
∆ti + ²+ Jc,i
Pc
⌉
≥ ki ; ki ∈ N
⇔ lim
²→+0
∆ti + ²+ Jc,i
Pc >
⌈
ki
rc
− 1
⌉
⇔ lim
²→+0
(Jc,i + ²) >
⌈
ki
rc
− 1
⌉
∗ Pc −∆ti
⇔ Jc,i ≥
⌈
ki
rc
− 1
⌉
∗ Pc −∆ti (5.13)
The global minimum jitter is then the smallest value which satisfies all
local jitter conditions. Since the ratio between rp and rc is rational, ηuc
displays a pattern of distances between steps which repeats periodically
every LCM(Pp , Pc) (the macro period). Therefore, as was the case for
OR-activation, it is sufficient to perform above calculation for one macro
period. An algorithm can be found in [41].
5.1.4 Example
Let us return to our initial example and calculate the activating event
model for consumer task C. According to equation 5.10 we obtain the
following activation period.
Pc = 4 ∗ 32 = 6
Fixed Data Rates 77
Activation jitter is calculated according to inequation 5.11, which be-
comes
3 ∗
⌈
∆t+ Jc,i
6
⌉
≥ 2 ∗
⌈
∆t+ 1
4
⌉
; ∆ti < ∆t ≤ ∆ti+1
Continuous pieces of the right side of this inequation now have to
be found. The first continuous piece is for 0 < ∆t ≤ 3. According to
inequation 5.12
3 ∗
⌈
∆t+ Jc,0
6
⌉
≥ 2 ; 0 < ∆t ≤ 3
⇔ Jc,0 ≥ 0 ∗ 6− 0 = 0
The following table shows all relevant constant pieces and the resulting
local Jc,i.
∆t range ki local Jc,i
0 < ∆t ≤ 3 2 Jc,0 ≥ 0 ∗ 6− 0 = 0
3 < ∆t ≤ 7 4 Jc,1 ≥ 1 ∗ 6− 3 = 3
7 < ∆t ≤ 11 6 Jc,2 ≥ 1 ∗ 6− 7 = −1
11 < ∆t ≤ 15 8 Jc,3 ≥ 2 ∗ 6− 11 = 1
15 < ∆t ≤ 19 10 Jc,4 ≥ 3 ∗ 6− 15 = 3
Please note that the a lower bound of −1 in the third line is purely
hypothetical, since the jitter cannot be less than zero. The last table
entry is there only to show that starting with ∆t > 3 the pattern repeats
every ∆t = LCM(Pp,Pc) = 3 ∗ 4 = 12 time units.
Finally, we take the largest Jc,i to obtain the activating event model
jitter.
Jc = 3
5.1.5 Data Rate Transitions for sporadic with jitter
Input Event Models
If the producer output event model is sporadic with jitter, then the
same rules to calculate the activating event model apply in the presence
of a data rate transition as for periodic with jitter event models. The only
difference of course is that the activating event model is also sporadic
with jitter.
5.1.6 Rate Transition Incurred Delay and Backlog
We would like to calculate the best- and worst-case token delay, and
the maximum token backlog and consequently the required buffer size
78 Rate Transitions between Tasks
at the consumer task input due to a data rate transition. This buffering
is required on top of the execution buffer that holds tokens from the
moment the consumer task is activated until execution of the task has
been completed (section 3.6.7). We will consider token buffering due to
a combination of these effects in section 5.5.
The best-case rate transition delay is obviously zero, since each con-
sumer activation is the result of at least one token arriving from the
producer. Consequently, that token does not have to wait for additional
tokens. The worst-case delay is incurred for the smallest set of tokens
that can remain in the rate-transition buffer after a consumer activation.
Let the size of this set be Nmin. Nmin can be obtained by starting with
the initial number of tokens in the input buffer, and iterating one macro
period to obtain each possible number of tokens in the buffer. Without
loss of generality, let us assume that the initial number of tokens in the
input buffer is 0.
If Nmin tokens remain in the rate-transition buffer, the consumer is
activated after n producer activations, where n is the smallest integer
solution for the following inequation.
n ∗ rp +Nmin ≥ rc ⇔ n ≥ rc −Nmin
rp
I.e.
n =
⌈
rc −Nmin
rp
⌉
The maximum activation delay is obtained if the next n groups of rp
tokens arrive as late as possible. Consequently
delay ≤
⌈
rc −Nmin
rp
⌉
∗ Pp + Jp (5.14)
The maximum backlog due to data rate transitions and hence the
required buffer size is
backlog ≤ rc − 1 (5.15)
For our example, we obtain
delay ≤ 5 , backlog ≤ 2
5.1.7 Special Cases
In this section, we take a brief look at special combinations of producer
and consumer data rates.
Data Rate Intervals 79
5.1.7.1 Consumer Data Rate is an Integer Multiple of
Producer Data Rate
If the consumer data rate is n ∈ N times the producer data rate, then
the consumer token curves step periodically every n∗Pp after the initial
jitter. If Jp = 0, then Jc also = 0. If Jp 6= 0, then the upper producer
token curve is shifted left by Jp for all ∆t > 0, and the lower producer
token curve is shifted right. Consequently, the upper and lower consumer
token curve are also shifted left respectively right by Jp. I. e. the jitter
calculation in section 5.1.3 can be replaced by the simple equation
Jc = Jp ; Pc = n ∗ Pp , n ∈ N (5.16)
5.1.7.2 Producer Data Rate is an Integer Multiple of
Consumer Data Rate
If the producer data rate is n ∈ N times the consumer data rate, then
each arrival of rp tokens immediately leads to n consumer activations.
I. e. the jitter calculation in section 5.1.3 can be replaced by the simple
equation
Jc = (n− 1) ∗ Pp + Jp ; Pp = n ∗ Pc , n ∈ N (5.17)
Rate-transition buffering is not required.
5.1.7.3 Identical Producer and Consumer Data Rates
If rp = rc, then a tight conservative approximation with step-height
rc of the upper producer token curve is the curve itself, and likewise for
lower token curves. I. e. in the absence of a data rate transition, the
consumer’s activating event model equals the producer’s output event
model, as would have been expected.
Jc = Jp ; Pp = Pc (5.18)
5.2 Data Rate Intervals
Conditional control-flow inside a tasks can lead to the situation that
the task consumes or produces a non-constant number of tokens per
execution. I. e. producer and consumer data rates become intervals:
[rp,min, rp,max] for a producer task and [rc,min, rc,max] for a consumer
task. Communication between tasks with data rate intervals can always
lead to a data rate transition.
In general, we do not assume any correlation between the number of
produced tokens and the number of consumed tokens. A lower consumer
data rate of zero is not allowed, since without additional information a
80 Rate Transitions between Tasks
bounded communication buffer cannot be guaranteed. A lower producer
data rate of zero corresponds to a sporadic output event model.
We would like to extend the results for data rate transitions with
fixed data rates (section 5.1) to calculate consumer token functions and
activating event functions in the presence of data rate intervals. Be-
fore doing so, an interpretation issue has to be addressed regarding the
minimum number of tokens required for one activation of consumer task
C. Our interpretation is as follows: rc,max tokens are required for one
activation of C. This is because the total number of tokens consumed
is not known a priory, and we do not want C to stall for lack of tokens.
Stalling for the lack of token is problematic for scheduling, since it can
lead to deadlocks. This interpretation is consistent with most models
of computation. However, the results from this section remain equally
valid if only rc,min tokens are required for one activation of C.
Following the approach in section 5.1, to construct the upper con-
sumer token function, we assume the maximum number of initial tokens
at the input of C not leading to an activation. We also assume that as
many additional tokens as possible arrive as soon as possible. To con-
struct the lower consumer token function, zero initial tokens at the input
of C are assumed, and as few additional tokens as possible arriving as
late as possible.
P
[1,2]
C
[2,3]FIFO int_ex
Figure 5.6. Example of a data rate transition with data rate intervals
Let us consider the example in Fig. 5.6 with producer task P and
consumer task C, where
rp = [rp,min, rp,max] = [1, 2] Pp = 4 Jp = 1
rc = [rc,min, rc,max] = [2, 3]
The upper and lower producer token functions are shown in Fig. 5.7(a).
In Fig. 5.7(b) the upper producer token function has been shifted up-
wards by rc,max − 1 in analogy to Fig. 5.3(b). Since the consumer data
rate is not fixed, there exists no single upper and lower consumer to-
ken function. Figs. 5.8 and 5.9 show three possible scenarios1. As can
1Remember that C requires at least 3 tokens for one activation. Therefore, no two simul-
taneous activations can be observed when 4 tokens are available, even though it is possible
that only 4 tokens are consumed in two subsequent consumer executions.
Data Rate Intervals 81
5 10 15
∆t
5
10
token
u
pτ
l
pτ
RTint_in_a
(a) original
5 10 15
∆t
5
10
token
l
pτ
u
pτ
rC,max- 1
RTint_in_b
(b) shifted upper producer token function
Figure 5.7. Upper and lower producer token functions for our data rate transition
with intervals example
5 10 15
∆t
5
10
token
l
pτ
u
ac,τ
l
ac,τ
u
pτ RTint_out_a
(a)
5 10 15
∆t
5
10
token
l
pτ
u
bc,τ
l
bc,τ
u
pτ RTint_out_b
(b)
Figure 5.8. Two possible sets of upper and lower consumer token functions for our
data rate transition with intervals example
be seen, depending on the sequence of consumer data rates, a different
function can dominate the other functions for a particular ∆t.
We could now determine all relevant sequences of consumer data rates
and then define the upper consumer token function to be the maximum
off all upper ‘consumer token scenario functions’. Likewise, we could de-
fine the lower consumer token function to be the minimum off all lower
‘consumer token scenario functions’. However, in the end we are inter-
ested in the upper and lower activating event functions for the consumer
task. Definitely, it is not possible to obtain a larger number of activations
for any ∆t than in the scenario in which the minimum number of tokens
82 Rate Transitions between Tasks
5 10 15
∆t
5
10
token
l
pτ
u
cc,τ
l
cc ,τ
u
pτ RTint_out_c
Figure 5.9. A third possible set of upper and lower consumer token functions for our
data rate transition with intervals example
are consumed during each consumer execution (upper ‘consumer token
scenario function’ in Fig. 5.8(a)). Likewise, it is not possible to obtain a
smaller number of activations for any ∆t than in the scenario in which
the maximum number of tokens are consumed during each consumer
execution (lower ‘consumer token scenario function’ in Fig. 5.8(a)).
We conclude that in the presence of a data rate transition with inter-
vals, the upper activating event function is calculated using the maxi-
mum producer data rate rp,max, the minimum consumer data rate rc,min
and the maximum number of initial tokens at the input of C. To cal-
culate the lower activating event function, the minimum producer data
rate rp,min, the maximum consumer data rate rc,max and zero initial
tokens at the input of C are used. Equations 5.8 and 5.9 are modified
accordingly.
ηuc =
⌈
rp,max
rc,min
∗ ηup
⌉
(5.19)
ηlc =
⌊
rp,min
rc,max
∗ ηlp
⌋
(5.20)
5.2.1 Data Rate Transitions for periodic with jitter
Event Models
Obviously, two sets of period and jitter values are now required for
a conservative periodic with jitter approximation of both the upper and
the lower activating event functions. The situation is similar to OR-
concatenation of a combination of periodic with jitter and sporadic with
jitter input event models (section 4.2.4). As was explained there, the
alternatives are to either use a single, very conservative sporadic with
jitter approximation with parameters that correspond to the upper ac-
tivating event function. The second, tighter alternative is to extend our
Combination with Multiple Inputs 83
system-level analysis framework to propagate two separate event models
for each event stream.
For our example, we obtain two periodic with jitter event models with
the following properties according to equations 5.18 (upper event model)
and 5.16 (lower event model).
ηuc,P+J : Puc = 4 J uc = 1
ηlc,P+J : P lc = 12 J lc = 1
5.2.2 Data Rate Transitions for sporadic with jitter
Input Event Models
If the producer output event model is sporadic with jitter, then we
proceed for the upper activating event model as in the periodic case. The
lower activating event function is obviously zero for all ∆t, since the lower
output event function is zero. Therefore, we do not need two separate
periodic with jitter event models. Instead, we use the parameters of
upper activating event model and additionally set it to sporadic.
5.2.3 Rate Interval Transition Incurred Token
Delay and Backlog
As in section 5.1, the worst-case delay is incurred for the smallest set
of tokens Nmin that can remain in the rate-transition buffer after a con-
sumer activation. This delay is maximized if subsequently as few tokens
as possible arrive as late as possible. Consequently, in correspondence
to inequation 5.14
delay ≤
⌈
rc,max −Nmin
rp,min
⌉
∗ Pp + Jp (5.21)
In correspondence to inequation 5.15, the possible backlog due to data
rate transitions and hence the required buffer size is
backlog ≤ rc,max − 1 (5.22)
For our example, we obtain
delay ≤ 9 , backlog ≤ 2
5.3 Combination with Multiple Inputs
In the previous chapter we showed how to calculate activating func-
tions for a consumer task C with multiple inputs from the individual
input event functions. In this chapter we have shown how to calculate
activating functions for a consumer task C with a single input in the
84 Rate Transitions between Tasks
presence of a data rate transition between producer task P and con-
sumer task C. In this section, both results are combined.
Let us review the equations that were derived for activating event
functions in the previous sections.
AND-activation (equations 4.1 and 4.2).
ηuAND = max{ηui }
ηlAND = min{ηli}
OR-activation (equations 4.11 and 4.12).
ηuOR =
n∑
i=1
ηui
ηlOR =
n∑
i=1
ηli
Data rate transitions with fixed data rates (equations 5.8 and
5.9).
ηuc =
⌈
rp
rc
∗ ηup
⌉
ηlc =
⌊
rp
rc
∗ ηlp
⌋
Data rate transitions with data rate intervals (equations 5.19
and 5.20).
ηuc =
⌈
rp,max
rc,min
∗ ηup
⌉
ηlc =
⌊
rp,min
rc,max
∗ ηlp
⌋
A combination of AND- and OR-activation has been discussed in sec-
tion 4.3 and is not considered here. The combination of AND-activation
with data-rate intervals is not acceptable, since bounded buffers cannot
be guaranteed at all task inputs. The remaining combinations can be
easily obtained by first performing rate-transition calculation, and then
multiple-input calculation.
Chaining of Rate Transitions, Multiple Inputs and EAFs 85
AND-activation combined with fixed data rate transitions.
ηuAND = max
⌈
rp,i
rc,i
∗ ηup,i
⌉
(5.23)
ηlAND = min
⌊
rp,i
rc,i
∗ ηlp,i
⌋
(5.24)
OR-activation combined with fixed data rate transitions.
ηuOR =
n∑
i=1
⌈
rp,i
rc,i
∗ ηup,i
⌉
(5.25)
ηlOR =
n∑
i=1
⌊
rp,i
rc,i
∗ ηlp,i
⌋
(5.26)
OR-activation combined with interval data rate transitions.
ηuOR =
n∑
i=1
⌈
rp,max,i
rc,min,i
∗ ηup,i
⌉
(5.27)
ηlOR =
n∑
i=1
⌊
rp,min,i
rc,max,i
∗ ηlp,i
⌋
(5.28)
The application to periodic with jitter event models and sporadic
with jitter event models is straight-forward using the results from sec-
tions 4.1.3, 4.2.2, 5.1.3 and 5.2.1.
5.4 Chaining of Rate Transitions, Multiple Inputs
and EAFs
An interesting question is in which order to chain rate transitions,
multiple inputs and EAFs when calculating activating event functions of
a consumer task. In the previous section we already established that rate
transition calculations are performed prior to multiple input calculations.
However, EAFs are a more complicated case.
The purpose of an EAF is to change the possible activation timing of a
task in order to meet some requirement (section 3.6.7). In the presence of
a rate transition this implies that it makes little sense to imagine that an
EAF changes the output event model of the producer task, and to then
calculate the input event model for the consumer, since a rate transition
changes the event model. For example, if strictly periodic consumer
activation was required, then this could not be enforced in general by a
periodic-EAF followed by a rate transition, due to the jitter introduced
by the rate transition. We conclude that EAF-calculation needs to follow
86 Rate Transitions between Tasks
rate transition calculation. This is consistent with our model that an
EAF delays task activation after an input event (section 3.6.7).
A similar argument applies for a combination of OR-activation and
EAFs. Assume an OR-activated task with periodic with jitter 6= 0 input
event models with different periods, which requires strictly periodic acti-
vation. We cannot satisfy our requirement with individual EAFs at each
port that change each input event model into strictly periodic, since the
activating event model will have a jitter 6= 0 due to the different input
event model periods (section 4.2). We therefore conclude that it makes
more sense to let EAF-calculation follow OR-concatenation. This im-
plies that a single EAF needs to be specified for the OR-activated task
which directly influences activation timing, instead of individual EAFs
for each input, where the influence on timing is indirect and not suffi-
ciently controllable. In the next section we will see a straight-forward
implementation of such a task-level EAF. Additional EAFs at individual
inputs are not forbidden, but serve little purpose.
The case is less clear for a combination of AND-activation and EAFs.
If we want to reduce the activation jitter of an AND-activated task, then
we can either use a single task-level EAF in analogy to OR-activated
tasks, or we can use individual EAFs at those inputs where the input
jitter exceeds the required jitter (the activation jitter equals the largest
input jitter, section 4.1). The advantage of a single EAF is simplicity
and consistency with OR-concatenation. The approach is also applicable
if a task is activated by a combination of AND- and OR-concatenations
(section 4.3).
In summary, we propose the following order to calculate an activating
event models:
rate transition → multiple inputs (AND, OR, combination) → EAF.
5.5 Combined Token Buffering
AND-activation (section 4.1.2) and data rate transitions (section 5.1.6)
both require token buffering. This buffering happens on top of activation
buffering and EAF buffering (section 3.6.7).
We shall assume that input tokens are buffered separately at each
consumer task input. In correspondence to section 3.6.7, tokens are
kept in the buffers from the moment of arrival until the instance of the
task that was activated by the tokens has been completed. For ease of
illustration, the input buffers have been laid out in Fig. 5.10 such that
the filling of one column at one input leads to one input event at that
input. We assume that the buffers are written and read in the order
indicated by the arrows.
Combined Token Buffering 87
bufferCombo
C
2
3
P1
3
P2
2
FIFO1
FIFO2
15
13
95
156
Figure 5.10. Communication buffer (implemented as circular buffer)
actAND_Buffer
C
2
3
P1
3
P2
2
AND
FIFO1
exin1
FIFO2
RT + AND
backlog2
RT + AND backlog1
15
13
41
156
in2
ex
C
exact = in
FIFOex (events)
activation
backlog
(a) Rate transitions with AND-
activation
actAND_EAF_Buffer
C
2
3
P1
3
P2
2
AND
FIFO1
exin1
FIFO2
RT + AND backlog1
in2
ex
C
in ex
activation
backlog
FIFOex (events)
act = EAF
EAF
backlog
RT + AND
backlog2
(b) Rate transitions with AND-
activation and EAF
Figure 5.11. Token buffering and activation mechanism for a combination of rate
transitions with AND-activation and an optional EAF
A possible solution for a combination of data rate transitions with
AND-activation and an optional EAF is shown in Fig. 5.11. The same
solution is applicable for a combination of data rate transitions with OR-
activation and an optional EAF (Fig. 5.12). Each input buffer has its
own rate transition and, in case of AND-activation, AND-concatenation
backlog. There is an additional buffer (bottom) which keeps track of
88 Rate Transitions between Tasks
actOR_Buffer
C
2
3
P1
3
P2
2
OR
FIFO1
ex1in1
FIFO2
RT
backlog2
RT
backlog1
ex2
C
ex
activation
backlog
15
13
41
156
in2
act = in
FIFOex (events)
(a) Rate transitions with OR-
activation
actOR_EAF_Buffer
C
2
3
P1
3
P2
2
OR
FIFO1
ex1in1
FIFO2
RT
backlog2
RT
backlog1
in2
C
in ex
activation
backlog
FIFOex (events)
act = EAF
EAF
backlog
ex2
(b) Rate transitions with OR-
activation and EAF
Figure 5.12. Token buffering and activation mechanism for a combination of rate
transitions with OR-activation and an optional EAF
the number, and in case of OR-activation, order in which input events
occurred at the two inputs. This buffer is also used to implement an
optional task-level EAF (Figs. 5.11(b) and 5.12(b)) in accordance with
section 3.6.6.
5.6 Summary and Conclusion
In this chapter we looked at tasks that consume or produce multi-
ple tokens per execution. We showed how to calculate activating event
functions for a consumer task in the presence of data rate transitions.
We also calculated required communication buffers and incurred token
delay due to a data rate transition. We considered both data rate transi-
tions between fixed data rates, and between data rate intervals. In both
cases, we had to conservatively approximate the exact activating event
functions for the consumer task, in order to obtain period and jitter pa-
rameters required in our compositional performance analysis approach.
We then combined those results with the results from the previous
chapter in order to calculate activating event models and the required
Summary and Conclusion 89
buffers in the presence of both data rate transitions and multiple acti-
vating inputs. We showed how input buffering can be implemented that
also allows traffic shaping using EAFs. This allows us to analyze the
performance of complex feed-forward applications. In the next chapter,
this work is extended to applications with cyclic task dependencies.

Chapter 6
CYCLIC TASK DEPENDENCIES
Tasks with multiple inputs allow to build cyclic dependencies. A typi-
cal application is a control loop, where one task represents the controller
and the other task a model of the controlled system. A task graph with
a cycle is shown in Fig. 6.1.
1
FIFO1
FIFO2 1
a 1
c 11
1
1
AND b
cyc_ex
Figure 6.1. Example of a task graph with a cycle.
In this chapter, we focus on single-rate cycles, i. e. without data rate
transitions, and with unconditional communication. Cycles with data
rate transitions will be briefly considered at the end of the chapter.
We will also consider a special case of cycles with date rate transitions
when discussing system-level performance analysis for Simulink designs
(chapter 9).
92 Cyclic Task Dependencies
6.1 Single-Rate Cycles
All tasks in a single-rate cycle consume and produce the same fixed
number of tokens per execution at their cycle-internal inputs and out-
puts. This implies that tasks with multiple inputs in a single-rate cycle
have to be AND-activated. OR-activated tasks are not allowed, since
the output period of an OR-activated task is smaller than the small-
est input period (section 4.2). This output period would be eventually
propagated around the cycle, leading to yet a smaller output period. A
fix-point cannot be reached, and the cycle thus cannot be scheduled1.
Without loss of generality, only unit-rate cycles are considered. The ex-
tension to non-unit single-rates is trivial. The example in Fig. 6.1 is in
fact a system with a unit-rate cycle.
Let us further assume that all AND-activated tasks in a unit-rate
cycle have exactly one cycle-internal input. Then, at least one initial
token has to be present at the input of at least one task in the cycle to
avoid deadlock [55]. More than one initial token allow to execute cycle
tasks in parallel, if they are mapped onto different resources or can be
pipelined. The number of tokens in the cycle remains constant, since
for every consumed token, one token is produced (atomic buffer update,
section 3.6.1). For more complex topologies, SDF scheduling rules [62]
allow to calculate the required number and positions of initial tokens to
avoid deadlock.
For lim∆t→∞, all tasks in a unit-rate cycle are activated the same
number of times. Consequently, for lim∆t→∞ the same number of input
events must occur at cycle-external inputs of all AND-activated tasks be-
longing to the same cycle to ensure that external buffers can be bounded.
For our compositional performance analysis approach this implies that a
cycle must be activated externally with periodic with jitter event models,
and that all external periods must be the same. An important exception
is a cycle with only one external input. In this case, any external event
model, in particular sporadic with jitter is acceptable since the number
of cycle-internal tokens is fixed an thus cannot grow if no external tokens
arrive.
If we want to use our performance analysis in the presence of single-
rate cycles, we first have to solve a starting-point problem. The input
event model at the cycle-internal input of an AND-activated task ul-
timately depends on the task’s activating event model, which cannot
be determined without all input event models. A conservative starting
1If we allow a combination of OR- and AND-activation (section 4.3), then cycle-external
inputs of a multi-input task can be OR-concatenated before AND-concatenation with the
cycle internal inputs.
Analysis Idea 93
point is to initially assume zero jitter at the cycle-internal input (sec-
tion 3.6.4). The activating event model of the AND-activated task then
equals the external input event model (section 4.1). We can now iterate
analysis and event model propagation around the cycle, hoping to find
a fix-point.
However, if only one task along the cycle has a response time which
is an interval, then after the first round of analysis and event model
propagation the internal input jitter of the AND-activated task will be
larger than the external input jitter. In our compositional performance
analysis approach, this larger jitter will be propagated around the cycle
again, resulting in an even larger jitter at the cycle-internal input of the
AND-activated task (section 3.5). It is obvious that the jitter appears
unbounded if calculated this way.
Non-functional cycles that do not converge by themselves can be made
to converge by actively reducing event model jitter through an EAF
(section 3.6.6). Unfortunately, such an approach cannot in general be
applied to functional cycles, since the additional EAF delay may yield a
cycle that cannot be executed fast enough. For example, let us assume
an external event model (Pb,ext = 100, Jb,ext = 0) in Fig. 6.1, and a
single initial token in the cycle. If tasks b and c have response time
intervals of [10, 12] and [20, 25], respectively, then the time around the
cycle is [30, 37]. If on the other hand a periodic resynchronization buffer
is inserted between tasks c and b in order to reduce the jitter to zero,
then according to [96] the time around the cycle increases to [130, 244],
thus exceeding the external period.
The problem boils down to the fact that event model propagation
as presented so far captures neither correlations between the timing of
events in different event streams, nor the fact that the number of tokens
in a cycle is fixed. Therefore, the activation jitter for the AND-activated
task is calculated very conservatively.
6.2 Analysis Idea
In order to enable performance analysis in the presence of cyclic task
dependencies, we need to consider event correlations arising from func-
tional cycles. It would be especially useful if we could show for a partic-
ular cycle that the cycle-internal input of the AND-concatenated task
cannot increase the activation jitter of that task. This would allow us
to ignore the cycle-internal input of the AND-concatenated task dur-
ing response-time calculation, effectively cutting the cycle and yielding
a purely feed-forward system. This feed-forward system could then be
analyzed using our compositional performance analysis approach with-
out further extensions, and without the need to propagate event models
94 Cyclic Task Dependencies
around the cycle multiple times in search of a fix-point, which as shown
above fails.
The interesting part is to determine conditions under which this ap-
proach is valid. Let us for now consider unit-rate cycles with exactly one
AND-concatenated task with one cycle-external and one cycle-internal
input (as in Fig. 6.1). Let us assume an external periodic with jitter
event model with period Pext and jitter Jext. Let us define tminff and
tmaxff to be the minimum respectively maximum sum of worst-case re-
sponse times of all tasks belonging to a cycle (the ‘time around the
cycle’) as obtained through analysis of the corresponding feed-forward
system2. Furthermore, let us assume for now that all tokens in the cycle
are initially at the cycle-internal input of the AND-concatenated task.
The number of initial tokens in the cycle shall be M ≥ 1 ∈ N.
At system startup, the first M tokens arriving at the cycle-external
input will immediately activate the AND-concatenated task together
with the M tokens already waiting at the cycle-internal input. We as-
sume that the external input event model is valid from the moment that
the first external token arrives [96]. Consequently, the activating event
model equals the external input event model for the first M activations.
Let us now consider the following activations.
6.3 Cycles with one initial token
Consider a cycle where after analysis of the corresponding feed-forward
system, tmaxff ≤ Pext. Let us assume that exactly one initial token is
available in the cycle, and that the AND-concatenated task (task b in
Fig. 6.1) has just been activated for the first time. Consequently, the
buffer at the cycle-internal input of task b is now empty (since there is
only one token in the cycle), and thus no further activation of task b
is possible at that moment. It will take between tminff and t
max
ff time
units until the next token becomes available at the cycle-internal input
of task b.
The first external token does not have to wait for an internal token, be-
cause the initial token has been placed at the cycle-internal input of the
AND-concatenated task (section 6.2). The maximum distance between
two consecutive external tokens is δmaxext (2) = Pext+Jext (equation 3.4).
From tmaxff ≤ Pext follows that it is not possible that the 2nd external
token arriving as late as possible after the 1st external token has to wait
for an internal token.
2This of course implies that the feed-forward system in not over-loaded and can be analyzed.
Cycles with one initial token 95
The 3rd external token can arrive at most δmaxext (3) = 2 ∗ Pext + Jext
after the 1st external token. Therefore, if both the 2nd and the 3rd
external tokens arrive as late as possible, then the 3rd arrives Pext after
the 2nd. From tmaxff ≤ Pext follows that the 3rd external token arriving as
late as possible after the 1st external token cannot wait for an internal
token, even if the 2nd external token also arrived as late as possible.
This argument can be extended to all further tokens. We infer that no
external token arriving as late as possible has to wait for an internal
token.
Activation of task b also cannot happen earlier than the arrival of
an external token. Therefore, the activating event model of task b is
conservatively captured by the external input event model. We conclude
that our approach is valid for a cycle withM = 1 initial token, for which
tmaxff ≤ Pext.
Pact = Pext ; Jact = Jext (6.1)
For example, let us assume that in our system in Fig. 6.1 task b is
activated externally with (Pb,ext = 4, Jb,ext = 3). Let us further assume
that feed-forward analysis has determined the time around the cycle to
be [tminff , t
max
ff ] = [2, 3]. I. e. each internal input event follows between
[2, 3] time units after the previous activating event. Fig. 6.2 shows a
snapshot of a sequence of external, internal and activating events for
task b (numbers indicate corresponding input events and the resulting
activating event). The first internal event is due to the initial token. As
can be seen, activating event timing can be described by the same event
model as external input event timing. If on the other hand analysis of
the corresponding feed-forward system determines tmaxff > Pext, then
this statement is no longer true, since for example the 3rd internal event
could occur later than the latest possible 3rd external event.
In Fig. 6.2 it can also be seen that an early external token may have
to wait for an internal token since two token arrivals at the cycle-internal
input of task b cannot follow closer than tminff , and thus
δminact (2) =
{
δminext (2) ; t
min
ff ≤ δminext (2)
tminff ; t
min
ff > δ
min
ext (2)
(6.2)
Effectively, if tminff > δ
min
ext (2), then the cycle acts like a dmin-EAF with
dmin = tminff (section 3.6.6). This additional effect of the cycle does not
require a new scheduling analysis, since the possible activation timing is
only tightened. All possible event timing in the tighter model is already
included in the wider model. Therefore, the results in equation 6.1 re-
main valid. However, if an extended periodic with jitter event model
96 Cyclic Task Dependencies
cyc_1_es
t
4 8 120
t
t
internal events
activating events
(1)
1
2
2
3
3
4
4
external events
)2(minextδ
)2(minactδ
1 2 3 4
Figure 6.2. Possible event sequence for our cycle example. Gray boxes indicate jitter
intervals during which an event can occur. Note that line 2 displays the possible
timing of internal events depending on the previous activating event, while lines 1
and 3 display the possible timing of events independent of previous events.
with a dmin = δmin(2) parameter is used as is the case in our composi-
tional performance analysis approach, then it is worthwhile to perform
scheduling analysis again with the tighter activating event model for the
AND-concatenated task.
6.3.1 Buffer Calculation
The required buffer sizes inside the cycle are trivial: each buffer re-
quires size one, since exactly one token is circling the cycle. The worst-
case AND-concatenation delay at the cycle-internal input of the AND-
concatenated task equals δmaxext (2) − tminff . This value is of no further
interest.
Buffer calculation is more interesting at the cycle-external input of
the AND-concatenated task. If tmaxff ≤ δminext (2) then a cycle-external
token can never wait for the cycle-internal token, and consequently no
AND-buffer is required at the cycle-external input (activation buffer cal-
culation is not affected). However, if tmaxff > δ
min
ext (2), then even though
the activation jitter of the AND-concatenated task is not affected by the
cycle-internal input, the required size of the cycle-external buffer may be.
For example, consider a cycle-external event model (Pext = 6, Jext = 18)
with a minimum distance δminext (2) = 1 between events. Let us as-
sume that the worst-case response time of the AND-concatenated task is
rmaxAND = 1, and that the worst-case time around the cycle is t
max
ff = 5. If
the cycle-internal input is ignored during buffer calculation, then feed-
forward analysis will determine that the required external buffer size
Cycles with two or more initial tokens 97
for the AND-concatenated task is 1, since δminext (2) ≥ rmaxAND, i. e. exter-
nal events cannot follow closer than the worst case response time of the
AND-concatenated task. This result is obviously wrong, since in reality
external tokens can backlog while a token is circling the cycle.
A conservative worst-case backlog and delay at the cycle-external in-
put due to AND-concatenation is obtained if it is assumed that external
tokens arrive as soon as possible, and that each iteration of the cycle
takes as long as possible, i. e. tmaxff time units
3. The results can be
obtained by treating the cycle like a dmin-EAF with dmin = tmaxff (sec-
tion 3.6.6). For the example in this section, this results in an external
AND-buffer size of 3 and a worst-case external backlog of 14.
6.4 Cycles with two or more initial tokens
Now consider a cycle where after analysis of the corresponding feed-
forward system, (M − 1) ∗ Pext < tmaxff ≤ M ∗ Pext ; M ∈ N. Let us
assume that exactly M tokens are available in the cycle, and that the
AND-concatenated task has just been activated for the first time. It will
take the cycle-internal token that has just been consumed between tminff
and tmaxff time units to return to the cycle-internal input of task b, where
it becomes the M +1st internal token from the perspective of that task.
The first M external tokens do not have to wait for internal tokens,
because theM initial tokens have been placed at the cycle-internal input
of the AND-concatenated task (section 6.2). The maximum distance
between M +1 consecutive external tokens is δmaxext (M +1) =M ∗Pext+
Jext (equation 3.4). From tmaxff ≤M ∗Pext follows that it is not possible
that the M +1st external token arriving as late as possible after the 1st
external token has to wait for the M + 1st internal token.
The M +2nd external token can arrive at most δmaxext (M +2) = (M +
1) ∗ Pext + Jext after the 1st external token. Therefore, if both the
M + 2nd and the 2nd external tokens arrive as late as possible, then
the M + 2nd arrives M ∗ Pext after the 2nd. From tmaxff ≤ M ∗ Pext
follows that the M+2nd external token arriving as late as possible after
the 1st external token cannot wait for an internal token, even if the 2nd
external token also arrived as late as possible. This argument can be
extended to all further tokens. We infer that no external token arriving
as late as possible has to wait for an internal token.
3Tighter results could be obtained if worst-case times around the cycle would be calculated
for individual iterations in a sequence of iterations. This would require a considerable analysis
extension and is not further considered in this thesis.
98 Cyclic Task Dependencies
Activation of task b also cannot happen earlier than the arrival of
an external token. Therefore, the activating event model of task b is
conservatively captured by the external input event model. We conclude
that our approach is valid for a cycle withM > 1 initial tokens, for which
(M − 1) ∗ Pext < tmaxff ≤ M ∗ Pext.
Pact = Pext ; Jact = Jext
Additionally, we may be able to tighten the minimum distance be-
tween (M + 1) activating events compared to the minimum distance
between (M + 1) external input events.
δminact (M + 1) =
{
δminext (M + 1) ; t
min
ff ≤ δminext (M + 1)
tminff ; t
min
ff > δ
min
ext (M + 1)
(6.3)
Effectively, if tminff > δ
min
ext (M + 1), then the cycle acts like a dmin-
EAF [96] with dmin,M+1 = tminff .
cyc_2_ex
a
c
AND b
ed sporadic: 40
(P = 4, J = 3)
tcore = 3
tcore = 2
tcore = 1
Figure 6.3. Cycle example with 2 initial tokens at the cycle-internal input of AND-
concatenated task b. Unit data rates and FIFOs have been omitted.
Consider the example in Fig. 6.3. Task b is activated externally pe-
riodically with jitter with (Pb,ext = 4, Jb,ext = 3). It shares a resource
with task e which is activated sporadically at most every 40 time units.
We assume static priority preemptive scheduling and a higher priority
for task e. Task c is mapped onto a separate resource. The core exe-
cution times are 3, 2, and 1 for tasks e, b and c, respectively. For this
system, feed-forward analysis determines the time around the cycle to
be [tminff , t
max
ff ] = [3, 7].
Fig. 6.4 shows a snapshot of a sequence of external, internal and ac-
tivating events for task b (numbers indicate corresponding input events
Analyzability Condition 99
cyc_2_es
t
4 8 120
t
t
internal events
activating events
(1)
1
3
2
5
3
4
5
external events
1 2 3
(2)
4
54
16
Figure 6.4. Possible event sequence for the cycle example in Fig. 6.3. Gray boxes
indicate jitter intervals during which an event can occur. Note that line 2 displays the
possible timing of internal events depending on the previous activating event, while
lines 1 and 3 display the possible timing of events independent of previous events.
and the resulting activating event). The first two internal events are
due to the initial tokens (the 3rd internal event can occur before the
2nd initial token has led to an activation). As can be seen, activating
event timing can be described by the same event model as external input
event timing. If we increase core execution times such that analysis of
the corresponding feed-forward system determines tmaxff > 2∗Pext, then
this statement is no longer true, since for example the 4th internal event
could occur later than the latest possible 4th external event.
6.4.1 Buffer Calculation
In the worst-case, the required buffers inside the cycle have size M ,
since exactly M tokens are circling the cycle. The buffer sizes can be
smaller for some cycle tasks, if it can be shown that a particular buffer
cannot contain all tokens at the same time. Such an analysis extension
is not further considered in this thesis.
If tmaxff ≤ δminext (M + 1) then a cycle-external token can never wait for
a cycle-internal token, and consequently no AND-buffer is required at
the cycle-external input of task b (activation buffer calculation is not
affected). If tmaxff > δ
min
ext (M + 1), then a modification of the approach
in section 6.3.1 can be applied using a dmin-EAF [96] with dmin,M+1 =
tmaxff .
6.5 Analyzability Condition
The results obtained in this section can be summarized in the following
analyzability condition.
100 Cyclic Task Dependencies
Lemma 6.1 A single-rate cycle with one AND-concatenated task with
one external input and external period Pext and one internal input can
be treated as a feed-forward system, if the sum of worst-case response
times of all cycle tasks calculated for the feed-forward system, divided by
the number M of initial tokens in the cycle, does not exceed the external
period Pext, and if the M tokens are initially at the cycle-internal input
of the AND-concatenated task.
Proof 6.1 Proof by induction follows from the argumentation in sec-
tions 6.3 and 6.4. 2
6.6 Cycles with more available than required
initial tokens
If the number of initial tokens in the cycle is larger than required by
lemma 6.1, then the additional tokens obviously cannot increase the time
between arrivals of tokens at cycle-internal inputs of the AND-activated
task. Consequently, the cycle-internal input still cannot increase the
activation jitter of the AND-concatenated task, and analyzing the cor-
responding feed-forward system remains a valid approach.
However, the additional tokens can decrease the minimum AND-delay
for early external token. A detailed calculation is left for future work.
We stay on the safe side by assuming that δminact (M +1) = δ
min
ext (M +1).
6.7 Cycles with fewer available than required
initial tokens
If the number of initial tokens in the cycle is smaller than required
by lemma 6.1, then analyzing the corresponding feed-forward system
is no longer a valid approach. The cycle may still be schedulable, but
then possible phases between events have to be modeled during analysis
for several iterations of the cycle. This is a major extension of our
compositional performance analysis approach which is beyond the scope
of this thesis. Inter-contexts (section 8.2) and the ideas for performance
analysis of Simulink (chapter 9) are first steps in this direction.
A much simpler solution is to alert the designer of the problem and
indicate how many initial tokens are needed in the cycle to render anal-
ysis of the corresponding feed-forward system a valid approach. The
designer then has two choices: he can either change the system imple-
mentation until lemma 6.1 is satisfied, or increase the number of initial
tokens in the cycle. However, the second choice changes properties of
the algorithm being implemented, e. g. the dynamics of a control loop,
and thus may not always be applicable.
System Startup 101
6.8 Self-Cycles
Self-cycles with N initial tokens are an explicit way to indicate that
M executions of a task must be completed before the M + Nth execu-
tion is allowed to start. Most single-resource schedulers will only start
executing a task again after the previous execution has been completed.
Therefore, tasks with self-cycles with more than one initial token would
have to be mapped onto multiple resources or pipelined to exploit par-
allelism. This is not considered further in this thesis.
6.9 System Startup
So far we have assumed that all cycle tokens are initially at the cycle-
internal input of the AND-concatenated task. If the initial tokens are
placed at any other cycle task input, then it is possible that the first
activation is delayed after the arrival of the first external token, and that
the next few activating events occur in quicker succession than allowed
by the calculated event model. Let us assume that in the system in
Fig. 6.1 the initial token is placed at the input of task c, and that the
response time interval of task c is [1, 2]. As can be seen in Fig. 6.5, if
external tokens arrive as soon as possible, and the first internal event
arrives as late as possible, then the first 2 activating events violate the
(Pb,act = 4, Jb,act = 3) event model that we obtained in section 6.3. For
all subsequent activating events, the results from section 6.3 are valid.
5
cyc_1_startup_es
t
4 8 12 160
t
t
internal events
activating events
1
1
1
2
2
2
3
3
3
4
4
4
5
external events
6
5 6
6
Figure 6.5. Possible event timing at system startup that violates the calculated ac-
tivating event model. Note the absolute times on the t-axis.
System startup is a problem independent of the presence of cycles,
since none of the calculated event models anywhere in the system is
valid as long as no events have occurred. This implies that none of
the calculated performance numbers is initially valid. For example, a
best-case task response time may be initially shorter than calculated,
because a second task that would otherwise have certainly led to an
102 Cyclic Task Dependencies
interrupt has not been activated yet. The system needs some time to
settle into a consistent state. One idea to control the process of settling
in is to build systems in a way that all event models are enforced from
the beginning. For example, an operating system could generate dummy
task activations as long as no real activating event has occurred.
This idea could also be applied to cycles if the cycle tokens initially
were not at the cycle-internal input of the AND-concatenated task. Task
activation would consider only the cycle-external input, and read dummy
tokens at the cycle-internal input as long as no real token has arrived.
However, a thorough investigation of this idea is beyond the scope of this
thesis. As long as it has not been performed, the condition in lemma 6.1
restricts the kind of cycles that can be analyzed.
6.10 Nested Cycles
So far, we have considered single cycles. The analysis approach that
we derived is obviously also applicable to multiple independent cycles
in a system. We now show that it can also be applied to nested cycles,
which occur in complex control-loops and filters.
cyc_nested
a
AND
f
c
b
d
e
AND
AND
Figure 6.6. Example of a task graph with nested cycles. Unit data rates and FIFOs
have been omitted. Edges that need to be cut for corresponding feed-forward analysis
are indicated.
Consider the example in Fig. 6.6, which consists of three nested cy-
cles joined at AND-concatenated tasks. Our analysis approach that we
developed in this section requires cycles with one external input. This
requirement is satisfied by cycle c → d → f (the innermost cycle). If
we determine during performance analysis that this cycle satisfies the
conditions in lemma 6.1, then the activating event model of task c does
Cycles with Multiple Inputs 103
not depend on the output event model of task f . We can therefore
assume that edge f → c is cut (indicated by a large ‘X’ in Fig. 6.6).
Consequently, we can now also treat cycle b → c → d → e as an inner-
most cycle (since the timing of c is not affected by f). If we determine
during performance analysis that this cycle also satisfies the conditions
in lemma 6.1, then it is correct to assume that edge e → b is cut, and
cycle a → b → c also becomes an innermost cycle. If we determine
during performance analysis that this cycle also satisfies the conditions
in lemma 6.1, then it is also correct to assume that edge c → a is cut.
The result is a valid corresponding feed-forward system for our system
with nested cycles, which was rightfully analyzed using our existing feed-
forward analysis.
Note that the order in which we considered the three cycles in the
preceeding paragraph served only to explain our line of thought. During
performance analysis we need to check the validity of the conditions in
lemma 6.1 for all cycles, but we can do it in any order. If the conditions
are violated for only one cycle, analyzing the corresponding feed-forward
system is not a valid analysis approach.
6.11 Cycles with Multiple Inputs
Let us now consider cycles with more than one external activating in-
puts. This adds a level of complexity since in general the phases between
tokens arriving at the two external inputs are not restricted.
multi_input_cycle
a
AND
c
b
AND
tcore = [2,3]
tcore = [2,3]
tcore = [2,3]
(Pa = 10, Ja = 0, ϕa = ?)
(Pb = 10, Jb = 0, ϕb = ?)
Figure 6.7. Cycle with two external inputs.
Consider the example of a cycle with two inputs in Fig. 6.7. Let us
assume that each task has a core execution time interval of [2, 3] and
is mapped on an exclusive resource (i. e. response time intervals equal
core execution time interval), that the cycle contains one initial token,
and that the external event models are (Pa,ext = Pb,ext = 10, Ja,ext =
Jb,ext = 0). What we do not know are the phases of events in one exter-
nal event stream relative to events in the other external event stream.
104 Cyclic Task Dependencies
Even though the sum of worst-case response times around the cycle
is 9 and thus less than the external period (10), it is possible that a
token arriving at the cycle-external input of task a has to wait for an
internal token because of the AND-activation delay of task b, and vice
versa. This is illustrated in the two gantt charts in Figs. 6.8 and 6.9.
0 5
...
...
...
line 1: aext , bext
line 2: bact = bext  AND acomp
line 3: cact = bcomp
...line 5: acomp
10 15 20 25
a b a b
...line 4:  aact = aext AND ccomp
a b
initial
token
t
multi_input_cycle_gantt_b
Figure 6.8. Possible event timing and events for our multi-input example (Fig. 6.7)
assuming a phase of +8 for cycle-external events at task b relative to task a, and an
initial token at the cycle-internal input of a.
0 5
...
...
...
line 1: aext , bext
line 3: bact = bext  AND acomp
line 4: cact = bcomp
...line 2: aact = aext  AND ccomp
10 15 20
b a b a
...line 5: ccomp
b a
initial
token
t
multi_input_cycle_gantt_a
14
14
122
67
- 5
Figure 6.9. Possible event timing and events for our multi-input example (Fig. 6.7)
assuming a phase of −8 for cycle-external events at task b relative to task a, and an
initial token at the cycle-internal input of b.
In Fig. 6.8 an external token arrives at task b 8 time units after the
corresponding external token has arrived at task a. We assume that the
Cycles with Multiple Inputs 105
cycle token is initially at the cycle-internal input of task a 4. As can
be seen, the schedule quickly settles into a pattern where the activating
jitter of task a is Ja,act = 2, and is thus larger than the external jitter.
On the other hand, the activating jitter of task b is Jb,act = 0, and thus
equals the external jitter. We observe the opposite effect in Fig. 6.9,
where an external token arrives at task b 8 time units before the corre-
sponding external token has arrived at task a. We assume that the cycle
token is initially at the cycle-internal input of task b 5. It is also obvious
that the additional AND-delay does not lead to unbounded backlog at
the cycle-external inputs.
multi_input_cycle_jitter
0 5 10-10 -5
ϕb
Ja,act , Jb,act
2 Ja,act
Jb,act
ϕa = 0
Figure 6.10. Activating jitter of tasks a and b as a function of the phase between
external tokens arriving at the two tasks. ϕa is zero, ϕb is varied in the interval
[−10, 10].
Fig. 6.10 shows the activating jitter of tasks a and b as a function of the
phase between external tokens arriving at the two tasks. ϕa is defined
as zero, ϕb is varied in the interval [−10, 10]. We observe that at least
one of the activation jitters is zero for all phases. We could therefore
analyze two corresponding feed-forward systems separately, one where
the cycle is cut at the input of task a, the other where the cycle is cut at
the input of task b. For each task in the cycle we would keep the smaller
minimum response time, and the larger maximum response time.
The generality of this idea has not been investigated further, since
the practical relevance of cycles with more than one external activating
inputs is doubtful. Should such a system occur in practice, then the
designer can be made aware of the complex scheduling dependencies
that he is building into his system and guided towards a more robust
solution. One such solution would be to replace all but one input by
communication registers, which can be read at any time and do not
4If the cycle token is initially at the cycle-internal input of task b, then we obtain the same
schedule for t ≥ 8.
5If the cycle token is initially at the cycle-internal input of task a, then we obtain the same
schedule for t ≥ 0.
106 Cyclic Task Dependencies
contribute to task activation. An alternative would be to use a single
AND-activated task which activates the cycles once all require input
data for one cycle iteration is available.
6.12 Multi-Rate Cycles
Due to the data-rate transitions in a multi-rate cycle, token arrival
at cycle-internal inputs becomes more irregular than in a single-rate
cycle (chapter 5). Therefore, conditions under which analysis of a cor-
responding feed-forward system is valid may be very conservative. A
less conservative solution requires considering possible event timing for
at least a macro period (chapter 5). A general solution is beyond the
scope of this thesis. However, the idea will be considered for the special
case of cycles stemming from Simulink multi-rate designs in chapter 9.
6.13 Summary and Conclusion
In this chapter we considered compositional performance analysis for
single-rate cyclic task dependencies, which are typical in control systems
and complex filters. We showed that a straight-forward application of
our compositional performance analysis approach does not yield a fix-
point. However, by considering phases between events and initial tokens
in a cycle we then showed that a significant number of cycles in fact
has a fix point such that analysis of an equivalent feed-forward system
produces correct results. Furthermore we showed that such cycles can
be arbitrarily nested. These are significant results since they allow us to
analyze many real-world systems with cyclic task dependencies with only
minor extensions to our compositional performance analysis approach
(detection of cycles, distinguishing AND-concatenated tasks inside cycles
from those outside cycle, performing simple checks).
We provided suggestions how to transform those systems that initially
do not satisfy all conditions that allow us to analyze an equivalent feed-
forward system. The designer needs to be alerted of the problem, and
a solution must be indicated. The potential problems that we identified
can be resolved either by increasing the number of initial tokens in the
cycle, or by reducing the number of external activating inputs that in-
fluence cycle tasks. The decision to modify the system then becomes a
trade-off between (perceived) performance and analyzability.
Chapter 7
SYSTEM-LEVEL ANALYSIS EXAMPLE
In this chapter, we apply the results from the previous chapters to
analyze the performance of our introductory system. For convenience,
the system is shown here again.
HW
uC DSPBus
1
1
OR
mon
s1
s2
s3
1
1
1
1
sys
c1
c2
c3
c4
c5
upd
ctrl
fltr
tmr
AND
in
out
[0,1] 1 1 1
1 1 1 1
1 1
1
1
1
2 2 2 3
1 1
1
1
1
ex
Figure 7.1. Example of an embedded real-time system with a variety of task depen-
dencies
The embedded system in Fig. 7.1 represents a hypothetical SoC con-
sisting of a micro-controller (uC ), a digital signal processor (DSP) and
dedicated hardware (HW ), all connected via an on-chip bus (Bus). DSP
and uC are equipped with local memory. The HW acts as an interface
to a physical system. It runs one task (sys) which issues actuator com-
108 System-Level Analysis Example
mands to the physical system and collects routine sensor readings. sys
is controlled by task ctrl, which evaluates the sensor data and calculates
the necessary actuator commands. ctrl is activated by a periodic timer
(tmr) and by the arrival of new sensor data (AND-activation in a cycle).
We assume 2 initial tokens in the cycle.
The physical system is additionally monitored by 3 sensors (s1 - s3 ),
which produce data sporadically as a reaction to irregular system events.
This data is registered by an OR-activated monitor task (mon) on the
uC, which decides if the control algorithm needs to be updated. If an
update is necessary (conditional communication), then this information
is sent to task upd on the DSP, which updated parameters into shared
memory.
The DSP additionally executes a signal-processing task (fltr), which
down-samples (data rate transition), filters and compresses a stream of
data arriving at input in, and sends the processed data via output out.
All communication, except for shared-memory on the DSP, is carried out
by communication tasks c1 - c5 over the on-chip Bus. Core execution
times for each task are shown in Tab. 7.1.
computation task C communication task C
mon [10, 12] c1 [4, 4]
sys [15, 15] c2 [4, 4]
upd [5, 5] c3 [4, 4]
ctrl [20, 23] c4 [8, 8]
fltr [12, 15] c5 [4, 4]
Table 7.1. Core execution and communication times
We assume the following event models at system inputs (Tab. 7.2).
input s/p Pin Jin dmin,in
s1 s 1000 0 0
s2 s 750 0 0
s3 s 600 0 0
in p 60 0 0
tmr p 70 0 0
Table 7.2. Event models at external system inputs.
In order to function correctly, the system has to satisfy a set of path
latency constraints (Tab. 7.3). Constraints 1 and 3 have been explicitly
specified by the designer. The 2nd constraint implicitly follows from the
109
fact that the cycle contains 2 initial tokens (chapter 6). Constraint 3
is defined for causally dependent tokens [123]. We shall also impose a
maximum jitter constraint at output out (Tab. 7.4).
constraint # path maximum latency
1 s1, s2, s3 → upd 70
2 cycle (ctrl → ctrl) 140
3 in → out 120
Table 7.3. Path latency constraints
constraint # output event model period event model jitter
4 out Pout = 90 Jout,max = 60
Table 7.4. Output jitter constraint (the output period automatically follows from
the data rate transition)
Figure 7.2. Single-rate version of our example system modeled in SymTA/S
110 System-Level Analysis Example
Performance analysis results were obtained to a large extent using our
system-level performance analysis framework SymTA/S [107]. However,
at the time of writing, our data rate transition algorithms had not been
integrated into SymTA/S. Therefore, we used a similar single rate sys-
tems, and performed the missing calculations using an external program.
A SymTA/S screen-shot of the single-rate system is shown in Fig. 7.2.
In the first step, SymTA/S performs OR-concatenation of the output
event models of s1 - s3 (section 4.2) and obtains the following sporadic
activating event model for task mon: Pact = POR = 250, Jact = JOR =
500. The large jitter is due to the fact that input events happening
at the same time lead to a burst of up to 3 activations (we assume no
correlations between s1 - s3 ). Since task mon is the only task mapped
onto uC, we can now perform local scheduling analysis for this resource,
in order to calculate the minimum and maximum response times, as well
as the output event model of task mon. The results of this analysis are
shown in Tab. 7.5. This is also the analysis status displayed in Fig. 7.2.
task s/p Pact Jact dmin,act C r s/p Pout Jout dmin,out
mon s 250 500 0 [10, 12] [10, 36] s 250 502 10
Table 7.5. Scheduling analysis input and output on resource uC
The worst-case response time of task mon increases compared to its
worst-case core execution time, since later activations in a burst have
to wait for the completion of the previous activations. The output jit-
ter increases by the difference between maximum and minimum core
execution times compared to the activation jitter. The minimum dis-
tance between output events equals the minimum core execution time
(section 3.6.2).
At this point, the rest of the system cannot be analyzed, because on
every resource activating event models for at least one task are missing.
SymTA/S therefore generates a conservative starting-point by propa-
gating all output event models along all paths until an initial activating
event model is available for each task (section 3.5). The initial acti-
vating event model for task fltr and all subsequent tasks in the path
is obtained by performing rate-transition calculation (chapter 5) on the
output event model of source-task in. SymTA/S then checks that the
system cannot be overloaded in the long term. This calculation requires
only activation periods and worst-case core execution times and thus can
be done before response-time calculation.
111
System-level analysis can now be performed by iterating local schedu-
ling analysis and event model propagation (section 3.6.5). SymTA/S de-
termines that task ctrl belongs to a cycle, checks that AND-concatenation
is selected, and then proceeds to analyze the corresponding feed-forward
system. SymTA/S executes until a fix-point for the whole system has
been reached, and then compares the calculated performance values
against performance constraints1.
Let us explore different scheduling options with regards to the timing
constraints that we specified above. We use static priority scheduling
both on the DSP and on the Bus and vary task priorities. In Tab. 7.6
we show performance analysis result for four experiments with different
priority assignments. In the 2nd and 3rd columns, tasks are ordered by
priority, highest priority on the left. In the last four columns, we give
the actual values for all four constrained values from Tabs. 7.3 and 7.4,
and indicate which constraints are met (
√
). A latency along path 3 is
only calculated if constraint 4 is met.
exp. Bus tasks DSP tasks constr. 1 constr. 2 constr. 3 constr. 4
1 c1, c2, c3, c4, c5 upd, ctrl, fltr 45
√
77
√
n/a 119
2 c1, c2, c3, c4, c5 fltr, upd, ctrl 60
√
107
√
n/a 97
3 c4, c5, c1, c2, c3 fltr, upd, ctrl 74 130
√
95
√
41
√
4 c1, c4, c5, c2, c3 fltr, upd, ctrl 60
√
158 111
√
57
√
Table 7.6. Bus and DSP tasks ordered by priority (highest on left), and calculated
actual values for all four constrained values from Tabs. 7.3 and 7.4
In the first experiment, only constraints 1 and 2 are satisfied. If we
increase priorities of tasks belonging to path 3 until constraints 3 and
4 are met, we violate either constraint 1 or constraint 2 (experiments
3 and 4). At this point, one solution is to increase the speed of one or
several components. Let us repeat experiment 3, but this time let us
increase the clock-rate of uC to 120% of its original clock rate. This will
speed up task mon which has the largest core execution time of all tasks
belonging to path 1 (which violates its constraint in experiment 3). As
can be seen in Tab. 7.7, all constraints are now satisfied.
There are several other solutions to satisfy all timing constraints. Let
us consider another, particularly interesting one. We will repeat exper-
iment 4 in our set of original experiments, but this time we will reduce
1In the future, it will also be possible to specify abort conditions, since it cannot be guaranteed
that a fix-point can be reached. Currently, a designer has to hit the ‘stop’ button to abort.
112 System-Level Analysis Example
exp. Bus tasks DSP tasks constr. 1 constr. 2 constr. 3 constr. 4
3′ c4, c5, c1, c2, c3 fltr, upd, ctrl 68
√
130
√
95
√
41
√
Table 7.7. Experiment 3 from Tab. 7.6 repeated with the clock-rate of uC increased
to 120% of its original clock rate.
the clock-rate of uC to 80% of its original clock rate. Surprisingly, again
all constraints are met (Tab. 7.8).
exp. Bus tasks DSP tasks constr. 1 constr. 2 constr. 3 constr. 4
4′ c1, c4, c5, c2, c3 fltr, upd, ctrl 69
√
124
√
107
√
53
√
Table 7.8. Experiment 4 from Tab. 7.6 repeated with the clock-rate of uC reduced
to 80% of its original clock rate.
The satisfaction of constraint 1 is easily comprehensible. This con-
straint was already satisfied in experiment 4, and the now slower uC did
not increase the latency on path 1 beyond the given deadline. But why
did the worst-case latencies on paths 2 and 3, as well as the jitter at
output out decrease compared to experiment 4? Due to the slower uC,
task mon produces data with a larger minimum distance d′min,out = 12.5.
This leads to larger minimum gaps between activations of task c1, and
consequently to a smaller worst-case transient load by task c1 on the
Bus. Since task c1 interrupts c2 - c4, the larger gaps between these
interrupts allow the lower priority task to execute earlier, thus reducing
the worst-case latencies on paths 2 and 3, as well as the jitter at output
out. Alternatively, we could have inserted a dmin-EAF (section 3.6.6)
between tasks mon and c1 to separate consecutive activating events for
c1. This might be a better solution if other tasks were sharing uC that
would violate their constraints if uC was slowed down.
Reducing transient overload on the bus is an example for a scheduling
anomaly [122, 97]. Worst-case response time analysis alone would not
have caught this situation, since the minimum distance between acti-
vations of task c1 is determined by the best-case response time of task
mon. Performance simulation might not have caught this situation ei-
ther, since best-cases are often not tested.
In summary, our analysis extensions allow us to reap the full range
of benefits of compositional performance analysis in the presence of a
variety of task dependencies typical in complex embedded applications.
113
This is an important step towards applicability of performance analysis
in the real world. As was shown, we were able to calculate conservative
performance bounds for our example featuring both OR- and AND-
concatenated tasks, a functional cycle, conditional communication and
data rate transitions. We explored several different scheduling options,
identified a scheduling anomaly which would likely have been overlooked
during simulation, and found a non-intuitive solution that meets all tim-
ing constraints on the target architecture while allowing to reduce the
speed of one component.

Chapter 8
CONTEXT-AWARE ANALYSIS
Most of the scheduling analysis techniques considered so far can be
rather pessimistic, because they ignore certain correlations between con-
secutive task activations. For example, worst-case response time analysis
typically assumes that every activation of a task leads to the worst-case
core execution time of that task. However, in a sequence of activations
of the same task, some activations may lead to shorter core execution
times. The designer may be able to derive rules from his knowledge of
the application which narrow down the execution times in a sequence of
task activations. We call such rules intra event stream contexts.
Also, worst-case response time analysis typically assumes worst-case
transient load distributions. This usually implies the assumption that all
tasks sharing a resource can be activated at the same time. However, in
reality it may not be possible to activate all these tasks at the same time.
Based on his knowledge of the system environment, the designer may
be able to narrow down possible phases between activations of different
tasks. We call such phase information inter event stream contexts.
The idea to improve scheduling analysis by considering some kind of
system context is not new. Mok and Chen introduced the idea of intra
event stream contexts and showed promising results for response-time
analysis for the transmission of MPEG-streams, where the worst-case
load for I-, P- and B-frames differs considerably [79, 7]. Recently, Thiele
et al. extended their compositional performance analysis with load curves
which allow them to consider intra event stream contexts [77]. Tindell
introduced the concept of inter event stream contexts for tasks scheduled
by a static priority preemptive scheduler [113]. This work was later
generalized by Palencia and Harbour [85]. Inter event stream contexts
116 Context-Aware Analysis
are also used in the holistic scheduling analysis by Eles et al. for mixed
time-triggered/event-triggered systems [90–92].
Context-aware analysis is another important step towards acceptance
of performance analysis for complex embedded applications. Our contri-
bution lies in the generalization of intra event stream contexts, as well
as in the combination of both types of contexts during analysis.
8.1 Intra Event Stream Contexts
Context-blind analysis assumes that in the worst case respectively
best case, every task executes each activation with its worst-case exe-
cution time (WCET) respectively best-case execution time (BCET). In
reality, different input data often activates different behaviors of a com-
putation task with significantly different WCETs [121, 120], or can lead
to significantly different bus loads for a communication task [126]. Such
different behaviors are sometimes called modes [126]. Modes can also be
generated internally in a task by a finite state machine which steps into
a different state each activation.
If it is possible to provide some rules for the sequence of activations
of different modes of a task C (i. e. specify intra event stream contexts),
then compared to the context-blind case a lower maximum load and
a higher minimum load can be determined during scheduling analysis
for a sequence of successive activations of the task. This in turn can
lead to shorter calculated worst-case response times and longer best case
response times of tasks that can be preempted by task C, thus obtaining
tighter performance bounds for theses tasks.
In their work, Mok and Chen assume that a periodic sequence of types
of activating events is completely known [7, 79]. While this is sufficient
to model e. g. a single MPEG-stream with a fixed sequence of I-, P- and
B-frames, in general intra event stream contexts can be more complex.
For example, an MPEG stream may be encoded in one of several patterns
of I-, P- and B-frames, or several MPEG streams may be interleaved.
A different example is an OR-activated task (section 4.2) with different
behaviors depending on the input at which data is received. In all theses
cases, the exact order of different event types is unknown.
Mok and Chen also do not clearly distinguish between different types
of events on one hand, and different task behaviors on the other. How-
ever, this distinction is crucial for subsystem integration and composi-
tional performance analysis. Different types of events are a property of
the sender, while different behaviors are a property of the receiver. Both
can be specified separately from each other and later correlated. Fur-
thermore, it may be possible to propagate intra event stream contexts
Intra Event Stream Contexts 117
along a chain of tasks. It is then possible to also correlate the modes of
consecutive tasks.
In our extension to Mok’s and Chen’s work, we allow minimum- and
maximum-conditions for the occurrence of a certain type of event in a
sequence of a certain length n, in order to capture partial information
about an event stream. For example, “min 2 blue events out of 6”
indicates that in any sequence of 6 events in an event stream, at least
two will have a property that we call ‘blue’ (in allusion to the terminology
used for colored petri nets [58]).
Once intra event stream contexts and modes are combined for a con-
sumer task C, a single worst-case and a single best-case sequence of
events with length n can be determined from the available min- and
max-conditions that can be used to calculate the worst- and best-case
load due to any number of consecutive activations of C. We have ex-
tended static-priority preemptive response-time calculation to exploit
this idea [50]. The following algorithm performs worst-case calculation.
A detailed discussion can be found [40].
/* generate worst-case token sequence of length n */
1 FulfillAllMinConditions();
2 CompleteSequenceConsideringMaxConditions();
3 SortSequenceByWeigth();
/* peform scheduling analysis */
4 CalculateWorstCaseResponseTimes(); // using equation 8.1
Figure 8.1. Algorithm for worst-case response time calculation considering intra-
contexts with minimum- and maximum-conditions for the occurrence of a certain
type of event.
Our algorithm (Fig. 8.1) first fulfills min-conditions, which specify the
smallest number of events of a certain color in a sequence of n events
(line1). After this step, in order to obtain a worst-case sequence, as
many of the remaining places in the sequence as max-conditions allow
are first filled with event colors that activate the mode with the largest
WCET, then with event colors that activate the mode with the second
largest WCET, and so on (line2). Then the resulting sequence is sorted
by weight, starting with the color that activates the mode with the
largest WCET (line3).
Intra event stream contexts can now be exploited for scheduling anal-
ysis (line4). In case of a static priority preemptive scheduler, the fol-
lowing extension of equation 3.5 gives the worst-case response time for
118 Context-Aware Analysis
lower-priority task i. Cj,k is the WCET of the kth activation of higher-
priority task j in a worst-case sequence of ηuj (ri) activations due to an
intra event stream context. If j is a task without intra event stream
context information, Cj,k equals Cj .
ri = Ci +
∑
∀j∈hp(i)
ηuj (ri)∑
k=1
Cj,k (8.1)
In [50], we compared worst-case response time analysis improvements
using intra event stream context information to the context-blind case
for a variety of system-bus speeds of a hypothetical set-top-box. An
important observation was that the greatest reductions in the calculated
worst-case response time (up to 90 %) due to intra event stream contexts
are obtained for slow bus speeds. The reason is that for slower bus
speeds low-priority traffic is interrupted more often, and thus calculation
of larger gaps between high-priority traffic using intra-contexts has a
larger influence. This observation is important since the designer usually
strives to reduce resource speed as much as possible to save cost and
power consumption.
8.2 Inter Event Stream Contexts
Context-blind analysis usually assumes that in the worst-case all sched-
uled tasks are activated simultaneously. In reality, activating events
are often time-correlated, which rules out simultaneous activation of all
tasks. For priority-based scheduling, this may lead to a lower maximum
number, and a higher minimum number of interrupts of a lower-priority
task through higher-priority tasks compared to the context-blind case,
again resulting in a shorter worst-case response time, and longer best-
case response time of the lower priority task.
Inter event stream contexts capture information about time-correlated
events in different event streams in a way that can be exploited by per-
formance analysis. Tindell introduced this idea for tasks scheduled by
a static priority preemptive scheduler [113]. Each set of time-correlated
tasks is grouped into a so called transaction. Each task is activated
when an offset elapses after the activation of the transaction to which it
belongs. Transactions are activated by periodic external events.
Tindell showed that the worst-case contribution of a transaction to the
response time of a lower-priority task occurs when the activation time
of the lower-priority task happens as soon as possible after the critical
instant of the transaction. The critical instant is the activation time of
one of the transaction’s higher-priority tasks. Subsequent activations of
Example 119
higher-priority tasks belonging to the transaction also have to happen
as soon as possible after the critical instant.
Since all activation times of all higher-priority tasks belonging to a
transaction are candidates for the critical instant of the transaction, the
worst-case response time of a lower-priority task has to be calculated for
all possible combinations of all critical instants of all transactions that
contain higher priority tasks, to find the absolute worst-case. I. e. the
algorithm is exponential with the number of transactions.
In [50], we compared worst-case response time analysis improvements
using inter event stream context information to the context-blind case
for a variety of offset values. An important observation was that inter
event stream context analysis reveals the non-linear influence that a
small local change, in our example the speed of one execution unit, can
have on system-performance.
8.3 Combination of Contexts
Inter and intra event stream contexts are orthogonal: the worst-case
response time of a lower-priority task is reduced both because fewer
high-priority task activations can interrupt its execution during a cer-
tain time interval, and because the time required to process a sequence
of activations of each higher-priority task is reduced. Therefore, perfor-
mance analysis can be further improved if it is possible to consider both
types of contexts in combination.
8.4 Example
Let us apply context-aware scheduling analysis to the system in Fig. 8.2,
where three tasks have been mapped onto a static priority preemptive
scheduled resource. Task properties are given in table 8.1. The core ex-
ecution time of task b is an interval. A detailed model reveals that task
b has three modes with different core execution times that are activated
by events with different colors.
task Pact C priority
a 100 30 hi
b 100 [15, 25] (red = 25, green = 20, blue = 15) mid
c 300 100 lo
Table 8.1. Task properties used in our context-aware scheduling analysis example.
Task b has three modes with different core execution times that are activated by
events with different colors.
120 Context-Aware Analysis
context_ex
static priority
c
b
a(P = 100, J = 0)
tcore = 30
tcore = [15,25]
tcore = 100
(P = 100, J = 0)
(P = 300, J = 0)
hi
mid
lo
Figure 8.2. Example system for context-aware scheduling analysis
color min # of events max # of events
red 2
green 2
blue 1 1
Table 8.2. Intra event stream context: Min- and max-conditions for event colors in
a sequence of 4 events generated by task sb
Let us suppose that we know min- and max-conditions for a sequence
of 4 events generated by source sb which activates task b (table 8.2). The
resulting partial sequence after fulfilling all min-conditions is: (‘green’,
‘green’, ‘blue’). A color now has to be assigned to the 4th event, and
the sequence must be sorted. In the worst-case, the 4th event is ‘red’,
because it activates the mode of task b with the longest core execution
time (table 8.1). The sorted worst-case event sequence of length 4 is:
(‘red’, ‘green’, ‘green’, ‘blue’). In the best-case, the 4th event is ‘green’.
This event only activates the mode with the 2nd shortest core execution
time (table 8.1). However, we cannot add another ‘blue’ event (which
activates the mode with the shortest core execution time), because it
would violate the max-condition in table 8.2. The sorted best-case event
sequence of length 4 is: (‘blue’, ‘green’, ‘green’, ‘green’).
Let us further assume that we know the phase between activations of
tasks a and b: activations of task b occur 50 time units after activations
of task a. We can now compare scheduling analysis results for four cases:
context-blind, only intra contexts, only inter context, and combination
of both contexts. Calculated response times are given in table 8.3.
Neither intra nor inter-contexts can affect task a, since its response
time is already a single value in the context-blind case. The response
time of task b cannot be affected by its own intra-contexts, since for
Summary and Conclusion 121
task rblind rintra rinter rintra+inter
a 30 30 30 30
b [15, 55] [15, 55] [15, 30] [15, 30]
c [145, 265] [150, 255] [160, 240] [165, 235]
Table 8.3. Calculated response times: Context-blind, only intra contexts, only inter
context, and combination of both contexts
gant3
T1 T2 T1 T2 T1 T2
t
T3 T3
T1 T2 T2 T2T1 T1
255
t
T3T3
T1 T2 T1 T2 T1
T3 T3 T3 T3 t
240
T1 T2 T1 T2 T1
t
T3 T3 T3 T3
c)
d)
a)
b)
45 45 10
265
45 50 5
20 25 20 30 5
235
20 25 20 25 10
Figure 8.3. Worst-case response time scenarios for task c obtained in the context-
blind case (a); analysis improvement due to intra (b), inter (c), and the combination
of intra and inter event stream contexts (d)
worst-case (best-case) response time calculation the longest (shortest)
core execution time must be considered across all modes, which is equiva-
lent to not considering the intra-context at all. The inter-context reduces
worst-case response time, since a phase of 50 guarantees that b cannot
be preempted by a. Both the best- and the worst-case response time of
task c are affected by both types of contexts, and by their combination.
The worst-case response time scenarios for task c for all four cases are
shown in Fig. 8.3.
122 Context-Aware Analysis
8.5 Summary and Conclusion
In this chapter we introduced intra and inter event stream contexts
and demonstrated that considering them can yield significantly tighter
performance analysis bounds compared to context-blind analysis. We
followed the approach suggested in SPI [126] and separated the modeling
of context information from the modeling of task modes. This approach
supports the idea of subsystem integration in compositional performance
analysis, since contexts and modes can be specified separately and later
correlated. We used a simple activation function where the occurrence
of an event with a particular color activates a particular mode. This
could be extended to more complex activation functions as suggested
in SPI, e. g. to consider a combination of colors at different task inputs
when determining which mode is activated.
Apart from fully specifying repeating sequences of colors, we addition-
ally consider minimum- and maximum-conditions for the occurrence of
a particular color in a sequence of a certain length n, in order to capture
partial information about an event stream. This allows us to apply intra-
context aware analysis for event streams where the exact order of colors
is unknown, e. g. if an event stream is multiplexed from multiple sources.
We presented an algorithm which allows us to calculate the worst-case
sequence of colors for the activation of a task with known modes. We ex-
tended static-priority preemptive response-time calculation to exemplify
the applicability of our approach. We borrowed the inter event stream
context model and analysis ideas by Tindell and Palencia and showed
that they can be nicely combined with intra event stream contexts, since
both types of contexts are orthogonal and improve analysis in different
ways.
We assumed that intra- and inter-contexts are specified manually by
the designer. Future work could derive inter-contexts from the response
times of analyzed tasks. We implicitly used this approach to determine
under which conditions a cycle-internal token does not influence the acti-
vating event model of an AND-concatenated task in a cycle (chapter 6)1.
Some intra-contexts can also be derived automatically e. g. by analyzing
the timing at different inputs of an OR-concatenated task.
1Contexts alone would not have done the trick since the concept of initial tokens is also
needed
Chapter 9
PERFORMANCE ANALYSIS FOR
SIMULINK
Simulink [74] is a block diagram oriented industry standard modeling
and simulation tool for control and signal processing systems. It builds
on the Matlab [73] environment for technical computing. The Simulink
extension StateFlow [75] allows to internally model a Simulink block
as a state-transition diagram. A typical application of Simulink is the
modeling and simulation of a physical system (e. g. from the automotive
domain) together with associated control or signal processing function-
ality. While the model of the physical system is eventually replaced
by the actual system, the control or signal processing functionality is
implemented as an embedded real-time system.
After an overview of Simulink, we first argue that the scheduling poli-
cies that can be synthesized for Simulink designs using existing code-
generators are too restricted for state-of-the art applications and ar-
chitectures. We then show how the Simulink model of computation
can be relaxed to enable event-driven scheduling, which is much more
flexible, scales well to multi-processor architectures, and is subsystem-
integration friendly. Finally, we show that system-level performance
analysis is possible for Simulink using our compositional performance
analysis approach, albeit with an extended model for phases between
different events.
9.1 Simulink Model of Computation
A number of fixed-step and variable-step solvers are available to eval-
uate a Simulink block-diagram at certain points in time. Variable-step
solvers are important in certain situations to model a physical system ac-
curately (e.g. to find zero crossings), but are of little significance for the
design of control and signal-processing embedded systems. These appli-
124 Performance Analysis forSimulink
B1
ts =1
B4
ts=4
B3
ts =3
B2
ts =2
simulink_ex_diag
(a) Simulink block-diagram
with four blocks with different
sample times
B1
B3
B2
B4
tsim   0  1  2  3  4  5  6  7  8  9
simulink_ex_time
(b) Block execution sequence and
progress of simulated time
Figure 9.1. Simulink example
cations are generally modeled using fixed-step solvers to enable code gen-
eration into periodic processes. In the following, only fixed-step solvers
are considered.
When a fixed-step solver is used, the Simulink model of computa-
tion adheres to the perfect synchrony hypothesis [8]. A perfect global
clock and strictly periodic block activation are assumed (different pe-
riods are allowed for different blocks). While blocks are executed and
results are communicated, time does not advance. Time advances only
after all computations and communications that need to be performed
at a particular clock-tick have been completed. This model is well suited
for function simulation, since influences from a real architecture are
completely ignored (separation of concerns). Consider the example in
Fig. 9.1(a) that shows a simple Simulink block-diagram with four blocks
B1 − B4, each with a different sample time (or execution period) ts.
The execution timing of the different blocks during simulated time is
shown in Fig. 9.1(b). As can be seen, computation and communication
are assumed timeless. If multiple blocks need to be executed at the same
point in time, then the edges between blocks (Fig. 9.1(a)) define prece-
dence constraints which are obeyed by the simulation engine (indicated
by gray arrows in Fig. 9.1(b)).
Simulink blocks additionally can have enabling and/or triggering in-
puts to allow conditional execution. An enabled block is activated with
it sample time, but only executed if the enabling condition is satisfied.
A triggered block is activated with the sample time of the block driving
the trigger input, but only executed if the triggering condition is satis-
Code Generation from Simulink 125
fied. Trigger inputs are of particular interest for blocks that contain a
state-transition diagram.
9.2 Code Generation from Simulink
A perfectly synchronous model cannot be implemented on a real ar-
chitecture. In practice, the implementation must be sufficiently close to
simulation. This implies on one hand, that Simulink code generators
such as Real-Time Workshop [76] and Target Link [26] have to generate
code that allows to compute the same results in the same order as simula-
tion. On the other hand, the target architecture must be able to execute
all blocks fast enough with respect to external stimuli (satisfaction of
the synchrony hypothesis [8]).
9.2.1 Tick Scheduling
One solution is to execute all computations and communications re-
quired at a certain point in time before time has advanced to a value
where the next set of computations has to be performed [8]. Real-Time
Workshop supports this solution in the so-called single tasking mode [76].
In this mode, a tick-scheduler is used to activate each set of computations
at the right points in time.
t
t
t
t
B1
B2
B3
B4
simulink_ex_tick
idle t
0 2 4 61 3 5
Figure 9.2. Tick scheduling of our Simulink example
A tick-scheduler can be a rather inefficient solution. It must be en-
sured that the longest sequence of computations that is required at any
clock-tick completes before the next clock-tick. This can lead to exces-
sive idle times after short sequence of computations. In our example, it
must be possible to execute all four blocks during one clock-tick, as this
is required for t = 0, t = 12 and so on. During all other clock ticks, the
126 Performance Analysis forSimulink
load is considerably smaller (idle line in Fig. 9.2). For the core execution
times assumed in our example, processor utilization is only 47.22%.
9.2.2 Rate Monotonic Scheduling
An improved solution is to assign priorities to tasks according to their
execution rate. This is generally called rate monotonic scheduling (RMS,
section 3.2). Real-Time Workshop supports this solution in the so-called
multi-tasking mode [76].
t
t
t
t
B1
B2
B3
B4
0 2 4 6 8 10 12
simulink_ex_RMS
idle t
Figure 9.3. RMS scheduling of our Simulink example
In RMS, it must only be ensured that a task with a particular exe-
cution rate has been completed before the next activation of the task.
Compared to tick-scheduling, load peaks can be processed over a longer
stretch of time, thus balancing the load that has to be processed dur-
ing any time interval. This is shown for our set of blocks in Fig. 9.3.
Note that the processor clock-speed has been halved compared to tick-
scheduling (different time scales in Figs. 9.2 and 9.3), and that the system
is still schedulable. For the core execution times assumed, now only B1
and B2 can be fully executed during one clock tick. The execution of
B3 and B4 is spread over several clock ticks. However, all tasks meet
their deadline. Resource utilization has increased to 94.44%. In reality,
the performance improvement due to RMS is diminished somewhat by
the RTOS-overhead (section 2.4.1).
However, RMS introduces a new problem [76]. If a faster task provides
data to a slower task, then the faster task may produce new output
values before the slower task has finished reading all previous values.
Data integrity is thus not guaranteed. Likewise, if a slower task provides
data to a faster task, then the faster task may start executing before the
slower task has provided all necessary values. Both effect can be seen in
Simulink Code Generation Issues 127
Fig. 9.3, e. g. at t = 1 between B1 and B3 (fast → slow), and at t = 0
between B3 and B2 (slow → fast).
To solve these problems, Simulink provides rate transition blocks [76].
These blocks buffer values between faster and slower blocks in a way that
data consistency is guaranteed. Rate transition blocks require additional
memory for buffering data, and in case of a transition from a slower to a
faster execution rate, delay the propagation of data and thus may change
the dynamics of a system.
9.3 Simulink Code Generation Issues
Even though preemptive scheduling for Simulink improves on tick-
scheduling, both approaches have several limitations which render them
poorly suited for complex architectures and applications.
9.3.1 Multi-Processor Implementation
As explained in section 2.1.2, a major problem with global synchrony
is that it does not scale well to large architectures. In the case of Simu-
link, the available schedulers require that each task can be activated
periodically, and that activations cannot backlog. This is difficult to
achieve for complex multi-processor architectures where arrival times
of data can jitter considerably. For example, the automotive industry,
which often models in Simulink, is trying to move away from many
insular single-processor control systems towards a global view of a net-
worked multiprocessor system that allows them to distribute functions
more efficiently [115]. In section 2.4.1 it was explained that an efficient
multi-processor implementation of complex applications is only possible
if task scheduling is reactive. This is neither the case for tick-scheduling
nor for rate-monotonic scheduling.
Another inefficiency stems from the fact that code generators cannot
decide for which Simulink blocks strict periodicity is most important.
Often, periodicity is only required at sub-system inputs and outputs,
while internally the timing is less restricted. Furthermore, even at inputs
and outputs it may be acceptable to diverge from a strictly periodic
execution up to a certain jitter. Finally, certain paths through a sub-
system may be more time-critical than others. There thus exists a variety
of timing constraints which cannot be specified in Simulink.
Today, in order to satisfy timing constraints, the designer can manu-
ally try to tweak the execution order for a set of blocks within the limits
imposed by the existing schedulers. This is tedious and may have to be
repeated often, since such restrictive schedules are sensitive to design
changes (section 2.1.2). Therefore, it would be much more desirable if
128 Performance Analysis forSimulink
the restrictions imposed by perfect synchrony could be replaced by a
set of timing constraints that exactly specify which timing really mat-
ters (compositional approach). This novel freedom would provide more
scheduling options, thus increasing the potential for an optimized imple-
mentation that is also robust to small design changes. Ideally, a chosen
schedule could be validated against timing constraints using scheduling
analysis techniques.
9.3.2 Multi-Language Design
As explained in section 2.1.4, complex applications are often designed
using multiple languages or tools. The different sub-functions then have
to be integrated on the target architecture. Function integration can
easily violate assumptions made during tick-scheduling or RMS of Simu-
link designs. For example, preemption can easily corrupt a tick-schedule,
since execution of a set of blocks may no longer be complete before the
next clock tick. TDMA scheduling, where the tick-schedule is assigned
a sufficiently large time-slot may be possible, but this only aggravates
the inefficiencies inherent to time-driven scheduling (section 2.4.1), since
an even larger worst-case number of tasks must be executable in a short
time window.
RMS is better suited for integration with other applications, since it
already assumes a priority-based scheduler. However, rate-monotonic
priority assignment by no means guarantees timely completion, and a
previously valid schedule can easily be invalidated by additional tasks
with a sufficiently high priority. We again conclude that in dynamic,
reactive systems, more flexible scheduling policies are better suited to
resolve such resource conflicts.
9.4 Model Relaxation
It has been established in the previous section that scheduling poli-
cies supported by current Simulink code-generators quickly reach their
limits when it comes to complex, dynamic, or multi-language systems.
Consequently, it would be desirable to preserve the synchrony hypothe-
sis during function design, but relax it in a way during implementation
that multi-processor implementation, reactive scheduling, and language
integration are supported.
We present an idea where the time-driven activation of Simulink
blocks is replaced by data-driven (and hence reactive) task activation.
The following aspects of Simulink have to be preserved: 1) relative exe-
cution rates between Simulink blocks, and 2) causality and the resulting
partial ordering of Simulink blocks, including read and write access se-
Model Relaxation 129
quences on each edge to model the register semantics used by Simulink
for communication. The SPI model ([126] and section 2.6) will be used
to express the transformed Simulink design. SPI allows to efficiently
capture the activation dependencies that we are interested in, while ab-
stracting functional details which do not directly influence performance.
9.4.1 Single-Rate Designs
When two Simulink blocks A and B with the same sample rate com-
municate, then it is possible to replace the communication register by
a communication FIFO. This implies that the execution sequence be-
tween the two blocks is no longer fixed to ABAB . . . . Instead, A can
be executed several times, and the output values buffered until they are
consumed by B. This scheduling flexibility comes at the cost of a larger
communication buffer and delay. However, buffers and delays can be
easily bounded by additional constraints (including a buffer size of one,
which then enforces the original, alternating schedule).
9.4.2 Multi-Rate Designs
When two Simulink blocks A and B with different sample rates com-
municate, then the simple FIFO-solution from the previous section is no
longer applicable. In case of a rate transition from a slower to a faster
sample-rate, a value may be read more than once. In case of a rate
transition from a faster to a slower sample-rate, a value may not be read
at all. Both effects cannot be implemented using a FIFO. Therefore,
we use SPI register communication to accurately capture the Simulink
destructive write, non-destructive read semantics. One token is written
(read) on the register channel per activation of the writing (reading)
process.
Since a register channel can be read at any time, it cannot contribute
to the activation function of a reading SPI process (section 2.6). There-
fore, a pair of FIFO-queues is additionally inserted between every two
processes with different sample times that communicate over a register
channel [48]. Activation of the generated SPI processes is enabled by
availability of tokens on those FIFO-queues. Relative execution rates
and partial ordering between Simulink blocks are maintained by writing
(reading) the appropriate number of tokens to (from) each queue, and by
the number of initial tokens on each queue, as specified in the following
equations.
r(Pi) = ts(Bi) (9.1)
nCj(Pwr→Prd) = r(Prd)− 1 (9.2)
nCj(Prd→Pwr) = r(Pwr) (9.3)
130 Performance Analysis forSimulink
r(Pi) is the number of tokens written and read by process (Pi) per exe-
cution on each of its FIFO channels. ts(Bi) is the sample time of block i.
nCj is the number of initial tokens on FIFO queue Cj . The direction of
queue Cj is indicated by indices Pwr and Prd, which refer to the writing
and reading processes of the corresponding register channel. If sample
times are not integer values, they have to be multiplied by an appropri-
ate factor to obtain integer values r(Pi) for each process. Likewise, if
sample times have a common divisor, they can be canceled down.
P1
P4
P2P3
4
1
1
3
3 3
3 2
2
1
1
4
3
12
1
3
1
c1
c2
c4
c5
c6
c3
simulink_ex_SPI_diag
Figure 9.4. SPI representation of the Simulink example from Fig. 9.1(a). Dashed
elements are virtual
simulink_ex_SPI_sched
step
1
2
3
4
0
firing
P3
P2
P2
P3
c5 c6
tokens in virtual 
 FIFOs after firing
1
4
2
0
3
3
0
2
4
1
13P25
P2P3
3
3 2
2
3
1
c5
c6
Figure 9.5. Valid execution sequence for processes P2 and P3
Let us apply these rules to our Simulink example in Fig. 9.1(a). The
SPI representation of the system is shown in Fig. 9.4. Large circles
indicate processes. The number of tokens written or read per execution
is annotated to each process port. Small circles indicate FIFO-queues.
The number of initial tokens is annotated. Squares indicate register
channels. For example, each execution of process P3 consumes 3 tokens
on channels c4 and c5, produces 3 tokens on channels c3 and c6, and
Model Relaxation 131
reads and writes the connected registers. Fig. 9.5 focuses on processes
P2 and P3. The table shows the only valid execution sequence for these
two processes resulting form their production/consumption rates and
from the initial number of tokens on channels c5 and c6. The sequence
repeats periodically after the 5th step and is consistent with the order
in which the corresponding Simulink blocks are simulated (Fig.9.1(b)).
Possible execution sequences for the complete example in Fig. 9.4, or for
any other Simulink design, can be derived accordingly.
9.4.3 Implications for Scheduling
As can be seen, the exact timing of processes is no longer specified in
the SPI-representation. Essentially, we have transformed the time-driven
Simulink model of computation into a synchronous dataflow representa-
tion (section 2.1.1). The only difference to SDF is that the FIFO-queues
are not used for communication between processes. They merely rep-
resent scheduling constraints, and consequently are modeled as virtual
SPI elements (section 2.6). This does not hinder the application of a
variety of available SDF-scheduling techniques. For example it is now
easy to distribute processes to different processors to exploit application
parallelism, and to optimize schedules for throughput, latency, memory
consumption etc.
If we choose a dynamic scheduling policy, tasks generated from com-
municating Simulink blocks with different sample times cannot interrupt
each other, since at any time, only one of them can be active. This is
because the total sum of tokens in a cycle is r(Pwr) + r(Prd)− 1. How-
ever, for simultaneous activation of both processes, r(Pwr) + r(Prd) to-
kens would be needed. Since communicating tasks with different sample
times cannot interrupt each other, there is no need for rate transition
blocks, which were needed for RMS (section 9.2.2).
Reactive scheduling is much more flexible than RMS. Due to activa-
tion by availability of data, it does not matter if the timing of input data
jitters. A task can only start executing once all tasks that it depends
upon have completed. Let us consider the example in Fig. 9.6. Let us
assume that B1 is externally activated periodically with a certain jit-
ter. A sample sequence of external input events is shown in line ext in
Fig. 9.6. Let us further assume that the path B1 → B4 is time criti-
cal, and that therefore task priorities have been assigned (highest first):
B1, B4, B3, B2.
Compared to Fig. 9.3, activation timing of tasks in Fig. 9.6 is more
irregular (including bursts of 2 activations for B2), both due to the ex-
ternal jitter and the data-dependencies in the system. For example, the
activation of B1 can occur later than an external event, if the previous
132 Performance Analysis forSimulink
t
t
t
t
B1
B2
B3
B4
0
simulink_ex_dist
idle t
2 4 6 8 10 12
text
Figure 9.6. Reactive scheduling of our Simulink example
executions of B3 and B4 have not been completed. No rate transitions
blocks are required, since data-dependent tasks do not interrupt each
other (but B1 and B4 can interrupt B2). The path B1 → B4 is exe-
cuted as quickly as possible, much faster than in the RMS case. The
processor utilization is as good as for RMS1. We could have chosen a
different priority assignment, or even a different scheduling policy de-
pending on the input event timing and timing constraints that need to
be satisfied. This flexibility is simply not available with RMS.
One remaining issue is scheduler synthesis. In [48] we showed how to
synthesize cyclo-static schedules for multi-rate Simulink designs. In case
of dynamic scheduling, the scheduler needs to keep track of the number
of produced and consumed tokens on the virtual FIFO channels. Locally,
this could be achieved by simple counters. If communicating processes
are mapped to different resources, then the number of produced vir-
tual tokens could be made part of a handshake communication protocol.
Scheduler synthesis is not considered further in this thesis. It is however
essential to exploit the benefits of reactive scheduling and should be de-
veloped in close cooperation with a company that offers code-generation
for Simulink.
1Even better in general, since RMS may become un-schedulable if processor load increases
above 69% [68]
Scheduling Analysis 133
simulink_ex2
BA
ts =4
BB
ts =6
(a) Simulink model
simulink_ex2_SPI
14
150
210
BA
2
2 3
3
2
2
c1
c2
12
1(PA,ext = 4, JA,ext = 0)
(b) SPI representation (sample times have
been canceled down to obtain token rates)
Figure 9.7. Example used to explain our analysis approach
9.5 Scheduling Analysis
Like any other scheduling policy, reactive scheduling does not guar-
antee that all timing constraints are met. Therfore, we would like to
analyze the possible timing of a reactive Simulink implementation using
the SymTA/S approach. This is straight-forward for single-rate systems,
where the tasks generated from Simulink communicate via FIFO buffers
(section 9.4.1). However, if two Simulink blocks with different sample
times communicate, we obtain a cyclic dependency with two data rate
transitions (section 9.4.2). So far, we have only studied data rate transi-
tions without cycles (chapter 5), and cycles without data rate transitions
(chapter 6). Both shall now be combined for the special case of tasks
generated from Simulink.
In chapter 6 it was shown that even for single-rate cycles a naive ap-
proach where only event models are propagated along the cycle without
considering further event timing correlations does not yield an analysis
fix-point. Therefore, let us follow the approach from chapter 6 and try
to find a fix-point for our multi-rate cycle problem by additionally con-
sidering the correlations between the arrival timing of different tokens
in the cycle.
The Simulink example in Fig. 9.7(a) shall be used for illustration. It
consists of two blocks BA and BB. BA has a sample time of 4 time
units, BB has a sample time of 6 time units. The corresponding SPI
representation with tasks A and B is shown in Fig. 9.7(b). We assume
that an input event at the cycle-external input of task A occurs period-
ically every 4 time units with a jitter of 1 time unit. We are interested
in calculating the activating event models, response time intervals and
output event models for both tasks.
134 Performance Analysis forSimulink
9.5.1 Schedulability Condition
Due to the data rate transitions, not every token arrival leads to an
activation of tasks A or B. Therefore, let us consider one complete macro
period, after which all tokens are in their initial buffers. The following
is a sufficient condition to avoid overload: let nA be the number of
executions of task A, and nB the number of executions of task B during
one macro period. In our example, the macro period is ABABA, i.e.
nA = 3 and nB = 2. An input event at the cycle-external input of task A
occurs on average every PA,ext (4 in our example). Consequently, a new
macro period needs to be started on average every nA ∗PA,ext (12 in our
example). Since A and B can never be executed at the same time due
to data dependencies, the cycle is definitely schedulable with bounded
external buffers if the sum of worst case response times of tasks A and
B during one macro period does not exceed the average time between
starts of the macro period. I. e.
nA ∗WCRT(A) + nB ∗WCRT(B)
!≤ nA ∗ Pext (9.4)
9.5.2 Analysis Approach
As was the case in chapter 6, let us initially assume that the acti-
vation timing of task A is solely determined by tokens arriving at the
cycle-external input. For our example this implies (PA,act = PA,ext =
4 , JA,act = JA,ext = 1). Let us additionally define that at system-
startup, the first external input event of task A occurs during t = [0, 1],
the second during t = [4, 5], and so on (Fig. 9.8, line 1).
Let us assume that scheduling analysis yields a response-time in-
terval of [1, 2] for task A. The resulting output event model is then
(PA,out = 4 , JA,out = 2). Consequently, task A completes the first exe-
cution during t = [1, 3], the next execution during t = [5, 7], and so on
(line 2).
Due to the data rate transition between tasks A and B, the activating
event model for task B becomes (PB,act = 6 , JB,act = 4) (section 5.1).
Let us assume that scheduling analysis yields a response-time interval of
[1, 3] for task B. We now need to check the schedulability condition in
inequation 9.4, which is satisfied: 3 ∗ 2 + 2 ∗ 3 ≤ 3 ∗ 4.
The output event model of task B is (PB,out = 6 , JB,out = 6). If we
simply calculated the cycle-internal input event model for task A from
the output event model of task B using the results from section 5.1, we
would obtain an internal input jitter JA,int = 10, invalidating our initial
assumption. Such an approach will not yield a fix-point. Therefore, we
will consider event timing in more detail.
Scheduling Analysis 135
(
SimulinkCycle
0 4 8 12
...
16
...
...
...
...
...
...
...
...
fix-point
line 1: Aact = Aext
line 2: Bact = Acomp
line 3: Bcomp
line 4: Aact = Aext  AND Bcomp
line 5: Bact = Acomp
line 6: Bcomp
line 7: Aact = Aext  AND Bcomp
line 8: Bact = Acomp
line 9: Bcomp
)
( )
( )
t
Figure 9.8. Possible event timing as calculated during scheduling analysis of our
multi-rate Simulink example
The first and second completions of task A each lead to one activation
of task B. Consequently, task B completes its first execution during t =
[2, 6], and the second execution during t = [6, 10]. The third completion
of task A does not activate task B (indicated by parentheses in line 2).
The second and third activations of task A depend on the first and
second completions of task B, respectively. Therefore, the second and
third activation of task A can be delayed by up to one time unit com-
pared to our initial assumption (line 4). This means that our initial
assumption was wrong. We have to repeat scheduling analysis with
a modified activating event model for task A. For this purpose, we
need a single activation jitter. To be conservative, we use the largest of
all individual activation jitters, yielding a new activating event model
(PA,act = 4 , JA,act = 2) for task A.
For simplicity, let us assume that the modified activating event model
does not change the response time interval of task A. The new possible
timing of completions of task A is shown in line 5. A conservative output
136 Performance Analysis forSimulink
event model is (PA,out = 4 , JA,out = 3), which results in an activating
event model (PB,act = 6 , JB,act = 5) for task B. Let us assume that
the modified activating event model also does not change the response
time interval of task B, yielding the possible completion times shown in
line 6.
From line 7 we see that we now have to modify the activating event
model for task A to (PA,act = 4 , JA,act = 3) to capture the largest jitter
of all activations. The resulting completion timing of task A is shown in
line 8. Compared to line 5, only the output jitter of the third execution
of task A has increased. Since the third execution of task A does not
activate task B, the activating event model for task B remains valid.
Therefore, a fix-point has been found. The output event model of task
A is conservatively assumed as (PA,out = 4 , JA,out = 4). This event
model is propagated to any other task which might be activated by A.
The final step is to obtain a periodic with jitter output event model
for task B. PB,out is obviously 6. Since the maximum distance between
two completions of task B is dmax = 18 − 6 = 12, and the minimum
distance between two completions of task B is dmin = 6 − 6 = 0, we
obtain (PB,out = 6, JB,out = 6) (section 3.1.1).
Let us review the extensions to our compositional performance anal-
ysis that were required to arrive at these results. Instead of only prop-
agating event models, we additionally propagated the possible phase of
each event during one macro period, and exploited our knowledge of
task dependencies to arrive at a tight activating event model for task A.
Other extensions were not necessary. Iteration to reach a fix-point is al-
ready a key existing part of our approach. The calculation of activating
event models in the presence of data rate transitions (section 5.1) was
reused on the feed forward path from task A to task B. Most impor-
tantly, we did not need to extend local scheduling analysis techniques,
since phase information was not required for this step. Consequently,
any available technique can be used, and our compositional approach is
not compromised.
Let us now discuss the generality of the presented approach. The ap-
proach is independent of the size of the external jitter, and also remains
valid if different response time intervals for a task are calculated for dif-
ferent iterations of our algorithm. In the latter case, the schedulability
condition (inequation 9.4) has to be checked each iteration. The algo-
rithm will always terminate with one of two possible outcomes. Either
a fix-point is found, or the schedulability condition is violated.
The approach has to be extended for situations where a Simulink
block A is connected to multiple Simulink blocks with sample-rates dif-
ferent from A (as e. g. in Fig. 9.1(a)). In that case, the task generated
Scheduling Analysis 137
from A has more than two AND-concatenated inputs, which must all
be considered when calculating the possible activation timing of A. A
much simpler solution is to decouple neighboring sample-rate transitions
by inserting dummy tasks such that each task has at most one neighbor
generated from a block with a different sample rate. Communication
between task and dummy task is via a FIFO-channel in accordance with
section 9.4.1. The FIFOs are responsible for decoupling. This solution
is an attractive trade-off between buffer-size versus analyzability (de-
sign for analyzability). An example is shown in Fig. 9.9 for our initial
Simulink system (compare to Fig. 9.4).
P1 P3aP3
3
31
1
1
c3
c4
simulink_ex_SPI_diag_decoupled
P4P1a
1
1 4
4
1
3
c2
c1
P1b P2
2
23
3
c5
c6
1
1
1 1
2
1 1
3
1
Figure 9.9. SPI representation of the Simulink example from Fig. 9.1(a) with decou-
pled cyclic task-dependencies.
9.5.3 Triggered and Enabled Blocks
Triggered and enabled blocks are modeled as processes with two modes
(section 8.1): In one mode, the triggering/enabling condition evalu-
ates to true and the block is executed. In the other mode, the trig-
gering/enabling condition evaluates to false and nothing else happens.
Without further knowledge, during scheduling analysis the core exe-
cution time of a triggered/enabled task is assumed to be an interval
covering both modes. If intra-context information exists on the occur-
rence of true/false conditions, then this can be taken into account during
scheduling analysis as explained in section 8.1.
FIFO-communication between blocks with the same sample time (sec-
tion 9.4.1) is not allowed if one of the blocks is a triggered or enabled
block, since the FIFO-buffer could over- or underrun. Instead, the so-
lution for multi-rate designs (section 9.4.2) can be used with unit pro-
ducer/consumer rates on the virtual FIFOs between the blocks. These
FIFOs are written/read in both modes and thus are accessed in the same
138 Performance Analysis forSimulink
way as in the case of regular Simulink blocks. The reading/writing of
the communication register can be conditional, since it is not buffered
and does not contribute to activation.
9.6 Summary and Conclusion
In this chapter we relaxed the perfectly-synchronous Simulink model
of computation into an SDF-like representation, which allows reactive
scheduling. We then showed the advantages of reactive scheduling com-
pared to the standard tick-scheduling or rate-monotonic scheduling. In
particular, it becomes much easier to handle activation with jitter, and
to optimize a schedule to satisfy timing constraints. Both benefits are
particularly valuable for complex applications and architectures. The
relaxation step is applicable for arbitrary Simulink designs.
In the second part of the chapter, we showed that standard scheduling
analysis techniques combined with event-model calculation in the pres-
ence of AND-activation and data-rate transitions can be re-used to ana-
lyze the performance of a reactive system obtained from Simulink. This
is an important result since all these analysis techniques were developed
without a specific modeling tool in mind. The only, albeit non-trivial
extension introduced specifically for Simulink was a more detailed mod-
eling of possible event timing during one macro period of neighboring
Simulink blocks with different sample-times. This allowed us to dis-
cover correlations and find a fix-point. This extension requires far less
work than a hypothetical, non-reusable scheduling analysis framework
designed specifically for Simulink.
We provided a suggestion how to transform a system with multiple
neighboring sample-rate transitions to enable analysis. As was the case
for single-rate cycles, the designer needs to be alerted of the problem, and
a solution must be indicated. Again, the decision to modify the system
becomes a trade-off between (perceived) performance and analyzability.
Chapter 10
SUMMARY AND OUTLOOK
In this thesis, we presented an analytical approach for performance
verification of embedded systems. Our main contribution is a method-
ology that enables reusing existing analysis techniques to analyze the
performance of complex embedded applications.
Embedded applications exhibit a variety of activation dependencies
between tasks. Of particular importance are tasks with multiple activat-
ing inputs, data-rate transitions between tasks, conditional communica-
tion, and cyclic task dependencies. During implementation, applications
are mapped onto heterogeneous multi-processor architectures, where a
variety of scheduling policies controls access to computation and com-
munication resources shared between tasks. This leads to a large number
of non-functional dependencies which affect performance. While exist-
ing scheduling analysis techniques capture scheduling dependencies in
great detail, they mostly use an over-simplified application model where
one activation of a task depends on exactly one execution of exactly one
predecessor task. These techniques are thus not applicable for complex
embedded applications. In particular, this is a restriction of a novel
compositional performance analysis approach developed by our group.
This approach is one of the first to enable performance analysis for het-
erogeneous multi-processor architectures in a scalable way that supports
subsystem-integration, making it an attractive choice for large systems.
In order to extend our compositional performance analysis approach
for complex embedded applications, we developed transformations in
this thesis between the variety of application-level task dependencies,
and the possible timing of activating events at the implementation level
as required by scheduling analysis. In particular, we showed how to cal-
culate activating event timing in the presence of data rate transitions
140 Summary and Outlook
(with fixed rates and rate intervals), for tasks with multiple activat-
ing inputs (AND- or OR-concatenated), and in the presence of cyclic
task dependencies. These transformations allow us to apply available
scheduling analysis techniques without modification, and to maintain
analysis composability for heterogeneous architectures. In other words,
we make available all benefits of compositional performance analysis
for complex embedded applications with a small number of additional,
re-usable steps. If buffering of communicated data is required due to
AND-activation or rate-transitions, then we additionally calculate re-
quired buffer sizes as well as minimum and maximum buffering delay.
We derived a set of conditions under which analysis of cyclic task de-
pendencies can be replaced by analysis of a corresponding feed-forward
system. This is an enabling step, since it avoids keeping track of possible
phases between activations of all cycles tasks over multiple cycle itera-
tions. We provided suggestions how to transform systems that initially
do not satisfy these conditions. In a practical analysis flow, the designer
needs to be alerted of this situation, and a solution must be indicated.
The decision to modify the system then becomes a trade-off between
(perceived) performance and analyzability.
We further showed that analysis tightness can be improved consid-
erably by modeling and evaluating different types of system contexts.
Specifically, we considered intra and inter event stream contexts, and
their combination. Contexts can be specified by the designer based on
his application knowledge, and represent another important step towards
analyzability of complex applications.
The theoretical results of this thesis have been implemented to a large
extent in our compositional performance analysis framework SymTA/S.
We used SymTA/S to analyze a SoC featuring both OR- and AND-
concatenated tasks, a functional cycle, conditional communication and
data rate transitions. We explored several different scheduling options,
identified a scheduling anomaly which would likely have been overlooked
during simulation, and found a non-intuitive solution that meets all tim-
ing constraints on the target architecture while allowing to reduce the
speed of one component.
Finally, to demonstrate a link to real-world application modeling, we
showed how to apply our performance analysis extensions to designs
specified in the industry-standard tool Simulink. We first relaxed the
perfectly-synchronous Simulink model of computation into a dataflow
representation, which allows reactive scheduling, and is thus well-suited
for implementation on multi-processor architectures. We then applied
our results for AND-activation and data-rate transitions, extended by a
more detailed model of possible activation timing.
141
Where do we go from here? The conservative nature of performance
analysis is often a cause of concern. It is correct that capturing of
possible event timing in standard event models may incur a loss of in-
formation, both at system inputs and during event model calculation.
In particular, construction of periodic with jitter or sporadic with jitter
event models for OR-concatenated tasks or in the presence of data-rate
transitions is conservative. One remedy is to use more detailed event
models where appropriate. However, it is not sufficient to specify such
models at system inputs. It must be ensured that these models can be
internally calculated and propagated as well, otherwise they are of little
use for analysis composition. More complex models require more com-
plex analyses which need to be developed first, a tradeoff that must be
carefully considered. A promising research direction is extensive consid-
eration of contexts beyond the first steps presented in chapter 8. For
example, inter-contexts can be used to obtain tighter activating event
models for AND- and OR-concatenated tasks. A second suggestion are
separate event model parameters for upper and lower event functions,
to more accurately model possible event timing in the presence of condi-
tional communication or a combination of periodic and sporadic events
(sections 5.2 and 4.2.4). In general, any event model and scheduling
analysis extension can be incorporated in SymTA/S, but it should not
compromise composability.
Further extensions include the ability to model and analyze additional
architecture components with sufficient detail. Important research di-
rections in this area include the analysis of shared memory and cache-
related preemption delay, both of which are currently not considered in
SymTA/S. On the application side, interesting issues include the con-
sideration of soft constraints, and analysis of transitions between modes
of operations, including system startup.
Independent of its current limitations, we believe that in the near
future the intractable number of performance dependencies in complex
systems will encourage designers to accept performance analysis to reap
the benefits of a reliable implementation. Furthermore, we believe that
the perceived over-conservative nature of performance analysis is a mis-
conception when viewed from a global perspective. One one hand, local
worst-cases do indeed occur at the same time, leading to the calculated
global worst case. The poor reliability of modern embedded systems is
a sad reminder of this fact. On the other hand, critics of performance
analysis often forget how conservatively some system parts are designed
today. For example, average load on an SoC bus is rarely more than
30%, in order to have enough headroom for transient load peaks. How-
ever, it is often not clear if that is really enough, or if, on the contrary,
142 Summary and Outlook
precious bandwidth is wasted. A systematic performance analysis ap-
proach would provide an upper limit on how bad those peaks really
are, thus allowing to reliably dimension bus bandwidth. Furthermore,
our compositional performance analysis approach allows to smooth out
these peaks, and calculates required buffer sizes and delays incurred in
the process (chapter 7).
The tremendous speed of performance analysis, and its applicability
early in the design-flow may prove to be its biggest advantages over per-
formance simulation. The ability to perform design-space exploration
early allows to optimize an implementation in ways that would not have
been considered with simulation for mere lack of time, thus leading to
superior solutions which more than compensate the conservative nature
of analysis. Therefore, coupling SymTA/S with an exploration frame-
work presents an exciting research direction. Closely related is sensitivity
analysis, to analyze the robustness of a solution to small performance
deviations from expected values, which are inevitable during implemen-
tation and integration.
List of Figures
1.1 Example of an embedded real-time system with a
variety of task dependencies 4
2.1 Coarse-grain view of a generic system-level design flow 7
2.2 Y-model for hardware/software co-design 8
2.3 A simple SDF graph and a valid sequential firing
sequence that returns the graph into its initial state 11
2.4 The Philips Nexperia Home platform for multime-
dia applications 16
2.5 Typical hardware/software architecture on an em-
bedded processor 17
2.6 Performance dependencies between functionally in-
dependent subsystems introduced by resource shar-
ing 21
2.7 SPI representation of an event source which pro-
duces an output event on average every 4. Each
event can be delayed from its perfect position by
up to 1 time unit 24
3.1 Upper and lower event functions for a periodic with
jitter event model with (P = 4, J = 1) 29
3.2 Example of an event stream that satisfies a periodic
with jitter event model with (P = 4, J = 1) 29
3.3 Example to illustrate output event model calculation 36
3.4 Scenario leading to the worst-case response times
for both tasks from Fig. 3.3 37
3.5 Scenario leading to the best-case response times
for both tasks from Fig. 3.3 38
144 List of Figures
3.6 Scenario leading to the minimum and maximum
distance between output events for task T1 from Fig. 3.3 38
3.7 Example to illustrate compositional performance
analysis 39
3.8 Communication buffering in our compositional per-
formance analysis approach 42
3.9 Arrival curves used to describe packet flows 43
3.10 Network calculus with two priority-scheduled tasks
(task 1 has a higher priority than task 2) 44
4.1 Example of an AND-activated task C with three inputs 52
4.2 AND-activation timing example for a valid sequence
of input events 54
4.3 Upper and lower input event functions at the first
and second input in our AND-example 58
4.4 Upper and lower input event functions at the third
input, as well as upper and lower activating event
functions in our AND-example (AND activation ∧=
AND input 3) 58
4.5 Maximum delay and backlog incurred at the first
and second input of our example AND-activated task 59
4.6 Maximum delay and backlog incurred at the third
input of our example AND-activated task 59
4.7 Example of an OR-activated task with two inputs 60
4.8 Upper and lower input event functions in our OR-example 61
4.9 Upper and lower activating event functions in our
OR-example 62
5.1 SDF model of a chain of filters which convert dig-
ital audio from CD to DAT format. The system
exhibits data rate transitions between tasks. 69
5.2 Data rate transition with smaller producer data
rate (rp = 2) and larger consumer data rate (rc = 3) 70
5.3 Upper and lower producer token functions for our
data rate transition example 73
5.4 Upper and lower consumer token functions for our
data rate transition example 73
5.5 Activating event functions for consumer task C in
our data rate transition example 74
5.6 Example of a data rate transition with data rate intervals 80
List of Figures 145
5.7 Upper and lower producer token functions for our
data rate transition with intervals example 81
5.8 Two possible sets of upper and lower consumer to-
ken functions for our data rate transition with in-
tervals example 81
5.9 A third possible set of upper and lower consumer
token functions for our data rate transition with
intervals example 82
5.10 Communication buffer (implemented as circular buffer) 87
5.11 Token buffering and activation mechanism for a
combination of rate transitions with AND-activation
and an optional EAF 87
5.12 Token buffering and activation mechanism for a
combination of rate transitions with OR-activation
and an optional EAF 88
6.1 Example of a task graph with a cycle. 91
6.2 Possible event sequence for our cycle example. Gray
boxes indicate jitter intervals during which an event
can occur. Note that line 2 displays the possible
timing of internal events depending on the previ-
ous activating event, while lines 1 and 3 display the
possible timing of events independent of previous events. 96
6.3 Cycle example with 2 initial tokens at the cycle-
internal input of AND-concatenated task b. Unit
data rates and FIFOs have been omitted. 98
6.4 Possible event sequence for the cycle example in
Fig. 6.3. Gray boxes indicate jitter intervals during
which an event can occur. Note that line 2 displays
the possible timing of internal events depending on
the previous activating event, while lines 1 and 3
display the possible timing of events independent
of previous events. 99
6.5 Possible event timing at system startup that vio-
lates the calculated activating event model. Note
the absolute times on the t-axis. 101
6.6 Example of a task graph with nested cycles. Unit
data rates and FIFOs have been omitted. Edges
that need to be cut for corresponding feed-forward
analysis are indicated. 102
6.7 Cycle with two external inputs. 103
146 List of Figures
6.8 Possible event timing and events for our multi-
input example (Fig. 6.7) assuming a phase of +8
for cycle-external events at task b relative to task
a, and an initial token at the cycle-internal input of a. 104
6.9 Possible event timing and events for our multi-
input example (Fig. 6.7) assuming a phase of −8
for cycle-external events at task b relative to task
a, and an initial token at the cycle-internal input of b. 104
6.10 Activating jitter of tasks a and b as a function of
the phase between external tokens arriving at the
two tasks. ϕa is zero, ϕb is varied in the interval
[−10, 10]. 105
7.1 Example of an embedded real-time system with a
variety of task dependencies 107
7.2 Single-rate version of our example system modeled
in SymTA/S 109
8.1 Algorithm for worst-case response time calculation
considering intra-contexts with minimum- and maximum-
conditions for the occurrence of a certain type of
event. 117
8.2 Example system for context-aware scheduling analysis 120
8.3 Worst-case response time scenarios for task c ob-
tained in the context-blind case (a); analysis im-
provement due to intra (b), inter (c), and the com-
bination of intra and inter event stream contexts
(d) 121
9.1 Simulink example 124
9.2 Tick scheduling of our Simulink example 125
9.3 RMS scheduling of our Simulink example 126
9.4 SPI representation of the Simulink example from
Fig. 9.1(a). Dashed elements are virtual 130
9.5 Valid execution sequence for processes P2 and P3 130
9.6 Reactive scheduling of our Simulink example 132
9.7 Example used to explain our analysis approach 133
9.8 Possible event timing as calculated during schedu-
ling analysis of our multi-rate Simulink example 135
9.9 SPI representation of the Simulink example from
Fig. 9.1(a) with decoupled cyclic task-dependencies. 137
List of Tables
3.1 Input and calculated parameter values for both
tasks from Fig. 3.3 37
3.2 Complex application properties supported by a va-
riety of performance analysis approaches 45
7.1 Core execution and communication times 108
7.2 Event models at external system inputs. 108
7.3 Path latency constraints 109
7.4 Output jitter constraint (the output period auto-
matically follows from the data rate transition) 109
7.5 Scheduling analysis input and output on resource uC 110
7.6 Bus and DSP tasks ordered by priority (highest
on left), and calculated actual values for all four
constrained values from Tabs. 7.3 and 7.4 111
7.7 Experiment 3 from Tab. 7.6 repeated with the clock-
rate of uC increased to 120% of its original clock
rate. 112
7.8 Experiment 4 from Tab. 7.6 repeated with the clock-
rate of uC reduced to 80% of its original clock rate. 112
8.1 Task properties used in our context-aware schedu-
ling analysis example. Task b has three modes with
different core execution times that are activated by
events with different colors. 119
8.2 Intra event stream context: Min- and max-conditions
for event colors in a sequence of 4 events generated
by task sb 120
148 List of Tables
8.3 Calculated response times: Context-blind, only in-
tra contexts, only inter context, and combination
of both contexts 121
Bibliography
[1] Y. Abdeddaim and O. Maler. Job-shop scheduling using timed automata. In
Computer Aided Verification, pages 478–492, 2001.
[2] R. Alur and D. Dill. A theory of timed automata. Theoretical Computer
Science, 126(2):183–235, 1994.
[3] ARM. AMBA Bus. http://www.arm.com/products/solutions/
AMBAOverview.html.
[4] N. C. Audsley, A. Burns, M. F. Richardson, and A. J. Wellings. Hard Real-
Time Scheduling: The Deadline Monotonic Approach. In Proceedings 8th IEEE
Workshop on Real-Time Operating Systems and Software, Atalanta, 1991.
[5] Axys Inc.MaxSim. http://www.axysdesign.com/products/products maxsim.asp.
[6] F. Balarin et al. Hardware-Software Co-Design of Embedded Systems. The
POLIS Approach. Kluwer Academic Publishers, 1997.
[7] S. K. Baruah, D. Chen, and A. K. Mok. Static-priority scheduling of multiframe
tasks. In Euromicro Conference on Real-Time Systems, June 1999.
[8] A. Benveniste and G. Berry. The synchronous approach to reactive and real-
time systems. Proc. of the IEEE, 79(9):1270–1282, 1991.
[9] G. Berry and G. Gonthier. The esterel synchronous programming language:
Design, semantics, implementation. Science of Computer Programming, 2002.
[10] P. Bjureus and A. Jantsch. Heterogeneous system-level cosimulation with SDL
and Matlab. In Proc. 2nd Forum on Design Languages, FDL, September 1999.
[11] Bosch. Electronic Control Units. http://www.kraftfahrzeugtechnik-
heute.de:8090/k/en/start/product.jsp?mfacKey=ae 01 stg.
[12] J. T. Buck. Scheduling dynamic dataflow graphs with bounded memory using
the Token Flow Model. Technical Report UCB/ERL 93/69, Ph.D dissertation,
Dept. of EECS, UC Berkeley, Berkeley, CA 94720, U.S.A., 1993.
150 Bibliography
[13] G. Buttazzo. Real-Time Computing Systems - Predictable Scheduling Algo-
rithms and Applications. Kluwer Academic Publishers, 2002.
[14] G. Buttazzo. Rate Monotonic vs. EDF: Judgment day. In Proc. 3rd Workshop
on Embedded Software (EMSOFT), Philadelphia (PA), USA, October 2003.
[15] Beatriz Asensio Calvo. Real-time analysis of time-driven scheduling – a quan-
titative comparison of Round Robin and TDMA. Technical report, Institute
of Computer and Communication Networks Engineering, Technical University
of Braunschweig, September 2001.
[16] S. Chakraborty, S. Ku¨nzli, and L. Thiele. A general framework for analysing
system properties in platform-based embedded system designs. In Proc.
DATE’03, Munich, Germany, March 2003.
[17] D. Ching, P. Schaumont, and I. Verbauwhede. Integrated modeling and gener-
ation of a reconfigurable network-on-chip. In 2004 Reconfigurable Architectures
Workshop (RAW 2004), April 2004.
[18] M. Chiodo, P. Giusto, H. Hsieh, A. Jurecska, L. Lavagno, and A. Sangiovanni-
Vincentelli. A formal methodology for hardware /software codesign of embed-
ded systems. IEEE Micro, August 1994.
[19] P. Chou, K. Hines, K. Partidge, and G. Borriello. Control generation for
embedded systems based on composition of modal processes. In Proceedings
of the 1998 IEEE/ACM international conference on Computer-aided design,
1998.
[20] P. Chou, R. B. Ortega, and G Borriello. The Chinook hardware/software
co-synthesis system. In Proc. of the 8th International Symposium on System
Synthesis, Cannes, France, September 1995.
[21] E. M. Clarke, O. Grumberg, and D. A. Peled. Model Checking. MIT Press,
2000.
[22] E. Closse, M. Poize, J. Pulou, J. Sifakis, P. Venter, D. Weil, and S. Yovine.
TAXYS: A tool for the development and verification of real-time embedded
systems. Lecture Notes in Computer Science, 2102, 2001.
[23] R. L. Cruz. A calculus for network delay. IEEE Transactions on Information
Theory, 37(1):114–141, January 1991.
[24] A. Dasdan, D. Ramanathan, and R. K. Gupta. Rate derivation and its ap-
plications to reactive, real-time embedded systems. In Proceedings of the 35th
design automation conference, 1998.
[25] A. Dasdan, D. Ramanathan, and R. K. Gupta. A timing-driven design and val-
idation methodology for embedded real-time systems. ACM Trans. on Design
Automation of Electronic Systems, 3(4), October 1998.
[26] dSPACE GmbH. TargetLink 2.0. http://www.dspace.de/ww/en/pub/products/
prodover/targetli.htm.
Bibliography 151
[27] M. Engels, G. Bilsen, R. Lauwereins, and J. Peperstraete. Cyclo-Static Data
Flow: Model and implementation. In Proc. 28th Asilomar Conf. on Signals,
Systems, and Computers, pages 503–507, Pacific Grove, CA, 1994.
[28] R. Ernst. Codesign of embedded systems: Status and trends. IEEE Design &
Test of Computers, April 1998.
[29] ETAS. ERCOSEK Automotive Real-Time Operating Sys-
tem. http://www.etas.info/html/products/ec/ercosek/
en products ec ercosek index.php.
[30] A. Fehnker. Scheduling a Steel Plant with Timed Automata. In Sixth In-
ternational Conference on Real-Time Computing Systems and Applications
(RTCSA’99), 1999.
[31] E. Fersman. A Generic Approach to Schedulability Analysis of Real-Time
Systems. PhD thesis, Uppsala Faculty of Science and Technology, 2003.
[32] C. J. Fidge. Real-time schedulability tests for preemptive multitasking. Real-
Time Systems, 14:61–93, 1998.
[33] FlexRay Consortium. FlexRay. http://www.flexray.com.
[34] D. D. Gajski, J. Zhu, R. Doemer, A. Gerstlauer, and S. Zhao. SpecC: Spec-
ification Language and Methodology. Kluwer Accademic Publishers, March
2000.
[35] K. Gresser. Echtzeitnachweis ereignisgesteuerter Realzeitsysteme. PhD thesis,
Technische Universita¨t Mu¨nchen, 1993.
[36] K. Gresser. An event model for deadline verification of hard real-time systems.
In Proceedings 5th Euromicro Workshop on Real-Time Systems, pages 118–123,
Oulu, Finland, 1993.
[37] N. Halbwachs. Synchronous Programming of Reactive Systems. Kluwer Aca-
demic Publishers, 1993.
[38] David Harel. Statecharts: A visual formalism for complex systems. Science of
Computer Programming, 8(3):231–274, June 1987.
[39] S. Heithecker, A. Lucas, and R. Ernst. A mixed qos sdram controller for fpga-
based high-end image processing. In IEEE Workshop on Signal Processing
Systems (SIPS’03), Seoul, Korea, August 2003.
[40] R. Henia. Analyse von kontextabha¨ngigem systemverhalten komplexer einge-
betteter systeme. Master’s thesis, Institute of Computer and Communication
Networks Engineering, Technical University of Braunschweig, 2003.
[41] M. Y. Houri. Task graph analysis with complex dependencies. Master’s thesis,
Institute of Computer and Communication Networks Engineering, Technical
University of Braunschweig, 2004.
[42] I-Logix. Statemate MAGNUM. http://www.ilogix.com/fs prod.htm.
[43] IEEE. VME bus. http://www.interfacebus.com/Design Connector VME.html.
152 Bibliography
[44] Institute of Computer and Communication Network Engineering, TU
Braunschweig. SPI - System Property Intervals. http://www.ida.ing.tu-
bs.de/research/projects/spi/home.e.shtml.
[45] ITU. ITU-T. Recommendation Z.100. Specification and Description Language,
1993.
[46] Kevin Jeffay and Steve Goddard. A theory of rate-based execution. In IEEE
Real-Time Systems Symposium, 1999.
[47] M. Jersak, Y. Cai, and A. Lucas. Translating simulink to SPI. Technical report,
Institute of Computer and Communication Networks Engineering, Technical
University of Braunschweig, 2001.
[48] M. Jersak, Y. Cai, D. Ziegenbein, and R. Ernst. A transformational approach
to constraint relaxation of a time-driven simulation model. In Proceedings
13th International Symposium on System Synthesis, Madrid, Spain, September
2000.
[49] M. Jersak and R. Ernst. Enabling scheduling analysis of heterogeneous systems
with multi-rate data dependencies and rate intervals. In Proceeding 40th Design
Automation Conference, Annaheim, USA, June 2003.
[50] M. Jersak, R. Henia, and R. Ernst. Context-aware performance analysis for
efficient embedded system design. In Proceeding Design Automation and Test
in Europe, Paris, France, March 2004.
[51] M. Jersak, K. Richter, and R. Ernst. Performance analysis for complex embed-
ded applications. International Journal of Embedded Systems, Special Issue on
Codesign for SoC, 2004.
[52] M. Jersak, K. Richter, R. Ernst, J.-C. Braam, Z.-Y. Jiang, and F. Wolf. Formal
methods for integration of automotive software. In Proc. of Design, Automa-
tion and Test in Europe Conference (DATE’03), Munich, Germany, March
2003.
[53] M. Jersak, K. Richter, R. Henia, R. Ernst, and F. Slomka. Transformation
of SDL specifications for system-level timing analysis. In Tenth International
Symposium on Hardware/Software Codesign (CODES’02), Estes Park, Col-
orado, USA, May 2002.
[54] M. Joseph and P. Pandya. Finding response times in a real-time system. The
Computer Journal, 29(5):390–395, 1986.
[55] G. Kahn. The semantics of a simple language for parallel programming. In
Proceedings of the IFIP Congress 74, pages 471 – 475. North-Holland, 1974.
[56] K. Keutzer, S. Malik, A. R. Newton, J. M. Rabaey, and A. Sangiovanni-
Vincentelli. System-level design: Orthogonalization of concerns and platform-
based design. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, 19(12):1523–1543, 2000.
Bibliography 153
[57] H. Kopetz and G. Gruensteidl. TTP - a time-triggered protocol for fault-
tolerant computin g. In Proceedings 23rd International Symposium on Fault-
Tolerant Co mputing, 1993.
[58] L.M. Kristensen, S. Christensen, and K. Jensen. The practitioner’s guide to
coloured petri nets. International Journal on Software Tools for Technology
Transfer, 2:98–132, 1998.
[59] C.-G. Lee, K. Lee, J. Hahn, Y.-M. Seo, S. L. Min, R. Ha, S. Hong, C. Y.
Park, M. Lee, and C. S. Kim. Bounding cache-related preemption delay for
real-time systems. IEEE Transactions on software engineering, 27(9):805–826,
November 2001.
[60] E. A. Lee et al. Overview of the ptolemy project. Technical report, University
of California at Berkeley, March 2001.
[61] E. A. Lee and D. G. Messerschmitt. Synchronous dataflow. Proceedings of the
IEEE, 75(9):1235–1245, 1987.
[62] E. A. Lee and D.G. Messerschmitt. Static scheduling of synchronous data
flow programs for digital signal processing. IEEE Transactions on Computers,
36(1), January 1987.
[63] E. A. Lee and Th. M. Parks. Dataflow process networks. Proceedings of the
IEEE, May 1995.
[64] E. A. Lee and A. Sangiovanni-Vincentelli. A framework for comparing models
of computation. IEEE Trans. CAD, December 1998.
[65] J. Lehoczky. Fixed priority scheduling of periodic task sets with arbitrary
deadlines. In Proceedings Real-Time Systems Symposiom, pages 201–209, 1990.
[66] J. Lemieux. Programming in the OSEK/VDX Environment. CMP Books,
2001.
[67] LIN Consortium. LIN - Local Interconnect Network. http://www.lin-
subbus.org/.
[68] C. L. Liu and J. W. Layland. Scheduling algorithm for multiprogramming in
a hard-real-time environment. Journal of the ACM, 20, 1973.
[69] LiveDevices Inc. Realogy Real-Time Architect Overview.
http://www.livedevices.com/realtime.shtml.
[70] K. Scott M. Fowler. UML Distilled: A Brief Guide to the Standard Object
Modeling Language. Addison-Wesley Pub Co, 2 edition, 1999.
[71] A. Mader. Experiment during Dagstuhl workshop on software intensive em-
bedded systems with special emphasis on automotive, November 2003.
[72] A. Mathur, A. Dasdan, and R. K. Gupta. Rate analysis for embedded systems.
ACM Trans. on Design Automation of Electronic Systems, 3(3), July 1998.
[73] The MathWorks. Matlab. http://www.mathworks.com.
154 Bibliography
[74] The MathWorks. Simulink. http://www.mathworks.com.
[75] The MathWorks. Stateflow. http://www.mathworks.com.
[76] The MathWorks, Inc. Real-Time Workshop, version 5.
http://www.mathworks.com/access/helpdesk/help/toolbox/rtw/rtw.shtml.
[77] A. Maxiaguine, S. Ku¨nzli, and L. Thiele. Workload characterization model for
tasks with variable execution demand. In Proc. Design, Automation and Test
in Europe (DATE 2004), Paris, France, 2004.
[78] Mentor Graphics. Seamless. http://www.mentor.com/seamless/.
[79] A.K. Mok and D. Chen. A multiframe model for real-time tasks. IEEE Trans-
actions on Software Engineering, 23(10):635–645, 1997.
[80] MOST Cooperation. MOST - Media Oriented Systems Transport.
http://www.mostcooperation.com/.
[81] H. S. Negi, T. Mitra, and A. Roychoudhury. Accurate estimation of cache-
related preemption delay. In CODES+ISSS’03, Newport Beach, California,
USA, October 2003.
[82] H. Oh and S. Ha. Fractional rate dataflow model and efficient code synthesis
for multimedia applications. In Languages, Compilers, and Tools for Embed-
ded Systems (LCTES’02) and Software and Compilers for Embedded Systems
(SCOPES’02), Berlin, Germany, June 2002.
[83] Open SystemC Initiative. SystemC. http://www.systemc.org/.
[84] J. C. Palencia, J. J. Gutierrez Garcia, and M. G. Harbour. Best-case analysis
for improving the worst-case schedulability test for distributed hard real-time
systems. In Proceedings of 10th Euromicro Workshop on Real-Time Systems,
June 1998.
[85] J. C. Palencia and M. G. Harbour. Schedulablilty analysis for tasks with
static and dynamic offsets. In Proc. 19th IEEE Real-Time Systems Symposium
(RTSS98), Madrid, Spain, 1998.
[86] J. C. Palencia and M. G. Harbour. Exploiting precedence relations in the
schedulability analysis of distributed real-time systems. In Proceedings of the
20th Real-Time Systems Symposium, December 1999.
[87] A. K. Parekh and R. G. Gallager. A generalized processor sharing approach to
flow control in integrated services networks: The single-node case. IEEE/ACM
Transactions on Networking, 1(3):344–257, June 1993.
[88] Philips. Nexperia. http://www.semiconductors.philips.com/ prod-
ucts/nexperia/index.html.
[89] J. Pino, S. Bhattacharyya, and E. Lee. A hierarchical multiprocessor schedu-
ling system for dsp applications. In Proc. of IEEE Asilomar Conference on
Signals, Systems, and Computers, November 1995.
Bibliography 155
[90] P. Pop, P. Eles, and Z. Peng. Holistic scheduling and analysis of mixed
time/event-triggered distributed embedded systems. In Tenth International
Symposium on Hardware/Software Codesign (CODES’02), Estes Park, Col-
orado, USA, May 2002.
[91] P. Pop, P. Eles, and Z. Peng. Schedulability analysis and optimization for
the synthesis of multi-cluster distributed embedded systems. In Proc. Design,
Automation and Test in Europe (DATE 2003), Munich, Germany, 2003.
[92] P. Pop, P. Eles, Z. Peng, V. Izosimov, M. Hellring, and O. Bridal. Design
optimization of multi-cluster embedded systems for real-time applications. In
Proc. Design, Automation and Test in Europe (DATE 2004), Paris, France,
2004.
[93] R. Bosch GmbH. CAN Specification Version 2.0, 1991.
[94] Rational Software. Purify. http://www-
306.ibm.com/software/awdtools/purifyplus/.
[95] K. Richter. Developing a general model for scheduling of mixed transforma-
tive/reactive systems. Master’s thesis, Technical University of Braunschweig,
January 1998.
[96] K. Richter. Compositional Scheduling Analysis Using Standard Event Models.
PhD thesis, Technical University of Braunschweig, 2004.
[97] K. Richter, M. Jersak, and R. Ernst. A formal approach to MpSoC performance
verification. IEEE Computer, 36(4), April 2003.
[98] K. Richter, R. Racu, and R. Ernst. Scheduling analysis integration for het-
erogeneous multiprocessor SoC. In Proceedings 24th International Real-Time
Systems Symposium (RTSS’03), Cancun, Mexico, December 2003.
[99] Semiconductor Industry Association. 2001 International Technology Roadmap
for Semiconductors. http://public.itrs.net/Files/2001ITRS/Home.htm.
[100] M. Sgroi, L. Lavagno, and A. Sangiovanni-Vincentelli. Formal models for em-
bedded system design. IEEE Design & Test of Computers, 17(2):14–27, 2000.
[101] L. Sha, R. Rajkumar, and S. S. Sathaye. Generalized rate-monotonic schedu-
ling theory: A framework for developing real-time systems. Proceedings of the
IEEE, 82(1):68–82, January 1994.
[102] Lui Sha, Ragunathan Rajkumar, and John P. Lehoczky. Priority inheritance
protocols: An approach to real-time synchronization. IEEE Transactions on
Computers, 39(9), September 1990.
[103] F. Slomka, M. Do¨rfel, and R. Mu¨nzenberger. Generating mixed hard-
ware/software systems from SDL specifications. In Proc. 9th International
Symposium on Hardware/Software Co-Design (CODES 2001), Copenhagen,
Denmark, April 2001.
[104] Sonics. SiliconBackplane III MicroNetwork IP. http://www.sonicsinc.com/
sonics/products/siliconbackplaneIII/.
156 Bibliography
[105] B. Sprunt, L. Sha, and J. Lehoczky. Aperiodic task scheduling for hart real-
time systems. In Journal of Real-Time Systems, 1989.
[106] Synopsys. System Studio. http://www.synopsys.com/products/cocentric studio/.
[107] Technical University of Braunschweig. SymTA/S – Symbolic Timing Analysis
for Systems. http://www.symta.org.
[108] Telelogic. Tau. http://www.telelogic.com/products/tau/.
[109] L. Thiele, S. Chakraborty, M. Gries, and S. Ku¨nzli. Design space exploration
of network processor architectures. In Mark Franklin, Patrick Crowley, Haldun
Hadimioglu, and Peter Onufryk, editors, Network Processor Design Issues and
Practices, Volume 1, chapter 4, pages 55–90. Morgan Kaufmann, October 2002.
[110] L. Thiele, S. Chakraborty, M. Gries, and S. Ku¨nzli. Design space exploration of
network processor architectures. In First Workshop on Network Processors at
the 8th International Symposium on High-Performance Computer Architecture
(HPCA8), Cambridge MA, USA, February 2002.
[111] TimeSys. TimeWiz. http://www.timesys.com/.
[112] K. Tindell and J. Clark. Holistic schedulability analysis for distributed hard
real-time systems. Microprocessing & Microprogramming, 50(2-3):117–134, apr
1994.
[113] K. W. Tindell. Adding time-offsets to schedulability analysis. Technical Report
YCS 221, Univ. of York, 1994.
[114] K. W. Tindell. An extendible approach for analysing fixed priority hard real-
time systems. Journal of Real-Time Systems, 6(2):133–152, Mar 1994.
[115] Ken Tindell, Herman Kopetz, Fabian Wolf, and Rolf Ernst. Safe automo-
tive software development. In Proc. Design, Automation and Test in Europe
(DATE’03), Munich, Germany, March 2003.
[116] Tri-Pacific Software, Inc. RAPID RMA. http://www.tripac.com/html/prod-
fact-rrm.html.
[117] University of California at Berkeley. Ptolemy.
http://ptolemy.eecs.berkeley.edu/ptolemyII.
[118] Uppsala Universitet and Aalborg University. UPPAAL.
http://www.uppaal.com/.
[119] WindRiver. VxWorks. http://www.windriver.com/announces/vxworks/.
[120] F. Wolf. Behavioral Intervals in Embedded Software. Kluwer Academic Pub-
lishers, 2002.
[121] F. Wolf, R. Ernst, and W. Ye. Path clustering in software timing analysis.
IEEE Transactions on VLSI Systems, 9(6), December 2001.
Bibliography 157
[122] T. Yen and W. Wolf. Performance estimation for real-time distributed embed-
ded systems. IEEE Transactions on Parallel and Distributed Systems, 9(11),
November 1998.
[123] D. Ziegenbein. A Compositional Approach to Embedded System Design. PhD
thesis, Technical University of Braunschweig, 2003.
[124] D. Ziegenbein, R. Ernst, K. Richter, J. Teich, and L. Thiele. Combin-
ing multiple models of computation for scheduling and allocation. In
Proceedings Sixth International Workshop on Hardware/Software Co-Design
(Codes/CASHE ’98), pages 9–13, Seattle, USA, March 1998.
[125] D. Ziegenbein, M. Jersak, K. Richter, and R. Ernst. Breaking down complexity
for reliable system-level timing validation. In Ninth IEEE/DATC Electronic
Design Processes Workshop (EDP’02), Monterey, USA, April 2002.
[126] D. Ziegenbein, K. Richter, R. Ernst, L. Thiele, and J. Teich. SPI – A system
model for heterogeneously specified embedded systems. IEEE Transactions on
Very Large Scale Integration (VLSI) Systems, 10(4), August 2002.
[127] D. Ziegenbein, J. Uerpmann, and R. Ernst. Dynamic Response Time Optimiza-
tion for SDF Graphs. In Proceedings International Conference on Computer-
Aided Design (ICCAD ’00), San Jose, November 2000.
