










The handle http://hdl.handle.net/1887/24481 holds various files of this Leiden University 
dissertation 
 
Author: Bamakhrama, Mohamed A. 
Title: On hard real-time scheduling of cyclo-static dataflow and its application in system-
level design 
Issue Date: 2014-03-12 
On Hard Real-Time Scheduling of Cyclo-Static Dataflow
and its Application in System-Level Design
Mohamed A. Bamakhrama

OnHard Real-Time Scheduling of Cyclo-Static Dataow and
its Application in System-Level Design
PROEFSCHRIFT
ter verkrijging van
de graad van Doctor aan de Universiteit Leiden,
op gezag van Rector Magnicus prof.mr. C.J.J.M. Stolker,
volgens besluit van het College voor Promoties
te verdedigen op woensdag 12 maart 2014
klokke 15:00 uur
door




Promotor: Prof. dr. Ed F. Deprettere Universiteit Leiden
Co-Promotor: Dr. Todor P. Stefanov Universiteit Leiden
Other Members: Prof. dr. Petru Eles Linköpings Universitet
Prof. dr. Rolf Ernst Technische Universität Braunschweig
Prof. dr. Marco Bekooij Universiteit Twente
Prof. dr. Joost Kok Universiteit Leiden
Prof. dr. Farhad Arbab Universiteit Leiden
Prof. dr. Harry Wijsho Universiteit Leiden
On Hard Real-Time Scheduling of Cyclo-Static Dataow
and its Application in System-Level Design
Mohamed A. Bamakhrama. -
Dissertation Universiteit Leiden. - With ref. - With summary in Dutch.
Copyright © 2014 by Mohamed A. Bamakhrama. All rights reserved.
¿is dissertation is licensed under the Creative Common Attribution-Share Alike 3.0
license. You can obtain a copy of this license from the following URL:
http://creativecommons.org/licenses/by-sa/3.0/
¿is dissertation was typeset using LATEX and version controlled using Git.
ISBN 978-90-9028032-5
Printed in the Netherlands.
My prayer and my sacrifice and my life and my death are all for God,




Table of Contents vii
List of Figures xi
List of Tables xiii
1 Introduction 1
1.1 Current Design Challenges and Trends . . . . . . . . . . . . . . . . . . 3
1.1.1 Design of Concurrent So ware . . . . . . . . . . . . . . . . . . 4
1.1.2 Designer Productivity . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.3 Real-Time Guarantees . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.1 Hard Real-Time Scheduling of Streaming Programs . . . . . . 14
1.4.2 Design Flows for Hard Real-Time Streaming Systems . . . . . 19
1.5 Organization of this Dissertation . . . . . . . . . . . . . . . . . . . . . 20
2 Background 23
2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Parallel Execution of Programs . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Cyclo-Static Dataow (CSDF) . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Real-Time Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.1 Task Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.2 Scheduling Concepts . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.3 Uniprocessor Schedulability Analysis . . . . . . . . . . . . . . 32
2.4.4 Multiprocessor Schedulability Analysis . . . . . . . . . . . . . 35
viii Contents
3 Automated Parallelization and Model Construction 39
3.1 Input Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1.1 Top-Level Part . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1.2 Implementation Part . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Automated Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Model Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Scheduling Framework 47
4.1 Input Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Basic Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Deriving Periods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4 Deriving Deadlines and Start Times . . . . . . . . . . . . . . . . . . . . 57
4.5 Deriving Buer Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.6 ¿roughput Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.7 Latency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.8 Deriving Architecture and Mapping Specications . . . . . . . . . . . 70
5 System-Level Synthesis 75
5.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 So ware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2.1 Scheduling Infrastructure . . . . . . . . . . . . . . . . . . . . . 78
5.2.2 Communication Infrastructure . . . . . . . . . . . . . . . . . . 83
6 Evaluation and Results 87
6.1 Experiment I: Evaluating Automated Parallelization and Model Con-
struction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2 Experiment II: Evaluating Performance and Resource Usage Metrics
under Periodic Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2.2 ¿roughput Evaluation . . . . . . . . . . . . . . . . . . . . . . 89
6.2.3 Latency Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2.4 Processor Requirements Evaluation . . . . . . . . . . . . . . . 91
6.2.5 Memory Requirements Evaluation . . . . . . . . . . . . . . . . 93
6.2.6 Summary of Experiment II . . . . . . . . . . . . . . . . . . . . 93
6.3 Experiment III: Validating Synthesized Systems . . . . . . . . . . . . . 95
7 Summary and Future Work 99









1.1 ¿e challenges involved in designing modern hard real-time multipro-
cessor streaming systems. . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Decidability and expressiveness for popular dataow MoCs . . . . . . 6
1.3 System-level design of modern embedded systems . . . . . . . . . . . 7
1.4 Popular real-time task models and the complexity of their feasibility tests 9
1.5 Bridging dataow MoCs and real-time task models through the pro-
posed scheduling framework . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Input and outputs of the proposed scheduling framework . . . . . . . 12
1.7 Overview of the proposed design ow . . . . . . . . . . . . . . . . . . 14
2.1 Example of a CSDF graph that corresponds to the SANLP program in
Listing 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 Automated parallelization and model construction . . . . . . . . . . . 40
3.2 ¿e parallel program corresponding to the SANLP shown in Listing 1 42
3.3 ¿e port domains of 𝒫snk shown in Figure 3.2 . . . . . . . . . . . . . 43
3.4 ¿e domains of v1 and v2 . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1 Scheduling framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Occurrence of tMIT(Ii , j) . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Computing tbuer(Ii , j) . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 Schedule S1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5 Schedule S2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.6 Schedule SL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.7 Schedule S∞ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.8 ¿e periodic schedule for the CSDF graph shown in Figure 2.1 con-
structed using¿eorem 4.3.1 . . . . . . . . . . . . . . . . . . . . . . . . 57
4.9 Timeline of Ai and A j when t′ ≥ Si . . . . . . . . . . . . . . . . . . . . 60
4.10 Timeline of Ai and A j when t′ < Si . . . . . . . . . . . . . . . . . . . . 60
4.11 Execution time-lines of Ai and A j when Si ≤ S j . . . . . . . . . . . . . 63
xii List of Figures
4.12 Execution time-lines of Ai and A j when Si > S j . . . . . . . . . . . . . 63
4.13 Decision tree for scheduling CSDF actors as real-time periodic tasks 71
4.14 ¿e CSDF graph corresponding to the SANLP program shown in
Listing 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.15 Mapping of G1 and G2 onto 6 processors assuming EDF and FFD . . 73
5.1 Electronic System-Level Synthesis . . . . . . . . . . . . . . . . . . . . . 76
5.2 Top-level block diagram of the hardware platform considered in this
dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3 Tile organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4 Complete MPSoC architecture . . . . . . . . . . . . . . . . . . . . . . . 78
5.5 Crossbar Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.6 Detailed description of function vTaskDelayUntil . . . . . . . . 81
5.7 FIFO layout in memory and the read/write registers . . . . . . . . . . 83
6.1 Results of the latency evaluation . . . . . . . . . . . . . . . . . . . . . . 92
6.2 Minimum number of processors required by optimal and partitioned
schedulers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3 Zynq-7000 SoC architecture . . . . . . . . . . . . . . . . . . . . . . . . 97
List of Tables
2.1 Summary of mathematical notations . . . . . . . . . . . . . . . . . . . 24
2.2 Approximation rations for known bin packing heuristics . . . . . . . 37
3.1 Deriving production/consumption rates sequences from the shortest
repetitive pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1 Computing D⃗ and S⃗ for the CSDF graph shown in Figure 2.1 on page
29 under dierent values of η⃗. . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Computing the buer sizes for the CSDF graph shown in Figure 2.1 on
page 29 under dierent values of η⃗. . . . . . . . . . . . . . . . . . . . . 64
4.3 Values of Kui and Kuj dened in (4.49) and (4.50) for the CSDF graph
shown in Figure 2.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 ¿e output paths latencies and graph maximum latency of the CSDF
graph shown in Figure 2.1 on page 29 under dierent values of η⃗. . . . 68
4.5 ¿e taskset parameters for G1 and G2 assuming µG = 1 and η⃗ = 1⃗ for
both graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.1 Specications of the machine on which the experiments were performed 87
6.2 Time needed to parallelize and derive the CSDF model for the bench-
mark programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.3 Benchmarks used for evaluating the periodic scheduling framework
proposed in Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.4 Results of ¿roughput Comparison . . . . . . . . . . . . . . . . . . . . 91
6.5 ¿e total amount of memory needed to realize the buers in the com-
munication channels under periodic and self-timed schedules . . . . 95
6.6 Programs used in Experiment III . . . . . . . . . . . . . . . . . . . . . 96
6.7 Hardware platforms used in Experiment III . . . . . . . . . . . . . . . 97




¿e computer was born to solve
problems that did not exist before.
Bill Gates
THE current period of human history is known as the information age. ¿is namestems from the fact that humanity is shi ing from the traditional industry-based
society that characterized the industrial revolution period between 1700-1900 CE, into
a knowledge-based society. ¿e knowledge which used to be concentrated only in
libraries and accessible only to a small fraction of the society has now become available
(mostly for free) to anyone with a computer and Internet access. ¿is dissemination of
knowledge has been mainly enabled by the advent of electronic computers and all the
advances that they enabled and brought such as the Internet. Electronics in general,
and computers in particular, have changed many aspects in our behavior and way of
thinking. Today, computers resemble the “backbone” of a modern society. ¿ey are
everywhere starting from mobile phones and MP3 players and all the way to satellites,
airplanes, and nuclear reactors. ¿ey process daily vast amounts of data to ensure that
our societies continue to function properly. Computer systems can be classied based
on their functionality into two categories:
1. General-purpose systems such as Personal Computers (PC). Such systems are
exible and can be controlled directly by the user to perform a variety of tasks
such as web browsing, word processing, gaming, etc.
2. Embedded systems such as the ones that you can nd inside digital TVs, cars,
trains, etc. Such systems have specic functionality and they aremostly invisible
to the user. Such computer systems are said to be embedded within larger
systems.
Most of people are familiar with the rst category since they use them in their daily
2 Chapter 1. Introduction
life. However, almost 90% of all computer systems shipped worldwide in 2010 were
actually embedded systems [MRP+11]. Embedded systems have become pervasive in
our life. ¿ey are everywhere, and even though that we mostly do not see them, we
still feel their presence through their actions. A very important property of embedded
systems is that their correct functionality does not depend only on producing the correct
result but also on producing the correct result at the right time. Such systems, where
time is critical to the correct functionality, are called real-time systems. Real-time
systems can be either hard or so . A hard real-time system is one where the failure
to meet the timing requirements leads to a system failure. In contrast, a so real-time
system is one where the failure to meet the timing requirements does not lead to a
failure but to degraded system performance that can be tolerated. Deciding whether
a system is hard or so depends usually on the overall system requirements and the
environment where the system is deployed.
Real-time systems can be further classied, based on the type of programs they run,
into:
1. Control programs that wait for external events from the physical world, and
then react to these events. Examples include factory automation programs
running on Programmable Logic Controllers (PLC) and railway switching
systems.
2. Streaming programs that are characterized by processing continuous streams of
data which arrive to the system. Usually, such programs process large amounts
of data within short periods of time. Examples of such programs include those
used in video and audio processing, digital signal processing, and network
protocol processing.
In many cases, a single embedded system can contain both control and streaming
programs.
A hard real-time streaming system is a hard real-time system that runs a set of
streaming programs. ¿is implies that all the streaming programs running on the
system have hard timing requirements (i.e., timing requirements that must be always
met). Examples of such systems include: (1) collision avoidance and path planning
sub-systems inUnmannedAerialVehicles (UAV) and self-driving cars, and (2) So ware-
Dened Radio (SDR) used in wireless communication. For instance, a typical value of
required reaction time to new sensor data in self-driving cars is around 300 ms [¿r10].
At the same time, some self-driving cars, such as Google’s self-driving car, have been
reported to gather around 750MB of data per second [Woo]. Such tremendous amount
of gathered data implies the need for parallel processing in order to meet the required
reaction times. In this dissertation, we tackle the problem of designing such streaming
systems in an automated and systematic way such that all the programs “provably” meet
their timing requirements. In the next section, we explain the current challenges in
1.1. Current Design Challenges and Trends 3
designing such systems and their implications.
1.1 Current Design Challenges and Trends
As general-purpose computers have moved from single core processors to multicore
processors [HNO97], the same has happened in the embedded systems domain. Today,
embedded systems designers integrate multiple processors, hardware peripherals and
memories into a single chip. Such chips are referred to as Multiprocessor System-
on-Chip (MPSoC) [JTW05]. MPSoCs represent a good candidate for running many
computationally-intensive streaming programs in hard real-time systems. For example,
¿run reported in [¿r10] that Stanford’s Stanley self-driving car, which was used
in DARPA 2005 challenge, employed two Intel quad-core processors in order to run
the autonomous driving so ware, including path planning and collision avoidance
algorithms. Such algorithms are also highly parallelizable [KKLR13], which means that
they benet from execution on multiprocessor systems. However, such algorithms are
specied, most of the time, as sequential programs. ¿is means that such programs
must be parallelized in order to meet their timing constraints and utilize the underlying
processors. Another challenge is that designing MPSoCs is becoming increasingly
complex as the number of processors integrated onto a single chip keeps increasing.
According to the International Technology Roadmap for Semiconductors (ITRS) report
for 2011 [Int11]:
“In the near term, the grand challenges for design technology remain (1) power
management, and (2) design productivity and design for manufacturability.
In the long term, the grand challenges for design technology have been up-
dated as (1) design of concurrent so ware, and (2) design for reliability and
resilience.”
¿erefore, we can represent the problem of designing hard real-time multiprocessor
systems that run multiple streaming programs as an intersection of three dierent
problems as shown in Figure 1.1. ¿e gray area represents the area to which the afore-
mentioned design problem belongs. In this gray area, the designer must produce a hard
real-time multiprocessor system that runs several streaming programs in parallel with
the ability to provide temporal isolation between the programs. Temporal isolation is
the ability to start/stop programs, at run-time, without violating the timing require-
ments of other already running programs. In the following three sections, we describe
each of the three challenges shown in Figure 1.1 in detail and the current approaches in
addressing these challenges.








Figure 1.1:¿e challenges involved in designing modern hard real-time multiprocessor streaming
systems.
1.1.1 Design of Concurrent Software
So ware became a very important component in modern embedded systems. ¿e
amount and complexity of so ware that is running on such systems has increased
dramatically over the last decades. For example, the size of embedded so ware in
automotive systems has increased by two orders of magnitude between 1990 and 2010
[EJ09]. According to [EJ09], so ware is becoming a key dierentiator in many domains
such as automotive. A complicating factor in designing modern embedded systems
is that embedded so ware must be written as parallel so ware in order to utilize the
underlying processors in an MPSoC. Given a sequential program, researchers have
identied three possible types of parallelism:
1. Instruction-Level Parallelism (ILP): this represents the lowest level of paral-
lelism that is visible to the programmer. Under ILP, multiple instructions within
the program may be executed in parallel.
2. Data-Level Parallelism (DLP): under DLP, a function (or block) within the
program is executed simultaneously on several processors and each copy of the
function processes its own stream of data.
3. Task-Level Parallelism (TLP): under TLP, the program is split into a set of
functions (or tasks) and these functions execute in parallel.
It is important to note that a program might contain more than one type of par-
allelism. For example, a program might exhibit task-level parallelism and data-level
parallelism. In this case, the program is split into multiple tasks that execute in parallel
and a certain task has several concurrently executing replicas with each replica pro-
cessing a separate stream of data. In general, identifying parallelism in a sequential
program is a tedious task. ¿is tedious task is exacerbated further by the fact that
designers need to provide timing guarantees for programs running on hard real-time
1.1. Current Design Challenges and Trends 5
multiprocessor systems. Traditionally, embedded so ware has been designed at the
level of Board Support Package (BSP) and high-level Application Programming Inter-
face (API) [HHBT09]. An example of a popular API for designing concurrent so ware
is the POSIX threads (Pthreads) standard [SG13]. However, designing concurrent
so ware at this level is known to be a cumbersome and error-prone task and it is not
easy to provide timing guarantees [Lee06]. ¿erefore, it has been recognized that the
designers need to abstract from the actual programs by building high-level models of
them [EJ09,Tei12]. ¿en, these models are used to analyze the program performance
under dierent scheduling andmapping decisions. Such design approach is o en called
Model-Based Design (MBD) orModel-Driven Design (MDD) and the models used
in such approaches are calledModels of Computation (MoC) [HHBT09].
In the broadest sense, a MoC denes the set of permitted operations used in com-
putation. MoCs can be classied as either sequential or parallel. In this dissertation, we
are interested in parallel MoCs as they are suitable for expressing programs that are
mapped onto MPSoCs. In particular, dataowMoCs have been identied as suitable
parallel MoCs for expressing streaming programs [TA10]. In general, dataow MoCs
abstract the program in the form of a directed graph, where graph nodes represent the
tasks of the program and graph edges represent the data dependencies among the tasks.
¿us, parallelism is explicitly specied in the model. According to [TA10], almost all
streaming programs can be modeled as Synchronous Dataow (SDF, [LM87]) graphs.
Several parallel dataow MoCs have been proposed in the literature that vary in their
expressiveness and decidability [LN05, JS05]. ¿e expressiveness of a MoC is usually
measured by its Turing completeness. On the other hand, decidability refers to the abil-
ity to perform scheduling decisions at compile-time [HO10]. If the execution order of
tasks can be determined at compile-time, then the designer can decide before running
the program if the program has the possibility of buer overow or deadlock. Decidable
dataow MoCs achieve their decidability by placing restrictions on the semantics of the
MoC [HO10]. Generally speaking, expressiveness and decidability are inversely related
as shown in Figure 1.2, which shows the expressiveness and decidability for popular
dataow MoCs.
Another development in designing parallel so ware is the advent of automated
parallelization tools. Streaming programs, in general, are designed at an algorithmic
level using a high-level sequential language such as C orMATLAB. Once the correctness
of these sequential specications is veried, they are passed to subsequent design stages.
It has been noted that most of the execution of these sequential specications is spent
in nested loops [Fra10]. ¿us, researchers have investigated several techniques for
parallelizing such programs. One particular class of nested loop programs that received
a lot of attention is static control and ane indices programs–also called Static Ane
Nested Loop Programs (SANLP) [Fea91]. ¿is class has been shown to embody a large















Expressiveness (Decidability)low (high) high (low)
Turing Complete (undecidable)Turing Incomplete (decidable)
Figure 1.2: Decidability and expressiveness for popular dataow MoCs. ¿e arrows between the
MoCs indicate that the MoC on the le -side is a subset of the one on the right-side. For example,
SDF is a subset of CSDF. ¿e dotted vertical line represents the borderline between decidable and
undecidable models.
















portion of streaming programs [Bas04]. An example of a valid SANLP is shown in
Listing 1. It has been shown in [Fea91] that a SANLP can be automatically analyzed
to construct a parallel version of it. Hence, it is important to utilize this property to
relieve the designer from the burden of parallelizing such programs manually. Given a
sequential program, automated parallelization tools analyze the program and construct
a parallel version of it. ¿is parallel version of the program exposes the parallelism
present in the original sequential program. Several parallelizing compilers have been
proposed for SANLPs, such as the PNgen compiler [VNS07].
1.1.2 Designer Productivity
Improving designer productivity has been an active area of research since the advent
of electronic systems. ¿is area of research is called Electronic Design Automation






Figure 1.3: System-level design of modern embedded systems
(EDA). A common thing among all electronic systems is that they are made of transis-
tors. Modern electronics are developed by putting an enormous number of transistors
into an Integrated Circuit (IC) (commonly known as chip) that is manufactured on a
single plane of silicon. Gordon Moore has predicted in 1965 [Moo65] that the number
of transistors that can be put into an IC doubles approximately every 18 months. ¿e 18
months period has been revised later to 24months. ¿e rationale behind this prediction
is that the physical processes used to manufacture ICs keep improving. ¿is in turn
will lead to the ability to shrink the transistors, and hence, put more transistors into
the same area. Moore’s prediction has survived, till now, the test of time and turned
into one of the most important rules of semiconductor industry, known asMoore’s
law. ¿erefore, as the number of transistors in a chip keeps increasing, it is necessary
to design the chip at the right level of abstraction. In the 1960s and 1970s, electronic
systems were designed at the gate level; designers represented their system as a set of
Boolean equations and then they translated these equations into circuits composed
of gates. With the number of transistors doubling every 24 months, the designers
eventually had to deal with a very large number of gates. ¿is forced the shi from
gate-level design to Register-Transfer Level (RTL) in 1980s. At RTL, the designer deals
no more with gates, but rather with components such as registers, adders, multipliers,
etc. Again, as the number of transistors kept increasing, the designers eventually had to
deal with a very large number of such components. ¿erefore, it became necessary to
raise the level of abstraction again from RTL to system-level [KM+00]. At system-level,
the designer deals with processors, memories, interconnects, and peripherals as the
primitive blocks that form the system. System-level design has been identied as a
promising design approach for MPSoCs [Hen03]. System-level design abstracts the
SoC design by considering it as a process ofmapping a set of tasks onto a set of process-
ing elements as shown in Figure 1.3. Once such a mapping is determined, for example
using design space exploration, Electronic System-Level (ESL) synthesis [GHP+09]
8 Chapter 1. Introduction
tools provide (mostly) automated procedures to generate the hardware descriptions
(i.e., RTL) and the parallel so ware running on the processors. In addition, Platform-
Based Design (PBD) has emerged as a de facto solution to address the design re-use
problem [KM+00]. System-level design and platform-based design are closely related.
Under PBD, a system is divided into three layers:
1. Hardware platform: consists of a set of processors, Input/Output (I/O) pe-
ripherals, and accelerators. ¿e hardware components have a largely xed
functionality, with some degree of parameterization.
2. So ware platform: consists of the RTOS, device drivers, and Basic I/O System
(BIOS) routines. ¿e so ware platform oers a set of APIs to the user programs
running on the system. ¿e so ware platform is sometimes called Hardware-
dependent So ware (HdS).
3. Programs so ware: consists of the users’ programs running on the system. ¿ese
programs communicate with the underlying so ware platform through the
API, which abstracts the underlying hardware platform.
1.1.3 Real-Time Guarantees
As mentioned earlier, several modern streaming programs have very high computa-
tional demands together with hard timing requirements. Such programs have two
primary performance metrics which are throughput and latency. ¿roughput mea-
sures how many samples (or data-units) a program can produce during a given time
interval. Latency measures the time elapsed between receiving a certain input sample
and producing the processed sample by the program. ¿erefore, it is important to
provide guaranteed throughput and latency for each program running on the designed
system. Providing such guarantees depends on the system hardware and so ware. ¿e
hardware must be predictable which means that any hardware operation must have a
bounded worst-case duration. ¿e same applies to system so ware such as operating
system and device drivers. In addition, the operating system scheduler must be capable
of enforcing temporal isolation among the running programs on the system. Temporal
isolation, as mentioned earlier, is the ability to start/stop programs, at run-time, without
violating the timing requirements of other already running programs. Commercial
O-¿e-Shelf (COTS) multicore hardware systems have been identied as inadequate
for hard real-time embedded systems [WGR+09]. Recently, several attempts have been
made to propose predictable multicore hardware architectures that can be used in hard
real-time embedded systems (e.g., [HGBH09, Sch09,UCS+10, Liu12]).
On the so ware side, several scheduling policies have been proposed for scheduling
streaming programs on MPSoCs [NVC10]. For a long time, self-timed scheduling was
considered the most appropriate policy for streaming programs modeled as dataow
graphs [LH89,SB09]. Under self-timed scheduling, a task is executed as soon as possible

















Figure 1.4: Popular real-time task models and the complexity of their feasibility tests. ¿e arrows
between the models indicate that the model on the le -side is a subset of the one on the right-side.
For example, L&L is a subset of GMF.
(i.e., when its input data is available). However, the need to support multiple programs
running on a single system without prior knowledge of the properties of the programs
(e.g., required throughput, number of tasks, etc.) at system design-time is forcing a shi 
towards run-time scheduling approaches. Most of the existing run-time scheduling solu-
tions assume programs modeled as task graphs and provide best-eort or so real-time
guarantees [NVC10]. Few run-time scheduling solutions exist which support programs
modeled using aMoC and provide hard real-time guarantees [God98,BHM+05,Mor12].
However, these solutions use either simple MoCs such as SDF graphs or Time Division
Multiplexing (TDM) scheduling.
Several algorithms from the classical hard real-time multiprocessor scheduling
theory [DB11] can perform fast admission and scheduling decisions for incoming
programs while providing hard real-time guarantees. Moreover, these algorithms
enforce temporal isolation between running programs. Another key advantage of using
classical hard real-time scheduling algorithms is the ability to derive in an analytical
and fast way the minimum number of processors needed to schedule a set of tasks and
the mapping of tasks to processors. Hard real-time scheduling theory and algorithms
work in a similar way to model-based design; they abstract the programs in the form
of a real-time task model. Such task models impose restrictions on the timing of the
program tasks. As a result, it becomes easier to perform timing analysis of the program
and reason about its behavior during the design phase. Several real-time task models
have been proposed in the literature which dier in the complexity of their feasibility
tests as shown in Figure 1.4. A feasibility test deals with the problem of deciding whether
a given set of tasks can be scheduled to meet all their deadlines.
¿e most famous model is the real-time periodic task model proposed by Liu and
Layland in 1973 [LL73]–o en called Liu and Layland (L&L) model. ¿is model has
simple feasibility analysis which led to its wide adoption. However, this model places
several restrictions on the tasks such as:
• ¿e releases (i.e., invocations) of all tasks are periodic, with constant interval
between releases.
• All tasks are independent in that releases of a certain task do not depend on the
10 Chapter 1. Introduction
initiation or completion of releases of other tasks.
• Execution time of each task is constant.
Several models were proposed to extend Liu and Layland model as shown in Figure 1.4.
However, these extended models have more sophisticated feasibility analysis.
1.2 Problem Statement
We can summarize the discussion in Section 1.1 as follows. Model-based design and
electronic system-level synthesis have emerged as de facto solutions to the problems
of designing parallel so ware for MPSoCs and generating the complete MPSoC, re-
spectively. However, no such de facto solution exists yet for the problem of scheduling
parallel streaming programs on MPSoCs used in hard real-time systems. Scheduling
has a direct inuence on the architecture andmapping specications needed to perform
electronic system-level synthesis as shown in Figure 1.3. One possible and attractive
solution is to use classical hard real-time scheduling algorithms due to their benets
mentioned in Section 1.1.3. However, most hard real-time scheduling algorithms as-
sume independent periodic or sporadic tasks [DB11]. Such a simple task model is not
directly applicable to modern streaming programs which, as mentioned in Section
1.1.1, are typically modeled as directed graphs, where graph nodes represent actors
(i.e., tasks) and graph edges represent data-dependencies. ¿e actors in such graphs
have data-dependency constraints and do not necessarily conform to the periodic or
sporadic task models. ¿erefore, the core problem addressed by this dissertation is
to investigate the applicability of hard real-time scheduling theory for real-time periodic
tasks to streaming programs modeled as acyclic Cyclo-Static Dataow (CSDF, [BELP96])
graphs.
We argue that in a hard real-time system, one would like to choose, as much as
possible, a decidable MoC in order to verify the timing behavior at system design
time. At the same time, one would like to use the least complex (in terms of feasibility)
real-time taskmodel that is permissive enough in its semantics to model the underlying
program. ¿erefore, in this dissertation, we consider Cyclo-Static Dataow (CSDF) as
the model of computation and the real-time periodic (i.e., L&L [LL73]) model as the
real-time task model. ¿e choice of CSDF is motivated by the fact that it is expressive
enough tomodelmost streaming programs as demonstrated in [TA10], while the choice
of the real-time periodicmodel is motivated by the fact that it has a simple feasibility test
and is widely adopted by existing real-time operating systems. By considering acyclic
CSDF graphs, our ndings and results are applicable to most streaming programs since
it has been shown recently that around 90% of streaming programs can be modeled
as acyclic SDF graphs [TA10]. Note that SDF graphs are a subset of the CSDF graphs
considered in this dissertation.





























can be scheduled as
Expressiveness (Design-time analysis)low (high) high (low)
Turing Complete (undecidable)Turing Incomplete (decidable)
Feasibility testeasy dicult
Pseudo-polynomial Strongly (co)NP-hard
Figure 1.5: Bridging dataow MoCs and real-time task models through the proposed scheduling
framework. ¿e link indicates that any acyclic CSDF can be scheduled as a set of L&L tasks.
1.3 Research Contributions
¿e research contributions of this dissertation can be summarized as follows.
Contribution 1: Proposing a Scheduling Framework that Bridges Data-
owMoCs and Real-Time TaskModels
We propose an analytical scheduling framework (see [BS11,BS12]) that bridges clas-
sical dataow models and classical real-time task models as shown in Figure 1.5. We
prove analytically using this framework that any streaming program, modeled as an
acyclic CSDF graph, can be executed as a set of real-time periodic tasks. ¿e proposed
framework computes the parameters (i.e., periods, start times, and deadlines) of the
periodic tasks corresponding to the graph actors and the minimum buer sizes of the
communication channels such that a valid periodic schedule is guaranteed to exist.
¿is framework shows that the use of bothmodels is possible and that they complement
each other; CSDF captures the functional aspects of the program, while the real-time
periodic task model captures the timing aspects. Using both models, as demonstrated
by our proposed framework, enables the designer to: (1) schedule the tasks to meet
certain performance metrics (i.e., throughput and latency), (2) derive analytically the
scheduling parameters that guarantee the required performance, and (3) compute
analytically the minimum number of processors that guarantee the required perfor-
mance together with the mapping of tasks to processors. An overview of the inputs and
outputs of the proposed scheduling framework is shown in Figure 1.6. Each CSDF is
annotated with theWorst-Case Execution Time (WCET) of its actors. ¿eWCET values
are obtained from the program through either static analysis tools or proling the
program on the target MPSoC platform [WEE+08]. ¿e user input constraints include:
12 Chapter 1. Introduction
CSDF annotated with WCETCSDF annotated with WCET
Scheduling Framework







Figure 1.6: Input and outputs of the proposed scheduling framework
(i) type of scheduling algorithm, (ii) type of allocation algorithm, and (iii) values of
parameters used to control the derivation of periods and deadlines as explained later
in Chapter 4. ¿e outputs of the framework are: (i) architecture specications which
describe how many processors are needed to schedule the programs, (ii) mapping
specications which associate each task with the processor on which it runs, and (iii)
temporal specications which consist of the parameters (i.e., periods, start times, and
deadlines) of the periodic tasks corresponding to the CSDF actors together with the
buer sizes of the CSDF communication channels.
Additionally, the proposed scheduling framework establishes the following results:
• Matched I/O rates graphs (which correspond to roughly 90% of streaming prog-
rams) have a throughput under periodic schedules that is equal to their through-
put under worst-case self-timed schedules. Periodic schedules here refers to
scheduling the graph actors as real-time periodic tasks. ¿is result opens the
door for applying periodic scheduling to streaming programs.
• For certain classes of CSDF graphs, it is possible to achieve throughput and
latency, under periodic schedules, that are equal to the throughput and latency
under worst-case self-timed schedules. It is also shown that, for CSDF graphs
in general, the latency can be reduced via reducing the deadlines of the actors
along the critical paths.
Contribution 2: Proposing and Realizing a System-Level Design Flow
that Incorporates the Proposed Scheduling Framework
In order to demonstrate the applicability of the proposed scheduling framework, we pro-
pose a system-level design ow that incorporates the proposed scheduling framework
(see [BZNS12]). ¿is design ow represents a contribution because most of the exist-
ing scheduling frameworks for streaming programs represent theoretical frameworks
that have limited or no applicability in real design ows. ¿e proposed design ow is
1.4. Related Work 13
based on an existing state-of-the-art system-level design ow for streaming programs
called Daedalus [TNS+07,NTS+08]. Similar to Daedalus, the proposed design ow
starts from the sequential specications of the programs, and then, generates in a fully
automated manner the nal system implementation, which provably meets the timing
requirements of the programs. A complete implementation of the proposed design ow
is available for download as an open source framework from http://daedalus.liacs.nl/.
¿is implementation of the proposed design ow is called the DaedalusRT design ow.
¿e proposed design ow is illustrated in Figure 1.7. It consists, in total, of six steps
that are marked inside circles in Figure 1.7. Step 1 accepts, as input, a set of SANLPs and
then uses thePNgen compiler to parallelize them and generate the parallel specication
of these input programs. ¿e parallel specication consists of the Polyhedral Process
Network (PPN, [VNS07]) representation of the program. PPN is a parallel MoC that is
useful for code generation and optimizations. However, it is not suitable for analytical
performance analysis. ¿is leads us to the next step.
In Step 2, the generatedPPNs in step 1 are used to construct the performance analysis
model, i.e., CSDF model. Given a PPN, we use the algorithm proposed in [BZNS12] to
derive a CSDF graph that is equivalent to the given PPN.
In Step 3, we performWCET analysis on the parallel specication of the program.
WCET analysis, as mentioned earlier, can be performed through either static analysis
tools or proling the code on the target MPSoC platform.
In Step 4, the CSDF models generated in step 2, the WCET values generated in
step 3, and the user constraints, which include for example the type of scheduler and
other parameters as explained earlier, are fed to the proposed scheduling framework
explained in Contribution 1. ¿is results in: (i) the architecture specication, which
describes how many processors are needed to schedule the programs, and (ii) the
mapping specication, which describes the allocation of tasks to processors.
In Step 5, the PPNs together with architecture and mapping specications are pro-
cessed by ESPAM [NSD08]. ESPAM is an ESL synthesis tool that supports MPSoC
synthesis from PPNs. We have extended ESPAM to support synthesizing the target
MPSoC hardware and so ware. ¿e output from this step is a full MPSoC implementa-
tion consisting of the RTL needed to perform low-level synthesis for FPGA or ASIC
together with the so ware running on each processor in the MPSoC.
Step 6 is the last step in the design ow and consists of performing low-level
synthesis for FPGA or ASIC together with compiling the code for each processor.
1.4 RelatedWork
Design of hard real-time streaming systems has been an active research area for a
long time. We give in this section a survey of the existing work and how it relates to



















































Figure 1.7: Overview of the proposed design ow
this dissertation. ¿e related work is organized into two categories: hard real-time
scheduling of streaming programs, and design ows for hard real-time streaming
systems.
1.4.1 Hard Real-Time Scheduling of Streaming Programs
It is shown in [LM87, BELP96] that any SDF/CSDF graph can be converted into a
functionally equivalent Homogenous SDF (HSDF, [LM87]) graph, where HSDF is
equivalent to Computation Graphs introduced by Karp and Miller in 1966 [KM66].
1.4. Related Work 15
Several scheduling techniques (e.g., [Mor12]) utilize this property by converting a
given SDF/CSDF into an HSDF graph and then performing the scheduling analysis
on the HSDF. However, the resulting HSDF graph has a size that grows exponentially
with the size of the input SDF/CSDF. ¿erefore, such scheduling techniques have
two disadvantages. First, the resulting HSDF might be huge which introduces large
overhead when the actual HSDF is scheduled on the system. Second, in a periodic
schedule of CSDF, each actor must have a start time and a period and each channel
must have a buer size. By deriving the schedule for the HSDF graph, it is possible to
derive a start time and a period for each CSDF actor. However, it is not clear how to
derive a buer size for each CSDF channel from the buer sizes of the HSDF.
Parks and Lee [PL95] studied the applicability of non-preemptive xed task priority
scheduling with rate monotonic priority assignment to streaming programs modeled
as SDF graphs. Our work diers in the following aspects. First, they considered
non-preemptive scheduling, while we consider only preemptive scheduling. Non-
preemptive scheduling is known to be NP-hard in the strong sense even for the unipro-
cessor case [JSM91]. Second, they considered SDF graphs which are a subset of the
more general CSDF graphs.
Verhaegh et al. [VLA+96] proposed Multidimensional Periodic Scheduling (MPS)
to schedule digital signal processing programs written as a set of nested loops and
modeled using Signal Flow Graphs (SFG). ¿e inputs to MPS are the SFG and a set of
explicit timing constraints. Given an SFG, MPS derives a schedule for each operation,
where this schedule is described by multiple periods and osets, and a buer size for
each channel such that the precedence and timing constraints are met. ¿e scheduled
operations execute in a strictly periodic manner (similar to the real-time periodic
model). Verhaegh et al. showed that MPS is NP-hard and they proposed a two-stage
solution approach in [VAvGL01]. ¿eMPS framework is very similar to the framework
proposed in Chapter 4 in that both frameworks derive the periods and start times of
the tasks and the buer sizes of the channels. However, the frameworks dier in the
considered MoC. MPS considers SFG, while the framework in Chapter 4 considers
CSDFwhich is more general than SFG [PC13]. Another dierence is that the framework
in Chapter 4 can derive a deadline for each task to reduce the graph latency.
Goddard [God98] studied applying real-time scheduling to streaming programs
modeled using the Processing Graphs Method (PGM). He used a task model called
Rate-Based Execution (RBE) which is a generalization of the real-time periodic task
model. For a given PGM, he developed an analysis technique to nd the RBE task
parameters of each actor and buer size of each channel. Each channel in PGM is asso-
ciated with a production/consumption rate (as in SDF) and a consumption threshold.
¿e interpretation is that the consumer consumes a number of tokens equal to the
consumption rate only if the channel contains a number of tokens that is greater than
16 Chapter 1. Introduction
or equal to the consumption threshold. In contrast, CSDF supports a sequence of pre-
dened production/consumption rates. As a result, the analysis technique in [God98]
is not applicable to CSDF graphs.
Ziegenbein et al. [ZUE00] proposed a technique to optimize the response time
of SDF graphs under Earliest Deadline First (EDF) scheduling assuming jittery input
streams. ¿eir technique accepts, as input, the SDF graphs, the average period of each
actor, and the jitter bounds of the input streams. ¿en, they revise the deadline of each
actor in such a way that the data dependencies are respected and the total response
time is minimized. Our approach diers from [ZUE00] in the following aspects. First,
we compute all the scheduling parameters of the tasks including the deadline, while
in [ZUE00] the authors assume that the periods are given. ¿us, our approach gives
the designer greater exibility in controlling the timing behavior of the tasks. Second,
we consider CSDF graphs which extend the SDF model considered in [ZUE00].
Bekooij et al. [BHM+05] analyzed the impact of TDM scheduling on programsmod-
eled as SDF graphs running on embedded real-time multiprocessor systems. Wiggers
et al. [WBS09] proposed a classication scheme for run-time scheduling algorithms
based on the causes of interference among programs. ¿ey identied two causes of
interference which are (i) how o en other tasks are executed and (ii) what execution
time is associated with these executions. According to Wiggers taxonomy [WBS09],
run-time scheduling algorithms are classied into:
1. Non-starvation-free algorithms: under such algorithms, interference depends
on (i) and (ii). Examples include xed-task priority scheduling.
2. Starvation-free algorithms: interference is independent of (i) but depends on
(ii). Examples include round-robin scheduling.
3. Budget-schedulers: interference is independent of (i) and (ii). ¿us, a budget
scheduler guarantees every task a minimum amount of time x in every time
interval of length y. Examples of budget schedulers include TDM and Constant
Bandwidth Server (CBS, [AB98]).
Wiggers et al. dened a subset of dataow graphs called functionally deterministic
dataow graphs and showed that such graphs have time deterministic behavior under
budget schedulers. CSDF is an example of a functionally deterministic dataow graph.
In [SBW09], Steine et al. proposed a priority-based budget scheduling algorithm that
overcomes some of the limitations of TDM. Recently, Hausmans et al. [HWGB13]
extended the analysis in [BHM+05,WBS09] to programs modeled as arbitrary HSDF
graphs when they are scheduled using algorithms of the rst class according toWiggers
taxonomy. In another work, Hausmans et al. [HGWB13] proposed a two parameter
(σ , ρ) workload characterization to reduce the gap between the worst-case throughput
determined by the analysis and the actual throughput of the program. ¿e (σ , ρ)
workload characterization uses the average execution time of consecutive executions of a
1.4. Related Work 17
task to provide throughput and latency guarantees. Compared to [BHM+05,HWGB13],
the framework in Chapter 4 applies the analysis directly to a more expressive MoC
(namely CSDF) and avoids the conversion to a larger HSDF. Compared to [WBS09,
SBW09], we use the real-time periodic model which does not restrict the designer to
a certain class of algorithms as dened by Wiggers taxonomy. ¿e designer can use
any algorithm that supports the underlying task model. Compared to [HGWB13], we
consider “classical” hard real-time tasks, where each execution of a task must meet its
deadline. In contrast, under the (σ , ρ) workload characterization, the average WCET
is used to improve the minimum guaranteed throughput/latency. ¿us, an internal task
in the dataow graph may miss its deadline without causing the program to violate its
guaranteed throughput/latency.
¿iele and Stoimenov [TS09] proposed an analysis framework for HSDF graphs
based on Real-Time Calculus (RTC, [CKT03]). ¿eir analysis framework provides
upper and lower bounds on the performancemetrics under dierent scheduling policies
(e.g., TDM and xed task priority scheduling). An advantage of their framework is its
ability to handle cyclic graphs. However, their framework acts as a general performance
analysis technique that provides only upper and lower bounds on the performance. In
contrast, our scheduling framework computes the tasks’ parameters that guarantee a
certain performance under certain scheduling policies. Moreover, we apply the analysis
directly on the more general CSDF model (although acyclic graphs only), while the
framework in [TS09] applies the analysis to HSDF graphs. ¿is means that a program
modeled as an SDF/CSDF graph must be converted into HSDF in order to apply the
analysis in [TS09]. Such conversion has disadvantages as mentioned earlier.
Moreira [Mor12] has investigated temporal analysis of hard real-time radio prog-
rams modeled as SDF graphs. He proposed a scheduling framework based on TDM
combined with static allocation. He also proved that it is possible to derive a periodic
schedule for the actors of a cyclic SDF graph if and only if the periods are greater than
or equal to the maximum cycle mean of the graph. He formulated the conditions on
the start times of the actors in the equivalent HSDF graph in order to enforce a periodic
execution of every actor as a Linear Programming (LP) problem. Our approach diers
from [Mor12] in the following aspects. First, we use the periodic task model which
allows applying a variety of hard real-time scheduling algorithms for multiprocessors.
Second, we use the CSDF model which is more expressive than the SDF model and
perform the analysis directly on CSDF instead of converting it into HSDF as done
in [Mor12].
Bodin et al. [BMKdD12] studied the throughput of programs modeled as SDF
graphs under K-periodic schedules. In a K-periodic schedule, a schedule of Ki oc-
currences of task i is repeated every Oi time units. It has been shown that self-timed
schedules are K-periodic schedules [CMQV89]. ¿us, K-periodic schedules achieve
18 Chapter 1. Introduction
the maximum throughput for a given SDF program. When Ki = 1 for all tasks, we
obtain 1-periodic schedules which are equivalent to the schedules generated using the
real-time periodic task model. ¿us, K-periodic schedules serve as a powerful tool
to analyze dierent scheduling policies. However, the realization of such schedules is
more complex than the 1-periodic ones. In this dissertation, we prove the existence of
1-periodic schedules for acyclic CSDF graphs. 1-periodic schedules are easier to realize,
however, this simplicity comes at the price of extra buer requirements as shown later
in Chapter 6.
Bouakaz et al. [BTV12,BT13] proposed recently a new dataow model called Ane
Dataow (ADF) which extends the CSDF model. ¿ey proposed as well an analysis
framework similar to ours to schedule the actors in an ADF graph as periodic tasks.
¿ey claim also that their analysis framework is capable of handling cyclic ADF graphs.
An advantage of their approach is the enhanced expressiveness of the ADF model. ¿e
framework proposed in [BTV12,BT13] has been proposed a er our framework and
the authors in [BTV12,BT13] refer to our framework and compare empirically their
framework with ours using the benchmarks explained later in Chapter 6. For most
benchmarks, both CSDF and ADF achieve the same throughput and latency while
requiring the same buer sizes. However, in few cases, ADF results in reduced buer
sizes compared to CSDF [BT13].
Benabid-Najjar et al. [BNHMMK12] studied periodic scheduling of SDF graphs.
For acyclic graphs, they proved that any acyclic SDF graph can be scheduled as a set
of periodic tasks. For cyclic SDF graphs, they showed that the existence of a periodic
schedule depends on the number of initial tokens in the graph cycles and provided
a framework to derive the graph throughput under a periodic schedule. Compared
to [BNHMMK12], our framework proves the existence of periodic schedules for acyclic
CSDF graphs, which are more expressive than SDF graphs. Recently, Bodin et al.
[BMKdD13] extended the work in [BNHMMK12] to cyclic CSDF graphs by providing
a framework to derive the maximum throughput of a CSDF graph under a periodic
schedule. Similar to [BT13], the work in [BMKdD13] is very recent and was proposed
a er our framework.
Geuns et al. [GHB13] proposed a technique to parallelize automatically sequential
streaming programs containing while loops and if statements. ¿eir technique derives a
Structured Variable-rate Phased Dataow (SVPDF) model of the parallelized program.
A er that, they perform scheduling analysis on the model to derive a schedule which
ensures that the source and sink tasks can be scheduled in a strictly periodic fashion.
A key dierence between the analysis framework in [GHB13] and the framework in
Chapter 4 is that the analysis in [GHB13] does not impose strict periodicity on all
tasks. It is imposed only on the source and sink tasks. In contrast, the framework
in Chapter 4 imposes strict periodicity on all the tasks to ensure that they conform
1.4. Related Work 19
with the real-time periodic task model. Using the real-time periodic task model has
the following advantages. First, any scheduling algorithm that supports the real-time
periodic task model can be used to schedule the programs. Second, multiple programs
can be scheduled, while preserving temporal isolation, as long as the programs’ tasks
conform to the task model and satisfy the schedulability test of the used scheduling
algorithm. ¿ird, theminimumnumber of processors needed to schedule the programs
can be determined in a fast and analytical way.
1.4.2 Design Flows for Hard Real-Time Streaming Systems
Several design ows for automated mapping of streaming programs onto MPSoC
platforms are surveyed in [GHP+09]. Most of these ows deal with so real-time
streaming systems. Additionally, these ows assume that the program model is derived
manually by the designer/programmer. In contrast, our proposed design ow deals
with hard real-time systems and derives the program model in a completely automated
manner.
DistributedOperation Layer (DOL, [TBHH07]) is a framework formapping parallel
applications onto tiledMPSoCs. It accepts, as input, an application, which is specied as
a process network, and an architecture specication. A er that, it uses multi-objective
optimization algorithms to perform the mapping of application to architecture. ¿en, it
applies analytical performance analysis based on Real-Time Calculus (RTC, [CKT03])
to estimate the performance of the application a ermapping. Our proposed design ow
diers from DOL in the following aspects. First, we perform automated parallelization
and model construction of the input applications, while DOL assumes that the input is
a parallel application. Second, Real-Time Calculus gives worst-case upper and lower
bounds on the performance, however, in reality these bounds may be rarely reached. In
contrast, our proposed design ow guarantees a certain performance of the programs
under certain scheduling policies.
PeaCE [HKL+08] is an integrated hardware/so ware co-design framework for
embeddedmultimedia systems. It employs Synchronous PiggybackedDataow (SPDF)
for computation tasks and Flexible Finite State Machines (fFSM) for control tasks.
PeaCE uses hardware/so ware co-simulations during the design phase in order to meet
certain timing constraints. In contrast, our proposed ow avoids these iterative steps
by applying hard real-time multiprocessor scheduling theory to guarantee temporal
isolation and a given throughput of each application running on the target MPSoC.
CA-MPSoC [SKS+10] is an automated design ow for mapping multiple applica-
tionsmodeled as SDF graphs onto Communication Assist (CA) basedMPSoC platform.
¿e ow uses non-preemptive scheduling to schedule the applications. In contrast, we
consider only preemptive scheduling, because non-preemptive scheduling to meet all
the deadlines is known to be NP-hard in the strong sense even for the uniprocessor
20 Chapter 1. Introduction
case [JSM91]. Moreover, we consider a more expressive MoC, namely the CSDF model.
CompSOC [GAC+13] is a platform and an associated design ow for running
applicationsmodeled asCSDF graphs. ¿eplatformpart (calledCoMPSoC [HGBH09])
provides predictability and composability, which means that the applications running
on the system are completely isolated in terms of execution time, power, and access
to shared resources. Similar to CoMPSoC, the hardware architecture proposed in
Chapter 5 is designed to provide predictability. For the so ware side, CompSOC uses
a custom OS called Compose that implements two-level hierarchical scheduling. In
the rst (or base) level, it divides the processor time into xed intervals and uses TDM
scheduling to provide complete isolation between the applications running on dierent
intervals. In the second level, each interval may use a dierent scheduling policy (e.g.,
EDF or round-robin) to schedule the tasks executed within the interval. ¿e associated
design ow with CompSOC accepts, as input, the CSDF models of the application.
¿en, it uses SDF3 [SGB06] to derive the buer sizes and TDM interval sizes needed
to guarantee a certain performance. ¿erefore, CompSOC provides a platform for
executing given parallel applications, while our proposed design ow is concerned with
providing a complete integrated design ow that parallelizes the applications and then
derives the scheduling and platform parameters that guarantee a certain performance.
MAPS [Cas13] is a design ow formapping dataowapplications ontoMPSoCs. ¿e
design ow accepts, as input, a set of sequential programs written in a variant of C called
C for Process Networks (CPN). A er that, it parallelizes these programs and generates
a performance analysis model based on Kahn Process Networks (KPN, [Kah74]).
¿en, it uses a simulation-based composability analysis to provide certain performance
guarantees on the target platform. Our proposed design ow diers fromMAPS in that
our ow provides hard real-time guarantees to the programs, while MAPS provides
so real-time guarantees.
MAMPSx [FSH+13] is an automated design ow for mapping applications modeled
as SDF graphs onto heterogeneous MPSoCs while providing performance guarantees.
¿e ow requires the designer to specify a sequential implementation (in C) of each
actor in an SDF graph. ¿e generated implementation uses either self-timed scheduling
or TDM scheduling to ensure meeting the throughput constraints. In contrast, our pro-
posed design ow starts from sequential applications written in C, and consequently,
the parallelized programs are automatically extracted from the initial C programs.
Moreover, we schedule the actors as real-time periodic tasks, which enables applying
very fast schedulability analysis to determine the minimum number of required proces-
sors. Instead, [FSH+13] applies design space exploration techniques to determine the
minimum number of processors and the mapping. Finally, our methodology supports
multiple applications to run simultaneously on an MPSoC, while the work in [FSH+13]
does not support multiple applications.
1.5. Organization of this Dissertation 21
1.5 Organization of this Dissertation
¿e rest of this dissertation is organized as follows:
1. Chapter 2 presents an overview of dataow models and hard real-time schedul-
ing theory. ¿is overview is necessary to understand the subsequent chapters.
2. Chapter 3 presents the rst two stages in the proposed design ow: automated
parallelization and model construction.
3. Chapter 4 presents the key contribution of this dissertation: scheduling frame-
work for streaming programs. ¿is framework constitutes the third stage in the
proposed design ow.
4. Chapter 5 presents the fourth stage of the proposed design ow (i.e., ESL syn-
thesis) and explains the hardware and so ware parts of the synthesized systems.
5. Chapter 6 presents the results of empirical evaluation of the proposed scheduling
framework and design ow. ¿is empirical evaluation is performed through a
set of experiments.
6. Chapter 7 ends this dissertation with conclusions and suggestions for future
work.
22 Chapter 1. Introduction
Chapter 2
Background
Essentially, all models are wrong, but
some are useful.
George E. P. Box
THIS chapter introduces the notations, denitions, and existing results that are usedin the subsequent chapters. It also contains material from the theory of dataow
models and hard real-time scheduling that is needed to understand the subsequent
chapters.
2.1 Notations
We present in Table 2.1 a summary of the mathematical notations used throughout this
dissertation.
Denition 2.1.1 (Partition of a Set). LetV be a set. An x-partition ofV is a set, denoted
by xV , where
xV = {xV 1, xV 2,⋯, xV x},








xV i = V
2.2 Parallel Execution of Programs
We start by dening what we mean by a program.
24 Chapter 2. Background
Table 2.1: Summary of mathematical notations
Symbol Meaning
N ¿e set of natural numbers excluding zero
N0 N ∪ {0}
Z ¿e set of integers
Q ¿e set of rational numbers
⋃︀x⋃︀ ¿e cardinality (i.e., size) of a set x
x̂ ¿emaximum value of x
x̌ ¿eminimum value of x
lcm ¿e least common multiple operator
gcd ¿e greatest common divisor operator
÷ ¿e integer division operator
mod ¿e integer modulo operator
xV An x-partition of a set V (see Denition 2.1.1)
Denition 2.2.1 (Program). A program (also called application) is a sequence of
operations (also called statements) that transform a given input to an output.
A statement can be a simple expression (e.g., z = x + y), an invocation of a
function (e.g., z = f(x,y)), or a control statement (e.g., if(x>1)). For some
programs, the statements need to be executed in a strictly sequential way in order
to maintain the correct functionality of the program. For some other programs, the
statements can be executed in a parallel fashion while maintaining the correct func-
tionality. In general, the main objective of executing the statements of a given program
in parallel is to achieve a speedup. Let ∆1 be the time needed to run the program on
one processor, and ∆m be the time needed to run the program on m processors. We





An ideal parallel implementation of a program running on m processors achieves a
speedup equal to m. However, Amdahl in 1967 [Amd67] observed that, in reality, any
program consists of two portions: a parallelizable portion, and a sequential portion.
¿e statements in the sequential portion can not be executed in parallel, and hence, do
not benet from execution on multiprocessor systems. Let f ∈ (︀0, 1⌋︀ be a fraction that
denotes the relative size of the parallelizable portion of a program. Amdahl showed
that the actual speedup is given by:
speedup =
1
(1 − f ) + fm
(2.2)
2.2. Parallel Execution of Programs 25
For example, for a program where f = 0.9 (i.e., 90% of the program is parallelizable),
the maximum speedup is:
maximum speedup = lim
m→∞
1





¿is means that the maximum speedup that can be obtained by executing the program
on a multiprocessor system is 10. ¿erefore, executing this program on more than 10
processors does not result in an extra speedup.
¿e ability to execute two program statements in parallel is constrained by the data
dependencies between them. For example, if a statement S j requires the data produced
by statement Si , then S j must be executed a er Si has completed its execution. To
nd all data dependencies in a given program, one needs to perform dependency
analysis. Dependency analysis reveals, for a given program, all data dependencies
among the statements of the program. Let S be a program where Si and S j represent
two statements of the program. Additionally, let input(Si) be the set of resources1
read by Si , and output(Si) be the set of resources written to by Si . We denote the
sequential execution of S j a er Si by Si → S j, while we denote the parallel execution
of Si and S j by Si ∥ S j. Bernstein [Ber66] showed that Si → S j and Si ∥ S j are
equivalent provided that:
1. output(Si) ∩ output(S j) = ∅
2. output(Si) ∩ input(S j) = ∅
3. output(S j) ∩ input(Si) = ∅
¿e three conditions above are known in the literature as Bernstein’s conditions. ¿ey
form the basis of how we can analyze a given program to determine the statements
that can be executed in parallel. During the execution of a program S, we say that a
statement Si precedes a statement S j, denoted by Si ≺ S j, if Si is executed before S j.
Given a program S where Si and S j are two statements and Si ≺ S j, one can identify
the following types of data dependencies:
• Flow (True) Dependence: If output(Si) ∩ input(S j) ≠ ∅
• Anti-Dependence: If input(Si) ∩ output(S j) ≠ ∅
• Output Dependence: If output(Si) ∩ output(S j) ≠ ∅
• Input Dependence: If input(Si) ∩ input(S j) ≠ ∅
Standard dataow analysis [ALSU86] is a body of techniques that derive the data
dependencies among the statements of a given program. Array dataow analysis [Fea91]
is a technique which performs dataow analysis for SANLP programs. Feautrier [Fea91]
showed that array dataow analysis can be used to construct a parallel version of a
given SANLP program. ¿is means that programs written in SANLP form can be
1resource here means hardware resources used to store data such as memory locations and registers.
26 Chapter 2. Background
automatically analyzed and parallelized. An example of a tool which implements such
automated analysis and parallelization is the PNgen compiler [VNS07]. ¿e details
of how a parallel program is derived using automated parallelization are explained in
Chapter 3.
2.3 Cyclo-Static Dataow (CSDF)
As mentioned earlier in Chapter 1, we use in this dissertation the Cyclo-Static Dataow
(CSDF) model for modeling streaming programs. In this section, we introduce this
model and its properties.
CSDF is a dataow model that extends the well-known Synchronous Dataow
(SDF, [LM87]) model. It is dened in [BELP96] as a directed graph G = (A, E),
where A is a set of actors that correspond to the graph nodes and E ⊆ A× A is a set
of communication channels that correspond to the graph edges. Actors represent
statements in the program that transform incoming data streams into outgoing data
streams, while communication channels represent data dependencies among the actors.
¿e communication channels carry streams of data, and an atomic data object is called
a token. A channel Eu ∈ E is a rst-in, rst-out (FIFO) queue with unbounded capacity
dened by a tuple Eu = (Ai ,A j). ¿e tuple means that Eu is directed from Ai (called
source) to A j (called destination). An actor receiving an input stream of the program is
called input actor, and an actor producing an output stream of the program is called
output actor. A pathWk between actors Aa and Az is an ordered sequence of channels
dened asWk = {(Aa ,Ab), (Ab ,Ac),⋯, (Ay ,Az)}. A pathWk is called output path if
the starting actor Aa is an input actor and the ending actor Az is an output actor. For
a graph G, we useW to denote the set of all output paths in G. Each actor Ai ∈ A is
associated with two sets of channels and two sets of actors. ¿e sets of channels are
the input channels set, denoted by inp(Ai), which consists of all the input channels to
Ai , and the output channels set, denoted by out(Ai), which consists of all the output
channels from Ai . ¿e sets of actors are the successors set, denoted by succ(Ai), and
the predecessors set, denoted by prec(Ai). ¿ey are given by:
succ(Ai) = {A j ∈ A ∶ ∃Eu = (Ai ,A j) ∈ E} (2.4)
prec(Ai) = {A j ∈ A ∶ ∃Eu = (A j ,Ai) ∈ E} (2.5)
We assume that: (1) for any input actor Ai , prec(Ai) = ∅, and (2) for any output actor
A j, succ(A j) = ∅.
Every actor A j ∈ A has an execution sequence (︀ f j(1), f j(2),⋯, f j(𝒩 j)⌋︀ of length
𝒩 j. ¿e interpretation of this sequence is: the nth time that actor A j is red, it ex-
ecutes the code of function f j(((n − 1) mod 𝒩 j) + 1). Similarly, production and
consumption rates of tokens are also sequences of length 𝒩 j in CSDF. ¿e token
2.3. Cyclo-Static Dataow (CSDF) 27
Algorithm 1 Levels(G)
Require: Acyclic CSDF graph G = (A, E)
1: i ← 1
2: while A ≠ ∅ do
3: LAi ← {A j ∈ A ∶ prec(A j) = ∅}
4: E′i ← {Eu ∈ E ∶ ∃Ak ∈
LAi such that Eu = (Ak ,A l)}
5: A← A∖ LAi
6: E ← E ∖ E′i
7: i ← i + 1
8: end while
9: L← i − 1
10: return L-partition of A given by LA = {LA1 , LA2 ,⋯, LAL}.
production of actor A j on channel Eu is represented as a sequence of constant in-




j (𝒩 j)⌋︀. ¿e nth time that actor A j is red, it produces
xuj (((n − 1) mod 𝒩 j) + 1) tokens on channel Eu. ¿e consumption of actor Ak is
completely analogous; the token consumption of actor Ak from a channel Eu is rep-




k(𝒩k)⌋︀. ¿e ring rule of a CSDF actor
Ak is evaluated as “true” for its nth ring if and only if all its input channels contain
at least yuk(((n − 1) mod 𝒩k) + 1) tokens. ¿e total number of tokens produced by
actor A j on channel Eu during the rst n invocations is denoted by Xuj (n) and given




j (l). Similarly, the total number of tokens consumed by actor Ak
from channel Eu during the rst n invocations is denoted by Yuk (n) and given by





An acyclic CSDF graph has a number of levels, denoted by L, and is given by
Algorithm 1. Algorithm 1 builds an L-partition of A, denoted by LA, by partitioning it
in a way similar to topological sort. ¿e actors belonging to subset LAi are said to be
level-i actors.
An important property of the CSDFmodel is its decidability, which is, as mentioned
in Chapter 1, the ability to derive at compile-time a schedule for the actors. ¿is is
formulated in Denition 2.3.1 and¿eorem 2.3.1.
Denition 2.3.1 (Valid Static Schedule [BELP96]). Given a connected CSDF graph
G, a valid static schedule for G is a nite sequence of actors invocations that can be
repeated innitely on the incoming sample stream while the amount of data in the
buers remains bounded. A vector q⃗ = (︀q1, q2,⋯, q⋃︀A⋃︀⌋︀T , where q j > 0, is a repetition
vector ofG if each q j represents the number of invocations of an actorA j in a valid static
schedule for G. ¿e repetition vector of G with the smallest norm is called the basic
repetition vector of G and is denoted by ⃗̌q. G is consistent if there exists a repetition
vector. If a deadlock-free schedule can be found, G is said to be live. Both consistency
28 Chapter 2. Background
and liveness are required for the existence of a valid static schedule.
Bilsen et al. [BELP96] proved the following theorem:
¿eorem 2.3.1. Let G be a CSDF graph. A repetition vector q⃗ = (︀q1, q2,⋯, q⋃︀A⋃︀⌋︀T of G
is given by




𝒩 j if j = k
0 otherwise
(2.6)
where r⃗ = (︀r1, r2,⋯, r⋃︀A⋃︀⌋︀T is a positive integer solution of the balance equation
Γ ⋅ r⃗ = 0⃗ (2.7)





Xuj (𝒩 j) if actor A j produces on channel Eu
−Yuj (𝒩 j) if actor A j consumes from channel Eu
0 Otherwise.
(2.8)
Denition 2.3.2. For a consistent and live CSDF graph G, an actor iteration is the
invocation of an actor Ai ∈ A for qi times, and a graph iteration is the invocation of
every actor Ai ∈ A for qi times, where qi ∈ q⃗.
Corollary 2.3.1 (From [BELP96]). If a consistent and live CSDF graph G completes k
iterations, where k ∈ N, then the net change to the number of tokens in the buers of G is
zero.
Lemma 2.3.1. Any acyclic consistent CSDF graph is live.
Proof. Bilsen et al. proved in [BELP96] that a CSDF graph is live if and only if every
cycle in the graph is live. Equivalently, a CSDF graph deadlocks only if it contains at
least one cycle. ¿us, absence of cycles in a CSDF graph implies its liveness. ∎
Example 2.3.1. Figure 2.1 shows an example of a CSDF graph. ¿is CSDF graph models
the SANLP program shown in Listing 1. ¿e graph has four actors which correspond
to the functions in the program. ¿e correspondence between the actors and the
functions is as follows: A1 corresponds to src, A2 corresponds to f1, A3 corresponds
to f2, and A4 corresponds to snk. ¿e graph has ve edges which represent the data
dependencies between the functions in the program. In this graph, there is one input
actor (i.e., A1) and one output actor (i.e., A4). ¿e graph has three output paths given


















Figure 2.1: Example of a CSDF graph that corresponds to the SANLP program in Listing 1
byW= {W1 = {(A1,A2) , (A2,A4)} ,W2 = {(A1,A3) , (A3,A4)},W3 = {(A1,A4)}}.













2 −1 0 0
1 0 −1 0
3 0 0 −3
0 1 0 −2













































3 0 0 0
0 1 0 0
0 0 1 0

































We show later in Chapter 3 how such a graph can be automatically derived from a given
SANLP program. ◻
2.4 Real-Time Scheduling
In this section, we introduce the real-time periodic task model, some important real-
time scheduling concepts, and schedulability analysis for uniprocessor and multipro-
cessor systems.
2.4.1 TaskModel
A system is composed of a set of m identical processors {π1, π2,⋯, πm}. ¿ese pro-
cessors execute a set of n tasks T = {T1, T2,⋯, Tn} and a task Ti may be preempted at
any time. A task Ti corresponds to a CSDF actor Ai and we model the tasks using the
real-time periodic task model. Under the periodic task model, each task is a recurrent
one with a constant inter-arrival time. A task Ti ∈ T is characterized by a 4-tuple of
integers Ti = (Si ,Ci , Pi ,Di). ¿e tuple parameters are interpreted as follows: Si ≥ 0
is the start time of Ti in absolute time units, Ci > 0 is theWorst-Case Execution Time
(WCET) of Ti , Pi ≥ Ci is the task period (i.e., inter-arrival time) in relative time units,
and Di , where Ci ≤ Di ≤ Pi , is the deadline of Ti in relative time units.
30 Chapter 2. Background
A periodic task Ti is invoked at time instants ri ,k = Si + kPi for all k ∈ N0. When
Ti is invoked, we say that Ti releases a job. ¿e kth job (or invocation) of Ti is denoted
by Ti ,k . Upon invocation, a task must execute for Ci time-units. ¿e deadline Di is
interpreted as follows: job Ti ,k has to nish its execution before time di ,k = ri ,k +Di for
all k ∈ N0. If Di = Pi , then Ti is said to have an implicit-deadline. If Di < Pi , then Ti is
said to have a constrained-deadline. If all the tasks in a taskset T are implicit-deadline
tasks, then we say that T is an implicit-deadline taskset. Otherwise, we say that T is a
constrained-deadline taskset. Similarly, if all the tasks in a taskset T have the same start
time, then we say that T is synchronous. Otherwise, we say that T is asynchronous. For
synchronous tasksets, we assume that their start time is 0.
¿e utilization of a task Ti is ui = Ci⇑Pi . For a taskset T , the total utilization of T
is usum(T) = ∑Ti∈T ui and the maximum utilization factor of T is û(T) = maxTi∈T ui .
Similarly, the density of a task Ti is δi = Ci⇑min(Di , Pi), the total density of T is
δsum(T) = ∑Ti∈T δi , and the maximum density of T is δ̂(T) = maxTi∈T δi . Note that
the density is equivalent to the utilization for implicit-deadline tasks.
¿e processor demand, denoted by demand(Ti , t1, t2), of a task Ti over a time
interval (︀t1, t2⌋︀ is the total computation time of all the jobs of Ti having activation time
and deadline within (︀t1, t2⌋︀. According to [BRH90], demand(Ti , t1, t2) is given by:
demand(Ti , t1, t2) = ζ(Ti , t1, t2) ⋅ Ci (2.9)
where ζ(Ti , t1, t2) is the total number of Ti jobs that are activated in the interval (︀t1, t2⌋︀
and have a deadline within the interval (︀t1, t2⌋︀. ¿e authors in [BRH90] showed that
ζ(Ti , t1, t2) is given by:
ζ(Ti , t1, t2) = max{0, ⃒ t2−S i−D iPi )︁ −max{0, ⌊︂
t1−S i
Pi }︂} + 1} (2.10)
For a taskset T , the processor demand of T over the time interval (︀t1, t2⌋︀ is denoted
by demand(T , t1, t2) and given by:
demand(T , t1, t2) = ∑
Ti∈T
demand(Ti , t1, t2) (2.11)
2.4.2 Scheduling Concepts
Given a system and a taskset T , a correct schedule is one that allocates a processor to a
task Ti ∈ T for exactlyCi time units in the interval (︀Si+kPi , Si+kPi+Di) for all k ∈ N0,
with the restriction that a task may not execute on more than one processor at the
same time. ¿e schedule itself can be constructed either oine (i.e., at system design-
time) or online (i.e., at system run-time). Oine scheduling allows the designer to
perform computationally expensive optimizations to produce the best possible schedule
2.4. Real-Time Scheduling 31
according to some criteria (e.g., minimizing energy). However, oine scheduling lacks
the exibility to deal with new events at run-time. In contrast, online scheduling can
deal with such new events at run-time. However, online scheduling introduces extra
overheads due to the scheduling decisions taken during run-time. In the rest of this
dissertation, we assume online scheduling algorithms unless we explicitly mention
otherwise.
According to [DB11], real-timemultiprocessor scheduling algorithms can be viewed
as attempting to solve two problems:
• ¿e priority assignment problem: when and in what order with respect to other
tasks, each job should execute.
• ¿e allocation problem: on which processor a task should execute and whether
a task can migrate between processors.
As a result, real-time multiprocessor scheduling algorithms can be classied based
on priority as follows [DB11]:
• Fixed task priority: Each task has a single priority that is used for all its jobs.
• Fixed job priority: ¿e jobs of a single task may have dierent priorities, however,
each job has a single priority. An example of an algorithm using this policy is
the Earliest Deadline First (EDF, [LL73]) algorithm.
• Dynamic priority: A single job might have dierent priorities during the course
of its execution. An example is the Least Laxity First (LLF) algorithm.
Based on allocation, algorithms can be classied as follows [DB11]:
• Nomigration: Each task is allocated to a processor and nomigration is permitted.
• Task-level migration: ¿e jobs of a task may execute on dierent processors,
however, a single job can only execute on a single processor.
• Job-level migration: A job may execute on more than one processor, however, it
may not execute on more than one processor at the same time.
A scheduling algorithm that does not permit migration at all is said to be a partitioned
scheduling algorithm, while an algorithm that permits all tasks to be migrated between
all processors is said to be a global algorithm, and nally an algorithm that permits
a subset of tasks to be migrated among a subset of processors is said to be a hybrid
algorithm.
Graham in 1969 [Gra69] showed that a system may exhibit unexpected scheduling
“anomalies” even though the system operates under a “better” set of conditions. For
example, he showed that decreasing the WCET of tasks may result in an increase in the
schedule length. Such anomalies are known in the literature asGraham’s anomalies (or
scheduling anomalies). A scheduling algorithm under which scheduling anomalies
can never occur is said to be anomaly-free [AJ02]. Examples of anomaly-free unipro-
cessor scheduling algorithms include EDF and xed-task priority scheduling with rate
monotonic priority assignment. Partitioned multiprocessor scheduling algorithms in
32 Chapter 2. Background
which each processor is scheduled using an anomaly-free uniprocessor scheduling
algorithm are known to be anomaly-free as well [AJ02].
In this dissertation, we focus only on partitioned scheduling due to the following
reasons: (1) task migration on distributed-memory MPSoCs has a non-negligible
overhead, (2) partitioned scheduling has been shown to be the most suitable type for
hard real-time systems [Bra11], and (3) partitioned scheduling where each processor is
scheduled using an anomaly-free uniprocessor scheduling algorithm is anomaly-free
as mentioned earlier.
A taskset T is said to be feasible on a system if there exists a scheduling algorithm
that can construct a correct schedule for T on the system. A taskset T is said to be
schedulable using a scheduling algorithm𝒜 if𝒜 can construct a correct schedule for
T on the system. Finally, a scheduling algorithm𝒜 is said to be optimal with respect
to a task model and a system if and only if it can schedule all tasksets that comply with
the task model and are feasible on the system.
On uniprocessor systems, EDF scheduling is known to be optimal for all periodic
tasksets [But11]. On multiprocessor systems, several global and hybrid algorithms are
known to be optimal for implicit-deadline periodic tasksets (e.g., [BCPV96, CRJ10,
FLS+11,RLM+12]) and, in contrast, optimal online scheduling of constrained-deadline
tasksets is impossible [FGB10]. Finally, partitioned scheduling is known to be non-
optimal for periodic tasksets [CFH+04].
A schedulability test for a scheduling algorithm 𝒜 decides, given a task set T ,
whether T is schedulable using 𝒜. Schedulability tests can be classied further into
[DB11]:
• Sucient: If all tasksets that are deemed schedulable by the test are in fact
schedulable.
• Necessary: If all the tasksets that are deemed unschedulable by the test are in
fact unschedulable.
• Exact: If the test is both sucient and necessary.
2.4.3 Uniprocessor Schedulability Analysis
On uniprocessor systems, the most popular classes of priority assignment are the xed-
task priority and the xed-job priority. In particular, EDF scheduling is themost popular
algorithm in the xed-job class. ¿erefore, we present in this section the schedulability
tests for EDF and xed-task priority scheduling.
Earliest Deadline First (EDF)
Under EDF, an exact test for implicit-deadline periodic tasksets is formulated in the
following theorem:
2.4. Real-Time Scheduling 33
¿eorem 2.4.1 (From [LL73]). An implicit-deadline periodic taskset T is schedulable
under EDF if and only if
usum(T) ≤ 1 (2.12)
For constrained-deadline periodic tasks, Baruah et al. [BRH90] derived in 1990 an
exact schedulability test for EDF on uniprocessor systems. ¿e test is formulated in the
following lemma:
Lemma 2.4.1 (From [BRH90]). A periodic task set T is feasible on one processor if and
only if usum(T) ≤ 1 and demand(T , t1, t2) ≤ (t2 − t1) for all 0 ≤ t1 < t2 < Ŝ + 2H, where
Ŝ = max{S1,⋯, Sn} and H = lcm{P1,⋯, Pn}.
¿e exact test in Lemma 2.4.1 is known to be co-NP-hard in the strong sense
[BRH90]. Amore ecient exact test for synchronous periodic tasksets has been derived
in 2009 by Zhang and Burns [ZB09]. ¿e test is known as the Quick-convergence
Processor-demand Analysis (QPA). Let T be a synchronous periodic taskset and let
Ď = min{D1,D2,⋯,Dn}. Let La be dened as

























min(La , Lb) when usum < 1
Lb when usum = 1
(2.15)
Recall from Section 2.4.1 that di ,k denotes the absolute deadline of the kth job of task
Ti and it is given by di ,k = ri ,k + Di . Zhang and Burns [ZB09] proved the following
theorem:
¿eorem 2.4.2. A synchronous periodic taskset T is schedulable if and only if usum ≤ 1
and the result of the following algorithm is demand(T , 0, t) ≤ Ď.
t ← max{di ,k ∶ di ,k < L∗}
while demand(T , 0, t) ≤ t and demand(T , 0, t) > Ď do
if demand(T , 0, t) < t then
t ← demand(T , 0, t)
else
34 Chapter 2. Background
t ← max{di ,k ∶ di ,k < t}
end if
end while





QPA can be used as a sucient test for asynchronous periodic tasksets based on
the following lemma from [BG04].
Lemma 2.4.2 ( [BG04]). Let T = {T1 = (S1,C1, P1,D1), T2 = (S2,C2, P2,D2), ⋯,
Tn = (Sn ,Cn , Pn ,Dn)} be an asynchronous periodic taskset. T is schedulable if the
synchronous periodic taskset T̃ = {T̃1 = (0,C1, P1,D1), T̃2 = (0,C2, P2,D2),⋯, T̃n =
(0,Cn , Pn ,Dn)} is schedulable.
Fixed-Task Priority
Underxed-task priority scheduling, Jospeh andPandya [JP86] derived an exact schedu-
lability test for synchronous periodic tasksets based on Response Time Analysis (RTA).
Let hp(Ti) be the set of tasks with priorities higher than the priority of Ti and Ri be the
total response time of Ti . ¿en, the schedulability test is formulated in the following
theorem:
¿eorem 2.4.3. A synchronous periodic taskset T is schedulable using xed-task priority
scheduling if and only if
∀Ti ∈ T ∶ Ri ≤ Di (2.16)
where the total response time Ri is given by solving the following xed-point equation:






It is shown in [JP86] that (2.17) has a xed-point if usum(T) ≤ 1. ¿e test in¿eorem
2.4.3 can be used as a sucient test for asynchronous periodic tasksets [Aud91].
An important question in xed-task priority scheduling is how to assign the priorities
to the tasks. Let prio(Ti) ∈ N denote the priority of task Ti . ¿e lowest priority value
is 1 and a higher value of prio(Ti)means a higher priority. Several priority schemes
are proposed in the literature such as:
2.4. Real-Time Scheduling 35
• Rate Monotonic (RM) [LL73]: Let Ti and Tj be two tasks. Under RM, if Pi <
Pj, then prio(Ti) > prio(Tj). Liu and Layland in [LL73] derived a sucient
schedulability test for implicit-deadline taskset under using xed-task priority
scheduling when the priorities are assigned according to the RM scheme. ¿e
test is:
¿eorem 2.4.4 (From [LL73]). Let T be an implicit-deadline synchronous taskset
where the priorities are assigned according to the RM rule and the tasks are sched-
uled using a xed-task priority scheduler. T is schedulable if
usum(T) ≤ n(21⇑n − 1) (2.18)
where n = ⋃︀T ⋃︀.
If the size of the taskset grows signicantly (i.e., n → ∞), then we obtain
usum(T) = ln(2) ≊ 0.693. ¿is means that any implicit-deadline synchronous
taskset with total utilization less than 0.69 is schedulable using xed-task priority
scheduling with RM priority assignment. ¿e RM priority assignment rule is
shown to be optimal for synchronous implicit-deadline tasksets when scheduled
using a xed-task priority scheduler [LL73].
• Deadline Monotonic (DM) [LW82]: Let Ti and Tj be two tasks. Under DM, if
Di < D j, then prio(Ti) > prio(Tj). ¿e exact schedulability test when using
the DM policy is the same test presented in ¿eorem 2.4.3. ¿e DM priority
assignment rule is shown to be optimal for synchronous constrained-deadline
tasksets when scheduled using a xed-task priority scheduler [LW82].
• OptimalPriorityAssignment (OPA) [Aud91]: ¿eoptimal priority assignment
policy proposed by Audsley in 1991 is an optimal priority assignment for asyn-
chronous periodic tasksets when scheduled using xed-task priority scheduling.
¿e algorithms for setting the priorities and checking the schedulability can be
found in [Aud91].
2.4.4 Multiprocessor Schedulability Analysis
An exact schedulability test for implicit-deadline periodic tasksets on m processors is:
usum(T) ≤ m (2.19)
Based on (2.19), one can compute the absolute minimum number of processors
needed to schedule an implicit-deadline taskset T as follows:
m̌OPT = [︂usum(T)⌉︂ (2.20)
It is important to notice that (2.19) and (2.20) are valid only assuming optimal
scheduling algorithms, which are either global or hybrid. However, as mentioned
36 Chapter 2. Background
earlier in Section 2.4.2, we consider in this dissertation partitioned scheduling only.
With partitioned scheduling, one must rst allocate the tasks to processors. ¿en,
the tasks on each processor are scheduled using a uniprocessor scheduling algorithm.
Recall fromDenition 2.1.1 on page 23 that xT denotes an x-partition of a taskset T . ¿e
minimum number of processors needed to schedule a taskset T assuming partitioned
scheduling is given by:
m̌PAR = min{x ∈ N ⋃︀ ∃ xT and ∀i ∶ xT i is schedulable } (2.21)
Partitioning Schemes
LetV be a set of n items and B be a set ofm containers (i.e., bins). Each item Vi ∈ V has
a size, denoted by size(Vi), where size(Vi) ∈ (︀0, 1⌋︀. Similarly, each container Bi has
a maximum capacity equal to 1, and a current capacity, denoted by capacity(Bi),
which gives the total size of items placed currently onto Bi . We seek for V an m-
partition, denoted by mV , such that the total size of the items in each mV i is less than
or equal to the maximum capacity of container Bi . ¿at is
∀i = 1, 2,⋯,m ∶ ∑
Vj∈mV i
size(Vj) ≤ 1 (2.22)
Solving the aforementioned partitioning problem requires solving the classical bin
packing problem which is known to be NP-hard [GJ79]. ¿erefore, many heuristics
exist for partitioning items over bins such as First-Fit, Best-Fit, Worst-Fit, etc. Below,
we summarize from [Joh74,CGJ96] some of the most used heuristics.
• First-Fit (ff): ff is a straightforward solution to the assignment problem. An
item Vi is placed in the rst (i.e., lowest indexed) bin B j that can accommodate
Vi . ¿at is
j = min{k ∶ size(Vi) + capacity(Bk) ≤ 1} (2.23)
If no such bin exists, then create a new bin and place Vi into it.
• Best-Fit (bf): bf places an item Vi into a bin B j such that B j will have minimal
remaining capacity a er placing Vi . ¿at is
j = min{k ∶ size(Vi) + capacity(Bk) is closest to, without exceeding, 1}
(2.24)
If no such bin exists, then create a new bin and place Vi into it.
• Worst-Fit (wf): wf is the opposite of BF as it tries to place an item Vi into a
bin B j that will have maximal remaining capacity a er placing Vi . ¿at is
j = min{k ∶ size(Vi) + capacity(Bk) is minimized } (2.25)
If no such bin exists, then create a new bin and place Vi into it.
2.4. Real-Time Scheduling 37
Table 2.2: Approximation rations for known bin packing heuristics
Heuristic Approximation ratio Source
FF Rf f = 17⇑10 [CGJ96]
BF Rbf = 17⇑10 [GJ79]
FFD ∀ V ∶ ffd(V) ≤ 11⇑9 ⋅ opt(V) + 1 [Yue91]
It is also possible to perform a preprocessing step before performing the heuristic.
¿is preprocessing step is usually sorting the items based on some criteria, such as their
sizes. ¿is leads us to the following heuristics:
• First-Fit-Decreasing (ffd): In this heuristic, the items are rst sorted, in de-
creasing (i.e., descending) order, based on their sizes. ¿en, ff is performed on
the sorted items.
• Best-Fit-Decreasing (bfd): Similar to ffd, the items are rst sorted, in de-
scending order, based on their sizes. ¿en, bf is performed on the sorted items.
An important metric in evaluating the performance of the aforementioned heuris-
tics is their approximation ratio. Let opt(V) be theminimumnumber of bins needed
to partition a set of items V using the optimal partitioning scheme. ¿en, for a given
heuristic h(V), its approximation ratio, denoted byRh, is given by
Rh = inf{r ≥ 1 ∶ h(V)⇑opt(V) ≤ r for all V} (2.26)
Several approximation ratios are available for the heuristics described earlier. ¿ese
ratios are summarized in Table 2.2.




A good scientist is a person with original ideas.
A good engineer is a person who makes a design
that works with as few original ideas as possible.
Freeman Dyson
RECALL from Figure 1.7 on page 14 that the rst two steps in the proposed designow are automated parallelization and model construction. In this chapter, we
explain, in detail, these two steps and how they are performed in the proposed design
ow. It is important to note that these two steps are not contributions of this dissertation
but they are presented here to make the dissertation self-contained.
3.1 Input Programs
We assume that an input program is written in the ANSI 89 C programming language
and consists of two parts: a top-level part and an implementation part. ¿ese two parts
are explained in detail in the next two sections.
3.1.1 Top-Level Part
¿is part consists of the main function of the program. It denes the top-level structure
of the algorithm realized by the program and intended for parallelization. We assume
that the main function is written as a Static Ane Nested Loop Program (SANLP)
and does not contain anti-dependencies (see Section 2.2). During parallelization, the
automated parallelization tool processes only this part to extract all the dependencies















Figure 3.1: Automated parallelization and model construction
and generate a parallel version of it. An example of a valid top-level le is shown in
Listing 1.
In [VNS07], SANLPs have been dened as programs that can be represented using the
polytope model. However, we prefer to use a denition that characterizes the semantics
of the program. Such a denition is the following one from [Mei10].
Denition 3.1.1 (Static AneNested Loop Program (SANLP)). A SANLP is a program
where each program statement is enclosed by one or more loops and if-statements, and
where:
C1 loops have a constant step size;
C2 loops have bounds that are ane expressions of the enclosing loop iterators,
static program parameters, and constants;
C3 if-statements have ane conditions in terms of the loop iterators, static program
parameters, and constants;
C4 index expressions of array references are ane constructs of the enclosing loop
iterators, static program parameters, and constants;
C5 data ow between statements in the loop is explicit, which prohibits that two
statements that contain function calls communicate through shared variables
outside the main function.
3.1.2 Implementation Part
¿is part consists of all the remaining functions of the program. ¿ese functions can
contain arbitrary code as long as this code does not break condition C5 fromDenition
3.1.1. An example of an implementation part is the implementation of functions src,
f1, f2, and snk from Listing 1.
3.2. Automated Parallelization 41
3.2 Automated Parallelization
¿is step is realized using the PNgen compiler [VNS07]. PNgen accepts as input the
top-level part written in SANLP form. A er that, it performs array dataow analysis
[Fea91] on the SANLP and translates each non-control statement in the top-level
part into a separate process. A process encapsulates the statement from which it got
translated and represents a separately executing entity1 in the parallel version of the
program. ¿e processes communicate among each other using FIFOs in accordance
with the semantics of Kahn Process Networks (KPN, [Kah74]). In KPN, the processes
are sequential programs that communicate over FIFOs and a process blocks when it
attempts to read from an empty FIFO.
Since we consider programs consisting of nested loops, each process (i.e., statement)
is executed for a number of iterations. ¿e set of iterations in which a process is
executed is called process domain. ¿e process domain is represented using the polytope
model [Fea96]. A polytope𝒟 is a bounded subset ofQn given by
𝒟 = {x⃗ ∈ Qn ⋃︀ Fx⃗ ≥ b⃗} (3.1)
where F ∈ Nk×n and b⃗ ∈ Nk . Since a process encapsulates a non-control statement,
the process has to read the input data needed by the statement, and write the output
data produced by the statement. Reading and writing data is done through process
input/output ports. ¿e ports of a process are connected to the FIFO channels inter-
connecting it with other processes. Similar to the process domain, PNgen constructs
input/output domains as polytopes. Due to the fact that PNgen represents the exe-
cution and communication using the polytope model, the generated process-based
representation is called Polyhedral Process Network (PPN, [VNS07]). To illustrate the
concepts of process, input/output ports, and domain, we give the following example.
Example 3.2.1. Consider the program in Listing 1 on page 6. PNgen processes the
program in Listing 1 to produce the parallel program shown in Figure 3.2. ¿e arrows
represent the FIFO channels and the black dots represent the ports. READ and WRITE
represent the primitives used to read and write from the FIFOs, respectively. ¿e
detailed implementation of these primitives is explained later in Chapter 5. For example,
process 𝒫snk has three input ports, represented by IP1, IP2, and IP3. Since function
snk does not produce any data for other functions, process 𝒫snk does not have any
output ports. IP1 provides the data produced by process 𝒫f1, IP2 provides the data
produced by process 𝒫f2, and IP3 provides the data produced by process 𝒫src. ¿e
process domain of process 𝒫snk is given by:
𝒟snk = {(w , i , j) ∈ Z3 ⋃︀ w > 0 ∧ 1 ≤ i ≤ 10 ∧ 1 ≤ j ≤ 3}
1¿is entity can be mapped into an OS process or thread.




















































Figure 3.2:¿e parallel program corresponding to the SANLP shown in Listing 1. ¿is parallel
program is produced using the PNgen compiler.
Reading data produced by function f1 to initialize the rst argument in1 of function
snk is represented by the following input domain:
𝒟IP1 = {(w , i , j) ∈ Z3 ⋃︀ w > 0 ∧ 1 ≤ i ≤ 10 ∧ 1 ≤ j ≤ 2}
In a similar way, reading data produced by function f2 to initialize the rst argument
in1 of function snk is represented by the following input domain:
𝒟IP2 = {(w , i , j) ∈ Z3 ⋃︀ w > 0 ∧ 1 ≤ i ≤ 10 ∧ j = 3}
Finally, reading data produced by function src to initialize the second argument in2
of function snk is represented by the following input domain:
𝒟IP3 = {(w , i , j) ∈ Z3 ⋃︀ w > 0 ∧ 1 ≤ i ≤ 10 ∧ 1 ≤ j ≤ 3}
¿ese port domains are visualized in Figure 3.3. ¿e dots encapsulated by rectangles
represent the port domains, and the arrows show the domain execution (i.e., scanning)
order. ◻
3.3 Model Construction
Model construction is the second step in the proposed design ow shown in Figure 1.7
on page 14. As mentioned in Section 1.3, we adopt the CSDF model as a performance
3.3. Model Construction 43










Figure 3.3:¿e port domains of 𝒫snk shown in Figure 3.2
analysis model in the proposed design ow. It is straightforward to see that PPNs
generated by PNgen share the following similarities with CSDF:
• Both of PPN and CSDF are represented as a directed graph, where nodes corre-
spond to either processes (in PPN) or actors (in CSDF), and edges correspond to
FIFOs.
• For a given PPN, the equivalent CSDF graph has the same topology as the PPN.
However, PPN and CSDF dier in the following aspects:
• Communication in PPN is represented using input/output ports domains, while
in CSDF, it is represented using sequences of production/consumption rates.
• Synchronization in PPN is implemented by the blocking read/writes, while in
CSDF, it is implemented explicitly using a schedule.
Deprettere et al. showed in [DSBS06] that for any non-parameterized PPN, it
is possible to derive a functionally equivalent CSDF graph, where the production
and consumption rates sequences consist of only 0s and 1s. A ‘0’ in the sequence
indicates that a token is not produced/consumed, while a ‘1’ indicates that a token
is produced/consumed. ¿e authors in [BZNS12] proposed an algorithm to derive
automatically the CSDF graph from a given PPN. We show how this algorithm works
through the following example.
Example 3.3.1. Consider the PPN shown in Figure 3.2. ¿e rst step in deriving the
equivalent CSDF graph is to derive the CSDF topology. ¿e CSDF graph (shown in
Figure 2.1 on page 29) has the same topology as the PPN in Figure 3.2. ¿e second
step is to derive the access patterns of the processes. For each process in the PPN
shown in Figure 3.2, a process variant is derived. A process variant v of a process
𝒫 is a tuple (𝒟v , ports), where 𝒟v ⊆ 𝒟𝒫 is the variant domain, and ports is a set of
input/output ports that are accessed in all iterations of𝒟v . A process variant captures
44 Chapter 3. Automated Parallelization and Model Construction







Figure 3.4:¿e domains of v1 and v2. Dimension w is omitted from the picture.
the consumption/production behavior of the process. For example, consider process
𝒫snk in Figure 3.2. It has the following process variants:
v1 = ({(w , i , j) ∈ Z3 ⋃︀ w > 0 ∧ 1 ≤ i ≤ 10 ∧ 1 ≤ j ≤ 2}, {IP1, IP3}) (3.2)
v2 = ({(w , i , j) ∈ Z3 ⋃︀ w > 0 ∧ 1 ≤ i ≤ 10 ∧ j = 3}, {IP2, IP3}) (3.3)
Process𝒫snk always reads data from input ports IP1 and IP3 in process variant v1, while
it always reads data from input ports IP2 and IP3 in process variant v2. ¿e domains of
v1 and v2 are depicted in Figure 3.4.
¿e third step is to construct a one dimensional sequence of process variants by
scanning the process domain. To do that, we rst project out dimension w from all
the domains. ¿en, we build a sequence of the process domain points using their
lexicographical order. For example, for process 𝒫snk, the process domain can be
represented as the following sequence of loop iterator values (i , j) according to the
loop execution shown in Figure 3.4 (i.e., lexicographical order):
{(1, 1), (1, 2), (1, 3),⋯, (2, 1), (2, 2),⋯, (10, 3)} (3.4)
A er that, we replace each iteration with the process variant to which it belongs. ¿is
results in the following sequence:
𝒮snk = {v1, v1, v2, v1, v1, v2,⋯, v1, v1, v2} (3.5)
In general, sequence 𝒮𝒫 has a length equal to ⋃︀𝒟𝒫 ⋃︀, hence, it might be very long.
¿erefore, it is desirable to express 𝒮𝒫 using the shortest repetitive pattern that covers
the whole sequence. ¿is is accomplished in [BZNS12] using a special data structure
called sux trees [Ukk95]. For example, for process 𝒫snk and its sequence of variants
𝒮snk, the shortest repetitive pattern is {v1, v1, v2}.
3.3. Model Construction 45




IP1 1 1 0
IP2 0 0 1
IP3 1 1 1
¿e last step in deriving the equivalent CSDF graph is to derive the production
and consumption rates sequences from the shortest repetitive pattern. For each port
of a CSDF actor, a consumption/production rates sequence is generated. ¿is is done
by building a table in which each row corresponds to an input/output port, and each
column corresponds to a process variant in the shortest repetitive pattern. If the
input/output port is in the set of ports of the process variant, then its entry in the
table is ‘1’. Otherwise, its entry is ‘0’. Each row in the resulting table represents a
consumption/production rates sequence for the corresponding input/output port.
For example, consider process𝒫snk in Figure 3.2. ¿e consumption and production
rates sequences of the corresponding CSDF actor A4 in Figure 2.1 on page 29 are
generated as shown in Table 3.1. We see fromTable 3.1 that the consumption/production
rates sequences for the ports are the same as the ones shown in Figure 2.1. ◻
46 Chapter 3. Automated Parallelization and Model Construction
Chapter 4
Scheduling Framework
Everything should be made as simple as possible,
but not simpler
Albert Einstein
RECALL from Section 1.1.3 that several algorithms from hard real-time multipro-cessor scheduling theory can perform fast admission and scheduling decisions
for incoming programs while providing hard real-time guarantees. Moreover, these
algorithms provide temporal isolation which is, as mentioned in Chapter 1, the ability
to start/stop programs, at run-time, without violating the timing requirements of other
already running programs. However, most of these algorithms assume independent
real-time periodic or sporadic tasks [DB11]. Such a simple task model is not directly
applicable to modern embedded streaming programs. Modern streaming programs,
as shown in Section 2.3, are modeled as dataow graphs, which means that the graph
actors have data-dependency constraints and do not necessarily conform to the real-
time periodic or sporadic task models. ¿erefore, we need to link the dataow MoCs
used for performance analysis of the programs and the real-time task models used for
timing analysis. Such a link is proposed in this chapter. Given a streaming program
modeled as an acyclic CSDF graph, we analytically prove that it is possible to execute
the graph actors as a set of real-time periodic tasks. We present an analytical scheduling
framework for computing the parameters of the periodic tasks corresponding to the
graph actors and the minimum buer sizes of the communication channels such that a
valid schedule is guaranteed to exist. ¿e computed parameters of the periodic tasks
are the period, deadline, and start time as dened in Section 2.4. ¿ese parameters
are used, as shown in Figure 4.1 to derive the architecture and mapping specications
needed later in Chapter 5 to perform system-level synthesis. Furthermore, we present
several results related to the quality of periodic scheduling of programs modeled as



















Figure 4.1: Scheduling framework
acyclic CSDF graphs. By considering acyclic CSDF graphs, our ndings are applicable
to most streaming programs since it has been shown recently in [TA10] that around
90% of streaming programs can be modeled as acyclic SDF graphs, which are a subset
of CSDF graphs.
4.1 Input Streams
In the remainder of this chapter, a graph G refers to an acyclic consistent CSDF graph.
A graph G has a set of periodic input streams connected to its input actors. ¿e set of
input streams connected to an actor Ai is disjoint from all other sets of input streams
connected to other actors. ¿at is, there are no two actors sharing the same input
stream. Let Ii , j be the jth input stream received by actor Ai and Pi be the period of Ai .
We assume that Ii , j satises the following: (1) it starts either prior to or together with Ai ,
(2) it has a constant inter-arrival time equal to Pi , and (3) it has jitter bounds (︀J−i , j , J+i , j⌋︀.
Let t′ be the time at which Ii , j starts. ¿e interpretation of the jitter bounds is that the
kth sample of the stream is expected to arrive in the interval (︀t′+kPi− J−i , j , t′+kPi+ J+i , j⌋︀.
If a sample arrives in the interval (︀t′ + kPi − J−i , j , t′ + kPi), then it is called an early
sample. On the other hand, if the sample arrives in the interval (t′ + kPi , t′ + kPi + J+i , j⌋︀,
then it is called a late sample.
Suppose that Ai is started at the same time when Ii , j starts. It is trivial to show that
early samples do not aect the periodicity of the input actor as the samples arrive prior
to the actor release time. Late samples, however, pose a problem as they might arrive
a er an actor is released which will cause the actor to block. To overcome this, it is
possible to insert a de-jitter buer before each input actor. ¿e buer accumulates
4.1. Input Streams 49
(k − 2)Pi (k − 1)Pi kPi (k + 1)Pi (k + 2)Pi
Figure 4.2: Occurrence of tMIT(I i , j). Down arrows represent sample arrival.
t′ Pi 2Pi 3Pi
Figure 4.3: Computing tbuffer(I i , j). Down arrows represent sample arrival.
a certain amount of incoming samples to the input actor. ¿is accumulation occurs
for a certain duration of time, denoted by tbuer(Ii , j), before starting the input actor.
tbuer(Ii , j) has to be computed such that once the input actor is started, it always nds
data in the buer. Suppose that J−i , j , J+i , j ∈ (︀0, Pi⌋︀. ¿en, we can derive the minimum
value for tbuer(Ii , j) and the minimum de-jitter buer size. In order to do that, we start
with proving the following lemma:
Lemma 4.1.1. Let Ii , j be a jittery input stream with J−i , j , J+i , j ∈ (︀0, Pi⌋︀. ¿e maximum
inter-arrival time between any two consecutive samples in Ii , j, denoted by tMIT(Ii , j),
satises:
tMIT(Ii , j) = 3Pi (4.1)
Proof. Based on the jitter model, tMIT(Ii , j) occurs when the kth sample is early by the
maximum value of jitter (i.e., arrives at time t = kPi − Pi), and the (k + 1) sample is late
by the maximum value of jitter (i.e., arrives at time t = (k+ 1)Pi +Pi). ¿is is illustrated
in Figure 4.2. ∎
Lemma 4.1.2. An input actor Ai ∈ A is guaranteed to always nd an input sample in
each of its input de-jitter buers if the following holds:
∀Ii , j ∶ tbuer(Ii , j) ≥ 2Pi (4.2)
Proof. During the time interval (t, t + tMIT(Ii , j)), Ai can re at most twice. ¿erefore,
it is necessary to buer at least two samples in order to guarantee that the input actor Ai
can continue ring periodically when the samples are separated by tMIT(Ii , j) time-units.
Suppose that Ii , j is started at time t′ and suppose also that the rst and second samples
are delayed by the maximum jitter value (i.e., Pi) as shown in Figure 4.3. Hence, the
duration needed to accumulate two samples before starting Ai (i.e., tbuer(Ii , j)) has to
be equal to or greater than 2Pi . ∎
50 Chapter 4. Scheduling Framework
Lemma 4.1.3. Let Ai be an input actor and Ii , j be a jittery input stream to Ai . Suppose
that Ii , j starts at time t′ and Ai starts at time t′ + 2Pi . ¿e de-jitter buer must be able to
hold at least three samples.
Proof. Suppose that the (k − 1) and (k + 1) samples arrive late and early, respectively,
by the maximum amount of jitter. ¿is means that they arrive at time t′ + kPi . Now,
suppose that the kth sample arrives with no jitter. ¿is means that at time t′ + kPi
there are three samples arriving simultaneously. Hence, the de-jitter buer must be
able to store them. During the interval (︀t′ + kPi , t′ + (k + 1)Pi), there are no incoming
samples and Ai processes the (k − 1) sample. At time t′ + (k + 1)Pi , the (k + 2) sample
might arrive which means that there are again three samples available to Ai . By the
periodicity of Ai and Ii , j, the previous pattern can repeat. ∎
¿emain advantage of the de-jitter buer approach is that the actors are still sched-
uled as periodic tasks. However, the disadvantages are the extra delay in starting the
input actors and the extra memory needed for the buers. Unless otherwise mentioned,
we assume that each input stream is connected to a de-jitter buer which has a capacity
that is at least equal to or greater than the limit given by Lemma 4.1.3.
4.2 Basic Denitions
First, we introduce the following denitions.
Denition 4.2.1 (Execution Time Vector). For a graphG = (A, E), an execution time
vector C⃗, where C⃗ ∈ N⋃︀A⋃︀, represents the worst-case execution times, measured in










E l ∈inp(A j)









where𝒩 j is the length of CSDF ring/production/consumption sequences of actor A j,
CR is the worst-case time needed to read a single token from an input channel, y lj is
the consumption sequence of A j from channel El , CW is the worst-case time needed
to write a single token to an output channel, xrj is the production sequence of A j into
channel Er , and CCj (k) is the worst-case computation time of A j in ring k.
¿e worst-case computation time of an actor A j is the worst-case time needed
to execute the function corresponding to A j in the PPN. For example, consider the
PPN in Figure 3.2 on page 42. ¿e worst-case computation time of process 𝒫f2 is
4.2. Basic Denitions 51
the worst-case time needed to execute function f2(in1). Obtaining the worst-case
computation time can be done through two ways [WEE+08]: (1) static analysis of the
function source code, or (2) proling the parallel program on the target platform.
Denition 4.2.2 (ActorWorkload). ¿eworkload of an actor Ai isWi = qiCi and the
maximum actor workload of the graph is Ŵ = maxA i∈A{Wi}.
Let lcm(q⃗) = lcm{q1, q2,⋯, q⋃︀A⋃︀}. Now, we give the following denition.
Denition4.2.3 (Matched Input/OutputRatesGraph). AgraphG is said to bematched
input/output (I/O) rates graph if and only if
Ŵ mod lcm(q⃗) = 0 (4.4)
If (4.4) does not hold, then G is said to bemis-matched I/O rates graph.
¿e concept of matched I/O rates applications was rst introduced in [TA10] as the
applications with low value of lcm(q⃗). However, the authors in [TA10] did not establish
an exact test for determining whether an application has matched I/O rates or not. ¿e
test in (4.4) is a novel contribution of this dissertation. If Ŵ mod lcm(q⃗) = 0, then
there exists at least a single actor in the graph which fully utilizes the processor on
which it runs. ¿is, as shown later, allows the graph to achieve optimal throughput.
On the other hand, if Ŵ mod lcm(q⃗) ≠ 0, then there exist idle durations in the period
of each actor which results in sub-optimal throughput.
Denition 4.2.4 (Balanced Graph). A graph G is called balanced if and only if
q1C1 = q2C2 = ⋯ = q⋃︀A⋃︀C⋃︀A⋃︀ (4.5)
If (4.5) does not hold, then the graph is called unbalanced.
Denition 4.2.5 (Output Path Latency). Let Wk = {(Aa ,Ab),⋯, (Ay ,Az)} be an
output path in graph G. ¿e latency ofWk under periodic input streams, denoted by
ℒ(Wk), is the elapsed time between the start of the rst ring of Aa that produces data
to channel (Aa ,Ab) and the nish of the rst ring of Az that consumes data from
channel (Ay ,Az).
Consequently, we dene the maximum latency of G as follows:
Denition 4.2.6 (Graph Maximum Latency). For a graph G, themaximum latency




¿e path with the largest output latency is called the critical path of the graph.
52 Chapter 4. Scheduling Framework
Denition 4.2.7 (Self-Timed Schedule). A Self-Timed Schedule (STS) is one where all
the actors are red as soon as their input data are available.
A self-timed schedule does not impose any extra latency (e.g., by the scheduler) on
the actors. ¿is leads us to the following result proven in [SB09] for HSDF graphs and
extended in [Gha08] to SDF graphs.
¿eorem 4.2.1 (From [SB09,Gha08]). For a graph G, the minimum achievable latency
and the maximum achievable throughput are obtained when the actors ofG are scheduled
using self-timed scheduling policy.
¿eorem 4.2.1 applies to CSDF graphs since any CSDF graph can be converted to
an equivalent (H)SDF graph [BELP96].
Denition 4.2.8 (Worst-Case Self-Timed Schedule). A self-timed schedule in which
every actor ring has a duration equal to the value given by (4.3) is called a worst-case
self-timed schedule.
For acyclic graphs, the throughput of an actor Ai , denoted byℛWSTS(Ai), under
a worst-case self-timed schedule is given according to [Gha08] by:
ℛWSTS(Ai) = qi⇑Ŵ (4.7)
4.3 Deriving Periods
¿e rst step of the proposed scheduling framework is to derive a valid period for each
actor in the graph. To do that, we introduce rst some denitions.
Denition 4.3.1 (Periodic Actor). Let Ai be a CSDF actor with start time Si and period
Pi . Ai is periodic if and only if it can be red at time instants Si + kPi for all k ∈ N0.
Denition 4.3.2 (Periodic Schedule). Let G be a CSDF graph. A valid static schedule
(see Denition 2.3.1 on page 27) in which all the actors are periodic actors is called a
periodic schedule.
Denition 4.3.3 (Period Vector). For a CSDF graph G, a period vector P⃗, where
P⃗ ∈ N⋃︀A⋃︀, represents the periods, measured in time-units, of the actors in G. Pj ∈ P⃗ is
the period of actor A j ∈ A. P⃗ is given by the solution to both
q1P1 = q2P2 = ⋯ = qn−1Pn−1 = qnPn (4.8)
and
P⃗ − C⃗ ≥ 0⃗, (4.9)
where q j ∈ ⃗̌q and n = ⋃︀A⋃︀.
4.3. Deriving Periods 53
Denition 4.3.3 implies that all the actors take the same time duration to complete
one actor iteration (see Denition 2.3.2 on page 28). ¿is common time duration of the
actor iteration is called the iteration period and it is denoted by α = qiPi for any Ai .
Now, we prove the existence of a periodic schedule when the input streams are
strictly periodic. Recall from Section 4.1 that we use de-jitter buers to hide the eect
of jitter and make the input samples delivered to the input actors arrive in a strictly
periodic fashion.







⟩ for Ai ∈ A. (4.10)
Proof. (4.8) can be re-written as:
Q ⋅ P⃗ = 0⃗, (4.11)





q1 if j = 1
−q j if j = i + 1
0 otherwise
(4.12)
Observe that nullity(Q) = 1. ¿us, there exists a single vector which forms the
basis of the null-space of Q. ¿is vector can be represented by taking any unknown Pk
as the free-unknown and expressing the other unknowns in terms of it which results
in:
P⃗ = Pk(︀qk⇑q1, qk⇑q2,⋯, qk⇑qn⌋︀T
¿eminimum Pk ∈ N is
Pk = lcm{q1, q2,⋯, qn}⇑qk
¿us, the minimum P⃗ ∈ N that solves (4.8) is given by
Pi = lcm(q⃗)⇑qi for Ai ∈ A (4.13)
Let ⃗̃P be the solution given by (4.13). (4.8) and (4.9) can be re-written as:
Q(c ⃗̃P) = 0⃗ (4.14)
cP̃1 ≥ C1, cP̃2 ≥ C2,⋯, cP̃n ≥ Cn (4.15)
where c ∈ N. (4.15) can be re-written as:
c ≥ q1C1⇑ lcm(q⃗), c ≥ q2C2⇑ lcm(q⃗),⋯, c ≥ qnCn⇑ lcm(q⃗) (4.16)
54 Chapter 4. Scheduling Framework
It follows that c is greater than or equal to maxA i∈A{Ciqi}⇑ lcm(q⃗) = Ŵ⇑ lcm(q⃗).
However, Ŵ⇑ lcm(q⃗) is not always guaranteed to be an integer. As a result, the value is
rounded up by taking its ceiling. It follows that the minimum ⃗̌P which satises both of
(4.8) and (4.9) is given by
P̌i = lcm(q⃗)⇑qi[︂Ŵ⇑ lcm(q⃗)⌉︂ for Ai ∈ A
∎
P̌i is the minimum value of the period of an actor Ai according to Lemma 4.3.1.
However, in many cases, the designer might not need the minimum period. ¿erefore,
we dene period scaling factor of a graph G, denoted by µG ∈ N, which is used to
scale all the minimum periods of the actors in a graph G. ¿us, the actual period of
an actor Ai is given by:
Pi = µG P̌i (4.17)
Observe that, for a graph G, scaling the minimum periods of the actors by µG results
in periods that satisfy (4.8) and (4.9).
¿eorem 4.3.1. For any graph G, a periodic schedule exists such that every actor Ai ∈ A
is periodic with a constant period Pi given by (4.17) and every communication channel
Eu ∈ E has a bounded buer capacity.
Proof. Recall from Section 2.3 that for any input actor Ai , prec(Ai) = ∅. By Algorithm
1 on page 27, we obtain that all input actors belong to LA1 (i.e., level-1 actors). Hence,
level-1 actors consist of either input actors or actors that generate data by themselves.
Now, recall from Section 4.1 that the input streams to the input actors are periodic with
periods equal to the input actors periods. ¿erefore, it follows that the input actors
in level-1 can execute periodically since their input streams are always available when
they re. By Denition 2.3.2 on page 28, level-1 actors will complete one iteration when
they re qi times, where qi is the repetition of Ai ∈ LA1. Assume that level-1 actors
start executing at time t = 0. ¿en, by time t = α, level-1 actors are guaranteed to
nish one iteration. According to ¿eorem 2.3.1 on page 28, level-1 actors will also
generate enough data such that every actor Ak ∈ LA2 can execute qk times (i.e., one
iteration) with a period Pk . According to (4.8) on page 52, ring Ak for qk times with
a period Pk takes α time-units. ¿us, starting level-2 actors at time t = α guarantees
that they can execute periodically with their periods given by Denition 4.3.3 for α
time-units. Similarly, by time t = 2α, level-3 actors will have enough data to execute
for one iteration. ¿us, starting level-3 actors at time t = 2α guarantees that they can
execute periodically for α time-units. By repeating this over all the L levels, a schedule
S1 (shown in Figure 4.4) is constructed in which all the actors that belong to LAi are
started at start time Si given by
Si = (i − 1)α (4.18)
4.3. Deriving Periods 55
time (︀0, α) (︀α, 2α)(︀2α, 3α)⋯(︀(L − 1)α,Lα)
level LA1(1)LA2(1) LA3(1) ⋯ LAL(1)
Figure 4.4: Schedule S1
time (︀0, α) (︀α, 2α)(︀2α, 3α)⋯(︀(L − 1)α,Lα)
level LA1(1)LA2(1) LA3(1) ⋯ LAL(1)
LA1(2) LA2(2)⋯ LAL−1(2)
Figure 4.5: Schedule S2
time (︀0, α) (︀α, 2α)(︀2α, 3α)⋯(︀(L − 1)α,Lα)







Figure 4.6: Schedule SL
LAj(k) denotes level- j actors executing their kth iteration. For example, LA2(1)
denotes level-2 actors executing their rst iteration. At time t = Lα, G completes
one graph iteration. It is trivial to observe from S1 that as soon as level-1 actors nish
one iteration, they can immediately start executing the next iteration since their input
streams arrive periodically. If level-1 actors start their second iteration at time t = α,
their execution will overlap with the execution of the level-2 actors. By doing so, level-2
actors can start immediately their second iteration a er nishing their rst iteration
since they will have all the needed data to execute one iteration periodically at time
t = 2α. ¿is overlapping can be applied to all the levels to yield the schedule S2 shown
in Figure 4.5. Now, the overlapping can be applied L times on schedule S1 to yield a
schedule SL as shown in Figure 4.6.
Starting from time t = Lα, a schedule S∞ can be constructed as shown in Figure
4.7. In schedule S∞, every actor Ai is red every Pi time-unit once it starts. ¿e start
time dened in (4.18) guarantees that actors in a given level will start only when they
have enough data to execute one iteration in a periodic way. ¿e overlapping guarantees
that once the actors have started, they will always nd enough data for executing the
next iteration since their predecessors have already executed one additional iteration.
¿us, schedule S∞ shows the existence of a periodic schedule of G where every actor
A j ∈ A is periodic with a period equal to Pj.
56 Chapter 4. Scheduling Framework
time (︀0, α) (︀α, 2α)(︀2α, 3α)⋯(︀(L − 1)α,Lα)(︀Lα, (L + 1)α)⋯
level LA1(1)LA2(1) LA3(1) ⋯ LAL(1) LAL(2) ⋯
LA1(2) LA2(2)⋯ LAL−1(2) LAL−1(3) ⋯
LA1(3) ⋯ LAL−2(3) LAL−2(4) ⋯
⋯
LAL−3(4) LAL−3(5) ⋯
⋯ ⋯ ⋯ ⋯
LA1(L) LA1(L + 1) ⋯
Figure 4.7: Schedule S∞
¿e next step is to prove that S∞ executes with bounded memory buers. In S∞,
the largest delay in consuming the tokens occurs for a channel Eu ∈ E connecting a
level-1 actor Am ∈ LA1 and a level-L actor. ¿is is illustrated in Figure 4.7 by observing
that the data produced by iteration-1 of a level-1 source actor will be consumed by
iteration-1 of a level-L destination actor a er (L− 1)α time-units. In this case, Eu must
be able to store at least (L − 1)Xum(qm) tokens. However, starting from time t = Lα,
both of the level-1 and level-L actors execute in parallel. ¿us, we increase the buer
size by Xum(qm) tokens to account for the overlapped execution. Hence, the total buer
size of Eu isLXum(qm) tokens. Similarly, if a level-2 actorAl ∈ LA2 is connected directly
to a level-L actor via channel Ev , then Ev must be able to store at least (L − 1)Xvl (ql)
tokens. By repeating this argument over all the dierent pairs of levels, it follows that
each channel Eu ∈ E, connecting a level-i source actor Ak and a level- j destination
actor, where j ≥ i, will store according to schedule S∞ at most:
bu = ( j − i + 1)Xuk(qk) (4.19)
tokens, where qk ∈ ⃗̌q. ¿us, an upper bound on the FIFO sizes exists. ∎
Example 4.3.1. We give now an example on how to compute the periods of the actors
for a given CSDF graph. Consider the CSDF graph shown in Figure 2.1 on page 29.
Recall from Example 2.3.1 on page 28 that the basic repetition vector of the graph is
⃗̌q = (︀3, 2, 1, 3⌋︀T . Suppose that the execution time vector C⃗ = (︀5, 8, 24, 4⌋︀T . It follows
that lcm(q⃗) = lcm{3, 2, 1} = 6 and the maximum actor workload is Ŵ = max{3×5, 2×
8, 1 × 24, 3 × 4} = 24. Based on Lemma 4.3.1, the minimum period vector of the graph
is ⃗̌P = (︀8, 12, 24, 8⌋︀T . ¿e periodic schedule resulting from¿eorem 4.3.1, assuming
µG = 1, is depicted in Figure 4.8. ¿e actors in the schedule shown in Figure 4.8 have
start times given by (4.18).
◻
4.4. Deriving Deadlines and Start Times 57





Figure 4.8:¿e periodic schedule for the CSDF graph shown in Figure 2.1 constructed using
¿eorem 4.3.1. ¿e x-axis represents the time and the up-arrows represent the start times of the
actors.
4.4 Deriving Deadlines and Start Times
¿e second step in the scheduling framework is to derive the deadlines and start times
of the actors. In the proof of¿eorem 4.3.1, we use the notion of start time Si to refer to
the time at which an actor Ai can start executing. ¿e start time used in ¿eorem 4.3.1
is sucient but not minimum. ¿erefore, we are interested in nding the earliest start
times of the actors that guarantee the existence of a periodic schedule. Minimizing the
start times is crucial since it has a direct impact on the latency of the graph and the
buer sizes of the communication channels.
First, we introduce the cumulative production and cumulative consumption
functions.
Denition 4.4.1 (Cumulative Production Function). ¿e cumulative production func-
tion of actor Ai producing into channel Eu during a time interval (︀ts , te), denoted by
prd(︀ts ,te)(Ai , Eu), is the sum of the number of tokens produced by Ai into Eu during
the interval (︀ts , te).
Denition 4.4.2 (Cumulative Consumption Function). ¿e cumulative consumption
function of actor Ai consuming from channel Eu over a time interval (︀ts , te), denoted
by cns(︀ts ,te)(Ai , Eu), is the sum of the number of tokens consumed by Ai from Eu
during the interval (︀ts , te).
Note that the time interval in Denitions 4.4.1 and 4.4.2 can be either open or
closed from the right. Next, we give the following denition:
Denition 4.4.3 (Valid Start Time). Let Eu = (Ai ,A j) be a communication channel
in a graph G. Under a periodic schedule, a valid start time of A j, denoted by S j,
guarantees that A j nds enough data in Eu to re at time instants S j+ kPj for all k ∈ N0.
Denition 4.4.3 implies that A j never blocks on reading from Eu when it has a valid
start time (i.e., no buer underow). We utilize this to determine conditions under
which one can compute the earliest start time such that it is always valid regardless of
when the actor is actually scheduled to write/read tokens under a periodic schedule.
58 Chapter 4. Scheduling Framework
Lemma 4.4.1. Let Eu = (Ai ,A j) be a communication channel in a graph G. Suppose
that Si and S j are valid start times of Ai and A j, respectively, under a periodic schedule.
Suppose also that Si and S j are derived when Ai is always scheduled to write data as late
as possible and A j is always scheduled to read data as early as possible. Si and S j remain
valid when Ai is scheduled to write earlier and/or A j is scheduled to read later.
Proof. If Ai is scheduled to write earlier, then this leads to the tokens being available
earlier on the channel. ¿us, A j can never have a blocking read regardless of when it is
scheduled to read because it has a valid start time derived assuming that it is scheduled
to read as early as possible. If A j is scheduled to read later, then this leads to longer stay
of the tokens in the channel. ¿us, A j can never block on reading since the channel
has enough tokens because S j is a valid start time for A j. ∎
Lemma 4.4.1 states that if we derive the earliest start time assuming that the tokens
production happens as late as possible and the tokens consumption happens as early
as possible, then the derived earliest start time is valid regardless of when the actor is
actually scheduled to execute during its period. ¿us, for the purpose of computing
the start time, the cumulative production function assumes that the tokens production
happens as late as possible. In this case, the cumulative production function is denoted
by prdS(︀ts ,te)(Ai , Eu) and is given by:
prdS
(︀ts ,te)






Pi ⧹︀ + 1) if (te − ts) mod Pi ≥ Di
Xui (⟨︀
te−ts
Pi ⧹︀) if (te − ts) mod Pi < Di
(4.20)
Similarly, for the purpose of computing the start time, the cumulative consumption
function assumes that the data consumption happens as early as possible. ¿e function
is denoted by cnsS(︀ts ,te⌋︀(Ai , Eu) and is given by
cnsS
(︀ts ,te⌋︀






Pi ⌉︂ + 1) if (te − ts) mod Pi = 0
Yui ([︂
te−ts
Pi ⌉︂) if (te − ts) mod Pi ≠ 0
(4.21)
It can be seen from (4.20) that the deadline has an impact on when the tokens
are produced. ¿erefore, we introduce the deadline factor of an actor Ai , denoted by
ηi , where ηi ∈ (︀0, 1⌋︀. Similar to the period scaling factor µG , the deadline factor is a
parameter controlled by the designer. Given the WCET, period, and deadline factor of
an actor Ai , its deadline Di is given by
Di = ⟨︀Ci + ηi(Pi − Ci)⧹︀ (4.22)
Now, we give the following lemma regarding the earliest start time.
4.4. Deriving Deadlines and Start Times 59
Lemma 4.4.2. For a graph G, the earliest start time of an actor A j ∈ A, denoted by S j,





0 if prec(A j) = ∅






{t ⋃︀ ∀k = 0, 1,⋯, α ∶ prdS
(︀S i ,max{S i ,t}+k)
(Ai , Eu) ≥ cnsS
(︀t,max{S i ,t}+k⌋︀
(A j , Eu)} (4.24)
Proof. ¿eorem 4.3.1 proved that starting a level-k actor A j at a start time
S j = (k − 1)α (4.25)
guarantees periodic execution of the actor A j. Any start time later than that guarantees
also periodic execution since A j will always nd enough data to execute in a periodic
way.





0 if prec(A j) = ∅
maxA i∈prec(A j){Si} + α if prec(A j) ≠ ∅
(4.26)
¿e equivalence follows from observing that a level-k actor, where k > 1, has a level-
(k − 1) predecessor. Hence, applying (4.26) to a level-k actor, where k > 1, yields:
S j = max{(k − 2)α, (k − 3)α,⋯, 0} + α = (k − 1)α
Now, we are interested in starting A j in level-k, where k > 1, earlier. ¿at is:
S j ≤ max
A i∈prec(A j)
{Si} + α (4.27)
S j has also a lower-bound by observing that an actor A j can not start before the
application is started. ¿at is:
0 ≤ S j ≤ max
A i∈prec(A j)
{Si} + α ⇒ 0 ≤ S j ≤ max
A i∈prec(A j)
{Si + α} (4.28)
If we select S j such that
S j = max
A i∈prec(A j)
{Si→ j}, where Si→ j = t′, and t′ ∈ (︀0, Si + α⌋︀ (4.29)
60 Chapter 4. Scheduling Framework
Ai
A j
Si t′ Si + α t′ + α
Figure 4.9: Timeline of A i and A j when t′ ≥ S i
Ai
A j
Sit′ Si + αt′ + α
Figure 4.10: Timeline of A i and A j when t′ < S i
then this guarantees that S j also satises (4.28).
In (4.29), a valid start time candidate Si→ j must guarantee that the number of
tokens available on channel Eu = (Ai ,A j) at any time instant t ≥ t′ is greater than or
equal to the number of consumed tokens at the same instant. To guarantee that, we
consider the following two possible cases:
Case I: t′ ≥ Si ¿is case is illustrated in Figure 4.9. In this case, a valid start time
candidate t′ must satisfy:
∀k = 0, 1,⋯, α ∶ prdS
(︀S i ,t′+k)
(Ai , Eu) ≥ cnsS
(︀t′ ,t′+k⌋︀
(A j , Eu) (4.30)
Satisfying (4.30) guarantees that A j can re at times t = t′, t′ + Pj ,⋯, t′ + α. ¿us,
a valid value of t′ guarantees that once A j is started, it always nds enough data to re
for one iteration. As a result, A j executes in a periodic way.
Case II: t′ < Si ¿is case is illustrated in Figure 4.10. A valid start time candidate t′
must satisfy:
∀k = 0, 1,⋯, α ∶ prdS
(︀S i ,S i+k)
(Ai , Eu) ≥ cnsS
(︀t′ ,S i+k⌋︀
(A j , Eu) (4.31)
¿is case occurs when A j consumes zeros tokens during the interval (︀t′, Si⌋︀. ¿is is
a valid behavior since the consumption rates sequence of a CSDF actor can contain zero
elements (see for example the CSDF graph on page 29). Since t′ < Si , it is sucient to
check the cumulative production and consumption over the interval (︀Si , Si + α⌋︀ since
by time t = Si + α both Ai and A j are guaranteed to have nished one iteration. ¿us,
t′ also guarantees that once A j is started, it always nds enough data to re. Hence, A j
executes in a periodic way.
4.5. Deriving Buer Sizes 61
Table 4.1: Computing D⃗ and S⃗ for the CSDF graph shown in Figure 2.1 on page 29 under dierent
values of η⃗.
η⃗ D⃗ S⃗
1⃗ (︀8, 12, 24, 8⌋︀T (︀0, 8, 24, 32⌋︀T
0⃗.5 (︀6, 10, 24, 6⌋︀T (︀0, 6, 22, 30⌋︀T
0⃗ (︀5, 8, 24, 4⌋︀T (︀0, 5, 21, 29⌋︀T
Now, we can merge (4.30) and (4.31) which results in:
∀k = 0, 1,⋯, α ∶ prdS
(︀S i ,max{S i ,t′}+k)
(Ai , Eu) ≥ cnsS
(︀t′ ,max{S i ,t′}+k⌋︀
(A j , Eu) (4.32)
Any value of t′ which satises (4.32) is a valid start time value that guarantees
periodic execution of A j. Since there might be multiple values of t′ that satisfy (4.32),
we take the minimum one because it is the earliest start time that guarantees periodic
execution of A j. ∎
Example 4.4.1. Now, we give an example on how to compute the earliest start times and
deadlines for a givenCSDF graph. Consider theCSDF graph shown in Figure 2.1 on page
29. Recall from Example 4.3.1 on page 56 that C⃗ = (︀5, 8, 24, 4⌋︀T and ⃗̌P = (︀8, 12, 24, 8⌋︀T .
In Table 4.1, we show the values of start times and deadlines under dierent values of
the deadline factors η⃗. ¿ese values are computed by applying Lemma 4.4.2 and (4.22)
to the graph. ◻
4.5 Deriving Buer Sizes
¿e third step in the scheduling framework is to derive a bounded buer size of each
communication channel in the graph. Similar to the start time, we prove in¿eorem
4.3.1 the existence of a periodic schedule when each channel has a bounded buer size.
However, the buer size used in¿eorem 4.3.1 is sucient but notminimum. ¿erefore,
we would like to derive the minimum buer sizes that guarantee periodic execution of
all the actors.
Denition 4.5.1 (Valid Buer Size). Let Eu = (Ai ,A j) be a communication channel
in a graph G. Under a periodic schedule, a valid buer size of Eu, denoted by bu,
guarantees that Ai can store tokens to Eu at time instants Si + kPi for all k ∈ N0.
Denition 4.5.1 implies that a producer never blocks when writing to a commu-
nication channel which has a valid buer size (i.e., no buer overow). Similar to
62 Chapter 4. Scheduling Framework
Lemma 4.4.1, we want to derive the conditions under which one can compute the mini-
mum buer size such that it is always valid regardless of when the actors are actually
scheduled to write/read during their periods.
Lemma 4.5.1. Let Eu = (Ai ,A j) be a communication channel in a graph G. Suppose
that Eu has a valid buer size that is derived when Ai is always scheduled to write as
early as possible and A j is always scheduled to read as late as possible. bu remains valid
when Ai is scheduled to write later and/or A j is scheduled to read earlier.
Proof. If Ai is scheduled to write later, then this leads to the tokens being written
later on the channel. ¿us, Ai can never have a blocking write regardless of when
A j is scheduled to read because bu is valid and is derived assuming that A j is always
scheduled to read as late as possible. If A j is scheduled to read earlier, then this leads
to shorter stay of the tokens in the channel. ¿us, Ai can never block on writing since
bu is valid and the channel has enough free space. ∎
Lemma 4.5.1 states that the buer size derived when the tokens production happens
as early as possible and the tokens consumption happens as late as possible is valid
regardless of when the actors are actually scheduled to execute during their periods.
¿us, for the purpose of computing the buer size, the cumulative production function
assumes that the tokens production happens as early as possible. In this case, the
cumulative production function is denoted by prdB(︀ts ,te⌋︀(Ai , Eu) and is given by:
prdB
(︀ts ,te⌋︀






Pi ⌉︂ + 1) if (te − ts) mod Pi = 0
Xui ([︂
te−ts
Pi ⌉︂) if (te − ts) mod Pi ≠ 0
(4.33)
Similarly, for the purpose of computing the buer size, the cumulative consumption
function assumes that the data consumption happens as late as possible. In this case,
the function is denoted by cnsB(︀ts ,te)(Ai , Eu) and is given by
cnsB
(︀ts ,te)






Pi ⧹︀ + 1) if (te − ts) mod Pi ≥ Di
Yui (⟨︀
te−ts
Pi ⧹︀) if (te − ts) mod Pi < Di
(4.34)
Lemma 4.5.2. For a graph G, the minimum bounded buer size bu of a communication







(︀S i ,max{S i ,S j}+k⌋︀
(Ai , Eu) − cnsB
(︀S j ,max{S i ,S j}+k)





Proof. (4.35) tracks the maximum cumulative number of unconsumed tokens in Eu
during one iteration for Ai and A j. ¿ere are two cases:
4.5. Deriving Buer Sizes 63
Ai
A j
Si S j Si + α S j + α
t1 t2 t3
Figure 4.11: Execution time-lines of A i and A j when S i ≤ S j
Ai
A j
SiS j Si + αS j + α
t1 t2 t3
Figure 4.12: Execution time-lines of A i and A j when S i > S j
Case I: Si ≤ S j In this case, (4.35) tracks the maximum cumulative number of
unconsumed tokens in Eu during the time interval (︀Si , S j + α). Figure 4.11 illustrates
the execution time-lines of Ai and A j when Si ≤ S j. In interval t1, Ai is actively
producing tokens while A j has not yet started executing. As a result, it is necessary to
buer all the tokens produced in this interval in order to prevent Ai from blocking on
writing. ¿us, bu must be greater than or equal to prdB(︀S i ,S j⌋︀(Ai , Eu). Starting from
time t = S j, both of Ai and A j are executing in parallel (i.e., overlapped execution). In
the proof of¿eorem 4.3.1, an additional Xui (qi) tokens were added to the buer size of
Eu to account for the overlapped execution. However, this value is a “worst-case” value.
¿e minimum number of tokens that needs to be buered is given by the maximum
number of unconsumed tokens in Eu at any time over the time interval (︀S j , S j + α)
(i.e., intervals t2 and t3 in Figure 4.11). Taking the maximum number of unconsumed
tokens guarantees that Ai will always have enough space to write to Eu. ¿us, bu is
sucient and minimum for guaranteeing strictly periodic execution of Ai and A j in
the time interval (︀Si , S j + α). At time t = S j + α, both of Ai and A j have completed one
iteration and the number of tokens in Eu is the same as at time t = S j (follows from
Corollary 2.3.1 on page 28). Due to the periodicity of Ai and A j, the pattern shown in
Figure 4.11 repeats. ¿us, bu is also sucient and minimum for any t ≥ S j + α.
Case II: Si > S j Figure 4.12 illustrates this case. According to Lemma 4.4.2, S j can
be smaller than Si if and only if Ai consumes zero tokens in interval t1. ¿erefore, the
intervals in which there is actually production/consumption of tokens are t2 and t3.
During interval t2, there is overlapped execution and bu gives the maximum number
of unconsumed tokens in Eu during (︀Si , S j + α) which guarantees that Ai always
have enough space to write to Eu and A j has enough data to consume from Eu. At
64 Chapter 4. Scheduling Framework
Table 4.2: Computing the buer sizes for the CSDF graph shown in Figure 2.1 on page 29 under
dierent values of η⃗.
η⃗ b⃗
1⃗ (︀2, 2, 5, 3, 2⌋︀T
0⃗.5 (︀2, 2, 5, 3, 2⌋︀T
0⃗ (︀2, 2, 5, 3, 2⌋︀T
time t = S j + α, A j nishes one iteration and interval t3 starts. During interval t3,
Ai is producing data to Eu while A j is consuming zero tokens. ¿erefore, Eu has to
accommodate all the tokens produced during interval t3 and bu must be greater than
or equal to prdB(︀S j+α,S i+α⌋︀(Ai , Eu). As in Case I, bu is sucient and minimum for
guaranteeing periodic execution of Ai and A j in the interval (︀S j , Si + α⌋︀. At time
t = Si + α, both of Ai and A j have completed one iteration and Eu contains a number
of tokens equal to the number of tokens at time t = Si . Due to the periodicity of Ai
and A j, their execution pattern repeats. ¿us, bu is also sucient and minimum for
any t ≥ Si + α. ∎
Example 4.5.1. We give now an example on how to compute the buer sizes for a given
CSDF graph. Consider the CSDF graph shown in Figure 2.1 on page 29. Recall that we
computed P⃗, D⃗ and S⃗ in Examples 4.3.1 and 4.4.1. ¿erefore, we show in Table 4.2 the
values b⃗ under dierent values of η⃗. Observe in this particular example that the buer
sizes do not change when η⃗ is varied. However, the buer sizes, in general, are reduced
when η⃗ is reduced. ◻
Now, we present the following theorem which represents the main results in this
chapter.
¿eorem 4.5.1. For a graph G, let T be a periodic taskset such that Ti ∈ T corresponds
to Ai ∈ A. Ti is given by:
Ti = (Si ,Ci ,Di , Pi), (4.36)
where Si is the earliest start time of Ai given by (4.23), Ci ∈ C⃗ is the WCET given by (4.3),
Di is the deadline given by (4.22), and Pi is the period given by (4.17). T is schedulable
on m processors using any hard-real-time scheduling algorithm𝒜 for asynchronous sets
of periodic tasks if:
1. every communication channel Eu ∈ E has a capacity of at least bu tokens, where
bu is given by (4.35)
2. T satises the schedulability test of𝒜 on m processors
Proof. Follows from¿eorem 4.3.1, and Lemmas 4.4.2 and 4.5.2. ∎
4.6. Throughput Analysis 65
¿eorem 4.5.1 states that for each program modeled as an acyclic CSDF graph,
one can schedule the corresponding real-time taskset on any system composed of m
processors using a scheduling algorithm𝒜 if (1) the communication channels are sized
appropriately, and (2) the taskset satises the schedulability test of𝒜 on m processors.
4.6 Throughput Analysis
Now, given a CSDF graph, we analyze the throughput of the graph actors under a
periodic schedule and compare it with the throughput under a worst-case self-timed
schedule. We start with the following denitions:
Denition 4.6.1 (Actor¿roughput). For a graphG, the throughput of actor Ai under
a periodic schedule, denoted byℛPS(Ai), is given by
ℛPS(Ai) = 1⇑Pi (4.37)
Denition 4.6.2 (Rate-Optimal Periodic Schedule [MB07]). For a graph G, a periodic
schedule that delivers the same throughput as a worst-case self-timed schedule for all
the actors is called Rate-Optimal Periodic Schedule (ROPS).
Recall matched I/O rates graphs introduced in Denition 4.2.3 on page 51. Now,
we give the following result.
Lemma 4.6.1. For a matched I/O rates graph G, the maximum achievable throughput of
the graph actors under a periodic schedule is equal to their throughput under a worst-case
self-timed schedule.
Proof. ¿e maximum achievable throughput under periodic scheduling is the one








Recall also from Table 2.1 on page 24 that ÷ denotes the integer division operator. Let
us re-write Ŵ as Ŵ = p ⋅ lcm(q⃗) + r, where p = Ŵ ÷ lcm(q⃗), and r = Ŵ mod lcm(q⃗).





Ŵ⇑qi if Ŵ mod lcm(q⃗) = 0
(p + 1) lcm(q⃗)⇑qi if Ŵ mod lcm(q⃗) ≠ 0
(4.39)
Recall from (4.7) that
ℛWSTS(Ai) = qi⇑Ŵ (4.40)
66 Chapter 4. Scheduling Framework
Now, recall from Denition 4.2.3 that a matched I/O rates graph satises the following
condition:
Ŵ mod lcm(q⃗) = 0 (4.41)
¿erefore, the maximum achievable throughput of the actors of a matched I/O rates
graph under periodic scheduling is:
ℛPS(Ai) = qi⇑Ŵ =ℛWSTS(Ai) (4.42)
∎
(4.39) shows that the throughput under periodic scheduling depends solely on
the relationship between lcm(q⃗) and Ŵ . If Ŵ mod lcm(q⃗) = 0, then ℛPS(Ai) is
exactly the same asℛWSTS(Ai). If Ŵ mod lcm(q⃗) ≠ 0, then ℛPS(Ai) is lower than
ℛWSTS(Ai).
Now, we prove the following result regarding matched I/O rates programs:
¿eorem 4.6.1. For a matched I/O rates graph G scheduled as a periodic taskset T using
its minimum period vector ⃗̌P, the maximum utilization factor û(T) = 1.
Proof. Recall from Section 2.4.1 that the utilization of a task Ti is dened as ui = Ci⇑Pi ,
where Ci ≤ Pi . ¿erefore, the maximum possible value for ui is when Ci = Pi which
leads to ui = 1. Now, let Am be the actor with the maximum workload. It follows that
qmCm = max
A i∈A
{qiCi} = Ŵ (4.43)








Now, let us write Ŵ as Ŵ = p ⋅ lcm(q⃗) + r, where p = Ŵ ÷ lcm(q⃗), and r = Ŵ mod





















4.7. Latency Analysis 67







Given the start times and deadlines of the actors as computed using Lemmas 4.4.2 and
4.5.2, we can compute the latency of an output path in the graph. LetWk be an output
path in a graph G, where Er is the rst channel and Eu is the last channel. We dene
Kuj and K
r
i as two constants given by:
Kri = min{k ∈ N ∶ xri (k) > 0} − 1 (4.49)
Kuj = min{k ∈ N ∶ y
u
j (k) > 0} − 1 (4.50)
¿en, according to Denitions 4.2.5 and 4.2.6 in Section 4.2 on page 51, the graph
latency ℒ(G) is given by:
ℒ(G) = max
Wk∈W
{S j + Kuj Pj + D j − (Si + K
r
i Pi)} (4.51)
where S j and Si are the earliest start times of the output actor A j and the input actor Ai ,
respectively, Pj and Pi are the periods of A j and Ai , respectively, and D j is the deadline
of A j.
Example 4.7.1. We give an example to illustrate how to compute the graph latency.
Consider the CSDF graph shown in Figure 2.1 on page 29. Recall from Example 2.3.1 on
page 28 that the graph has three output paths given byW= {W1 = {(A1,A2) , (A2,A4)}
, W2 = {(A1,A3) , (A3,A4)},W3 = {(A1,A4)}}. Recall also from Example 4.3.1 on
page 56 that the minimum period vector of the graph is ⃗̌P = (︀8, 12, 24, 8⌋︀T . First, we list
the values of Kui and Kuj dened in (4.49) and (4.50) in Table 4.3. ¿en, in Table 4.4, we
show the output paths latencies and the graph maximum latency under dierent values
of η⃗. Observe in this example that the dierent output paths latencies (i.e., ℒ(W1),
ℒ(W2), and ℒ(W3)) are equal since the source and sink actors (i.e., A1 and A4) are the
same for all paths and the values of Kui and Kuj are equal for each output path. ◻
Recall from (4.22) that the designer controls the deadline values using the deadline
factors η⃗. Such control allows the designer to reduce the latency by reducing the
values of the deadlines as illustrated in Example 4.7.1. However, it is not necessary to
reduce all the deadlines in order to reduce the latency. Recall from Denition 4.2.6
on page 51, that the graph latency is dictated by the critical path which is the output
68 Chapter 4. Scheduling Framework






Table 4.4:¿e output paths latencies and graph maximum latency of the CSDF graph shown in
Figure 2.1 on page 29 under dierent values of η⃗.
η⃗ ℒ(W1) ℒ(W2) ℒ(W3) ℒ(G)
1⃗ 40 40 40 40
0⃗.5 36 36 36 36
0⃗ 33 33 33 33
path with the largest latency. ¿erefore, it is sucient to reduce the deadlines of the
actors belonging to the critical path while leaving the deadlines of the other actors set
to their periods. ¿is is useful since lower deadlines translate to constrained-deadline
tasks which translates, in general, into higher number of processors that are needed
to schedule the actors. Reducing the deadlines of the critical path is accomplished
using Algorithm 2. Algorithm 2 starts by initializing the deadlines of all actors to their
periods. A er that, it iteratively reduces the deadline of actor Am that induces the
largest start time on a given destination actor A j. ¿is reduction in deadline results in
reducing Sm→ j. However, reducing Sm→ j might cause another actor (e.g., Ak) to have a
larger Sk→ j than Sm→ j. ¿us, the deadline reduction is repeated until no other actor
Ak with larger Sk→ j is found.
Now, we show that for two sub-classes of CSDF graphs, one can derive a simpler
expression for the latency. ¿ese two classes are: (1) balanced graphs (see Denition
4.2.4 on page 51), and (2) graphs where ⃗̌q = 1⃗.
¿eorem 4.7.1. ¿e minimum achievable latency of a balanced graph G when its actors
are scheduled as implicit-deadline periodic tasks is equal to its latency under a worst-case
self-timed schedule.
Proof. By Denitions 4.2.4 on page 51 and 4.3.3 on page 52, a balanced graph has a
period vector in which each actor has a period equal to its execution time (i.e., Pi = Ci).
¿erefore, each actor requires execution on a dedicated processor (since ui = Ci⇑Pi = 1).
In this case, the actor utilizes fully the processor on which it executes. Hence, there
is no idle time or latency imposed by the scheduler and the actors execute in a way
4.7. Latency Analysis 69
Algorithm 2 Set-deadline-and-start-time(G , η⃗)
Require: A graph G
Require: Deadline factors η⃗
1: for all A j ∈ A do
2: if prec(A j) = ∅ then
3: S j ← 0
4: else
5: Initialize the deadline of each actor A i ∈ prec(A j) to its period (i.e., D i = Pi)
6: Find Am ∈ prec(A j) such that Sm→ j = maxA i∈prec(A j){S i→ j}
7: Set the deadline of Am to Dm = Cm + ηm(Pm − Cm)
8: Find S j = maxA i∈prec(A j){S i→ j} with the new deadline of Am
9: Find Ak ∈ prec(A j) such that Sk→ j = maxA i∈prec(A j){S i→ j}
10: if Ak ≠ Am then
11: Repeat lines 6 to 9 {A new actor Ak is the bottleneck}
12: else





17: Set the deadline of each output actor Ao to Do = Co + ηo(Po − Co)
18: return D⃗, where D i ∈ D⃗ is the deadline of actor A i , and S⃗, where S i ∈ S⃗ is the start time of
actor A i
that emulates self-timed execution. As a result, the graph has the same latency as if it is
executed in a self-timed way. ∎
¿eorem 4.7.1 implies that a balanced graph scheduled to achieve the minimum
achievable latency requires a number of processors m = n, where n is the number of
actors in the graph. In such a case, the system has “one-to-one” mapping (i.e., one task
per processor). Hence, the type of scheduler is irrelevant since each task requires a
dedicated processor, which is fully utilized, in order to achieve the minimum achievable
latency.
¿eorem 4.7.2. ¿e minimum achievable latency of a graph G with basic repetition
vector ⃗̌q = 1⃗, when its actors are scheduled as implicit-deadline periodic tasks, isℒPS(G) =
LmaxA i∈A{Ci}.
Proof. Based on Lemma 4.3.1 on page 53, a CSDF graph with basic repetition vector
⃗̌q = 1⃗ will have a period vector in which all the periods are the same and equal to
maxA i∈A{Ci}. ¿us, the latency of the graph is the sum of the periods along the longest
output path. By Algorithm 1 on page 27, the longest output path hasL actors in it. ¿us,
the graph latency is ℒ(G) = LmaxA i∈A{Ci}. ∎
70 Chapter 4. Scheduling Framework
¿eorem 4.7.3. ¿e minimum achievable latency of a graph G with basic repetition
vector ⃗̌q = 1⃗, when the actors of G are scheduled as constrained-deadline periodic tasks
with D⃗ = C⃗, is equal to its latency under a worst-case self-timed schedule.
Proof. When the actors of G are scheduled with D⃗ = C⃗, an actor Ai ∈ A is started
immediately a er all its predecessors have nished at least one ring. Hence, the latency
encountered by the rst sample is equal to the sum of actors’ execution times along the
output path with the largest sum of actors’ execution times. ¿is is equivalent to the
latency under a worst-case self-timed schedule. ∎
We summarize the results presented so far in the form of the decision tree shown
in Figure 4.13. ¿is decision tree can be used by the designer to determine the quality of
periodic schedules for a given graphG. IfG is a matched I/O rates graph (see Denition
4.2.3 on page 51), then the graphwill achieve its optimal throughputwhen it is scheduled
using its minimum periodic vector. If a matched I/O rates graph G is a balanced one
(see Denition 4.2.4 on page 51) or has a unity basic repetition vector (i.e., ⃗̌q = 1⃗), then
it has also an optimal latency when it is scheduled using its minimum period vector.
Otherwise, G might have a sub-optimal latency under a periodic schedule. Finally,
mis-matched I/O rates graphs have sub-optimal throughput and latency under periodic
schedules.
4.8 Deriving Architecture andMapping Specications
So far, we have shown how to derive the parameters of the taskset corresponding to a
given CSDF graph. Recall from Figure 1.7 on page 14 that the scheduling framework
produces two outputs that are used as inputs to the system-level synthesis step. ¿ese
outputs are the architecture and mapping specications. ¿e architecture specication
refers to the number of processors needed to schedule the taskset, while the mapping
specication refers to the allocation of tasks to processors. Suppose that we want to
run a set of programs modeled by a set of graphs G = {G1,G2,⋯,GN} on a platform.
Let T(Gi) denote the taskset corresponding to the actors of Gi . ¿en, we dene T to




A er that, we apply one of the partitioning schemes described in Section 2.4.4 on T.
¿is results in a m-partition of T, denoted by mT. ¿en, the architecture specication
states that the system consists of m processors, while the mapping specication states
that the allocation of tasks to processors is given by mT. To illustrate these steps, we
give the following example.
4.8. Deriving Architecture and Mapping Specications 71







ℒPS = ℒWSTS ⃗̌q = 1⃗?
ℒPS = ℒWSTS
if η⃗ = 0⃗





Figure 4.13: Decision tree for scheduling CSDF actors as real-time periodic tasks. ℛPS and ℒPS
refer to the throughput and latency, respectively, when P⃗ = ⃗̌P.
Example 4.8.1. Consider the programs shown in Listing 1 (on page 6) and Listing 2.
Applying the automated parallelization and model construction steps explained in
Chapter 3 to both programs results in the CSDF graphs shown in Figures 2.1 and 4.14.
We denote the graph shown in Figure 2.1 by G1 and the one shown in Figure 4.14 by
G2. Recall from Example 4.3.1 that the execution time vector of G1 is C⃗ = (︀5, 8, 24, 4⌋︀T .
For G2, the execution times are given between the parentheses in Figure 4.14 and
72 Chapter 4. Scheduling Framework
Listing 2 Another example of a SANLP in C
1 int main()
2 {
3 for (i = 0; i < 100; i++) {









A1(2) A2(4) A3(7) A4(1)E1 E2 E3
(︀1⌋︀ (︀1⌋︀ (︀1⌋︀ (︀1⌋︀ (︀1⌋︀ (︀1⌋︀
Figure 4.14:¿e CSDF graph corresponding to the SANLP program in Listing 2. A1 corresponds
to in, A2 corresponds to g1, A3 corresponds to g2, and A4 corresponds to out. ¿e numbers
between the parentheses are the WCET of the actors. For example, the WCET of actor A2 is 4.
Graph C⃗ P⃗ S⃗ b⃗ usum(T)
G1 (︀5, 8, 24, 4⌋︀T (︀8, 12, 24, 8⌋︀T (︀0, 8, 24, 32⌋︀T (︀2, 2, 5, 3, 2⌋︀T 2.7916
G2 (︀2, 4, 7, 1⌋︀T (︀7, 7, 7, 7⌋︀T (︀0, 7, 14, 21⌋︀T (︀2, 2, 2⌋︀T 2.0
Table 4.5:¿e taskset parameters for G1 and G2 assuming µG = 1 and η⃗ = 1⃗ for both graphs. D⃗ is
omitted since D⃗ = P⃗.
C⃗ = (︀2, 4, 7, 1⌋︀T . Assume that the period scaling factor for both graphs is 1 (see (4.17)
on page 54) and the deadline factor for both graphs is 1.0 (see (4.22) on page 58). ¿en,
we compute the periods, start times, and buer sizes of both graphs as shown in Table
4.5. We see from Table 4.5 that the total utilization of T(G1) and T(G2) is equal to
4.7916. ¿us, the absolute minimum number of processors needed to schedule G1 and
G2 on a system, assuming an optimal scheduling algorithm, is given by (2.20) on page
35 and is equal to m̌OPT = [︂4.7916⌉︂ = 5. Under partitioned scheduling, the minimum
number of processors needed to schedule G1 and G2 on a system is given by (2.21) on
page 36.
Suppose that EDF is used as a scheduling algorithm and FFD (see Section 2.4.4) is
used as an allocation algorithm. ¿en, the resulting mapping is shown in Figure 4.15.
In this case, the minimum number of processors needed to schedule G1 and G2 is



















Figure 4.15:Mapping of G1 and G2 onto 6 processors assuming EDF and FFD. π i denotes the
ith processor and u denotes the total utilization of a task. Each task corresponds to a function
in either Listing 1 or 2. For example, Tg1 corresponds to function g1 in Listing 2. ¿e arrows
represent the communication between the tasks.
equal to 6. Figure 4.15 contains both the architecture and mapping specications
needed for performing the system-level synthesis in Chapter 5. It states how many
processors are needed (i.e., 6) and the mapping of each task to processors together with
the communication pattern among the processors. ◻
74 Chapter 4. Scheduling Framework
Chapter 5
System-Level Synthesis
Build a system that even a fool can use,
and only a fool will want to use it.
George Bernard Shaw
SYSTEM-level synthesis represents the  h step in the proposed design ow. ¿einputs to this step are the program, architecture, and mapping specications. ¿e
program specication consists of the PPNs derived in Chapter 3 together with the
tasks’ parameters and buer sizes derived in Chapter 4. ¿e architecture specication
describes the number of processors as derived in Section 4.8. Finally, the mapping
specication, derived in Section 4.8, associates each task with the processor on which it
runs. All these specications are used, as shown in Figure 5.1, to generate the MPSoC
platform which consists of the hardware part together with the so ware running on
that hardware. ¿e whole system-level synthesis procedure is performed using ESPAM
[NSD08]. ESPAM is a system-level synthesis tool for streaming systems with support
for model-based design. We extended ESPAM in order to: (1) generate the hardware
architecture explained in the following section, (2) generate the so ware with the
proper scheduling and communication infrastructures explained later in Section 5.2,
and (3) add support for Xilinx ML605 and Zynq boards that are used later in Chapter 6
for prototyping the synthesized systems.
5.1 Hardware
In this dissertation, we consider a hardware platform consisting of a tiled distributed
memory MPSoC as shown in Figure 5.2. ¿e on-chip interconnect is assumed to be
a predictable on-chip interconnect. A predictable on-chip interconnect is one that

















Figure 5.2: Top-level block diagram of the hardware platform considered in this dissertation
provides bounded worst-case latency on read/write operations between any communi-
cating source and destination pair in the SoC. An example of such interconnect is the
Æthereal network-on-chip [GDR05]. ¿e aforementioned assumption is necessary in
order to compute in Section 4.3 a safe upper bound on the worst-case execution time
of each actor.
Each tile consists, as shown in Figure 5.3, of a processor, several memories, and a
timer. Each tile contains three dedicated memories:
• ProgramMemory (PM) to store the programs’ binaries
• Data Memory (DM) to store the data segments, heap and stack
• Communication Memory (CM) to store the data sent to other processors
Each processor writes the processed data to its local communication memory. A er
that, remote consumer processors read this data from the communication memory of
the producer processor. ¿is means that data writes are always local, and data reads
are either local or remote depending on the actual mapping of tasks to processors. All
the memories are implemented as dual-port memories which means that the commu-
5.2. Software 77




Figure 5.3: Tile organization
nication memory can be accessed by its owner processor and a remote processor at the
same time.
A complete detailed picture of the SoC architecture integrated into ESPAM is
shown in Figure 5.4. ¿e on-chip interconnect in the SoC is a general-purpose, high
performance AXI-4 [ARM10] crossbar switch which is provided by a commercial IP
vendor. ¿e crossbar features a Shared-Address, Multiple-Data (SAMD) topology as
shown in Figure 5.5. It has two arbiters: one for read transactions and one for write
transactions. Both arbiters are independent and can be active at the same time. ¿e
arbitration policy can be congured to be round-robin or priority-based. Parallel write
and read data pathways connect eachmaster to all the slaves that it can access according
to a sparse connectivity map. When more than one source has data to send to dierent
destinations, data transfers can occur independently and in parallel. We congure the
crossbar to use the round-robin arbitration policy which enables us to derive a safe
upper bound on the latencies of the communication operations.
In order to perform hardware generation, we store the hardware platform shown
in Figure 5.4 as a parametrized template in ESPAM. ¿en, we use ESPAM to generate
the actual platform in Xilinx Platform Studio (XPS) Microprocessor Hardware Speci-
cations (MHS, [Xil11]) format. ¿is allows importing the hardware project directly
into the Xilinx XPS tool and performing FPGA synthesis.
5.2 Software
Recall from Chapter 3 that the code of the parallelized programs is generated by the
PNgen compiler. ¿e generated code has a form as the one shown in the example in
Figure 3.2 on page 42. ¿us, the remaining components that we have to generate are: (1)
the scheduling infrastructure, and (2) the communication infrastructure implementing
78 Chapter 5. System-Level Synthesis













M-AXI S-AXI M-AXI S-AXI
DDR Controller UART Controller I2C Controller
S-AXI S-AXI S-AXI
Figure 5.4: Complete MPSoC architecture. P-Bus and D-Bus stand for program and data buses,
respectively. M-AXI and S-AXI stand for AXI master and slave, respectively.
the FIFO reads/writes.
5.2.1 Scheduling Infrastructure
Recall from Chapter 4 that we schedule the tasks as periodic tasks. For the scheduler,
we consider xed task priority scheduling with Deadline Monotonic priority assign-
ment. ¿is choice is driven by the wide availability of real-time operating systems
supporting xed task priority scheduling. Nevertheless, it is important to note that
dierent scheduling algorithms (e.g., EDF) can be used. We chose to implement the
scheduling infrastructure using FreeRTOS [Reab]. FreeRTOS is an open source RTOS
that implements xed task priority scheduling and supports Xilinx FPGAs which are
used later in Chapter 6 for evaluating the synthesized systems.
Tick-Based Implementation
FreeRTOS, as many real-time operating systems, relies on using hardware timers to
keep track of time. Such timers generate periodic interrupts and these interrupts cause
the OS to invoke the scheduler. A single interrupt and the associated scheduling event
are calledOS clock tick. ¿e OS clock tick denes the shortest time granularity visible
to the OS. OS clock tick is dierent from the processor (or CPU) clock tick. A processor
clock tick refers to the duration of a single clock cycle of the clock signal used to operate
the processor. Most of theWCET analysis tools measure theWCET of a task in terms of




































Figure 5.5:Crossbar Topology. AW stands for AXI write address channel, AR for AXI read address
channel, W for write data channel, and R for read data channel.
are multiple of the OS clock tick duration. ¿erefore, it is necessary during the system
synthesis phase to ensure that all the timing parameters are converted to the appropriate
OS clock tick values. ¿is can be done, for example, by rounding the parameters up to
the nearest multiple of the OS clock tick duration. To illustrate the previous concepts,
we provide the following example.
Example 5.2.1. Suppose that we have a system comprised of a processor with a clock
frequency equal to 1 GHz (i.e., processor clock cycle is 1 ns). Suppose that we want to
run a task T1 with the following parameters (all in processor clock cycles) T1 = (C1 =
1.5 × 106, P1 = 2.5 × 106,D1 = 2.5 × 106, S1 = 0). Now, suppose that the OS clock tick
frequency is 1000 Hz. ¿is means that the OS performs scheduling events every 1 ms,
which is equivalent to 106 processor clock cycles. We see that the period and deadline
of T1 are not multiples of the OS clock tick duration. ¿erefore, P1 and D1 must be
rounded up to the nearest multiple of the OS clock tick duration which is 3.0 × 106.
Such roundingmight of course violate the timing requirements dictated by the designer.
¿erefore, it is important to keep in mind the eect of such rounding while specifying
the system and program timing requirements. ◻
One eect of the tick-based implementation that must be taken into account is
the ratio between the tasks’ WCET and the OS clock tick duration. If the WCET is a
fraction of the OS clock tick, then the resulting schedule has sub-optimal throughput
with under-utilized processors and the overhead of the RTOS is not amortized. On the
other hand, if the WCET is larger than the OS clock tick duration (preferably multiples
of the OS clock tick), then the RTOS overhead is amortized. ¿erefore, it is important
to consider this relation between the tasks’ WCET and the OS clock tick duration when
the timing parameters are converted from CPU clock cycles into OS clock ticks.
A periodic task Ti can be implemented in FreeRTOS as shown in Listing 3. Variable
80 Chapter 5. System-Level Synthesis
Listing 3 Implementing a periodic task in FreeRTOS
1 void task(void *arg) {
2 portTickType LastReleaseTime;
3 const portTickType Period = 5;
4 LastReleaseTime = xTaskGetTickCount();
5
6 for (;;) {
7 function();
8 vTaskDelayUntil( &LastReleaseTime, Period );
9 }
10 }
LastReleaseTime records, as its name implies, the last release time of Ti in OS
clock ticks. ¿is variable is initialized when the task starts. Constant Period repre-
sents the period of the task in terms of OS clock ticks. For example, in Listing 3, the
period is 5 OS clock ticks. Inside the for-loop, the task function (i.e., function())
is executed innitely. Upon each execution, function vTaskDelayUntil, which is
part of the FreeRTOS API, is called. ¿e detailed description of vTaskDelayUntil
is shown in Figure 5.6. ¿e function takes two parameters: LastReleaseTime
and Period. Upon calling it, it puts the task in the sleep state and schedules it for
reactivation at time t = LastReleaseTime + Period. It also updates the value of
LastReleaseTime accordingly.
Another eect of the tick-based implementation thatmust be also taken into account
is the need to synchronize the time returned by xTaskGetTickCount() among
the dierent processors. In an MPSoC, the clock signals of the dierent processors
are usually generated from a single “reference” clock signal produced by an oscillator.
¿erefore, the clock signals used by the processors can be kept in phase. ¿e moment,
at which the OS clock tick count returned by xTaskGetTickCount() is initialized,
can be synchronized through the use of a global barrier in the initialization code of
each processor. For example, see line 14 in Listing 5.
Example 5.2.2. Consider the PPN shown in Figure 3.2 on page 42. Process 𝒫snk is
realized under FreeRTOS as shown in Listing 4. Note that the while-loop shown in
Figure 3.2 is replacedwithfor(;;) in Listing 4. Note also thatvTaskDelayUntil
is placed such that a er each invocation of function snk, the task postpones its next
execution to the next release time in accordance with the real-time periodic task model
as dened in Section 2.4.1. ¿eREAD primitive togetherwith its counterpartWRITE are
used to read/write from/to the FIFOs, respectively. ¿ese two primitives are explained
later in Section 5.2.2. ◻
5.2. Software 81
Function Prototype:
void vTaskDelayUntil( portTickType *LastReleaseTime,
portTickType Period );
Description:
Delay a task until a specied time. ¿is function can be used by cyclical tasks to ensure a
constant execution frequency.
¿is function diers from vTaskDelay() in one important aspect: vTaskDelay() speci-
es a time at which the task wishes to unblock relative to the time at which vTaskDelay()
is called, whereas vTaskDelayUntil() species an absolute time at which the task wishes
to unblock.
vTaskDelay() will cause a task to block for the specied number of ticks from the time
vTaskDelay() is called. It is therefore dicult to use vTaskDelay() by itself to gen-
erate a xed execution frequency as the time between a task unblocking following a call to
vTaskDelay() and that task next calling vTaskDelay()may not be xed (the task may
take a dierent path though the code between calls, or may get interrupted or preempted a
dierent number of times each time it executes).
Whereas vTaskDelay() species a wake time relative to the time at which the function
is called, vTaskDelayUntil() species the absolute (exact) time at which it wishes to
unblock.
It should be noted that vTaskDelayUntil() will return immediately (without block-
ing) if it is used to specify a wake time that is already in the past. ¿erefore a task using
vTaskDelayUntil() to execute periodically will have to re-calculate its required wake
time if the periodic execution is halted for any reason (for example, the task is temporarily
placed into the Suspended state) causing the task to miss one or more periodic executions.
¿is can be detected by checking the variable passed by reference as the LastReleaseTime
parameter against the current tick count. ¿is is however not necessary under most usage
scenarios.
¿is function must not be called while the RTOS scheduler has been suspended by a call to
vTaskSuspendAll().
Parameters:
LastReleaseTime: Pointer to a variable that holds the time at which the task was
last unblocked. ¿e variable must be initialized with the current time prior to its rst
use (see the example below). Following this the variable is automatically updated within
vTaskDelayUntil().
Period: ¿e cycle time period. ¿e task will be unblocked at time (*LastReleaseTime
+ Period). Calling vTaskDelayUntil with the same Period parameter value will
cause the task to execute with a xed interval period.
Figure 5.6: Detailed description of function vTaskDelayUntil. Source: [Reaa].
Enforcing Start Times
In many commercial real-time operating systems, the API provided by the RTOS does
not allow the programmer to specify explicitly the start time of a task when the task is
82 Chapter 5. System-Level Synthesis
Listing 4 Implementing process 𝒫snk in Figure 3.2 as a periodic task under FreeRTOS
1 void task_snk(void *arg) {
2 portTickType LastReleaseTime;
3 const portTickType Period = 5;
4 LastReleaseTime = xTaskGetTickCount();
5
















created. ¿erefore, it is the programmer’s responsibility to implement a mechanism
which ensures that a task starts on its specied start time as derived in Section 4.4. ¿e
start time of a task may be realized under such an RTOS using several mechanisms.
We list here two possible mechanisms:
1. ¿e rst mechanism is to use amaster task that releases the programs’ tasks at
the time when they are supposed to start. ¿is master task releases each task at
the OS clock tick on which the task should begin its execution. A er starting
all the tasks, the master task can be terminated or put into a permanent sleep
state.
2. ¿e second mechanism is to release all the tasks simultaneously. A er that,
each task is put into sleep state from the moment of simultaneous release till
the moment at which it should start.
¿e rst mechanism provides tight control on when to start the tasks. However,
a disadvantage of this mechanism is the extra overhead introduced by the master
task. ¿e utilization of the master task must be taken into account while deriving the
architecture andmapping specications (see Section 4.8) in order to avoid any deadline
misses.
¿e second mechanism does not have the utilization overhead of the master task
















Figure 5.7: FIFO layout in memory and the read/write registers
task model as it does not cause the task to block the processor from other tasks.
Example 5.2.3. Consider task snk shown in Listing 4. ¿e implementation of snk
assuming the second mechanism (i.e., simultaneous release) is shown in Listing 5.
Variable SimultaneousReleaseTime is passed from the main function that
releases all the program’s tasks. ¿is variable is used to put the task in sleep state until
its start time. A er that, the task executes as a periodic task starting from the time
assigned to LastReleaseTime at line 29. ◻
5.2.2 Communication Infrastructure
¿e communication infrastructure deals with the implementation of the READ and
WRITE primitives shown for example in Figure 3.2 on page 42 and Listing 4 on page
82. ¿ese primitives provide the actors with the ability to communicate among each
other. Recall from Section 5.1 that each actor produces data to its local communication
memory, and reads data from its communication memory and/or remote communi-
cation memories. ¿e FIFOs are implemented as circular buers, and they are stored
in the communication memories of the processors (see Figure 5.4 on page 78). ¿e
size of a single data word in the FIFOs is 32 bits. Each FIFO contains two special data
words called wr_cnt and rd_cnt as shown in Figure 5.7(a). ¿ese data words store
two pieces of information as shown in Figure 5.7(b) and 5.7(c): (1) the write and read
counters of the FIFO, and (2) a special bit called “Flag” which is used for detecting
counter overows. Whenever the read/write counter exceeds the FIFO size, the ag bit
is toggled. Storing the counter and ag in one data word enables updating the FIFO
state in a producer/consumer task using a single atomic operation.
A detailed implementation of the read/write operations is depicted in Listings 6
and 7. ¿e read/write operations accept four input parameters: (1) a pointer to the value
read/written (val), (2) a pointer to the FIFO (pos), (3) the amount of data, in 32-bit
words, being read/written during an invocation of the read/write operation (len), and
(4) the size of the FIFO in 32-bit words (size). ¿e implementation shown in Listings
6 and 7 assumes that the amount of data written/read by the producer/consumer,
84 Chapter 5. System-Level Synthesis
Listing 5 Implementing the simultaneous release mechanism under FreeRTOS. ¿e
listing shows only the relevant code to the simultaneous release mechanism and other
non-relevant details are omitted.
1 int main() {
2 static portTickType SimultaneousReleaseTime;
3 /* xTaskCreate() (part of FreeRTOS API) creates new tasks */
4 xTaskCreate( task_snk, "snk", SNK_STACK_SIZE, \
5 &SimultaneousReleaseTime, SNK_PRIORITY, NULL);
6 /* Other xTaskCreate() invocations go here */
7
8 /* xTaskGetTickCount() (part of FreeRTOS API) returns the count
9 of OS clock ticks since vTaskStartScheduler() was called.
10 If vTaskStartScheduler() was not called, it returns 0 */
11 SimultaneousReleaseTime = xTaskGetTickCount();
12
13 /* Set up a global barrier to synchronize the processors */
14 waitForGlobalStartSignal(); /* */
15
16 /* vTaskStartScheduler() (part of FreeRTOS API) invokes the
17 scheduler for the first time. It also resets the count of
18 OS clock ticks returned by xTaskGetTickCount() to 0 */
19 vTaskStartScheduler();
20 }
21 void task_snk(void *arg) {
22 portTickType LastReleaseTime, SimultaneousReleaseTime;
23 const portTickType Period = 5;
24 const portTickType StartTime = 20;
25 /* SimultaneousReleaseTime is set in main() */
26 SimultaneousReleaseTime = *((portTickType *) arg);
27 vTaskDelayUntil( &SimultaneousReleaseTime, StartTime );
28 /* Set LastReleaseTime to the actual start time */
29 LastReleaseTime = xTaskGetTickCount(); /* */
30







38 vTaskDelayUntil( &LastReleaseTime, Period );
39 } } } }
5.2. Software 85
Listing 6 An example implementation of the read macro under FreeRTOS
1 READ(void *val, void *pos, int len, int size){
2 volatile int *fifo=(int *)pos;
3 int r_cnt = fifo[1];
4 int w_cnt = fifo[0];
5 int i = 0;
6 while(w_cnt == r_cnt){
7 taskDISABLE_INTERRUPTS();
8 xil_printf("PANIC! Buffer Underflow\n");
9 for (;;);
10 }
11 for(i = 0; i < len; i++){
12 ((volatile int *)val)[i]= fifo[(r_cnt & 0x7FFFFFFF)+2+i];
13 }
14 r_cnt += len;
15 if((r_cnt & 0x7FFFFFFF) == size){
16 r_cnt &= 0x80000000;
17 r_cnt ^= 0x80000000;
18 }
19 fifo[1] = r_cnt;
20 }
respectively, is always the same. ¿at is, for a given communication channel, the value
of len used in WRITE by the producer and the value of len used in READ by the
consumer are the same. When a task Ti reads len words from the FIFO into a buer
val, the read macro performs the following steps:
1. ¿e read and write counters are copied into local variables (lines 3 and 4)
2. If a buerunderowoccurs (i.e., FIFO is empty), then the interrupts are disabled
and a “panic” message is printed to the user to indicate that a buer underow
has occurred (lines 6-10). It is important to note that this situation should not
occur under normal operating conditions since the start times and buer sizes
derived in Sections 4.4 and 4.5 are valid. Recall from Section 1.2 that normal
operating conditions mean that both system hardware and so ware function
properly without faults.
3. ¿e for-loop copies the data from the communication memory into val and
the read counter is incremented (lines 11-14).
4. A er that, the read counter is checked for overow condition and Flag is toggled
accordingly (lines 15-18).
5. Finally, the macro updates rd_cnt register in the FIFO with the new value of
the read counter by doing a single atomic assignment (line 19).
Analogously, when a task Ti writes len words to the FIFO from a buer val, the
86 Chapter 5. System-Level Synthesis
Listing 7 An example implementation of the write macro under FreeRTOS
1 WRITE(void *val, void *pos, int len, int size){
2 volatile int *fifo=(int *)pos;
3 int w_cnt = fifo[0];
4 int r_cnt = fifo[1];
5 int i = 0;
6 while(r_cnt == (w_cnt ^ 0x80000000)){
7 taskDISABLE_INTERRUPTS();
8 xil_printf("PANIC! Buffer overflow\n");
9 for (;;);
10 }
11 for(i = 0; i < len; i++) {
12 fifo[(w_cnt & 0x7FFFFFFF)+2+i] = ((volatile int *)val)[i];
13 }
14 w_cnt += len;
15 if((w_cnt & 0x7FFFFFFF) == size){
16 w_cnt &= 0x80000000;
17 w_cnt ^= 0x80000000;
18 }
19 fifo[0] = w_cnt;
20 }
write macro performs the following steps:
1. ¿e read and write counters are copied into local variables (lines 3 and 4)
2. If a buer overow occurs (i.e., FIFO is full), then, similar to READ, the in-
terrupts are disabled and a “panic” message is printed to the user (lines 6-10).
Note again that this situation should not occur under normal operating circum-
stances since the start times and buer sizes derived in Sections 4.4 and 4.5 are
valid.
3. ¿e for-loop copies the data from val into the communication memory and
the write counter is incremented (lines 11-14).
4. A er that, thewrite counter is checked for overow condition andFlag is toggled
accordingly (lines 15-18).
5. Finally, the macro updates wr_cnt register in the FIFO with the new value of
the write counter by doing a single atomic assignment (line 19).
Chapter 6
Evaluation and Results
In theory, there is no dierence between
theory and practice. But, in practice, there is.
Jan L. A. van de Snepscheut
IN this chapter, we evaluate the proposed scheduling framework and design ow byperforming a set of experiments. ¿e rst experiment evaluates the rst two phases
of the proposed design ow (i.e., automated parallelization and model construction) by
measuring the time needed to perform them on a set of real life programs. ¿e second
experiment evaluates the scheduling framework proposed in Chapter 4. Namely, it
evaluates the following performance and resource usagemetrics for streaming programs
under periodic scheduling: (1) throughput, (2) latency, (3) processor requirements, and
(4)memory requirements. It also compares thesemetrics to their counterparts obtained
under self-timed scheduling. Finally, the third experiment validates the synthesized
systems by running them on actual hardware and checking the timing behavior during
system run-time against the timing specications reported by the scheduling framework
during the design process.
Unless mentioned otherwise, all the experiments were performed on a Lenovo
¿inkPad T500 laptop which has the specications outlined in Table 6.1.
Table 6.1: Specications of the machine on which the experiments were performed
Property Value
Processor Intel Core2 Duo T9400 CPU at 2.53GHz
RAM 4 GB
Operating system Ubuntu 12.04 LTS (64-bit)
88 Chapter 6. Evaluation and Results
Table 6.2: Time needed to parallelize and derive the CSDF model for the benchmark programs
Program No. of actors No. of edges Lines of code Time (s)
Filter-bank 69 89 367 1.60
Alternating direction implicit solver 28 167 209 7.26
FM radio 28 39 195 0.66
2D nite dierence time domain kernel 17 71 144 0.89
2D gauss blur lter for image processing 11 26 75 7.82
Gram-Schmidt 9 20 48 1.85
Regularity detector 8 11 54 2.86
6.1 Experiment I: EvaluatingAutomatedParallelizationand
Model Construction
In this experiment, we evaluate the rst two phases in the proposed design ow. ¿ese
two phases are automated parallelization and model construction. We do that by
parallelizing a set of real-life programs and deriving their CSDF models. ¿e used
programs are from the PolyBench benchmark [Pou]. ¿e programs are specied as
sequential programs in C and vary in their size and complexity. ¿e list of programs
together with the time needed to parallelize them and derive their CSDF models is
shown in Table 6.2.
¿e time reported in Table 6.2 includes: (1) the time needed by the PNgen compiler
to parse the C program and generate the PPN, (2) the time needed to derive the CSDF
model as described in Chapter 3. We see clearly that the rst two phases of the proposed
ow (i.e., automated parallelization and model construction) are very fast. ¿e fast
derivation of the PPNandCSDFmodels relieves the designer from the burden ofwriting
the parallel specicationsmanually. Moreover, this allows the designer to explore a large
number of alternative program specications in a short period of time [ZNS13,ZBS13].
6.2 Experiment II: Evaluating Performance and Resource
UsageMetrics under Periodic Scheduling
In this experiment, given a streaming program executed under a periodic schedule,
we evaluate the following performance and resource usage metrics: (1) throughput,
(2) latency, (3) processor requirements, and (4) memory requirements. ¿en, we
compare these metrics with those obtained under a self-timed schedule. Recall from
¿eorem 4.2.1 on page 52 that the maximum achievable throughput and minimum
achievable latency of a streaming program modeled as a CSDF graph are the ones
achieved under self-timed scheduling. For brevity, we refer in the remainder of this
section to periodic scheduling/schedule as PS and the self-timed scheduling/schedule
6.2. Experiment II: Evaluating Performance and Resource Usage Metrics under Periodic
Scheduling 89
as STS. In this experiment, we report the throughput for the output actors (i.e., the
actors producing the output streams of the program, see Section 2.3). For latency, we
report the graph maximum latency according to Denition 4.2.6 on page 51. Under
periodic scheduling, we use the minimum period vector given by Lemma 4.3.1 on page
53. ¿e self-timed schedule parameters are computed using the SDF3 tool-set [SGB06].
SDF3 is a powerful analysis tool-set which is capable of analyzing CSDF and SDF
graphs to check for consistency errors, compute the repetition vector, compute the
maximum achievable throughput and latency, etc. SDF3 denesℛSTS(G) as the graph
throughput under self-timed scheduling, and ℛSTS(Ai) = qiℛSTS(G) as the actor
throughput. Similarly, ℒSTS(G) denotes the graph latency under self-timed scheduling.
We use the sdf3analysis tool from SDF3 to compute the throughput and latency
for the self-timed schedule assuming unbounded FIFO channel sizes. We also use the
same tool to compute the minimum buer sizes required to achieve the maximum
achievable throughput under a self-timed schedule. We compute the throughput using
the throughput option, the latency using the latency(min_st) option, and
the buer sizes using the buffersize option.
¿is experiment is performed on a set of real-life streaming programs. ¿ese prog-
rams come fromdierent domains (e.g., signal processing, communication,multimedia,
etc.). ¿e benchmark programs are described in detail in the following section.
6.2.1 Benchmarks
We collected the benchmarks from several sources. ¿e rst source is the StreamIt
benchmark [TA10] which contributes 11 streaming programs. ¿e second source is
the SDF3 benchmark [SGB06] which contributes ve streaming programs. ¿e third
source is individual research articles which contain real-life CSDF graphs such as
[MBvdBvM08,OH04, PMN+09]. In total, 19 programs are considered as shown in
Table 6.3. ¿ese programs are modeled using a mixture of CSDF and SDF graphs. For
StreamIt benchmarks, the actors’ execution times are specied in CPU clock cycles
measured on MIT RAW architecture [TKM+02], while for SDF3 benchmarks, the
actors’ execution times are specied in CPU clock cycles on the ARM architecture. For
the graphs from [OH04,PMN+09], the authors do not mention explicitly the actors’
execution times. As a result, wemake assumptions regarding the execution times which
are reported below Table 6.3.
6.2.2 Throughput Evaluation
Table 6.4 shows the results of comparing the throughput of the output actor for every
program under both self-timed and periodic schedules. ¿e most important column
in the table is the last column which shows the ratio of the PS schedule throughput to
90 Chapter 6. Evaluation and Results
Table 6.3: Benchmarks used for evaluating the periodic scheduling framework proposed in Chapter
4. ⋃︀A⋃︀ denotes the number of actors in the graph, while ⋃︀E⋃︀ denotes the number of communication
channels.
Domain No. Program ⋃︀A⋃︀ ⋃︀E⋃︀ Source
Signal Processing 1 Multi-channel beamformer 57 70 [TA10]
2 Discrete cosine transform (DCT) 8 7 -
3 Fast Fourier transform (FFT) kernel 17 16 -
4 Filterbank for multirate signal processing 85 99 -
5 Time delay equalization (TDE) 29 28 -
Cryptography 6 Data Encryption Standard (DES) 53 60 -
7 Serpent 120 128 -
Sorting 8 Bitonic Parallel Sorting 40 46 -
Video processing 9 MPEG2 video 23 26 -
10 H.263 video decoder 4 3 [SGB06]
Audio processing 11 MP3 audio decoder 14 18 -
12 CD-to-DAT rate converter (SDF)1 6 5 [OH04]
13 CD-to-DAT rate converter (CSDF) 6 5 -
14 Vocoder 114 147 [TA10]
Communication 15 So ware FM radio with equalizer 43 53 -
16 Data modem 6 5 [SGB06]
17 Satellite receiver 22 26 -
18 Digital Radio Mondiale receiver 4 3 [MBvdBvM08]
Medical 19 Heart pacemaker2 4 3 [PMN+09]
1 We use two implementations for CD-to-DAT: SDF and CSDF and we refer to them as CD2DAT-S and
CD2DAT-C, respectively. ¿e assumed WCET are C⃗ = (︀5, 2, 3, 1, 4, 6⌋︀T µs.
2 We assume the following WCET: Motion Est.: 4 µs, Rate Adapt.: 3 µs, Pacer: 5 µs, and EKG: 2 µs.
the STS schedule throughput (ℛPS(Aout)⇑ℛSTS(Aout)), where Aout denotes the output
actor. We clearly see that periodic scheduling delivers the same throughput as self-
timed scheduling for 16 out of 19 programs. All these 16 programs are matched I/O
rates programs that haveℛWSTS equal toℛSTS. Only three programs (CD2DAT-(S,C)
and Satellite) are mis-matched and have lower throughput under periodic scheduling.
Table 6.4 conrms also the observation made by the authors in [TA10] who reported
an interesting nding: “Neighboring actors o en have matched I/O rates. ¿is reduces
the opportunity and impact of advanced scheduling strategies proposed in the literature”.
According to [TA10], the advanced scheduling strategies proposed in the literature
(e.g., [SB09]) are suitable formis-matched I/O rates programs. Looking into the results
in Table 6.4, we see that periodic scheduling performs very-well for matched I/O
programs.
6.2.3 Latency Evaluation
Figure 6.1 shows the ratios of the latency under periodic scheduling to the latency under
self-timed scheduling. Recall from Section 4.7 that the latency can be controlled using
the deadline factors η⃗. For all the programs scheduled using η⃗ = 1⃗, the average latency
6.2. Experiment II: Evaluating Performance and Resource Usage Metrics under Periodic
Scheduling 91
Table 6.4: Results of ¿roughput Comparison. Aout denotes the output actor.
Program q̌out ℛSTS(Aout) Ŵ lcm(q⃗) ℛPS(Aout) ℛPS(Aout)ℛSTS(Aout)
Beamformer 1 1.97 × 10−4 5076 1 1⇑5076 1.0
DCT 1 2.1 × 10−5 47616 1 1⇑47616 1.0
FFT 1 8.31 × 10−5 12032 1 1⇑12032 1.0
Filterbank 1 8.84 × 10−5 11312 1 1⇑11312 1.0
TDE 1 2.71 × 10−5 36960 1 1⇑36960 1.0
DES 1 9.765 × 10−4 1024 1 1⇑1024 1.0
Serpent 1 2.99 × 10−4 3336 1 1⇑3336 1.0
Bitonic 1 1.05 × 10−2 95 1 1⇑95 1.0
MPEG2 1 1.30 × 10−4 7680 1 1⇑7680 1.0
H.263 1 3.01 × 10−6 332046 594 1⇑332046 1.0
MP3 2 5.36 × 10−7 3732276 2 1⇑1866138 1.0
CD2DAT-S 160 1.667 × 10−1 960 23520 1⇑147 0.04
CD2DAT-C 160 1.361 × 10−1 1176 23520 1⇑147 0.05
Vocoder 1 1.1 × 10−4 9105 1 1⇑9105 1.0
FM 1 6.97 × 10−4 1434 1 1⇑1434 1.0
Modem 1 6.25 × 10−2 16 16 1⇑16 1.0
Satellite 240 2.27 × 10−1 1056 5280 1⇑22 0.2
Receiver 288000 4.76 × 10−2 6048000 288000 1⇑21 1.0
Pacemaker 64 2.0 × 10−1 320 320 1⇑5 1.0
under periodic scheduling is ve times the latency under self-timed scheduling. We
also see that the mis-matched programs have large latency due to their sub-optimal
throughput. If we exclude the mis-matched programs, then the average latency is four
times the latency under self-timed scheduling. For latency-insensitive programs, this
is acceptable as long as they can be scheduled using the periodic task model to achieve
the maximum achievable throughput. For latency-sensitive programs, reducing the
latency can be done by using smaller values of the deadline factors η⃗ as explained in
Section 4.7. For example, the Vocoder program has a ratio ℒPS(G)⇑ℒSTS(G) ≊ 13.5
when η⃗ = 1⃗. ¿is ratio is reduced to 1.0 when η⃗ = 0⃗. If η⃗ is set to 0⃗ for all the programs,
then we see that 14 out of 19 programs achieve optimal latency. Two matched I/O
rates programs (Receiver and Pacemaker) have sub-optimal latency under periodic
scheduling when η⃗ = 0⃗. ¿is is due to the following reasons. First, the execution times
in these two programs have large variations between rings. Such variation is captured
under self-timed scheduling, while the scheduling framework proposed in Chapter
4 assumes always the WCET. Second, we report the maximum latency while SDF3
reports the actual latency under a self-timed schedule.
6.2.4 Processor Requirements Evaluation
For processor requirements, we compute the minimum number of processors needed
to schedule the program under optimal and partitioned schedulers. We choose to

























































































Figure 6.1: Results of the latency evaluation. ¿e latency is computed by setting η⃗ and applying
Algorithm 2 to compute the start times and deadlines.
compute the minimum required number of processors under these two classes of
schedulers because it is possible to do so in an analytical and easy way using (2.20) and
(2.21) on page 35. Unfortunately, such easy computation of the minimum number of
processors is not possible under self-timed scheduling. ¿is is because the minimum
number of processors required by self-timed scheduling, denoted by m̌STS, can not
be easily computed with equations such as (2.20) and (2.21). Finding m̌STS in practice
requires design space exploration procedures to nd the best allocation which delivers
the required throughput and latency. SDF3 tool-set used to compute the self-timed
scheduling parameters does not support such design space exploration for self-timed
scheduling. ¿erefore, we choose to set the number of processors required by self-timed
scheduling to its upper bound which is the number of actors in the graph.
Figure 6.2 shows the minimum number of processors required to schedule the
6.2. Experiment II: Evaluating Performance and Resource Usage Metrics under Periodic
Scheduling 93
matched I/O rates programs from Table 6.3 under optimal and partitioned schedulers.
¿e number of processors is computed assuming: (1) EDF algorithm with QPA schedu-
lability test (see Section 2.4.3), (2) FFD allocation (see Section 2.4.4), and (3) period
scaling factor µG = 1 for all the programs. When η⃗ = 1⃗, we see that nine out of 16
programs require the same number of processors under both optimal and partitioned
schedulers. ¿e remaining seven programs require on average 14% more processors
under partitioned schedulers. As η⃗ is decreased, we observe the following trends. First,
some programs tend to require the same or slightly higher number of processors com-
pared to the case when η⃗ = 1⃗ (e.g., Filterbank and FMRadio). Second, some other
programs tend to have an increase in the number of processors that is proportional to
the increase in η⃗ (e.g., Beamformer and Serpent). Finally, for all programs, we observe
a large “jump” in the number of processors when η⃗ = 0⃗.
6.2.5 Memory Requirements Evaluation
Given a streaming program, we compute the total amount of memory needed to realize
the buers in the communication channels under periodic and self-timed schedules.
We compute the total amount of memory assuming period scaling factor µG = 1 and
deadline factors η⃗ = 0⃗. ¿is part of Experiment II is conducted on a Dell PowerEdge
T710 server running Ubuntu 11.04 (64-bit) Server OS. Table 6.5 shows, for the matched
I/O rates programs in Table 6.3, the total amount of memory required under a periodic
schedule, denoted byMPS, and total amount of memory required under a self-timed
schedule, denoted byMSTS. We also report the time needed to compute the buer sizes
under both schedules (i.e., tPS and tSTS). We see that seven out of 16 programs have
identical memory requirements under both periodic and self-timed schedules. MPEG,
Vocoder, and Modem have increased memory requirements under periodic schedules,
nevertheless, the increase remains below 12%. Only one program (Pacemaker) has
+144% increase in memory requirements under periodic schedules. ¿e reason for
this huge increase is related to the reason for its sub-optimal latency as explained in
Section 6.2.3. Pacemaker has large variations in its execution time which are taken
into account under self-timed scheduling. ¿ese variations, however, are not taken
into account under the scheduling framework proposed in Chapter 4 which assumes
always the WCET.
6.2.6 Summary of Experiment II
In Sections 6.2.2-6.2.5, we provided a detailed comparison between periodic and self-
timed scheduling for a set of real streaming programs. We compared the following
metrics: (1) throughput, (2) latency, (3) processors usage, and (4)memory requirements.
It is shown that, for more than 70% of the benchmarks, periodic scheduling results

















































































6.3. Experiment III: Validating Synthesized Systems 95
Table 6.5: ¿e total amount of memory needed to realize the buers in the communication
channels under periodic and self-timed schedules. Fields marked with “N/A” indicate that SDF3,
which is used to compute the self-timed schedule parameters, could not compute a solution within
30 days.
Program MPS tPS (seconds) MSTS tSTS (seconds) MPS⇑MSTS
Beamformer 366 3.6 366 859 1.0
DCT 2816 3.0 2816 0.15 1.0
FFT 15872 0.42 15872 1.1 1.0
Filterbank 1128 13.72 N/A - -
TDE 77280 1.41 77280 2.96 1.0
DES 4420 1.0 3986 6400 × 60 1.11
Serpent 19241 12.4 N/A - -
BitonicSort 175 0.23 175 9.96 1.0
MPEG2 9753 2.1 9393 40.73 1.038
H.263 1257 131.4 1257 81.88 1.0
MP3 22 652 22 0.48 1.0
Vocoder 700 31.7 699 1.55 1.001
FM 63 1.13 63 0.1 1.0
Modem 20 0.1 18 0.02 1.111
Receiver 3475 19431 N/A - -
Pacemaker 105 0.2 43 2.03 2.442
in optimal throughput and latency. It is also shown that the memory requirements
under a periodic schedule, compared to a self-timed one, are the same or slightly higher
(at most +11%) for 75% of the benchmarks. In the cases when periodic scheduling is
optimal (for example in terms of throughput and latency), we argue that it provides
additional benets over self-timed scheduling. ¿ese benets are:
1. Ability tomodify the set of running programs easily. Under periodic scheduling,
one can use ecient schedulability tests and partitioning schemes explained
in Chapter 2 to perform online admission control of new programs. In con-
trast, self-timed scheduling requires either re-performing the design space
exploration or using heuristics to devise the new allocation of the programs’
tasks.
2. Ability to use a wide variety of scheduling algorithms for real-time periodic
tasks. ¿is variety gives the designer extra exibility in choosing the most
suitable algorithm for a certain platform.
6.3 Experiment III: Validating Synthesized Systems
In this experiment, we synthesize a set of MPSoC systems, where each system runs a
set of streaming programs. A er that, the synthesized systems are validated by instru-
96 Chapter 6. Evaluation and Results
Listing 8 Task implementation with code for detecting deadline misses
1 void task(void *arg) {
2 portTickType LastReleaseTime, SimultaneousReleaseTime;
3 portTickType ticks;
4 const portTickType Period = 5;
5 const portTickType StartTime = 20;
6
7 SimultaneousReleaseTime = *((portTickType *) arg);
8 vTaskDelayUntil( &SimultaneousReleaseTime, StartTime );
9 LastReleaseTime = xTaskGetTickCount();
10
11 for (;;) {
12 function();
13 ticks = xTaskGetTickCount();
14 vTaskDelayUntil( &LastReleaseTime, Period );
15 if (ticks > *LastReleaseTime) {
16 taskDISABLE_INTERRUPTS();





menting the generated code to detect deadline misses and buer underows/overows.
Deadline misses, for implicit-deadline tasks, can be detected in FreeRTOS using the fol-
lowing observation from Figure 5.6: “It should be noted that vTaskDelayUntil()
will return immediately (without blocking) if it is used to specify a wake time that is
already in the past”. ¿erefore, the task implementation shown in Listing 3 is updated
to detect deadline misses which results in Listing 8. We see in Listing 8 that the function
invocation is followed by an if statement that checks whether LastReleaseTime
has elapsed or not. If a deadline miss is detected, then the following actions are per-
formed: (1) all interrupts are disabled, (2) a message is printed to the user, and (3) the
system freezes its execution by entering into an innite loop.
¿is experiment is conducted using the programs outlined in Table 6.6. ¿ey
include a mixture of real and synthetic programs. ¿e programs shown in Table 6.6 are
validated on two types of hardware platforms. ¿ese hardware platforms are listed in
Table 6.7. When prototyping the systems on ML605 board, the synthesized systems
have an architecture as the one shown in Figure 5.4. In contrast, the Zedboard is based
on Xilinx Zynq-7000 SoC which is a dual ARM system with largely xed functionality
as shown in Figure 6.3.
¿e set of synthesized systems is outlined in Table 6.8. ¿e synthesis time includes
the time needed to perform all the steps shown in Figure 1.7 except for the WCET
6.3. Experiment III: Validating Synthesized Systems 97
Table 6.6: Programs used in Experiment III
Program Description No. of tasks
JPEG-E Image encoder from raw format to JPEG format 6
JPEG-D Image decoder from JPEG format to raw format 2
Sobel Sobel edge-detector lter 5
G1 ¿e program shown in Listing 1 4
G2 ¿e program shown in Listing 2 4
Table 6.7:Hardware platforms used in Experiment III
Platform Description
ML605 Xilinx ML605 Virtex-6 FPGA board with SoC architecture shown in Figure 5.4
Zedboard Avnet ZedBoard board with Xilinx Zynq-7000 SoC shown in Figure 6.3
ARM 1 ARM 2




Figure 6.3: Zynq-7000 SoC architecture
analysis and low-level synthesis and compilation. ¿e short synthesis times in Table 6.8
demonstrate clearly the speed of the proposed design ow. ¿e throughput constraints
correspond to the periods that are requested by the designer from the scheduling frame-
work. ¿is is done by choosing dierent values of the period scaling factor µG . By
using µG , it is possible to reduce the programs throughput, and accordingly, reduce the
required number of processors. ¿is reduction is necessary, for example, on the Zynq
platform in order to ensure that the programs can be scheduled on two processors. For
the deadline factors η⃗, we always use η⃗ = 1⃗. Each synthesized system is validated by
running it with real input data for a duration between 1 and 12 hours. For all the synthe-
sized systems shown in Table 6.8, no deadline misses or buer underows/overows
are detected during the whole validation phase.
98 Chapter 6. Evaluation and Results
Table 6.8:¿e set of synthesized systems
System Board Programs Synthesis time (seconds) ¿roughput constraints
Sys1 ML605 JPEG-D | Sobel 11.3 1 fps | 4.7 fps
Sys2 Zedboard JPEG-D | Sobel 8.6 2 fps | 5 fps
Sys3 ML605 G1 | G2 8.2 3.125 fps | 3.571 fps
In order to “double-check” the previous nding, we measured also the throughput
of the actors in the generated systems. ¿e throughput is measured by measuring
the periods of the output actors using custom hardware counters. For example, for
Sys3 in Table 6.8, we measured the the periods of the output actors when the system
is ran on ML605 board for a time duration equal to 1 hour. ¿e periods that were
requested from the scheduling framework are 280 ms for G2 and 320 ms for G1 and
the OS clock tick was set to 10 ms. ¿e average measured periods were 279915940 ns
for G2 and 319719020 ns for G1. ¿e deviations from these averages were always less
than 1 ms, which is much below the time granularity visible to the RTOS. ¿us, these




Show me a hard real-time system, and I will show
you a hammer that will cause it to miss its deadlines.
Paul E. McKenney
THIS dissertation addressed the problem of designing hard real-time streamingsystems running a set of parallel streaming programs in an automated way such
that the programs provably meet their timing requirements. Such systems are usually
realized nowadays as MPSoCs. Model-based design and electronic system-level synthe-
sis have emerged as de facto solutions to the problems of designing parallel so ware
for MPSoCs and generating the complete MPSoC, respectively. However, no such de
facto solution exists yet for the problem of scheduling parallel streaming programs on
MPSoCs. Scheduling has a direct inuence on the architecture and mapping specica-
tions needed to perform electronic system-level synthesis. One possible and attractive
solution is to use classical hard real-time scheduling algorithms. However, most hard
real-time scheduling algorithms assume independent periodic or sporadic tasks, while
modern streaming programs are o en modeled as directed graphs in which the actors
(i.e., tasks) have data dependency constraints and do not necessarily conform to the
real-time periodic task model. In this dissertation, a scheduling framework is pro-
posed with which it is analytically proven that any streaming program, modeled as an
acyclic CSDF graph, can be executed as a set of real-time periodic tasks. ¿e proposed
framework computes the parameters of the periodic tasks corresponding to the graph
actors and the minimum buer sizes of the communication channels such that a valid
periodic schedule is guaranteed to exist. ¿e proposed framework shows that the use
of both models is possible and that they complement each other; CSDF captures the
functional aspects of the program, while the real-time periodic task model captures
the timing aspects. Using both models, as demonstrated by the proposed framework,
100 Chapter 7. Summary and Future Work
enables the designer to: (1) schedule the tasks to meet certain performance metrics (i.e.,
throughput and latency), (2) derive analytically the scheduling parameters that guar-
antee the required performance, and (3) compute analytically the minimum number
of processors that guarantee the required performance together with the mapping of
tasks to processors. Additionally, the scheduling framework (explained in Chapter 4)
establishes the following results:
• Matched I/O rates graphs (which correspond to roughly 90% of streaming prog-
rams) have a throughput under periodic schedules that is equal to their through-
put under worst-case self-timed schedules.
• For certain classes of CSDF graphs, it is possible to achieve throughput and
latency under periodic schedules that are equal to the throughput and latency
under worst-case self-timed schedules. It is also shown that, for CSDF graphs
in general, the latency can be reduced via reducing the deadlines of the actors
along the critical paths.
In order to demonstrate the eectiveness and eciency of the proposed scheduling
framework, a system-level design ow that incorporates the scheduling framework is
proposed. ¿is design ow accepts, as input, algorithmic sequential specications of
streaming programs, and then applies a set of systematic and fully automated steps that
produce, as output, a complete system implementation which provablymeets the timing
requirements of the programs. ¿e system implementation consists of the parallelized
versions of the input streaming programs together with the hardware needed to run
them. ¿e proposed design ow consists of the following key steps: (1) automated
parallelization and model construction, (2) scheduling framework (as proposed in
Chapter 4), and (3) electronic system-level synthesis. A complete implementation of
the proposed design ow is available for download, as an open source framework called
DaedalusRT, from http://daedalus.liacs.nl/.
¿e proposed scheduling framework and design ow are evaluated through a set
of experiments. ¿e rst experiment (Section 6.1) shows that automated paralleliza-
tion and model construction are very fast for many real life programs. ¿e second
experiment (Section 6.2) demonstrates the quality of periodic scheduling of streaming
programs in terms of (1) throughput, (2) latency, (3) processors usage, and (4) memory
requirements. It shows that, for more than 70% of the benchmarks, periodic scheduling
results in optimal throughput and latency. It also shows that the memory requirements
under a periodic schedule, compared to a self-timed one, are the same or slightly higher
(at most +11%) for 75% of the benchmarks. In the cases where periodic scheduling is
optimal, it can be argued that it provides additional benets over self-timed scheduling.
¿ese benets are:
1. Ability tomodify the set of running programs easily. Under periodic scheduling,
one can use ecient schedulability tests and partitioning schemes explained
7.1. Suggestions for Future Work 101
in Chapter 2 to perform online admission control of new programs. In con-
trast, self-timed scheduling requires either re-performing the design space
exploration or using heuristics to devise the new allocation of the programs’
tasks.
2. Ability to use a wide variety of scheduling algorithms for real-time periodic
tasks. ¿is variety gives the designer extra exibility in choosing the most
suitable algorithm for a certain platform.
Finally, the third experiment (Section 6.3) validated the correctness of the synthe-
sized systems using the proposed design ow. ¿e validation was done by running the
synthesized systems on FPGA boards with real input data for long durations. During
the whole validation phase, no deadline misses or buer underows/overows were
observed.
7.1 Suggestions for FutureWork
In this section, we provide a summary of the issues that deserve further investigation
in the future.
Support for More Expressive MoCs
Amore expressiveMoC allows amore accurate analysis of themodeled program. A rst
step towards this goal is the work in [BTV12]. ¿e authors in [BTV12], as mentioned in
Section 1.4, presented a scheduling framework similar to ours with support for a MoC
called Ane Data-Flow (ADF), which extends the CSDF model. Another option is to
consider support for dynamic MoCs which model programs that change their behavior
during run-time (e.g., Boolean Data-Flow [BL93]).
Improving theWCET by Considering the Eect of Mapping
In Chapter 4, we assume that the WCET of an actor is computed assuming the worst-
case latency of communication operations. ¿is worst-case latency occurs when the
underlying interconnect is fully congested. However, such assumption overestimates
the WCET value. In a real system, many communication streams are isolated from
the others (see for example Figure 4.15). ¿erefore, communication operations occur
without congestion and they do not take theirworst-case latency. ¿erefore, it is possible
to reduce the WCET values if the mapping is taken into account. A rst step towards
“communication-aware” allocation in hard real-time systems realized on MPSoCs is
the work presented in [ZM12]. Zimmer and Mueller in [ZM12] presented a framework
for deriving low-contention mapping of real-time programs mapped onto NoC-based
MPSoCs. ¿ey devised two solvers: one based on exhaustive search and another based
102 Chapter 7. Summary and Future Work
on a heuristic. ¿e resulting mapping tries to reduce the communication contention
and, hence, reduce the communication latency. ¿is, in turn, leads to a tighter WCET
estimates of the tasks.
Support for Hierarchical Scheduling
¿e architecture and mapping derivation explained in Section 4.8 does not support
hierarchical scheduling. Hierarchical scheduling is becoming more popular in modern
hard real-time systems since it allows dierent programs to be scheduled using dierent
scheduling policies. Furthermore, in some application domains such as avionics, it is
mandatory to use two-levels of scheduling in order to provide complete partitioning in
time and space as mandated by industry standards (such as ARINC 653 Specication
[ARI]). ¿erefore, it is interesting to investigate how such hierarchical scheduling
schemes aect the derivation of the architecture and mapping specications.
Bibliography
[AB98] L. Abeni and G. Buttazzo. Integrating multimedia applications in
hard real-time systems. In Proceedings of the 19th IEEE Real-Time
Systems Symposium, RTSS ’98, pages 4–13, 1998. doi:10.1109/
REAL.1998.739726.
[AJ02] Björn Andersson and Jan Jonsson. Preemptivemultiprocessor schedul-
ing anomalies. In Proceedings of the International Parallel and Dis-
tributed Processing Symposium, IPDPS 2002, Los Alamitos, CA, USA,
2002. IEEE Computer Society. doi:10.1109/IPDPS.2002.
1015483.
[ALSU86] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jerey D. Ullman.
Compilers: Principles, Techniques, and Tools. Addison Wesley, Boston,
MA, U.S.A, 2nd edition, 1986.
[Amd67] Gene M. Amdahl. Validity of the single processor approach to achiev-
ing large scale computing capabilities. In Proceedings of the Spring Joint
Computer Conference, AFIPS ’67 (Spring), pages 483–485, New York,
NY, USA, 1967. ACM. doi:10.1145/1465482.1465560.
[ARI] ARINC Incorporated. 653P1-3 Avionics Application So ware Stan-
dard Interface, Part 1, Required Services. URL: http://www.
arinc.com/ [cited September 13, 2013].
[ARM10] ARM Ltd. AMBA® AXI Protocol - Version: 2.0 : Specication, 2010.
URL: http://www.arm.com/ [cited June 19, 2012].
[Aud91] Neil C. Audsley. Optimal priority assignment and feasibility of static
priority tasks with arbitrary start times. Technical Report YCS 164,
University of York, 1991.
104 Bibliography
[Bar03] Sanjoy K. Baruah. Dynamic- and Static-priority Scheduling of Re-
curring Real-time Tasks. Real-Time Systems, 24(1):93–128, 2003.
doi:10.1023/A:1021711220939.
[Bar10] Sanjoy Baruah. ¿e Non-cyclic Recurring Real-Time Task Model. In
Proceedings of the IEEE 31st Real-Time Systems Symposium, RTSS ’10,
pages 173–182, Los Alamitos, CA, USA, 2010. IEEE Computer Society.
doi:10.1109/RTSS.2010.19.
[Bas04] Cédric Bastoul. Improving Data Locality in Static Control Programs.
PhD thesis, Université Pierre-et-Marie-Curie (Paris VI), France, 2004.
[BCGM99] Sanjoy Baruah, Deji Chen, Sergey Gorinsky, and Aloysius Mok. Gen-
eralized Multiframe Tasks. Real-Time Systems, 17:5–22, 1999. doi:
10.1023/A:1008030427220.
[BCPV96] S. K. Baruah, N. K. Cohen, C. G. Plaxton, and D. A. Varvel. Propor-
tionate progress: A notion of fairness in resource allocation. Algorith-
mica, 15(6):600–625, 1996. doi:10.1007/BF01940883.
[BELP96] Greet Bilsen, Marc Engels, Rudy Lauwereins, and Jean Peperstraete.
Cyclo-static dataow. IEEE Transactions on Signal Processing,
44(2):397–408, 1996. doi:10.1109/78.485935.
[Ber66] A. J. Bernstein. Analysis of Programs for Parallel Processing. IEEE
Transactions on Electronic Computers, EC-15(5):757–763, 1966. doi:
10.1109/PGEC.1966.264565.
[BG04] Sanjoy Baruah and Joël Goossens. Scheduling Real-Time Tasks: Algo-
rithms and Complexity. In Joseph Y.-T. Leung, editor,Handbook of
Scheduling: Algorithms, Models, and Performance Analysis. CRC Press,
Boca Raton, FL, U.S.A, 2004. doi:10.1201/9780203489802.
ch28.
[BHM+05] Marco Bekooij, Rob Hoes, Orlando Moreira, Peter Poplavko, Milan
Pastrnak, Bart Mesman, Jan Mol, Sander Stuijk, Valentin Gheorghita,
and Jef Meerbergen. Dataow Analysis for Real-Time Embedded
Multiprocessor System Design. In Peter van der Stok, editor, Dynamic
and Robust Streaming in and between Connected Consumer-Electronic
Devices, volume 3, pages 81–108. Springer Netherlands, 2005. doi:
10.1007/1-4020-3454-7_4.
Bibliography 105
[BL93] Joseph T. Buck and Edward A. Lee. Scheduling dynamic dataow
graphs with bounded memory using the token ow model. In Pro-
ceedings of the IEEE International Conference on Acoustics, Speech,
and Signal Processing, volume 1 of ICASSP ’93, pages 429–432, 1993.
doi:10.1109/ICASSP.1993.319147.
[BMKdD12] B. Bodin, A. Munier-Kordon, and B.D. de Dinechin. K-Periodic
schedules for evaluating the maximum throughput of a Synchronous
Dataow graph. In Proceedings of the International Conference on
Embedded Computer Systems, SAMOS ’12, pages 152–159, 2012. doi:
10.1109/SAMOS.2012.6404169.
[BMKdD13] Bruno Bodin, Alix Munier-Kordon, and Benoît Dupont de Dinechin.
Periodic Schedules for Cyclo-Static Dataow. In Proceedings of the 11th
IEEE Symposium on Embedded Systems for Real-Time Multimedia, ES-
TIMedia 2013, pages 105–114, 2013. doi:10.1109/ESTIMedia.
2013.6704509.
[BNHMMK12] AbirBenabid-Najjar, ClaireHanen, OlivierMarchetti, andAlixMunier-
Kordon. Periodic Schedules for Bounded Timed Weighted Event
Graphs. IEEE Transactions on Automatic Control, 57(5):1222–1232,
2012. doi:10.1109/TAC.2012.2191871.
[Bra11] Björn B. Brandenburg. Scheduling and Locking in Multiprocessor Real-
Time Operating Systems. PhD thesis, University of North Carolina at
Chapel Hill, U.S.A., 2011.
[BRH90] Sanjoy K. Baruah, Louis E. Rosier, and Rodney R. Howell. Algorithms
and complexity concerning the preemptive scheduling of periodic,
real-time tasks on one processor. Real-Time Systems, 2:301–324, 1990.
doi:10.1007/BF01995675.
[BS11] Mohamed Bamakhrama and Todor Stefanov. Hard-real-time schedul-
ing of data-dependent tasks in embedded streaming applications. In
Proceedings of the ninth ACM International Conference on Embedded
So ware, EMSOFT ’11, pages 195–204, NewYork, NY, USA, 2011. ACM.
doi:10.1145/2038642.2038672.
[BS12] Mohamed A. Bamakhrama and Todor Stefanov. Managing latency in
embedded streaming applications under hard-real-time scheduling.
In Proceedings of the eighth IEEE/ACM/IFIP International Conference
on Hardware/So ware Codesign and System Synthesis, CODES+ISSS
106 Bibliography
’12, pages 83–92, New York, NY, USA, 2012. ACM. doi:10.1145/
2380445.2380464.
[BT13] Adnan Bouakaz and Jean-Pierre Talpin. Buer minimization in
earliest-deadline rst scheduling of dataow graphs. In Proceedings of
the 14th ACM SIGPLAN/SIGBED Conference on Languages, Compilers
and Tools for Embedded Systems, LCTES ’13, pages 133–142, New York,
NY, USA, 2013. ACM. doi:10.1145/2465554.2465558.
[BTV12] Adnan Bouakaz, Jean-Pierre Talpin, and Jan Vitek. Ane Data-Flow
Graphs for the Synthesis of Hard Real-Time Applications. In Proceed-
ings of the 12th International Conference on Application of Concurrency
to System Design, ACSD ’12, pages 183–192, Los Alamitos, CA, USA,
2012. IEEE Computer Society. doi:10.1109/ACSD.2012.16.
[But11] Giorgio C. Buttazzo. Hard Real-Time Computing Systems. Springer
US, 3rd edition, 2011. doi:10.1007/978-1-4614-0676-1.
[BZNS12] Mohamed A. Bamakhrama, Jiali Teddy Zhai, Hristo Nikolov, and
Todor Stefanov. A methodology for automated design of hard-real-
time embedded streaming systems. In Proceedings of the 15th Design,
Automation Test in Europe Conference and Exhibition, DATE 2012,
pages 941–946, 2012. doi:10.1109/DATE.2012.6176632.
[Cas13] Jeronimo Castrillon. Programming Heterogeneous MPSoCs: Tool
Flows to Close the So ware Productivity Gap. PhD thesis, Rheinisch-
Westfälische Technische Hochschule Aachen, Germany, 2013.
[CFH+04] John Carpenter, Shelby Funk, Philip Holman, Anand Srinivasan,
James Anderson, and Sanjoy Baruah. A Categorization of Real-time
Multiprocessor Scheduling Problems and Algorithms. In Joseph
Y.-T. Leung, editor, Handbook of Scheduling: Algorithms, Models,
and Performance Analysis. CRC Press, Boca Raton, FL, U.S.A, 2004.
doi:10.1201/9780203489802.ch30.
[CGJ96] E. G. Coman, Jr., M. R. Garey, and D. S. Johnson. Approximation
algorithms for bin packing: A survey. In Dorit S. Hochbaum, editor,
Approximation algorithms for NP-hard problems, pages 46–93. PWS
Publishing Co., Boston, MA, USA, 1996.
[CKT03] Samarjit Chakraborty, Simon Künzli, and Lothar ¿iele. A general
framework for analysing system properties in platform-based embed-
ded system designs. In Proceedings of the Design, Automation and Test
Bibliography 107
in Europe Conference and Exhibition, DATE ’03, pages 190–195, 2003.
doi:10.1109/DATE.2003.1253607.
[CMQV89] Guy Cohen, Pierre Moller, Jean-Pierre Quadrat, and Michel Viot. Al-
gebraic tools for the performance evaluation of discrete event sys-
tems. Proceedings of the IEEE, 77(1):39–85, 1989. doi:10.1109/5.
21069.
[CRJ10] Hyeonjoong Cho, Binoy Ravindran, and E. Douglas Jensen. T-L
plane-based real-time scheduling for homogeneous multiprocessors.
Journal of Parallel and Distributed Computing, 70(3):225–236, 2010.
doi:DOI:10.1016/j.jpdc.2009.12.003.
[DB11] Robert I. Davis andAlan Burns. A survey of hard real-time scheduling
formultiprocessor systems. ACMComputing Surveys, 43(4):35:1–35:44,
2011. doi:10.1145/1978802.1978814.
[DSBS06] Ed F. Deprettere, Todor Stefanov, Shuvra S. Bhattacharyya, and
Mainak Sen. AneNested LoopPrograms and their Binary Parameter-
ized Dataow Graph Counterparts. In Proceedings of the International
Conference on Application-specic Systems, Architectures and Processors,
ASAP 2006, pages 186–190, 2006. doi:10.1109/ASAP.2006.7.
[EJ09] Christof Ebert and Capers Jones. Embedded So ware: Facts, Figures,
and Future. IEEE Computer, 42(4):42–52, 2009. doi:10.1109/
MC.2009.118.
[Fea91] Paul Feautrier. Dataow analysis of array and scalar references. In-
ternational Journal of Parallel Programming, 20(1):23–53, 1991. doi:
10.1007/BF01407931.
[Fea96] Paul Feautrier. Automatic Parallelization in the Polytope Model. In
Guy-René Perrin and Alain Darte, editors,¿e Data Parallel Program-
ming Model: Foundations, HPF Realization, and Scientic Applica-
tions, volume 1132, pages 79–103. Springer Berlin Heidelberg, 1996.
doi:10.1007/3-540-61736-1_44.
[FGB10] Nathan Fisher, Joël Goossens, and Sanjoy Baruah. Optimal on-
line multiprocessor scheduling of sporadic real-time tasks is impos-
sible. Real-Time Systems, 45(1-2):26–71, 2010. doi:10.1007/
s11241-010-9092-7.
108 Bibliography
[FKPY07] Elena Fersman, Pavel Krcal, Paul Pettersson, and Wang Yi. Task
automata: Schedulability, decidability and undecidability. Information
and Computation, 205(8):1149–1172, 2007. doi:10.1016/j.ic.
2007.01.009.
[FLS+11] Shelby Funk, Greg Levin, Caitlin Sadowski, Ian Pye, and Scott Brandt.
DP-Fair: a unifying theory for optimal hard real-time multiproces-
sor scheduling. Real-Time Systems, 47(5):389–429, 2011. doi:
10.1007/s11241-011-9130-0.
[Fra10] Björn Franke. C Compilers and Code Optimization for DSPs. In
Shuvra S. Bhattacharyya, Ed F. Deprettere, Rainer Leupers, and Jarmo
Takala, editors, Handbook of Signal Processing Systems, pages 575–601.
Springer US, 2010. doi:10.1007/978-1-4419-6345-1_21.
[FSH+13] Shakith Fernando, Firew Siyoum, Yifan He, Akash Kumar, and Henk
Corporaal. MAMPSx: A Design Framework for Rapid Synthesis of
Predictable Heterogeneous MPSoCs. In Proceedings of IEEE Interna-
tional Symposium on Rapid System Prototyping, RSP ’13, pages 136–142,
2013. doi:10.1109/RSP.2013.6683970.
[GAC+13] Kees Goossens, Arnaldo Azevedo, Karthik Chandrasekar, Manil Dev
Gomony, Sven Goossens, Martijn Koedam, Yonghui Li, Davit Mir-
zoyan, Anca Molnos, Ashkan Beyranvand Nejad, Andrew Nelson, and
Shubhendu Sinha. Virtual Platforms forMixed-Time-Criticality Appli-
cations: ¿e CompSOCArchitecture and Design Flow. ACM SIGBED
Review, 10(3):23–34, 2013. doi:10.1145/2544350.2544353.
[GB04] Marc Geilen and Twan Basten. Reactive process networks. In Pro-
ceedings of the 4th ACM International Conference on Embedded So -
ware, EMSOFT ’04, pages 137–146, New York, NY, USA, 2004. ACM.
doi:10.1145/1017753.1017778.
[GDR05] Kees Goossens, John Dielissen, and Andrei Rădulescu. Æthereal
Network on Chip: Concepts, Architectures, and Implementations.
IEEE Design and Test of Computers, 22(5):414–421, 2005. doi:10.
1109/MDT.2005.99.
[Gha08] Amir Hossein Ghamarian. Timing Analysis of Synchronous Data Flow
Graphs. PhD thesis, Technische Universiteit Eindhoven, Netherlands,
2008.
Bibliography 109
[GHB13] Stefan J. Geuns, Joost P.H.M. Hausmans, and Marco J.G. Bekooij.
Automatic dataow model extraction from modal real-time stream
processing applications. In Proceedings of the 14th ACM SIG-
PLAN/SIGBED Conference on Languages, Compilers and Tools for Em-
bedded Systems, LCTES ’13, pages 143–152, New York, NY, USA, 2013.
ACM. doi:10.1145/2465554.2465561.
[GHP+09] Andreas Gerstlauer, Christian Haubelt, Andy D. Pimentel, Todor P.
Stefanov, Daniel D. Gajski, and Jürgen Teich. Electronic System-
Level Synthesis Methodologies. IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, 28(10):1517–1530, 2009.
doi:10.1109/TCAD.2009.2026356.
[GJ79] Michael R. Garey and David S. Johnson. Computers and Intractability:
A Guide to the ¿eory of NP-Completeness. WH Freeman & Co., New
York, NY, USA, 1979.
[God98] Steve Goddard. On the Management of Latency in the Synthesis of
Real-Time Signal Processing Systems from Processing Graphs. PhD
thesis, University of North Carolina at Chapel Hill, U.S.A., 1998.
[Gra69] R. L. Graham. Bounds on Multiprocessing Timing Anomalies.
SIAM Journal on Applied Mathematics, 17(2):416–429, 1969. doi:
10.1137/0117039.
[Hen03] Jörg Henkel. Closing the SoC design gap. IEEE Computer, 36(9):119–
121, 2003. doi:10.1109/MC.2003.1231200.
[HGBH09] Andreas Hansson, Kees Goossens, Marco Bekooij, and Jos Huisken.
CoMPSoC: A template for composable and predictable multi-
processor systemon chips. ACMTransactions onDesignAutomation of
Electronic Systems, 14(1):2:1–2:24, 2009. doi:10.1145/1455229.
1455231.
[HGWB13] Joost P. H. M. Hausmans, Stefan J. Geuns, Maarten H. Wiggers,
and Marco J.G. Bekooij. Two parameter workload characterization
for improved dataow analysis accuracy. In Proceedings of the IEEE
19th Real-Time and Embedded Technology and Applications Symposium,
RTAS ’13, pages 117–126, Los Alamitos, CA, USA, 2013. IEEEComputer
Society. doi:10.1109/RTAS.2013.6531085.
110 Bibliography
[HHBT09] Wolfgang Haid, Kai Huang, Iuliana Bacivarov, and Lothar¿iele. Mul-
tiprocessor SoC so ware design ows. IEEETransactions on Signal Pro-
cessing, 26(6):64–71, 2009. doi:10.1109/MSP.2009.934111.
[HKL+08] Soonhoi Ha, Sungchan Kim, Choonseung Lee, Youngmin Yi, Seong-
nam Kwon, and Young-Pyo Joo. PeaCE: A hardware-so ware code-
sign environment for multimedia embedded systems. ACM Trans-
actions on Design Automation of Electronic Systems, 12(3):24:1–24:25,
2008. doi:10.1145/1255456.1255461.
[HNO97] Lance Hammond, Basem A. Nayfeh, and Kunle Olukotun. A single-
chip multiprocessor. IEEE Computer, 30(9):79–85, 1997. doi:10.
1109/2.612253.
[HO10] Soonhoi Ha and Hyunok Oh. Decidable Signal Processing Dataow
Graphs: Synchronous and Cyclo-Static Dataow Graphs. In Shuvra S.
Bhattacharyya, Ed F. Deprettere, Rainer Leupers, and Jarmo Takala,
editors,Handbook of Signal Processing Systems, pages 851–874. Springer
US, 2010. doi:10.1007/978-1-4419-6345-1_30.
[HWGB13] Joost P. H. M. Hausmans, Maarten H. Wiggers, Stefan J. Geuns,
and Marco J. G. Bekooij. Dataow analysis for multiprocessor sys-
tems with non-starvation-free schedulers. In Proceedings of the 16th
International Workshop on So ware and Compilers for Embedded Sys-
tems, M-SCOPES ’13, pages 13–22, New York, NY, USA, 2013. ACM.
doi:10.1145/2463596.2463603.
[Int11] International Technology Roadmap for Semiconductors. 2011 Edition:
Executive Summary, 2011. URL: http://www.itrs.net/ [cited
August 19, 2013].
[Joh74] David S. Johnson. Fast algorithms for bin packing. Journal of
Computer and System Sciences, 8(3):272–314, 1974. doi:10.1016/
S0022-0000(74)80026-7.
[JP86] M. Joseph and P. Pandya. Finding Response Times in a Real-
Time System. ¿e Computer Journal, 29(5):390–395, 1986. doi:
10.1093/comjnl/29.5.390.
[JS05] A. Jantsch and I. Sander. Models of computation and languages
for embedded system design. IEE Proceedings-Computers and Dig-
ital Techniques, 152(2):114–129, 2005. doi:10.1049/ip-cdt:
20045098.
Bibliography 111
[JSM91] Kevin Jeay, Donald F. Stanat, and Charles U. Martel. On non-
preemptive scheduling of period and sporadic tasks. In Proceedings of
the 12th Real-Time Systems Symposium, RTSS ’91, pages 129–139, 1991.
doi:10.1109/REAL.1991.160366.
[JTW05] Ahmed Jerraya, Hannu Tenhunen, and Wayne Wolf. Multiprocessor
Systems-on-Chips. IEEE Computer, 38(7):36–40, 2005. doi:10.
1109/MC.2005.231.
[Kah74] Gilles Kahn. ¿e Semantics of Simple Language for Parallel Program-
ming. In Proceedings of the IFIP Congress, pages 471–475. North-
Holland Publishing Company, 1974.
[KKLR13] Junsung Kim, Hyoseung Kim, Karthik Lakshmanan, and Ragu-
nathan (Raj) Rajkumar. Parallel scheduling for cyber-physical sys-
tems: analysis and case study on a self-driving car. In Proceedings of
the ACM/IEEE 4th International Conference on Cyber-Physical Sys-
tems, ICCPS ’13, pages 31–40, New York, NY, USA, 2013. ACM.
doi:10.1145/2502524.2502530.
[KM66] Richard M. Karp and Raymond E. Miller. Properties of a Model for
Parallel Computations: Determinacy, Termination, Queueing. SIAM
Journal on Applied Mathematics, 14(6):1390–1411, 1966. doi:10.
1137/0114108.
[KM+00] Kurt Keutzer, , Sharad Malik, A. Richard Newton, Jan M. Rabaey, and
A. Sangiovanni-Vincentelli. System-level design: orthogonalization of
concerns and platform-based design. IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, 19(12):1523–1543, 2000.
doi:10.1109/43.898830.
[Lee06] Edward A. Lee. ¿e Problem with¿reads. IEEE Computer, 39(5):33–
42, 2006. doi:10.1109/MC.2006.180.
[LH89] Edward Ashford Lee and Soonhoi Ha. Scheduling strategies for multi-
processor real-time DSP. In Proceedings of the IEEE Global Telecommu-
nications Conference and Exhibition: Communications Technology for
the 1990s and Beyond, volume 2 ofGLOBECOM 1989, pages 1279–1283,
1989. doi:10.1109/GLOCOM.1989.64160.
[Liu12] Isaac Liu. Precision Timed Machines. PhD thesis, University of
California, Berkeley, U.S.A., 2012.
112 Bibliography
[LL73] C. L. Liu and James W. Layland. Scheduling Algorithms for Multipro-
gramming in a Hard-Real-Time Environment. Journal of the ACM,
20(1):46–61, 1973. doi:10.1145/321738.321743.
[LM87] Edward A. Lee and David G. Messerschmitt. Synchronous data ow.
Proceedings of the IEEE, 75(9):1235–1245, 1987. doi:10.1109/
PROC.1987.13876.
[LN05] E. A. Lee and S. Neuendorer. Concurrent models of computa-
tion for embedded so ware. IEE Proceedings-Computers and Dig-
ital Techniques, 152(2):239–250, 2005. doi:10.1049/ip-cdt:
20045065.
[LW82] Joseph Y.-T. Leung and Jennifer Whitehead. On the complexity of
xed-priority scheduling of periodic, real-time tasks. Performance
Evaluation, 2(4):237–250, 1982. doi:10.1016/0166-5316(82)
90024-4.
[MB07] Orlando M. Moreira and Marco J. G. Bekooij. Self-Timed Scheduling
Analysis for Real-Time Applications. EURASIP Journal on Advances
in Signal Processing, 2007(1), 2007. doi:10.1155/2007/83710.
[MBvdBvM08] Arno Moonen, Marco Bekooij, René van den Berg, and Jef van Meer-
bergen. Cache aware mapping of streaming applications on a multi-
processor system-on-chip. In Proceedings of the Conference on Design,
Automation and Test in Europe, DATE ’08, pages 300–305, New York,
NY, USA, 2008. ACM. doi:10.1145/1403375.1403448.
[Mei10] Sjoerd Meijer. Transformations for Polyhedral Process Networks. PhD
thesis, Universiteit Leiden, Netherlands, 2010.
[Moo65] Gordon E. Moore. Cramming more components onto integrated
circuits. Electronics, 38(8):114–117, 1965.
[Mor12] Orlando Moreira. Temporal Analysis and Scheduling of Hard Real-
Time Radios running on a Multi-Processor. PhD thesis, Technische
Universiteit Eindhoven, Netherlands, 2012.
[MRP+11] Mario Morales, Shane Rau, Michael J. Palma, Mali Venkatesan, Flint
Pulskamp, and Abhi Dugar. Worldwide Intelligent Systems 2011–2015
Forecast: ¿e Next Big Opportunity. International Data Corporation,
5 Speen Street, Framingham, MA 01701, U.S.A., 2011.
Bibliography 113
[NSD08] Hristo Nikolov, Todor Stefanov, and Ed Deprettere. Systematic and
Automated Multiprocessor System Design, Programming, and Imple-
mentation. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, 27(3):542–555, 2008. doi:10.1109/TCAD.
2007.911337.
[NTS+08] H. Nikolov, M.¿ompson, T. Stefanov, A. Pimentel, S. Polstra, R. Bose,
C. Zissulescu, and E. Deprettere. Daedalus: toward composable
multimedia MP-SoC design. In Proceedings of the 45th annual Design
Automation Conference, DAC ’08, pages 574–579, New York, NY, USA,
2008. ACM. doi:10.1145/1391469.1391615.
[NVC10] Vincent Nollet, Diederik Verkest, and Henk Corporaal. A Sa-
fari ¿rough the MPSoC Run-Time Management Jungle. Journal
of Signal Processing Systems, 60:251–268, 2010. doi:10.1007/
s11265-008-0305-4.
[OH04] Hyunok Oh and Soonhoi Ha. Fractional Rate Dataow Model for
Ecient Code Synthesis. ¿e Journal of VLSI Signal Processing, 37:41–
51, 2004. doi:10.1023/B:VLSI.0000017002.91721.0e.
[PC13] Keshab K. Parhi and Yanni Chen. Signal Flow Graphs and Data
Flow Graphs. In Shuvra S. Bhattacharyya, Ed F. Deprettere, Rainer
Leupers, and Jarmo Takala, editors, Handbook of Signal Processing
Systems, pages 1277–1302. Springer New York, 2013. doi:10.1007/
978-1-4614-6859-2_39.
[PL95] ¿omas M. Parks and Edward A. Lee. Non-preemptive real-time
scheduling of dataow systems. In Proceedings of the International
Conference on Acoustics, Speech, and Signal Processing, volume 5
of ICASSP-95, pages 3235–3238, 1995. doi:10.1109/ICASSP.
1995.479574.
[PMN+09] Rodolfo Pellizzoni, Patrick Meredith, Min-Young Nam, Mu Sun,
Marco Caccamo, and Lui Sha. Handling mixed-criticality in SoC-
based real-time embedded systems. In Proceedings of the 7th ACM In-
ternational Conference on Embedded So ware, EMSOFT ’09, pages 235–
244, New York, NY, USA, 2009. ACM. doi:10.1145/1629335.
1629367.
[Pou] Louis-Noël Pouchet. PolyBench/C: the Polyhedral Benchmark suite.
URL: http://www.cs.ucla.edu/~pouchet/software/
polybench/ [cited July 22, 2013].
114 Bibliography
[Reaa] Real Time Engineers Ltd. Task Control: vTaskDelayUntil. URL:
http://www.freertos.org/vtaskdelayuntil.html
[cited July 29, 2013].
[Reab] Real Time Engineers Ltd. ¿e FreeRTOS Project. URL: http:
//www.freertos.org/ [cited June 19, 2012].
[RLM+12] Paul Regnier, George Lima, Ernesto Massa, Greg Levin, and Scott
Brandt. Multiprocessor scheduling by reduction to uniprocessor:
an original optimal approach. Real-Time Systems, pages 1–39, 2012.
doi:10.1007/s11241-012-9165-x.
[SB09] Sundararajan Sriram and Shuvra S. Bhattacharyya. Embedded Multi-
processors: Scheduling and Synchronization. CRC Press, Boca Raton,
FL, USA, 2nd edition, 2009.
[SBW09] Marcel Steine, Marco Bekooij, andMaartenWiggers. A Priority-Based
Budget Scheduler with Conservative Dataow Model. In Proceedings
of the 12thEuromicroConference onDigital SystemDesign, Architectures,
Methods and Tools, DSD ’09, pages 37–44, 2009. doi:10.1109/
DSD.2009.148.
[Sch09] Martin Schoeberl. Time-predictable computer architecture. EURASIP
Journal on Embedded Systems, 2009:2:1–2:17, 2009. doi:10.1155/
2009/758480.
[SEGY11a] Martin Stigge, Pontus Ekberg, Nan Guan, and Wang Yi. On the
Tractability of Digraph-Based Task Models. In Proceedings of the
23rd Euromicro Conference on Real-Time Systems, ECRTS ’11, pages
162–171, Los Alamitos, CA, USA, 2011. IEEE Computer Society. doi:
10.1109/ECRTS.2011.23.
[SEGY11b] Martin Stigge, Pontus Ekberg, Nan Guan, and Wang Yi. ¿e Digraph
Real-Time Task Model. In Proceedings of the 17th IEEE Real-Time and
Embedded Technology and Applications Symposium, RTAS ’11, pages
71–80, Los Alamitos, CA, USA, 2011. IEEE Computer Society. doi:
10.1109/RTAS.2011.15.
[SG13] IEEE Computer Society and¿e Open Group. Standard for Informa-
tion Technology Portable Operating System Interface (POSIX®): Base
Specications, Issue 7. IEEE Std 1003.1, 2013 Edition, pages 1–3906,
2013. doi:10.1109/IEEESTD.2013.6506091.
Bibliography 115
[SGB06] Sander Stuijk, Marc Geilen, and Twan Basten. SDF3: SDF For Free.
In Proceedings of the 10th International Conference on Application of
Concurrency to SystemDesign, ACSD ’06, pages 276–278, Los Alamitos,
CA, USA, 2006. IEEE Computer Society. doi:10.1109/ACSD.
2006.23.
[SKS+10] A. Shabbir, A. Kumar, S. Stuijk, B. Mesman, and H. Corporaal. CA-
MPSoC: An automated design ow for predictable multi-processor
architectures for multiple applications. Journal of Systems Architec-
ture, 56(7):265–277, 2010. doi:10.1016/j.sysarc.2010.03.
007.
[TA10] William¿ies and Saman Amarasinghe. An empirical characteriza-
tion of stream programs and its implications for language and com-
piler design. In Proceedings of the 19th International Conference on
Parallel Architectures andCompilation Techniques, PACT ’10, pages 365–
376, New York, NY, USA, 2010. ACM. doi:10.1145/1854273.
1854319.
[TBHH07] Lothar ¿iele, Iuliana Bacivarov, Wolfgang Haid, and Kai Huang.
Mapping Applications to Tiled Multiprocessor Embedded Systems.
In Proceedings of the Seventh International Conference on Applica-
tion of Concurrency to System Design, ACSD ’07, pages 29–40, 2007.
doi:10.1109/ACSD.2007.53.
[Tei12] Jürgen Teich. Hardware/So ware Codesign: ¿e Past, the Present, and
Predicting the Future. Proceedings of the IEEE, 100(Special Centennial
Issue):1411–1430, 2012. doi:10.1109/JPROC.2011.2182009.
[¿r10] Sebastian¿run. Toward robotic cars. Communications of the ACM,
53(4):99–106, 2010. doi:10.1145/1721654.1721679.
[TKM+02] Michael Bedford Taylor, Jason Kim, Jason Miller, David Wentzla,
Fae Ghodrat, Ben Greenwald, Henry Homan, Paul Johnson, Jae-
Wook Lee, Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan
Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and
Anant Agarwal. ¿e Raw microprocessor: a computational fabric
for so ware circuits and general-purpose programs. IEEE Micro,
22(2):25–35, 2002. doi:10.1109/MM.2002.997877.
[TNS+07] Mark¿ompson, Hristo Nikolov, Todor Stefanov, Andy D. Pimentel,
Cagkan Erbas, Simon Polstra, and Ed F. Deprettere. A framework
116 Bibliography
for rapid system-level exploration, synthesis, and programming of
multimedia MP-SoCs. In Proceedings of the 5th IEEE/ACM Interna-
tional Conference on Hardware/So ware Codesign and System Synthe-
sis, CODES+ISSS ’07, pages 9–14, New York, NY, USA, 2007. ACM.
doi:10.1145/1289816.1289823.
[TS09] Lothar ¿iele and Nikolay Stoimenov. Modular performance analysis
of cyclic dataow graphs. In Proceedings of the seventh ACM Inter-
national Conference on Embedded So ware, EMSOFT ’09, pages 127–
136, New York, NY, USA, 2009. ACM. doi:10.1145/1629335.
1629353.
[UCS+10] ¿eo Ungerer, Francisco J. Cazorla, Pascal Sainrat, Guillem Bernat,
Zlatko Petrov, Hugues Cassé, Christine Rochange, Eduardo Quiãones,
Sascha Uhrig, Mike Gerdes, Irakli Guliashvili, Michael Houston, Flo-
rian Kluge, Stefan Metzla, Jörg Mische, Marco Paolieri, and Ju-
lian Wolf. Merasa: Multicore Execution of Hard Real-Time Appli-
cations Supporting Analyzability. IEEE Micro, 30(5):66–75, 2010.
doi:10.1109/MM.2010.78.
[Ukk95] Esko Ukkonen. On-line construction of sux trees. Algorithmica,
14:249–260, 1995. doi:10.1007/BF01206331.
[VAvGL01] Wim F. J. Verhaegh, Emile H. L. Aarts, Paul C. N. van Gorp, and Paul
E. R. Lippens. A two-stage solution approach to multidimensional
periodic scheduling. IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, 20(10):1185–1199, 2001. doi:10.
1109/43.952736.
[VLA+96] W.F.J. Verhaegh, P.E.R. Lippens, E.H.L. Aarts, J.L. Meerbergen, and
A.Werf. Multidimensional periodic schedulingmodel and complexity.
In Luc Bougé, Pierre Fraigniaud, Anne Mignotte, and Yves Robert,
editors, Euro-Par’96 Parallel Processing, volume 1124 of Lecture Notes
in Computer Science, pages 226–235. Springer Berlin Heidelberg, 1996.
doi:10.1007/BFb0024706.
[VNS07] Sven Verdoolaege, Hristo Nikolov, and Todor Stefanov. pn: a tool
for improved derivation of process networks. EURASIP Journal on
Embedded Systems, 2007(1):19–19, 2007. doi:10.1155/2007/
75947.
[WBS09] Maarten H. Wiggers, Marco J.G. Bekooij, and Gerard J.M. Smit.
Monotonicity and run-time scheduling. In Proceedings of the seventh
Bibliography 117
ACM International Conference on Embedded So ware, EMSOFT ’09,
pages 177–186, New York, NY, USA, 2009. ACM. doi:10.1145/
1629335.1629359.
[WEE+08] Reinhard Wilhelm, Jakob Engblom, Andreas Ermedahl, Niklas Holsti,
Stephan¿esing, DavidWhalley, GuillemBernat, Christian Ferdinand,
Reinhold Heckmann, Tulika Mitra, Frank Mueller, Isabelle Puaut,
Peter Puschner, Jan Staschulat, and Per Stenström. ¿e worst-case
execution-time problem–overview of methods and survey of tools.
ACM Transactions on Embedded Computing Systems, 7(3):36:1–36:53,
2008. doi:10.1145/1347375.1347389.
[WGR+09] Reinhard Wilhelm, Daniel Grund, Jan Reineke, Marc Schlickling,
Markus Pister, and Christian Ferdinand. Memory Hierarchies,
Pipelines, and Buses for Future Architectures in Time-Critical Embed-
ded Systems. IEEE Transactions on Computer-Aided Design of Inte-
grated Circuits and Systems, 28(7):966–978, 2009. doi:10.1109/
TCAD.2009.2013287.
[Woo] Victoria Woollaston. New images show how Google’s self-driving
cars see the world. URL: http://www.dailymail.co.uk/
sciencetech/article-2317594/ [cited September 9, 2013].
[Xil11] Xilinx, Inc. Platform Specication Format Reference Manual: Em-
bedded Development Kit (EDK) 13.2, July 2011. URL: http://www.
xilinx.com/ [cited June 19, 2012].
[Yue91] Minyi Yue. A simple proof of the inequality FFD (L) ≤
11⇑9 OPT (L) + 1,∀L for the FFD bin-packing algorithm. Acta
Mathematicae Applicatae Sinica, 7:321–331, 1991. doi:10.1007/
BF02009683.
[ZB09] Fengxiang Zhang and Alan Burns. Schedulability Analysis for Real-
Time Systems with EDF Scheduling. IEEE Transactions on Computers,
58(9):1250–1258, 2009. doi:10.1109/TC.2009.58.
[ZBS13] Jiali Teddy Zhai, Mohamed A. Bamakhrama, and Todor Stefanov.
Exploiting just-enough parallelism when mapping streaming appli-
cations in hard real-time systems. In Proceedings of the 50th Annual
Design Automation Conference, DAC ’13, pages 170:1–170:8, New York,
NY, USA, 2013. ACM. doi:10.1145/2463209.2488944.
118 Bibliography
[ZM12] Christopher Zimmer and Frank Mueller. Low Contention Mapping
of Real-Time Tasks onto TilePro 64 Core Processors. In Proceedings
of the IEEE 19th Real-Time and Embedded Technology and Applications
Symposium, RTAS ’12, pages 131–140, Los Alamitos, CA, USA, 2012.
IEEE Computer Society. doi:10.1109/RTAS.2012.36.
[ZNS13] Jiali Teddy Zhai, Hristo Nikolov, and Todor Stefanov. Mapping of
streaming applications considering alternative application specica-
tions. ACMTransactions on Embedded Computing Systems, 12(1s):34:1–
34:21, 2013. doi:10.1145/2435227.2435230.
[ZUE00] Dirk Ziegenbein, Jan Uerpmann, and Rolf Ernst. Dynamic re-
sponse time optimization for SDF graphs. In Proceedings of the
2000 IEEE/ACM International Conference on Computer-Aided Design,
ICCAD ’00, pages 135–141, Piscataway, NJ, USA, 2000. IEEE Press.
doi:10.1109/ICCAD.2000.896463.
Curriculum Vitae
Mohamed A. Bamakhrama was born on July 11, 1983 in Dubai, United Arab Emirates.
In 2001, he obtained his high school diploma from Muaath Bin Jabal secondary school
in Sharjah, United Arab Emirates and was ranked among the top 10 students in United
Arab Emirates. In 2005, he obtained a B.Sc. (honors) in Computer Engineering from
University of Sharjah and graduated with the highest GPA in the whole university. In
2007, he obtained a M.Sc. (honors) from the Institute of Informatics in the Technical
University of Munich in Germany. FromMay 2008 till September 2009, he worked as
a research scientist at the Research department of NXP Semiconductors in Eindhoven,
Netherlands. Between October 2009 and October 2013, he has been working, towards
his Ph.D. degree, as a research assistant at Leiden University in Leiden, Netherlands.
Since October 2013, he is working as a postdoctoral researcher at the same university.
His research interests include real-time embedded systems design and programming,
hardware/so ware co-design, computer architecture, computer communication and
protocols design.
List of Publications
¿e work described in this dissertation has resulted in the following publications:
Journal Articles
• Mohamed A. Bamakhrama and Todor P. Stefanov. On the hard-real-time
scheduling of embedded streaming applications. Design Automation for Embed-
ded Systems, 2012. doi: 10.1007/s10617-012-9086-x.
Referred, Peer-Reviewed Conference Proceedings
• Emanuele Cannella,Mohamed A. Bamakhrama, and Todor Stefanov, “System-
level Scheduling of Real-time Streaming Applications using a Semi-partitioned
Approach,” Accepted for publication in Proceedings of the 17th Design, Automa-
tion, and Test in Europe Conference and Exhibition, DATE ’14, 2014.
• Jiali Teddy Zhai,Mohamed A. Bamakhrama, and Todor Stefanov, “Exploiting
just-enough parallelism when mapping streaming applications in hard real-
time systems,” in Proceedings of the 50th Annual Design Automation Confer-
ence, DAC ’13, (New York, NY, USA), pp. 170:1–170:8, ACM, 2013. doi:
10.1145/2463209.2488944.Winner of HiPEAC 2013 Paper Award.
• Mohamed A. Bamakhrama and Todor Stefanov, “Managing latency in embed-
ded streaming applications under hard-real-time scheduling,” in Proceedings of
the eighth IEEE/ACM/IFIP International Conference on Hardware/So ware Code-
sign and System Synthesis, CODES+ISSS ’12, (New York, NY, USA), pp. 83–92,
ACM, 2012. doi: 10.1145/2380445.2380464.
• Mohamed A. Bamakhrama, Jiali Teddy Zhai, Hristo Nikolov, and Todor Ste-
fanov, “Amethodology for automateddesign of hard-real-time embedded stream-
ing systems,” in Proceedings of the 15th Design, Automation, and Test in Europe
conference, DATE ’12, pp. 941–946, 2012.
doi: 10.1109/DATE.2012.6176632.
• Mohamed Bamakhrama and Todor Stefanov, “Hard-real-time scheduling of
data-dependent tasks in embedded streaming applications,” in Proceedings of the
ninth ACM International Conference on Embedded So ware, EMSOFT ’11, (New
York, NY, USA), pp. 195–204, ACM, 2011.
doi: 10.1145/2038642.2038672. Best Paper Candidate.
Samenvatting
Dit proefschri onderbouwt modellen en methoden voor het (semi-) automatisch
ontwerpen van ingebedde rekensystemen die één of meer datastromen op een zodanige
manier verwerken dat zij aantoonbaar aan hun tijdsrestricties voldoen. De rekensyste-
men die in het proefschri de voorkeur genieten zijn de zogenoemde meervoudige
processoren op een enkele silicon drager,MPSoCs in vakjargon. Het ontwerpen van
so ware en hardware die op een natuurlijke manier gelijktijdigheid tot uitdrukking
brengt is gebleken successvol te zijn uitgaande van abstracte modellen en systeem-niveau
synthese methoden Gelijktijdigheid in so ware vergt een tijdsordening in parallele
hardware. Tijdsordeningsmethoden voor de implementatie van multi-datastroom toe-
passingen met harde tijdsrestricties zijn schaars en niet echt bevredigend. De klassieke
tijdsordeningsalgorithmes veronderstellen onafhankelijke periodieke taken en zijn
dus niet van toepassing op datastroom applicaties die worden gemodelleerd als grafen
waarin de taken onderling afhankelijk zijn afhankelijk zijn.
Dit proefschri gee een tijdsordening van taken (of actoren) in niet-cyclische
datastroom grafen die bewezen in tijd geordend kunnen worden als een verzameling
van onafhankelijke periodieke taken. Daarmee wordt het domein van dataow grafen
toehankelijk voor de klassieke literatuur betreende de ordening van onafhankelijke
periodieke taken die aan strikte tijdscondities moeten voldoen. De niet-cyclische
datastroom grafen die in dit proefschri aan bod komen zijn de zogenoemde synchrone
dataow grafen (SDF), en cyclische synchrone dataow grafen (CSDF). Het woord
cyclisch hier slaat niet op de grafen maar op de evaluatiecyclus van de taken.
Het in dit proefschri voorgestelde raamwerk berekent parameters van de perio-
dieke taken die overeenstemmenmet de functionele graafactoren, evanals de minimum
capaciteit van de buers op de intertaak communicatie kanalen, zodanig dat het bestaan
van een valide periodieke tijdsordening is gegarandeerd. De doorbraak hier is dat het
gebruik van twee modellen - het datastroom graaf model en het aan tijdsrestricties
ondergevige periodieke taak model niet alleen mogelijk is maar ook complementair
zijn: De SDF en CSDF grafen modelleren het functioneel gedrag van van de toepaasing,
terwijl het periodieke taak model met tijdsrestricites het tijdsgedrag modelleert. Met
behulp van deze twee modellen kan een ontwerper aan de slag. Zij kan de taken in
tijd ordenen zodanig dat prestatie maten (opstarttijd, en data doorvoor snelheid) in
acht worden genomen. Zij kan tijdsordening parameters die de prestaties garanderen
analytisch aeiden. En zij kan het minimum aantal processoren en de minimale ca-
paciteit van communicatiekanalen die nodig zijn om aan de prestatievoorwaarden te
voldoen analytisch bepalen. Tenslotte kan zij de toekening van taken aan processoren
eenduidig bepalen.
De voorgestelde modellen en methoden leiden bovendien tot de volgende resulta-
ten. De zogenoemdematched I/O rates graphs die ongeveer 80% van de datastroom
grafen omvatten, hebben een optimale doorstroomtijd onder een periodieke tijdsor-
dening. Voor een zekere verzameling van CSDF grafen kan een optimale opstarttijd
en doorstroomtijd bereikt worden onder een periodieke tijdsordening. Verder kan de
opstarttijd van CDSF grafen in het algemeen worden gereduceerd door de tijdlimiet
van de actoren in het critische pad te verkleinen.
Voor zover een gedegen theoretische onderbouwing nog vraagt om een experimen-
tele bevestiging, is toch een systeem-niveau ontwerp methodologie opgezet waarin
de tijdsordeningsmodellen en -methoden die in dit proefschri zijn ontwikkeld zijn
opgenomen. Het ontwerpschema gaat uit van een sequentiele specicatie (of pro-
gramma) van een datastroom toepassing. Na een serie van systematische en auto-
matische transformaties wordt een parallel implementatie aangeboden die onvoor-
waardelijk aan de gestelde tijdsrestricties voldoet. De implementatie is een afbeelding
van een niet-sequentiele versie van het oorspronkelijk sequebtiele programma op
een multi-processor executie platform op een enkele silicon drager. De belangrijste
transformaties zijn: Automatische aeiding van een niet-sequentiele variant van het
ingangsprogramma en van het bijbehorende model; de tijdsordeningsmethodologie
zoals hierboven beschreven; een systeem-niveau synthese. De gehele procedure en
bijbehorende so ware implementatie is beschikbaar in het publieke domein. Zie
http://daedalus.liacs.nl/
Tenslotte biedt het proefschri een aantal validatie experimenten die, met dank aan
het publiek beschikbare raamwerk, door iedereen kan worden overgedaan. Wat blijkt?
De eerste van de genoemde transformaties vergt weinig tijd, althans voor vrij veel
zinvolle programmas. Dit wordt bevestigd door experimenten. Andere experimenten
bevestigen de quaiteit van periodieke tijdsordeningen van datastroom toepassingen,
in termen van doorstroomsnelheid, opstarttijd, processor activiteit, and geheugen
vereisten. Voor meer dan 70% van de doorgerekende toepassingen blijkt dat periodieke
tijdsordening leidt tot een optimale opstarttijd en doorstroomtijd. Ook wanneer een
periodieke tijdsordening even goed is (in termen van opstarttijd en doorvoertijd)
dan vrije tijdsordening, kan worden beargumenteerd dat periodieke tijdsordening
voordelen hee boven vrije tijdsordening. Die voordelen zijn: De mogelijkheid om
de verzameling van actieve programmas eenvoudig te wijzigen, en de mogelijkheid
om een veelheid van tijdsordeningsalgorithmen voor reele-tijd periodieke taken aan
te wenden. De gesynthetiseerde datastroom toepassingen zijn tenslotte beoordeeld
op hun werkelijke prestaties vergeleken met hun verwachte prestaties. Daartoe zijn
de gesynthetiseerde implementaties geexecuteerd op FPGA hardware versies. Geen
van alle experimentele voorbeelden vertoonde een tijdslimiet overschrijding, noch een
foutmelding met betrekking tot de berekende capacitiet van communicatie buers.
Acknowledgments
First and foremost, all the thanks and praises are to God, the lord of the worlds, for
all what he gave to me. Over the past four years, many people and organizations have
contributed to the successful completion of this dissertation. On the scientic and
professional level, I would like to thank the following researchers for their support
and collaboration: Bill ¿ies from Microso Research, Sander Stuijk from Technische
Universiteit Eindhoven, OrlandoMoreira fromEricsson, andAdnan Bouakaz fromUni-
versité de Rennes 1. I would like also to thank Xilinx, Inc. for their generous donations,
in the form of hardware development boards and so ware licenses, which enabled
conducting my PhD research. Another thank you goes to the European CATRENE
program, which funded my research partially through its Tera-Scale Multicore Archi-
tectures (TSAR) project. I would like to thank all my current and ex-colleagues at
Leiden Embedded Research Center (LERC) who formed a true team. I enjoyed every
single moment of my work with you. A very BIG thank you goes to Jiali Teddy Zhai,
Hristo Nikolov, Sven van Haastregt, Emanuele Cannella, Mohammed Al-Hissi, Di Liu,
Jelena Spasic, Sjoerd Meijer, and Dmitry Nadezhkin.
On the social level, I made several friends during my stay in the Netherlands who
were truly helpful and supportive and gave me the feeling of being at home. ¿ey all
deserve my thanks and gratitude. From Leiden, I would like to thank the following
friends and their families: Moosa Elayah, Saleh (Samir) Naser Al-Deen, Taleb Alkurdi,
Bilal Karasneh, Abdeljalil El Boujadayni, Khaled Younes, Mohamed¿abit, Mohamed
Ghaly, Attiya Abdelbaki, Umar Ryad, Haz Osman, and Mohamd Al-Sulami. From
Eindhoven, I would like to thank the following friends and their families: Hishem
Ali, Nabil Eissa, Mohammed Ezz, Ammar Osaiweran, Younes El-Waaoui, Mohamed
Azimane, and Sabri Boughorbel.
On the family level, I would like to start by expressing my thanks and gratitude
to my whole family, and in particular, my mother, my sister Khadeja, my brothers
Abdulrahman and Khalid, and my nephew Ahmed. ¿ank you all for your continuous
support, prayers, and encouragement. Special thanks go to my kids, Ahmed and
Khadeja, who always drew smile on my face and brought hope to me. A special thank
you goes to my wife’s family, especially my parents-in-law, for their understanding,
encouragement, and patience during the last four years. I would like to conclude
this acknowledgement with its “dessert”. ¿e biggest thank you goes to my other half,
Bushra, who shared the PhD journey with me from Day 1 and was instrumental to
the completion of this work. She was always there when I needed her and always
encouraged me to continue the PhD journey. ¿is dissertation would have not been
possible without her sacrices, patience, and continuous support. I can not express my
gratitude in words, so I will just say: May God reward you Bushra for all what you did!

