Analysis and optimization of dynamic dataflow programs by Casale-Brunet, Simone
POUR L'OBTENTION DU GRADE DE DOCTEUR ÈS SCIENCES
acceptée sur proposition du jury:
Dr J.-M. Vesin, président du jury
Dr M. Mattavelli, directeur de thèse
Prof. J. Castrillon, rapporteur
Prof. N. Zufferey, rapporteur
Prof. A. P. Burg, rapporteur
Analysis and optimization of dynamic dataflow programs
THÈSE NO 6663 (2015)
ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE
PRÉSENTÉE LE 8 JUIN 2015
À LA FACULTÉ DES SCIENCES ET TECHNIQUES DE L'INGÉNIEUR
GROUPE SCI STI MM
PROGRAMME DOCTORAL EN GÉNIE ÉLECTRIQUE 
Suisse
2015
PAR
Simone CASALE BRUNET

This thesis is dedicated to the loving memory of my younger brother Edo
Multas per gentes et multa per aequora uectus
Advenio has miseras, frater, ad inferias,
Vt te postremo donarem munere mortis
Et mutam nequiquam adloquerer cinerem,
Quandoquidem fortuna mihi tete abstulit ipsum,
Heu miser indigne frater adempte mihi.
Nunc tamen interea haec, prisco quae more parentum
Tradita sunt tristi munere ad inferias,
Accipe fraterno multum manantia fletu
Atque in perpetuum, frater, aue atque uale.
— Catullus (Carmi, CI. Ad inferias)

Acknowledgements
First of all, I would like to express my deepest sense of gratitude to my supervisor, Dr. Marco
Mattavelli, who offered his continuous advice and encouragement throughout the course of
this thesis.
I would also like to express my very sincere gratitude to Dr. Jorn W. Janneck, from the Lund
University, for his support and systematic guidance to this thesis. Special thanks to Prof.
Massimo Canale, from the Politecnico di Torino, who encouraged me to pursue the vision to
become a PhD.
The work presented in this thesis was partly supported by the Fonds National Suisse pour
la Recherche Scientifique under grant 200021.138214. This support is gratefully acknowledged.
I am thankful to all lab colleagues. A very special thanks to Endri Bezati for his precious
friendship and technical assistance to my project. I also take this opportunity to express my
gratitude to my friends Christian and Nicoletta, Giacomo, Aurora and Tia, Lucianone, Marco,
Martina, Rinaldo and Jenny, Renato and Sandra.
A very special and warm thanks with my profound gratitude to Jessica, who loved and sup-
ported me during the writing of my dissertation. She made me feel like everything was possible
and incredible. I love you∞∞ Principessinadellafavolapiùbella♥
I thank my parents, Patrizia and Andrea, who have always given me the strength and wisdom
to be sincere in my work, for setting high moral standards and supporting me through their
hard work, and for their unselfish love and affection.
Lausanne, 8 May 2015 Simone Casale Brunet
i

Abstract
All computing platforms, from mobile to supercomputers, are becoming more and more
heterogeneous and massively parallel. While they can provide higher power efficiency and
computation throughput, effective and confident use of these systems always requires knowl-
edge about low-level programming. The average time necessary to develop and optimize a
design on heterogeneous platforms is higher and higher compared to typical homogeneous
systems. Dataflow models of computation (MoC) are quickly becoming the common practice
in heterogeneous systems development. In domains such as signal processing and multimedia
communication, dataflow MoCs have become accepted as standard. However, the shift from a
sequential and architecture-specific MoC to a dataflow MoC still uncovers several program-
ming and development challenges. The Cal Actor Language (CAL) is a recently-specified
dataflow and actor-based language capable of concisely expressing complex and general pur-
pose parallel applications. However, design tools supporting this language are generally not
adequate to fully exploit its features and expressiveness power. In fact, they generally restrict
its MoC in order to reduce the design space exploration (DSE) effort. The objective of this
thesis is to provide a DSE methodology where all the features of CAL and dynamic dataflow
MoCs can be exploited in a more general and effective manner. This dissertation illustrates a
novel profiling, analysis and performance estimation methodology for the DSE of dynamic
dataflow programs. The main research contributions of this thesis are: the formalization
of a graph-based representation of the program execution called an execution trace graph
(ETG); the formalization of a systematic methodology for profiling generic dynamic dataflow
programs through their code interpretation; the formalization of a complete DSE methodology
for dynamic dataflow programs in order to efficiently identify close-to-optimal design points
according to various and tailored performance merit functions. In particular, the following de-
sign space optimization problems for dynamic dataflow programs are addressed: the analysis
of the hotspots and the algorithmic bottlenecks of a parallel program; the bounding and opti-
mization of the buffer size configuration for complex designs; the dynamic power dissipation
minimization of programs implemented in multi-clock domain architecture. Furthermore,
theoretical concepts like the design space critical path and the potential speedup of a dataflow
application have been defined and revisited, respectively. The thesis also presents a DSE
framework developed in order to demonstrate the effectiveness of this design methodology.
Key words: dynamic dataflow, design space exploration, heterogeneous computing, CAL
iii

Résumé
De nos jours, des mobiles aux super-ordinateurs, toutes les plates-formes informatiques
deviennent de plus en plus hétérogènes et massivement parallèles ce qui les rend très ef-
ficaces en termes de puissance et de calcul. Pour obtenir une très bonne utilisation de
ces systèmes, il est nécessaire d’avoir toujours plus de connaissances de programmation
bas niveau. De plus, le temps moyen nécessaire pour développer et optimiser ce type de
système est de plus en plus élevé par rapport aux systèmes typiquement séquentiels. Les
modèles de calcul flux de données deviennent rapidement la pratique la plus courante dans
le développement des systèmes hétérogènes. Dans des domaines, tels que le traitement
du signal et le multimédia, ces modèles de calcul flux de données sont devenus un stan-
dard largement accepté. Cependant, le passage d’une méthode séquentielle et spécifique à
l’architecture à une méthode flux de données, montre que plusieurs défis de programmation
et de développement sont encore à découvrir. Pour répondre à ce passage, un langage de
programmation flux de données, récemment spécifié, a été développé. Ce langage, appelé Cal
Actor Language (CAL), est capable d’exprimer de manière concise des applications parallèles
complexes avec un formalisme simple et générique. Malgré cela, les outils de conception
basés sur ce langage ne sont généralement pas suffisants pour exploiter entièrement toutes
ses caractéristiques, surtout sa puissance d’expression. En général, les outils actuellement
disponibles limitent énormément son modèle de calcul afin de réduire l’effort de l’exploration
de l’espace de conception. L’objectif de cette thèse est donc de fournir une méthodologie
d’exploration de l’espace de conception où toutes les fonctionnalités du CAL et de son modèle
de calcul peuvent être exploitées d’une manière plus générale et plus efficace. Elle démon-
tre aussi une nouvelle méthodologie d’estimation et d’analyse des performances pour les
applications flux de données dynamiques. Les principales contributions à la recherche de
cette thèse sont: la formalisation d’une représentation de l’exécution du programme basée
sur la théorie des graphes et appelée "graphe de trace d’exécution"; la formalisation d’une
méthodologie systématique pour le profilage des programmes flux de données dynamiques
génériques à travers l’interprétation de haut niveau de leur code source; la formalisation d’une
méthodologie complète de l’exploration de l’espace de conception pour des programmes flux
de données dynamiques. En outre, les problèmes d’optimisation de l’espace de conception
du design pour les programmes flux de données dynamiques abordés sont: l’analyse des
goulets d’étranglement algorithmiques d’un programme; la sélection et l’optimisation de la
configuration de la taille de mémoire pour des applications complexes; la minimisation de
la dissipation de puissance dynamique des programmes mis en œuvre dans une architec-
v
Résumé
ture multi-horloges. De plus, les concepts théoriques comme l’espace du chemin critique et
l’accélération potentielle d’une application flux de données ont été respectivement définis et
revisités. La thèse présente, également, un logiciel d’exploration de l’espace de conception
développé afin de démontrer l’efficacité de cette méthode.
Mots clefs: flux de données, exploration de l’espace du design, computation parallèle, plates-
formes hétérogènes, CAL
vi
Contents
Acknowledgements i
Abstract iii
Résumé v
List of symbols xiii
List of figures xix
List of tables xxiii
List of listings xxv
List of algorithms xxvii
1 Introduction 1
1.1 Heterogeneous systems development . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Requirements for effective design development . . . . . . . . . . . . . . . 3
1.1.2 Models of computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Design space exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Motivation of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 System development design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Research contributions of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Dataflow programming 11
2.1 Dataflow programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Kahn process networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2 Dataflow process networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.3 Actor transition systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Dataflow paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Modular programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Parallelism flavors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Dataflow classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Static dataflow programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
vii
Contents
2.3.2 Cyclo-static dataflow programs . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.3 Dynamic dataflow programs . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Code interpretation and generation . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Abstract syntax tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.2 Intermediate representation . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.3 Control flow graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 The Cal Actor Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5.1 CAL program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.2 Execution model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.3 CAL syntax and semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.4 An example of a CAL program . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.5 RVC-CAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.6 Compiler infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3 Profiling CAL programs 35
3.1 Actor classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Static analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.1 Source lines of code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.2 Operators count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.3 Cyclomatic complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.4 Halstead metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Data-dependent analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 Computational load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.2 Data-transfers and storage load . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Exploring the design space of dataflow programs 43
4.1 Orthogonalization of concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.1 Model of computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.2 Model of architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.3 Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 The design space of a dataflow program . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.1 Design space and design points . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.2 Exploration methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.3 Performance estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 Advances in design space exploration of CAL programs . . . . . . . . . . . . . . 54
4.4.1 Space for improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4.2 New requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
viii
Contents
5 Execution trace graph 57
5.1 Geometry of execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1.1 Partially-ordered space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1.2 Execution trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1.3 Execution trace space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2 Execution trace graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2.1 Firings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2.2 Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.3 Example of an execution trace graph . . . . . . . . . . . . . . . . . . . . . 63
5.3 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.1 Topological order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.2 Mapping independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.3 Untimed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.4 Maximum parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.5 Data dependent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3.6 Modeling a dynamic program execution . . . . . . . . . . . . . . . . . . . 70
5.4 Timed execution trace graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.4.1 Firing weight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4.2 Dependency weight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.5 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.5.1 Firing expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.5.2 Dependency amalgamation . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.5.3 Event-driven system representation . . . . . . . . . . . . . . . . . . . . . . 78
5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6 TURNUS: a design space exploration environment for CAL programs 85
6.1 Design flow features and capabilities . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.1.1 Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.1.2 Execution trace graph post-mortem scheduling and analysis . . . . . . . 86
6.2 High-level models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2.1 CAL dataflow program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2.2 Architecture and constraints . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2.3 Execution trace graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2.4 Profiling information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3 Integration with third-party CAL dataflow environments . . . . . . . . . . . . . . 102
6.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7 Profiling CAL programs with TURNUS 105
7.1 Advances in profiling CAL programs . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.2 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.2.1 Firing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.2.2 Action data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.2.3 Actor data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
ix
Contents
7.2.4 Buffer data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.2.5 Statistical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.2.6 Profiled token . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.3 Building of the execution trace graph . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.4 Application programming interface . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8 Design space exploration and optimization with TURNUS 119
8.1 Performance estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.1.1 Post-mortem scheduler models . . . . . . . . . . . . . . . . . . . . . . . . 120
8.1.2 Execution trace graph post-mortem scheduling . . . . . . . . . . . . . . . 123
8.1.3 Execution statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8.1.4 Analysis of a collection of execution trace graphs . . . . . . . . . . . . . . 126
8.2 Design space critical path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8.2.1 Critical path length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8.2.2 Algorithmic critical path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
8.2.3 Throughput and design space critical path . . . . . . . . . . . . . . . . . . 132
8.2.4 Potential speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
8.3 Hotspot analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
8.3.1 Critical actions ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
8.3.2 Impact analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.4 Buffer size dimensioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
8.4.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
8.4.2 Deadlock and feasible regions . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.4.3 Minimization by the use of a model predictive control approach . . . . . 139
8.4.4 Optimization by the exploration of the design space critical path . . . . . 142
8.5 Dynamic power dissipation minimization . . . . . . . . . . . . . . . . . . . . . . 145
8.5.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.5.2 Multi-clock domain partitioning . . . . . . . . . . . . . . . . . . . . . . . . 148
8.5.3 Linear programming formulation . . . . . . . . . . . . . . . . . . . . . . . 148
8.5.4 Heuristic approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
9 Experimental results 153
9.1 Design cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.1.1 JPEG decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.1.2 MPEG4-SP decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
9.1.3 MPEG-HEVC decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
9.2 CAL source code static and dynamic profiling . . . . . . . . . . . . . . . . . . . . 154
9.2.1 Source code static analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
9.2.2 Memory requirements and utilization . . . . . . . . . . . . . . . . . . . . . 156
9.2.3 Execution trace graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
9.2.4 Initial design-refactoring directions . . . . . . . . . . . . . . . . . . . . . . 157
x
Contents
9.3 Design refactoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
9.3.1 Critical action ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
9.3.2 Impact analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.4 Bounded buffer size configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
9.5 Buffer size optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
9.6 Dynamic power dissipation minimization . . . . . . . . . . . . . . . . . . . . . . 169
9.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
10 Conclusions 173
10.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
A Discrete event system and simulation 179
A.1 Petri nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
A.1.1 State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
A.1.2 Transition firing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
A.2 Discrete event system specification . . . . . . . . . . . . . . . . . . . . . . . . . . 181
A.2.1 The atomic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
B Model predictive control 183
C A CAL esoteric example 185
C.1 A Chef chocolate cake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
C.2 From a sequential to a dataflow program specification . . . . . . . . . . . . . . . 187
C.3 The first CAL chocolate cake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
C.4 A dynamic refrigerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
C.5 Design space exploration of a kitchen . . . . . . . . . . . . . . . . . . . . . . . . . 193
Bibliography 211
Curriculum Vitae
xi

List of Symbols
Acronyms
AAM Algorithm architecture adequation matching
ACP Algorithmic critical path
API Application programming interface
ASIC Application-specific integrated circuit
ATM Actor transition system
CAL Cal Actor Language
CIF Common interchange format
CP Critical path
CPL Critical path length
CSDF Cyclo-static dataflow program
DAG Directed acyclic graph
DCT Discrete cosine transform
DDF Dynamic dataflow program
DPN Dataflow process network
DSCP Design space critical path
DSE Design space exploration
ETG Execution trace graph
FID Firing identifier
FIFO First in, first out
FNL Functional unit network language
xiii
Contents
FPGA Field-programmable gate array
GALS Globally asynchronous locally synchronous
HEVC High efficiency video coding
HW Hardware
IDCT Inverse DCT
ILP Integer linear programming
IQ Inverse quantization
JPEG Joint photographic experts group
KPN Kahn process network
LP Linear programming
LTS Labeled transition systems
LUB Least upper bound
MCD Multiple-clock domain
MoA Model of architecture
MoC Model of computation
MPC Model predictive control
MPEG Moving picture experts group
MPSoC Multiprocessor system on chip
NPM Native programming model
Orcc Open RVC-CAL compiler infrastructure
PAPI Performance application programming interface
PAPS Periodic admissible parallel schedule
PASS Periodic admissible sequential schedule
PiMM Parameterized and interfaced dataflow meta-model
PN Petri net
QCIF Quarter CIF
RMC Reconfigurable media coding
xiv
Contents
RTL Register-transfer level
RVC Reconfigurable video coding
S-LAM System-level architecture model
SDF Static or synchronous dataflow program
SoC System on chip
SP Simple profile
SW Software
TETG Timed execution trace graph
UML Unified modeling language
VHDL VHSIC hardware description language
VHSIC Very high speed integrated circuit
VLSI Very large scale integration
XDF XML dataflow format
XML Extensible markup language
Operators
[.]′ The matrix transpose operator
argmax(.) The argument of the maximum operator
argmin(.) The argument of the minimum operator
• The amalgamation operator
|.| The cardinality operator
||.||n The n-norm operator
|−→. | The directed path length operator
max(.) The maximum value operator
min(.) The minimum value operator
⊕ The concatenation operator
−→. The directed path operator
E [.] The expected value operator
xv
Contents
V ar (.) The variance operator
Variables
(X ,≤) A partially-ordered space, also called po-space
(X ,d X ) A directed topological space, also called d-space
β ∈Cβ A buffer size configuration
η Action execution index
κ ∈K The CAL actor-classes
λ ∈Λ The CAL action
Λ The set of CAL actions
ΛC P ⊆Λ The set of actions that have at least one action firing along the critical path
|−→C P | The critical path length
T̂ The estimated design throughput
S(n) The speedup of a program when executed in n processing elements
T The design throughput
µ Dependency kind
−→
1 = [0,1] The closed and directed unit interval
−→
C P The critical path
−→
C P al g o The algorithmic critical path
−→p A directed path, also called d-path
⊥ The empty sequence
ρ ∈Cρ A partitioning configuration
σ ∈Cσ A scheduling configuration
Θ The set clock-accurate profiling information
a ∈ A The CAL actor
A The set of CAL actors
AC P ⊆ A The set of actors that have at least one action firing along the critical path
B The set of buffers of a dataflow program
xvi
Contents
bi ∈B The i-th buffer of a dataflow program
Cβ The set of buffer size configurations
Cρ The set of partitioning configurations
Cσ The set of scheduling configurations
D The ETG dependencies set
d Dependency direction
D• The set of amalgamated dependencies
Dc ⊆D The critical dependencies set
D f ⊆D The finite state machine dependencies set
Dg ⊆D The guard dependencies set
Dp ⊆D The port dependencies set
D t ⊆D The tokens dependencies set
Dv ⊆D The internal variables dependencies set
DC P ⊆Dc The dependencies set along the critical path
E(X ,d X ) The execution trace space
e•n ∈D• The n-th amalgamated dependency
en = (si , s j ) ∈D The n-th dependency of the ETG, with si and s j source and target firings
G(PU , ME ,L) The platform architecture model
G(V ,E) A generic graph with V and E the sets of vertexes and edges
K The set of CAL actor-classes
KC P ⊆K The set of actor-classes that have at least one action firing along the critical path
L The set of links available on the architecture
li ∈ L The i-th link (i.e. interconnection between a processing element and a medium)
M The set of mapping point
ME The set of media of an architecture
mei ∈ME The i-th medium
P The set of Petri net places
xvii
Contents
pi ∈ P The i-th of Petri place
P;− The set of fictive Petri net places
PU The set of processing elements of an architecture
pui ∈ PU The i-th processing element
S The ETG action firings set
Sc ⊆ S The critical action firings set
si ∈ S The i-th firing of the ETG
SC P ⊆ Sc The action firings set along the critical path
T The set of Petri net transitions
ti ∈ T The i-th of Petri transition
w(si ) Execution time (or time weight) required by action firing si
w(si , s j ) Execution time (or time weight) required by dependency (si , s j )
xviii
List of Figures
1.1 Simplified typical design flow of a heterogeneous hardware and software system. 2
1.2 Heterogeneous system development design flow for CAL dataflow programs. . 7
2.1 Pipeline parallelism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Task parallelism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Data parallelism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Dataflow MoCs classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 A dataflow graph with two actors, ai and a j , connected through the buffer bn .
pi ,n defines the number of tokens produced on bn during each firing of ai . c j ,n
defines the number of tokens consumed from bn during each firing of a j . . . . 19
2.6 Dataflow graph example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Code compiler and interpreter flowcharts. . . . . . . . . . . . . . . . . . . . . . . 23
2.8 CAL network and actors structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.9 Action execution model according to Equation 5.10. . . . . . . . . . . . . . . . . 26
2.10 Basic dataflow program. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.11 The RVC-CAL compiler and Xronos infrastructure integrated in the design flow
presented in Figure 1.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 Mapping from an application to an architecture. Constraints represent the
feasible regions of the design space. . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 The design space M =Cρ×Cσ×Cβ = {m1,m2, . . . ,mnM } and the corresponding
performance T(m) and estimated performance T̂(m). . . . . . . . . . . . . . . . 46
4.3 Platform independent simulation of the CAL network depicted in Fig. 2.10
with the mapping configurations described in Table 4.1. The execution of each
action is supposed to take at least one (abstract) clock cycle (when there are no
blocking output buffers), the overhead introduced by the action selection and
buffer access overheads are both neglected. In gray the actor execution with the
corresponding action firing. In striped-gray the actor execution is postponed
due to the unavailability of a token (i.e. blocking reading). . . . . . . . . . . . . 47
5.1 Execution space in R2 of two actors A and B mapped on two processing units
pu1 and pu2, respectively. The dashed arrow represents a possible execution
path of the program. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
xix
List of Figures
5.2 Execution trace graph obtained after the execution of the CAL program de-
scribed in Section 2.5.4. The firing set S is summarized in Table 2.2, and the
dependencies set D is summarized in Table 5.2. . . . . . . . . . . . . . . . . . . 64
5.3 Execution Trace Graphs of the CAL network depicted in Fig. 2.10. Dashed lines
represent additional edges that model a particular scheduling configuration
defined within the mapping configurations described in Table 4.1. . . . . . . . 69
5.4 Two possible execution paths of the GuardedInverter actor illustrated in
Listing 5.1. The corresponding execution trace graphs do not take into account
the guard enable and disable dependencies. . . . . . . . . . . . . . . . . . . . . . 72
5.5 Guard enable and disable dependencies couples that model the guard enable
windows n = 1 and n = 2 depicted in Figure 5.4. The firing sb represents a generic
firing of the action B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.6 The ETGs related to the execution paths depicted in Figure 5.4a and Figure
5.4b where for each firing of B a couple of guard enable and disable has been
considered in order to model the guard enabled windows n = 1 and n = 2. . . . 75
5.7 Firings expansion of an execution trace graph. . . . . . . . . . . . . . . . . . . . . 78
5.8 Amalgamation of the execution trace graph illustrated in Figure 5.2. . . . . . . 79
5.9 Petri net obtained from the execution trace graph depicted in Figure 5.2. . . . . 82
6.1 TURNUS design flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 The Network object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.3 The ActorClass object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.4 The Actor object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.5 The Action object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.6 The Quid object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.7 The Procedure object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.8 The Variable object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.9 The Guard object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.10 The Port object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.11 The Buffer object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.12 The Type object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.13 The Version object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.14 Xilinx Zynq-7 ZC702 evaluation-board architecture model. . . . . . . . . . . . . 96
6.15 The Platform object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.16 The ProcessingElement object. . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.17 The Medium object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.18 The Link object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.19 The Scheduler object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.20 The Trace object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.21 The Firing object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.22 The Dependency object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.23 The NetworkProfilingData and ActionProfilingData objects. . . 102
xx
List of Figures
6.24 The open RVC-CAL compiler (Orcc) and Xronos infrastructure integrated in the
TURNUS design flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.1 CAL profilers design flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.2 The FiringData object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.3 The ActionData object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.4 The ActorData object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.5 The ActorTracingData object. . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.6 The BufferData object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.7 The Statistics object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.8 The Token object. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
8.1 Execution trace graph post-mortem scheduler: simulation models. . . . . . . . 121
8.2 Sequence diagram for the DEVS atomic implementation of an actor. . . . . . . 124
8.3 Design space critical path. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
8.4 Critical path length linear model |C P |(ρn). . . . . . . . . . . . . . . . . . . . . . . 134
8.5 Theoretical speedup S(n) defined in Equation (8.27) for different values of h =
|C P |al g o/w ∈ [0,1] when nA = 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
8.6 Example of impact analysis for three actions λ1, λ2 and λ3. . . . . . . . . . . . . 138
8.7 Critical path design space given different buffer size configurations. . . . . . . 139
8.8 Bounded buffer scheduling with deadlock avoidance approach. . . . . . . . . . 143
8.9 Bounded buffer scheduling with deadlock recovery approach. . . . . . . . . . . 144
9.1 JPEG decoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
9.2 MPEG4-SP decoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
9.3 HEVC decoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
9.4 The rendering of a small portion (i.e. approximatively 80000 action firings and
350000 dependencies) of the execution trace graph described in Table 9.4. Action
firings are colored according to the corresponding actor. . . . . . . . . . . . . . 159
9.5 Refactoring strategies for the Inter-Prediction actor. . . . . . . . . . . . . 161
9.6 Impact analysis for the initial version of the Shared-Memory MPEG-HEVC de-
coder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
9.7 Buffer size optimization of the MPEG4-SP decoder implemented on an ST Mi-
croelectronics STHorm platform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
C.1 A dataflow representation of the Hello World Cake with Chocolate
sauce Chef program illustrated in Listing C.1. . . . . . . . . . . . . . . . . . . . 187
C.2 The Liquify CAL actor defined in Listing C.4. . . . . . . . . . . . . . . . . . . . 191
C.3 The modified version of the ChocolateSauce CAL actor. . . . . . . . . . . . 191
xxi

List of Tables
2.1 CAL lexical tokens. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Firing of the CAL program described in Section 2.5.4. . . . . . . . . . . . . . . . 31
3.1 Profiled executed operators and statements. . . . . . . . . . . . . . . . . . . . . . 38
4.1 Mapping configurations for the dataflow network illustrated in Figure 2.10. For
brevity, the actors Producer, Filter and Consumer are denoted with P, F, C, respec-
tively. The partitioning of the buffers is not considered. . . . . . . . . . . . . . . 48
5.1 Dependencies kinds, directions, parameters and additional attributes. . . . . . 63
5.2 Dependencies set S of the execution trace graph depicted in Figure 5.2. . . . . 65
5.3 Firings sequence of the CAL actor Split defined in Listing 2.2 when two in-
put sequences are available in its input port I: I1 = {0,1,−10,−5} and I2 =
{−1,−1,0,−1}, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.4 Firings with the corresponding internal variable and guard values for the execu-
tion trajectories and graphs depicted in Figure 5.4. . . . . . . . . . . . . . . . . . 73
5.5 Firing weight parameters for the linear model of Equation 5.10. . . . . . . . . . 76
7.1 CAL profilers features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9.1 Static code complexity of the MPEG-HEVC decoder. . . . . . . . . . . . . . . . . 156
9.2 Actor memory requirements for the initial HEVC design. . . . . . . . . . . . . . 156
9.3 Buffer utilization profiling data of the MPEG-HEVC decoder. . . . . . . . . . . . 157
9.4 Execution trace graph configuration of the Ref-Standard MPEG-HEVC decoder. 158
9.5 Memory requirement for the actor internal variables of the Shared-Memory
MPEG-HEVC decoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
9.6 Execution trace graph configuration of the Shared-Memory MPEG-HEVC de-
coder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
9.7 Critical action ranking analysis of the MPEG-HEVC decoder. Results are for the
initial and the full-parallel version, summarized for 5 different actions in Table
9.7a and 9.7b respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
9.8 Description of different configurations of the HEVC decoder design and corre-
sponding speedup, computational complexity and critical path length values. . 163
9.9 Design sizes: numbers of actors, buffers and action firings. . . . . . . . . . . . . 165
xxiii
List of Tables
9.10 Bounded buffer size configurations of the JPEG decoder using the MPC approach.
Results are compared to state of the art approaches. . . . . . . . . . . . . . . . . 165
9.11 Bounded buffer size configurations of the MPEG-HEVC decoder using the MPC
approach. Results are compared to state of the art approaches. . . . . . . . . . 166
9.12 2-Clock domains dynamic power minimization results of the MPEG-4 SP de-
coder implemented in a Xilinx Virtex-5 FPGA. Nominal: all the domains use the
maximum available frequency; Optimized: with the clock frequencies illustrated
in Table 9.12a. ∆% defines the percentage reduction between the nominal and
optimized case of each contribution. . . . . . . . . . . . . . . . . . . . . . . . . . 169
9.13 4-Clock domains dynamic power minimization results of the MPEG-4 SP de-
coder implemented in a Xilinx Virtex-5 FPGA. Nominal: all the domains use the
maximum available frequency; Optimized: with the clock frequencies illustrated
in Table 9.13a. ∆% defines the percentage reduction between the nominal and
optimized case of each contribution. . . . . . . . . . . . . . . . . . . . . . . . . . 170
xxiv
List of Listings
2.1 Inverter.cal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 Split.cal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 ParametrizedProducer.cal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 PingPongMerge.cal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 BiasedMerge.cal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6 TokenConsumer.cal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.7 BasicNetwork.nl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.8 BasicNetwork.xdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.1 GuardedInverter.cal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
C.1 Cake.chef . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
C.2 Cake.cal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
C.3 ChocolateSauce.cal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
C.4 Liquify.cal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
C.5 ModifiedChocolateSauce.cal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
C.6 Refrigerate.cal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
C.7 Refrigerate.cal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
xxv

List of Algorithms
1 Compute the set of parameters ES(si ),EF (si ),LS(si ),LF (si ) for each si ∈ S. . . . 129
2 Compute the slack value SL(si ) for each si ∈ S and SL(si , s j ), and the set of critical
firings set Sc and critical dependencies set Dc . . . . . . . . . . . . . . . . . . . . . 130
3 Critical path extraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4 Impact analysis for the set of critical actionsΛC P . . . . . . . . . . . . . . . . . . . 137
5 Critical path length reduction by increasing the size of critical buffers. . . . . . . 146
6 Heuristic algorithm for solving the problem of the multi-clock domain partition-
ing defined in Equation (8.45). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
xxvii

1 Introduction
This thesis addresses the problem of analyzing complex design applications modeled with
emerging dataflow programming languages. In the last decades, there has been a great deal of
activity and advancement in the field of dataflow programming languages. The motivations of
such interest are related to the fact that the increasing demand of computing power can be
coped difficultly only by the improvement of device technology. Nowadays, the availability of
heterogeneous parallel platforms, that combine the processing features of FPGAs with multi-
core CPUs, offer in a single silicon die a potential amount of computing power that exceeds
by far what was available in the past years. However, the programming experience of these
platforms becomes more and more complex. Consequently, designers have to implement
increasingly complex applications for increasingly complex and networked platforms. The
potential power of those platforms can only be exploited if existing design flows are able to
support the new heterogeneous architectures. Designs capable of efficiently exploiting the
architecture characteristics must encompass both hardware and software design concepts,
which are currently expressed by using completely different abstractions. This thesis defines a
complete design flow, supported by a software tool environment, such that the designer can
be efficiently and easily guided during the entire application development process.
1.1 Heterogeneous systems development
All computing platforms, from mobile to supercomputers, are becoming more and more
heterogeneous and massively parallel. In a time when new hardware meant higher clock
frequencies, old programs almost always ran faster on more modern equipment. However,
this is not the case anymore when programs written for single-core systems will have to
execute, as an example, on multi-core platforms at possibly lower clock speeds on low-power
platforms. While these heterogeneous and massively parallel platforms can provide higher
power efficiency and computational throughput, their effective and confident use always
requires knowledge about low-level programming. Hence, the average time necessary to
develop and optimize a design on heterogeneous platforms is higher and higher compared
to typical homogeneous systems. A common practice is to choose "a-priori" partition of the
1
Chapter 1. Introduction
design. Each part of the design is specifically developed for the assigned computing element.
This typical design flow is depicted in Figure 1.1 which starts with a behavioral description of
the application. This description is generally made using a plethora of different programming
language. Application parts that are implemented in hardware (HW) are generally specified
using parallel languages (e.g. VHDL, Verilog), and application parts that are implemented in
software (SW) are generally specified using sequential languages (e.g. C/C++, CUDA, OpenCL)
that sometimes make use of parallel pragmas (e.g. OpenMP) specified by the designer. This
initial choice can affect all the development stages. In fact, if the design requirements and
constraints are not satisfied, then the design should be optimized. If the modification requires
that the partitioning configuration should be changed, then part of the (or the entire) design
should be rebuilt from scratch. This can be frustrating for the designer, but it also increases
the time-to-market of the application. As a consequence, this typical design flow cannot
be considered as an adequate and productive methodology for the design development on
heterogeneous platforms.
Behavioral
description
Software porting
Procedural 
optimization
Compilation
Interfaces
synthesis
HDL 
synthesis
Latency/Area/Power 
minimization
Hardware
porting
Co-Simulation
performance estimation
Optimization 
up until satisfying 
system constraints
Implementation
Iteration loop
Figure 1.1: Simplified typical design flow of a heterogeneous hardware and software system.
2
1.1. Heterogeneous systems development
1.1.1 Requirements for effective design development
The main requirements for effective design development on heterogeneous platforms can be
summarized in terms of:
• Design abstraction: one of the most important questions that a designer faces during
the early stage of the development is which level of abstraction should be used. The
response is not always trivial by the diverse nature of platforms. Different degrees of
abstraction may be employed depending on the amount of details needed to describe
the requirements.
• Modularity: design abstraction should make opportunities for a more flexible and
modular implementation. The functionalities of a program should be separated into
independent and interchangeable modules, where each module contains everything
necessary to execute only a specific functionality.
• Composability: design abstraction should make opportunities to provide recombinant
components implementation. These components should be selected and assembled in
various combinations in order to satisfy specific design requirements.
• Reuse: design abstraction and the modularity of an application should make opportuni-
ties for the reuse of program components. As an example, several audio codecs share
part of the same functional units, which makes modularity a necessity.
These requirements are essential for an unified computation abstraction for HW and SW that
requires the application programming with a model of computation that is modular and, at
the same time, abstracts out platform specific details.
1.1.2 Models of computation
One of the main obstacles that may prevent the widespread usage of heterogeneous parallel
platforms is the fact that serial models of computations (MoC) and programming methods are
still adopted. The vast majority of existing software is written in sequential form. However,
efficient parallel implementations are challenging and arduous to achieve using sequential
MoCs. In fact, sequential languages are notoriously difficult to parallelize in general, so
efficient parallel implementations will usually require significant guidance from the user.
Consequently, serious problems are arising for porting existing technologies and applications
on the new performing heterogeneous and massively parallel platforms. The understandabil-
ity, predictability, and determinism properties of purely-sequential MoCs remain the crucial
requirement for parallel MoCs. Hence, a shift to a new programming paradigm that exploits
the parallelism and diversification interested in heterogeneous systems development is clearly
becoming a necessity. In application areas characterized by the use of highly parallel com-
puting platforms, the use of a dataflow MoC to describe the algorithms creates opportunities
3
Chapter 1. Introduction
for more flexible implementation and also for more extensive analysis. One of the reasons for
this is that a dataflow program describes an algorithm as a, possibly hierarchical, network of
communicating computational kernels, also called actors. Actors are connected by directed,
lossless, order-preserving point-to-point channels. This makes the flow of data explicit be-
tween actors, which are not permitted to share data in any other way than by sending each
other messages, called tokens. Furthermore, this MoC also exposes the application-internal
parallelism between actors. As actors are forbidden to share state, implementation tools have
a great range of freedom in mapping dataflow programs to hardware and software imple-
mentations, or mixtures thereof. Dataflow programs are being analyzed in different ways
for different purposes. The subclass of statically schedulable programs, also called static
dataflow programs, is amenable to pure compile-time analysis that yields not only a static
schedule, but also things such as exact minimal bounds for buffer sizes, exact predictions for
throughput and latency, and a guarantee of the absence of deadlocks and so forth. However,
for many complex applications (e.g. signal processing), it is not possible to represent all of the
functionality in terms of a purely statical schedulable program. Functionality that involves
conditional execution or dynamically varying token production and consumption rates can
only be directly expressed through a dynamic dataflow representation. Intuitively, in dynamic
dataflow programs the production and consumption rates of actors can vary in different ways
that are not entirely predictable at compile time. As a consequence, compile-time analysis
may provide inconclusive results.
1.1.3 Design space exploration
Design space exploration (DSE) refers to the activity of evaluating and exploring the different
design alternatives during the system development of an application. For large and complex
designs, implemented in heterogeneous and massively parallel platforms, the number of de-
sign alternatives easily becomes too big and error-prone for a manual and efficient exploration.
For this reason, several DSE methodologies have been investigated in the last decades. Each
DSE methodology generally makes use of the following functionalities:
• Rapid prototyping: DSE is used to generate a set of prototypes prior to implementation.
Validating and testing the design before its final implementation may reduce the cost
and the time required for solving problems that can arise in the late production cycle
of an application. Furthermore, it can increase understanding of the impact of design
decisions during the implementation process.
• Optimization: even though validation is an important part of the design process, it is
possible that the application does not satisfy the requirements. Consequently, feasible
design configuration should be explored in order to meet the requirements. If any of
those exist, the design should be modified. Consequently, clear and precise refactoring
directions should be provided to the designer.
• System integration: when heterogeneous platforms are used as target architecture of
4
1.2. Motivation of this thesis
the application, the system integration can become one of the most tedious and error-
prone stages of the development. System integration requires a working assembly and
configuration of multiple components. DSE can be used to find feasible assemblies
configurations that satisfy the design constraints and requirements.
Therefore, a designer must have a formal method, supported with a computer-aided frame-
work for finding a feasible set of design alternatives, also referred to as design points, that
meet the specification requirements. However, general and structured methodologies are
lacking for designing application-specific architectures that are sufficiently modular and pro-
grammable. In fact, the current practice is to design application-specific architectures at a
detailed level. The level of detail involved limits the number of design points that can be
explored effectively. As a consequence, this may limit the freedom to make trade-offs between
programmability, resource utilization and achievable performances. For this reason, a generic
and DSE environment should encompass the following main components:
• Application model: a suitable representation of the design space is essential. This
should be: formal (automated analysis and exploration techniques can be performed),
general (the application should be platform-independent and retargetable), expressive
(constraints imposed by the target platform can be captured and enforced).
• Exploration and analysis: the environment should provide a collection of computer-
aided techniques for discovering potential design configuration candidates. Moreover,
the framework should be able to tackle the challenge of solving, in a reasonable time
frame, a large number of complex design constraints. As far as the user is concerned,
the framework must provide a method for navigating through the set of interesting and
distinctive solutions.
• Performance estimation: the environment should be able to estimate the application
performance and directly test the different candidate solutions. As far as the user is
concerned, testing the different configurations one by one without the possibility of
estimating the performance can be an error-prone procedure since this may require
several partial implementations of the application. Furthermore, good estimation also
mean that the DSE analysis provides reliable results.
1.2 Motivation of this thesis
Dataflow MoCs are a promising practice in heterogeneous systems development. It has
already been demonstrated how they can be efficiently used to support the portability by
itself and the portability of the parallelism of an application. In domains such as signal
processing and multimedia communication, where the scalability is also growing in interest
as a fundamental requirement, dataflow MoCs have already become an accepted standard.
However, the shift from a sequential and architecture-specific MoC to a dataflow MoC still
5
Chapter 1. Introduction
uncovers several programming and development challenges. The Cal Actor Language (CAL)
is a recently specified dataflow and actor-based language capable of concisely expressing
complex and general purpose parallel applications. A subset of this language has also been
standardized within the MPEG Reconfigurable Video Coding framework (RVC) where it is
used to specify the standard Video Tool Library (VTL). However, design tools that support
this language are generally not adequate to fully exploit its features and capabilities. Current
design methodologies, where this language is used, severely restrict its MoC in order to reduce
the DSE effort. The objective of this thesis is to provide a DSE methodology where all the
features of CAL, and dynamic dataflow MoC, more generally, are completely exploited.
1.3 System development design flow
In this thesis the system development design flow and methodology illustrated in Figure 1.2
is used. The program functional behavior is taken separately from the architecture model.
Program behavior is expressed using the CAL dataflow language, which is based on dataflow
processing network principles. The architecture, together with its constraints, are modeled
with a high-level abstraction used to describe the platform where the design is implemented.
The architecture model is based on the notion of processing elements, media and links. A
processing element defines the kind of platform, a medium defines the way that this platform
is communicating, and a link defines a connection between processing elements and media.
Constraints are applied within the architecture and the program and are used to define, for
example, the maximal clocking frequency of each operator. The six different stages of this
design flow are respectively:
• Compiler infrastructure: transforms the source code of a CAL program to an equivalent
intermediate or representation. The compiler should provide the possibility of verify-
ing the program behavioral correctness directly from the intermediate representation,
without requiring any partial implementation or prototyping.
• Profiling and analysis: the design alternatives of the application are explored such that
constraints and performance requirements can be satisfied. The design can also be
statically or dynamically analyzed in order to evaluate its computational and commu-
nication costs. In the case where a design point satisfies the requirements, this is then
used to drive the compiler infrastructure through a set of compiler directives. Otherwise,
in the case where requirements cannot be satisfied, refactoring directions should be
provided to the designer. These highlight which part of the design requires modification
to allow requirement satisfaction.
• Performance estimation: performance of a given design point is evaluated without
requiring any partial implementation of the program: only the high-level models of
both the program and architecture are used. Results of the estimation are analyzed in
order to reduce the design points that can satisfy requirements.
6
1.3. System development design flow
• Code generation: the CAL program is transformed to a low-level code representation.
Software and/or hardware code is generated according to the mapping of the program
to the target architecture.
• Synthesis or compilation: the software or hardware code is compiled or synthesized,
respectively. Standard tools are used in order to obtain the software executables and the
hardware binary files and netlists of the implementation.
• Implementation: when both the performances and constraints are satisfied, the design
is implemented in the hardware or/and software architecture. If the implementation
contains both hardware and software parts, then interfaces provided by the architecture
should be automatically integrated into the design.
Compiler
Infrastructure
Code 
Generation
Synthesis
or 
Compilation
Implementation
Profiling
and
Analysis
Performance
Estimation
CAL
program ArchitectureConstraints
R
ef
a
ct
o
ri
n
g
 D
ir
e
c
ti
o
n
s
C
o
m
p
il
e
r 
D
ir
e
c
ti
v
e
s
Figure 1.2: Heterogeneous system development design flow for CAL dataflow programs.
7
Chapter 1. Introduction
1.4 Research contributions of this thesis
This dissertation focuses on the profiling and analysis, and the performance estimation stages
of a CAL program development design flow illustrated in Section 1.3. Those stages represent
together the DSE of a CAL program. In this context, the main contributions of this thesis are:
(i) Execution Trace Graph [1, 2, 3, 4]: a graph-based representation of the program execu-
tion is formalized. This mathematical formalism can be used to model the execution
of static, cyclo-static and dynamic dataflow programs. A collection of analysis and
transformations are illustrated. As an example, it is demonstrated how it is possible to es-
timate the design performance by scheduling the execution trace graph post-mortem, or
how this representation can be transformed to an event-driven system where advanced
control technique methods can be used to explore the design points of the application.
(ii) Profiling of dynamic dataflow programs [1, 2]: a systematical methodology for pro-
filing generic dynamic dataflow programs is formalized. This is based on the code
interpretation, which does not require any partial implementation of the program. It
is demonstrated how this methodology can be effectively used to extract the execution
trace graph through a serial code interpretation.
(iii) Design space exploration methodology: a collection of heuristic methods, based on the
analysis of the execution trace graph, is formalized for exploring the different design
points of a generic dynamic dataflow program. In particular, the following problems
have been addressed and solved:
• Design space critical path formalization [1, 5]: the concept of design space crit-
ical path is formalized and used to bound the design points of an application.
Furthermore, with this notion, the concepts of potential speedup, defined in the
well known Amdahl’s law, have also been revisited and adapted to the domain of
dataflow programs.
• Hotspots analysis [1, 5, 6, 7, 8, 9]: a methodology for highlighting the bottlenecks
of the program and providing clear code refactoring directions is formalized.
• Buffer size configuration dimensioning [1, 10, 11, 12]: a systematic methodology
for solving the problem of bounding and optimizing the buffer size configuration of
complex dynamic dataflow programs has been formalized and solved.
• Dynamic power dissipation minimization [13, 14, 8, 15]: a systematic method-
ology for reducing the dynamic power dissipation of complex dynamic dataflow
programs, implemented in multi-clock domain architecture, has been formalized
and solved.
(iv) Design space exploration environment [16, 17, 18, 19, 20]: a computer-aided framework
for exploring the design space of dynamic dataflow program has been implemented. This
framework, called TURNUS, has been released as an open-source project that has already
8
1.5. Thesis organization
been integrated with other open-source CAL HW synthesis and SW code generation tools
(i.e. called Xronos and Orcc, respectively). Its integration with these tools provides
a complete systems design environment for CAL applications. The TURNUS’s main
functionalities and structure are discussed in this dissertation and used to prove the
effectiveness of the illustrated design methodology and its heuristics algorithms.
1.5 Thesis organization
This dissertation is organized as follows:
• Chapter 2 provides an overview of the main concepts of dataflow programming. An
overview concerning the taxonomy classification and the different models of computa-
tion that can be defined is presented. General discussion about static, cyclo-static and
dynamic dataflow programs are presented. Furthermore, an introduction to the CAL
actor language is presented with a collection of examples.
• Chapter 3 summarizes the possible profiling options of a dataflow program. Two main
profiling axes are illustrated: computational load and memory utilization. Furthermore,
a discussion concerning static and dynamic analysis of the code is presented. A focus is
given on profiling the CAL actor language and how well-known profiling metrics, such
as the Cyclomatic complexity and the Halstead metrics, are used in the context of this
language.
• Chapter 4 defines the design space of an application. It summarizes the concept of
orthogonalization of concerns. In this direction, it is illustrated how the application
can be modeled with a high-level of abstraction by defining the model of computation
and the model of architecture. Furthermore, the concept of mapping configuration is
defined together with the definition of design points. Different design space exploration
strategies are presented. In this chapter the space for improvement on the context
of CAL application exploration and optimization, that this thesis tries to gap, are also
discussed.
• Chapter 5 defines the concept of execution trace graph of a dataflow program. The
main properties of this graph-based representation are illustrated and used to provide a
formal definition of the design space of an application. Furthermore, how a dynamic
dataflow program can be handled is demonstrated by the use of guard enable and
disable dependencies concepts.
• Chapter 6 illustrates the main functionalities and the design flow of the TURNUS
dataflow exploration framework. This represents the tool implementation of the heuris-
tic proposed throughout this thesis. The main software engineering structure, together
with the formal design methodology, are presented.
9
Chapter 1. Introduction
• Chapter 7 focuses on the TURNUS dataflow profile functionalities. The main advance-
ment and improvement, compared to state of the art tools, are presented.
• Chapter 8 focuses on the TURNUS design exploration and optimization functionalities.
Design performance is estimated by the use of a post-mortem execution trace scheduler.
It is illustrated how estimation results are used in order to guide the optimization
heuristic during the exploration phases. Furthermore, the concept of design space
critical path is defined and used as a primary metric of the optimization heuristics
that are illustrated in this chapter. These are: the hotspots analysis, the buffer size
dimensioning (i.e. minimization and optimization) and the partitioning (i.e. with
a particular focus on the problem of minimizing the dynamic power dissipation on
reconfigurable platforms).
• Chapter 9 presents a collection of experimental results for video codec applications.
More precisely, results obtained during the different stages of the design space explo-
ration of video and image decoders (i.e. such as a MPEG4-SP, HEVC and JPEG) are
presented and discussed.
• Chapter 10 concludes the dissertation, highlighting possible future works and illustrat-
ing the open problems that this thesis has not yet solved.
Furthermore, some additional in-depth material is available in the appendixes of this disserta-
tion. This additional material represents a quick reference guide for the reader. Appendixes
are structured as follows:
• Appendix A illustrates the main concepts of discrete event systems and simulation. The
formal definition of a Petri net is introduced. This is used in Section 5.5.3 when trans-
forming the execution trace graph to an event-driven system is formalized. Furthermore,
the formalism behind the concept of discrete event system specification and simulation
is presented.
• Appendix B illustrates the main concepts and functionalities of the model predictive
control. This receding horizon control technique is used in Section 8.4.3 where the
problem of buffer size dimensioning is solved.
10
2 Dataflow programming
Stream processing is a widely used term in literature to describe a variety of systems. In fact,
streaming applications are programs that process continuous data streams. These applica-
tions have become ubiquitous due to increased automation in signal and video processing,
telecommunications, health care, transportation, retail, science, security, emergency response
and finance. As a result, various research communities have independently developed pro-
gramming models for streaming applications. While there are differences both at the language
level and at the system level, each of these communities ultimately represent streaming appli-
cations as a graph of streams and operators, generally called dataflow programs. This chapter
provides an overview about dataflow programming. Starting from the definition of dataflow
program, different models of computations are illustrated. Successively, an overview about the
Cal Actor Language is presented, as this language is used as a reference dataflow programming
language in the remaining chapters of this dissertation.
2.1 Dataflow programs
In the context of this dissertation, a dataflow program is defined as directed graphs whose
vertices are operators, called actors, and whose edges are streams. In general, stream graphs
might be cyclic, though some systems only support acyclic graph. Dataflow programs im-
plement streams as FIFO (first-in, first-out) queues, called buffers, sometimes with limited
capacity, sometimes not. Conceptually, streams are infinite sequence of atomic data items,
called tokens, and each actor consumes data items from incoming streams and produces
data items on outgoing streams. The token is the atomic unit of communication in a dataflow
program. One of the main properties of dataflow programs is that they have a data-driven
semantic: it is the availability of tokens that enables an actor. One of the principal strengths
of dataflow programs is that they do not over-specify an algorithm by imposing unnecessary
sequencing constraints between actors. Instead, they only specify a partial order, where
sequencing constraints are imposed only by data dependence and, since actors can run
concurrently, dataflow programs inherently expose the application parallelism [21, 22].
11
Chapter 2. Dataflow programming
In the following parts of this section, an overview of the different dataflow models of compu-
tations (MoCs) used within this dissertation is presented. These are: the Kahn Process Net-
works [23] that represent the underpinning representation for dataflow graphs, the Dataflow
Process Networks [24] that are closely related to the Kahn Process Networks, and the Actor
transition system [25] that extends Dataflow Process Networks with the notions of atomic step,
priority and actor internal variables.
2.1.1 Kahn process networks
A Kahn process network (KPN) [23] is a network of processes that can communicate only
through unidirectional and unbounded buffers. Each buffer carries a possible infinite se-
quence of tokens. Using the notation formalized in [24], each token’s sequence is denoted
with X = [x1, x2, x3, . . .] where each xi represents a token drawn from some set. A token is con-
sidered as an atomic data object that is written (produced) exactly once and read (consumed)
exactly once. Writes to the buffers are non-blocking, in the sense that they always succeed
immediately. Reads from buffers are blocking, in the sense that if a process attempts to read a
token from a buffer and no data is available, then it stalls (waits) until the buffer has sufficient
tokens to satisfy the read. Consequently, it is not possible to test the presence of input tokens.
Kahn process
Let Sp denotes the set of p-tuples of sequences as in X = {X1, X2, . . . , Xp } ∈ Sp . A Kahn process
is then defined as a mapping from a set of input sequences to a set of output sequences such as:
F : Sp → Sq (2.1)
The KPN process F has an event semantic instead of state semantics as in some other do-
mains such as continuous time. Moreover, the only technical restriction is that F must be a
continuous mapping function.
Monotonicity and continuity
Considering a prefix ordering of sequences, the sequence X precedes the sequence Y (written
X v Y ) if X is a prefix of (or is equal to) Y . For example, if X = [x1, x2] and Y = [x1, x2, x3] then
X v Y and it is common to say that X approximates Y , since it provides partial information
about Y . The empty sequence, denoted with⊥, is a prefix of any other sequence.
The increasing chain (possibly infinite) of sequences is defined as χ = {X0, X1, . . .} where
X1 v X2 v . . .. Such an increasing chain of sequences has one or more upper bounds Y , where
Xi v Y for all Xi ∈ χ. The least upper bound (LUB) unionsqχ is an upper bound such that for any
other upper bound Y , unionsqχv Y . The LUB may be an infinite sequence.
12
2.1. Dataflow programs
Given a functional process F and an increasing chain of sets of sequencesχ, as defined in Equa-
tion (2.1), F maps χ into another set of sequences that may or may not be an increasing chain.
Let unionsqχ denote the LUB of the increasing chain χ. Then F is said to be Scott-continuous [26] if
for all such chains χ, unionsqF (χ) exists and:
F (unionsqχ)=unionsqF (χ) (2.2)
Networks of Scott-continuous processes have a more intuitive property called monotonicity.
A process F is said to be monotonic if:
X v Y ⇒ F (X )v F (Y ) (2.3)
Remark. Monotonicity can be thought of as a form of causality that does not invoke time, in
that "future input concerns only future output".
A continuous process is monotonic. However, a monotonic process may be not continuous.
A key consequence of this property is that a process can be computed iteratively [27]. This
means that given a prefix of the final input sequences, it may be possible to compute part
of the output sequences. In other words, a monotonic process is non-strict (its inputs need
not be complete before it can begin computation). In addition, a continuous process will not
wait forever before producing an output (it will not wait for completion of an infinite input
sequence). Networks of monotonic processes are determinate.
2.1.2 Dataflow process networks
Dataflow process networks (DPN) [24] formally establish a special case of KPNs, where the
computational blocks are called actors. As for the KPN process, actors can communicate only
through unidirectional and unbounded buffers which can carry possible infinite sequences of
tokens. As for KPN, writes to buffers are non-blocking. On the contrary, reads from buffers
are non-blocking, in the sense that an actor can test the presence of input tokens. If there are
not enough input tokens, then the read returns immediately and the actor does not need to be
suspended when it cannot read. This could introduce non-determinism, without requiring
the actor to be non-determinate.
Actor with firings
DPN networks are a special case of KPN where each process consists of repeated firings of an
actor [28]. An actor firing can be defined as an indivisible (atomic) quantum of computation.
The firings themselves can be described as functions, and the invocation of these firings is
controlled by some firing rules. Sequences of firings define a continuous Kahn process as
the least-fixed-point of an appropriately constructed functional mapping, therefore formally
establishing DPN as a special case of KPN [29].
13
Chapter 2. Dataflow programming
An actor with m inputs and n output is defined as a tuple ( f ,R), where:
• f : Sm → Sn is a function called the firing function.
• R ⊆ Sm is a set of finite sequences called the firing rules.
• f (ri ) is finite for all ri ∈R.
• no two distinct ri r j ∈R are joinable, in the sense that they do not have an LUB.
The Kahn process F defined in Equation (2.1) based on the actor { f ,R} has to be interpreted
as the least-fixed-point function of the functional φ : (Sm → Sn)→ (Sn → Sm) defined such as:
(φ(F ))(s)=
 f (r )⊕F (s′) if there exist s ∈R such that s = r ⊕ s′ and s v s′⊥ otherwise (2.4)
where ⊕ represents the concatenation operator and (Sm → Sn) the set of functional mapping
Sm to Sn . It is possible to demonstrate that φ is both a continuous and monotonic function.
The firing function f need not be continuous. In fact, it does not even need to be monotonic.
It merely needs to be a function, and its value must be finite for each of the firing rules [29].
2.1.3 Actor transition systems
Actor transition systems (ATS) [25] describe actors in terms of labeled transition systems (LTS).
The ATS extends the notion of actor with firings by introducing the notions of atomic step,
internal state, and priority. In an ATS, a step makes a transition from one state to another.
An actor maintains and updates its internal variables: these are not sequences of tokens, but
simple internal values that cannot be shared among actors. Moreover, the notion of priority
allows actors to ascertain and react to the absence of tokens. This notion can make actors
harder to be analyzed, and it may introduce unwanted non-determinism into a dataflow
application.
Remark. The state of an actor depends upon the value (state) of its internal variables, and not
just on the sequence of tokens it has received.
Let Σ denote the non-empty actor state space, u the universe of tokens that can be exchanged
between actors and U n a finite and partially-ordered sequence of n tokens over u. An n-to-m
actor is an LTS (σ0,τ,Â) where:
• σ0 ∈Σ is the actor initial state.
• τ⊂Σ×U n ×U m ×Σ defines the transition relation.
• Â⊂ τ×τ defines a strict partial order over τ.
14
2.1. Dataflow programs
Any (σ, s, s′,σ′) ∈ τ is called a transition, where σ ∈Σ is its source state, s ∈ Sn its input tuple,
σ′ ∈ Σ its destination state and s′ ∈U m its output tuple. It must be noted that Â is a non-
reflexive, anti-symmetric, transitive and partial-order relation on τ, also called its priority
relation. An equivalent and more compact notation for the transition (σ, s, s′,σ′) is σ s→s
′
−−−→σ′.
As for any LTS, in ATS each transition can be labeled and referred to as an action λ such as:
λ :σ
s→s′−−−→σ′ (2.5)
In summary, a step makes a transition from one state to another, each transition can be labeled
as an action and the execution of a step is defined as firing, in which tokens may be consumed
and produced, and the internal variables may be updated.
Enabled transition and step of an actor
Intuitively, the priority relation determines that a transition cannot occur if some other tran-
sition is possible. This can be seen as the definition of a valid step of an actor, which is a
transition such that two conditions are satisfied:
• The required input tokens must be present.
• There must not be another transition that has priority.
Given an n-to-m actor (σ0,τ,Â), a state σ ∈Σ and an input tuple v ∈ Sn , a transition σ s→s
′
−−−→σ′
is enabled if and only if:v v s6 ∃σ r→r ′−−−→σ′′ ∈ τ : r v v ∧σ s→s′−−−→σ′ Âσ r→r ′−−−→σ′′ (2.6)
Hence, a step from state σ with input v is defined as any enabled transition σ
s→s′−−−→σ′.
Actors composition
For any transition relation τ its set of input ports P i nτ and its set of output ports P
out
τ are de-
fined as the ports in which at least one transition consumes input from or produces output to:
P i nτ = {p ∈ P | ∃σ
s→s′−−−→σ′ ∈ τ : σ(p) 6=⊥}
P outτ = {p ∈ P | ∃σ s→s
′
−−−→σ′ ∈ τ : σ′(p) 6=⊥}
(2.7)
where P is the set of input and output ports names. It is assumed that an input port with name
p and an output port of the same name are in no way related. In order to express complex
functionality, actors are composed into a dataflow network. As an example, Figure 2.6 depicts
15
Chapter 2. Dataflow programming
a dataflow network composed of five actors interconnected with five buffers. The structure
of a network can be represented by a partial function from (input) ports to (output) ports,
mapping each input port in its domain to the output port that connects to it. It must be noted
that, this assumption implies the absence of fan-in (as every input port is connected to at
most one output port), and it permits unconnected (open) input (and output) ports.
2.2 Dataflow paradigm
The emergence of massively parallel architectures, along with the difficulties to program
these architectures, makes dataflow paradigm a more appealing alternative to an imperative
paradigm [22, 30, 31, 32, 33, 34, 35]. The main advantages of this paradigm are related to
the ability of expressing concurrency without complex synchronization mechanisms. This
is made possible by the internal representation of the program as a network of processing
blocks that only communicate through communication channels. As a matter of fact, blocks
are independent and do not produce any side-effects. This removes the potential concurrency
issues that could arise when the programmer is asked to manually manage the synchronization
between parallel computations [36, 37]. Moreover, this paradigm explicitly exposes all the
natural parallelism of a program [36, 22].
2.2.1 Modular programming
The decomposition of the program into processing blocks improves its maintainability by
enforcing the encapsulation of the components. Such a decomposition naturally makes the
program description modular. The main capabilities of a modular description are:
• Reusability: a single processing block can be used multiple times in the same dataflow
network.
• Reconfigurability: a processing block can be easily replaced by another one when their
input and output ports are (strictly) identical.
• Hierarchical representation: a processing block of the dataflow network may represent
another dataflow network.
In the rest of this dissertation, a processing block that represents a hierarchical composition of
processing blocks is referred to as a sub-network, otherwise it is simply referred to as an actor.
2.2.2 Parallelism flavors
It is worth summarizing the specific terminology for the various kinds of parallelism flavors
among actors of a dataflow program. These are:
16
2.2. Dataflow paradigm
• Pipeline parallelism is inherent to a streaming execution model in case of a chain of
actors. Pipelining does not enhance the throughput on one calculation, but the pro-
cessing of a sequence of calculations. As an example, Figure 2.1 depicts the concurrent
execution of a producer actor A with a consumer actor B. This parallelism flavor can
be considered as a mixture of task and data parallelisms. Pipelining represents the
separation of a computation in several stages that can be executed in parallel.
A B
(a) Dataflow network
A A A
B B B
(b) Parallel execution
Figure 2.1: Pipeline parallelism.
• Task parallelism refers to the parallelism between independent parts of an application.
In a dataflow context, it appears when two or more actors do not have any dependency
constraints. As an example, Figure 2.2 depicts the concurrent execution of different
actors B and C, respectively, that do not constitute a pipeline.
A D
B
C
(a) Dataflow network
A A
B B
C C
D D
(b) Parallel execution
Figure 2.2: Task parallelism.
• Data parallelism refers to a unique computation performed on different sets of data. It
can be applied by duplicating an actor when it processes several sets of data successively
with no dependencies between them. Data parallelism is also sometimes characterized
as SPMD (single program, multiple data). As an example, Figure 2.3 depicts the con-
current execution of multiple replicas of the same actor B on different portions of the
same data.
17
Chapter 2. Dataflow programming
A C
B
B
(a) Dataflow network
A A
B B
B B
C C
(b) Parallel execution
Figure 2.3: Data parallelism.
2.3 Dataflow classes
Since the representation of a dataflow program does not over-constrain the order of operations,
a scheduler of the program has the freedom it needs to adequately exploit the different
parallelism kinds, to maximize the re-use or simply reduce the limited hardware resources
available on the implementation platform. Figure 2.4 illustrates the three main dataflow MoC
classes. The respective actor behavior that can be represented for each of them is discussed in
this section.	  
DDF	  
CSDF	  
SDF	  
Figure 2.4: Dataflow MoCs classes.
2.3.1 Static dataflow programs
Static dataflow (SDF), sometimes also referenced as synchronous dataflow, is a special class of
dataflow MoC where the number of tokens consumed and produced by each actor is fixed
and known at compile time. Repeated firing of the same actor respects the same behavior.
This is the less expressive class of dataflow programs, but it is also the one that can be more
18
2.3. Dataflow classes
easily analyzed. In fact, its main property is its total compile time predictability, with respect
to scheduling, memory consumption, and execution termination.
Static scheduling
In order to build a statical schedule, the compiler should construct a single cycle of a periodic
schedule. The first step is then evaluating how many invocations of each actor should be
included in each cycle. This can be easily obtained using the number of produced and
consumed tokens for each actor firing. As depicted in Figure 2.5, the number of tokens
consumed at each firing by the i − th actor from the n− th buffer is denoted by ci ,n ∈N, the
number of tokens produced at each firing by the i − th actor on the n− th buffer is denoted by
pi ,n ∈N, and the number of times the i − th actor is invoked (i.e. repeated) in each cycle of the
iterated schedule is denoted by ri ∈N. Hence, in order to have a feasible periodic schedule, it
must be ensured that for each n− th buffer of the dataflow graph the following condition is
satisfied:
pi ,n ri = c j ,n r j (2.8)
In other words these equations ensure that in each cycle of the iterated schedule the number
of tokens produced on each buffer is equal to the number of tokens consumed on that buffer.
Indeed, the first step in finding a schedule for an SDF graph is to solve a set of Equation (2.8)
for the unknowns ri .
ai aj
bnpi,n cj,n
Figure 2.5: A dataflow graph with two actors, ai and a j , connected through the buffer bn . pi ,n
defines the number of tokens produced on bn during each firing of ai . c j ,n defines the number
of tokens consumed from bn during each firing of a j .
Since for SDF programs the number of consumed and produced tokens for each actor firing is
fixed and known at compile time, the set of equations can be concisely written by constructing
a topological matrix Γ. The entry [Γ]i ,n contains the integer pi ,n when the i − th actor pro-
duces pi ,n tokens on the n− th buffer, and it contains the integer −ci ,n when the i − th actor
consumes ci ,n tokens from the n− th buffer. In general, this matrix does not need to be square.
19
Chapter 2. Dataflow programming
For example, the dataflow graph shown in Figure 2.6 has the following topological matrix:
Γ=

p A,1 −cB ,1 0 0 0
p A,4 0 0 −cD,4 0
0 pB ,2 −cC ,2 0 0
0 0 pC ,3 0 −cE ,3
0 0 0 pD,5 −cE ,5
 (2.9)
The system of equations to be solved can be formulated such as:
Γ −→r = −→0 (2.10)
where −→r is the repetition vector containing the ri value for each i − th actor, and −→0 is a
zero-vector. Equation (2.10) is usually referred to as the balance equation of the dataflow
program.
Remark. If an actor has a connection to itself (i.e. a self-loop) then only one entry in Γ describes
this buffer. This entry gives the net difference between the amount of tokens produced on this
buffer and the amount of tokens consumed from this buffer each time the actor is invoked.
This difference needs to be zero for a correctly constructed graph. Hence, the entry describing a
self-loop should be zero [38].
A E
B
D
C
b2cB,1b1
b4 b5
b3
pA,1
pA,4 pA,5
pB,2 pC,3cC,2
cE,3
cE,5cD,4
Figure 2.6: Dataflow graph example.
Existence of an admissible schedule
An admissible sequential schedule φs is defined as a non-empty ordered list of actors such
that if the actors are executed in the sequence given by φs , then the number of tokens stored
in each buffer will remain non-negative and bounded. Each actor must appear in φs at least
once. A periodic admissible sequential schedule (PASS) is a periodic and infinite admissible
sequential schedule. In [38] it has been demonstrated that, for any connected SDF graph, a
necessary condition to be able to construct a PASS is that the rank of Γ should be:
r ank(Γ)= s−1 (2.11)
20
2.3. Dataflow classes
where s is the number of actors in the graph. In other terms, the null space of Γ should have
dimension one. It is shown in [38] that when the rank is correct, there always exists a repetition
vector −→r that contains only integers and relies in this null space. This vector defines how
many times each actor should be invoked in one period of a PASS. In other words, the rank
of the topology matrix indicates a sample rate in consistency in the graph. SDF graphs that
have a topology matrix such that r ank(Γ)= s are said to be defective: any schedule for this
graph will result either in a deadlock or unbounded buffer size configuration.
The use of a PASS scheduler requires using a single processing unit implementation: this
does not exploit the parallelism advantages of a dataflow application. Clearly, if a workable
schedule for a single processing unit can be generated, then a workable schedule for a multi-
processing units system can also be generated. The objective is then to find a periodic
admissible parallel schedule (PAPS) defined as a set of listsΨ= {ψi , i = 1, . . . , M } where M is
the number of processing units, and ψi specifies a periodic schedule for the i − th processing
unit. For single processing unit targets, some reasonable scheduling objectives might include
minimization of data or program memory requirements. For multi-processing unit targets,
minimizing the throughput or maximizing flow-time are more likely objectives [38, 39, 40].
2.3.2 Cyclo-static dataflow programs
Cyclo-Static Dataflow (CSDF) generalizes the SDF MoC by defining cyclically-changing firing
rules. It must be noted that, CSDF extends SDF with the notion of state, while maintaining
the same compile-time properties concerning scheduling and memory consumption. CSDF
programs allow the number of tokens consumed and produced by an actor to vary from one
firing to the next in a cyclic pattern: unlike the scalar consumption and production parameters
for SDF, in CSDF programs ci ,n and pi ,n are integer vectors both defined as
−→
γ i ,n . Because
these patterns are periodic and predictable, it is still possible to statically construct periodic
schedules using techniques based on those developed for SDF. State can be represented as an
additional argument to the firing rules and firing function: in other words, it is modeled as a
self-loop [41, 42].
Static scheduling
The topological matrix entries are defined such as:
[Γ]i , j = ti , j
σi , j
di , j
(2.12)
where di , j = di m(−−→γi , j ) is the length or period of the token production/consumption pattern
for the i − th buffer connected to the j − th actor. If there is no connection, then di , j = 1. The
j − th actor fires in a cycle with period t j = lcm(di , j ,∀i ), the least common multiple of the
consumption and production periods for all the buffers connected to that actor. Finally, σi , j is
the sum of the elements in −→γ i , j . As done for SDF, it is also possible for the CSDF programs to
21
Chapter 2. Dataflow programming
solve the balance equation (2.10) and verify the existence of an admissible schedule. However,
in CSDF programs the repetition vector −→r does not represent the number of actor firings, but
the number of cycles. In this case, the number of firings of each i − th actor is defined as ri ti .
2.3.3 Dynamic dataflow programs
Although SDF and CSDF are adequate models for representing parts of many algorithms, they
are rarely sufficient for expressing entire complex programs since they are not adequate to
express data-dependent iterations, conditionals and recursion. For example, functionality
that involves conditional execution of dataflow subsystems or actors with dynamically-varying
production and consumption rates cannot be expressed in decidable dataflow models [43, 44].
The dynamic dataflow (DDF) MoC defines actors with a number of produced and consumed
tokens that is not statically specified. In a DDF program, an actor may have both firing rules
and firing functions that are data-dependent. In other words, the token production and
consumption rate can vary according to the program input sequence.
Analysis techniques
The increased modeling flexibility and expressiveness power make DDF programs much
harder to be analyzed. Due to their Turing-complete nature, many analysis problems may be-
come undecidable [43]. For example, DDF analysis techniques may succeed in guaranteeing a
bounded buffer size execution and deadlock avoidance only for a significant subset of specifi-
cations (e.g. input streams in the context signal processing systems) [1, 11, 12]. Similarly, DDF
scheduling is generally a run-time operation. However, some or all of the scheduling decisions
can be predicted at compile-time by either describing the program with a more restricted
programming model or by analyzing the program to find if parts of it can be described in a
more restricted way [45, 46, 47, 48]. A systematical and effective analysis methodology for
DDF programs is illustrated in the following chapters of this dissertation.
2.4 Code interpretation and generation
The portability support of dataflow program onto different HW and SW platforms is provided
by a compiler infrastructure capable to generate low-level from the high-level program rep-
resentation at a system level. As illustrated in Figure 1.2, the compiler infrastructure is an
essential part for enabling an effective DSE exploration and implementation of a dataflow
program. In this section, the basic components of a dataflow compiler infrastructure are
illustrated. These are extensively used in the rest of this dissertation when the profiling of a
dataflow program is presented. Interpreters and compilers have much in common [49]. As
illustrated in Figure 2.7, both have the source code of the input program as input. Moreover,
both analyze and validate the input program and build an internal (i.e. intermediate) rep-
resentation of it. However, the main difference is that a compiler generates a stand-alone
22
2.4. Code interpretation and generation
machine code program, while an interpreter performs the actions described by the high-level
input program description.
Source
Program
Compiler
Target
Program
(a) Compiler flowchart
Source
Program
Interpreter Results
(b) Interpreter flowchart
Figure 2.7: Code compiler and interpreter flowcharts.
2.4.1 Abstract syntax tree
An abstract syntax tree (AST) is a tree representation of the abstract syntactic structure of the
source code. Each node of the tree denotes a construct occurring in the source code. The
syntax is abstract in the sense that it is not representing every detail appearing in the real
syntax. An AST is usually the result of the syntax analysis phase of a compiler or an interpreter.
It often serves as an intermediate representation of the program through several stages that the
compiler requires, and has a strong impact on the final output of the compiler. After verifying
correctness, the AST serves as the base for code generation. The AST is used to generate the
intermediate representation for the code generation or interpretation.
2.4.2 Intermediate representation
Intermediate representation (IR) is a representation of a program partway between the input
source and output target code. A well-structured IR is one that does not depend on both the
input source code and the target architecture, so that it maximizes its ability to be re-used in a
retargetable compiler.
2.4.3 Control flow graph
The control flow graph (CFG) is a graph-based representation of the program control flow,
which is generally used for making analyses from the IR representation of the input pro-
gram [50]. The CFG of a function is a connected, directed graph where the set of nodes
represents the sequences of program instructions and the set of directed edges (i.e. ordered
pairs of nodes) represents the flow of control. More precisely, a node represents a basic block
which is a maximal sequence of consecutive statements with a single entry point, a single exit
point, and no internal branches.
23
Chapter 2. Dataflow programming
2.5 The Cal Actor Language
The Cal Actor Language (CAL) [51] is a domain-specific language that provides useful abstrac-
tions for dataflow programming with actors. CAL directly captures the features of ATS actors
adding the notion of atomic action firings, also called steps. Figure 2.8 illustrates the basic
concepts of a CAL program. This is a dataflow network composed of a set of actors and a
set of first-in first-out (FIFO) buffers. Each CAL actor is then defined by a set of input ports,
a set of output ports, a set of actions, and a set of internal variables. CAL also includes the
possibility of defining an explicit finite state machine (FSM). The FSM captures the actor state
behaviour and drives the action selection according to its particular state, to the presence
of input tokens and to the value of the tokens evaluated by other language operators called
guard functions. Each action may capture only a part of the firing rule of the actor together
with the part of the firing function that pertains to the input/state combinations enabled by
that partial rule defined by the FSM. An action is enabled according to its input patterns and
guards expressions. Input patterns are defined by the amount of data that are required in
the input sequences, whereas guards are boolean expressions on the current state and/or on
input sequences that need to be satisfied for enabling the execution of an action. In the rest of
this section, a basic overview is presented of the main concepts concerning the syntax, the
semantics and the different MoC that can be represented with this language.
actions
internal
variables
FSM
Pin Pout
B C
D
E
b1
b2
b3
b4b5
A
Figure 2.8: CAL network and actors structure.
24
2.5. The Cal Actor Language
2.5.1 CAL program
A CAL program network N is defined as a tuple (K , A,B) where:
• K = {κ1,κ2, . . .κnκ} is a finite set of actor-classes.
• A = {a1, a2, . . . , anA } is a finite set of actors.
• B = {b1,b2, . . . ,bnB } is a finite set of buffers.
A CAL actor-class κ defines the program-code-template and the implementation behaviors of
the actor (i.e. the CAL source code). Different actors can instantiate the same class, however
each actor corresponds to a different object with its own internal states that cannot be shared.
A CAL actor a is defined as a tuple (κ,P i n ,P out ,Λ,V ,FSM) where:
• κ is the actor-class.
• P i n = {p i n1 , p i n2 , . . . , p i nnI } is the finite set of input ports.
• P out = {pout1 , pout2 , . . . , poutnO } is the finite set of output ports.
• Λ= {λ1,λ2, . . . ,λnΛ} is the finite set of actions.
• V = {v1, v2, . . . , vnV } is the finite set of internal variables.
• FSM is the internal finite state machine.
A CAL buffer b is defined as a tuple (as , ps , at , pt ) where:
• as ∈ A is the source actor (i.e. the one that produces the tokens).
• ps ∈ P outas is the output port of the source actor.
• at ∈ A is the target actor (i.e. the one that consumes the tokens from the buffer).
• pt ∈ P i nat is the input port of the target actor.
It is important to note that each input port can be connected at most to one buffer. On the
contrary, there are no limitations on how many buffers can be connected to an output port.
2.5.2 Execution model
For the purpose of this thesis, it is assumed that the firing of an action is performed by following
the serial execution of the stages summarized in Figure 2.9. These are:
25
Chapter 2. Dataflow programming
• Wait for tokens Qbr : the firing is waiting that all the required input tokens are available
from the corresponding buffers.
• Consume input tokens Qr : the firing is consuming the input tokens.
• Action execution Qe : the firing performs the execution of its algorithmic part.
• Wait for space Qbw : the firing is waiting that all the required output tokens can be
accommodated in the corresponding buffers.
• Write output tokens Qw : the firing is producing the output tokens.
where the transition conditions are the following:
• hasTokens: the number of required input tokens is available from each corresponding
input buffer.
• hasSpace: the number of output tokens space is available on each corresponding output
buffer.
Qbrstart Qr Qe Qbw Qw end
! hasTokens
hasTokens
! hasSpace
hasSpace
Figure 2.9: Action execution model according to Equation 5.10.
2.5.3 CAL syntax and semantics
In the following section, an overview concerning CAL is provided. The syntax and the semantic
of this dataflow program are illustrated through simple but effective examples. The interested
reader can refer to [51].
Lexical tokens
Lexical tokens help the user to understand the functionality provided by any language. A
lexical token is a string of indivisible characters known as lexemes. The CAL lexical tokens,
also summarized in Table 2.1, are described in the following:
• Keywords Keywords are a special type of identifiers. They are already reserved in the
programming language by default. These keywords can never be used as identifiers
in the code. Some of these keywords are action, actor, begin, else, if, while,
true and false.
26
2.5. The Cal Actor Language
• Operators Operators usually represent mathematical, logical or algebraic operations.
Operators are written as any string of characters !, %, ˆ, &, *, /, +, -, =, <, >, ?, ˜ and |.
• Delimiters Delimiters are used to indicate the start or the end of this syntactical element
in the CAL. The following elements are used as delimiters: (, ), [, ], { and }.
• Comments Comments in CAL language are the same as in Java and C/C++. Single-line
comments start with // and multiple-line comments start with /* and end with */.
Table 2.1: CAL lexical tokens.
Keywords action, actor, procedure, function, begin, if, else, end,
foreach, while, do, procedure, in, list, int, uint, float,
bool, true, false
Operators !, %, ˆ, &, *, /, +, -, =, <, >, ?, ˜, |
Delimiters (, ), [, ], {, }, ==>, ->, :=
Comments //, /* . . .*/
Actions, input patterns and output patterns
The simplest actor that can be described using CAL is the Inverter actor defined in Listing
2.1. This actor consumes a token from its input port and produces a token on its output
port. The actor header is defined in line 1, which contains the actor name, followed by a list
of parameters contained inside the () construct (empty, in this case), and the declaration
of the input and output ports. The input ports are those in front of the ==> construct and
the output ports are those after it. In this case the input and output port sets are defined
as P i nInverter = {I} and P outInverter = {O} respectively. For each parameter and port, the data
type is specified before the name (all defined with an int data type, in this case). This actor
contains only one action, labeled as invert as defined in line 3. In this case, the action
set is defined as λInverter = {invert}. Action invert demonstrates how to specify token
consumption and production. The part in front of the ==>, which defines the input patterns,
specifies how many tokens to consume, from which ports, and what to call those tokens in the
rest of the action. In this case, there is one input pattern: I:[val]. This pattern indicates
that one token is to be read (i.e. consumed) from the input port I, and that the token is to
be called val in the rest of the action. Such an input pattern also defines a condition that
must be met for this action to fire: if the required token is not present, this action will not be
executed. Therefore, input patterns do the following:
• They define the number of tokens (for each port) that will be consumed when the action
is executed (fired).
• They declare the variable symbols by which tokens consumed by an action firing will be
referred to within the action.
27
Chapter 2. Dataflow programming
• They define a firing condition for the action, i.e. a condition that must be met for the
action to be able to fire.
The output patterns of an action are those defined after the ==> construct. They simply
define the number and values of the output tokens that will be produced on each output port
by each firing of the action. In this case, the output pattern O:[-v] says that exactly one
token will be generated at output port O, and its value is -v. It is worth noting that although
syntactically the use of v in the input pattern I:[a] looks the same as the one in the output
expression O:[-v], their meanings are very different. In the input pattern the name v is
declared: in other words, it is introduced as the name of the token that is consumed whenever
the action is fired. By contrast, the occurrence of v in the output expression uses that name.
Listing 2.1: Inverter.cal
1 actor Inverter() int I ==> int O :
2
3 invert: action I:[val] ==> O:[-val] end
4
5 end
Guards
So far, the only firing condition for actions was that there be sufficient tokens for them to
consume, as specified in their input patterns. However, in many cases, it is possible to specify
additional criteria that need to be satisfied for an action to fire. Conditions, for instance, that
depend on the values of the tokens, the actor internal variables, or both. These conditions can
be specified using guards, as for example in the Split actor, defined in Listing 2.2. This actor
defines one input port I, two output ports O1 and O2, and two actions A and B. Those actions
require the availability of one token in I, however their selection is guarded by the value of the
input token val read from I, and respectively defined in line 4 and line 7. In this example, if
val >= 0 then action A is selected, otherwise action B is selected.
Listing 2.2: Split.cal
1 actor Split() int I ==> int O1, int O2 :
2
3 A: action I:[val] ==> O1:[val]
4 guard val >= 0 end
5
6 B: action I:[val] ==> O2:[val]
7 guard val < 0 end
8
9 end
28
2.5. The Cal Actor Language
Actor parameters and internal variables
Using CAL, it is possible to define a set of actor parameters. These can be used when the same
actor definition is used more then once in the same program definition. For example, the
ParametrizedProducer actor, defined in Listing 2.3 uses the parameter maxCounter.
This parameter, defined in line 1, is used as a guard condition by the (only) action produce
as defined in line 7. This actor also defines the internal variable counter that is used and
updated during each firing of the action as described in line 9.
Listing 2.3: ParametrizedProducer.cal
1 actor ParametrizedProducer(int maxCounter) ==> int O :
2
3 int counter := 0;
4
5 produce: action ==> O:[counter]
6 guard
7 counter < maxCounter
8 do
9 counter := counter + 1;
10 end
11
12 end
Priorities and State Machines
In the PingPongMerge actor, reported in Listing 2.4, a finite state machine schedule is used
to force the action sequence to alternate between the two actions A and B. The schedule
statement introduces two states s1 and s2. On the contrary, in the BiasedMerge actor,
reported in Listing 2.5, the selection of which action to fire is not only determined by the
availability of tokens, but also depends on the priority statement.
Listing 2.4: PingPongMerge.cal
1 actor PingPongMerge() T In1, T In2 ==> T O :
2
3 A: action In1:[val] ==> O:[val] end
4
5 B: action In2:[val] ==> O:[val] end
6
7 schedule fsm s1:
8 s1(A) --> B;
9 s2(B) --> A;
10 end
11
12 end
29
Chapter 2. Dataflow programming
Listing 2.5: BiasedMerge.cal
1 actor BiasedMerge() T In1, T In2 ==> T O :
2
3 A: action In1:[val] ==> O:[val] end
4
5 B: action In2:[val] ==> O:[val] end
6
7 priority
8 A > B
9 end
10
11 end
2.5.4 An example of a CAL program
In CAL it is possible to define a network of interconnected actors. Figure 2.10 depicts a
CAL program composed by three actors Producer, Filter and Consumer, and by two
buffers b1 and b2. Two different representation approaches are supported for defining the
CAL network structure: the first one is based on a functional programming language called
Functional unit Network Language (FNL), the second one is based on eXtensible Markup
Language (XML) known as XML Dataflow Format (XDF).
As an example, the XDF and FNL network representations illustrated in Listings 2.8 and
2.7, respectively, both define a CAL program where the Producer actor instantiates the
ParametrizedProducer actor-class defined in Listing 2.3, the Filter actor instantiates
the Inverter actor-class defined in Listing 2.1, and the Consumer actor instantiates the
TokenConsumer actor-class defined in Listing 2.6. It must be noted that, in this particular ex-
ample, the Producer actor instantiates its actor-class using the parameter maxCounter=3.
Supposing executing this program in a single-core processing unit, with an unlimited buffer
size configuration (i.e. it is always possible to produce tokens in a buffer), the corresponding
action firings are those summarized in Table 2.2.
ConsumerProducer
b1 b2
Filter
Figure 2.10: Basic dataflow program.
Listing 2.6: TokenConsumer.cal
1 actor TokenConsumer() int I ==> :
2
3 consume: action I:[val] ==> end
4
5 end
30
2.5. The Cal Actor Language
Listing 2.7: BasicNetwork.nl
1 network BasicNetwork () ==> :
2
3 entities
4
5 Producer = ParametrizedProducer(maxCounter = 3);
6 Filter = Inverter();
7 Consumer = TokenConsumer();
8
9 structure
10
11 Producer.O --> Filter.I
12 Filter.O --> Consumer.I
13
14 end
Listing 2.8: BasicNetwork.xdf
1 <?xml version="1.0" encoding="UTF-8"?>
2 <xdf name="BasicNetwork">
3 <instance id="Producer">
4 <class name="ParametrizedProducer"/>
5 <parameter name="maxCounter">
6 <expr kind="literal" literal-kind="integer" value="3"/>
7 </parameter>
8 </instance>
9 <instance id="Filter">
10 <class name="Inverter"/>
11 </instance>
12 <instance id="Consumer">
13 <class name="TokenConsumer"/>
14 </instance>
15 <connection src="Producer" src-port="O" dst="Filter" dst-port="I"/>
16 <connection src="Filter" src-port="O" dst="Consumer" dst-port="I"/>
17 </xdf>
Table 2.2: Firing of the CAL program described in Section 2.5.4.
Firing Actor Actor-class Action
s1
Producer ParametrizedProducer produces2
s3
s4
Filter Inverter inverts5
s6
s7
Consumer TokenConsumer consumes8
s9
31
Chapter 2. Dataflow programming
2.5.5 RVC-CAL
CAL language has been expressly designed in order to be fully analyzable and thus to support
different forms of code analysis. Such an opportunity makes it possible to look for a variety
of optimization techniques that can be applied before and during the synthesis from the
dataflow program to the implementation code. A subset of the more general CAL language,
called RVC-CAL, has been standardized by the ISO/IEC SC29WG11 committee also known as
MPEG [52, 53, 54, 55]. This subset restricts the data-types, operators, and features that can
be used when describing a CAL actor. RVC-CAL is used within the MPEG community as a
reference software language for the specification of the MPEG video-coding technology under
the form of a library of components (i.e. the actors) that are configured and instantiated into
networks to generate standard MPEG video decoders (e.g. MPEG4-SP, AVC, HEVC).
2.5.6 Compiler infrastructure
The RVC-CAL compiler infrastructure used in the context of this thesis is summarized in
Figure 2.11. This is called open RVC-CAL compiler infrastructure (Orcc) [56, 57, 58]. It provides
the necessary tools for the design, simulation and code generation of different targets for
RVC-CAL programs. During the compilation flow, the RVC-CAL program is translated into
a code intermediate representation (IR). The IR is built using a model-driven engineering
(MDE) meta-model. More precisely, it makes use of the MDE technologies available on
the Eclipse IDE [59] such as the Eclipse modeling framework (EMF) [60, 61], Xtext [62] and
Xtend [63]. The Orcc compilation flow can be summarized as follows:
• Front-end: the RVC-CAL code is parsed and translated into an Abstract Syntax Tree
(AST). The AST is successively transformed into an IR. At this stage the semantic valida-
tion, the type inference and the expression evaluation are performed.
• Core: a meta-model of the IR is created and serialized. The serialization allows the
possibility of incremental compilations and analysis.
• Interpreter: the IR can be directly interpreted from its meta-model generated by the
back-end. The code interpretation is type-accurate and it permits a first high-level and
behavioral verification of the program.
• Back-end: target specific optimization (i.e. IR to IR transformations) are performed
before the code low-level code generation. Successively, the IR is translated into a
general purpose programming language (e.g. C/C++, Java) or to a register transfer
language (RTL) (e.g. VHDL, Verilog).
For the purposes of this dissertation, the framework taken in charge of generating RTL de-
scriptions from a CAL code representation is Xronos [8, 64, 65]. This is the evolution of work
presented in [66, 67] and it is fully integrated into the Orcc environment.
32
2.6. Conclusions
Compiler
Infrastructure
Code 
Generation
Synthesis
or 
Compilation
Implementation
Profiling
and
Analysis
Performance
Estimation
CAL
program ArchitectureConstraints
R
ef
a
ct
o
ri
n
g
 D
ir
e
c
ti
o
n
s
C
o
m
p
il
e
r 
D
ir
e
c
ti
v
e
s
Source
Code
Build
Script
Orcc and Xronos
C
LLVM
Promela
Java
HDL
xdf
cal
Front-
end
IR Core IR Back-
Ends
XronosInterpreter
Figure 2.11: The RVC-CAL compiler and Xronos infrastructure integrated in the design flow
presented in Figure 1.2.
2.6 Conclusions
In this chapter the notion of dataflow programming has been illustrated. Three different
classes of dataflow graphs have been investigated. Those are notably the Kahn process network
(KPN), the dataflow process network (DPN) and the actor transition system (ATS). For each
one of these classes the main mathematical formalization has been provided and discussed.
The notion of monotonicity has been introduced and used to illustrate the main analysis
problematics that can arise when an operator (or actor) is not monotonic. Successively, the
main features of modularity and the different parallelism flavors exposed by the dataflow MoC
has been illustrated. Successively, the discussion has covered the taxonomic classifications
of dataflow programs. The main properties of static (SDF), cyclo-static (CSDF) and dynamic
dataflow (DDF) programs has been illustrated. It has been shown why the analysis of dynamic
dataflow programs is considered a challenging task. Finally, the Cal Actor Language (CAL)
has been introduced. Concepts like actor-class, actor, action, procedure, internal variable,
ports, guards, internal state machine and priority have been illustrated through a collection
of source code examples. Furthermore, the RVC-CAL standardized subset and its compiler
infrastructure has been illustrated. It has been shown how starting from a CAL source code
representation of the program it is possible to generate a low-level code representation suitable
for implementing the program both in software and in hardware.
33

3 Profiling CAL programs
An appropriate complexity analysis stage is a fundamental step for any methodology aiming at
the implementation of today’s complex applications [68, 69]. Such a stage may have different
final implementation goals such as defining a new architecture dedicated to a specific applica-
tion under study, defining an optimal instruction set for a selected processor architecture, or
guiding the software optimization process in terms of control-flow and dataflow optimization
targeting a specific architecture. In this context, the term complexity is intended in a broader
and more intuitive sense than its strict mathematical definition only considering the size of
the minimal algorithm descriptions. More precisely, the various aspects and results of the
run-time algorithm complexity metrics are investigated. Such metric results can hardly be
evaluated from the algorithm code itself, because of its size or because the program behavior
is data-dependent and therefore a more sophisticated analysis methodology should be used.
In the following chapter, some methods to classify the behavior of an actor and successively
different static and run-time analyses are illustrated.
3.1 Actor classification
Actor classification determines the behavior of a given actor in terms of production/consump-
tion of tokens, patterns that may govern token exchanges, and possible acceptable token
values. The final goal of this analysis is the detection of the class of each actor composing
the network. Restricted dataflow classes represent different trade-offs between algorithmic
expressiveness and execution predictability (see Section 2.3). In the simplest case, struc-
tural information of an actor is sufficient for the classification (e.g. the rules for an actor to
be considered SDF only depend on the input and output patterns of actions). However, in
more general cases, it is necessary to gather information from an actual execution of the
actor [70, 71, 72, 73].
Using the set of dataflow classes illustrated in Section 2.3, it is possible to classify dynamic
actors into a restricted dataflow class as follows:
35
Chapter 3. Profiling CAL programs
• Static behavior: the classification tries to classify each actor within classes that are in-
creasingly expressive and complex. The rationale behind this is that the more expressive
(powerful) a class is, the more difficult it is to analyze. If an actor cannot be classified as
a static actor, the method will try to classify it as CSDF. An actor is classified as static if
and only if it conforms to the SDF class, which means that all its actions have the same
input and output patterns. A one-action actor is by definition static.
• Cyclo-static behavior: an actor has to meet two conditions to be a candidate for CSDF
classification: it must have a state and there must be a fixed number of data-independent
firings that depart from the initial state, modify the state, and return the actor to its
original state. Once the actor was identified as a valid CSDF candidate, abstract inter-
pretation can be used to determine the sequence of actions characterizing its behavior,
as well as its production and consumption rates [71, 72].
• Dynamic behavior: if not classified as SDF or CSDF, the actor is defined as DDF.
After being classified, the actors, as well as the network they compose, may be subject to
additional analysis and optimizations that require the respect of more restricted dataflow
classes.
3.2 Static analysis
The methods based on a static analysis of the source code range from simply counting the
number of operations up to defining dependencies among the basic blocks. This informa-
tion can be used during different optimization stages. For example, the lower and upper
run-time of a given program on a given processing element can be directly evaluated from
the operator count analysis [74, 75]. While this simple counting technique provides a very
accurate evaluation of the operations, it cannot handle loops, recursion, conditional state-
ments and data-dependent applications except for some particular cases. Explicit or implicit
enumeration of program paths can handle loops and conditional statements and can yield
bounds on best and worst case run-time [74, 75, 50]. The main drawback of these techniques
is that the typical real processing complexity of many algorithms heavily depends on the
input data statistics while static analysis can only detect upper and lower bounds. Restricted
programming styles such as absence of dynamic data structures, recursion, and bounded
loops are required in order to correctly perform a static analysis [74].
3.2.1 Source lines of code
Source lines of code (SLOC) is one of the most-used metric when dealing with program
development complexity and maintainability. Using the definition proposed in [76], a line of
code is a line of program text that is not a comment or blank line, regardless of the number of
statements or fragments of statements in the line. This specifically includes all lines containing
36
3.2. Static analysis
program headers, declarations, and executable and non-executable statements. However, the
SLOC of a program can be strongly dependent on how the counting procedure is interpreted.
For this reason, the number of lines of code should be used only as a crude complexity
measure [77].
3.2.2 Operators count
As for the SLOC metric, the occurrence of each operator can be used as a crude complexity
measure of the program. Table 3.1 reports the set of unary, binary, data handling and flow
control operators available for the CAL language. However, basing the program complexity
on the number of operator occurrences can be misleading as conditional blocks (e.g. if and
while) are taken into account only once.
3.2.3 Cyclomatic complexity
The cyclomatic complexity analysis [78] is a quantitative measure of the complexity of pro-
gramming instructions. It directly measures the number of linearly independent paths through
the program source code. In other words, this is a software metric that equates complexity to
the number of decisions in a program. Developers can use this measure to determine which
modules (i.e. network, actor, action, procedure) of a program are overly complex and need
to be re-coded. For each module, the metric can be calculated either from evaluating the
CFG of the module (i.e. see Section 2.4.3) or from evaluating the program’s statements. The
cyclomatic complexity is defined as:
v = e−n+2p (3.1)
where e is the number of edges, n is the number of nodes, and p is the number of modules.
It must be noted that this equation is based on the assumption that the CFG is a strongly-
connected graph. The cyclomatic complexity of a module also gives the maximum number
of linearly independent paths through it. In other words, it can be evaluated by counting the
branch conditions in a module. Hence, Equation (3.1) can be redefined such as:
v = b+1 (3.2)
where b represents the number of simple branch conditions. The formulation defined in
Equation 3.2 is convenient because it allows developers to calculate the cyclomatic complexity
of a program without having to use graph analysis. However, this only applies to individual
modules in such that they only contain single-entry and single-exit, structured, blocks of
code [79].
37
Chapter 3. Profiling CAL programs
Table 3.1: Profiled executed operators and statements.
Kind Symbol Name
Unary
~ binary not
! logical not
− unary minus
# number of elements
Binary
& bit and
| bit or
∧ bit xor
== equal
!= not equal
≥ greater than or equal
> greater than
≤ less than or equal
< less than
&& logical and
‖ logical or
− minus
+ plus
∗ times
/ division
di v integer division
∗∗ exponentiation
% modulo
<< shift left
>> shift right
Data Handling
ASSIGN assign
CALL call
LOAD load
STORE store
LIST_LOAD list load
LIST_STORE list store
Flow Control
if if then else statement
while while, do while and for statements
38
3.2. Static analysis
3.2.4 Halstead metrics
Halstead metrics [80] are used to deduce a program production and quality based on the
numbers of operands and operators used in the source code. Halstead metrics are based on
the following set of parameters:
• n1 the number of distinct operators present.
• N1 the total number of operators present.
• n2 the number of distinct operands present.
• N2 the total number of operands present.
In the context of a dataflow program, these parameters can be defined with different levels of
granularity: they can be defined for the overall program or for each actor, action and procedure.
Some of the most-used Halstead metrics are the following:
• Program length: describes the size of the abstracted program obtained by removing
everything except operators and operands from the original source code. It is defined as:
N =N1+N2 (3.3)
Contrarily to the SLOC metric (see Section 3.2.1), Halstead length gives a clearer ac-
counting of the overall statement complexity. In fact, SLOC does not tell anything about
how complex the lines of code are.
• Program volume: models the number of bits required to store an abstracted program
of length N . It is defined as:
V =N log2(n1+n2) (3.4)
With this formulation, it is supposed that both the operators and the operands are
encoded as binary strings of uniform (and potentially non-integral) length.
• Program level: describes the ratio between the volume V of the current program and
the most compact volume of the same algorithm implementation [80]. It is defined as:
L = 2
n1
n2
N2
(3.5)
In other words, a longer implementation of an algorithm has a lower program level than
a shorter implementation of the same algorithm.
39
Chapter 3. Profiling CAL programs
• Program difficulty: is defined as the inverse of the program level, such as:
D = 1
L
(3.6)
In other words, a longer implementation of an algorithm has a higher difficulty com-
pared to a shorter implementation of the same algorithm.
• Programming effort: defines the effort required to develop (or understand) a program.
It is defined as:
E =D V (3.7)
In other words, the programming effort is proportional to both the difficulty and the
volume of the program.
• Programming time: defines the time in seconds required to develop the program. It is
defined as:
T = E
S
(3.8)
where the S value is the Stroud number, defined as the number of elementary discrimi-
nations performed by the human brain per second [81]. S ranges from 5 to 20 and its
value for software scientists is generally set to 18.
3.3 Data-dependent analysis
The execution of DDF programs can vary according to a particular input stimulus (see Section
2.3.3). For this reason, complexity of a DDF program cannot be defined only through a static
code analysis as the one illustrated in the previous section. In other words, in order to identify
the program’s basic structure and complexity with different levels of abstraction, the DDF
program should be executed considering a statistically meaningful set of input sequences [1].
The different approaches that are generally used are:
• Binary-code execution: where a low-level code representation of the dataflow program
is generated and successively profiled through an instrumented platform-dependent
(host-)execution [33, 82, 83, 84, 85].
• Code interpretation: where the dataflow program IR is executed through a platform-
independent code interpretation [1, 73, 86].
The main difference between these two approaches is how the program execution abstracts
from the platform and how results are biased by low-level code optimizations. The complexity
measure obtained through a binary code-execution can be dependent on the particular
40
3.3. Data-dependent analysis
platform where the program is executed and can be biased by low-level code optimizations
performed by the compiler. Contrarily, with a IR-code interpretation, the complexity measure
is totally platform-independent and not biased by low-level code optimizations. During a
data-dependent analysis it is possible to identify the program’s basic structure and complexity
with different levels of abstraction, independently of the approach. Two main axes are typically
recognized: the computational load and the data-transfers and storage load.
3.3.1 Computational load
The computational load is expressed in terms of executed operators and control statements
(i.e. comparison, logical, arithmetic and data movement instructions). It is possible to model
the firing time of an action firing based on the number of its executed operands and control
statements retrieved during the program code interpretation. For each action firing si this is
defined such as:
w(si )=
∑
j
c j o(si ) j (3.9)
where o(si ) j represents the number of executions of the j − th operator or control statements
performed by the action firing si and c j a weight for the respective operator or control state-
ments. It must be noted that c j can be defined according to a desired target architecture. As for
the static analysis discussed in Section 3.2, Table 3.1 reports the set of operators and control
statements that can be retrieved interpreting a CAL program.
3.3.2 Data-transfers and storage load
The data-transfers and storage load are expressed in terms of internal actor variable utilization,
input/output port utilization, buffer utilization and token production/consumption. During
the program code interpretation, some statistical information concerning the actor internal
variables and the buffer utilization can be stored to evaluate the memory load and utilization.
Internal actor variables
During the program code interpretation, for each firing and each actor internal variable the
following information can be collected:
• Writes: number of writes that each firing has made on an internal actor variable.
• Reads: number of reads that each firing has made on an internal actor variable.
• List writes: number of writes that each firing has made on an internal actor list variable.
• List reads: number of reads that each firing has made on an internal actor list variable.
41
Chapter 3. Profiling CAL programs
Tokens and buffers
During the program code interpretation, the following information can be collected for each
firing and each buffer:
• Writes: number of tokens written on a buffer.
• Reads: number of tokens read from a buffer.
• Peeks: number of peeks (i.e. test of tokens presence) made by each firing on the respec-
tive input buffers.
• Read miss: number of unavailable tokens on the input buffers that made the selected
action not fireable.
• Write hit: number of unavailable token places on the output buffers that made the
selected action not fireable.
Furthermore, for each buffer, the maximal occupancy can be considered as a measure of an
initial space estimation of the buffer size requirement.
3.4 Conclusions
In this chapter the main requirements of the profiling of a dataflow program have been
summarized. It has been shown how actors can be analyzed and their behavior classified as
static, cyclo-static and dynamic. Successively, the different static code analysis metrics have
been illustrated. These are the elementary count of source lines of code, the operator count
but also the more complex cyclomatic and Halstead metrics. Successively, data-dependent
analysis for DDF programs has been discussed. The concepts of computational load and
data-transfer and storage load have been introduced.
42
4 Exploring the design space of
dataflow programs
Complex software systems may have many design points in terms of selection of software
components and hardware architectures for implementation. These point choices create a
large space of possible design solutions called the design space. The design process requires
exploring through this design space to find design solutions before the actual implementation.
The aim of the design space exploration (DSE) is to find design solutions that satisfy functional
performance constraints and/or optimize portions of the system. In addition, the heterogene-
ity of modern parallel architectures and the diverse requirements of target applications greatly
complicate modern systems design. Developing efficient programs for this kind of platform re-
quires design methodologies that can deal with system complexity and flexibility. This has lead
to the notion of system-level design, where key roles are played by aspects such as high-level
modeling and simulation, and separation of concerns [87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97].
In this context, the exploration of the design space becomes an essential step when implement-
ing applications to heterogeneous and parallel platforms. This is due to the combinatorial
explosion of design options when dealing with multiple concurrent processing units. In
order to have an efficient implementation and integration process, the design has to be suffi-
ciently modular and portable, without the need of any or partial implementation and manual
rewriting.
4.1 Orthogonalization of concerns
Orthogonalization of concerns is a well-established design paradigm [98]. Alternative solutions
of the design space can be efficiently evaluated through design performance estimations. One
of the main features of this design methodology is the separation between:
• Functional behavior and architecture.
• Communication and computation.
According to [89, 91], a formal model of a design is defined by the following components:
43
Chapter 4. Exploring the design space of dataflow programs
• A functional specification, given as a set of explicit or implicit relations which involve
inputs, outputs and possibly internal state information.
• A set of properties that the design must satisfy, given as a set of relations over inputs,
outputs, and states, that can be checked against the functional specification.
• A set of performance indexes that evaluate the quality of the design (e.g. in terms of cost,
reliability, speed, size) given as a set of equations involving inputs and outputs.
• A set of constraints on performance indexes, specified as a set of inequalities.
The functional specification fully characterizes the operation of a system, while the perfor-
mance constraints bound the cost. In other words, target points of the design space can be
formulated in terms of minimization problems where the objective functions are defined as
performance indexes and constraints as inequalities of the problem. In the following, the
concept of orthogonalization of concerns is illustrated using the formalism described in [98],
where the notions of model of computation, model of architecture and mapping are used.
4.1.1 Model of computation
The Model of Computation (MoC) is a formal representation of the operational semantics of
networks of functional blocks describing computation [99, 98]. Depending on the modeling
perspective, MoCs can be classified as an abstract or executable description [100]. Abstract
models are used to define the application workload without executing the specification. On
the other hand, executable specifications allow different abstraction levels: it can directly rep-
resent the application or, for example, a discrete-event performance model of the application
itself. In the context of this thesis, only abstract dataflow MoCs are analyzed; more precisely,
MoCs where the taxonomy can be described as illustrated in Section 2.3.
4.1.2 Model of architecture
The Model of Architecture (MoA) is a formal representation of the operational semantics of
networks of functional blocks describing architectures [90, 98, 101, 102]. Depending on the
modeling perspective, a MoA can be classified as an abstract or an executable architecture
description [100]. Abstract models are used to represent performance in a symbolical manner.
For example they associate the required latency in clock cycles with each operation without
actually executing any hardware description. On the other hand, executable specifications
allow to more precisely model state-dependent behavior, such as the timing of caches and
pipelines. In the context of this thesis, only abstract dataflow MoCs are analyzed as the ones
illustrated in [101, 102].
44
4.2. The design space of a dataflow program
Application Architecture
Model of Computation Model of Architecture
Constraints
Figure 4.1: Mapping from an application to an architecture. Constraints represent the feasible
regions of the design space.
4.1.3 Mapping
The mapping involves defining which part of the program is executed on a particular process-
ing element, and which part of the communication structure is assigned to a particular media.
In the context of hardware-software co-design the problem to be solved is coordinating the
design of the parts of the system to be implemented as SW and the parts to be implemented
as HW blocks [103]. The main requirement is to avoid HW/SW integration problems that
can arise when heterogeneous platforms are used. As such, a set of constraints should be
imposed and respected. Figure 4.1 depicts this process: the application is mapped into a target
architecture if the set of constraints is fully satisfied. Constraints can be defined in terms of
data type [90, 98] (e.g. an application that makes use of floating points can be mapped only
in an architecture that supports this numeric representation) but also in terms of memory
allocation, power utilization or clock frequency.
4.2 The design space of a dataflow program
The design space describes the different mapping configurations that can be defined among
the application and the target architecture. However, there may exist many design alternatives
that implement a given system specification. Each of these expose the design to different
qualities of the design itself [104]. As such, these different implementations have to be explored
and judged for their quality so that a designer can make a decision on which configuration has
to be implemented. Consequently, during DSE many design alternatives have to be evaluated.
Each design alternative may consist of different configuration choices with different levels
of parameters: for example from the choice of the partitioning of an application block to a
processing element, to a lower-level design parameter such as clock frequency or bus widths.
Hence, the DSE objective is to evaluate one or more mapping configuration so that design
objectives are satisfied. These objectives can be formulated in terms of real-time constraints,
throughput, resource efficiency and utilization. This list can easily be extended by, for example,
introducing requirements on the power consumption and silicon area utilization. The problem
can be defined as efficiently finding a feasible design configuration so that requirements are
fully satisfied.
45
Chapter 4. Exploring the design space of dataflow programs
Cβ
Cσ
Cρ
m1
m2
M
T
T(m1)
T̂(m1)|T− T̂|(m1)
T(m2)
T̂(m2)
|T− T̂|(m2)
Figure 4.2: The design space M = Cρ ×Cσ×Cβ = {m1,m2, . . . ,mnM } and the corresponding
performance T(m) and estimated performance T̂(m).
4.2.1 Design space and design points
The evaluation of design points is one of the fundamental steps of the DSE. Its objective is to
define the design space in terms of a set of independent parameters so that performance and
requirements can be evaluated. The set of parameters is defined according to the abstraction
level used for modeling both the application and the architecture. As such, when dealing with
an abstract MoC and MoA, these parameters are defined in terms of partitioning, scheduling
and buffer size configurations. In the following, the set of available partitioning, scheduling
and buffer size configurations are referenced as Cρ , Cσ and Cβ, respectively. Hence, a mapping
configuration point is defined as a 3-tuple m = (ρ,σ,β) where:
• ρ ∈ Cρ defines a partitioning configuration of the network (i.e. actors and buffers
mapped on the available processing elements and media, respectively).
• σ ∈Cσ defines a scheduling configuration of each partition.
• β ∈Cβ defines a size configuration of each buffer.
The design space of a dataflow program is then defined as the set of those independent
configurations such as:
M = {m1,m2, . . . ,mnM }⊆Cρ×Cσ×Cβ (4.1)
Consequently, the DSE problem is to efficiently find a mapping configuration point m∗ ∈M
so that the design objectives are met. As an example, let’s consider the dataflow program
presented in Section 2.5.4. Its network configuration is depicted in Figure 2.10 and it is com-
posed of 3 actors: Produce, Filter and Consume. Supposing executing this program with
the different mapping configurations illustrated in Table 4.1, hence corresponding execution
Gantt charts are depicted in Figure 4.3.
46
4.2. The design space of a dataflow program
Producer s1 s2 s3
Filter s4 s5 s6
Consumer s7 s8 s9
(a) Mapping configuration m1
Producer s1 s2 s3
Filter s4 s5 s6
Consumer s7 s8 s9
(b) Mapping configuration m2
Producer s1 s2 s3
Filter s4 s5 s6
Consumer s7 s8 s9
(c) Mapping configuration m3
Producer s1 s2 s3
Filter s4 s5 s6
Consumer s7 s8 s9
(d) Mapping configuration m4
Producer s1 s2 s3
Filter s4 s5 s6
Consumer s7 s8 s9
(e) Mapping configuration m5
Producer s1 s2 s3
Filter s4 s5 s6
Consumer s7 s8 s9
(f) Mapping configuration m6
Figure 4.3: Platform independent simulation of the CAL network depicted in Fig. 2.10 with
the mapping configurations described in Table 4.1. The execution of each action is supposed
to take at least one (abstract) clock cycle (when there are no blocking output buffers), the
overhead introduced by the action selection and buffer access overheads are both neglected.
In gray the actor execution with the corresponding action firing. In striped-gray the actor
execution is postponed due to the unavailability of a token (i.e. blocking reading).
47
Chapter 4. Exploring the design space of dataflow programs
Table 4.1: Mapping configurations for the dataflow network illustrated in Figure 2.10. For
brevity, the actors Producer, Filter and Consumer are denoted with P, F, C, respectively. The
partitioning of the buffers is not considered.
Mapping mi Partitions ρi (static) Scheduler σi Buffer size βi
m1 ρ11 = {P,C ,F } σ11 = {P,P,P,C ,C ,C ,F,F,F }
β11 = 512
β21 = 512
m2 ρ12 = {P,C ,F } σ12 = {P,F,C ,P,F,C ,P,F,C }
β12 = 512
β22 = 512
m3
ρ13 = {P,F } σ13 = {P,F,P,F,P,F } β13 = 1
ρ23 = {C } σ23 = {C ,C ,C } β23 = 1
m4
ρ14 = {P,F } σ14 = {P,F,P,F,P,F } β14 = 512
ρ24 = {C } σ24 = {C ,C ,C } β24 = 512
m5
ρ15 = {P } σ15 = {P,P,P } β15 = 1
ρ25 = {F } σ25 = {F,F,F } β25 = 1
ρ35 = {C } σ35 = {C ,C ,C }
m6
ρ16 = {P } σ16 = {P,P,P } β16 = 512
ρ26 = {F } σ26 = {F,F,F } β26 = 512
ρ36 = {C } σ36 = {C ,C ,C }
4.2.2 Exploration methods
Different DSE methodologies can be classified according only if single or multiple design-
objectives are taken into account. In the latter, optimality is usually defined using the notion
of Pareto-dominance [105]: a design point dominates another one if it is equal or better in
all criteria and strictly better in at least one. In a set of design points, these are called Pareto-
optimal which are not dominated by any other. With this notion in mind, the different DSE
approaches can be characterized, as summarized in [106], so that:
• Exploration by hand: the selection of design points is done by the designer himself.
The major focus is on how design performance can be efficiently estimated [107].
• Exhaustive search: all design points of a specified region are evaluated. Generally, this
approach is combined with local optimization heuristics where one or multiple design
parameters are evaluated in order to reduce the size of the design space [108].
• Reduction to a single objective: design points are selected by reducing the DSE problem
to a set of single criterion problems. Manual or exhaustive sampling is done in one
or several directions of the search space and a constraint optimisation (e.g. iterative
improvement or analytic methods) is done in the other [109, 110, 111, 112].
• Black-box randomized search: design points are evaluated using a black-box optimisa-
tion approach. The design space is iteratively analyzed, where at each iteration the new
design point is computed based on the priory information and by defining an appropri-
ate neighborhood function. The properties of these new design points are estimated.
48
4.3. Related work
Examples of sampling and search strategies are Pareto-simulated annealing [113] and
Pareto-tabu search [114] evolutionary multi-objective optimization [115, 116] or Monte
Carlo methods improved by statistical estimation of bounds [117]. These black-box
optimizations are generally combined with local search methods [118].
4.2.3 Performance estimation
Performance analysis always involves three issues: a modeling effort, an evaluation effort
and the accuracy of the obtained results [119, 120, 121]. Very accurate performance numbers
can be achieved, but at the expense of a lot of detailed modeling and long evaluation times.
However, performance numbers can be achieved in a shorter time with modest effort for
modeling but at the expense of loss of accuracy. Independently from the abstraction level used
to model the application and the architecture, the objective for efficiently exploring the design
space is to find an appropriate performance estimation of the application for each mapping
configuration point. If the performance of each mapping configuration point m = (ρ,σ,β) ∈M
of the design space is defined in terms of application throughput as:
T(m)= f (m) (4.2)
the approximated model can be defined as:
T̂(m)= f̂ (m) (4.3)
hence, the objective is to reduce the accuracy error defined as:
²= ||T− T̂||2 =
(∑
{|T(mi )− T̂(mi )|2 : mi ∈M }
) 1
2
(4.4)
where ||.||2 is the 2-norm operator.
4.3 Related work
In the following section, an overview of some design space exploration tools and frameworks
is presented. For each one, the main functionalities and limitations are discussed.
CAL Design Suite
The CAL Design Suite [122, 33] is a set of tools for exploring and optimizing the design space of
RVC-CAL applications. It represents the first functional attempt to provide a complete design
flow for optimizing RVC codec specification to multi-core and heterogeneous platforms [30]. It
is based on the analysis of the execution trace graph (ETG) of the program (i.e. see Section 7.1).
However, their definition is limited since only internal variables and tokens dependencies
are supported. Furthermore, the CAL design suite provides a very basic architecture model
49
Chapter 4. Exploring the design space of dataflow programs
for heterogeneous platforms. A detailed discussion on how CAL programs are profiled is
presented in Section 7.1.
COMPA
The COMPA project [123, 46] provides an analysis and optimization framework for RVC-CAL
applications. The design space exploration is performed through a static analysis of the source
code. Different trade-offs between parallelism, communication traffic cost, and memory size
requirement are implemented as source to source transformations.
Daedalus
Daedalus [124, 125, 126] provides a unified environment for rapid system-level architectural
exploration, high-level synthesis, programming and prototyping of multimedia MPSoC ar-
chitectures. The Daedalus framework is an automatic design flow for KPN networks. The
application is modeled using a C/C++ imperative specification which is then automatically
converted into a KPN using the KPNgen tool [127]. Because of the nature of KPN models,
modeling of interrupts is difficult and inefficient. The design space exploration is performed
using the Sesame system-level simulation framework.
MAPS
The MPSoCs Application Programming Studio (MAPS) [93, 94, 95, 97] is a DSE framework for
KPN programs. Both the performance estimation and the design space is performed through
an ETG analysis. ETGs are obtained by profiling and are augmented with timing information
via performance estimation. However, their definition is limited since only internal variables
and tokens dependencies are supported. Several heuristics for buffer sizing, mapping, and
scheduling are available within the framework. For fast and functional validations, MAPS
is fully integrated with the High-Level Virtual Platform (HVP) simulator [128]. Furthermore,
MAPS is equipped with a pioneering multi-application analysis component that performs
composability analysis in order to assess if a set of applications may run simultaneously, on
the same platform, without interfering with each other.
Mescal
Mescal project [129, 130] aims at designing heterogeneous, application-specific, programmable
(multi) processors. The goal is to allow the programmer to describe the application in any
combination of models of computation that is natural for the application domain. The goal
is also to find a disciplined and correct by construction abstraction path from the underly-
ing micro-architecture to an efficient mapping between application and architecture. The
micro-architecture description including the memory subsystem is based on an architecture
50
4.3. Related work
description language.
Metropolis
Metropolis [131] is a framework allowing the description and refinement of a design at dif-
ferent levels of abstraction and integrates modeling, simulation, synthesis, and verification
tools. It provides an infrastructure based on meta-modeling with precise semantics that are
general enough to support various model of computations. This meta-model can capture
the functionality, the architecture and the mapping between the two different abstraction
levels. The function of a system, such as the application, is modeled as a set of processes that
communicate through media. Architectural building blocks are represented by performance
models where events are annotated with the costs of interest. A mapping between functional
and architecture models is determined by a third network that correlates the two models by
synchronizing events (using constraints) between them. Non-deterministic behavior can be
modeled and constraints can restrict the set of possible executions.
PeaCE
The PeaCE Environment [132] specifies the system behavior with a heterogeneous compo-
sition of three models of computation. These are an extended SDF model (called SPDF) for
computation tasks, an extended FSM model (called fFSM) for control tasks, and a task model
to describe the task interactions, respectively. The PeaCE environment provides seamless
co-design flow from functional simulation to system synthesis, utilizing the features of the
formal models maximally during the whole design process. This framework is based on the
Ptolemy project [133]. However, when dealing with C/C++ specifications, the PeaCE approach
does not provide an automatic procedure to transform this specification into dataflow graphs.
Preesm
Preesm [134, 135] is a rapid-prototyping framework for static dataflow applications that has
been inspired by the algorithm architecture adequation matching methodology (AAM, also
sometimes called AAA) [136]. Preesm makes uses of a parameterized and interfaced dataflow
meta-model (PiMM) [137] representation of the application, together with a System-Level
Architecture Model (S-LAM) for the high-level architecture description. It automatically
generates functional code for heterogeneous multi-core embedded systems, optimizing the
application scheduler by using the throughput as an optimization requirement.
Ptolemy
Ptolemy [133, 99] is a component-based heterogeneous modeling environment. It allows the
hierarchical combination of different models of computations with a high level of abstraction.
51
Chapter 4. Exploring the design space of dataflow programs
It uses tokens as the underlying communication mechanism. Controllers regulate how actors
fire and how tokens are sent between each actor. This mechanism allows different models of
computation to be combined within the Ptolemy framework. The design space exploration is
performed with third party environments (e.g. the PeaCE framework [132]).
SDF3
SDF3 [138] is a dataflow analysis tool that supports SDF and CSDF dataflow models of com-
putations. SDF3 is oriented towards model analysis and simulation without generating an
executable prototype of the application.
Sesame
The Sesame system-level simulation framework [139] addresses the problem of finding a
suitable and efficient target MP-SoC platform architecture. Sesame deploys separate appli-
cation and architecture models: the application model describes the functional behavior
of an application, while the architecture model defines architecture resources and captures
their performance constraints. Sesame maps application models onto architecture models
for cosimulation by means of trace-driven simulation, while using an intermediate mapping
layer for scheduling and event-refinement purposes. This allows for evaluation of the system
performance of a particular application, mapping, and underlying architecture. Essential in
this methodology is that an application model is independent from architectural specifics and
assumptions on hardware/software partitioning. The main disadvantage in this methodology
is that only KPN application models can be used and analyzed.
Space Codesign
Space Codesign [140, 141] is a design environment that provides an interface for user-written
SystemC modules that models application software to make calls to a real-time operating
system kernel. It provides a cosimulation environment for user-written SystemC hardware
modules. The environment also facilitates successive refinement through three software
abstraction layers for hardware-software codesign suitable for embedded-system design. The
first level focuses on the system design: the application is specified and functionality validated
through the SystemC simulator. In the second layer the application is partitioned among
different software and hardware modules. The hardware is modeled and emulated via the
SystemC simulator, while the software is encapsulated in the SystemC-RTOS interface via
an RTOS emulation process. At the third level, a more sophisticated architecture model is
emulated with the support of cycle accurate simulation at a chosen processor frequency.
52
4.3. Related work
SPADE
The Stream Processing Application Declarative Engine (SPADE) [142, 143] is a stream process-
ing application development framework for System S [144], which is a large-scale, distributed
datastream processing middleware. As a front-end for rapid application development for Sys-
tem S, SPADE provides an intermediate language for composition of parallel and distributed
dataflow graphs, together with a toolkit of type-generic, built-in stream processing operators,
that support scalar as well as vectorized processing and can seamlessly inter-operate with user-
defined operators. It provides a code generation framework to create optimized applications
that run natively on the Stream Processing Core (SPC), the execution and communication
substrate of Systems. Successively, an optimizing compiler automatically maps applications
into appropriately-sized execution units in order to minimize communication overhead, while
at the same time exploiting available parallelism.
SynDEx
SynDEx [145] is a graphical and interactive software implementing the Algorithm Architecture
Adequation Matching methodology (AAM, also sometimes called AAA) [136]. Within this
environment, the designer defines an algorithm graph, an architecture graph and system
constraints. SynDEx is a Computer-Aided-Design software aiming at mapping an algorithm
into an architecture. The architecture taken into account is only composed of several proces-
sors, and hardware logic, like FPGA, cannot be taken into account in this flow. The design
space exploration is done according to one unique criteria: the application throughput. It
provides the possibility of low-level code generation, but it is not actually provided within the
distributed tools.
SystemCoDesigner
SystemCoDesigner [146, 147, 148] is an actor-oriented approach using a high-level language
named SysteMoC, which is built on top of SystemC. It generates HW-SW SoC with automatic
design space exploration techniques. The model is translated into a behavioral SystemC
model as a starting point for HW and SW synthesis. During DSE, the design space is explored
using state of the art multi-objective optimization algorithms. For each design alternative,
performance is estimated by using performance models (which are generated automatically
from the SystemC behavioral model) and the behavioral synthesis results. The HW synthesis
is delegated to Forte Cynthesizer [149], a commercial tool which generates RTL code from a
SystemC intermediate model.
53
Chapter 4. Exploring the design space of dataflow programs
4.4 Advances in design space exploration of CAL programs
In the previous sections it has been discussed how the design space of an application can
be modeled and explored. Moreover, a list of available design exploration tools has been
presented. In the following section the main improvement and advancements concerning the
exploration and optimization of CAL dataflow programs are illustrated.
4.4.1 Space for improvement
• Dynamic program analysis: dynamic program analysis is not supported by the tools
available for CAL programs. They limit their analysis to static and cyclo-static MoC
classes. Even though they can provide guarantees on the system performance and
requirements (e.g. deadlock-free execution), complex dynamic programs (e.g. video
codecs) can be analyzed only under strong assumptions and limitations on the design
cases.
• Design space modeling: as mentioned before, the design space can be modeled only
for restricted classes of dataflow programs. Moreover, performance estimation method-
ologies are specifically targeted for restricted sets of architectures.
• Bottlenecks and refactoring directions: except for the CAL design suite, bottleneck and
design refractory directions are not provided. As such, the designer should implement its
application on the target architecture and profile with an additional third-party profiler
for the resulting implementation. Relations between profiling results of the application
and the corresponding CAL source code is done by hand. However, sometimes this
relation cannot be univocally obtained (e.g. due to code-inlining optimization done by
compilers).
• Automatic mapping and code generation: automatic mapping and code generation is
partially driven over a limited set of architectures. Moreover, tools does not provide a
uniform and interoperable methodology to provide the mapping configuration.
4.4.2 New requirements
Consolidated design space exploration methodologies for static and cyclo-static dataflow pro-
grams are hardly extensible to dynamic dataflow programs. In fact, they make use of analytical
models of the application MoC. For dynamic programs, this leads to possible non-linear and
difficult-to-solve formulation. Consequently, this formalism should be extended in order to
make the design exploration and performance analysis possible using a unique mathematical
tool-set. For this reason, the concept of an execution trace graph of a dataflow program has
been formalized. This is a graph-based representation of the program where nodes represent
a single action firing and directed arcs represent dependencies among couples of actions
firing. In the next chapter it is demonstrated how using this formalism is possible to efficiently
54
4.5. Conclusions
explore the design space of dynamic dataflow applications (and also by consequence, in the
case of static and cyclo-static programs).
4.5 Conclusions
In this chapter the main requirements for a design space exploration (DSE) environment have
been summarized. The notion of orthogonalization of concerns has been introduced. The
main features of these design methodologies are the separation between functional behavior
and architecture and between communication and computational load. Furthermore, the no-
tions of high-level models of computation (MoC) and models of architecture (MoA) have been
presented. The design space and design points (i.e. design alternatives) of an application have
been formalized. Each design point has been defined as a particular mapping configuration
of the design defined in terms of partitioning, scheduling and buffer size configuration. Suc-
cessively, different DSE analysis and performance estimation methods have been illustrated
together with an overview of the current available frameworks. A discussion about possible
space for improvements of the methodology and tools in the context of dynamic dataflow
programming, and more precisely for the CAL dataflow language, has been presented at the
end of the chapter.
55

5 Execution trace graph
In Chapter 3 we discussed how the execution of a dataflow program consists of a sequence
of action firings. In this chapter how those firings can be correlated in a novel graph-based
representation, called the execution trace graph, in order to model the execution behaviour of
the program is illustrated. The graph is an acyclic directed graph where each node represents
an action firing and each directed arc represents either a data or a logical dependence between
two different action firings. A partial order of the fired actions can be obtained from the
topological order of the graph. Hence, using the notions of partially-ordered space and
directed-path developed in [150, 151, 152, 153], the effectiveness of analyzing a dataflow
program starting from its behavioural execution is demonstrated.
5.1 Geometry of execution
Without the ambition to be complete, this section provides a brief introduction to the trace
space theory formalized in [150, 151, 152, 153]. Looking at the geometry of dataflow program
executions, it is possible to think of a concurrent execution of two actors A and B on two
processing units pu1 and pu2 as a curve in R2. Points on this space have the local time on
pu1 taken to execute A on pu1 as abscissa, and the local time on pu2 taken to execute B on
pu2 as ordinate. Figure 5.1 depicts a possible execution path along the execution space of
the program. The execution space of a program can be considered as the set of all possible
increasing paths (as far as the time flow cannot be inverted) included in the square delineated
by the interleaving of A and B.
5.1.1 Partially-ordered space
The geometric model which has already been implicitly used in Figure 5.1 is a partially-ordered
space, also called a po-space. This is a topological space equipped with a partial order. In
other words, a po-space is a topological space in which points are ordered globally through
time. Formally, a partial order≤ on a set U is a reflexive, transitive and antisymmetric relation.
57
Chapter 5. Execution trace graph
A
B
B
A
Figure 5.1: Execution space in R2 of two actors A and B mapped on two processing units pu1
and pu2, respectively. The dashed arrow represents a possible execution path of the program.
A partial order ≤ on a topological space X is said to be closed if ≤ is a closed subset of X ×X in
the product topology. In that case (X ,≤) is called a po-space.
5.1.2 Execution trace
The dashed arrow that has already been intuitively used in Figure 5.1 represents an execution
trace of the two actors A and B. In other words, it represents a directed path, also called d-path,
in the execution space. Formally, a d-path −→p in a po-space (X ,≤) is defined as:
−→p :−→1 → X (5.1)
that is continuous and order-preserving, where
−→
1 = [0,1] ⊆ R represents the closed and
directed unit interval. A d-path that is up to monotone reparametrizations is called trace and
it is represented as X (x1, x2), where x1, x2 ∈ X such that x1 ≤ x2. A po-space equipped with a
notion of direction is defined as a directed topological space, also called a d-space. A d-space
if formally defined as (X ,d X ) consists of a po-space X together with a set of d X of paths in
X . In this case, it is possible to define a new partial order ≺ on X such that x1 ≺ x2 if there is
a d-path from x1 to x2 in X . This is a sort of reachability relation that is antisymmetric and
coarser than the relation ≤ in the sense that x1 ≺ x2 ⇒ x1 ≤ x2.
5.1.3 Execution trace space
The concurrent execution of actors A and B depicted in Figure 5.1 might have several feasi-
ble traces. In other words, some equivalent traces might exist such that the corresponding
executions end with the same result. Formally, two d-paths −→p 1 and −→p 2 are considered as
equivalent when −→p 2 can be obtained by continuously deforming −→p 1 (or vice versa). This
equivalence relation is called dihomotopy. Given two points x1 and x2 of a d-space (X ,d X ),
then E(X ,d X )(x1, x2) identifies the execution trace space obtained from X (x1, x2) by identi-
fying all the dihomotopic equivalent paths. In particular, E(X ,d X )(x1, x2) 6= ; if and only if
there exists at least one directed path in X going from x1 to x2.
58
5.2. Execution trace graph
5.2 Execution trace graph
The execution of a dataflow program can be modelled as a directed acyclic graph (DAG) where
each node represents a single action firing and each directed arc represents either a data or a
logical dependence between two different action firings [1, 2, 33, 94, 154]. In Section 2 it has
been shown how during each firing, an action can consume a finite number of input tokens,
produce a finite number of output tokens, and modify the actor’s internal variables. Hence,
it can be observed that it is possible to identify the dependencies that arise among different
firings. For example, if during a firing an action consumes some tokens, then it must rely
on the execution of the action that produced those tokens. The same can be stated if the
action, in the processing part of the firing, makes use of some of the internal actor variables
that were previously modified or used by another action. Several other types of dependencies
can be identified and used to characterize the execution of a dataflow program: these are
summarized in Table 5.1 and presented in Section 5.2.2.
An execution trace graph (ETG) is formally defined as a DAG(S,D), where:
• S is the set of single action firings, defining the nodes of the graph.
• D = S×S is the set of dependencies, defining the directed edges of the graph.
Defining dependencies between action firings establishes a precedence order. If the firing
s2 ∈ S depends on firing s1 ∈ S, then s1 has to be executed and completed before s2 can be
started. The dependency is then defined as (s1, s2) ∈D . The transitive hull of the dependencies
is the precedence relation ≤. So, S can be defined as a po-space (S,≤) and the precedence
constraint among s1 and s2 can be expressed as s1 ≺ s2.
Remark. In this work it is assumed that the number of firings in S and the number of depen-
dencies in S are finite and they will be denoted by the notation |S| <∞, |D| <∞ respectively.
5.2.1 Firings
Each si ∈ S represents a single action firing occurring during the execution of a dataflow
program. In other words, if an action is fired n times, thus n nodes in S are used to represent
each single firing.
A single action firing s ∈ S is formally defined as a 3-tuple s(a,λ,η), where:
• a ∈ A is the actor.
• λ ∈Λ is the action.
• η ∈N is the action execution index, that identifies two different firings of the same action
during the entire program execution.
59
Chapter 5. Execution trace graph
5.2.2 Dependencies
Each (si , s j ) ∈D represents dependence between two fired actions si and s j , such that si 6= s j .
Several kinds of dependencies can be defined during the execution of a dataflow program.
As summarized in Table 5.1, these are: internal variable, finite state machine, guard, port
and tokens. As illustrated in the following, each of these can be defined by a sub-kind and
enhanced with some profiling parameters useful for a post-mortem analysis. Hence, more
than one dependence can be defined between each couple si , s j .
A dependency (si , s j ) ∈D is formally as a 5-tuple (si , s j ,µ,d), where:
• si ∈ S is the source action firing.
• s j ∈ S is the target action firing.
• µ is the dependence kind. As illustrated in the following, the kind can be: internal
variable, finite state machine, guard, port or tokens.
• d is the dependence direction. As illustrated in the following, the direction can be:
read/read, read/write, write/read, write/write, enable, disable or undefined.
The incoming dependencies set of a firing si is defined such as:
δ(si )
−
E = {(sn , sm) :∀(sn , sm) ∈D, sm = si } (5.2)
The set of firings which are the source of an incoming dependencies of si is called the prede-
cessor, and is denoted as:
δ(si )
−
S = {s j : ∃(s j , si ) ∈D} (5.3)
Firings that do not have any predecessors are called sources of the ETG. The set of sources is
defined as:
S;− = {si : δ(si )−S =;} (5.4)
Similarly, the outgoing dependencies set of a firing si is defined as:
δ(si )
+
E = {(sn , sm) :∀(sn , sm) ∈D, sn = si } (5.5)
The set of firings which are the target of an outgoing dependencies of si is called the successors
of si , and is denoted as:
δ(si )
+
S = {s j : ∃(si , s j ) ∈D} (5.6)
Firings that do not have any successors are called sinks of the ETG. The set of sinks is defined as:
60
5.2. Execution trace graph
S;+ = {si : δ(si )+S =;} (5.7)
Internal variable
An internal variable dependency (si , s j ) ∈D occurs when two actions of the same actor share
the same internal variable v ∈V . More precisely, four different directions can be defined:
• write/read: when the action firing s j reads the internal variable v without an intervening
write operation and si is the last action firing, previous to s j , who wrote on v .
• write/write: when the action firing s j has an intervening write operation on the internal
variable v and si is the last action firing, previous to s j , who wrote on v .
• read/read: when both the action firings si and s j read the internal variable v without
an intervening write operation and si is the last action firing, previous to s j , who read
from v .
• write/write: when both the action firings si and s j wrote on the internal variable v and
si is the last action firing, previous to s j , who wrote on v .
Only the write/read is a data dependency. By contrast, the read/write, read/read and write/write
express only memory utilization precedence between the two actions and such information
could be useful if a memory optimization of the design is applied. The parameter that can be
stored in this kind of dependency is variable v on which the dependency is related. Additional
attributes retrieved from the profiling are the initial and final value of such a variable. In the
following, the set of dependencies of this kind are denoted with Dv ⊆D .
Finite state machine
An internal state machine dependency (si , s j ) ∈D connects two executed actions belonging
to the same actor and related via its internal state scheduler. In other words, a dependency
of this kind is defined when both the execution of the action firings si and s j is driven by the
actor internal FSM and si is the last action firing, previous to s j , scheduled by the FSM. In the
following, the set of dependencies of this kind are denoted with D f ⊆D .
Guard
A guard dependency (si , s j ) ∈ D occurs when an action firing si modifies the value of the
guard which conditions the action firing s j . The guard condition, which may be defined as a
combination of state variable and token value, can be defined enabled or disabled by si by
61
Chapter 5. Execution trace graph
the modification of its variables or the production of particular token values. For this kind of
dependency, two different directions can be defined:
• enable: when the modification of an internal variable or the production of a token
performed by si makes the action firing s j executable (i.e. enabled).
• disable: when the modification of an internal variable or the production of a token
performed by si makes the action firing s j not-executable (i.e. disabled).
The parameters that can be stored are the guard identifier on which the dependency is related
and the appearance order on which this guard was enabled or disabled. In the following,
the set of dependencies of this kind are denoted with Dg ⊆ D. It must be noted that in
some design cases, uncovering these dependencies might have the side effect of letting the
trace be dependent on both the buffer size and the scheduler configuration used during the
program execution. A more detailed discussion about this kind of dependency is presented in
Section 5.3.6.
Port
A port dependency (si , s j ) ∈D connects two action firing of the same actor that share an input
or an output port p. It defines in which order tokens must be consumed or produced over this
port. More precisely two different directions can be defined:
• read/read: when both the action firings si and s j retrieved some tokens from the input
port p and si is the last action firing, previous to s j , who retrieved at least one token
from p.
• write/write: when both the action firings si and s j sent some tokens to the output port
p and si is the last action firing, previous to s j , who sent at least one token to p.
The parameter that can be stored in this kind of dependency is the port p (input of output)
on which the dependency is related. In the following, the set of dependencies of this kind is
denoted with Dp ⊆D .
Tokens
A tokens dependency (si , s j ) ∈ D connects the action firing that produces some tokens to
the one that consumes at least one of them. In such cases, these actions may be in different
actors, or they may be part of the same actor (i.e. in case of a direct dataflow feedback loop).
The parameters that can be stored in this kind of dependency are the number of tokens
that the consumer firing s j consumed among the tokens produced by the producer firing si .
Additional attributes retrieved from the profiling are the token values. In the following, the set
of dependencies of this kind are denoted with D t ⊆D .
62
5.2. Execution trace graph
Table 5.1: Dependencies kinds, directions, parameters and additional attributes.
Name Direction Parameters Additional attributes
Dv internal variable
read/read
variable id
write/write initial value
read/write final value
write/write
D f finite state machine
Dg guard
enable guard id
disable appearance order
Dp port
read/read
port id
write/write
D t tokens
output port id
token values
number of tokens
5.2.3 Example of an execution trace graph
The dataflow program described in Section 2.5.4 is used in order to show the main structure of
an ETG. The firing set S contains nine action firings s = {s1, s2, . . . , s9} which are summarized
in Table 2.2. The firing set S can be divided in three sub-sets, one for each actor of the
network, SP = {s1, s2, s3}, SF = {s4, s5, s6} and SC = {s7, s8, s9}, such that S = SP ∪SF ∪SC and
San ∩Sam = ; for each couple of actors am 6= an . Sets SP , SF and SC contain the firings of
Producer, Filter and Consumer respectively. The dependencies set D contains sixteen
dependencies D = {e1,e2, . . . ,e16} which are summarized in Table 5.1.
Even though the firing sequence of this program has already been illustrated in Section 2.5.4,
it is worth re-describing part of it, and highlighting how the ETG of Figure 5.2 can be obtained.
In this context, let’s suppose the mapping configuration m1 described in Table 4.1 is used.
This particular mapping configuration defines a single partition configuration σ1 = {P,F,C }
(i.e. all the actors are assigned to the same processing element), the scheduler configuration
σ1 = {P,P,P,F,F,F,C ,C ,C } (i.e. predefined and static) and the buffer size configuration β11 =
β21 = 512. The execution Gantt chart is depicted in Figure 4.3a, where each action firing takes
one (abstract) clock cycle to conclude its execution. At time t = 0, the scheduler imposes
to execute the actor Producer which fires the action produce. This single action firing is
denoted with s1. During its execution, s1 updates the internal actor variable counter: the
initial value iscounteri = 0 and the final value iscounter f = 1. Finally, the firing concludes
by writing an output token τ1 = 1 on the output portO. At time t = 1 the scheduler selects again
the actor Producer which fires again the action produce. This second firing is denoted
with s2. Also s2 updates the internal actor variablecounter: the initial value iscounteri = 1
and the final value is counter f = 2. Finally, the firing concludes by writing an output token
τ1 = 2 on the output port O. During the firing s1 the internal state variable counteri has the
value previously written by the firing s1: hence an internal variable dependency between s1
and s2 can be defined. This is denoted with e1. As both the firings wrote this variable, the
63
Chapter 5. Execution trace graph
s1
s2
s3
s4
s5
s6
s7
s8
s9
e1 e2
e3 e4
e6 e7
e9 e10
e13
e15
e5
e8
e11
e12
e14
e16
Figure 5.2: Execution trace graph obtained after the execution of the CAL program described
in Section 2.5.4. The firing set S is summarized in Table 2.2, and the dependencies set D is
summarized in Table 5.2.
dependency direction is write/write. Moreover, both s1 and s2 wrote a token on the same
output port: hence a port variable, with direction write/write, can be defined. This is denoted
with e2. The same happens at time t = 2, when the same action is fired for the third time in
a row. This new firing is denoted with s3. Also in this case an internal variable and a token
dependency can de defined with the previous step s2: these are e3 and e4 respectively. At
time t = 3, the scheduler imposes the execution of the actor Filter which fires the action
invert. This action firing is denoted with s4. During the firing, s4 consumes the token τ1
from its input port I and produces an output token τ4 on its output port O. As the input token
τ1 was previously produced by the firing s1, a token dependency between s1 and s4 can be
defined: this is denoted with e5. At time t = 4, the second execution of the actor Filter,
imposed by the scheduler, again fires the action invert. This action firing is denoted with s5.
Also this firing read the token τ2 from the input port I and wrote the token τ5 on the output
port O. As τ2 was previously produced by s2, a new token dependency can de defined: this is
denoted with e8. Furthermore, as the firing s5 read and wrote tokens from and to the same
ports as the firing s4, two new port dependencies can be defined: these are denoted with e6
and e7 which have as direction read/read and write/write respectively. The execution of the
entire program continues till t = 9 and the same considerations can be made in order to build
the remaining dependencies of the ETG.
64
5.2. Execution trace graph
Table 5.2: Dependencies set S of the execution trace graph depicted in Figure 5.2.
(si , s j ) Source Target Kind Direction Parameter Attribute
e1 s1 s2 Variable Write/Write variable=counter initial=1
final=2
e2 s1 s2 Port Write/Write port=O
e3 s2 s3 Variable Write/Write variable=counter initial=2
final=3
e4 s2 s3 Port Write/Write port=O
e5 s1 s4 Token - count=1
source-Port=I
source-Port=O
value=1
e6 s4 s5 Port Read/Read port=I
e7 s4 s5 Port Write/Write port=O
e8 s2 s5 Token - count=1
source-Port=I
source-Port=O
value=2
e9 s5 s6 Port Read/Read
e10 s5 s6 Port Write/Write port=O
e11 s3 s6 Token - count=1
source-Port=I
source-Port=O
value=3
e12 s4 s7 Token - count=1
source-Port=I
source-Port=O
value=-1
e13 s7 s8 Port Read/Read port=I
e14 s7 s8 Token - count=1
source-Port=I
source-Port=O
value=-2
e15 s5 s8 Port Read/Read port=I
e16 s8 s9 Token - count=1
source-Port=I
source-Port=O
value=-3
65
Chapter 5. Execution trace graph
5.3 Properties
The aim of this section is to illustrate and demonstrate the main properties of an ETG. In fact,
as demonstrated in Chapter 6, these properties can be successfully exploited when exploring
and optimizing the design space of a dataflow program.
5.3.1 Topological order
As the ETG is a DAG(S,D) it is possible to define a partial order on the firings set S. This
topological order can be defined with a mapping function l : S →N such that:
si ≤ s j ⇒ l (si )< l (s j ) (5.8)
It must be noted that a DAG can have different valid topological orders. In other words,
given two valid topological mapping functions l1 and l2 it is possible that l (s)1 6= l (s)2. As
demonstrated in the following of this section, an ETG can express the maximum potential
parallelism of the program. This property is strictly related to the fact that a DAG generally
admits several valid topological mapping functions.
Execution trace space
By recalling the notation introduced in Section 5.1, (S,≤) and (S,d X ) represent the po-space
and the d-space, respectively, of the program execution defined by the firings set S and the
dependencies set D. The po-space (S,≤) refers to the collection of firings ordered by their
dependencies. Similarly, the d-space (S,d X ) refers to the collection of directed paths that can
be defined among the firings by following their outgoing dependencies. Consequently, the
ETG defines what is called the execution trace space of a program that has been formalized in
Section 5.1.3.
5.3.2 Mapping independence
The mapping independence property can be demonstrated using the same example dataflow
program described in Section 5.2.3 and analyzing the ETGs that are obtained using the different
mapping configurations defined in Table 4.1. It must be noted that those considerations are
valid only for deterministic actors, in the sense that the execution is not time dependent.
Scheduling independence
Let’s consider the two mapping configurations m1 and m2, which differ only on how the
scheduling configuration has been defined. The first was used in Section 5.2.3 to illustrate
how the ETG depicted in Figure 5.2 has been obtained. In this case the firings set S has
been obtained with the following order S(m1)= {s1, s2, s3, s4, s5, s6, s7, s8, s9} as depicted in the
66
5.3. Properties
Gantt chart in Figure 4.3a. Using the scheduler configuration defined in m2, the firing steps
order changes. In this case, S(m2)= {s1, s4, s7, s2, s5, s8, s3, s6, s9} as depicted in the Gantt chart
in Figure 4.3b. Following the considerations made in Section 5.2.3, the two dependencies
sets D(m1) and D(m2), respectively, remain the same. This leads to the same ETG as the
one depicted in Figure 5.2, since the partial order of the firings defined on both S(m1) and
S(m2) is the same. This demonstrates that the partial order of the firings set S defined only
using the dependencies set D does not give any information about the scheduling policy
used for constructing the ETG. Hence, the ETG does not depend on the scheduling policy.
Additional edges should be introduced in the ETG in order to make it possible to define
a more strict ordering of S and defining the scheduling configuration. If the objective is
to model S(m1) and S(m2) such as S(m1) = {s1 < s2 < s3 < s4 < s5 < s6 < s7 < s8 < s9} and
S(m2)= {s1 < s4 < s7 < s2 < s5 < s8 < s3 < s6 < s9}, some additional edges should be introduced
as depicted in Figure 5.3a and Figure 5.3b, respectively. Those additional edges are depicted
with dashed arrows as they should not be confused with the dependencies defined in the
previous section. In fact, those additional edges on the ETG are only used to model the
constraints imposed by the scheduler.
Partitioning independence
The same considerations about the ETG scheduling independence can also be done for
the partitioning configuration. Let’s consider the two mapping configurations m2 and m4
defined in Table 4.1. In m1 all the actors are mapped in one partition, contrary to m4 where
two partitions are defined. The scheduling and buffer size configurations of m2 andm4 are
the same. Considering m4, the firings set S has been obtained with the following order
S(m4)= {s1, s4, s2, s7, s5, s3, s8, s6, s9} as depicted in the Gantt chart in Figure 4.3d. Following the
same considerations made in Section 5.2.3, since the dependencies set D(m4) is the same as
D , in this case the corresponding ETG is also the one depicted in Figure 5.2. If the objective is
to model the additional constraints imposed by the scheduling configuration, some additional
edges should be introduced as depicted in Figure 5.3d. Those additional edges are depicted
with dashed arrows as they should not be confused with the dependencies defined in the
previous section. In fact, those additional edges on the ETG are only used to model the
constraints imposed by the scheduler defined in each partition: σ14 and σ
2
4, respectively. This
leads to the following partial ordered set S(m4) = {s1 < s4 < s2 ≤ s7 < s5 < s3 ≤ s8 < s6 < s9}.
When dependencies are satisfied, firings of actors mapped on ρ14 can be executed in parallel
to firings mapped on ρ24. For example, s2 ≤ s7 means that both s2 and s7 can be fired during
the same clock cycle, as depicted in Figure 4.3d.
Buffer size independence
The same considerations about the ETG scheduling and partitioning independence can also
be done for the buffer size configuration. Let’s consider the two mapping configurations m3
and m4 defined in Table 4.1. In m3 the buffer size configuration is defined as β13 = β23 = 1,
67
Chapter 5. Execution trace graph
contrary to m4 where the buffer size configuration is defined asβ13 =β23 = 512. The partitioning
and scheduling configurations of m3 andm4 are the same. Considering m3, the firings set S
has been obtained with the following order S(m3) = {s1, s4, s2, s7, s5, s3, s8, s6, s9} as depicted
in the Gantt chart in Figure 4.3d. Following the same considerations made in Section 5.2.3,
since the dependencies set D(m3) is the same as D , in this case the corresponding ETG is also
the one depicted in Figure 5.2. This is because tokens are produced and consumed by the
same firings. Hence no additional dependencies should be considered for D(m3), and as a
consequence D =D(m3)=D(m4). As such, changing the buffer size of a program (i.e. that is
not time dependent) does not change the partial order of S imposed by D . This demonstrates
that the ETG is independent from the buffer size used during the program execution.
5.3.3 Untimed
The ETG does not contain any information about timing of fired steps and dependencies.
The only information that can be obtained is a partial ordering about firings. In other words,
the dependency (si , s j ) ∈ D defines only that s j can only be fired after the complete firing
of si . Let’s consider the two mapping configurations m5 and m6, which differ only on how
the buffer size configuration has been defined. As can be seen from the two Gantt charts
depicted in Figure 4.3e and 4.3f, respectively, the firing of s2 takes 2 clock cycles using the
mapping configuration m5 and 1 clock cycle using the mapping configuration m6. However,
this information is not defined in the ETG. Section 5.4 discusses how the ETG can be extended
in order to define timing information for both the firings and dependencies.
5.3.4 Maximum parallelism
The ETG defines the maximum parallel execution that can be performed by the dataflow
program. In fact, as described previously, it is completely independent from the mapping con-
figuration. In other words, precedence relations about firings is imposed only by precedence
about how data should be processed. For example a token dependency defines that the firing
that consumes tokens can only be executed after the firing that produced those tokens has
been fired. The same is for the other kind of dependencies. As such, the dependencies set D
defines only a minimal information based on the data processing (i.e. tokens, internal vari-
ables) and resource utilizations (i.e. ports, guards) that should be respected in order to obtain
a correct program execution. The constraints imposed by a particular mapping configuration
can only be modeled by introducing additional edges as discussed in Section 5.3.2. The ETG
without additional edges can be seen as the execution of the program using a fully-parallel
mapping configuration (i.e. where each partition contains only one actor). Let’s consider for
example the mapping configuration m6, where each actor is mapped in a separate partition.
In this case the resulting ETG is depicted in Figure 5.3f where the additional edges imposed
by the internal scheduler of the partition do not restrict the partial order of the ETG. This
demonstrates that the ETG defined by S and D expresses the maximum parallelism of the
application.
68
5.3. Properties
s1
s2
s3
s4
s5
s6
s7
s8
s9
(a) Mapping configuration m1
s1
s2
s3
s4
s5
s6
s7
s8
s9
(b) Mapping configuration m2
s1
s2
s3
s4
s5
s6
s7
s8
s9
(c) Mapping configuration m3
s1
s2
s3
s4
s5
s6
s7
s8
s9
(d) Mapping configuration m4
s1
s2
s3
s4
s5
s6
s7
s8
s9
(e) Mapping configuration m5
s1
s2
s3
s4
s5
s6
s7
s8
s9
(f) Mapping configuration m6
Figure 5.3: Execution Trace Graphs of the CAL network depicted in Fig. 2.10. Dashed lines
represent additional edges that model a particular scheduling configuration defined within
the mapping configurations described in Table 4.1.
69
Chapter 5. Execution trace graph
5.3.5 Data dependent
The ETG can vary between two different program executions if this program contains at least
one actor that is data dependent. For example, let’s consider the CAL actor Split defined in
Listing 2.2. This is composed of 2 actions: A and B, respectively. The firing conditions of both
actions define that one input token should be available in the input port I. However, action A
is fireable only if the token value is val≥ 0 and action B if val< 0. Let’s suppose that two
input sequences are available in the input port I: I1 = {0,1,−10,−5} and I2 = {−1,−1,0,−1}
respectively. Hence, the firing sequence S = {s1, s2, s3, s4} of this actor defines different action
firings as illustrated in Table 5.3. It must also be noted that the dependencies set D can change
too. In this case, the program should be analyzed using different data sequences for generating
representative ETG on which statistical analysis can be obtained (i.e. see Chapter 9 for an
example on how stream applications that are data dependent are analyzed).
Table 5.3: Firings sequence of the CAL actor Split defined in Listing 2.2 when two input se-
quences are available in its input port I: I1 = {0,1,−10,−5} and I2 = {−1,−1,0,−1}, respectively.
Firing Action
I1 I2
s1 A B
s2 A B
s3 B A
s4 B B
5.3.6 Modeling a dynamic program execution
The execution of dynamic actors such that the execution is mapping dependent can be
modeled using the ETG. The CAL actor used to prove this property is the GuardedInverter
actor which is illustrated in Listing 5.1. This actor is composed of 2 actions A and B, an input
port I, an output port O and an internal actor variable m. The priority condition B > A is
defined: this lead the action A to be fireable each time that the action B is not. It is important
to note that action B is fireable each time there is at least one input token in its input port
I and the guard condition m > 0 and m <3 is satisfied. The state variable on which this
guard is defined is modified only by the action A: consequently, only A can enable or disable
the guard. Two possible execution paths are illustrated in Figure 5.4a and Figure 5.4b. The axis
of abscissae defines the time flow for action B, similarly the axis of ordinates defines the time
flow for action A. It is supposed that each firing of both A and B requires the same amount of
time. The list of firings and the respective value assumed by the internal variable m and the
guard condition are reported in Table 5.4a and Table 5.4b, respectively. For this example, two
regions where the guard of B is enabled can be identified along the A-axis of the execution
path: these are called the guard enable window n = 1 and n = 2 respectively. In those regions,
70
5.3. Properties
B can be executed if there is at least one input token in its input port B. It is now clear how
the execution path of a dynamic program can vary according to the mapping configuration
used for the execution. In the following it is clarified how it is possible to model such kinds of
enabling and disabling windows and make the entire analysis process unaware of the mapping
configuration.
Listing 5.1: GuardedInverter.cal
1 actor GuardedInverter() int I ==> int O :
2
3 int m := 0;
4
5 A: action ==>
6 do
7 m := m + 1;
8 if m = 5 then m := 0; end
9 end
10
11 B: action I:[val] ==> O:[-val]
12 guard m > 0 and m < 3
13 end
14
15 priority:
16 B > A;
17 end
18
19 end
Using guard enable and disable dependencies
Even though the two previous paths are equivalent (i.e. they end at the same internal actor
state configuration as illustrated in Table 5.4), not considering the enable and disable guard
dependencies makes the ETG dependent to the mapping configuration used during the
execution. This can be seen from the ETG obtained from the two execution paths depicted in
Figure 5.4c and Figure 5.4d, where both guard enable and guard disable dependencies are not
considered. Considering for example the first ETG depicted in Figure 5.4c, the firing s3 can be
executed when the guard is enabled. In other words it is possible to fire s3 after the execution
of s1 and before the execution of s5, but also after the execution of s8 and before the execution
of s13. This can be argued for each firing of B. In other words, it is possible to identify two
equivalent constraints on the partial order of the ETG: s1 < sb < s5 and s8 < sb < s13 where sb
identifies any firing of B. These two conditions can be modeled with the guard enable and
disable dependencies as illustrated in Figure 5.5. Each guard enable and disable dependency
is coupled with an appearance order that identifies which guard enabling window is modeled.
Removing cyclic paths
However, this kind of dependency cannot be considered as strict dependency, otherwise
it would potentially cause the graph to become cyclic. Consequently, the ETG would not
represent a po-space and by consequence a trace space. This happens when we consider a
71
Chapter 5. Execution trace graph
B
A
Guard enabled window
n = 1
Guard enabled window
n = 2
s1
s2
s3 s4 s5
s6
s7
s8
s9 s10 s11 s12
s13
s14
(a) A first possible execution path.
B
A
Guard enabled window
n = 1
Guard enabled window
n = 2
s1
s2 s3
s4
s5
s6
s7
s8
s9 s10 s11 s12 s13
s14
(b) A second possible execution path.
Firings of A
Firings of B
s1 s2 s5 s6 s7 s8 s12 s13 s14
s3 s4 s9 s10 s11
(c) The execution trace graph corresponding to the execution path of Figure 5.4a without considering the guard
enable and disable dependencies.
Firings of A
Firings of B
s1 s3 s4 s5 s6 s7 s8 s13 s14
s2 s9 s10 s11 s12
(d) The execution trace graph corresponding to the execution path of Figure 5.4b without considering the guard
enable and disable dependencies.
Figure 5.4: Two possible execution paths of the GuardedInverter actor illustrated in
Listing 5.1. The corresponding execution trace graphs do not take into account the guard
enable and disable dependencies.
72
5.3. Properties
Table 5.4: Firings with the corresponding internal variable and guard values for the execution
trajectories and graphs depicted in Figure 5.4.
(a) Firings for the execution trajectory and graph depicted in Figure 5.4a and 5.4c, respectively.
Firing Action
m value Guard status
initial final initial final
s1 A 0 1 disabled enabled
s2 A 1 2 enabled -
s3 B 2 - - -
s4 B - - - -
s5 A - 3 - disabled
s6 A 3 4 disabled -
s7 A 4 0 - -
s8 A 0 1 - enabled
s9 B 1 - enabled -
s10 B - - - -
s11 B - - - -
s12 A - 2 - -
s13 A 2 3 - disabled
s14 A 3 4 disabled -
(b) Firings for the execution trajectory and graph depicted in Figure 5.4b and 5.4d, respectively.
Firing Action
m value Guard status
initial final initial final
s1 A 0 1 disabled enabled
s2 B 1 - enabled -
s3 A - 2 - -
s4 A 2 3 - disabled
s5 A 3 4 disabled -
s6 A 4 0 - -
s7 A 0 1 - enabled
s8 A 1 2 enabled -
s9 B 2 - - -
s10 B - - - -
s11 B - - - -
s12 B - - - -
s13 A - 3 - disabled
s14 A 3 4 disabled -
73
Chapter 5. Execution trace graph
Firings of A
Firings of B
s1 s2 s5 s6 s7 s8 s12 s13 s14
sb
enable1
disable1
enable2
disable2
Figure 5.5: Guard enable and disable dependencies couples that model the guard enable
windows n = 1 and n = 2 depicted in Figure 5.4. The firing sb represents a generic firing of the
action B.
path with a (n+1) guard enable and a n guard disable (i.e. where n represents the appearance
order of the enabling window). Consequently, when analyzing the dependency graph for each
executed guarded action B only one of the available guard enable and disable couples with
the same appearance order shall be taken into account (i.e. the others are discarded). For the
previously described example, the two ETGs that model the two execution paths illustrated in
Figure 5.4 are depicted in Figure 5.6.
It must be noted that, using this formalism lets the two ETGs illustrated in Figure 5.6 be defined
as equivalent. In fact, both ETGs can model the first or the second execution path by choosing
the appropriate guard enable and disable couple.
5.4 Timed execution trace graph
Time information is added to an ETG by defining for each firing and each dependency a
corresponding time value. For this purpose, the ETG is transformed to a weighted graph
which is a special type of labeled graph where labels are numbers (for this specific case, always
positive) called weights.
The timed execution trace graph (TETG) is formally defined extending the notation of the
ETG as a DAG(S,D,ΨS ,ΨD ) where:
• ΨS : S →R+ is the firings weight mapping function.
• ΨD : D →R+ is the dependencies weight mapping function.
In other words, for each firing si ∈ S is assigned a time value called firing weight and defined
as w(si ) ≥ 0. Similarly, the dependency weight w(si , s j ) ≥ 0 is defined for each dependency
(si , s j ) ∈D .
74
5.4. Timed execution trace graph
Firings of A
Firings of B
s1 s2 s5 s6 s7 s8 s12 s13 s14
s3 s4 s9 s10 s11
e1 e2e3 e4 e5 e6e7 e8e9 e10
(a) The execution trace graph corresponding to the execution path of Figure 5.4a considers a couple of the guard
enable and disable dependencies for each firing. The guard enable and disable couples (e1,e2) and (e3,e4) are
used to model the firing of s3 and s4, respectively, on the guard enable window n = 1. The guard enable and disable
couples (e5,e6), (e7,e8) and (e9,e10) are used to model the firing of s9, s10 and s11, respectively, on the guard enable
window n = 2.
Firings of A
Firings of B
s1 s3 s4 s5 s6 s7 s8 s13 s14
s2 s9 s10 s11 s12
e1 e2 e3 e7e4 e8e5 e9e6 e10
(b) The execution trace graph corresponding to the execution path of Figure 5.4b considers a couple of the guard
enable and disable dependencies for each firing. The guard enable and disable couple (e1,e2) is used to model the
firing of s2 on the guard enable window n = 1. The guard enable and disable couples (e3,e4), (e5,e6), (e7,e8) and
(e9,e10) are used to model the firing of s9, s10, s11 and s12, respectively, on the guard enable window n = 2.
Figure 5.6: The ETGs related to the execution paths depicted in Figure 5.4a and Figure 5.4b
where for each firing of B a couple of guard enable and disable has been considered in order
to model the guard enabled windows n = 1 and n = 2.
5.4.1 Firing weight
The firing weight w(si ) models the time required for entirely executing the action firing si . In
other words, using the action execution model discussed in Section 2.5.2, w(si ) should model
the time required not only for executing the algorithmic part of the fired action, but also the
time required for reading and writing the input and output tokens. Therefore, w(si ) can be de
defined as the combination of five terms, that are respectively:
• Wait for available input tokens: models the waiting time of si for the availability of all
its input tokens (i.e. blocking reading).
• Read input tokens: models the time required by si for reading all its input tokens.
• Algorithmic part execution: models the time required by si for executing the action
algorithmic part.
• Wait for available output space: models the waiting time of si for the availability of the
75
Chapter 5. Execution trace graph
necessary output token places (i.e. blocking writing).
• Write output tokens: models the time required by si for writing all its output tokens.
These terms can vary according to the mapping configuration chosen for the application imple-
mentation. Using the formalism illustrated in Section 4.2.3 where the mapping configuration
has been defined as a 3-tuple (σ,ρ,β), w(si ) can be defined as:
w(si )= f (si ,ρ,β) (5.9)
where f is only a function of the partitioning and the buffer size configurations.
Linear model
The firing weight model of Equation 5.9 can be simplified as a linear combination of terms as:
w(si )=w(si )r d +w(si )r +w(si )e +w(si )wd +w(si )w (5.10)
where the meaning of each term is summarized on Table 5.5. Some examples of different
techniques that can be used to measure or estimate these terms are discussed in Section 8.1.
Table 5.5: Firing weight parameters for the linear model of Equation 5.10.
Parameter Description
w(si )r d Wait for available input tokens waiting time of si for the availability of all its
input tokens (i.e. blocking reading)
w(si )r Read input tokens time required by si for reading all its input
input
w(si )e Algorithmic part execution time required by si for executing the action
algorithmic part
w(si )wd Wait for available output space waiting time of si for the availability of the
necessary output token places (i.e. blocking
writing)
w(si )w Write output tokens time required by si for writing all its output
tokens
5.4.2 Dependency weight
The dependency weight w(si , s j ) models the time required to make the dependency (si , s j ) ∈D
available to the target firing step s j after the execution of the firing si has been completely
performed. Consequently, this value may depend on the particular mapping configuration
76
5.5. Transformations
m = (σ,ρ,β) ∈M and it is defined as:
w(si , s j )= f (si ,m)= f (si ,σ,ρ,β) (5.11)
where f is a function of the scheduling, the partitioning and the buffer size configurations.
Depending on the kind of (si , s j ), this weight may model different factors. For example, if
(si , s j ) ∈D t is a token dependency then w(si , s j ) can model the time required by the buffer to
receive and make the corresponding tokens available. The same considerations can be made
for state variable dependencies (si , s j ) ∈Dv where the token is now a state variable and the
buffer a local memory region. So, considering a read/write internal variable dependency, the
weight corresponds to the time required for reading and storing the updated value of that
internal variable. Similarly, if (si , s j ) ∈D f is a finite state machine dependency then w(si , s j )
defines the time required by the internal actor scheduler to select the specific action firing.
Furthermore, when additional fictitious dependencies are introduced to model the scheduling
configuration ρ, the weight of these fictitious dependencies models the time required for the
partition scheduler to select the corresponding actor, as discussed in Section 8.1.
5.5 Transformations
In the following some ETG transformations are discussed. These represent an overview of the
main graph-based transformations that can be applied to an ETG. These are extensively used
in the rest of this dissertation when the ETG is used to explore the design space of a dataflow
program.
5.5.1 Firing expansion
The firing expansion of an ETG is a new DAG(V ,E) where the set of vertexes is evaluated
defining for each firing si ∈ S two new vertexes pisi2i−1 ∈V and pi
si
2i ∈V , respectively, connected
by a directed edge (pisi2i−1,pi
si
2i ) ∈ E . Moreover, each dependency (si , s j ) ∈D is transformed to a
new directed edge (pisi2i ,pi
s j
2 j−1) ∈ E . As an example, Figure 5.7 depicts the transformation of an
ETG. It can be seen how, for each firing si ∈ S, the corresponding pisi2i−1 ∈V inherits the incom-
ing dependencies, similarly the corresponding pisi2i ∈V inherits the outgoing dependencies.
Furthermore, it is possible to define two new fictitious vertexes pis and pit called the source
and sink vertex of G(V ,E), respectively. For each vertex such that δ(pisi2i−1)
−
S =; (i.e. that has
no incoming edges) a new fictitious edge (pis ,pi
si
2i−1) ∈ E is defined. Similarly, for each vertex
such that δ(pisi2i )
+
S = ; (i.e. that has no outgoing edges) a new fictitious edge (pisi2i ,pit ) ∈ E is
defined.
77
Chapter 5. Execution trace graph
s1 s3 s5
s2 s4
e1
e2 e5
e3 e4
(a) Initial execution trace graph.
pis pi
s1
1 pi
s1
2 pi
s3
5 pi
s3
6 pi
s5
9 pi
s5
10
pi
s2
3 pi
s2
4 pi
s4
7 pi
s4
8
pit
s1 e2
e1
s3 e5 s5
e3
s2
e4
s4
(b) Expanded version.
Figure 5.7: Firings expansion of an execution trace graph.
5.5.2 Dependency amalgamation
When analyzing the ETG, it is possible that the only requirement is to know which are the
set of predecessors and successors given firing (i.e. see Equation (5.3) and Equation (5.6), re-
spectively). Consequently, all the information contained in the dependencies set D can
be redundant as two firings si and s j are related with more than one dependence. Let
D A = {e1,e2, . . .en} ⊆D any subset of dependencies having the same endpoints. The multi-
dependency amalgamation (i.e. also called multi-edge amalgamation [155]) corresponding
to D A is an ETG that results from merging (amalgamating) all of the dependencies in D A into
a single and unlabeled dependency generally denoted with e• = e1 • e2 • . . . • en . The set of
amalgamated dependencies is denoted as D•. Informally, the amalgamation can be see as a
non-minimal transitive reduction of a graph.
As an example, the ETG depicted in Figure 5.2, e1 and e2 have the same endpoints s1 and s2.
Hence, e1 and e2 can be amalgamated as e• = e1 •e2. In the same ETG other dependencies can
be amalgamated as illustrated in Figure 5.8.
5.5.3 Event-driven system representation
This section illustrates a methodology for converting the ETG of a dataflow program into a
discrete event system in the form of a Petri net (PN) [156, 157]. This conversion supports a
more systematic development of design space exploration heuristics based on the application
of automatic control methodologies (in this regard, see Chapter 6).
78
5.5. Transformations
s1
s2
s3
s4
s5
s6
s7
s8
s9
e1 •e2
e2 •e3
e6 •e7
e9 •e10
e13
e15
e5
e8
e11
e12
e14
e16
Figure 5.8: Amalgamation of the execution trace graph illustrated in Figure 5.2.
Petri nets
A PN is a particular kind of bipartite directed graph made up of three types of objects: places,
transitions, and directed arcs (for a complete overview about PN see Appendix A.1). Directed
arcs connect places to transitions or transitions to places. Each place can contain tokens: the
presence or the absence of a token can indicate whether a condition associated with this place
is true or false. Formally, a PN is defined as a tuple N (P,T, I ,O, M0), where:
• P = {p1, p2, . . . , pm} is a finite set of places.
• T = {t1, t2, . . . , tn} is a finite set of transitions.
• I : P ×T is the pre-incidence matrix that defines directed arcs from places to transitions.
• O : T ×P is the post-incidence matrix that defines directed arcs from transitions to
places.
• M0 is the initial marking of places.
The execution of a PN is controlled by the number and distribution of tokens over the places.
Similarly to the DPN with firings dataflow model, a PN also executes by firing transitions
governed by enabling and firing rules. In a PN a transition t can be enabled if all its input
places contain at least a number of tokens equal to the weight of the respective directed arcs.
The firing of an enabled transition removes from each input place the number of tokens equal
to the weight of the respective input directed arc and deposits in each output place a number
of tokens equal to the weight of the respective directed output arc. Mathematically, firing the
transition t at event k yields a new marking:
M(p,k)=M(p,k−1)− I (p, t )+O(t , p), ∀p ∈ P (5.12)
79
Chapter 5. Execution trace graph
for any p ∈ P at each firing instant k ∈N.
Why not transform the dataflow program directly to a PN?
When a dynamic dataflow program belonging to the DPN class is translated into a PN represen-
tation, it is in general required that the MoC of the resulting PN is modified accordingly [158].
This may imply, for instance, the use of a colored PN that allows tokens to carry values and that
preserves the order in their respective places (see, in this regard, [159, 160]). However, a more
effective approach is to directly transform the ETG into a PN. In such cases, the objective is to
obtain a mathematical description of the behavior of an ETG similar to the one provided by
Equation (5.12). A systematic approach to reach such objectives is to correlate the dependency
constraints defined in the ETG with the firing rules of the PN.
ETG to PN transformation
Intuitively, an ETG action firing si ∈ S can be represented as a PN transition ti ∈ T that can be
fired only if there are enough input tokens at its incoming places p ∈ P . Similarly, each ETG
dependency (si , s j ) ∈D can be represented as a PN place p ∈ P , for which the place weight
W (p, t) is defined as the number of tokens nt (i.e. expressed by the tokens dependency) if
(si , s j ) ∈D t or unitary otherwise (i.e. (si , s j ) ∈D\D t ). Furthermore, defining T;− ⊆ T as the set
of transitions such that the respective ETG action firings are contained in S;− (i.e. sources of
the ETG, as defined in Equation (5.4)), an additional fictive incoming place with unitary weight
must be defined for each of those transitions. The set of fictive transitions is referred to as P;− .
In order to model the fact that only the transitions contained in S;− are initially enabled (i.e.
they do not depend on the firing of any other transition), one token is defined as the initial
marking only for the places contained in P;− (i.e. M0(p)= 0,∀p ∈ P and M0(p)= 1,∀p ∈ P;−).
It must be noted that the ETG amalgamation transformation illustrated in Section 5.5.2 can be
applied in order to reduce the number of equivalent PN places. The only requirement is that
token dependencies should not be amalgamated.
In conclusion, an ETG is formally transformed to its equivalent PN as follows:
T : si 7→ ti ∈ T ∀si ∈ S
P : (si , s j ) 7→ p ∈ •ti ∪ t•j ∀(si , s j ) ∈D
P;− : p ∈ •ti ∀ti ∈ T;−
(5.13)
where for each PN place the weight is defined such as:
W (p, ti )=
nt if (si , s j ) ∈D t1 otherwise
where nt is the number of tokens defined in the token dependency. The initial PN marking is
80
5.5. Transformations
defined as:
M0(p)=
1 if p ∈ P;−0 otherwise (5.14)
The just-introduced transformation allows the representation of the behavior of a dataflow
program by means of an event-driven system. In this regard, Equation (5.12) can be revisited
in order to describe the evolution of the variable of interest of the Petri net and, in turn, of
the program. More precisely, introducing the incidence matrix of the net defined as A(t , p)=
O(t , p)− I (t , p), i.e. A =O− I , Equation (5.12) can be rewritten in the following more compact
form (see e.g. [157]) as:
M(k+1)=M(k)+ Au(k) (5.15)
where u(k) is a n×1 column vector with 1 as its i -th entry and 0 in the remaining n−1 positions
denoting that only the i -th transition ti fires at event k and M(k) and M(k+1) are, respectively,
the marking vectors of the net before and after the firing occurrence. Equation (5.15) is usually
referred to as the state equation of the net and can be augmented by an output relation of
the form:
y(k+1)=C M(k+1) (5.16)
in order to highlight suitable variables of interest which can be expressed as (linear) functions
of the tokens actually in the net places. The state equation description of a Petri net and,
more generally, of an event-driven system is needed when performance optimization of such
systems has to be achieved through theoretic control approaches [161]. Thus, the use of the
above-defined transformation can be regarded as an effective and systematic way to extend
the use of such approaches to all the signal processing applications that can be casted within
the considered dataflow programming framework.
Example
As an example, the just-introduced ETG to PN transformation can be applied to the ETG
of Figure 5.8. The PN structure depicted in Figure 5.9 is obtained. It must be noted that,
as required by Equation 5.13, token dependencies of the ETG have not been amalgamated.
More over, the only fictitious place is p1 ∈ P;− where the initial marking is one token. Places
that correspond to a token dependency (i.e. p4, p5, p6, p9, p10, p11) are denoted with a grey
background.
81
Chapter 5. Execution trace graph
p1
t1
p2
t2
p3
t3
p4
b1
p5
b1
p6
b1
t4
p7
t5
p8
t6
p9
b2
p10
b2
p11
b2
t7
p12
t8
p13
t9
Figure 5.9: Petri net obtained from the execution trace graph depicted in Figure 5.2.
In this case, the incidence matrix A in Equation (5.15) is given by:
A =

−1 0 0 0 0 0 0 0 0
1 −1 0 0 0 0 0 0 0
0 1 −1 0 0 0 0 0 0
1 0 0 −1 0 0 0 0 0
0 1 0 0 −1 0 0 0 0
0 0 1 0 0 −1 0 0 0
0 0 0 1 −1 0 0 0 0
0 0 0 0 1 −1 0 0 0
0 0 0 1 0 0 −1 0 0
0 0 0 0 1 0 0 −1 0
0 0 0 0 0 1 0 0 −1
0 0 0 0 0 0 1 −1 0
0 0 0 0 0 0 0 1 −1

and the initial marking, according to Equation (5.14), is defined as:
M0 =
[
1 0 0 0 0 0 0 0 0 0 0 0 0
]′
where [·]′ denotes the matrix transpose operator. Moreover, supposing that the variable of
interest is the number of tokens stored in each buffer of the considered dataflow network
(i.e.from Figure 2.5.4, b1 and b2, respectively), matrix C describing the output relation (5.16) is
defined as:
C =
[
0 0 0 1 1 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 0 0
]
82
5.6. Conclusions
5.6 Conclusions
In this chapter the notion of execution trace graph (ETG) of a dataflow program has been
formalized. It has been shown how the ETG represents a graph-based structure of the exe-
cution of a dataflow program. The execution of a dataflow program has been modeled as a
directed acyclic graph where nodes represent single action firings and edges represent (data
or functional) dependencies between two different action firings. Notions of partially-ordered
sets (i.e. po-sets) and directed paths (i.e. d-paths) have been adapted to this execution model.
Different dependency kinds have been defined, notably the finite state machine dependencies,
the internal variable dependencies, the port dependencies, the tokens dependencies and the
guard dependencies. The importance of the guard (enable and disable) dependencies in the
context of dynamic dataflow programs has been discussed. In fact, by the use of this kind of
dependency it has been demonstrated how different execution trajectories can be modeled by
using the ETG obtained through a serial program exception. Furthermore, the main properties
of the ETG have been illustrated with some examples demonstrating how this graph-based
representation is totally mapping independent and can be used to effectively estimate the
design performance through a post-mortem analysis. Finally, some ETG transformations have
been illustrated. For example, the transformation of the ETG to an event-driven system has
shown how the DSE can be made by the use of advanced control techniques.
83

6 TURNUS: a design space exploration
environment for CAL programs
In this section the main functionalities and the iterative design flow of TURNUS [16, 17, 19, 20]
are presented. This is a DSE environment for dynamic dataflow programs. Compared to the
state of the art exploration tools illustrated in Section 4.3, the novel features include both
the possibility to estimate the design performance and to explore and optimize the design
space based on the analysis of the ETG presented in Chapter 5. Moreover, it provides an
application programming interface (API) to profile CAL programs that is usable by third-party
dataflow compilers. In the following, the design flow together with the high-level models of
the dataflow program and the architecture are illustrated. Furthermore, it is illustrated how
this environment can be integrated with already existing dataflow environments.
6.1 Design flow features and capabilities
An overview of the TURNUS iterative design flow is depicted in Figure 6.1. This DSE en-
vironment is composed of two main blocks: the TURNUS profiler, and the TURNUS ETG
post-mortem scheduling and analysis. The TURNUS profiler is used in the first stages of the
DSE for evaluating both the ETG and the high-level profiling information of a CAL program.
Successively, the ETG is post-mortem scheduled in order to estimate the design performance
and explore and optimize the design space of the program. Results of the DSE can be used by
the designer that is informed of which parts of the CAL program should be restructured, and
by third-party tools that implement the program on the mapped target architecture.
6.1.1 Profiler
The TURNUS profiler is used on top of a CAL compiler infrastructure where the source code is
interpreted. It provides a set of application programming interfaces (API) that are used to make
a high level profiling analysis of the code. The profiler API is illustrated in Section 7.4. The only
input of the profiler is the CAL program description. This contains the CAL program input
description, which is defined by the CAL project and its collection of source code files. After
85
Chapter 6. TURNUS: a design space exploration environment for CAL programs
the profiled simulation, a high-level profiling data file is generated. This contains profiling
information concerning the workload and the buffer size utilization. Furthermore, using the
high-level profiling information, the ETG file is generated. The collection of profiling data and
how these are used to evaluate the ETG is illustrated in Section 7.2.
6.1.2 Execution trace graph post-mortem scheduling and analysis
The iterative DSE is performed by analyzing the ETG generated by the profiler. The ETG is
used for both the design performance estimation and the exploration of the design space. At
this step both the program and architecture model (i.e. defined as illustrated in Section 6.2)
are used in order to estimate and define the possible design points. The performance estima-
tion can be enhanced by using clock-accurate profiling information retrieved by third-party
profilers (e.g. GnuProf, Valgring, ModelSim). The design space can be constrained using the
constraints information provided by the designer. Examples of available design space opti-
mization are the possibility to minimize and optimize the buffer size configuration, partition
the program on many-cores, and reduce the dynamic power dissipation. At this stage, the
following analysis can be done:
• Performance estimation: illustrated in Section 8.1, is used to estimate the design per-
formance for a given mapping configuration. The ETG post-mortem scheduler is based
on a discrete event simulator. The main functionality is to assign timing weight to
each firing and dependency of the ETG. The timed ETG is then used by the underlying
analysis provided by the framework.
• Critical path evaluation: illustrated in Section 8.2, is used to evaluate the critical path
of an application. It defines what is called design space critical path, which is used to
define bounds on the design space of the application.
• Impact analysis: illustrated in Section 8.3, is used to provide the code refactoring
directions to the designer. It provides a list of actor and actions where code refactoring
should be concentrated.
• Buffer dimensioning: illustrated in Section 8.4, provides a collection of heuristic al-
gorithms for estimating a feasible bounded buffer size configuration. Furthermore, a
solution for the problem of maximizing the application throughput and, at the same
time, minimizing the buffer size configuration is presented.
• Partitioning: illustrated in Section 8.5, provides a collection of heuristic algorithms
tailored for partitioning the application on multi-clock domains architectures. The
main requirements are that the application throughput is maximized and, at the same
time, the dynamic power dissipation is minimized.
The design space can be iteratively explored. For each iteration a mapping configuration file
is generated. This is used to drive third-party dataflow tools (e.g. low-level code generators)
86
6.1. Design flow features and capabilities
during the design implementation stages. Section 6.3 illustrates a list of tools already integrated
with this framework.
Compiler
Infrastructure
Code 
Generation
Synthesis
or 
Compilation
Implementation
CAL 
program ArchitectureConstraints
R
ef
a
ct
o
ri
n
g
 D
ir
e
c
ti
o
n
s
C
o
m
p
il
e
r 
D
ir
e
c
ti
v
e
s
Mapping
configuration
Code 
Refactoring 
Directions
TURNUS 
Execution Trace Graph post-mortem scheduling and analysis
Critical Path
Impact 
analysis
Buffer 
dimensioning
Partitioning
Performance 
estimation
TURNUS
Profiler
Execution 
Trace Graph
High-level 
profiling 
data
Profiling
Data
Figure 6.1: TURNUS design flow.
87
Chapter 6. TURNUS: a design space exploration environment for CAL programs
6.2 High-level models
In the following, the high-level models used in the framework to represent the CAL dataflow ap-
plication, the target architecture, the ETG and the program profiling information are illustrated.
These are presented using a (simplified) unified modeling language (UML) representation
that respects the framework APIs available in [16].
6.2.1 CAL dataflow program
The CAL dataflow program model describes the basic structure of the program. A specific
meta model representation is used in order to extend the interoperability of CAL tools al-
ready available. A CAL code compiler infrastructure that wants to make use of the TURNUS
framework should wrap its intermediate representation and generate a consistent program
model. The basic components of this representation are described in the following section.
The formalism that is used is the same as the one illustrated in Section 2.5.
Network
The Network object is used to model a dataflow program network N (A,B). As depicted in
Figure 6.2, this object is defined by the following elements:
• id: String element that identifies the network under analysis.
• sourceFile: String attribute that contains the relative source file path of the network
(i.e. the .xdf or .nl file name).
• version: Version element that contains the versioning information of the source file.
• project: String attribute that contains the name of the CAL project where the source
file is stored.
• classes: list of ActorClass elements contained in the network.
• actors: list of Actor elements contained in the network.
• buffers: list of Buffer elements contained in the network.
Actor-class
The ActorClass object is used to model an actor-class κ ∈K . As depicted in Figure 6.3, this
object is defined by the following elements:
• name: String attribute that identifies the actor-class.
88
6.2. High-level models
Network
+ id : String
+ sourceFile : String
+ project : String
Version
ActorClass
Actor
Buffer
classes
1..n
buffers
1..n
version
1
actors
1..n
Figure 6.2: The Network object.
• nameSpace: String attribute used to represent the level of hierarchy of the source
file.
• sourceFile: String attribute that contains the relative source file path of the actor-
class.
• version: Version element that contains the versioning information of the source file.
• Actions: list of Action elements contained in the actor-class.
• inputPorts: list of input Port elements contained in the actor-class.
• outputPorts: list of output Port elements contained in the actor-class.
• variables: list of Variable elements contained in the actor-class.
• procedures: list of Procedure elements contained in the actor-class.
It must be noted that the concatenation of the name space and the name cannot be shared
among actor-classes defined on the same Network.
Actor
The Actor object is used to model an actor a ∈ A. As depicted in Figure 6.4, this object is
defined by the following elements:
• id: String attribute that identifies the actor. It must be noted that the same id cannot
be shared among actors of the same network.
• actorClass: ActorClass element that is instantiated by the actor.
89
Chapter 6. TURNUS: a design space exploration environment for CAL programs
ActorClass
+ name : String
+ nameSpace : String
+ sourceFile : String
Version
Action
Port
inputPorts
1..n
Port
outputPorts
1..n
Variable
Procedure
actions
1..n
version
1
variables
1..n
procedures
1..n
Figure 6.3: The ActorClass object.
Actor
+ id : String
ActorClassactorClass
1
Figure 6.4: The Actor object.
Action
The Action object is used to model an action λ ∈Λ. As depicted in Figure 6.5, this object is
defined by the following elements:
• id: String attribute that identifies the action. It must be noted that the same id cannot
be shared among actions of the same actor.
• label: Qid element that contains the qualifier identifier of the action.
• guards: list of the Guard elements used by the action.
• procedure: list of Procedure elements used by the action.
• variables: list of Variables elements used by the action.
It must be noted that, even though actions are defined in the ActorClass, these are always
considered by the framework as a tuple (Actor,Action) when analyses are performed.
90
6.2. High-level models
Action
+ id : String
Quid
Guard
Procedure
Variable
label
1
guards
0..n
procedures
0..n
variables
0..n
Figure 6.5: The Action object.
Qid
The Qid object is used to model a qualifier identifier, which is a sequence of identifiers
separated by a dot. As depicted in Figure 6.6, this object is defined by the following elements:
• ids: array of String elements that contains the ordered sequences of identifiers.
• size: Integer attribute that defines the size of the identifier in terms of elements in
the ids array.
Quid
+ id : String[]
+ size : Integer
Figure 6.6: The Quid object.
Procedure
The Procedure object is used to model a procedure (or a function) defined in an actor-class
and called by an action. As depicted in Figure 6.7, this object is defined by the following
elements:
• name: String attribute that identifies the procedure. It must be noted that the same
name cannot be shared among procedures of the same actor-class.
• variables: list of Variable elements used by the procedure.
91
Chapter 6. TURNUS: a design space exploration environment for CAL programs
It must be noted that, even though procedures are defined in the ActorClass, these are
always considered by the framework as a tuple (Actor,Procedure) when analyses are
performed.
Procedure
+ name : String
Variablevariables
0..n
Figure 6.7: The Procedure object.
Internal actor variable
The Variable object is used to model an internal variable. As depicted in Figure 6.8, this
object is defined by the following elements:
• name: String attribute that identifies the actor internal variable. It must be noted
that the same name cannot be shared among variables of the same actor-class.
• type: Type element that contains the variable type.
It must be noted that, even though variables are defined in the ActorClass, these are always
considered by the framework as a tuple (Actor,Variable) when analyses are performed.
Variable
+ name : String
Typetype
1
Figure 6.8: The Variable object.
Guard
The Guard object is used to model an action guard. As depicted in Figure 6.9, this object is
defined by the following elements:
• id: String attribute that identifies the guard. It must be noted that the same id cannot
be shared among guards of the same action.
• variables: list of Variable elements used by the guard.
• ports: list of input Port elements used by the guard.
It must be noted that, even though guards are defined in the Action, these are always con-
sidered by the framework as a tuple (Actor,Action,Guard) when analyses are performed.
92
6.2. High-level models
Guard
+ id : String
Port
Variable
variables
0..n
ports
0..n
Figure 6.9: The Guard object.
Port
The Port object is used to model an input port p i ni ∈ P i na or an output port poutj ∈ P outa . As
depicted in Figure 6.10, this object is defined by the following elements:
• name: String attribute that identifies the port. It must be noted that the same name
cannot be shared among ports of the same kind (i.e. input or output) and of the same
actor-class.
• type: Type element that contains the port type.
It must be noted that, even though ports are defined in the ActorClass, these are always
considered by the framework as a tuple (Actor,Port) when analyses are performed.
Port
+ name : String
Typetype
1
Figure 6.10: The Port object.
Buffer
The Buffer object is used to model a buffer b ∈B . As depicted in Figure 6.11, this object is
defined by the following elements:
• sourceActor: Actor element that contains the source actor.
• sourcePort: Port element that contains the source output port.
• targetActor: Actor element that contains the target actor.
• targetPort: Port element that contains the target input port.
Type
A basic type system is modeled using the Type object. As depicted in Figure 6.12, this object
is defined by the following elements:
93
Chapter 6. TURNUS: a design space exploration environment for CAL programs
Buffer
Actor
sourceActor
1
PortsourcePort
1
Port
targetPort
1
Actor
targetActor
1
Figure 6.11: The Buffer object.
• name: String attribute that identifies the type.
• size: Integer attribute that defines the number of elements contained in a complex
data type (e.g. elements of a list of elements of the same type).
• bits: Integer attribute that defines the number of bits required to represent the
variable of the given data type.
• subType: Type element that contains the sub-type of a type, if any. This attribute is
used to model complex data types (e.g. elements of a list of elements of the same type).
Type
+ name : String
+ size : Integer
+ bits : Integer
subType
1
Figure 6.12: The Type object.
Version
The Version object is used to define a unique identifier of a file. It is used, for example,
to track the code modification and refactoring that could be made on a network or in an
actor-class. TURNUS supports a Git versioning system [162] and, as depicted in Figure 6.13,
for this object defines the following elements:
• date: String attribute that contains the time stamp of the last local file modification.
94
6.2. High-level models
• revision: String attribute that contains the commit hash identifier of the file.
• repository: String attribute that contains the Git repository URL of the file.
Version
+ date : String
+ revision : String
+ repository : String
Figure 6.13: The Version object.
6.2.2 Architecture and constraints
The platform model describes the structure of the architecture where the dataflow program is
implemented. A basic meta model representation is used in order to represent the available
processing elements where actors are mapped and the media where buffers are mapped. The
platform is modeled as a graph G(PU , ME ,L) where:
• PU = {pu1, pu2, . . . , punPU } is the set of processing elements.
• ME = {me1,me2, . . . ,menME } is the set of media.
• L = {l1, l2, . . . , lnL } is the set of links between processing elements and media.
As an example, Figure 6.14 depicts the Xilinx Zynq-7 ZC702 evaluation-board [163] architec-
ture model defined with this formalism. In this case, the set of processing elements PU is
composed of three components: two ARMs and an FPGA, respectively. Each ARM has its own
L1 memory. The two ARMs share a L2 memory between them and a DDR3 memory with
the FPGA. Furthermore, the bus-interfaces AXI-HP, AXI-GP and AXI-ACP are modeled with
three different media. Each one of these memories is modeled with a medium mei ∈M and
each interconnection between a processing element and a medium with a link li ∈ L. The
basic components of this architecture high-level representation are described in the following
section.
Platform
The Platform object is used to model a platform G(PU , ME ,L). As depicted in Figure 6.15,
this object is composed of the following elements:
• name: String attribute that contains an identifier of the platform.
• media: list of Medium elements available in the platform.
95
Chapter 6. TURNUS: a design space exploration environment for CAL programs
(a) Xilinx architecture model.
ARM
FPGA
ARM
L2
L1
L1
AXI-HP
AXI-GP
AXI-ACP
DDR3
(b) TURNUS architeture model.
Figure 6.14: Xilinx Zynq-7 ZC702 evaluation-board architecture model.
• processingElements: list of ProcessingElement elements available in the plat-
form.
• links: list of Links elements available in the platform.
Platform
+ name : String
ProcessingElement
Medium
Link
processingElements
1..n
media
1..n
links
1..n
Figure 6.15: The Platform object.
Processing element
The ProcessingElement object is used to model a processing element pu ∈ PU . As
depicted in Figure 6.16, this object is defined by the following elements:
• name: String attribute that contains an identifier of the operator. Operators of the
same platform cannot share the same name.
• family: String attribute that contains the operator family identifier.
96
6.2. High-level models
• clock: Double attribute that defines the period (in ns) of the operator clock cycle.
• schedulers: list of Scheduler elements available in the processing element.
• supportedTypes: list of Type elements supported by the processing element.
ProcessingElement
+ name : String
+ family : String
+ clock : Double
Scheduler
Type
schedulers
1..n
supportedTypes
1..n
Figure 6.16: The ProcessingElement object.
Medium
The Medium object is used to model a medium me ∈ ME . As depicted in Figure 6.17, this
object is defined by the following elements:
• name: String attribute that contains an identifier of the medium. Media of the same
platform cannot share the same name.
• family: String attribute that identifies the medium family name.
• schedulers: list of Scheduler elements available in the medium.
• inputClock: Double attribute that defines the period (in ns) of the medium input clock
cycle.
• outputClock: Double attribute that defines the period (in ns) of the medium output
clock cycle.
• maxSize: Integer attribute that defines the maximum size (in bi t ) of the medium.
• maxPush: Integer attribute that defines the maximum number of bits that can be
consumed by the medium during each input clock cycle.
• maxPop: Integer attribute that defines the maximum number of bits that can be
produced by the medium during each output clock cycle.
97
Chapter 6. TURNUS: a design space exploration environment for CAL programs
Medium
+ name : String
+ family : String
+ inputClock : Double
+ outputClock : Double
+ maxSize : Integer
+ maxPop : Integer
+ maxPush : Integer
Schedulerschedulers
1..n
Figure 6.17: The Medium object.
Link
The Link object is used to model a link l ∈ L. As depicted in Figure 6.18, this object is defined
by the following elements:
• medium: Medium element that defines the medium end-point of the link.
• operator: Operator element that defines the operator end-point of the link.
Link
Medium
Operator
medium
1
operator
1
Figure 6.18: The Link object.
Scheduler
The Scheduler object is used to model a scheduling policy of an operator or a medium. As
depicted in Figure 6.19, this object is defined by the following elements:
• name: String attribute that contains an identifier of the scheduler. Schedulers of the
same medium or operator cannot share the same name.
• selectionTime: Double attribute that defines the time required for making a schedul-
ing choice.
98
6.2. High-level models
Scheduler
+ name : String
+ selectionTime : Double
Figure 6.19: The Scheduler object.
6.2.3 Execution trace graph
The ETG model describes the graph-based structure illustrated in Chapter 5. A basic meta
model representation is used in order to represent both the firings and dependencies sets. The
basic components of this representation are described in the following section.
Trace
The Trace object is used to model an ETG G(V ,E). As depicted in Figure 6.20, this object is
defined by the following elements:
• firings: list of Firing elements contained in the ETG.
• Dependencies: list of Dependency elements contained in the ETG.
Trace
Firing
Dependency
firings
1..n
dependencies
0..n
Figure 6.20: The Trace object.
Firing
Each firing si ∈ S of the ETG is represented by a Firing object. As depicted in Figure 6.21,
this object is defined by the following elements:
• id: Long attribute that defines the firing identifier i . Firings of the same trace cannot
share the same id.
• actorClass: String attribute that contains the name of the ActorClass.
• actor: String attribute that contains the id of the Actor.
• action: String attribute that contains the id of the Action.
99
Chapter 6. TURNUS: a design space exploration environment for CAL programs
Firing
+ id : Long
+ actorClass : String
+ actor : String
+ action : String
Figure 6.21: The Firing object.
Dependency
Each dependency (si , s j ) ∈ D of the ETG is represented by a Dependency. As depicted in
Figure 6.22, this object is defined by the following elements:
• sourceFiring: Long attribute that contains the source Firing identifier i .
• targetFiring: Long attribute that contains the target Firing identifier j .
• kind: String attribute that identifies the dependency kind k. Valid values are: variable,
fsm, guard, port and tokens.
• count: Integer attribute that contains the number of tokens (required only for token
dependencies).
• port: String attribute that contains the Port name (required only for port dependen-
cies).
• sourcePort: String attribute that contains the source Port name (required only for
token dependencies).
• targetPort: String attribute that contains the target Port name (required only for
token dependencies).
• direction: String attribute that identifies the direction of a dependency. Valid names
are: read/read, read/write, write/read, write/write, enable and disable. (required only for
port, internal variable and guard dependencies).
• variable: String attribute that contains the Variable name (required only for inter-
nal variable dependencies).
• guard: String attribute that contains the Guard identifier (required only for guard
dependencies).
• appearance: Integer attribute that contains the appearance order of a guard en-
able/disable window (required only for guard dependencies).
100
6.2. High-level models
Dependency
+ sourceFiring : Long
+ targetFiring : Long
+ kind : String
+ count : String
+ port : String
+ sourcePort : String
+ targetPort : String
+ direction : String
+ variable : String
+ guard : String
+ appearance : Integer
Figure 6.22: The Dependency object.
6.2.4 Profiling information
The profiling information represents the data collection provided by third-party profiles. This
data set denoted with Θ contains clock-accurate profiling information for each actor and
action of the dataflow program. The basic components of this data set are described in the
following section.
Network profiling data
The NetworkProfilingData object contains the profiling information of a Network
implemented and profiled on a specific Operator. As depicted in Figure 6.23, this object is
defined by the following elements:
• network: String attribute that contains the Network name.
• operator: String attribute that contains the Operator name.
• actionsData: a list of ActionProfilingData elements that contains the profiling
data for each tuple (Actor,Action) of the network.
Action profiling data
TheActionProfilingData object contains the profiling information of each tuple (Actor,Action).
As depicted in Figure 6.23, this object is defined by the following elements:
• actor: String attribute that contains the Actor identifier.
• action: String attribute that contains the Action identifier.
• max: Double attribute that contains the maximum number of clock cycles.
101
Chapter 6. TURNUS: a design space exploration environment for CAL programs
• min: Double attribute that contains the minimum number of clock cycles.
• average: Double attribute that contains the average number of clock cycles.
ActionProfilingData
+ actor : String
+ action : String
+ average : Double
+ max : Double
+ min : Double
NetworkProfilingData
+ network : String
+ operator : String
actionsData
1..*
Figure 6.23: The NetworkProfilingData and ActionProfilingData objects.
6.3 Integration with third-party CAL dataflow environments
The Orcc compiler and the Xronos framework illustrated in Section 2.5.6 are two examples of
CAL dataflow environments that are integrated into the TURNUS design flow. As depicted in
Figure 6.24, both are used as compiler infrastructures. Additional CAL dataflow environments
have been successfully integrated within the TURNUS design flow as illustrated in [164, 165,
166, 167, 168, 169, 170, 171]. In the following, it is discussed how both Orcc and Xronos interact
with the TURNUS environment, as these two frameworks have been extensively used for the
purpose of this dissertation.
Orcc
As depicted in Figure 6.24, the Orcc and the TURNUS design flow are integrated in the following
parts:
• Code interpretation: TURNUS Orcc RVC-CAL profiler [172] provides an extension of
the basic Orcc CAL interpreter functionalities, where the TURNS APIs illustrated in
Section 7.4 have been integrated. This extended code interpreter and profiler is used
both to generate the ETG and the high-level profiling data of the CAL program under
analysis.
• Profiling data: Orcc supports the generation of C/C++ code where the performance
application programming interface (PAPI) [173, 174, 175] is integrated. During the
program execution, clock-accurate profiling information is retrieved and used, during
the DSE performed by TURNUS, to enhance the architecture model.
• Mapping: mapping configuration file generated with TURNUS can directly be used in
Orcc. In fact, for each Orcc back-end it is possible to drive the code compilation using
the buffer size and partitioning configurations evaluated by TURNUS.
102
6.4. Conclusions
Xronos
As depicted in Figure 6.24, the Xronos and the TURNUS design flow are integrated in the
following parts:
• Profiling data: Xronos provides a test-bench platform where it is possible to retrieve the
exact number of clock cycles required for executing a CAL action. This clock-accurate
profiling information is then used during the DSE performed by TURNUS to enhance
the architecture model and the performance estimation.
• Mapping: mapping configuration file generated with TURNUS can directly be used in
Xronos. It is possible to drive the code synthesis using the buffer size and partitioning
configurations on multi-clock domain platforms.
6.4 Conclusions
In this chapter the DSE environment developed and used for demonstrating the effectiveness
of the design methodology discussed in this dissertation has been introduced. Its main func-
tionalities and structure have been illustrated. This DSE environment provides a complete
DSE solution for dynamic dataflow programs implemented in heterogeneous and massively
parallel architectures. The main functionalities are a collection of application programming
interfaces (APIs) for profiling dataflow programs during their code interpretation. The main
features of this profiler are the capability to generate an ETG and provide high-level profiling
information both retrieved during a high-level code interpretation of the program. Further-
more, the main design space analysis and performance estimation capabilities have been
illustrated.
103
Chapter 6. TURNUS: a design space exploration environment for CAL programs
C
o
m
p
il
er
in
fr
a
st
ru
ct
u
re
C
o
d
e 
g
en
er
at
io
n
S
y
n
th
es
is
o
r 
C
o
m
p
il
a
ti
o
n
Im
p
le
m
e
n
ta
ti
o
n
C
A
L
 
p
ro
gr
am
A
rc
h
it
ec
tu
re
C
o
n
st
ra
in
ts
Refactoring Directions
Compiler Directives
M
a
p
p
in
g
c
o
n
fi
g
u
ra
ti
o
n
C
o
d
e
 
re
fa
c
to
ri
n
g
 
d
ir
e
c
ti
o
n
s
T
U
R
N
U
S
 
E
xe
cu
ti
o
n
 t
ra
ce
 g
ra
p
h
 
p
o
st
-m
o
rt
em
 s
ch
ed
u
li
n
g
 a
n
d
 a
n
al
ys
is
T
U
R
N
U
S
P
ro
fi
le
r
E
x
ec
u
ti
o
n
 
tr
a
ce
 g
ra
p
h
H
ig
h
-l
e
ve
l 
p
ro
fi
li
n
g
 
d
a
ta
P
ro
fi
li
n
g
d
a
ta
C
o
re
B
ac
k
-
e
n
d
S
o
u
rc
e
C
o
d
e
B
u
il
d
S
cr
ip
t
In
te
rp
re
te
r
O
rc
c 
a
n
d
 X
ro
n
o
s
C
L
L
V
M
P
ro
m
e
la
Ja
v
a
X
ro
n
o
s
H
D
L
ca
l
xd
f
IR
IR
F
ro
n
t-
en
d
Figure 6.24: The open RVC-CAL compiler (Orcc) and Xronos infrastructure integrated in the
TURNUS design flow.
104
7 Profiling CAL programs with TURNUS
The TURNUS CAL profiler, as every program profiler, provides a statistical summary of the
execution complexity of a CAL program. Information that can be retrieved is, for example, the
number of operators (i.e. see Table 3.1) that each procedure, action and actor executed, the
number of tokens produced and consumed by each action and actor and the buffer utilizations.
Moreover, it provides a complete Java application programming interface (API) that can be
plugged in a CAL code interpreter. One of its most powerful functionalities is the possibility to
generate the ETG illustrated in Chapter 5 without generating any partial implementation of
the code. Moreover, it does not depend on any third-party profilers and it can be integrated
into an existing CAL code interpreter. At the time of writing this thesis, an integrated version
of the TURNUS profiler is provided for the Orcc CAL code interpreter [56] and available as
an open source product [16]. In this chapter the main functionalities and novelties that have
been introduced are highlighted. Successively, the set of data that can be collected during the
program interpretation is illustrated. Finally, it is described how this set of profiling data is
used when building the ETG of the program execution.
7.1 Advances in profiling CAL programs
The DSE exploration based on the ETG post-processing can be effectively performed only
if the ETG satisfies the requirements illustrated in Chapter 5. CAL profilers that are able to
generate an ETG are available on both the CAL Design Suite [122, 33] and the Caltoopia [176]
framework. However, as illustrated in Table 7.1, the ETG that these two profilers are able to
evaluate does not completely satisfy all requirements. For example, the CAL Design Suite is
not able to identify the internal variables and port dependencies. Caltoopia, on the other
hand, can identify only tokens dependencies. Furthermore, its ETG is not untimed. Moreover,
as illustrated in Figure 7.1, both profilers require a partial C/C++ implementation of the CAL
program. Consequently, the ETGs are evaluated through a binary execution of the program.
The CAL Design suite makes use of the Intel Pin tool [82, 177] in order to obtain a dynamic
binary code instrumentation, while Caltoopia makes use of its native libraries. However,
in both cases, the high-level profiling data can be biased by low-level code optimizations
105
Chapter 7. Profiling CAL programs with TURNUS
performed by the compilers (e.g. GCC [178], ICC [179]). Table 7.1 provides a summarized
overview of the new functionalities introduced by the TURNUS CAL profiler. Contrary to
the other two environments, the TURNUS CAL profiler provides a collection of APIs (i.e. see
Section 7.4) that can be integrated in the available CAL code interpreters. Furthermore, the
profiling of the CAL program is performed directly through a CAL code interpretation, without
requiring any partial low-level implementation and binary code execution. The profiling
information can be obtained with different levels of granularity. In fact, it is possible to obtain
information for each actor-class, actor, action, procedure and buffer. It must be noted that the
profiling data information is provided as a set of statistical data (i.e. see Section 7.2) where
the call of each operator, the load and write of each variable and token are reported. The ETG
generated by the TURNUS CAL profiler fully satisfies the requirements illustrated in Chapter 5.
CAL code interpreter
TURNUS Profiler
Execution 
Trace Graph
High-level 
profiling 
data
CAL 
program
API
(a) TURNUS CAL profiler.
Orcc CAL code back-end
C/C++ code
Execution 
Trace Graph
Profiling
data
CAL 
program
.bin
Intel Pin tool
Dynamic binary code instrumentation
Binary code execution
(b) CAL Design Suite.
Caltoopia CAL code back-end
C/C++ code
Tokens 
Trace
CAL 
program
.binBinary code execution
(c) Caltoopia.
Figure 7.1: CAL profilers design flow.
106
7.1. Advances in profiling CAL programs
Table 7.1: CAL profilers features.
(a) Environment and main features.
Tool Environment Static analysis Dynamic analysis API Notes
TURNUS Java 1.7 Code interpretation
Cal Design Suite Java 1.6 C/C++ executable - (1)
Caltoopia Java 1.6 - C/C++ executable -
(b) Profiling granularity.
Tool Actor-class Actor Action Procedure Buffer
TURNUS
Cal Design Suite - -
Caltoopia -
(c) Profiling information.
Tool Statistical data Operator calls Internal variables Tokens Buffers
TURNUS
Cal Design Suite - - -
Caltoopia - - -
(d) Profiling information.
Dependencies
Tool Untimed FSM Internal variables Port Tokens Guard Notes
TURNUS (2)
Cal Design Suite - - -
Caltoopia - - - - - (3)
Notes: (1) requires Intel Pin [82, 177] as a third-party tool for instrumenting the binary
program; (2) guard enable and disable dependencies analysis is an under-development
functionality; (3) token production/consumption is logged during the program execution and
successively used to build the ETG.
107
Chapter 7. Profiling CAL programs with TURNUS
7.2 Data collection
In the following section, the set of profiling data that is collected during the code interpretation
of a CAL program is illustrated.
7.2.1 Firing data
For each action execution, the TURNUS profiler generates a unique firing identifier (FID)
stored in aLong variable. During the firing execution, the profiling information is stored in the
FiringData object. As depicted in Figure 7.2, this object contains the following elements:
• scheduledByFsm: Boolean attribute that indicates if the firing has been scheduled by
the internal actor FSM.
• readVariables: key-value map element that has a Variable as a key and an Integer
as a value. It indicates how many times the firing has read the actor internal variable. In
other words, depending on the variable Type, the value corresponds to the number of
Load or LoadList operations that has been performed.
• writeVariables: key-value map element that has aVariable as a key and anInteger
as a value. It indicates how many times the firing has written the actor internal variable.
In other words, depending on the variable Type, the value corresponds to the number
of Store or StoreList operations that has been performed.
• calledOpcodes: key-value map element that has an Opcode as a key and an Integer
as a value. It indicates how many times the firing has called a specific operation code.
• calledProcedures: key-value map element that has a Procedure as a key and an
Integer as a value. It indicates how many times the firing has called a specific
Procedure.
• enabledGuards: list of Guard elements that contains the list of guards that has been
enabled by the firing.
• disabledGuards: list of Guard elements that contains the list of guards that has been
disabled by the firing.
• consumedTokens: key-value map element that has a Buffer as a key and a list of
Token elements as a value. It contains the list of consumed tokens for each buffer.
• consumedTokens: key-value map element that has a Buffer as a key and a list of
Token elements as a value. It contains the list of produced tokens for each buffer.
108
7.2. Data collection
FiringData
+ firing : Long
+ scheduledByFsm : Boolean
Guard
enabledGuards
0..*
disabledGuards
0..*
Map
+ value : Integer
Variablekey
1
writeVariables
0..*
readVariables
0..*
Map
+ value : Integer
Opcodekey
1
callOpcodes
0..*
Map
+ value : Integer
Procedurekey
1
callProcedure
0..*
Map Fifo
Token
key
1
value
1..*
consumedTokens
0..*
producedTokens
0..*
Figure 7.2: The FiringData object.
7.2.2 Action data
For each action λ ∈Λ a set of statistical data is collected during the entire program execution.
This set of data is defined by the ActionData object. As depicted in Figure 7.3, this object
contains the following elements:
• readVariables: key-value map element that has aVariable as a key and aStatistics
object as a value. It contains the statistical information concerning the number of read-
ings of a variable performed by all the firings of the action.
• wroteVariables: key-value map element that has aVariable as a key and aStatistics
object as a value. It contains the statistical information concerning the number of writ-
ings of a variable performed by all the firings of the action.
• calledOpcodes: key-value map element that has anOperand as a key and aStatistics
object as a value. It contains the statistical information concerning the number of calls
of an operation code performed by all the firings of the action.
• calledProcedures: key-value map element that has a Procedure as a key and a
Statistics object as a value. It contains the statistical information concerning
the number of calls of a procedure performed by all the firing of the actions.
109
Chapter 7. Profiling CAL programs with TURNUS
• consumedTokens: key-value map element that has aBuffer as a key and aStatistics
object as a value. It contains the statistical information concerning the number of tokens
consumed by all the firings of the action.
• producedTokens: key-value map element that has aBuffer as a key and aStatistics
object as a value. It contains the statistical information concerning the number of tokens
produced by all the firings of the action.
This data set is updated each time that a firing ends its execution: the firing’s data is merged in
the corresponding action’s data.
ActionData
Map
Variable
Statistics
key
1
value
1writeVariables
0..*
readVariables
0..*
Map
Opcode
Statistics
key
1
value
1
callOpcodes
0..*
Map
Procedure
Statistics
key
1
value
1
callProcedure
0..*
Map
Fifo
Statistics
key
1
value
1
consumedTokens
0..*
producedTokens
0..*
Figure 7.3: The ActionData object.
7.2.3 Actor data
For each actor a ∈ A a set of statistical data is collected during the entire program execution.
This information is stored in the ActorData object. As depicted in Figure 7.4, this object
contains the following elements:
• firedActions: key-value map element that has an Action as a key and an Integer as
a value. It contains the number of firings of each action contained in the actor.
110
7.2. Data collection
• readVariables: key-value map element that has aVariable as a key and aStatistics
object as a value. It contains the statistical information concerning the number of read-
ings of a variable performed by all the firings of the actor.
• wroteVariables: key-value map element that has aVariable as a key and aStatistics
object as a value. It contains the statistical information concerning the number of writ-
ings of a variable performed by all the firings of the actor.
• calledOpcodes: key-value map element that has anOperand as a key and aStatistics
object as a value. It contains the statistical information concerning the number of calls
of an operation code performed by all the firings of the actor.
• calledProcedures: key-value map element that has a Procedure as a key and a
Statistics object as a value. It contains the statistical information concerning
the number of calls of a procedure performed by all the firings of the actor.
• consumedTokens: key-value map element that has aBuffer as a key and aStatistics
object as a value. It contains the statistical information concerning the number of tokens
consumed by all the firings of the actor.
• producedTokens: key-value map element that has aBuffer as a key and aStatistics
object as a value. It contains the statistical information concerning the number of tokens
produced by all the firings of the actor.
Furthermore, an additional set of data is used to track which was the last action firing that
used a resource (e.g. internal variable, input or output port). This information is stored in
the ActorTracingData object. As depicted in Figure 7.5, this object contains the following
elements:
• lastFsmScheduled: Long attribute that contains the FID of the last firing scheduled by
the actor state machine.
• lastVariableReader: key-value map element that has a Variable as a key and a Long
as a value. For each variable, it contains the FID, if it exists, of the last action firing that
read the variable.
• lastVariableReader: key-value map element that has a Variable as a key and a Long
as a value. For each variable, it contains the FID, if it exists, of the last action firing that
wrote the variable.
• lastGuardEnabler: key-value map element that has a Guard as a key and a Long as a
value. For each guard, it contains the FID, if it exists, of the last action firing that enabled
the guard.
• lastGuardDisabler: key-value map element that has a Guard as a key and a Long as a
value. For each guard, it contains the FID, if it exists, of the last action firing that disabled
the guard.
111
Chapter 7. Profiling CAL programs with TURNUS
ActorData
Map
+ value : Integer
Actionkey
1
firedActions
0..*
Map
Variable
Statistics
key
1
value
1writeVariables
0..*
readVariables
0..*
Map
Opcode
Statistics
key
1
value
1
callOpcodes
0..*
Map
Procedure
Statistics
key
1
value
1
callProcedure
0..*
Map
Fifo
Statistics
key
1
value
1
consumedTokens
0..*
producedTokens
0..*
Figure 7.4: The ActorData object.
• lastBufferReader: key-value map element that has a Buffer as a key and a Long as
a value. For each buffer, it contains the FID, if it exists, of the last action firing that
consumed a token from this buffer.
• lastBufferWriter: key-value map element that has a Buffer as a key and a Long as
a value. For each buffer, it contains the FID, if it exists, of the last action firing that
produced a token on this buffer.
This data set is updated each time that a firing ends its execution: the FiringData are
merged in the corresponding actor data. It must be noted that the data merging can be done
only after the computation of the ETG dependencies has been performed as described in
Section 7.3.
7.2.4 Buffer data
For each buffer b ∈B the following set of statistical data is collected during the entire program
execution.This information is stored in the BufferData object. As depicted in Figure 7.6,
this object contains the following elements:
112
7.2. Data collection
ActorTracingData
lastFsmScheduled : Long
Map
+ value : Long
Variablekey
1
lastVariableReader
0..* lastVariableReader
0..*
Map
+ value : Long
Guardkey
1
lastGuardEnabler
0..*
lastGuardEnabler
0..*
Map
+ value : Long
Bufferkey
1
lastBufferReader
0..* lastBufferWriter
0..*
Figure 7.5: The ActorTracingData object.
• consumedTokens: Statistics element that contains the statistical information
about the number of tokens that has been consumed from the buffer.
• producedTokens: Statistics element that contains the statistical information about
the number of tokens that has been produced from the buffer.
• maxOccupancy: Integer attribute that contains the maximum number of stored
tokens in the buffer.
• readMisses: Integer attribute that contains the sum of read misses.
• readHits: Integer attribute that contains the sum of read hits.
• writeMisses: Integer attribute that contains the sum of write misses.
• writeHits: Integer attribute that contains the sum of write hits.
BufferData
+ readMisses : Integer
+ readHits : Integer
+ writeMisses : Integer
+ writeHits : Integer
+ maxOccupancy : Integer
StatisticsproducedTokens
0..*
Statistics
consumedTokens
0..*
Figure 7.6: The BufferData object.
113
Chapter 7. Profiling CAL programs with TURNUS
7.2.5 Statistical data
Summary statistics for a stream of data values are collected in a Statistics object. As
illustrated in Figure 7.7, this object contains the following information:
• min: Double attribute that contains the minimum value of the data stream.
• max: Double attribute that contains the maximum value of the data stream.
• average: Double attribute that contains the average of the data stream.
• variance: Double attribute that contains the variance of the data stream.
• count: Long attribute that contains the number of elements stored in the data stream.
Statistics
+ min : Double
+ max : Double
+ average : Double
+ variance : Double
+ count : Long
Figure 7.7: The Statistics object.
7.2.6 Profiled token
During the program execution each Token is treated as an object that contains, as depicted
in Figure 7.8, both of the following information:
• producer: Long attribute that contains the FID of the firing that produced the token.
• value: generic Object attribute that contains the encapsulated value of the token.
It must be noted that, as illustrated in the next Section 7.3, the information concerning the
producer is indispensable in order to evaluate the tokens dependencies of an ETG.
Token
+ produced : Long
+ value : Object
Figure 7.8: The Token object.
114
7.3. Building of the execution trace graph
7.3 Building of the execution trace graph
At the end of each action firing, a Firing object is created by analyzing the corresponding
FiringData. Each Firing object represents a single action firing si ∈ S, where the FID
is evaluated such as i =FiringData.firing. Furthermore, it is possible to compute the
incoming dependencies set δ(si )−S . This set is evaluated immediately after si ends its execution.
The respective firing data and actor data contained in the FiringData and ActorData,
respectively, are analyzed. In the following, it is illustrated how for each dependency kind
described in Section 5.2.2, these data objects are be analyzed. By using this methodology, the
ETG can be immediately streamed and stored in a file during the simulation process. It must
be noted that the memory requirement of the profiler is limited and predictable: in fact, the
size of the data sets is limited and predictable too.
Internal variable dependencies
The set of variable dependencies is evaluated by analyzing both the FiringData and the
ActorData sets. For each Variable that has been read by the firing si , a read/read de-
pendency (s j , si ) ∈ Dv is defined if the read variables map of the ActorData contains an
FID for the given Variable. Similarly a read/write dependency (s j , si ) ∈ Dv is defined if
the written variables map of the ActorData contains an FID for the given Variable. For
each Variable that has been written by the firing si , a write/read dependency (s j , si ) ∈Dv is
defined if the read variables map of theActorData contains an FID for the givenVariable.
Similarly a write/write dependency (s j , si ) ∈Dv is defined if the written variables map of the
ActorData contains an FID for the given Variable.
Finite state machine dependency
The internal state machine dependency is evaluated by analyzing both the FiringData and
the ActorData sets. In fact, (s j , si ) ∈D f can be defined if the firing si has been scheduled
by the actor state machine and if the ActorData contains an FID j of a firing that has been
previously scheduled by the internal state machine.
Guard dependencies
The set of guard dependencies is evaluated by analyzing bothFiringData and theActorData
set. For each Guard that has been enabled by the firing si , an enable dependency (s j , si ) ∈Dg
is defined if the last guard disabler map of the ActorData contains an FID for the given
Guard. Similarly, a disable dependency (s j , si ) ∈Dg is defined if the last guard enabler map
of the ActorData contains an FID for the given Guard.
115
Chapter 7. Profiling CAL programs with TURNUS
Port dependencies
The set of port dependencies is evaluated by analyzing both the FiringData and the
ActorData sets. For each Buffer, where at least one token has been consumed by the
firing si , a read/read dependency (s j , si ) ∈Dp is defined if the last buffer reader map of the
ActorData contains an FID for the given Buffer. Similarly, a write/write dependency
(s j , si ) ∈Dp is defined if the last buffer writer map of the ActorData contains an FID for the
given Buffer.
Tokens dependencies
The set of token dependencies is evaluated by directly analyzing the FiringData set. In
fact, for each Token contained in the map of consumed tokens it is possible to identify a
(s j , si ) ∈D t where j is the Token.producer (i.e. which identifies the token producer).
7.4 Application programming interface
The collection of profiling APIs provided by TURNUS is evaluated in the following. These
methods are used by a third-party CAL code interpreter. It must be noted that this API can
be used under the assumption of a serial code interpretation. In other words, the code
interpretation can be performed only by taking into account one single action firing at a time.
Long startFiring(Actor actor, Action action, Boolean sbfm)
This method is called when a new action can be fired. A new and empty FiringData object
is created and associate to this new firing. The TURNUS profiler generates a new firing
identifier.
Void endFiring()
This method is called when the current action firing has terminated its execution. After the call
of this method the current firing is added to the ETG as illustrated in Section 7.3. Furthermore,
both the ActionData and the ActorData data are updated as illustrated in Section 7.2.
Void read(Variable variable, Object value)
This method is called each time a Variable is read by the current action firing. The
readStateVariable map defined in the FiringData is updated accordingly.
Void write(Variable variable, Object value)
This method is called each time a Variable is written by the current action firing. The
writeStateVariable map defined in the FiringData is updated accordingly.
Object[] produce(Buffer buffer, Object[] tokens)
This method is called each time the current action firing writes a collection of tokens on the
Buffer. It must be noted that the TURNUS profiler internally wraps each token Object
116
7.5. Conclusions
in a ProfiledToken object as discussed in Section 7.2. Both the FiringData and the
BufferData are updated accordingly.
Object[] consume(Buffer buffer, Integer numTokens)
This method is called each time the current action firing consumes numTokens from aBuffer.
It must be noted that the code interpreter receives the Object and it is unaware of the
ProfiledToken object used to store the producer identifier (i.e. see Section 7.2). In other
words, only the TURNUS profiler extends, internally, the concept of profiled token. Both the
FiringData and the BufferData are updated accordingly.
Void enable(Guard guard)
This method is called each time the current action firing enables a Guard. The FiringData
is updated accordingly.
Void disable(Guard guard)
This method is called each time the current action firing disables a Guard. The FiringData
is updated accordingly.
Void call(Procedure procedure)
This method is called each time the current action firing calls (i.e. enter in) a Procedure.
The FiringData is updated accordingly.
Void endProcedure()
This method is called each time the current action firing ends (i.e. exit from) a Procedure.
Void call(Opcode opcode)
This method is called each time the current action firing calls an OpCode. The FiringData
is updated accordingly.
Boolean hasTokens(Buffer buffer, Integer numTokens)
This method is called each time the scheduler checks if there are enough tokens in the given
Buffer. If the result is true, then a readHit is stored in the respective BufferData,
otherwise it is a readMiss.
Boolean hasSpace(Buffer buffer, Integer numTokens)
This method is called each time the scheduler checks if there is enough space in the given
Buffer. If the result is true, then a writeHit is stored in the respective BufferData,
otherwise it is a writeMiss.
7.5 Conclusions
In this chapter, the main functionalities and structure of the CAL dataflow profiler available
in the framework illustrated in Chapter 6 have been discussed. Compared to the available
profiling tools for this dataflow language, the new functionalities are the possibilities to
117
Chapter 7. Profiling CAL programs with TURNUS
generate a complete ETG and to obtain statistical profiling information with different levels
of abstraction. The number of executed and called operators and procedures as well as the
internal actor variables and token utilization (i.e. read/write) for each actor-class, actor, action
and procedure can be analyzed. Furthermore, buffer utilization statistics are provided in
terms of token production/consumption rates, maximal occupancy, read hits and misses, as
well as write hits and misses. Finally, the APIs that can be used by a generic third-party CAL
code interpreter has been illustrated. An example of integration with an already-available CAL
compiler has been discussed.
118
8 Design space exploration and opti-
mization with TURNUS
In this chapter different DSE strategies are illustrated based on the analysis of the ETG. First
of all, it is discussed how design performance can be estimated through a post-mortem
scheduling of the ETG. How timing information are estimated and assigned both for each
firing and each dependency is also illustrated. Then, some DSE analyses are illustrated and
discussed. These heuristics are all based on the analysis of the design space critical path.
Different problems, such as evaluating the design refactoring directions, minimizing and
optimizing the buffer size configuration and minimizing the dynamic power dissipation of a
design are also discussed.
8.1 Performance estimation
Performance of a program is estimated by an ETG post-mortem scheduling that takes into
account a particular mapping configuration. Using the notions of architecture modeling,
enhanced with clock-cycle accurate profiling information, Equation (4.3) can be defined as:
T̂(m)= f (m, ETG(S,D), G(PU , ME ,L), Θ) (8.1)
where m = (ρ,σ,β) ∈M is a mapping configuration point of the design space, ETG(S,D) is the
ETG of the program, G(PU , ME ,L) is the target architecture model andΘ is the set of clock-
accurate profiling information retrieved by third-party profilers. Performance, in terms of
throughput, is estimated introducing the timing information illustrated in Section 5.4 for each
action firing si ∈ S and each dependency (si , s j ) ∈D . Recalling Equation (5.10), the algorithmic
part execution time w(si )e can be obtained by third-party HW and SW profilers (e.g. GNU
gprof, Valgrind, ModelSim). On the contrary, the other terms contained in w(si ) (e.g. the
action selection time, read and write delays) and the dependencies weights w(si , s j ) should
be estimated. Furthermore, as discussed in Section 5.3.2, additional mapping dependencies
might be introduced in D according to the particular mapping configuration. It must be
noted that the partial order represented by the ETG should remain the same even after the
post-mortem scheduling (i.e. locally in each actor and globally over the entire design network).
119
Chapter 8. Design space exploration and optimization with TURNUS
Following this section, the structure and the main functionalities of the ETG post-mortem
scheduler used to estimate the design performance are discussed. Furthermore, it is clarified
how the timing information can be estimated and used by the underling analyses that are
illustrated in this chapter.
8.1.1 Post-mortem scheduler models
The ETG post-mortem scheduler is based on a discrete event system specification (DEVS)
formalism [180, 181]. This is a modular, hierarchical and timed-event system which makes
possible, among other things, the modeling and the analysis of discrete-event systems. The
two basic elements that describe a DEVS model are the following:
• AtomicModel: the basic building blocks of a DEVS model. The behavior of an atomic
model is described by its state transition functions (internal, external, and confluent),
its output function, and its time advance function.
• PortValue: makes the communication possible between a pair of atomic models.
Moreover it defines the template argument for the types of objects that can be accepted
as input and produced as output.
The state of an atomic model is realized by the attributes contained in the object that im-
plements the model. The evolution of the state is modeled through the combination of the
following functions and events:
• Internal transition function δext : describes the model autonomous behavior (i.e. how
its state evolves in the absence of input). These types of events are called internal events
because they are self-induced (i.e. internal to the model).
• Time advance function δa : schedules these autonomous changes of state.
• Output function δout : describes the output of the model when an internal event occurs.
• External transition function δext : describes how the model changes state in response
to the input.
• Confluent transition function δcon f : handles the simultaneous occurrence of both an
internal and an external event.
In this section how the dataflow program and the target architecture are modeled using the
DEVS formalism is described. See Appendix A.2 for a complete overview about DEVS.
120
8.1. Performance estimation
IN_DATA
REQUEST_SPACE
READY_TO_CONSUME
IN_DATA_DONE
OUT_DATA
REQUEST_TOKENS
BufferActor Output Port
OUT_DATA
ASK_SPACE
OUT_DATA_RECEIVED
HAS_SPACE
IN_DATA
ASK_TOKENS
Actor Input Port
(a) DEVS model of a buffer, actor input port and actor output port.
Partition B
Producer Filter Consumerb1 b2
Partition A
Scheduler
STATUS ENABLE STATUS ENABLE STATUS ENABLE
ENABLE ENABLE
Scheduler
(b) DEVS model of the dataflow application, discussed in Section 2.5.4, mapped on two separate operator partitions.
Figure 8.1: Execution trace graph post-mortem scheduler: simulation models.
Actor model
Each actor a ∈ A is modeled as an AtomicActor element which describes a DEVS atomic
model. Each actor atomic model contains the subset of the actor firings Sa ⊆ S. Each actor
output port pouti ∈ P outa is modeled, as illustrated in Figure 8.1a, with the following four
PortValue elements:
• OUT_DATA: used to send the produced tokens to the output buffer.
• ASK_SPACE: used to send the number of tokens that should be produced.
• HAS_SPACE: used for receiving an acknowledgment signal when the requested token
space is available.
• OUT_DATA_RECEIVED: used for receiving an acknowledgment signal when all the
produced tokens have been successfully received.
Similarly, each actor input port p i ni ∈ P i na is modeled with the following two PortValue
elements:
• IN_DATA: used to receive the input tokens from the input buffer.
• ASK_TOKENS: used to send the number of tokens that should be consumed.
121
Chapter 8. Design space exploration and optimization with TURNUS
It must be noted that an input event is associated with each input PortValue. Similarly,
an output event is associated with each output PortValue. Furthermore, as illustrated
in Figure 8.1b, the additional ENABLE input PortValue element and the STATUS output
PortValue element are defined for each actor. Both ports are used by the partition scheduler
(i.e. see the following part of this section) to enable and disable the actor and to retrieve its
status.
Buffer model
Each buffer b ∈B is modeled as an AtomicBuffer which describes a DEVS atomic model.
Each buffer atomic model is modeled as an asynchronous receiver/transmitter (Rx/Tx). The
following four PortValue elements, as illustrated in Figure 8.1a, are used to model the Rx
interface with the source actor:
• IN_DATA: used to receive the tokens produced by the actor.
• IN_REQUEST_SPACE: used to receive the number of tokens that the actor wants to
produce.
• READY_TO_CONSUME: used to send the acknowledgment signal when the requested
token space is available.
• IN_DATA_DONE: used to send the acknowledgment signal when all the tokens pro-
duced by the actor have been received.
Similarly, the following two PortValue elements are used to model the Tx interface with the
target actor:
• OUT_DATA: used to send the tokens requested by the actor.
• REQUEST_TOKENS: used to receive the number of tokens required by the actor.
It must be noted that an input event is associated with each input PortValue. Similarly,
an output event is associated with each output PortValue. Furthermore, as illustrated
in Figure 8.1b, the additional ENABLE_RX and ENABLE_TX input PortValue elements
are defined for each buffer. These are used by the partition scheduler (i.e. see below) to
asynchronously enable and disable the Rx and Tx interfaces, respectively, of the buffer.
Mapping model
Each partition is modeled as an AtomicPartition which describes a DEVS atomic model.
As illustrated in Figure 8.1b, the scheduler of each actor and buffer partition is modeled as a
controller that enables or disables the corresponding atomic objects. Each actor is enabled by
122
8.1. Performance estimation
sending a signal to theENABLE port according to its status provided thought theSTATUS port.
Similarly, each buffer is enabled by sending a signal to the ENABLE_RX and ENABLE_TX
ports. These ports can be used asynchronously in order to model buffers that are on the
boundary of two actor partitions or buffers that are used in a multi-clock domain architecture.
As an example, Figure 8.1b illustrates the post-scheduler model for the dataflow program
discussed in Section 2.5.4. In this case the Producer and Filter actors are partitioned in
the same partition PartitionA, and the Consumer actor is partitioned in partition PartitionB.
Each of these partitions have an actors scheduler and a buffers scheduler. It must be noted
that the buffer b1 is modeled as a synchronous buffer (i.e. input and output interfaces are
activated at the same time), and contrary to the buffer b2, which is modeled as a asynchronous
buffer (i.e. the activation of the input and the output interfaces is decoupled).
8.1.2 Execution trace graph post-mortem scheduling
Performance of a program is estimated by a post-mortem scheduling of the ETG using the
DEVS simulator previously described. For each firing si ∈ S, the timing information illustrated
in Section 5.4 is estimated. Additional dependencies are introduced in D according to the
particular mapping configuration. Figure 8.2 illustrates how each firing si ∈ S is post-scheduled
by performing six different stages. These are respectively: schedule firing, ask tokens, consume
tokens, execute firings, ask space, produce tokens. The starting and ending time of each stage is
used, as described below, to evaluate both the firings and dependencies weights.
Schedule firing
During this stage the actor is selected by the partition scheduler by using the ENABLE signal.
A new unprocessed firing si ∈ Sa is selected. Considering s j as the last firing already processed
in the given partition, the additional dependency (s j , si ) should be added to the original
dependencies set D as discussed in Section 5.3.2. The time required for performing this stage
defines the dependency weight as:
w(s j , si )= t (si )endschedul e − t (si )st ar tschedul e (8.2)
Time required for performing this step is estimated according to the architecture model and
the scheduling policy.
Ask tokens
This stage is performed if there is at least one incoming token dependency of si that should be
processed, hence the actor sends a token request to each corresponding input buffer through
the corresponding ASK_TOKENS port. The time required for performing this stage defines
123
Chapter 8. Design space exploration and optimization with TURNUS
ASK_TOKENS
ASK_SPACE
IN_DATA
numTokens
OUT_DATA
sc
h
ed
u
le
 f
ir
in
g
Buffer 
(Rx)
REQUEST_SPACE
Buffer 
(Tx) Actor
ex
ec
u
te
 fi
ri
n
g
as
k 
sp
ac
e
IN_DATA
a
sk
 to
ke
n
s
OUT_DATA
tokens
re
ad
 t
o
k
en
s
se
n
di
n
g 
to
ke
n
s
READY_TO_CONSUME
HAS_SPACE
OUT_DATA
OUT_DATA
OUT_DATA_RECEIVED
tokens
w
ri
te
 t
o
ke
n
s re
ce
iv
in
g 
to
ke
n
s
IN_DATA
IN_DATA
IN_DATA_DONE
REQUEST_TOKENS
numTokens
true
true
Figure 8.2: Sequence diagram for the DEVS atomic implementation of an actor.
the input wait time of the firing defined as:
w(si )r d = t (si )endaskTokens − t (si )st ar taskTokens (8.3)
The time required for performing this step is estimated according to the architecture model.
Consume tokens
During this stage the incoming token dependencies are processed and the corresponding
tokens are consumed. Each token is retrieved from the corresponding IN_DATA port. The
time required for performing this stage defines the read input token time of the firing defined
such as:
w(si )r = t (si )endconsumeTokens − t (si )st ar tconsumeTokens (8.4)
124
8.1. Performance estimation
Furthermore, for each tokens dependency, the time when this stage is performed is associated
and defined as t (si )consume . The time required for performing this stage is estimated according
to the architecture model.
Execute firing
During this stage the algorithmic part of the firing is executed. The time required to perform
this stage defines the algorithmic part execution time of the firing defined as:
w(si )r = t (si )endexecute − t (si )st ar texecute (8.5)
It must be noted that the time required for performing this step can be obtained by third-party
profiling information.
Ask space
This stage is performed if at least one outgoing tokens dependency of si that should be
processed exists. In this case, the actor sends a space request to each corresponding output
buffer through the corresponding ASK_SPACE port. The time required for performing this
stage defines the output waiting time of the firing defined as:
w(si )wd = t (si )endaskSpace − t v st ar taskSpace (8.6)
Furthermore, for each token dependency, the time when this stage is performed is associated
and defined as t (si )askSpace . The time required for performing this step is estimated according
to the architecture model.
Produce tokens
During this stage the outgoing token dependencies are processed and the corresponding
tokens are produced. Each token is produced in the corresponding OUT_DATA port. The time
required for performing this stage defines the write output token time of the firing defined as:
w(si )w = t (si )endpr oduce − t (si )st ar tpr oduce (8.7)
Furthermore, for each token dependency, the time when this stage is performed is associated
and defined as t (si )pr oduce . Consequently, the token dependency weight can be defined as:
w(s j , si )= t (si )consume − t (s j )pr oduce (8.8)
The time required for performing this step is estimated according to the architecture model.
125
Chapter 8. Design space exploration and optimization with TURNUS
8.1.3 Execution statistics
As initial computational load statistics, the overall network workload of the entire dataflow
program is defined as:
w =∑{w(si )+max{w(s j , si ) : (s j , si ) ∈ δ(si )−S } : si ∈ S} (8.9)
For each actor-class κ ∈K , the corresponding actor-class workload is defined as:
w(κ)=∑{w(si )+max{w(s j , si ) : (s j , si ) ∈ δ(si )−Sκ} : si ∈ Sκ} (8.10)
Similarly, for each actor a ∈ A the actor workload is defined as:
w(a)=∑{w(si )+max{w(s j , si ) : (s j , si ) ∈ δ(si )−Sa } : si ∈ Sa} (8.11)
where δ(si )−Sa defines the incoming dependencies set that has a source firing that belongs to
the same actor a. For each action λ ∈Λ of the actor, the action workload is defined as:
w(λ)=∑{w(si )+max{w(s j , si ) : (s j , si ) ∈ δ(si )−Sλ} : si ∈ Sλ} (8.12)
8.1.4 Analysis of a collection of execution trace graphs
As discussed in Section 2.3.3, the execution behavior of a dynamic dataflow program can
change according to the input sequence. Hence, the analysis and the exploration should be
performed using a collection of ETGs generated with different input sequences.
Considering a finite set of input sequences I = {I1, I2, . . . , InI }, the corresponding ETGs collec-
tion is defined as:
ET Gs = {ETG(S1,D1),ETG(S2,D2), . . . ,ETG(SnI ,DnI )} (8.13)
8.2 Design space critical path
Many metrics for dataflow programs have been developed with the aim of supporting designers
to reduce the running time of their applications. The main requirement of such metrics is to
provide a clear optimization objective by highlighting both problematic actors (or actions)
and buffers that may reduce the design performance. Such possibilities are fundamental for
applications whose complexity falls beyond the guess that a designer can make with success.
The widest-used metric is the makespan which is defined as the start-to-end execution time
of an application [182, 183]. Using the formalism of the ETG, the makespan can be seen as
the execution critical path length: this can be defined as the longest, time-weighted sequence
of events from the start of the program to its termination [1, 2]. In the context of RVC-CAL, a
first attempt in defining a critical path analysis methodology was introduced in [33]. However,
this approach makes the simplified assumption that all actors are executed in parallel with
126
8.2. Design space critical path
an unbounded buffer size configuration. As a result, only the computation load of each
action is taken into account, whereas both the scheduler overhead and the buffer latencies are
neglected. In other words, only the fully serial code portion of the program can be identified
as the design bottleneck. Consequently, this approach severely restricts the design space
exploration. In the following section, this methodology is improved defining the concept of
design space critical path.
8.2.1 Critical path length
The critical path length (CPL) can be evaluated in different ways [32, 33, 184, 185]. Indeed,
as demonstrated in [1, 2] the technique provided in [185] seems to be the most convenient
both for the reduced complexity of the algorithm and for the additional profiling information
that could be retrieved. The latter is illustrated herein below. For each action firing si ∈ S, four
parameters should be evaluated. These are:
• Early Start time ES(si ) which defines its earliest possible starting execution time.
• Latest Start time LS(si ) which defines its latest possible starting execution time without
extending the overall program completion time.
• Early Finish time EF (si ) which defines its earliest possible ending execution time.
• Latest Finish time LF (si ) which defines its latest possible ending execution time with-
out extending the overall program completion time.
Moreover, an additional parameter called slack is introduced both for each action firing si ∈ S
and each dependency s(si , s j ) ∈ E represented by SL(si ) and SL(si , s j ), respectively. This is
used in order to define the maximum delay that a fired action or a dependency can tolerate
without impacting the overall completion time. The evaluation of the CP can be done in
O(|S| + |D|) by performing the Algorithms 1, 2 and 3, respectively. This evaluation can be
summarized as follows. Firstly, for each si ∈ S the early start time ES(si ) and the early finish
time EF (si ) are evaluated by following any valid increasing topological order of S. It must be
noted that for each source firing s j ∈ S;− (i.e. see Equation (5.4)) the preconditions ES(s j )= 0
and EF (s j ) = 0 have been imposed. Secondly, for each si ∈ S the latest start time LS(si )
and latest finish time LF (si ) are evaluated by following any decreasing topological order of
S. It must be noted that for sink firings s j ∈ S;+ (i.e. see Equation (5.7)) the preconditions
LS(s j )= ES(s j ) and LF (s j )= EF (si ) have been imposed. Hence, the slack value for both action
firings si ∈ S and dependencies (si , s j ) ∈D are evaluated. The set of critical action firings is
defined as:
Sc = {si : SL(si )= 0}⊆ S (8.14)
127
Chapter 8. Design space exploration and optimization with TURNUS
Similarly, the set of critical dependencies is defined such as:
Dc = {(si , s j ) : SL(si , s j )= 0}⊆D (8.15)
Finally, the
−→
C P is evaluated by walking back the ETG as illustrated in Algorithm 3. At each
iteration a new action firing is selected by following one of the incoming critical edges such
that δ(si )−Sc = {s j : ∃(s j , si ) ∈Dc }. The sets SC P ⊆ Sc and DC P ⊆Dc contain respectively the fired
actions and dependencies along this path. Similarly, the sets KC P ⊆K , AC P ⊆ A andΛC P ⊆Λ
contain respectively the actor-classes, actors and actions that have at least one action firing
along this path. The CP can be considered completely determinate only when one of the
source firings si ∈ S;− is reached. It must be noted that one such path always exists [186]. As a
result, the critical path length is defined such as:
|−→C P | = f (σ,ρ,β)=∑{w(si ) : si ∈ SC P }+∑{w(si , s j ) : (si , s j ) ∈DC P )} (8.16)
where w(si ) and w(si , s j ) represent the action firing and dependency weights, respectively, as
discussed in Section 5.4. For this reason the |−→C P | can be see as a function f of the scheduling,
partitioning and buffer size configuration. This can be also evaluated as:
|−→C P | =max{LF (si ) : si ∈ S} (8.17)
As mentioned above, the main advantage of evaluating the critical path in such a way is that
this can be done in linear time (i.e. O(|S|+ |D|)). Moreover, all the critical actions or critical
dependencies that are not along the CP can be highlighted through their slack value.
Remark. More then one CP may exist for each weighted ETG. In this case each CP contains
different action firings. However, the length of these paths is always |−→C P |.
Statistical distribution
The profiling clock weights illustrated in Section 6.2.4 makes it possible to model the execution
time as a statistical value. In fact, for each action, it is possible to specify the average, the
minimal and the maximal number of clock cycles required for the execution. Hence, it is
possible to model the execution weight w(si )e as a statistical variable in the sense of a normal
distribution with expected value and variance, respectively, defined as:
E [w(si )e ]= 1α2
(
min(si )+α1mean(si )+max(si )
)
V ar (w(si )e )=
(
max(si )−min(si )
α2
)2 (8.18)
where mean(si ), min(si ) and max(si ) are the average, minimal and maximal execution time,
respectively, defined in Figure 6.23. It must be noted that α1 and α2 are used to model the
distribution shape and they are generally defined as α1 = 4 and α2 = 6 [187]. Furthermore,
making the assumption that the execution time of each firing is independent and uncorrelated
128
8.2. Design space critical path
Algorithm 1: Compute the set of parameters ES(si ),EF (si ),LS(si ),LF (si ) for each si ∈ S.
Input: (S,≤) the firings po-set with size Hs = |S|
Result: ES(si ),EF (si ),LS(si ),LF (si ) for each si ∈ S
// Initialize the source firings S;−
for si ∈ S;− do
ES(s j )← 0
EF (s j )← 0
end
// Iterate S with an increasing topological order
i ← 1
while i ≤Hs do
if si ∉ S;− then
ES(si )←max{EF (s j )+w(s j , si ) : (s j , si ) ∈ δ(si )−D }
EF (si )← ES(si )+w(si )
end
i ← i +1
end
// Initialize the sink firings S;+
for si ∈ S;+ do
LS(s j )← ES(s j )
LF (s j )← EF (s j )
end
// Iterate S with a decreasing topological order
i ←Hs
while i ≥ 1 do
if si ∉ S;+ then
LF (si )←min{LS(s j )−w(si , s j ) : (si , s j ) ∈ δ(si )−D }
LS(si )← LF (si )−w(si )
end
i ← i −1
end
129
Chapter 8. Design space exploration and optimization with TURNUS
Algorithm 2: Compute the slack value SL(si ) for each si ∈ S and SL(si , s j ), and the set of
critical firings set Sc and critical dependencies set Dc .
Input: S the firings set
Input: D the dependencies set
Result: Sc the critical firings set and Dc the critical dependencies set
Data: ES(si ),EF (si ),LS(si ),LF (si ) for each firing si ∈ S evaluated using Algorithm 1
Data: Sc =; and Dc =;
// Compute the critical firings set Sc
for si ∈ S do
SL(si )← LF (si )−EF (si )
if SL(si )= 0 then
Sc ← Sc ∪ {si }
end
end
// Compute the critical dependencies set Dc
for (si , s j ) ∈D do
SL(si , s j )← LS(s j )−EF (si )−w(si , s j )
if SL(si , s j )= 0 then
Dc ←Dc ∪ {(si , s j )}
end
end
Algorithm 3: Critical path extraction.
Input: S the firings set
Result:
−→
C P the critical path
Data: EF (si ) for each firing si ∈ S evaluated using Algorithm 1
Data: Sc the critical firings set evaluated using Algorithm 2
Data:
−→
C P =;
// Find the last CP firing
s ← argmax{si : LF (si )≥ LF (s j ),∀s j ∈ S}
while s 6=⊥ do−→
C P ← s⊕−→C P
s ← getCriticalPredecessor (s);
end
begin getCriticalPredecessor(si)
for s j ∈ δ(si )−S do
if s j ∈ Sc then
return s j
end
end
return⊥
end
130
8.2. Design space critical path
to the others [188], it is possible to redefine the |−→C P | such as:E [|
−→
C P |]=∑{E [w(si )] : si ∈ SC P }
V ar (|−→C P |)=∑{V ar (w(si )) : si ∈ SC P } (8.19)
where E [.] and V ar (.) are the expected value and the variance operators, respectively. In this
context, the variance value is used to define the accuracy of the
−→
C P .
8.2.2 Algorithmic critical path
The algorithmic critical path (ACP) is evaluated neglecting for each action firing the time spent
in waiting for the availability of input tokens and output space, and additional scheduling
dependencies (i.e. see Section 5.3.2). In other words, the ACP is evaluated supposing that the
outgoing dependencies of an action firing are immediately made available to its successors.
Consequently, for each action firing s ∈ S the corresponding weight is evaluated as:
w(si )=
w(si )e for heterogeneous architecturew(si )r +w(si )e +w(si )w for homogeneous architecture
w(si , s j )= 0
(8.20)
taking into account only the algorithmic part execution weight w(si )e (i.e. see Table 5.5) of
each action firing. Other weights, both for firings and dependencies, are neglected. After that,
the CPL is evaluated as illustrated in Section 8.2.1 and denoted as |−→C P |al g o . This value can be
considered as the lower bound for the CPL value of the entire design space such as:
|−→C P |al g o ≤ |−→C P |(m), ∀m ∈M (8.21)
It must be noted that in Equation (8.20) the write and read token times, denoted respectively
as w(si )r and w(si )w , have been neglected when a heterogeneous architecture is considered.
This choice is motivated by the fact that the writing and reading time of tokens in hetero-
geneous architectures may depend on the particular mapping configuration of the buffers.
Contrarily, if these values are also considered then Equation 8.21 cannot be considered as a
lower bound for the entire design space M .
Remark. Neglecting the time spent in waiting for the availability of input tokens and output
space and neglecting the additional scheduling dependencies corresponds to post-scheduling
the ETG considering an unbounded buffer size configuration and a partitioning configuration
where for each processing unit is assigned only one actor (i.e. fully-parallel execution).
131
Chapter 8. Design space exploration and optimization with TURNUS
8.2.3 Throughput and design space critical path
An interesting property concerning the CP length |−→C P | is that it can be easily related to the
application throughput T defined in Equation (4.2) as:
T∝ 1
|−→C P |
(8.22)
In other words, by reducing the execution critical path length (or makespan) the throughput
of the application increases [183]. This makes it possible to explore the design space in terms
of |−→C P | in order to find trade-offs between performance and resource configuration and usage.
From Equation (8.21) it is possible to define an upper bound of the potential achievable
performance of the design as:
T∝ 1
|−→C P |
≤ 1
|−→C P |al g o
(8.23)
This latter equation defines what is called the design space critical path (DSCP) of an appli-
cation, which is represented in Figure 8.3a. It must be noted that evaluating |−→C P |al g o can
be considered as the first starting point of a design space exploration. In fact, in the event
that performance does not meet established requirements with this optimistic design con-
figuration, the computational load of the actions (or actors) along this critical path should
be reduced. Successively, the design DSCP should be explored in order to find trade-offs be-
tween performance and resource configuration and usage as illustrated in Figure 8.3b. Several
purpose-driven design optimization analyses can be performed in this direction as illustrated
in the next sections of this chapter.
Remark. Sometimes the throughput of a system is referred to as the production or execution
rate of (a particular set of) actors. Throughput is usually measured in terms of bits per second
or, for a dataflow program, in tokens per second. However, in a DDF program this rate can vary
according to the input stimulus. Consequently, in order to compare its execution with different
input stimuli, the throughput should be referred to as the rate at which each input stimulus is
completely processed.
8.2.4 Potential speedup
The theoretical speedup of a program is a widely-used metric in the domain of parallel com-
puting. This metric is used to predict the theoretical speedup for a program when parallel
processing units are used (e.g. cores, threads of execution). The theoretical speedup is de-
fined as:
S(n)= t (1)
t (n)
(8.24)
132
8.2. Design space critical path
|C P |
T
Tmax
|−→C P |al g o
(a) The relationship between T and the design throughput and the critical path length |C P | as defined in Equation
(8.23). The maximum achievable throughput Tmax is evaluated with the algorithmic critical path length |C P |al g o .
(ρ,σ,β)
|C P |
|−→C P |al g o
c1
c2
c3
c4 c5 c6
(b) The design space critical path. Points {c1,c2, , . . .c6} represent different mapping configuration points.
Figure 8.3: Design space critical path.
133
Chapter 8. Design space exploration and optimization with TURNUS
where t (1) and where t (n) are the program finishing times when mapped in one and n process-
ing elements, respectively. Its value is generally estimated using the theoretical formulation
proposed in [189], also known as Amdahl’s law, defined as a relationship between parallelized
implementation of an algorithm and its sequential implementation. Even though this formu-
lation is widely used in computer science engineering, it has always been criticized for the
assumption under which it has been formalized [190, 191]. The main criticisms are that it is
assumed that the problem size remains the same when parallelized, that parallel portions of
a program can hardly be estimated. Furthermore, this law is formulated assuming that the
program MoC is sequential. In the context of dataflow programming, the theoretical speed
S(n) can be formulated in terms of network workload w and critical path length |−→C P |, defined
in Section 8.1.3 and Section 8.2.1, respectively. In fact, it is possible to define as t (n)= |−→C P |(ρn),
hence t(1)= |−→C P |(ρ1)=w in the case of one processing unit. Consequently, Equation (8.24)
can be redefined as:
S(n)= w
|−→C P |(ρn)
(8.25)
It must be noted that, as discussed in Section 8.2, |−→C P |(ρn) could be a non-linear function of
the application mapping configuration. A lower bound can be evaluated using a simplified
linear model, such as the one depicted in Figure 8.4 and defined as:
|−→C P |(ρn)=
w −
w−|−→C P |al g o
nA−1 (n−1) if n ∈ [1,nA]
|−→C P |al g o if n ≥ nA
(8.26)
where nA is the number of actors a ∈ A. It must be noted that the model of Equation (8.26)
only makes the assumption that the communication cost between different actor partitions
remains constant and does not dominate the program execution. Consequently, defining as
n
|C P |(ρn)
|−→C P |al g o
w
1 nA
Figure 8.4: Critical path length linear model |C P |(ρn).
134
8.3. Hotspot analysis
n
S(n)
h = 0.3
h = 0.5
h = 0.8
1 2 3 4 5 6 7 8 9 10 11 12
1
1.5
2
2.5
3
3.5
4
Figure 8.5: Theoretical speedup S(n) defined in Equation (8.27) for different values of h =
|C P |al g o/w ∈ [0,1] when nA = 10.
h = |
−→
C P |al g o
w ∈ [0,1], the maximal theoretical speedup of a dataflow program can be defined as:
S(n)=
w −
1
1+ h−1nA−1 (n−1)
n if n ∈ [1,nA]
1
h if n ≥ nA
(8.27)
As an example, Figure 8.5 depicts this relation for different values of h when nA = 10.
8.3 Hotspot analysis
When performance requirements in terms of |−→C P |al g o cannot be satisfied, the design should
be refactored. In other words, the designer should reduce the algorithmic complexity of the
actions that most contribute to the most serial part of the design. In the following, two different
refactoring direction metrics, useful for the designer, are presented. The first, called critical
actions ranking, is an ordered list of actions that most contribute to the overall |−→C P |al g o . The
second, called impact analysis, estimates which improvement margins can be obtained by
refactoring a critical action.
8.3.1 Critical actions ranking
For each actor-class κ of the program, w(κ)c and w(κ)C P define the actor-class critical work-
load and the actor-class workload along the
−→
C P , respectively. These values are evaluated as:
{
w(κ)c = ∑{w(si )+max{w(s j , si ) : (s j , si ) ∈ δ(si )−Sc,κ} : si ∈ Sc ∩Sκ}
w(κ)C P = ∑{w(si )+max{w(s j , si ) : (s j , si ) ∈ δ(si )−SC P,κ} : si ∈ SC P ∩Sκ} (8.28)
where, for a given firing si , δ(si )−Sc,κ and δ(si )
−
SC P,κ
represent the critical incoming edges and the
incoming edges along the CP where source firings belong to the same actor-classκ, respectively.
Similarly, for each actor a of the program, w(a)c and w(a)C P define the actor critical workload
135
Chapter 8. Design space exploration and optimization with TURNUS
and the actor workload along the
−→
C P , respectively. These values are evaluated as:{
w(a)c = ∑{w(si )+max{w(s j , si ) : (s j , si ) ∈ δ(si )−Sc,a } : si ∈ Sc ∩Sa}
w(a)C P = ∑{w(si )+max{w(s j , si ) : (s j , si ) ∈ δ(si )−SC P,a } : si ∈ SC P ∩Sa} (8.29)
where, for a given firing si , δ(si )−Sc,a and δ(si )
−
SC P,a
represent the critical incoming edges and the
incoming edges along the CP where source firings belong to the same actor a, respectively.
Similarly, for each action λ of an actor, w(λ)c and w(λ)C P define the action critical workload
and the action workload along the
−→
C P , respectively. These values are evaluated as:{
w(λ)c = ∑{w(si )+max{w(s j , si ) : (s j , si ) ∈ δ(si )−Sc,λ} : si ∈ Sc ∩Sλ}
w(λ)C P = ∑{w(si )+max{w(s j , si ) : (s j , si ) ∈ δ(si )−SC P,λ} : si ∈ SC P ∩Sλ} (8.30)
where, for a given firing si , δ(si )−Sc,λ and δ(si )
−
SC P,λ
represent the critical incoming edges and the
incoming edges along the CP where source firings belong to the same action λ, respectively.
Actor-classes, actors and actions can be ranked according to their value of critical workload
and workload along the
−→
C P . Consequently, it is possible to define the actor-class κ∗, the actor
a∗ and action λ∗ that most contribute to the overall |−→C P | as:
κ∗C P = argmax{κi : w(κi )C P ≥w(κ j )C P ,∀κ j ∈KC P }
a∗C P = argmax{ai : w(ai )C P ≥w(a j )C P ,∀a j ∈ AC P }
λ∗C P = argmax{λi : w(λi )C P ≥w(λ j )C P ,∀λ j ∈ΛC P }
(8.31)
8.3.2 Impact analysis
If the maximum achievable design throughput Tmax does not satisfy the design requirements
(see Figure 8.3a), the exploration process should initially concentrate on the reduction of
the algorithmic complexity of the design, and successively on finding an optimal mapping
configuration. In [1, 2] it has been demonstrated how, when dealing with parallel designs,
the information obtained exclusively from the evaluation of the
−→
C P (e.g. the critical ranking
previously described) does not provide a reliable direction for refactoring. Hence, the analysis
should be concentrated on estimating, and highlighting the action λ which requires the less
refactoring effort in order to maximally reduce the |−→C P |al g o (i.e. maximally improve Tmax ,
consequently). This estimation can be obtained using the impact analysis technique, which
is summarized in Algorithm 4. The
−→
C P al g o and the set SC P of actions along this path are
initially evaluated as illustrated in Section 8.2.2. After that, for each single action λ ∈ΛC P (i.e.
that has at least one action firing along the CP) it is estimated how much the |−→C P al g o | can be
reduced by reducing the algorithmic complexity of this action. The algorithmic complexity
reduction-factor is defined as a value such that r ∈ Rw = [1,2, . . . ,100]. In other words, the
|−→C P al g o | is iteratively computed for each λ ∈ΛC P and r ∈Rw considering at each evaluation
136
8.3. Hotspot analysis
step the following weight configuration:
w(si )r =

100−r100 w(si )e if si ∈ Sλw(si )e ifsi 6∈ Sλ for homogeneous architecturew(si )r + 100−r100 w(si )e +w(si )w if si ∈ Sλw(si )r +w(si )e +w(si )w ifsi 6∈ Sλ for heterogeneous architecture
w(si , s j )r = 0
(8.32)
In other words, r estimates the percentage of how much the algorithmic execution time w(si )e
should be reduced in order to reduce the ACP length. For each iteration, the corresponding
ACP length is denoted as |−→C P |(λ,r )al g o and the percentage decrease as:
∆|−→C P |(λ,r )al g o = 100
(
1− |
−→
C P |(λ,r )al g o
|−→C P |al g o
)
∈ [0,100] (8.33)
Finally, it is possible to clearly identify on which action the refactoring should be concentrated
in order to reduce the |−→C P |al g o and, consequently, improve the maximum achievable design
throughput Tmax . As an example, Figure 8.6 depicts an example of impact analysis for three
actions λ1, λ2 and λ3, respectively.
Algorithm 4: Impact analysis for the set of critical actionsΛC P .
Input: S the firings set
Input:
−→
C P al g o the initial algorithmic critical path
Result: ∆|−→C P |(Λ,Rw )al g o the ACP length reduction set
Data: ∆|−→C P |(Λ,Rw )al g o =;
for λ ∈ΛC P do
for r ∈Rw do
// Set the action firing weights
for s j ∈ S do
w(s j )← Equation (8.32)
end
// use Algorithms 1,2,3
−→
C P (λ,r )al g o ← computeCpLength()
// evaluate the CP length reduction ratio
∆|−→C P |(λ,r )al g o ← 100
(
1− |
−→
C P |(λ,r )al g o
|−→C P |al g o
)
∆|−→C P |(Λ,Rw )al g o =∆|−→C P |(Λ,Rw )al g o ∪ ∆|−→C P |(λ,r )al g o
end
end
137
Chapter 8. Design space exploration and optimization with TURNUS
r
∆|−→C P |(λ,r )al g o
λ1
λ2
λ3
10 20 30 40 50 60 70 80 90 100
0.5
1
1.5
2
2.5
3
3.5
4
Figure 8.6: Example of impact analysis for three actions λ1, λ2 and λ3.
8.4 Buffer size dimensioning
The total memory size requirement of an application implemented by a dataflow program
consists of the sum of two contributions: the code size and the data buffer size. Minimizing
the total buffer size can be a very important optimization objective in order to reduce cost
of today’s FPGAs that have severe embedded-memory limitations. In the domain of SDF,
CSDF and DPN designs, which are typically implemented in memory-constrained hardware
platforms, the buffer minimization problem is an NP-complete problem [192, 193, 194, 47, 45]
which necessitates the use of heuristic algorithms.
8.4.1 Related work
One of the pioneering works on buffer minimization was presented in [195] where an algorithm
for scheduling a KPN in-bounded memory was illustrated: while simulating the design using
any scheduler and imposing an initial buffer size configuration, the buffer capacity is increased
in case of system deadlock caused by buffer overflow. However, this approach is not guaranteed
to find the minimum buffer size requirement. Since SDF is a special case of KPN (i.e. see
Section 2.1.2), in [196] this approach has also been extended for SDF programs, where a
backtracking search is added to the initial algorithm. Some other authors provides model-
checking based techniques in order to obtain a close-to-optimal solution [197, 40, 196] by
exploring the entire state space. However, the scalability of these techniques is limited by the
capabilities of the state space exploration stage and can fail for large-scale systems. All of these
heuristic algorithms are suitable only for SDF and CSDF designs and cannot be applied in a
DDF context.
138
8.4. Buffer size dimensioning
8.4.2 Deadlock and feasible regions
From the knowledge of the minimum buffer size configuration, guarantees on the minimum
achievable throughput Tmi n of the design can be obtained. As depicted in Figure 8.7, for a
given configuration of partitioning and scheduling (pi∗,σ∗) the critical path design space can
de divided in two different regions according to the buffer size configuration β. These two
regions are the deadlock region and the feasible region, respectively. These are separated by
the minimum buffer size configuration βmi n . An upper and a lower bound for the
−→
C P length
can be defined as:
|−→C P |(βmi n)≤ |−→C P |(β)≤ |−→C P |al g o (8.34)
where |−→C P |(βmi n) defines the critical path length evaluated with the minimal buffer size
configuration βmi n .
β
|−→C P |(β)
|−→C P |al g o
|−→C P |(βmi n)
βmi n
c1
c2
c3 c4
Figure 8.7: Critical path design space given different buffer size configurations.
8.4.3 Minimization by the use of a model predictive control approach
As previously discussed, the problem of bounding and minimizing the buffer size configuration
of a dataflow program, without impacting the performance and guaranteeing at the same time
a deadlock-free execution, has been proven to be an NP-complete problem. Consequently,
it requires the use of heuristic algorithms. In this section, this problem is solved using ETG
transformation and treating the program like a linear-discrete event system as illustrated
in Section 5.5.3. Considering an ETG with nS = |S| firings, the problem of bounding (and
minimizing) the buffer size configuration, guaranteeing at the same time a deadlock-free
139
Chapter 8. Design space exploration and optimization with TURNUS
execution, can defined as:
minimize
u(k),u(k+1),...,u(k+nS )
J =
nS∑
k
b∑
j
y(k) j
subject to Equation (5.16)
y(k) j ≥ 0,∀k ∈ {1,2, . . . ,nS},∀ j ∈ {1,2, . . . ,b}
nS∑
i
u(k)i = 1,∀k ∈ {1,2, . . . ,nS}
nS∑
k
u(k)i = 1,∀i ∈ {1,2, . . . ,nS}
(8.35)
where y(k) j represents the j -th component of y(k), i.e. the number of tokens available on
the j -th buffer, the constraint
∑nS
i u(k)i = 1,∀k ∈ {1,2, . . . ,nS} requires that a firing si ∈ S is
executed at each event k, while the constraint
∑nS
k u(k)i = 1,∀i ∈ {1,2, . . . ,nS} requires that
a step must be fired only once (i.e. the deadlock condition is avoided because all the steps
in S must be fired). In other words, executing only one firing at each event k, can also be
seen as finding a topological order of S for which the sum of all the available tokens along the
dataflow network is minimized during the entire execution. However, the problem defined
in Equation (8.35) is an integer linear programming (ILP) problem, where the number of
optimization variables and constraints can grow significantly according to the ETG size nS .
Consequently, a heuristic algorithm should be applied. For instance, find a feasible scheduling
sequence of the ETG such that the buffer size is kept bounded and, if possible, minimized,
guaranteeing a deadlock-free execution for all the action firings in S. When dealing with large-
data graphs some well-known heuristics, such as graph-cutting or pattern recognition, can be
successfully used to reduce the problem size. The heuristics that are illustrated in the following
rely on both the formalism of the graph and automatic control theory to minimize the size and
to find a sub-optimal solution to Problem (8.35). Model predictive control (MPC) [198, 199]
is a receding horizon-control technique where at each event, an optimization problem is
solved by predicting the future system behavior (i.e. see Appendix B). Bearing in mind the
transformations discussed in Section 5.5.3, it is possible to define an MPC approach that makes
use of the ETG, and where the prediction and control horizons are related to the graph-cut
used for reducing the problem size.
Deadlock avoidance
As discussed in Section 5.3, one of the main properties of an ETG is that it is completely inde-
pendent from any buffer size configuration. Moreover, an ETG can have different topological
orders with different minimal buffer size requirements for admitting a deadlock-free execution.
Therefore, the initial problem can be relaxed: sorting at each event k ∈ {1,2,3, . . .nS} only one
firing, then the optimization problem of Equation (8.35) can be solved iteratively only for a
limited set of firings. Consequently, this problem can be solved using the receding horizon
control technique through the use of an MPC controller. In this case, the prediction horizon
140
8.4. Buffer size dimensioning
Hp defines the number of firings of each ETG-cut. At each event k an ETG-cut S(k)′Hp ⊆ S
is evaluated so that it contains only Hp unscheduled firings of S with the lowest available
topological order. Then, according to the procedure just described, the optimization problem
can be formulated as:
minimize
u(k|k),u(k+1|k),...,u(k+Hc−1|k)
J (k)=
Hp∑
i
b∑
j
y(k+ i |k) j
subject to y(k+ j |k)i ≥ 0, ∀ j , i ∈ {1,2, . . . Hp }
Hp∑
i
u(k+ j |k)i = 1, ∀ j ∈ {0,1,2, . . . Hc −1}
Hp∑
i
u(k+ j |k)i = 0, ∀ j ∈ {Hc , Hc +1, . . . Hp }
(8.36)
where Hc (i.e. the control horizon) is the number of firings that can be executed (i.e. ordered)
inside S(k)′Hp . At each event k only the first selected firing defined by u(k)
∗ = u(k|k) is
executed. When firing the step defined by u(k)∗ the number of tokens inside each buffer is
updated accordingly. The minimal buffer size configuration can be defined as the maximal
token capacity of each buffer obtained during the entire execution only when all the firings
have been executed. In other words, the bounded buffer size configuration is evaluated as:
β(bi )mi n =max{y(k)i , ∀k ∈ {1,2,3, . . . ,nS}} (8.37)
Whereβ(bi )mi n defines the minimal bounded size of each buffer bi ∈B required for scheduling
the ETG. Consequently, the minimal buffer size configuration which defines the borders
between the deadlock region and feasible region depicted in Figure 8.7 can be defined as:
βmi n = {β(b1)mi n ,β(b2)mi n , . . . ,β(bnB )mi n} (8.38)
The flowchart of this approach is depicted in Figure 8.8. It must be noted that, if this analysis is
performed on a collection of ETGs, then the minimal size value of each buffer is the maximal
value obtained within the ETGs collection.
Deadlock recovery
In the previous approach, the problem of Equation (8.36) should be solved nS times. Another
approach, that reduces the number of time that this problem should be solved, is to schedule
post-mortem the ETG using a dynamic buffer size configuration that is modified each time
that a deadlock condition arises (i.e. the ETG has not action firing that cannot be scheduled
because some buffers are full). This second approach can be considered as an improvement
of the one introduced in [195], where the key idea is to recover only a blocked action from
a deadlock execution that produces the highest number of tokens that could resolve the
deadlock condition. However, it must be noted that in [195] any minimization cost function
141
Chapter 8. Design space exploration and optimization with TURNUS
can be used based on the prediction of the program execution and buffer utilization. On
the contrary, with this second approach, when a deadlock condition arises, a trace sub-
graph S(k)′Hp is evaluated as previously described. Successively, the problem of Equation
(8.35) is solved in order to identify the next schedulable firings as done for the approach that
avoids deadlocks. Hence, the found fired action is scheduled supposing, only at this time, an
unbounded buffer size configuration. The new maximum token capacity of each buffer is then
used from the successive scheduling, as a new buffer size configuration. It is worth noting that,
initially the size of all the buffers can be set as 0 tokens. Only when all the action firings have
been scheduled can the bounded buffer size configuration be defined as the maximal token
capacity of each buffer obtained during the entire ETG post-mortem scheduling, as defined in
Equation (8.37). The flowchart of this approach is depicted in Figure 8.9.
8.4.4 Optimization by the exploration of the design space critical path
When design performance in terms of throughput are not met by using the minimal buffer size
configuration, the problem becomes how to increase the size in order to reduce the |−→C P (β)|.
The problem of exploring the design space, for a given scheduling and partitioning config-
uration, in order to find a suitable buffer size configuration that guarantees the throughput
requirements, is depicted in Figure 8.7. This can be formulated such as:
minimize J =
{
|−→C P |(β)
β(bi ), ∀bi ∈B
subject to β(bi )mi n ≤β(bi )≤β(bi )max , ∀bi ∈B∑
{β(bi ), ∀bi ∈B}≤β(B)max
(8.39)
where J is a multi-objective cost function over the critical path length |−→C P (β)| and the size
β(bi ) of each buffer bi ∈B . The constraints β(bi )mi n ≤β(bi )≤β(bi )max impose that for each
buffer the size should be equal or larger to the minimal size β(b)mi n evaluated in the previous
section. Moreover, an upper bound on each buffer size β(bi ) and on the overall buffer size
configuration β(B) = ∑{β(bi ), ∀bi ∈ B} can be imposed. These additional constraints are
necessary when the program is implemented in a severe memory-constrained platform (e.g.
DSP, FPGA). Problem (8.39) can be demonstrated to be an NP-complete [10] and, therefore, it
needs the use of efficient heuristics in order to find good approximate solutions.
Reducing the critical path length
From Equation (5.10) it can be seen how the overall execution time of each firing si ∈ S
is affected by the blocking writing overhead w(si )wd introduced by a buffer that cannot
accommodate enough tokens. Hence, the objective becomes reducing |−→C P | by increasing
the size of buffers that are responsible for introducing the highest total amount of writing
overhead along the
−→
C P . This can be formulated as an iterative procedure as illustrated in
142
8.4. Buffer size dimensioning
load the ETG
are there
some not
yet post-
mortem
scheduled
action
firings?
compute an ETG
subgraph of Hp not
yet post-mortem
scheduled action firings
MPC controller
solve Problem (8.36)
post-mortem
schedule the se-
lected action firing
label the post-
mortem scheduled
action firing as post-
mortem scheduled
update
buffers utilization
END!
yes
no
Figure 8.8: Bounded buffer scheduling with deadlock avoidance approach.
143
Chapter 8. Design space exploration and optimization with TURNUS
load the ETG
are there
some not
yet post-
mortem
scheduled
action
firings?
are there
some post-
mortem
schedulable
action
firings?
compute an ETG
subgraph of Hp not
yet post-mortem
scheduled action firings
MPC controller
solve Problem (8.36)
post-mortem
schedule the se-
lected action firing
update maximum
buffer size
END!
post-mortem sched-
ule an action firing
label the post-
mortem scheduled
action firing as post-
mortem scheduled
yes
no
no
yes
Figure 8.9: Bounded buffer scheduling with deadlock recovery approach.
144
8.5. Dynamic power dissipation minimization
Algorithm 5. At each iteration k, the
−→
C P (β(k)) is evaluated by scheduling post-mortem the ETG
according to the buffer configuration β(k). At k = 0, the minimal buffer size βmi n is used as
the starting point of the algorithm. Successively, for each firing si along the
−→
C P (β(k)), a tuple
(si ,bi ,d ,τ) is computed for each buffer bi that introduced a write delay (i.e. w(si )(k)wd > 0).
Each tuple contains the write delay time d ≤w(si )(k)wd introduced during the firing of si by
the buffer bi and the corresponding number of blocked tokens τ (i.e. that caused the delay of
the execution because they could not be accommodated in bi ). Each tuple is then stored in
the set B(k)C P . When B(k)C P has been completely determinate, it is possible to obtain the
following information for each buffer b ∈B :{
d(bi ,k) = ∑{d : n = i ,∀(s,bn ,d ,τ) ∈B(k)C P }
τ(bi ,k)max = max{τ : n = i ,∀(s,bn ,d ,τ) ∈B(k)C P }
(8.40)
where d(bi ,k)d and τ(bi ,k)max define the overall write blocking delay and the maximal num-
ber of tokens for each buffer bi . Successively, the buffer that needs to be increased in size is
defined as follow:
b∗k = argmax{bi : d(bi ,k)> d(b j ,k)∧β(bi ,k)+τ(bi ,k)max ≤β(bi )max ,∀b j ∈B} (8.41)
Once b(k)∗i has been found, the new buffer size configuration is modified as:
β(bi ,k+1)=
β(bi ,k)+τ(bi ,k)max if bi = b∗kβ(bi ,k) otherwise (8.42)
A new iteration is then made following the same approach. The heuristic can conclude when
the desired critical path length reduction has been achieved (or the maximal number of
iterations k has been performed). It must be noted that, if this analysis is performed on a
collection of ETGs, then the optimal size value of each buffer is the maximal value obtained
within the ETGs collection.
8.5 Dynamic power dissipation minimization
Even though technological improvements in current VLSI design have led to higher clock
frequencies, larger dies, and higher transistor density, they have created significant design chal-
lenges as a result of power consumption and the need for synchrony at higher speeds [200, 201].
As a result, the performance of applications does not necessarily increase at the same pace.
The two main limiting factors are: the technological constraints (e.g. clock frequency caps
imposed by wire delays and clock skew) and the requirement constraints (e.g. power con-
sumption, low-noise and robustness). In order to address these issues, previous work has
demonstrated that asynchronous circuits have the potential of achieving substantially higher
performance compared to their synchronous equivalents. In addition to the elimination of
clock skew and lower interconnection delays, asynchronous circuits have other advantages
145
Chapter 8. Design space exploration and optimization with TURNUS
Algorithm 5: Critical path length reduction by increasing the size of critical buffers.
Input: S the firings set
Input: βmi n the minimal buffer size configuration evaluated as discussed in Section 8.4.3
Input: |−→C P |al g o the algorithmic critical path length
Result: βopt the optimal buffer size configuration
Data: k = 0 the iteration number
Data: β(k) buffer size configuration at iteration k
// Find the last CP firing
β(k)←βmi n do
tracePostProcess(β(k))
// use Algorithms 1,2,3
|−→C P (β(k))|← computeCpLength()
B(k)C P ← getCriticalBuffers(−→C P (β(k)))
β(k)← Equation (8.42)
k ← k+1
while B(k)C P 6= ;∧ |
−→
C P (β(k))|
|−→C P |al g o
> ²
βopt ←βk
begin getCriticalBuffers(
−→
C P)
BC P ←;
for si ∈ SC P do
if w(si )wd > 0 then
(si ,bi ,d ,τ)← getBlockingBuffers(si)
BC P ←BC P ∪ (si ,bi ,d ,τ)
end
end
return BC P
end
146
8.5. Dynamic power dissipation minimization
such as a higher tolerance to the influence of the external environment. On the other hand,
the main drawbacks are the complexity of the implementation and the overall power con-
sumption. In an asynchronous design process, performance evaluation, optimization and
implementation are complicated by the presence of complex dependencies among concur-
rent events. While performance estimation for synchronous systems is based mainly on the
static analysis of the critical path, the performance of an asynchronous design is related to
several dynamic factors [201]. Moreover, performance estimation and design optimization
of asynchronous systems are not supported by efficient and comprehensive automatic syn-
thesis and optimization tools. An interesting trade-off between complete synchronous and
asynchronous methodology is the globally asynchronous locally synchronous (GALS) clocking-
style, supported by multiple-clock domain (MCD) architectures. The key features of a GALS
system are the use of distinct local and independent clocks (i.e. with different frequencies and
phases), rather than a global timing reference. In a typical GALS configuration, a GALS module
(also called synchronous island) consists of a synchronous module, a clock generator and
an asynchronous wrapper (i.e. that encapsulates the synchronous module). GALS modules
communicate with each other through asynchronous interfaces. For GALS-based applications
implemented on MCD architectures, the design objective is to optimize the mapping of the
application into multiple clock domains, subsequently assigning a clock frequency to each
clock domain in order to reduce the overall power consumption, while at the same time,
meeting the design performance requirements.
8.5.1 Related work
The idea of using the dataflow representation for GALS-based applications was introduced
in [202] where advantages and disadvantages of this approach are discussed. In the fol-
lowing, a one-to-one correspondence between hardware resources in the architecture and
actors is assumed. Dataflow design modeling, exploration and optimization for GALS-based
designs have been studied previously by several authors. For example in [203] a GALS design-
partitioning method for high performance and very large VLSI systems is illustrated. The
system is partitioned into an optimal configuration of synchronous blocks by exploring rela-
tionships between power consumption and the number of synchronous blocks which define
the granularity of this approach. In this case, the main limitation is that the synchronous
blocks have fixed sizes that cannot be changed during the optimization process. Moreover, this
approach does not take system performance during the optimization process into account.
In [202], a design and evaluation framework is provided for modeling application-specific
GALS-based dataflow architectures for CSDF, where system performance (e.g. throughput)
during optimization is taken into account. Similarly, in [204, 200], a method for automatic
synthesis of asynchronous digital systems is discussed. However, both the approaches are
developed for fine-grained dataflow graphs, where actors are primitives or combinational
functions.
147
Chapter 8. Design space exploration and optimization with TURNUS
8.5.2 Multi-clock domain partitioning
The problem of partitioning an isomorphic GALS dataflow application into MCD architectures
can be defined as finding a suitable actor-clock mapping configuration that employs the lowest
clock frequencies that meet the overall design performance requirements. If F = { f1, f2, . . . , fnF }
defines the set of available clock frequencies of a platform and A = {a1, a2, . . . , anA } the set of
actors (i.e. see Section 2.5), then the mapping (i.e. partitioning) function can be defined as:
ρ : A→ F (8.43)
Consequently, the problem can be formulated as follows:
minimize J =∑{caρ(a),∀a ∈ A}
subject to T(ρ)≥Tmi n
(8.44)
where ca is a generic objective function weight and T(ρ) is the design performance function
in terms of throughput. In other words, the goal is to find a partitioning configuration ρ that
reduces the total dynamic power dissipation of the design without degrading the performance.
8.5.3 Linear programming formulation
Using the notion of ETG and critical path length, the problem of Equation (8.44) can be
formulated as a linear programming (LP) [14, 13]. For this purpose, weights w(si ) of each
firing si ∈ S and w(si , s j ) of each dependency (si , s j ) ∈ E are evaluated by scheduling post-
mortem the ETG where each actor has been mapped with the highest available clock frequency
fmax =max{ fi ∈ F }. Successively, the ETG dependency amalgamation transformation is used
(i.e. see Section 5.5.2). For each amalgamated dependency e1 •e2 • . . .•en the corresponding
weight is evaluated as w(e1 • e2 • . . . • en) = max{w(e1), w(e2), . . . , w(en)}. Successively, the
critical path
−→
C P and its length |−→C P | are computed as described in Section 8.2.1. Successively,
the firing extension graph G(V ,E) of the ETG is computed (i.e. see Section 5.5.1). For each
edge e ∈ E , weights are assigned such that w(e) = w(si ) if e corresponds to a firing si ∈ S,
and w(e)= w(si , s j ) if e corresponds to a dependency (si , s j ) ∈D. It must be noted that for
each fictitious edge (pis ,pi
si
2i−1) ∈ E and (pi
si
2i ,pit ) ∈ E weights are defined as w(pis ,pi
si
2i−1)= 0 and
w(pisi2i ,pit )= 0, respectively. Finally, the clock domains partitioning problem of Equation (8.44)
148
8.5. Dynamic power dissipation minimization
can be defined as follows:
maximize
∑
{caγ(a),∀a ∉ AC P }
subject to ϕ(pit )−ϕ(pis)= |−→C P |
ϕ(pisi2i )−ϕ(pi
si
2i−1)=w(si ), ∀si ∈ SC P
ϕ(pi
s j
2 j−1)−ϕ(pi
si
2i )=w(si , s j ), ∀(si , s j ) ∈DC P
ϕ(pisi2i−1)−ϕ(pis)= 0, ∀si ∈ {si ∈ SC P : δ(si )−S =;}
ϕ(pit )−ϕ(pisi2i )= 0, ∀si ∈ {si ∈ SC P : δ(si )+S =;}
ϕ(pisi2i )−ϕ(pi
si
2i−1)≥ γ(a)w(si ), ∀si ∉ SC P
ϕ(pi
s j
2 j−1)−ϕ(pi
si
2i )≥w(si , s j ), ∀(si , s j ) ∉DC P
ϕ(pisi2i−1)−ϕ(pis)≥ 0, ∀si ∈ {si ∉ SC P : δ(si )−S =;}
ϕ(pit )−ϕ(pisi2i )≥ 0, ∀si ∈ {si ∉ SC P : δ(si )+S =;}
γ(a)= 1, ∀a ∈ AC P
γ(a)≥ 1, ∀a ∉ AC P
(8.45)
where ϕ(pi) and γ(a) are the unknown variables of this problem. One of the well-known LP
techniques can be used to solve this problem. Once Problem (8.45) has been solved, the MCD
mapping (i.e. partitioning) function defined in Equation (8.43) is obtained as follows:
ρ(a)=

fmax
γ(a) , if a∉ AC P
fmax , if a∈ AC P
(8.46)
In other words, the clock can be reduced only for non-critical actors. Additional constraints
can be added in order to impose clock partitions among actors. The solution of this LP
problem provides an optimal clock domain partition configuration. However, the number of
constraints is at least |S|+ |D|. Hence, the use of a heuristic algorithm must be considered
when dealing with complex dataflow designs that require the exploration of "large" ETGs.
8.5.4 Heuristic approach
The following sections describe a heuristic approach, where the only assumption made
is that the number of available clock frequency domains is known a-priori as F = { f1 =
fmax , f2, . . . , fnF } where fi > f j : ∀i < j , i = 1,2, . . . ,nF . The heuristic is illustrated in Algo-
rithm 6, where ρ( fi )−1 defines the set of actors such that ρ(a)= fi . Initially, all the actors are
assigned the highest available clock frequency (i.e. ρ( f1)−1 = A = {a1, a2, . . . , anA } and ρ( fi )−1 =
;,∀i > 1). The |−→C P | is then calculated with respect to the performance constraints. Iteratively,
at each k-step the maximum reduction set Rk is defined as Rk = {a : a ∉ Ak−1C P ∧ a ∈ ρ( fk−1)−1}.
ρ( fk )
−1 ⊆Rk is then calculated so that the |−→C P | does not increase. It must be noted that this
approach does not claim to find the optimal solution, but provides a practical approach to
be applied to complex dataflow programs [13, 14]. It must be noted that, if this analysis is
149
Chapter 8. Design space exploration and optimization with TURNUS
performed on a collection of ETGs, then the clock domain of each actor is the one with the
highest frequency obtained within the ETGs collection.
Algorithm 6: Heuristic algorithm for solving the problem of the multi-clock domain partition-
ing defined in Equation (8.45).
Input: G(S,D) the execution trace graph
Result: ρ(a) the clock-actor mapping function
Data: k = 0 iteration number
Data: ρ( f0)−1 = A and ρ( fi )−1 =;,∀i > 1
// use Algorithms 1,2,3
|−→C P (0)|← computeCpLength(ρ( f1)−1)
do
for a ∈Rk do
ρ( fk−1)−1∗ ← ρ( fk−1)−1 \ {a}
ρ( fk )
−1∗ ← ρ( fk )−1∪ {a}
// use Algorithms 1,2,3
|−→C P (k)|← computeCpLength(ρ( f1)−1,ρ( f2)−1, . . . ,ρ( fk−2)−1,ρ( fk−1)−1∗ ,ρ( fk )−1∗ )
// if this is a feasible configuration
if |−→C P (k)| = |−→C P (0)| then
ρ( fk )
−1 ← ρ( fk )−1∗
end
k ← k+1
end
while ρ( fk )−1 6= ;
8.6 Conclusions
In this chapter, different DSE functionalities based on the analysis of the ETG of a dataflow
program have been illustrated. Firstly, it has been shown how the ETG can be processed
(i.e. scheduled) post-mortem in order to evaluate the performance estimation for a given
mapping configuration of the program. The methodology has been illustrated for modeling
the target architecture and scheduling post-mortem accordingly the ETG in order to assign
a timing information (i.e. weight) for each action firing and each dependency of the ETG.
Secondly, the concept of design space critical path (DSCP) has been formulated and related
with the throughput of a design. This formalism has been used to effectively restring the
design space that should be analyzed. Furthermore, by considering the CP of the program, the
definition of potential speedup formulated in the Amdahl’s law has been revisited and adapted
to dataflow programs. Based on the CP analysis of a program, the hotspots analysis has been
introduced. This can be used to highlight to the designer which part of a CAL program should
be refactored in order to meet performance throughput. The problem of bounding the buffer
size configuration of a dynamic dataflow program has been solved by using advanced control
techniques such as a model predictive controller. A heuristic, based on the analysis of the CP,
150
8.6. Conclusions
has been used to evaluate a reasonable trade-off between design throughput and buffer size
configuration. Lastly, the problem of reducing the dynamic power dissipation of a dataflow
execution has been studied for multi-clock domain architectures. A linear problem (LP)
formulation and a heuristic approach have been illustrated to find partitioning configurations
such that the energy consumption is minimized by guaranteeing at the same time the same
throughput performance of the design.
151

9 Experimental results
In this chapter, a collection of experimental results based on the analysis of image and video
codec applications is presented. These applications are JPEG, MPEG4-SP, and an HEVC
decoders. Applications have been implemented in different target architectures. These are
a multi-core i7 desktop CPU, an SThorm many-core platform and a Xilinx Virtex-5 FPGA. In
the following, the critical path design exploration and the code refactoring assisted by the
impact analysis is illustrated using an HEVC video decoder implemented in a multi-core i7
desktop CPU as a design case. Successively, the bounded buffer size heuristic, based on the
use of an MPC controller, is illustrated using both a JPEG and an HEVC decoder as design
cases. The optimal trade-off between buffer size dimensioning and throughput performance
is then discussed for an MPEG4-SP decoder implemented on an SThorm many-core platform.
Finally, the dynamic power dissipation minimization heuristic is illustrated on an MPEG4-SP
decoder implemented on a Xilinx Virtex-5 FPGA.
9.1 Design cases
In the following, three multimedia applications specified using the RVC-CAL formalism [52,
53, 54, 55] are illustrated and used in the rest of this chapter. These are respectively a JPEG
image decoder, an MPEG4-SP video decoder and an MPEG HEVC video decoder.
9.1.1 JPEG decoder
The first example is a JPEG decoder described using the RVC-CAL formalism [205]. The top-
level network of this design is depicted in Figure 9.1. This is composed of 8 subnetworks
(actor/network composition), 8 actor-classes, and 8 actors. The main functional components
are a JPEG Parser, Huffman decoder, inverse quantization (IQ) and inverse discrete cosine
transform (IDCT) block, respectively. Input to the decoder is a compressed 4:2:0 bit-stream
and output is the decoded image.
153
Chapter 9. Experimental results
JPEG Parser Huffman IQ IDCTinput bitstream
010011010...
video output
1110010101...
Figure 9.1: JPEG decoder.
9.1.2 MPEG4-SP decoder
The second design example is an MPEG-4 simple profile (SP) decoder described using the
RVC-CAL formalism [31]. The top-level network of this design is depicted in Figure 9.2. This is
composed of 8 subnetworks (actor/network composition), 27 actor-classes and 42 actors. The
main functional components are a bit-stream parser, and for each deconding component (i.e.
Y, U, V) a reconstruction, 2D-IDCT, frame buffer, and motion compensation block, respectively.
The merging block makes a composition of the Y, U and V parts. Input to the decoder is a
compressed 4:2:0 bit-stream and output is the decoded video sequence.
Texture V
Texture U
Texture Y
Parser
input bitstream
010011010...
video output
1110010101...
Motion V
Motion U
Motion Y
Merger
Figure 9.2: MPEG4-SP decoder.
9.1.3 MPEG-HEVC decoder
The third example is an MPEG-HEVC decoder described using the RVC-CAL formalism [9].
The top-level network of this design is depicted in Figure 9.3a. The basic version of this
design is composed of 9 subnetworks (actor/network composition), 16 actor-classes and 32
actors. The main functional components are a bit-stream parser, moving prediction (MovPred),
intra-prediction (Inra), inter-prediction (Inter), IDCT, reconstruct coding unit (RecCU), select
coding unit (SelCU), deblocking filter (DebFilter), sample adaptive offset filter (SaoFilter) and
decoding picture buffer (DecPicBuff) block, respectively. Input to the decoder is a compressed
4:2:0 bit-stream and output is the decoded video sequence.
9.2 CAL source code static and dynamic profiling
This section presents the experimental results on the high-level design profiling, illustrated
in Chapter 3 and 7 respectively, for an MPEG-HEVC decoder. The design under study is the
RVC-CAL standardized version [206] of the decoder illustrated in Section 9.1.3. The complete
154
9.2. CAL source code static and dynamic profiling
Inter
IT
Intra
Parser
Mv Prediction
Select
CU
Deblocking 
Filter
SAO Filter
Picture 
Buffer
input bitstream
010011010... video output
1110010101...
(a) Top-level network with 9 subnetworks.
(b) RVC-CAL standardized version (without subnetworks).
Figure 9.3: HEVC decoder.
network topology (without subnetworks) is depicted in Figure 9.3b. This design is composed
of 32 actors, 26 actor-classes and 112 buffers. In the following, this initial design configuration
is referred to as Ref-Standard.
9.2.1 Source code static analysis
The first step of a high-level design profiling is the static source-code analysis illustrated in
Section 3.2. Results of this analysis performed on the HEVC Ref-Standard design are summa-
rized in Table 9.1 where the overall number of SLOC and Halstead metric values are reported.
155
Chapter 9. Experimental results
Table 9.1: Static code complexity of the MPEG-HEVC decoder.
(a) Network composition in terms of actors, buffers, internal variales, actor-classes and SLOC.
Actors Buffers Internal variables Classes SLOC
Ref-Standard 32 112 745 26 1.46 104
Shared-Memory 13 96 800 13 1.37 104
(b) Halstead complexity metric results.
n1 n2 n N1 N2 N
Ref-Standard 3.20 102 2.35 103 2.67 103 3.29 104 3.33 104 6.62 104
Shared-memory 4.25 102 2.45 103 2.87 103 3.19 104 3.36 104 6.55 104
V D E T B I
Ref-Standard 7.54 105 2.24 103 1.69 109 9.38 107 3.33 10−4 3.37 102
Shared-memory 7.52 105 2.76 103 2.07 109 1.15 108 3.33 10−4 2.73 102
Table 9.2: Actor memory requirements for the initial HEVC design.
Internal variables Bit size
Ref-Standard 745 12.25MB
Shared-memory 800 732.0kB
The overall design SLOC is 14658, which is smaller compared to the approximatively 110000
C/C++ SLOC of openHEVC implementation [207, 208]. According to Halstead development
time T, the HEVC Ref-Standard design is a 37.8 man-months (i.e. T = 9.38 109s) project when
developed using RVC-CAL as the programming language.
9.2.2 Memory requirements and utilization
The overall memory requirements, in terms of internal variables and buffer utilization, of a
design can be estimated through a static and dynamic high-level code analysis. The overall
amount of bits required for the actors’ internal variables can be evaluated through a static
code analysis. In fact, in RVC-CAL the dimension of each internal variable should be known at
compile-time, as such no dynamic memory allocations can be made. This makes it possible
to estimate exactly the static memory requirement for each internal variable. As summarized
in Table 9.2, the memory requirement for the HEVC Ref-Standard design is 12.25MB. On the
contrary, a dynamic code analysis is required for estimating the memory requirements in
terms of tokens passed on each buffer because of the dynamic dataflow MoC of this design. As
illustrated in Section 3.3, this analysis can be performed with a high-level code interpretation
of the CAL source code. Table 9.3a summarizes the buffer utilization and requirement when
156
9.2. CAL source code static and dynamic profiling
Table 9.3: Buffer utilization profiling data of the MPEG-HEVC decoder.
(a) Tokens passed through the network (i.e. produced and sucessively consumed).
Total Bits
Ref-Standard 1000824 16.91MB
Shared-Memory 213586 1.37MB
(b) Buffer bandwidth in terms of produced and consumed tokens for each firing.
Produced-tokens/Firing Consumed-tokens/Firing
Average Min Max Average Min Max
Ref-Standard 49.74 1 4096 49.75 1 4096
Shared-Memory 3.16 1 8192 3.16 1 8192
(c) Write and read hits on each buffer.
Write hits Read hits
Total Average Min Max Total Average Min Max
Ref-Standard 1000824 8935.9 8 400210 1052894 9400.8 8 400210
Shared-Memory 213586 2224.8 8 16082 235764 2455.8 8 16694
the 8-frame 416x240 MERGE_B_TI_3 HEVC conformance bit-stream [209] is used as input
for the design. In this case, the number of tokens passed through the design buffers is approx-
imatively 106 which require 16.91MB to be represented. Furthermore, during the dynamic
code interpretation it is possible to evaluate the number of write and read accesses that are
performed on each buffer. Table 9.3b and 9.3c illustrate, respectively, the number of tokens
produce/consumed each time a firing makes use of a buffer and the number of write/read
accesses that have been performed on each buffer.
9.2.3 Execution trace graph
The structure of the ETG evaluated during the high-level code interpretation described in
the previous section is summarized in Table 9.4. This is composed of approximatively 2 106
firings (nodes of the graph) and 2.2 107 dependencies (directed arcs). As can be seen from
Table 9.6b, the ETG is (in general) a low-connected graph. In fact, for this particular case the
incoming and outgoing degree is approximatively 11.64. Figure 9.4 depicts the rendering of a
small portion (i.e. approximatively 80000 action firings and 350000 dependencies) of this ETG
made with the Gephi graph-visualizer [210, 211].
9.2.4 Initial design-refactoring directions
From the high-level profiling information obtained during the analysis described in the pre-
vious sections, it is possible to identify the buffer utilization as a critical point of the HEVC
157
Chapter 9. Experimental results
Table 9.4: Execution trace graph configuration of the Ref-Standard MPEG-HEVC decoder.
(a) Size.
Action firings Dependencies
1932226 22490814
(b) Action firings incoming and outgoing degree.
Average Min Max Var
|δ(si )−| 11.64 0 200 54.63
|δ(si )+| 11.64 0 61530 12143.96
(c) Dependencies set.
Direction
Kind Total input output rr rw wr ww
FSM 8.2% - - - - - -
Port 8.6% 54.2% 45.8 % - - - -
Internal variable 77.8% - - 34.4% 16.2% 32.6% 16.8 %
Tokens 5.4% - - - - - -
Ref-Standard design. In fact, the overall number of exchanged tokens between actors is ap-
proximatively 16.91MB. This issue can be effectively coped with an extension of the CAL MoC
by introducing the notion of shared-memory between actors. If correctly used, this insight
enables reducing the overall number of exchanged tokens, without introducing (possible)
race-conditions. In other words, the shared-memory approach should only be used if an actor
modifies the value of an internal variable where the values are used without any modification
(i.e. read only access) by a second actor. Without this approach, each time that the first actor
modifies an internal variable then the new value should be sent as a token to the second actor.
In the case where the processing result of an actor does not depend on the arrival time of the
token and its value is not locally modified, then the shared-memory approach can be used.
The consumption of a token is transformed to a read-access of a variable. This is the case, for
example, for the DecPicBuffer actor. This actor receives the decoded picture as a stream of
tokens. However, the decoded picture can be shared among actors without requiring the use
of tokens by the use of a shared-memory approach. More over, additional internal variables,
necessary to store the token data values, can be removed. Following these considerations,
the HEVC Ref-Standard design has been modified by supporting the shared-memory MoC.
This new design configuration, summarized in Table 9.1 and referred to as Shared-Memory,
is composed of 13 actors and actor-classes and 96 buffers. The same static and dynamic
code analysis illustrated before has also been performed for this new design configuration.
Table 9.5 summarizes the internal variables that can be shared among the actors of the HEVC
Shared-Memory design. It is possible to see how the overall size of the shared-memory is
158
9.2. CAL source code static and dynamic profiling
Figure 9.4: The rendering of a small portion (i.e. approximatively 80000 action firings and
350000 dependencies) of the execution trace graph described in Table 9.4. Action firings are
colored according to the corresponding actor.
22kB, However, the real memory requirement reduction is obtained by removing internal
actor variables that store redundant data and by reducing the overall number of exchanged
tokens. In Table 9.3a and 9.3, respectively, it can be seen how the overall internal memory of
the Shared-Memory design is 732kB (94% smaller compared to the Ref-Standard version) and
the overall amount of exchange tokens is 1.37MB (92% smaller compared to the Ref-Standard
version). The ETG structure obtained with this new design configuration is summarized in
Table 9.6. This ETG is used in the following, where the hotspots analysis is performed.
159
Chapter 9. Experimental results
Table 9.5: Memory requirement for the actor internal variables of the Shared-Memory MPEG-
HEVC decoder.
Shareable
Internal variables Bit size Variables Bits size
811 754KB 11 22.0kB
Table 9.6: Execution trace graph configuration of the Shared-Memory MPEG-HEVC decoder.
(a) Size
Action firings. Dependencies
493551 6124379
(b) Action firings incoming and outgoing degrees.
Degree Average Min Max Var
|δ(si )−| 12.40 0 84 146.02
|δ(si )+| 12.40 0 38643 9180.69
(c) Dependencies set.
Direction
Kind Total input output rr rw wr ww
FSM 8.0% - - - - - -
Port 6.0% 63.2% 36.8 % - - - -
Internal variable 82.0% - 0 31.4% 18.9% 27.7% 22.0 %
Tokens 4.0% - - - - - -
160
9.3. Design refactoring
9.3 Design refactoring
This section presents the experimental results of the design space critical path exploration
and hotspots analysis, illustrated in Section 8.2 and 8.3 respectively, for an MPEG-HEVC
decoder. The design under study is the decoder illustrated in Section 9.2.4, referred to as
HEVC Shared-Memory, specified using the RVC-CAL dataflow formalism extended with the
notion of shared-variables. The target platform for this implementation is a desktop computer
equipped with an Intel i7-3770 3.40GHz processor and 8GB of memory. The objectives of this
analysis are two-fold: improve the throughput performance of the initial design, increase the
potential speedup S(n) in order to fully exploit the 4 cores. To achieve both objectives, the
designer has reduced the algorithmic critical path length |−→C P |al g o following the refactoring
directions provided by the DSE framework.
Inter
(a) Initial version
Inter Inter
n
(b) Pipeling recplication
Demux
Inter
Inter
Inter
Mux
n
(c) Data parallelism
Figure 9.5: Refactoring strategies for the Inter-Prediction actor.
161
Chapter 9. Experimental results
9.3.1 Critical action ranking
As illustrated in Table 9.8b, the initial maximal speedup (i.e. achievable with n = nA process-
ing elements) of this design is 2.43. The first step of the refactoring phase was to evaluate
the critical action ranking. As illustrated in Section 8.3, the objective is to identify which
actions contribute the most to the serial part of the design. Table 9.7a summarizes the list of
the first 5 actions which contribute the most to the overall |−→C P |al g o . It can be seen how the
interpolation action, contained by the Inter actor, contributes by 45% to the overall
|−→C P |al g o . Hence, this action should be considered as the refactoring starting point. As summa-
rized in Figure 9.5, this actor can be split in order to exploit task or data parallelism. In this
case, as illustrated in Figure 9.5c, the actor has been replicated for each video component (i.e.
Y, U, V). In this new design configuration, reported as code optimization in Table 9.8b, both the
overall design complexity and the |−→C P |al g o have been reduced by 52%. However, the maximal
potential parallelism has decreased to 2.23%. Consequently, a second round of this analysis
has been performed in order to highlight the new most serial parts of the design. In this new
design configuration, the new 5 most critical actions are summarized in Table 9.7b, where the
|−→C P |al g o contributions of the interpolation action, contained by the InterLuma200
actor, and the addResAndClip action, contained by the SelCU actor, are around 12.38%
and 11.14%, respectively.
Table 9.7: Critical action ranking analysis of the MPEG-HEVC decoder. Results are for the
initial and the full-parallel version, summarized for 5 different actions in Table 9.7a and 9.7b
respectively.
(a) Critical action ranking for the initial version of the decoder.
Actor Action w(a)C P
Inter interpolation 45.86%
Inter applyWeights 11.14%
DecPicBuff expandBorders 6.68%
SaoFilter getSaoTypeIdxDone 4.58%
DebFilter filterEdges 4.00%
(b) Critical action ranking for the full parallel version of the decoder.
Actor Action w(a)C P
InterLuma200 interpolation 12.38%
SelCU addResAndClip 11.14%
SaoFilterLuma getSaoTypeIdxDone 7.42%
DebFilter filterEdges 7.28%
SaoFilterLuma getSaoMerge 6.85%
162
9.3. Design refactoring
Table 9.8: Description of different configurations of the HEVC decoder design and correspond-
ing speedup, computational complexity and critical path length values.
(a) Design configurations.
Design version Notes
1 Shared-memory Initial version
2 Code optimization Reducing impacts of critical copies
3 InterPred Comp Splitting luma and chroma
4 CompPipeline The inter-prediction actors are pipelined
5 CompPipeline2x1 Splitting first part of the pipeline of the luma inter-prediction
6 Sao Comp Splitting luma and chroma computation in the Sao filter
7 DPB Optim Optimization of the action which expand the borders
8 Sao Split Optimization of the Sao and parallelization of the luma part
(b) Potential speedup, computational complexity and critical path length.
Design version S(nA) ∆w ∆|−→C P |
1 Shared-memory 2.43 - -
2 Code optimization 2.23 -52% -57%
3 InterPred Comp 2.84 -55% -47%
4 CompPipeline 3.65 -58% -39%
5 CompPipeline2x1 3.84 -60% -38%
6 Sao Comp 4.05 -60% -36%
7 DPB Optim 4.18 -60% -35%
8 Sao Split 4.46 -58% -31%
9.3.2 Impact analysis
In this case the critical action ranking does not provide any clear direction as to what is the
first action that should be refactored and what is the potential |−→C P |al g o reduction. Conse-
quently, the impact analysis illustrated in Section 8.3 has been used. Results of this analysis,
summarized in Figure 9.6, show that the two most critical actions highlighted in Table 9.7b
are not the best refactoring candidates. In fact, by refactoring one of these two actions, the
|−→C P |al g o can potentially be reduced by a maximum of 5%, compared to 7% by refactoring the
filterEdge action, contained in the DebFilter actor, or the geatSaoMerge action,
contained in SaoFilterLuma actor. According to this result, the filterEdge action has
been refactored. Successively, other iterations have been performed in the same manner.
Results are summarized in Table 9.8. At the end, the maximal potential speedup that has been
obtained has been around 4.46% and the throughput improvement (i.e. in terms of |−→C P |al g o)
around 31%.
163
Chapter 9. Experimental results
0 10 20 30 40 50 60 70 80 90 100
0
1
2
3
4
5
6
7 DebFilter : filterEdges
SaoFilterLuma : getSaoMerge
SelCU : addResAndClip
InterLuma200 : interpolation
r
∆
|−→ C
P
|(λ
,r
) a
lg
o
Figure 9.6: Impact analysis for the initial version of the Shared-Memory MPEG-HEVC decoder.
9.4 Bounded buffer size configuration
This section presents the experimental results on bounding and minimizing the buffer size
configuration, illustrated in Section 8.4.3, for the JPEG decoder and the MEPG-HEVC decoders
illustrated in Section 9.1.1 and 9.1.3, respectively. The number of actors and buffers of each
specific design configuration is summarized in Table 9.9, together with the number of fired
actions Hs contained in each ETG used for the analyses. Tables 9.10 and 9.11 report the results
obtained using the deadlock avoidance and deadlock recovery approaches for the JPEG and
HEVC decoders, respectively. The results have also been compared with what was obtained
using a well-known state of the art method [195]. In the comparison ∆bits% and ∆tokens%
represent the difference obtained with the new approach, respectively in terms of bits and
token size savings. Moreover, for each configuration, the average time required for solving
at each iteration, the problem formulated in Equation (8.36), is reported in terms of ms. The
results have been obtained using a standard desktop PC with an i7-3770 3.40GHz processor
and 32GB of memory. It can be observed that even with very small trace sub-graphs, such
that Hp = 1 and a single, optimized fired step such as Hc = 1, the approach leads to a bounded
buffer size configuration that is about 15% smaller compared to the well-established solution
introduced in [195] (i.e. both in terms of tokens and bit savings). It is interesting to observe
that if a buffer can contain only tokens of the same type (e.g. unsigned/signed integer, floats)
where the number of bits for a single token is known (as in the case of these two examples)
then the optimization objective J(k) of Equation (8.36) can easily be formulated in terms of
tokens by introducing a cost value for each component of the vector y .
164
9.4. Bounded buffer size configuration
Table 9.9: Design sizes: numbers of actors, buffers and action firings.
Actors Buffers Action firings
JPEG 6 10 181739
HEVC (see Table 9.4a) 32 112 1932226
Table 9.10: Bounded buffer size configurations of the JPEG decoder using the MPC approach.
Results are compared to state of the art approaches.
(a) Deadlock avoidance.
Hp 1 2 2 4 4 4
Hc 1 1 2 2 3 4
|S′| 6 12 12 24 24 24
bits (kB) 2.08 2.08 2.08 2.09 2.09 2.09
tokens 1961 1961 1961 1963 1963 1963
∆bits% -15.8 -15.8 -15.8 -15.7 -15.7 -15.7
∆tokens% -8.9 -8.9 -8.9 -8.8 -8.8 -8.8
solver (ms) 2.9 3.4 6.0 8.4 13.1 18.7
(b) Deadlock recovery.
Hp 1 2 2
Hc 1 1 2
|S′| 6 12 12
bits (kB) 2.07 2.07 2.07
tokens 1950 1950 1950
∆bits% -16.2 -16.2 -16.2
∆tokens% -9.4 -9.4 -9.4
deadlocks% 0.9 0.9 0.9
solver (ms) 3.0 3.5 4.7
165
Chapter 9. Experimental results
Table 9.11: Bounded buffer size configurations of the MPEG-HEVC decoder using the MPC
approach. Results are compared to state of the art approaches.
(a) Deadlock avoidance.
Hp 1 2 4 2
Hc 1 1 1 2
|S′| 32 64 128 64
bits (kB) 122.73 123.17 125.19 122.98
tokens 86411 86405 88285 86238
∆bits% -14.1 -13.8 -12.4 -13.9
∆tokens% -15.4 -15.4 -13.6 -15.6
solver (ms) 5.7 10.7 25.3 27.6
(b) Deadlock recovery.
Hp 1 2 2 1
Hc 1 1 2 2
|S′| 32 64 64 32
bits (kB) 110.77 111.88 111.78 113.56
tokens 77663 78605 78554 80407
∆bits% -22.5 -21.7 -21.8 -20.5
∆tokens% -24.0 -23.0 -23.1 -21.3
deadlocks% 0.5 0.6 0.6 0.5
solver (ms) 8.2 12.5 29.0 15.2
166
9.5. Buffer size optimization
9.5 Buffer size optimization
This section presents the experimental results on finding a trade-off between the buffer size
configuration and the throughput performance, as illustrated in Section 8.4.4, of the MPEG4-
SP decoder illustrated in Section 9.1.2. The target architecture is the ST Microelectronics
STHorm platform [212]. This is an area and power-efficient, many-core platform based on
multiple globally asynchronous, locally synchronous (GALS) clusters of processing elements.
Clusters feature up to 16 processors and one control processor with independent instruction
streams sharing a multi-banked L1 data memory (256 kB), multi-channel DMA engine, and
specialized hardware for synchronization and scheduling. The fabric can be programmed in
either OpenCL or standard C with the integration of a specific API, called native programming
model (NPM), which is closely coupled to the platform and provides the highest level of con-
trol on application-to-resource mapping, at the expense of abstraction. For the purpose of
this work, a C/NPM implementation of the RVC-CAL network synthesized using an STHorm
specific extension of Orcc has been used. The results reported in the following are obtained
using a software emulator running on Linux. In all the tests, the CAL actors composing the
MPEG4-SP decoder were mapped into one processing element each, and a single STHorm
cluster was used. Furthermore, the results presented here have been obtained with four differ-
ent 10-frame QCIF bit-streams, which are commonly-used video test sequences also known
as Akyio, Foreman, Suzie and News. Figure 9.7 illustrates both the estimated and the
experimental results where the Akiyo test sequence has been used as a reference for evaluating
both the minimal buffer size using the heuristic approach introduced in Section 8.4.3 and the
different buffer size configuration using the heuristic algorithm described in Section 8.4.4. The
other test sequences have only been used to validate this approach. Figure 9.7b depicts the
estimated results (i.e. obtained post-processing the causation trace as illustrated in Section 8.1
and using the clock-accurate profiling information retrieved from the STM System Trace
Module) and the experimental results (i.e. obtained from a cycle-accurate, but slower, design
simulation). In the picture, the results obtained with the Akiyo test sequence are reported.
It must be noted that the CP length with a minimal buffer size is roughly 35% higher com-
pared to algorithmic critical path length |−→C P |al g o . As it can be seen, a good trade-off can be
achieved between performance improvement and resource utilization (i.e. in terms of memory
utilization). In fact, increasing the buffer size from the minimal configuration by 6% leads
to an overall throughput improvement of 30%. Compared to the experimental results, the
performance estimation has 5% of inaccuracy in terms of absolute throughput and execution
time values. This is probably related to the low precision of the STHorm scheduler model
implemented in the DSE performance estimation engine used during the ETG post-mortem
scheduling phase.
167
Chapter 9. Experimental results
0 2 4 6 8 10 12
0
5
10
15
20
25
30
35
40
45
50
Buffer Size [%]
C
ri
ti
c
a
l 
P
a
th
 L
e
n
g
th
 [
%
]
 
 
foreman
akiyo
suzie
news
 mean
min. buffer size
CPalgo
(a) Trade-off estimation.
0 2 4 6 8 10 12 14
0
5
10
15
20
25
30
35
Buffer Size [%]
Cr
iti
ca
l P
at
h 
Le
ng
th
 [%
]
 
 
estimated (akiyo)
experimental (akiyo)
CPalgo  
min. buffer size
(b) Experimental results and estimated values.
Figure 9.7: Buffer size optimization of the MPEG4-SP decoder implemented on an ST Micro-
electronics STHorm platform.
168
9.6. Dynamic power dissipation minimization
9.6 Dynamic power dissipation minimization
This section presents the experimental results on the dynamic power minimization, illustrated
in Section 8.5, for the MPEG4-SP decoder illustrated in Section 9.1.2. The design under test
contains 41 actors and it has been implemented on a Xilinx Virtex-5 FPGA. Performance
profiling and low-level code synthesis has been performed with Xronos synthesizer. Two
different MCD configurations have been tested: the first presents 2-clock domains with
F = {50.0,6.25} MHz; the second has 4-clock domains with F = {50.0,25.0,12.50,6.25} MHz.
In order to reduce the power dissipation related to memory access, the minimal buffer size
configuration illustrated in Section 8.4.3 has been used. With this configuration, the results
of the heuristic approach illustrated in Section 8.5.3 for the two MCD configurations are
summarized respectively in Table 9.12 and Table 9.13. For each of these two configurations,
the overall power consumption has been measured using the Xilinx XPE [213], enhanced
with information retrieved during a post-place and route simulation using 10-frame QCIF
bit-streams as input stimulus. Results of the different power contribution terms for both
the 2-clock and 4-clock domain configurations are summarized in Table 9.12 and Table 9.13,
respectively. Compared to the MCD configuration, where all the domains use the maximal
available frequency, a significant overall power reduction can be achieved partitioning the
design using the approach illustrated in this work. In fact, the overall power reduction ranges
between 4% and 10% in the two cases.
Table 9.12: 2-Clock domains dynamic power minimization results of the MPEG-4 SP decoder
implemented in a Xilinx Virtex-5 FPGA. Nominal: all the domains use the maximum available
frequency; Optimized: with the clock frequencies illustrated in Table 9.12a. ∆% defines the
percentage reduction between the nominal and optimized case of each contribution.
(a) Clock domains and partitioning.
Clock Domain Actors
f1 50.0 MHz 21
f2 6.25 MHz 20
(b) Experimental results.
Contribution Nominal W Optimized W ∆%
Clocks 0.328 0.282 -14.0
Logic 0.069 0.06 -13.0
Signals 0.079 0.083 5.1
BRAMs 0.1 0.082 -18.0
Input/Output 0.005 0.005 0.0
Leakage 1.051 1.05 -0.1
Total 1.632 1.562 -4.3
169
Chapter 9. Experimental results
Table 9.13: 4-Clock domains dynamic power minimization results of the MPEG-4 SP decoder
implemented in a Xilinx Virtex-5 FPGA. Nominal: all the domains use the maximum available
frequency; Optimized: with the clock frequencies illustrated in Table 9.13a. ∆% defines the
percentage reduction between the nominal and optimized case of each contribution.
(a) Clock domains and partitioning.
Clock Domain Actors
f1 50.0 MHz 12
f2 25.0 MHz 5
f3 12.5 MHz 8
f4 6.25 MHz 16
(b) Experimental results.
Contribution Nominal W Optimized W ∆%
Clocks 0.436 0.357 -18.1
Logic 0.07 0.041 -41.4
Signals 0.095 0.053 -44.2
BRAMs 0.106 0.075 -29.2
Input/Output 0.005 0.004 -20.0
Leakage 1.053 1.05 -0.3
Total 1.765 1.58 -10.5
170
9.7. Conclusions
9.7 Conclusions
In this chapter a collection of experimental results based on the analysis of video codec
applications has been illustrated and discussed. The results obtained during the different
stages of the design space exploration of video decoders, such as JPEG, MPEG4-SP, and HEVC
decoders, have been presented and discussed. More precisely, the following cases of use
have been discussed: the critical path design exploration and the code refactoring assisted
by the impact analysis have been illustrated for the HEVC video decoder, implemented in
a multi-core i7 desktop CPU. Successively, the bounded buffer size heuristic, based on the
use of an MPC controller, has been used for both a JPEG and HEVC decoder. An optimal
trade-off between buffer size dimensioning and throughput performance has been discussed
for an MPEG4-SP decoder implemented on an SThorm many-core platform. Finally, the
dynamic power dissipation minimization heuristic has been used for an MPEG4-SP decoder
implemented on a Xilinx Virtex-5 FPGA.
171

10 Conclusions
This thesis addressed the problem of defining a DSE methodology for complex designs appli-
cations modeled with dynamic dataflow MoCs. Despite the increasing interest in massively
and heterogeneous parallel platforms, a unified methodology for the specification and de-
velopment of (complex) applications is far from being uniformly adopted. There are still too
many approaches and methodologies: some give more emphasis to the re-use of legacy code
and IP blocks, others require specific methodologies constrained to a given type of platform
or technology. Very few approaches have as main objective the achievement of a true unified
methodology capable to abstract from SW and HW. The work proposed in this thesis has tried
to demonstrate how a unified HW and SW design methodology for complex designs can be
successfully adopted. One of the major contributions of this work has been the formalization
of DSE methodology for dynamic dataflow programs. In fact, dynamic dataflow MoCs have
always been criticized with the argument that their behavior is hardly analyzable. The general
approach for the implementation of dataflow programs has always been using expressive
limited MoCs (e.g. static and cyclo-static) under the assumption that run-time performance
can be guaranteed at compile-time. However, this approach severely limits the scalability
of an application when a wide set of features is required (e.g. multimedia application). This
work has shown how it is possible to efficiently explore the design space and estimate the
performance of an application through the analysis of the ETG. The main advantage of this
approach is that it does not limit the program expressiveness. In fact, it can be independently
adopted from the dataflow MoC class (i.e. static, cyclo-static and dynamic). The effectiveness
of this design methodology has been proven through the development and use of a DSE
framework called TURNUS. The main research contributions of this thesis are:
(i) Execution Trace Graph: a graph-based representation of the program execution has
been formalized and illustrated in Chapter 5. It has been shown how this mathematical
formalism, called execution trace graph (ETG), can be used to model the execution of
static, cyclo-static and dynamic dataflow programs as a directed acyclic graph (DAG).
Nodes and edges of this DAG represent a single action firing and a (data or functional) de-
pendency between two different action firings, respectively. Notions of partially-ordered
173
Chapter 10. Conclusions
sets (i.e. po-sets) and directed paths (i.e. d-paths) have been adapted to this execution
model. Different dependency kinds have been defined, notably the finite state machine
dependencies, the internal variable dependencies, the port dependencies, the tokens
dependencies and the guard dependencies. Compared to similar graph-based execution
models, the ETG model defines the concept of guard enable and disable dependencies.
By the use of these kinds of dependencies it has been demonstrated how different execu-
tion trajectories can be modeled using the ETG. This can be obtained through a serial
high-level program execution. The guard enable and disable dependencies also make
it possible to model the execution of dynamic dataflow programs without requiring an
MoC expressiveness reduction (i.e. to a static or a cyclo-static MoC) as done by previous
approaches. Two interesting properties of the ETG are that design performance can be
efficiently estimated through post-mortem scheduling (i.e. see Section 8.1) and that
different analysis approaches can be used to find design configuration points that satisfy
trade-off requirements between performance and resource utilization. As an example,
LP methods (e.g. see 8.5.3) and advanced control technique approaches (e.g. see Section
8.4.3) can be efficiently used for every class of dataflow MoC.
(ii) Profiling of dynamic dataflow programs: a systematical methodology for profiling dy-
namic dataflow programs has been formalized and illustrated in Chapter 7. It has been
shown how static and dynamic information concerning the program complexity can be
retrieved during a high-level code interpretation of the program. Compared to previous
approaches that mainly required a low-level code generation and binary-execution of
the program, the proposed methodology is completely based on a serial and high-level
code interpretation of the program. Furthermore, it has been shown how the profiling
information is systematically used to evaluate the different dependency kinds of an ETG.
(iii) Design space exploration methodology: a unified SW and HW DSE methodology based
on the post-mortem scheduling and analysis of the ETG has been formalized and il-
lustrated in Chapter 8. A collection of heuristic methods used for efficiently exploring
different design configurations of a dynamic dataflow program has been illustrated.
Compared to previous approaches, this methodology makes the analysis of dynamic
dataflow programs possible without limiting MoC expressiveness. The main research
contributions are:
• Performance estimation: a unified performance estimation approach for both
HW and SW, based on ETG post-mortem scheduling, has been introduced and
illustrated in Section 8.1. Compared to other approaches that requires several
partial low-level implementations and integrations of the program, the proposed
approach makes use of a DEVS simulator. Additional dependencies and timing
information are evaluated and enhanced by cycle-accurate profiling data obtained
by third-party profilers. The use of the ETG makes the analysis of the application
parallelism by itself and exploration of its available levels of parallelism possible.
Furthermore, SW and HW platforms are modeled with a unified approach.
174
• Design space critical path: the concept of design space critical path has been
formalized and illustrated in Section 8.2. This concept has been used to bound
and limit the design configuration of an application that should be considered by
the DSE optimization heuristics. Furthermore, the concept of potential speedup
defined in terms of critical path length has been formalized using the notion of ETG
critical path. Contrary to the state of the art, this definition clearly states what are
the serial and parallel parts of a program by using the ETG’s graph-based formalism.
• Hotspots analysis: the concept of hotspots of a dataflow program has been formal-
ized and illustrated in Section 8.3. This metric provides clear source code refactoring
information to the designer that can be employed to reduce the algorithmic com-
plexity and improve the throughput of a program. Compared to previous methods,
this metric takes data or functional dependencies defined by the ETG directly into
account.
• Buffer size configuration dimensioning: a buffer size dimensioning approach for
dynamic dataflow programs has been formalized and illustrated in Section 8.4. The
design space of the program has been split into a deadlock and feasible region.
These regions are separated by what is called minimal buffer size configuration.
This thesis has proposed a methodology, based on the use of advanced control
techniques, to find a close-to-minimal buffer size configuration and successively
evaluate different trade-offs between throughput and memory usage. Compared to
previous approaches that are suitable only for static or cyclo-static dataflow MoCs,
this methodology is suitable for dynamic dataflow MoCs and it does not require
any behavioral limitation of the program’s MoC. It must be noted that for dynamic
dataflow programs, the minimal buffer size configuration can vary according to the
input sequence used. As such, a deadlock-free execution can be guaranteed only
for the tested set of input sequences. In other words, a deadlock-free execution can
be guaranteed only if representative input sequences are tested. This is a common
practice in multimedia processing.
• Dynamic power dissipation minimization: a dynamic power dissipation mini-
mization approach for dynamic dataflow programs as been formalized and illus-
trated in Section 8.5. It has been shown how the program can be mapped on a
multi-clock domain architecture reducing its dynamic power dissipation without
impacting the throughput performances. Compared to previous approaches that
are suitable only for static or cyclo-static dataflow MoCs, this methodology is also
suitable for dynamic dataflow MoCs.
(iv) Design space exploration environment: a DSE environment, called TURNUS, suitable
for the analysis and optimization of CAL applications has been developed and provided
as an open source project. The structure and main functionalities of this framework have
been illustrated in Chapter 6. Along this thesis it has been shown how TURNUS has been
integrated with other open-source CAL HW synthesis and SW code generation tools (i.e.
called Xronos and Orcc, respectively). Its integration with these tools has provided a
175
Chapter 10. Conclusions
complete system design environment for CAL applications that was not available before
this work.
A collection of experimental results based on the analysis of image and video codec applica-
tions has been illustrated and discussed in Chapter 9. The results obtained during the different
stages of the design space exploration of video decoders, such as JPEG, MPEG4-SP, and HEVC
decoders, have been presented and discussed. More precisely, the following cases of use
have been discussed: critical path design exploration and code refactoring assisted by impact
analysis have been illustrated for the HEVC video decoder, implemented in a multi-core i7
desktop CPU. Successively, the bounded buffer size heuristic based on use of an MPC con-
troller has been used for both a JPEG and HEVC decoder. An optimal trade-off between buffer
size dimensioning and throughput performance has been discussed for an MPEG4-SP decoder
implemented on an SThorm many-core platform. Finally, the dynamic power dissipation
minimization heuristic has been used for an MPEG4-SP decoder implemented on a Xilinx
Virtex-5 FPGA. It must be noted how these results have been obtained using the same CAL
programs implemented on a wide variety of parallel architectures. Although the unified SW
and HW methodologies illustrated in this thesis has been validated, several challenging issues
are still unsolved. These open problems are discussed in the following section.
10.1 Future work
Since we used the term orthogonalization of concerns, in this section the term orthogonaliza-
tion of effort is forged. By orthogonalization of effort it is assumed that both theoretical and
implementation works should be concentrated in order to improve the DSE methodology and
the supporting framework.
Open theory problems
The following problems, as far as what concerns the theory behind the DSE methodology,
need more investigation:
• Hardware and algorithmic critical path: the relation between the program algorithmic
critical path and the hardware critical path should be investigated. The objective is to
verify if it is possible to provide further information to the designer in order to scale the
design over higher frequency.
• Many-core partitioning: the problem of many-core partitioning of dynamic dataflow
programs is a must-do problem. This should also be provided within the DSE framework
in order to provide a fully-functional co-design environment. Initial, but still unpub-
lished, work has been concentrated on finding effective partitioning heuristics based on
the analysis of the design space critical path. However, further investigation is required
to validate and prove the effectiveness of this approach.
176
10.1. Future work
• Pipelining analysis: initial, but still unpublished, work is the identification through the
ETG analysis of actions where execution can be pipelined. If these actions are along the
algorithmic critical path, pipelining their execution would directly provide improvement
of the design performance without requiring any program modifications.
• Scheduling optimization: another must-do problem is the analysis and optimization of
the scheduling of dynamic dataflow programs. Solving this problem is essential in order
to fully explore the design space defined as the configuration points of partitioning,
scheduling and buffer size configurations.
• Performance estimation: the performance estimation methodology based on the post-
mortem processing of the ETG should be verified and validated on a wider set of het-
erogeneous parallel platforms. An example of a not-yet tested platform is the emerging
many-core Parallella platform [214].
Open implementation problems
The following problems, as far as what concerns the improvement of the TURNUS DSE
framework, need more investigation:
• Complex guard conditions: guard enable and disable dependencies are the key roles to
model different execution trajectories with a single ETG. Guard conditions, where more
than one internal actor variable is involved, require the use of a satisfiability modulo
theories (SMT) problem solver [215]. Currently, the detection of enabled and disabled
guard conditions is required to be made by the code interpreter. As future work, a
new functionality that should be integrated within the DSE framework is the possibility
to model such complex guard conditions and directly detect when a guard has been
enabled or disabled. In other words, enabling and disabling guard conditions should
be directly identified when post-processing the ETG by analyzing the internal variable
modifications performed by the firings.
• Big data: the size of an ETG rapidly grows as the number of action firings increase. For
complex designs and where big input sequences are used as program stimulus, the
corresponding ETG can contain millions, even billions of nodes and dependencies. This
could become a problem if effective methods for handling such big data are not used. An
initial experimental approach that has been tested within the DSE framework is the use
of a graph database integrated in a Blueprints graph interface ecosystem [216, 217, 218].
However, further investigation is required to consolidate and improve the performance
of this approach
177

A Discrete event system and simulation
A.1 Petri nets
A Petri net [156, 157] (PN) is a bipartite directed graph with two kinds of nodes, called tran-
sitions and places, where arcs are either from a place to a transition or from a transition to a
place. Using the concept of conditions and events, places represent conditions, and transitions
represent events. A transition (an event) has a certain number of input and output places repre-
senting the pre-conditions and post-conditions of the event, respectively. Contrary to KPN and
DPN, where tokens are atomic data objects, in a PN, tokens are used to simulate the dynamic
and the concurrent activity of the system. The presence of a token in a place is interpreted
as holding the truth of the condition associated with the place. In another interpretation, n
tokens are put in a place to indicate that n data items or resources are available.
A PN is formally defined as a tuple {P,T,E ,W, M0}, where:
• P = {p1, . . . , pm} is a finite set of places.
• T = {t1, . . . , tn} is a finite set of transitions.
• E ⊆ (P ×T )∪ (T ×P ) is a finite set of directed arcs connecting transitions to places and
places to transitions.
• W : E →N is a weight function which defines the weight assignment for each arc.
• M0 : P →N0 is the initial marking.
• P ∩T =; and P ∪T =;.
According to the weight function W , arcs are labeled with positive integer numbers W (pi , t j )
and W (ti , p j ), representing respectively the weight of an arc from a place to a transition and
the weight from a transition to a place. The two sets •t = {p ∈ P : (p, t) ∈ E } and t• = {p ∈ P :
(t , p) ∈ E } define respectively the pre-set and post-set of a transition t .
179
Appendix A. Discrete event system and simulation
Remark. PN transitions are like KPN processes or DPN actors: they fire when sufficient input
is available. However, tokens have no value and firing of a transition does not involve any
computation on tokens. For a PN, a firing is just the act of moving tokens from one place to
another. Moreover, contrary to KNP and DPN buffers, places do not preserve the token ordering.
A.1.1 State
The state of a PN is described by a marking function M : P →N0, which assigns a non-negative
integer representing the number of tokens residing in that place to each place. Typically, the
marking function M is described as a column vector M = [M(p1), . . . , M(pm)]′ ∈ Nm0 whose
generic entry M(pi ), i = 1, . . . ,m represents the number of tokens present in place pi , and the
symbol [·]′ is the matrix transpose operator.
A.1.2 Transition firing
Each firing (or occurrence) of a transition produces an update of the net marking vector M . In
order to be able to occur, a transition has to be enabled. A transition t is said to be enabled if it
satisfies the following firing rule:
∀p ∈ •t : M(p)≥W (p, t ) (A.1)
Roughly speaking, the occurrence of a transition removes tokens from the pre-set of a transi-
tion and adds tokens to its post-set, according to the weights of the arcs connecting the places
to the transition. The firing of a transition t in M results in a new marking M˜ defined as:
M˜(p) : p 7→

M(p)+W (t , p)−W (p, t ) for p ∈ •t ∩ t•
M(p)+W (t , p) for p ∈ t• \ •t
M(p)−W (p, t ) for p ∈ •t \ t•
M(p) otherwise
(A.2)
The marking transition from M to M˜ can be concisely represented using the notation M [t〉M˜ .
Note that an enabled condition of Equation (A.1) guarantees that the resulting marking can
never assign a negative number to a place. By this, the run of a PN can be defined as the
marking sequence M k0 = {Mi , i = 0, . . . ,k} obtained by firing the enabled transition sequence
t k1 = {ti , i = 1, . . . ,k, ti ∈ T }, such that Mi−1[ti 〉Mi for i ∈ {1, . . . ,k}. It must be noted that the run
of a PN is non-deterministic: when multiple transitions are enabled at the same time, any
one of them may fire. A run of a PN can be effectively computed by means of elementary
matrix operations (i.e. multiplications and additions) using the pre-incidence matrix I and
180
A.2. Discrete event system specification
the post-incidence matrix O, respectively defined as:
I = [qi , j ]i=1,...,m
j=1,...,n
, qi , j =
W (pi , t j ) for (pi , t j ) ∈ E0 otherwise
O = [ri , j ]i=1,...,m
j=1,...,n
, ri , j =
W (t j , pi ) for (t j , pi ) ∈ E0 otherwise
(A.3)
In this way, the marking Mk (p) obtained by firing transition t at event k, can be expressed as:
Mk (p)=Mk−1(p)− I (p, t )+O(t , p), ∀p ∈ P (A.4)
A.2 Discrete event system specification
Discrete event system specification (DEVS) [180, 181] is a conceptual framework for specifying
modular, hierarchical and timed event systems. Its formalism makes it possible to model
and analyze general systems that can be discrete event systems (i.e. described by state tran-
sition tables), continuous state systems (i.e.described by differential equations), and hybrid
continuous state and discrete event systems.
Discrete event systems are a generalization of discrete time systems that allow time to be
continuous. The trajectories of a discrete-event system are functions from the time base R×N
to its sets of input, output, and state. These trajectories change value only a finite number of
times in any finite interval. This is the defining characteristic of a discrete event system: the
events that cause these discrete changes give the class of systems its name.
A.2.1 The atomic model
An Atomic DEVS model is defined as a tuple M = (X ,Y ,S, ta ,δext ,δi nt ,λ) where
• X is the set of input events.
• Y is the set of output events.
• S is the set of sequential states (or also called the set of partial states).
• ta : S →T∞ is the time advance function which is used to determine the lifespan of a
state.
• δext : Q ×X → S is the external transition function which defines how an input event
changes a state of the system.
• δi nt : S → S is the internal transition function which defines how a state of the system
changes internally (i.e. when the elapsed time reaches the lifetime of the state).
181
Appendix A. Discrete event system and simulation
• λ : S → Y φ is the output function where Y φ = Y ∪φ and φ ∉ Y is a silent event or an
unobserved event. This function defines how a state of the system generates an output
event (when the elapsed time reaches the lifetime of the state).
where Q = {(s, te ) : s ∈ S, te ∈ (T∩ [0, t a(s)])} is the set of total states, te is the elapsed time since
the last event, T∞ = [0,∞] defines the extended time base that is the set of the non-negative
real number plus infinity [180].
182
B Model predictive control
The key aspects of the Model Predictive Control (MPC) are presented in the following. However,
for a more complete presentation, the interested reader can refer to [198, 199] where different
domains of application are also illustrated. MPC, also known as receding horizon control,
is a model-based form of control in which the control action is obtained by solving, at each
sampling time, a finite-horizon open loop optimal control problem. Using the current state of
the plant as the initial state of the problem, the optimization solution yields an optimal control
sequence. The control loop is then closed by using the first control move obtained from the
optimized sequence. This is the main difference from conventional control strategies (e.g.
Proportional Derivative Integrator, Linear-Quadratic Regulator) which use a pre-computed
control law.
One of the main features of MPC is the possibility to take hard system constraints directly
into account in the optimization problem. The system evolution is predicted over Hp (i.e.
prediction horizon) events k. During event k, using the standard notation of discrete event
systems (e.g. see Equation (5.16)) the actual output is represented as y(k), the predicted
output and optimized output for the event k+ i are represented respectively as y(k+ i |k) and
u(k + i |k). At each event k, the MPC strategy calculates a set of Hc ≤ Hp (Control Horizon)
values of the input U (k)Hco = {u(k+ i |k),∀i ∈ {0,1, . . . , Hc −1}}. The input is evaluated so that
that the predicted outputs Ŷ (k)
Hp
o = {y(k+ i |k),∀i ∈ {1,2, . . . , Hp }} reach the target point in an
optimal manner. U (k)Hco is obtained by optimizing a linear or quadratic constrained objective
function such as:
J (k)= f (x(k|k),u(k|k),u(k+1|k), . . . ,u(k+Hc −1|k)) (B.1)
In other words, at each event k the objective is to minimize an objective function subject to
183
Appendix B. Model predictive control
additional constraints as:
minimize
u(k|k),u(k+1|k),...,u(k+Hc−1|k)
J (k)
subject to ymi n ≤ y(k+ i |k)≤ ymax , ∀i ∈ {1,2 . . . Hp }
umi n ≤ u(k+ i |k)≤ umax ,∀i ∈ {0,2 . . . Hc −1}
u(k+ i |k)= 0,∀i ∈ {Hc , Hc +1. . . Hp }
g (u(k|k),u(k+1|k), . . . ,u(k+Hc |k))≤ 0
(B.2)
It must be noted that during the prediction, the control is held constant after Hc control moves
(i.e. u(k+ i |k)= 0). As mentioned before, a remarkable feature of MPC is its receding horizon
approach: after evaluating the optimal input set U (k)Hco only the first move u(k)
∗ = u(k|k) is
actually implemented. Then, a new sequence is calculated at the next event and only the first
input move is implemented again.
184
C A CAL esoteric example
This appendix is a fun dissemination example that can be used to explain to people without
any notion of dataflow programming (and parallel programing, in general) what a dataflow
program is. The example that follows does not want to be strictly scientifically correct. An
example based on a chocolate cake recipe is presented. This is a tribute to my parents, Patrizia
and Andrea, who are both hotel-keepers and usually prepare homemade cakes for their guests’
breakfast [219]. Unfortunately for the reader, the recipe presented in the following is not
the original chocolate cake that my parents prepare. This is because the original recipe can
potentially be used by their competitors when they read this thesis.
C.1 A Chef chocolate cake
TheHello World Cake with Chocolate sauce [220] is an open source recipe. This
recipe, reported in Listing C.1, is written using the Chef esoteric programming language [221].
Within the Chef formalism, a program looks like a recipe. A Chef program is composed by
ingredients, mixing bowls and baking dishes. According to the definition presented in [221],
the ingredients hold individual data values and a program has access to an unlimited number
of mixing bowls and baking dishes. These contain ingredient data values. The ingredients
in a mixing bowl or baking dish are ordered, like a stack of pancakes. New ingredients are
placed on top, and if values are removed then they are removed from the top. If the value
of an ingredient changes, its old value in the mixing bowl or baking dish does not change.
The values in the mixing bowls and baking dishes also retain their dry or liquid designations.
Considering the Chef recipe reported in Listing C.1 it can be seen in this case how the program
is composed of two methods: the first (i.e. see line 18) describes how the cake is baked; the
second (i.e. see line 47) describes how the chocolate sauce used to serve the cake is made.
Inside each method how ingredients, mixing bowls and baking dishes are used is defined. For
each method, a set of input ingredients is defined (i.e. see lines 3 and 40, respectively).
185
Appendix C. A CAL esoteric example
Listing C.1: Cake.chef
1 Hello World Cake with Chocolate sauce.
2
3 Ingredients.
4 33 g chocolate chips
5 100 g butter
6 54 ml double cream
7 2 pinches baking powder
8 114 g sugar
9 111 ml beaten eggs
10 119 g flour
11 32 g cocoa powder
12 0 g cake mixture
13
14 Cooking time: 25 minutes.
15
16 Pre-heat oven to 180 degrees Celsius.
17
18 Method.
19 Clean the mixing bowl.
20 Put chocolate chips into the mixing bowl.
21 Put butter into the mixing bowl.
22 Put sugar into the mixing bowl.
23 Put beaten eggs into the mixing bowl.
24 Put flour into the mixing bowl.
25 Put baking powder into the mixing bowl.
26 Put cocoa powder into the mixing bowl.
27 Stir the mixing bowl for 1 minute.
28 Combine double cream into the mixing bowl.
29 Stir the mixing bowl for 4 minutes.
30 Liquify the contents of the mixing bowl.
31 Pour contents of the mixing bowl into the baking dish.
32 Bake the cake mixture.
33 Wait until baked.
34 Serve with chocolate sauce.
35
36
37
38 chocolate sauce.
39
40 Ingredients.
41 111 g sugar
42 108 ml hot water
43 108 ml heated double cream
44 101 g dark chocolate
45 72 g milk chocolate
46
47 Method.
48 Clean the mixing bowl.
49 Put sugar into the mixing bowl.
50 Put hot water into the mixing bowl.
51 Put heated double cream into the mixing bowl.
52 Dissolve the sugar.
53 Agitate the sugar until dissolved.
54 Liquify the dark chocolate.
55 Put dark chocolate into the mixing bowl.
56 Liquify the milk chocolate.
57 Put milk chocolate into the mixing bowl.
58 Liquify contents of the mixing bowl.
59 Pour contents of the mixing bowl into the baking dish.
60 Refrigerate for 1 hour.
186
C.2. From a sequential to a dataflow program specification
C.2 From a sequential to a dataflow program specification
A Chef program can be seen as a sequential collection of operations made of ingredients,
with mixing bowls and baking dishes used as containers. Now suppose that the cake must
be prepared before a given time and in the least time possible. For example, my parents
want to make the breakfast cake in no more than one hour. So, my parents can decide to
cooperate together in the baking process. However, which part of the cake should be prepared
by my mother and which part by my father? Furthermore, which part of the cake should be
prepared before the other parts? The problem here is how to effectively find which parts of the
cake can be made at the same time and which parts need to be made before the other parts.
Considering the Hello World Cake with Chocolate sauce illustrated before, it
can be very hard to identify which parts of the recipe can be prepared at the same time. For
example, the fact that the chocolate sauce can be prepared at the same time with other parts of
the cake and that this sauce is used for dressing the cake is not implicitly defined by the recipe.
This problem can be solved by using a dataflow formalism for defining the cake recipe. Using
a dataflow approach, reading the cake recipe becomes much more understandable. A basic
dataflow representation of this recipe is depicted in Figure C.1. This is what is called a dataflow
network where boxes are actors (i.e. computational kernels) interconnected by buffers that
handle unbounded sequences of tokens (i.e. atomic data objects). As a consequence, each
actor contains a single Chef method and each buffer models how ingredients flow from
different mixing bowls or baking dishes. Each dataflow token represents a single ingredient
unit (e.g. 1g of sugar). Using this formalism it is immediately clear which are the single parts
of the recipe, how they are related and which part should be prepared before others.
Chocolate
Sauce
ChocolateSauce
Cake
ChocolateChips
Butter
BeatenEggs
Flour
BakingPowder
CocoaPowder
DoubleCream
ChocolateCake
Sugar
Sugar
Water
Cream
DarkChocolate
MilkChocolate
Figure C.1: A dataflow representation of the Hello World Cake with Chocolate
sauce Chef program illustrated in Listing C.1.
187
Appendix C. A CAL esoteric example
C.3 The first CAL chocolate cake
In the previous section it has been shown how the recipe can be modeled as a dataflow
program. However, it has not been described how the syntax and the semantic (i.e. how the
program is written and what it describes) of Chef program can be translated to a dataflow
program. This section provides a possible translation of a Chef recipe to an "equivalent" CAL
dataflow program. Basically, each method is translated as an actor and each unitary quantity
of an ingredient as a token. As described before, Figure C.1 illustrates the CAL dataflow
representation of the Chef recipe reported in Listing C.1. This CAL program is composed by
two actors: the Cake actor and the ChocolateSauce actor. The CAL code of these two
actors is reported in Listing C.2 and C.3, respectively. For each actor the input tokens represent
the required ingredients, while the output tokens represent the amalgamation result of the
prepared (input) ingredients. The Cake actor is composed of seven actions, an actor internal
state machine (FSM), and two action priority conditions are defined. Inputs and outputs of
this actor are defined in Lines 3 and 4, respectively. Similarly, the ChocolateSauce actor is
composed of six actions and an actor internal FSM. Inputs and output of this actor are defined
in Lines 2 and 3, respectively. In both actors the respective Chef mixing bowl is modeled as
an internal variable: the cakeMixture in the Cake actor, and the chocolateSauce in
the ChocolateSauce actor. Activities like liquify, stir, agitate and refrigerate are modeled
as CAL functions that can aggregate, modify or define the status of a single or a collection of
ingredients.
Exploiting dataflow properties
The CAL program illustrated before can be used to show the powerfulness of a dataflow
approach when dealing with parallel programs. As an example, lets consider the two actions
liquifyDarkChocolate and liquifyMilkChocolate defined in Lines 25 and 35,
respectively, of the ChocolateSauce actor. It is easy to see how liquefying (i.e. melting)
both the dark and milk chocolate can be performed at the same time if there are at least two
chefs (e.g. my parents). The dataflow approach allows to easily and explicitly model this
condition as illustrated in Figure C.2: the Chef liquify activity can be modeled as a single
CAL actor where the input is the solid ingredient and the output is the liquefied ingredient
as illustrated in Listing C.4. Similarly, this approach can also be done for the other Chef
activities like stir, agitate and refrigerate. In computer science, the fact that these activities
can be performed at the same time by different chefs, is called parallelism (or more precisely
task parallelism). Furthermore, it must be noted that this actor can be used for both milk
and dark chocolate: in computer science this is called code-reusability. In order to exploit
these properties the ChocolateSauce actor should be modified as illustrated in Listing C.5.
Consequently, the new dataflow network representation is the one depicted in Figure C.3. In
computer science, the property that an actor can be represented as a network of actors is called
modularity. Furthermore, inside the actors network depicted in Figure C.3, the refrigerate
Chef activity has been modeled with the RefrigerateCAL actor defined in Listing C.6. This
188
C.3. The first CAL chocolate cake
Listing C.2: Cake.cal
1 actor Cake()
2 ChocolateChips Cc, Butter B, DoubleCream Dc, BakingPowder Bp, Sugar S, BeatenEggs Be,
3 Flour F, CocoaPowder Cp, Cs ChocolateSauce
4 ==> CakeMixture Cm :
5
6 CakeMixture cakeMixture;
7
8 int stirMinutes;
9 int stirstirMaxMinutesutes;
10
11 clean: action ==>
12 do
13 // initialize mixing bowl
14 cakeMixture := 0;
15 // set the stir timer
16 stirMinutes := 0;
17 stirMaxMinutes := 1;
18 end
19
20 add: action B:[butter] repeat 100,
21 S:[sugar] repeat 114,
22 Cc:[chocoChips] repeat 33,
23 Be:[btnEggs] repeat 111,
24 F:[flour] repeat 119,
25 Bp:[bakingPwd] repeat 2,
26 Cp:[cocoaPwd] repeat 32 ==>
27 do
28 cakeMixture := butter+ sugar + chocoChips + btnEggs + flour + bakingPwd + cocoaPwd;
29 end
30
31 stir1m: action ==>
32 guard stirMinutes< stirMaxMinutes
33 do
34 stir1minute(cakeMixture);
35 stirMinutes := stirMinutes+ 1;
36 end
37
38 combineCream: action Dc:[doubleCream] repeat 54 ==>
39 do
40 cakeMixture := cakeMixture + doubleCream;
41 // set the stir timer
42 stirMinutes := 0;
43 stirMaxMinutes := 4;
44 end
45
46 liquify: action ==>
47 do
48 while(!isLiquified(cakeMixture)) do
49 liquify(cakeMixture);
50 end
51 cakeMixture := mixingBowl;
52 while(!isBaked(cakeMixture)) do
53 bake(cakeMixture);
54 end
55 end
56
57 bake: action ==>
58 do
59 while(!isBaked(cakeMixture)) do
60 bake(cakeMixture);
61 end
62 end
63
64 serve: action Cs:[sauce] ==> Cm:[cakeMixture]
65 do
66 cakeMixture := cakeMixture + sauce;
67 end
68
69 schedule fsm s0 :
70 s0(clean) --> s1;
71 s1(add) --> s2;
72 s2(stir1m) --> s2;
73 s2(combineCream) --> s3;
74 s3(stir1m) --> s3;
75 s3(liquify) --> s4;
76 s4(bake) --> s5;
77 s5(serve) --> s0;
78 end
79
80 priority
81 stir1m > combineCream;
82 stir1m > liquify;
83 end
84
85 end
189
Appendix C. A CAL esoteric example
Listing C.3: ChocolateSauce.cal
1 actor ChocolateSauce()
2 Sugar S, HotWater Hw, HeatedDoubleCream Hdc, DarkChocolate Dc, MilkChocolate Mc
3 ==> ChocolateSauce Cs :
4
5 ChocolateSauce chocolateSauce;
6
7 clean: action ==>
8 do
9 chocolateSauce := 0;
10 end
11
12 add: action Hw:[water] repeat 108, Hdc:[cream] repeat 108 ==>
13 do
14 chocolateSauce := water + cream;
15 end
16
17 dissolveSugar: action S:[sugar] repeat 111 ==>
18 do
19 chocolateSauce := chocolateSauce + sugar;
20 while(!isDissolved(chocolateSauce)) do
21 agitate(chocolateSauce);
22 end
23 end
24
25 liquifyDarkChocolate: action Dc:[darkChocolate] repeat 101 ==>
26 var
27 MeltedDarkChocolate mdc := 0
28 do
29 while(!isLiquified(darkChocolate)) do
30 mdc := mdc + liquify(darkChocolate);
31 end
32 chocolateSauce := chocolateSauce + mdc;
33 end
34
35 liquifyMilkChocolate: action Mc:[milkChocolate] repeat 72 ==>
36 var
37 MeltedMilkChocolate mmc := 0
38 do
39 while(!isLiquified(milkChocolate)) do
40 mmc := mmc + liquify(milkChocolate);
41 end
42 chocolateSauce := chocolateSauce + mmc;
43 end
44
45 refrigerate: action ==> Cs:[chocolateSauce]
46 do
47 foreach int t in 0 .. 60 do
48 refrigerate1minute(chocolateSauce);
49 end
50 end
51
52 schedule fsm s0 :
53 s0(clean) --> s1;
54 s1(add) --> s2;
55 s2(dissolveSugar) --> s3;
56 s3(liquifyDarkChocolate) --> s4;
57 s4(liquifyMilkChocolate) --> s5;
58 s5(refrigerate) --> s0;
59 end
60
61 end
190
C.4. A dynamic refrigerator
actor models a refrigerator where the number of minutes required for refrigerating a product
is specified as a parameter. This parameter is specified along the recipe: for the chocolate cake
example this value is 4 minutes. In computer science, specifying parameters in such a way is
referred to as compile-time configuration of the program.
Liquify SolidSolid
Liquify SolidSolid
Figure C.2: The Liquify CAL actor defined in Listing C.4.
Listing C.4: Liquify.cal
1 actor Liquify(type Solid, type Liquid) Solid S ==> Liquid L :
2
3 liquify: action S:[solid] ==> L:[liquid]
4 var
5 Liquid liquid := 0
6 do
7 while(!isLiquified(solid)) do
8 liquid := liquid + liquify(solid);
9 end
10 end
11
12 end
Sauce
Water
Cream
ChocolateSauce
Refrigerate ChocolateSauce
Liquify
MeltedMilkChocolate
DarkChocolate Liquify
MeltedDarkChocolate
MilkChocolate
Sugar
Figure C.3: The modified version of the ChocolateSauce CAL actor.
C.4 A dynamic refrigerator
Since the topic of this dissertation is the analysis of dataflow programs, this section illustrates
a very basic example of a dataflow actor. Let’s consider the Refrigerate actor defined in
Listing C.6. The number of minutes required for refrigerating the input program is defined as
a parameter. In other words, its value is specified before starting to prepare the recipe and
cannot be changed (i.e. compile-time configuration as previously discussed). However, it is
191
Appendix C. A CAL esoteric example
Listing C.5: ModifiedChocolateSauce.cal
1 actor ModifiedChocolateSauce()
2 Sugar S, HotWater Hw, HeatedDoubleCream Hdc, MeltedDarkChocolate Dc, MeltedMilkChocolate Mc
3 ==> ChocolateSauce Cs :
4
5 ChocolateSauce chocolateSauce;
6
7 clean: action ==>
8 do
9 chocolateSauce := 0;
10 end
11
12 add: action S:[sugar] repeat 111,
13 Hw:[water] repeat 108,
14 Hdc:[cream] repeat 108,
15 DCc:[darkChocolate] repeat 101,
16 Mc:[milkChocolate] repeat 72 ==>
17 do
18 chocolateSauce := sugar + water + cream + milkChocolate + darkChocolate;
19 end
20
21 dissolve: action ==> Cs[chocolateSauce]
22 do
23 while(!isDissolved(chocolateSauce)) do
24 agitate(chocolateSauce);
25 end
26 end
27
28 schedule fsm s0 :
29 s0(clean) --> s1;
30 s1(add) --> s2;
31 s2(dissolve) --> s0;
32 end
33
34 end
Listing C.6: Refrigerate.cal
1 actor Refrigerate(type Product, int minutes) Product H ==> Product C :
2
3 refrigerate: action H:[product] ==> C:[product]
4 do
5 foreach int t in 0 .. minutes do
6 refrigerate1minute(product);
7 end
8 end
9
10 end
192
C.5. Design space exploration of a kitchen
possible that chef can decide to increase the number of minutes required for refrigerating
a product. This functionality of a refrigerator can be modeled as illustrated in the modified
Refrigerate CAL actor described in Listing C.7. The number of minutes that the product
should stay in the refrigerator is specified as an input token of the actor and can be modified
while making the cake (i.e. program execution). In other words, (part of) the recipe can
be prepared according to some chef’s choices that are not predictable. In this example the
number of minutes can vary according to the refrigerating status of the cake. In computer
science, the execution of an actor that varies according to some input stimulus (i.e. the chef’s
choices) is referred to as dynamism.
Listing C.7: Refrigerate.cal
1 actor DynamicRefrigerator(type Product) Product H, int T ==> Product C, int R :
2
3 Product product;
4 int remainingTime := 0;
5
6 setTimer: T:[time] ==> R:[remainingTime]
7 do
8 remainingTime := remainingTime + time;
9 end
10
11 place: action H:[product] ==>
12 do
13 place := product;
14 end
15
16 refrigerate: action ==> R:[remainingTime]
17 guard
18 remainingTime > 0
19 do
20 refrigerate1minute(product);
21 remainingTime := remainingTime - 1;
22 end
23
24 ready: action ==> P:[product], R:[remainingTime] end
25
26 schedule fsm s0 :
27 s0(place) --> s1;
28 s1(refrigerate) --> s1;
29 s1(ready) --> s0;
30 end
31
32 priority
33 refrigerate > ready;
34 setTimer > place;
35 setTimer > refrigerate;
36 setTimer > ready;
37 end
38
39 end
C.5 Design space exploration of a kitchen
The design space exploration (DSE) of a dataflow program is one of the topic of this dissertation.
Ok, but what is the DSE of a dataflow program? In order to easily explain what the DSE is,
a similarity of a dataflow program implementation and the recipe is made in the following
section. First of all, the recipe corresponds to a program. As illustrated in the previous
sections, a dataflow program can be see as a cake recipe. Similarly, the kitchen corresponds to
the target platform, where each processing unit can be seen as a chef. Thence, a massively
parallel platform can be considered as a collection of chefs and commis chefs. So chefs are
like processing units that can execute all kinds of complex operations, and commis chefs are
193
Appendix C. A CAL esoteric example
like processing units that can execute a limited set of operations. Furthermore, the difference
between a chef and a commis chef is how much they are payed per hour. So it is convenient
to assign simple tasks of a recipe to a commis chef and complex task to a chef. In this
way, constraints of an application implementation can be defined as the maximum time for
cooking a cake and the maximum amount of money that should used to pay chefs and commis.
Consequently, the design space of an application is the collection of design alternatives (i.e.
mapping configuration) that define which parts of a recipe should be assigned to a chef or a
commis chef (i.e. partitioning), the size of each mixing bowl and baking dish that should be
used (i.e. buffer size configuration) and the operation order that each chef or commis chef
should follow (i.e. scheduling). In this way, the DSE of a program can be seen as the analysis of
a recipe and the results as the collection of rules for each available chef and commis chef in
order to bake a cake with the given constraints. With these definitions, analysis and heuristics
that have been presented in this dissertation can easily be adapted to the cooking domain.
For example, the throughput of a system defined in terms of bit per second can be defined as
cakes per hour, energy minimization can be see as money minimization.
Remark. As an interesting similar example about the critical path analysis, that has been
illustrated in this dissertation, and the cooking process of a recipe can be found in one of the
Numb3rs TV series episode [222].
194
Bibliography
[1] S. Casale-Brunet, A. Elguindy, E. Bezati, R. Thavot, G. Roquier, M. Mattavelli, and J. W.
Janneck, “Methods to Explore Design Space for MPEG RMC Codec Specifications,”
Image Commun., vol. 28, pp. 1278–1294, Nov. 2013.
[2] S. Casale-Brunet, M. Mattavelli, and J. Janneck, “Profiling of Dataflow Programs Using
Post Mortem Causation Traces,” in Signal Processing Systems (SiPS), 2012 IEEE Workshop
on, pp. 220–225, Oct. 2012.
[3] S. Casale-Brunet, M. Mattavelli, C. Alberti, and J. Janneck, “Systems design space explo-
ration by serial dataflow program execution,” in Signals, Systems and Computers, 2013
Asilomar Conference on, pp. 1805–1809, Nov. 2013.
[4] M. Casale-Brunet, S.and Mattavelli, C. Alberti, and J. Janneck, “Representing Guard
Dependencies in Dataflow Execution Traces,” in Computational Intelligence, Com-
munication Systems and Networks (CICSyN), 2013 Fifth International Conference on,
pp. 291–295, 2013.
[5] S. Casale-Brunet, M. Mattavelli, C. Alberti, and J. Janneck, “Design Space Exploration of
High-Level Stream Programs on Parallel Architectures,” Conference: 8th International
Symposium on Image and Signal Processing and Analysis (ISPA 2013), Trieste, Italy,
pp. 738–743, Sep. 2013.
[6] A. Ab-Rahman, R. Thavot, S. Casale-Brunet, E. Bezati, and M. Mattavelli, “Design space
exploration strategies for FPGA implementation of signal processing systems using CAL
dataflow program,” in Design and Architectures for Signal and Image Processing (DASIP),
2012 Conference on, pp. 1–8, Oct. 2012.
[7] A. Ab-Rahman, S. Casale-Brunet, C. Alberti, and M. Mattavelli, “Dataflow program
analysis and refactoring techniques for design space exploration: MPEG-4 AVC/H.264
decoder implementation case study,” in Design and Architectures for Signal and Image
Processing (DASIP), 2013 Conference on, pp. 63–70, Oct. 2013.
[8] E. Bezati, S. Casale-Brunet, M. Mattavelli, and J. Janneck, “Synthesis and optimization
of high-level stream programs,” in Electronic System Level Synthesis Conference (ESLsyn),
2013, pp. 1–6, May 2013.
195
Bibliography
[9] D. de Saint-Jorre, C. Alberti, M. Mattavelli, and S. Casale-Brunet, “Exploring MPEG
HEVC decoder parallelism for the efficient porting onto many-core platforms,” in Image
Processing (ICIP), 2014 IEEE International Conference on, pp. 2115–2119, Oct. 2014.
[10] S. Casale-Brunet, M. Mattavelli, and J. Janneck, “Buffer optimization based on critical
path analysis of a dataflow program design,” in Circuits and Systems (ISCAS), 2013 IEEE
International Symposium on, pp. 1384–1387, May 2013.
[11] S. Casale-Brunet, E. Bezati, M. Mattavelli, M. Canale, and J. Janneck, “Execution trace
graph analysis of dataflow programs: bounded buffer scheduling and deadlock recovery
using model predictive control,” in Proceedings of Conference on Design and Architec-
tures for Signal and Image Processing (DASIP), 2014.
[12] M. Canale, S. Casale-Brunet, E. Bezati, M. Mattavelli, and J. Janneck, “Dataflow programs
analysis and optimization using model predictive control techniques: An example of
bounded buffer scheduling,” in Signal Processing Systems (SiPS), 2014 IEEE Workshop
on, pp. 1–6, Oct. 2014.
[13] S. Casale-Brunet, E. Bezati, C. Alberti, M. Mattavelli, E. Amaldi, and J. Janneck, “Multi-
clock domain optimization for reconfigurable architectures in high-level dataflow ap-
plications,” in Signals, Systems and Computers, 2013 Asilomar Conference on, pp. 1796–
1800, Nov. 2013.
[14] S. Casale-Brunet, E. Bezati, C. Alberti, M. Mattavelli, E. Amaldi, and J. Janneck, “Par-
titioning and optimization of high level stream applications for multi clock domain
architectures,” in Signal Processing Systems (SiPS), 2013 IEEE Workshop on, pp. 177–182,
Oct. 2013.
[15] E. Bezati, S. Casale-Brunet, M. Mattavelli, and J. Janneck, “Coarse grain clock gating of
streaming applications in programmable logic implementations,” in Electronic System
Level Synthesis Conference (ESLsyn), Proceedings of the 2014, pp. 1–6, 2014.
[16] “TURNUS.” http://github.com/turnus. Accessed: May 2015.
[17] S. Casale-Brunet, M. Mattavelli, and J. Janneck, “TURNUS: A design exploration frame-
work for dataflow system design,” in Circuits and Systems (ISCAS), 2013 IEEE Interna-
tional Symposium on, pp. 654–654, May 2013.
[18] S. Casale-Brunet, E. Bezati, C. Alberti, G. Roquier, M. Mattavelli, J. Janneck, and J. Boutel-
lier, “Design space exploration and implementation of RVC-CAL applications using
the TURNUS framework,” in Design and Architectures for Signal and Image Processing
(DASIP), 2013 Conference on, pp. 341–342, Oct. 2013.
[19] S. Casale-Brunet, C. Alberti, M. Mattavelli, and J. Janneck, “TURNUS: A unified dataflow
design space exploration framework for heterogeneous parallel systems,” in Design and
Architectures for Signal and Image Processing (DASIP), 2013 Conference on, pp. 47–54,
Oct. 2013.
196
Bibliography
[20] S. Casale-Brunet, M. Wiszniewska, E. Bezati, M. Mattavelli, J. Janneck, and M. Canale,
“TURNUS: an open-source design space exploration framework for dynamic stream
programs,” in Proceedings of Conference on Design and Architectures for Signal and
Image Processing (DASIP), 2014.
[21] E. Lee and A. Sangiovanni-Vincentelli, “Comparing models of computation,” in Pro-
ceedings of the 1996 IEEE/ACM international conference on Computer-aided design,
pp. 234–241, IEEE Computer Society, 1997.
[22] J. Johnston, W.and Hanna and R. Millar, “Advances in dataflow programming languages,”
ACM Computing Surveys (CSUR), vol. 36, no. 1, pp. 1–34, 2004.
[23] G. Kahn, “The semantics of a simple language for parallel programming,” in Information
processing (J. L. Rosenfeld, ed.), (Stockholm, Sweden), pp. 471–475, North Holland,
Amsterdam, Aug. 1974.
[24] E. Lee and T. Parks, “Dataflow Process Networks,” in Proceedings of the IEEE, pp. 773–799,
1995.
[25] J. Janneck, “Actor Machines: A machine model for dataflow actors and its applica-
tions,” Technical Memo LTH Report 96, 2011 (corrections 2013-03-01), Lund University,
Computer Science Department, Mar. 2013.
[26] A. Grabowski, “Scott-continuous functions,” Journal of Formalized Mathematics, vol. 10,
1998.
[27] D. McAllester, P. Panangaden, and V. Shanbhogue, “Nonexpressibility of Fairness and
Signaling,” J. Comput. Syst. Sci., vol. 47, pp. 287–321, Oct. 1993.
[28] J. Dennis, “First Version of a Data Flow Procedure Language,” in Programming Sympo-
sium, Proceedings Colloque Sur La Programmation, (London, UK), pp. 362–376, Springer-
Verlag, 1974.
[29] E. Lee and E. Matsikoudis, “A Denotational Semantics for Dataflow with Firing,” in
Memorandum UCB/ERL M97/ 3, Electronics Research, 1997.
[30] C. Lucarz, G. Roquier, and M. Mattavelli, “High level design space exploration of RVC
codec specifications for multi-core heterogeneous platforms,” in Design and Architec-
tures for Signal and Image Processing (DASIP), 2010 Conference on, pp. 191–198, Oct.
2010.
[31] S. Bhattacharyya, J. Eker, J. Janneck, C. Lucarz, M. Mattavelli, and M. Raulet, “Overview
of the MPEG Reconfigurable Video Coding Framework,” Journal of Signal Processing
Systems, vol. 63, pp. 251 – 263, 2011.
[32] C. Lucarz, M. Mattavelli, and J. Janneck, “Optimization of portable parallel signal pro-
cessing applications by design space exploration of dataflow programs,” in Signal Pro-
cessing Systems (SiPS), 2011 IEEE Workshop on, pp. 43 –48, Oct. 2011.
197
Bibliography
[33] C. Lucarz, Dataflow Programming for Systems Design Space Exploration for Multicore
Platforms. PhD thesis, EPFL - STI - EDIC, Lausanne, 2011.
[34] J. Castrillon, A. Tretter, R. Leupers, and G. Ascheid, “Communication-aware Mapping
of KPN Applications Onto Heterogeneous MPSoCs,” in Proceedings of the 49th Annual
Design Automation Conference, DAC ’12, (New York, NY, USA), pp. 1266–1271, ACM,
2012.
[35] S. Bhattacharyya, E. Deprettere, R. Leupers, and J. Takala, eds., Handbook of Signal
Processing Systems. Springer, 2013.
[36] W. Najjar, E. Lee, and G. Gao, “Advances in the dataflow computational model,” Parallel
Computing, vol. 25, no. 13, pp. 1907–1929, 1999.
[37] E. Lee, “The problem with threads,” Computer, vol. 39, no. 5, pp. 33–42, 2006.
[38] E. Lee and D. Messerschmitt, “Static Scheduling of Synchronous Data Flow Programs
for Digital Signal Processing,” IEEE Trans. Comput., vol. 36, pp. 24–35, Jan. 1987.
[39] Y. Kwok and I. Ahmad, “Static Scheduling Algorithms for Allocating Directed Task Graphs
to Multiprocessors,” ACM Comput. Surv., vol. 31, pp. 406–471, Dec. 1999.
[40] Z. Gu, M. Yuan, N. Guan, M. Lv, X. He, Q. Deng, and G. Yu, “Static Scheduling and
Software Synthesis for Dataflow Graphs with Symbolic Model-Checking,” in Proceedings
of the 28th IEEE International Real-Time Systems Symposium, RTSS ’07, (Washington,
DC, USA), pp. 353–364, IEEE Computer Society, 2007.
[41] T. Parks, J. Pino, and E. Lee, “A comparison of synchronous and cycle-static dataflow,”
in Signals, Systems and Computers, 1995. 1995 Conference Record of the Twenty-Ninth
Asilomar Conference on, vol. 1, pp. 204–210, IEEE, 1995.
[42] G. Bilsen, M. Engels, R. Lauwereins, and J. Peperstraete, “Cycle-static dataflow,” Signal
Processing, IEEE Transactions on, vol. 44, no. 2, pp. 397–408, 1996.
[43] S. Bhattacharyya, E. Deprettere, and B. Theelen, “Dynamic dataflow graphs,” in Hand-
book of Signal Processing Systems, pp. 905–944, Springer, 2013.
[44] M. Geilen and T. Basten, “Kahn process networks and a reactive extension,” in Handbook
of Signal Processing Systems, pp. 1041–1081, Springer, 2013.
[45] J. Ersfolk, G. Roquier, J. Lilius, and M. Mattavelli, “Scheduling of Dynamic Dataflow
Programs Based on State Space Analysis,” in Proceedings of the IEEE International
Conference on Acoustics, Speech and Signal Processing, pp. 1661–1664, IEEE, 2012.
[46] H. Yviquel, J. Boutellier, M. Raulet, and E. Casseau, “Automated design of networks of
transport-triggered architecture processors using dynamic dataflow programs,” Signal
Processing: Image Communication, vol. 28, no. 10, pp. 1295 – 1302, 2013.
198
Bibliography
[47] J. Ersfolk, “Scheduling Dynamic Dataflow Graphs with Model Checking,” 2014. PhD
Thesis, TUCS Dissertations.
[48] H. Yviquel, A. Sanchez, P. Jaaskelainen, J. Takala, M. Raulet, and E. Casseau, “Embed-
ded Multi-Core Systems Dedicated to Dynamic Dataflow Programs,” Journal of Signal
Processing Systems, vol. 80, no. 1, pp. 121–136, 2015.
[49] L. Torczon and K. Cooper, Engineering A Compiler. San Francisco, CA, USA: Morgan
Kaufmann Publishers Inc., 2nd ed., 2011.
[50] F. Allen and J. Cocke, “A Program Data Flow Analysis Procedure,” Commun. ACM, vol. 19,
pp. 137–147, Mar. 1976.
[51] J. Eker and J. Janneck, “CAL Language Report: Specification of the CAL Actor Language,”
Technical Memo UCB/ERL M03/48, Electronics Research Laboratory, University of
California at Berkeley, Dec. 2003.
[52] I. 23001-4:2011, “Information technology - MPEG systems technologies - Part 4: Codec
configuration representation,” 2011.
[53] M. Mattavelli, J. Janneck, and M. Raulet, “MPEG Reconfigurable Video Coding,” in
Handbook of Signal Processing Systems (S. Bhattacharyya, E. Deprettere, R. Leupers, and
J. Takala, eds.), pp. 43–67, Springer US, 2010.
[54] M. Mattavelli, “MPEG reconfigurable video representation,” in The MPEG Representation
of Digital Media (L. Chiariglione, ed.), pp. 231–247, Springer New York, 2012.
[55] E. Jang, M. Mattavelli, M. Preda, M. Raulet, and H. Sun, “Reconfigurable Media Coding:
An overview ,” Signal Processing: Image Communication, vol. 28, no. 10, pp. 1215–1223,
2013.
[56] “The Open RVC-CAL Compiler, Orcc.” http://github.com/orcc. Accessed: May 2015.
[57] M. Wipliez, Compilation infrastructure for dataflow programs. Theses, INSA de Rennes,
Dec. 2010.
[58] H. Yviquel, A. Lorence, K. Jerbi, G. Cocherel, A. Sanchez, and M. Raulet, “Orcc: Multime-
dia Development Made Easy,” in Proceedings of the 21st ACM International Conference
on Multimedia, MM ’13, pp. 863–866, ACM, 2013.
[59] “Eclipse IDEs.” http://eclipse.org/ide. Accessed: May 2015.
[60] “Eclipse modeling framework.” http://eclipse.org/modeling/emf. Accessed: May 2015.
[61] D. Steinberg, F. Budinsky, M. Paternostro, and E. Merks, EMF: Eclipse Modeling Frame-
work 2.0. Addison-Wesley Professional, 2nd ed., 2009.
[62] “Xtext: Language development made easy!.” http://eclipse.org/Xtext. Accessed: May
2015.
199
Bibliography
[63] “Xtend: Modernized java.” http://eclipse.org/xtend. Accessed: May 2015.
[64] “Xronos.” http://github.com/orcc/xronos. Accessed: May 2015.
[65] E. Bezati, High-level synthesis of dataflow programs for heterogeneous platforms: design
flow tools and design space exploration. PhD thesis, EPFL - STI - EDMI, Lausanne, 2015.
[66] J. Janneck, I. Miller, D. Parlour, G. Roquier, M. Wipliez, and M. Raulet, “Synthesizing
Hardware from Dataflow Programs: An MPEG-4 Simple Profile Decoder Case Study,”
Journal of Signal Processing Systems, vol. 63, no. 2, pp. 241–249, 2009.
[67] E. Bezati, H. Yviquel, M. Raulet, and M. Mattavelli, “A unified hardware/software co-
synthesis solution for signal processing systems,” in Design and Architectures for Signal
and Image Processing (DASIP), 2011 Conference on, pp. 1–6, Nov. 2011.
[68] M. Ravasi and M. Mattavelli, “High-abstraction level complexity analysis and memory
architecture simulations of multimedia algorithms,” Circuits and Systems for Video
Technology, IEEE Transactions on, vol. 15, pp. 673–684, May 2005.
[69] A. Abran, Software Metrics and Software Metrology. Wiley-IEEE Computer Society Pr,
2010.
[70] C. Zebelein, J. Falk, C. Haubelt, and J. Teich, “Classification of General Data Flow Actors
into Known Models of Computation,” in Formal Methods and Models for Co-Design,
2008. MEMOCODE 2008. 6th ACM/IEEE International Conference on, pp. 119–128, Jun.
2008.
[71] M. Wipliez and M. Raulet, “Classification and transformation of dynamic dataflow
programs,” in Design and Architectures for Signal and Image Processing (DASIP), 2010
Conference on, pp. 303–310, Oct. 2010.
[72] M. Wipliez and M. Raulet, “Classification of Dataflow Actors with Satisfiability and
Abstract Interpretation,” IJERTCS, vol. 3, no. 1, pp. 49–69, 2012.
[73] I. Chukhman, W. Plishker, and S. Bhattacharyya, “Instrumentation-driven model detec-
tion for dataflow graphs,” in System on Chip (SoC), 2012 International Symposium on,
pp. 1–8, Oct. 2012.
[74] Y. Li and S. Malik, “Performance Analysis of Embedded Software Using Implicit Path
Enumeration,” SIGPLAN Not., vol. 30, pp. 88–98, Nov. 1995.
[75] P. Puschner and C. Koza, “Calculating the Maximum Execution Time of Real-time Pro-
grams,” Real-Time Syst., vol. 1, pp. 159–176, Sep. 1989.
[76] S. Conte, H. Dunsmore, and Y. Shen, Software Engineering Metrics and Models. Redwood
City, CA, USA: Benjamin-Cummings Publishing Co., Inc., 1986.
200
Bibliography
[77] V. Shen, S. Conte, and H. Dunsmore, “Software Science Revisited: A Critical Analysis
of the Theory and Its Empirical Support,” Software Engineering, IEEE Transactions on,
vol. SE-9, pp. 155–165, Mar. 1983.
[78] T. McCabe, “A Complexity Measure,” Software Engineering, IEEE Transactions on, vol. SE-
2, pp. 308–320, Dec. 1976.
[79] R. Prather, “Theory of program testing: An overview,” Bell System Technical Journal,
vol. 62, pp. 3073–3105, Dec. 1983.
[80] M. Halstead, Elements of Software Science (Operating and programming systems series).
New York, NY, USA: Elsevier Science Inc., 1977.
[81] P. Hamer and G. Frewin, “M.H. Halstead’s Software Science - a Critical Examination,” in
Proceedings of the 6th International Conference on Software Engineering, ICSE ’82, (Los
Alamitos, CA, USA), pp. 197–206, IEEE Computer Society Press, 1982.
[82] C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. Reddi, and
K. Hazelwood, “Pin: Building Customized Program Analysis Tools with Dynamic Instru-
mentation,” SIGPLAN Not., vol. 40, pp. 190–200, Jun. 2005.
[83] L. Gao, J. Huang, J. Ceng, R. Leupers, G. Ascheid, and H. Meyr, “TotalProf: a fast and
accurate retargetable source code profiler,” in Proceedings of the 7th IEEE/ACM inter-
national conference on Hardware/software codesign and system synthesis, pp. 305–314,
ACM, 2009.
[84] J. Eusse, C. Williams, and R. Leupers, “CoEx: A novel profiling-based algorithm/archi-
tecture co-exploration for ASIP design,” in Reconfigurable and Communication-Centric
Systems-on-Chip (ReCoSoC), 2013 8th International Workshop on, pp. 1–8, IEEE, 2013.
[85] J. Castrillon and R. Leupers, “Parallel Code Flow,” in Programming Heterogeneous MP-
SoCs, pp. 123–164, Springer International Publishing, 2014.
[86] I. Chukhman and S. Bhattacharyya, “Instrumentation-driven framework for validation
of dataflow applications,” in Signal Processing Systems (SiPS), 2014 IEEE Workshop on,
pp. 1–6, Oct. 2014.
[87] G. De-Micheli, Synthesis and Optimization of Digital Circuits. McGraw-Hill Higher
Education, 1st ed., 1994.
[88] G. De-Micheli and R. Gupta, “Hardware/Software Co-Design,” IEEE MICRO, vol. 85,
pp. 349–365, 1997.
[89] S. Edwards, L. Lavagno, E. Lee, and A. Sangiovanni-Vincentelli, “Design of Embedded
Systems: Formal Models, Validation, and Synthesis,” in PROCEEDINGS OF THE IEEE,
pp. 366–390, 1999.
201
Bibliography
[90] A. Kienhuis, Design Space Exploration of Stream-based Dataflow Architectures: Methods
and Tools. PhD thesis, Delft University of Technology, The Netherlands, Jan. 1999.
[91] G. De-Micheli, R. Ernst, and W. Wolf, eds., Readings in Hardware/Software Co-design.
Norwell, MA, USA: Kluwer Academic Publishers, 2002.
[92] A. Nandi and R. Marculescu, “System-level Power/Performance Analysis for Embedded
Systems Design,” in Proceedings of the 38th Annual Design Automation Conference,
DAC’01, (New York, NY, USA), pp. 599–604, ACM, 2001.
[93] J. Ceng, J. Castrillon, W. Sheng, H. Scharwachter, R. Leupers, G. Ascheid, H. Meyr, T. Is-
shiki, and H. Kunieda, “MAPS: an integrated framework for MPSoC application paral-
lelization,” in Proceedings of the 45th annual Design Automation Conference, pp. 754–759,
ACM, 2008.
[94] J. Castrillon, R. Velasquez, A. Stulova, W. Sheng, J. Ceng, R. Leupers, G. Ascheid, and
H. Meyr, “Trace-based KPN composability analysis for mapping simultaneous applica-
tions to MPSoC platforms,” in Design, Automation Test in Europe Conference Exhibition
(DATE), 2010, pp. 753–758, Mar. 2010.
[95] R. Leupers and J. Castrillon, “MPSoC programming using the MAPS compiler,” in Design
Automation Conference (ASP-DAC), 2010 15th Asia and South Pacific, pp. 897–902, Jan.
2010.
[96] J. Castrillon, W. Sheng, and R. Leupers, “Trends in embedded software synthesis,” in
Embedded Computer Systems (SAMOS), 2011 International Conference on, pp. 347–354,
IEEE, 2011.
[97] J. Castrillon, R. Leupers, and G. Ascheid, “Maps: Mapping concurrent dataflow appli-
cations to heterogeneous MPSoC,” Industrial Informatics, IEEE Transactions on, vol. 9,
no. 1, pp. 527–545, 2013.
[98] K. Keutzer, A. Newton, J. Rabaey, and A. Sangiovanni-Vincentelli, “System-level Design:
Orthogonalization of Concerns and Platform-based Design,” Trans. Comp.-Aided Des.
Integ. Cir. Sys., vol. 19, pp. 1523–1543, Nov. 2006.
[99] E. Lee, “Overview of The Ptolemy Project,” Technical Memo UCB/ERL M98/71, Elec-
tronics Research Laboratory, University of California at Berkeley, Nov. 1998.
[100] M. Gries, “Methods for Evaluating and Covering the Design Space During Early Design
Development,” Integr. VLSI J., vol. 38, pp. 131–183, Dec. 2004.
[101] M. Pelcat, J. Nezan, J. Piat, J. Croizer, and S. Aridhi, “A System-Level Architecture Model
for Rapid Prototyping of Heterogeneous Multicore Embedded Systems,” in Conference
on Design and Architectures for Signal and Image Processing (DASIP) 2009, (nice, France),
p. 8 pages, Sep. 2009.
202
Bibliography
[102] E. Bezati, R. Thavot, G. Roquier, and M. Mattavelli, “High-level dataflow design of signal
processing systems for reconfigurable and multicore heterogeneous platforms,” Journal
of Real-Time Image Processing, vol. 9, no. 1, pp. 251–262, 2014.
[103] M. Sgroi, L. Lavagno, and A. Sangiovanni-Vincentelli, “Formal Models for Embedded
System Design,” IEEE Design & Test of Computers, vol. 17, no. 2, pp. 14–27, 2000.
[104] R. Ernst, “Codesign of embedded systems: Status and trends,” Design & Test of Comput-
ers, IEEE, vol. 15, no. 2, pp. 45–54, 1998.
[105] K. Miettinen, Nonlinear multiobjective optimization. Kluwer Academic Publishers,
Boston, 1999.
[106] S. Kunzli, Efficient Design Space Exploration for Embedded Systems. PhD thesis, ETH
Zurich, Apr. 2006.
[107] T. Grotker, S. Liao, G. Martin, and S. Swan, System Design with SystemC. Norwell, MA,
USA: Kluwer Academic Publishers, 2002.
[108] G. Palermo, C. Silvano, and V. Zaccaria, “Multi-objective Design Space Exploration of
Embedded Systems,” J. Embedded Comput., vol. 1, pp. 305–316, Aug. 2005.
[109] C. Chantrapornchai, E. Sha, and X. Hu, “Efficient design exploration based on mod-
ule utility selection,” Computer-Aided Design of Integrated Circuits and Systems, IEEE
Transactions on, vol. 19, pp. 19–29, Jan. 2000.
[110] A. Ghosh and T. Givargis, “Analytical design space exploration of caches for embedded
systems,” in In Design Automation and Test in Europe (DATE, Press, 2003.
[111] K. Lahiri, A. Raghunathan, and S. Dey, “Design Space Exploration for Optimizing On-
Chip Communication Architectures,” in IEEE transactions on Computer-Aided Design of
Integrated Circuits and Systems, pp. 952–961, 2004.
[112] S. Rajagopal, J. Cavallaro, and S. Rixner, “Design Space Exploration for Real-Time Em-
bedded Stream Processors,” IEEE Micro, vol. 24, pp. 54–66, Jul. 2004.
[113] P. Czyzzak and A. Jaszkiewicz, “Pareto simulated annealing—a metaheuristic technique
for multiple-objective combinatorial optimization,” Journal of Multi-Criteria Decision
Analysis, vol. 7, no. 1, pp. 34–47, 1998.
[114] G. Agosta, G. Palermo, and C. Silvano, “Multi-objective Co-exploration of Source Code
Transformations and Design Space Architectures for Low-power Embedded Systems,”
in Proceedings of the 2004 ACM Symposium on Applied Computing, SAC ’04, (New York,
NY, USA), pp. 891–896, ACM, 2004.
[115] T. Blickle, J. Teich, and L. Thiele, “System-Level Synthesis Using Evolutionary Algorithms
,” Design Automation for Embedded Systems, vol. 3, pp. 23–58, 1998.
203
Bibliography
[116] M. Eisenring, L. Thiele, and E. Zitzler, “Conflicting Criteria in Embedded System Design,”
IEEE Design & Test Of Computers, vol. 17, pp. 51–59, 2000.
[117] D. Bruni, A. Bogliolo, and L. Benini, “Statistical design space exploration for application-
specific unit synthesis,” in Design Automation Conference, 2001. Proceedings, pp. 641–
646, 2001.
[118] N. Bambha, S. Bhattacharyya, J. Teich, and E. Zitzler, “Hybrid global/local search strate-
gies for dynamic voltage scaling in embedded multiprocessors,” in Hardware/Software
Codesign, 2001. CODES 2001. Proceedings of the Ninth International Symposium on,
pp. 243–248, 2001.
[119] P. Bose and T. Conte, “Performance analysis and its impact on design,” Computer, vol. 31,
pp. 41–49, May 1998.
[120] S. Pllana, I. Brandic, and S. Benkner, “Performance Modeling and Prediction of Parallel
and Distributed Computing Systems: A Survey of the State of the Art,” in Complex, Intel-
ligent and Software Intensive Systems, 2007. CISIS 2007. First International Conference
on, pp. 279–284, Apr. 2007.
[121] M. Obaidat and G. Papadimitriou, Applied System Simulation: Methodologies and Appli-
cations. Springer Publishing Company, Incorporated, 2013.
[122] “CAL design suite.” http://sourceforge.net/projects/caldesignsuite/. Accessed: May
2015.
[123] “COMPA Project.” http://www.compa-project.org. Accessed: May 2015.
[124] “Daedalus: System-Level Design For Multi-Processor System-on-Chip.” http://daedalus.
liacs.nl. Accessed: May 2015.
[125] M. Thompson, H. Nikolov, T. Stefanov, A. Pimentel, C. Erbas, S. Polstra, and E. Depret-
tere, “A Framework for Rapid System-level Exploration, Synthesis, and Programming
of Multimedia MP-SoCs,” in Proceedings of the 5th IEEE/ACM International Conference
on Hardware/Software Codesign and System Synthesis, CODES+ISSS ’07, (New York, NY,
USA), pp. 9–14, ACM, 2007.
[126] H. Nikolov, T. Stefanov, and E. Deprettere, “Systematic and Automated Multiproces-
sor System Design, Programming, and Implementation,” Computer-Aided Design of
Integrated Circuits and Systems, IEEE Transactions on, vol. 27, pp. 542–555, Mar. 2008.
[127] S. Verdoolaege, H. Nikolov, and T. Stefanov, “Pn: A Tool for Improved Derivation of
Process Networks,” EURASIP J. Embedded Syst., vol. 2007, pp. 19–19, Jan. 2007.
[128] J. Ceng, W. Sheng, J. Castrillon, A. Stulova, R. Leupers, G. Ascheid, and H. Meyr, “A
high-level virtual platform for early MPSoC software development,” in CODES+ISSS
’09: Proceedings of the 7th IEEE/ACM international conference on Hardware/software
codesign and system synthesis, (New York, NY, USA), pp. 11–20, ACM, 2009.
204
Bibliography
[129] A. Mihal, C. Kulkarni, M. Moskewicz, M. Tsai, N. Shah, S. Weber, Y. Jin, K. Keutzer,
C. Sauer, K. Vissers, and S. Malik, “Developing Architectural Platforms: A Disciplined
Approach,” IEEE Des. Test, vol. 19, pp. 6–16, Nov. 2002.
[130] M. Gries and K. Keutzer, Building ASIPs: The Mescal Methodology. Springer Publishing
Company, Incorporated, 1st ed., 2010.
[131] F. Balarin, Y. Watanabe, H. Hsieh, L. Lavagno, C. Passerone, and A. Sangiovanni-
Vincentelli, “Metropolis: an integrated electronic system design environment,” Com-
puter, vol. 36, pp. 45–52, Apr. 2003.
[132] S. Ha, S. Kim, C. Lee, Y. Yi, S. Kwon, and Y. Joo, “PeaCE: A Hardware-software Codesign
Environment for Multimedia Embedded Systems,” ACM Trans. Des. Autom. Electron.
Syst., vol. 12, pp. 1–25, May 2008.
[133] “Ptolemy project: heterogeneous modeling and design.” http://ptolemy.eecs.berkeley.
edu. Accessed: May 2015.
[134] “PREESM: the parallel and real-time embedded executives scheduling method.” http:
//sourceforge.net/projects/preesm/. Accessed: May 2015.
[135] M. Pelcat, K. Desnos, J. Heulot, C. Guy, J. Nezan, and S. Aridhi, “Preesm: A dataflow-
based rapid prototyping framework for simplifying multicore DSP programming,” in
Education and Research Conference (EDERC), 2014 6th European Embedded Design in,
pp. 36–40, IEEE, 2014.
[136] T. Grandpierre and Y. Sorel, “From Algorithm and Architecture Specifications to Au-
tomatic Generation of Distributed Real-Time Executives: A Seamless Flow of Graphs
Transformations,” in Proceedings of the First ACM and IEEE International Conference on
Formal Methods and Models for Co-Design, MEMOCODE ’03, (Washington, DC, USA),
pp. 123–133, IEEE Computer Society, 2003.
[137] K. Desnos, M. Pelcat, J. Nezan, S. Bhattacharyya, and S. Aridhi, “PiMM: Parameterized
and Interfaced dataflow Meta-Model for MPSoCs runtime reconfiguration,” in Embed-
ded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIII), 2013
International Conference on, pp. 41–48, Jul. 2013.
[138] S. Stuijk, M. Geilen, and T. Basten, “SDF3: SDF for free,” in Application of Concurrency
to System Design, 2006. ACSD 2006. Sixth International Conference on, pp. 276–278, Jun.
2006.
[139] A. Pimentel, C. Erbas, and S. Polstra, “A systematic approach to exploring embedded
system architectures at multiple abstraction levels,” Computers, IEEE Transactions on,
vol. 55, pp. 99–112, Feb. 2006.
[140] “Space Codesign Systems.” http://http://www.spacecodesign.com. Accessed: May
2015.
205
Bibliography
[141] J. Chevalier, M. de Nanclas, L. Filion, O. Benny, M. Rondonneau, G. Bois, and E. Aboul-
hamid, “A SystemC refinement methodology for embedded software,” Design Test of
Computers, IEEE, vol. 23, pp. 148–158, Mar. 2006.
[142] B. Gedik, H. Andrade, K. Wu, P. Yu, and M. Doo, “SPADE: The System S Declarative
Stream Processing Engine,” in Proceedings of the 2008 ACM SIGMOD International
Conference on Management of Data, SIGMOD ’08, (New York, NY, USA), pp. 1123–1134,
ACM, 2008.
[143] W. De Pauw, M. Let¸ia, B. Gedik, H. Andrade, A. Frenkiel, M. Pfeifer, and D. Sow, “Visual
Debugging for Stream Processing Applications,” in Runtime Verification (H. Barringer,
Y. Falcone, B. Finkbeiner, K. Havelund, I. Lee, G. Pace, G. Rosu, O. Sokolsky, and N. Till-
mann, eds.), vol. 6418 of Lecture Notes in Computer Science, pp. 18–35, Springer Berlin
Heidelberg, 2010.
[144] D. Turaga, H. Andrade, B. Gedik, C. Venkatramani, O. Verscheure, J. Harris, J. Cox,
W. Szewczyk, and P. Jones, “Design Principles for Developing Stream Processing Applica-
tions,” Softw. Pract. Exper., vol. 40, pp. 1073–1104, Nov. 2010.
[145] “SynDEx.” http://www.syndex.org. Accessed: May 2015.
[146] “SystemCoDesigner.” http://www.mycodesign.com/research/scd. Accessed: May 2015.
[147] C. Haubelt, M. Meredith, T. Schlichter, and J. Keinert, “SystemCoDesigner: Automatic De-
sign Space Exploration and Rapid Prototyping from Behavioral Models,” in Proceedings
of the 45th Design Automation Conference (DAC0´8), (Anaheim, CA, USA.), pp. 580–585,
Jun. 2008.
[148] J. Keinert, M. Streubuhr, T. Schlichter, J. Falk, J. Gladigau, C. Haubelt, J. Teich, and
M. Meredith, “SystemCoDesigner: an Automatic ESL Synthesis Approach by Design
Space Exploration and Behavioral Synthesis for Streaming Applications,” ACM Trans.
Des. Autom. Electron. Syst., vol. 14, pp. 1–23, Jan. 2009.
[149] “Forte Synthesizer.” http://www.cadence.com/products/sd/cynthesizer/. Accessed:
May 2015.
[150] A. Mazurkiewicz, “Trace theory,” in Petri Nets: Applications and Relationships to Other
Models of Concurrency (W. Brauer, W. Reisig, and G. Rozenberg, eds.), vol. 255 of Lecture
Notes in Computer Science, pp. 278–324, Springer Berlin Heidelberg, 1987.
[151] T. Kahl, “Relative directed homotopy theory of partially ordered spaces,” Journal of
Homotopy and Related Structures, vol. 1, no. 1, pp. 79–100, 2006.
[152] L. Fajstrup, E. Goubault, and M. Rauben, “Algebraic Topology And Concurrency,” tech.
rep., Theoretical Computer Science, 1998.
206
Bibliography
[153] L. Fajstrup, E. Goubault, E. Haucourt, S. Mimram, and M. Raussen, “Trace Spaces: An
Efficient New Technique for State-Space Reduction,” in Programming Languages and
Systems (H. Seidl, ed.), vol. 7211 of Lecture Notes in Computer Science, pp. 274–294,
Springer Berlin Heidelberg, 2012.
[154] J. Janneck, I. Miller, and D. Parlour, “Profiling dataflow programs,” in Multimedia and
Expo, 2008 IEEE International Conference on, pp. 1065–1068, Jun. 2008.
[155] J. Gross and J. Yellen, Graph Theory and Its Applications, Second Edition (Discrete Math-
ematics and Its Applications). Chapman & Hall/CRC, 2005.
[156] J. Peterson, “Petri Nets,” ACM Comput. Surv., vol. 9, pp. 223–252, Sep. 1977.
[157] T. Murata, “Petri nets: Properties, analysis and applications,” Proceedings of the IEEE,
vol. 77, pp. 541–580, Apr. 1989.
[158] J. Rocha, L. Gomes, and O. Dias, “Dataflow model property verification using petri net
translation techniques,” in Industrial Informatics (INDIN), 2011 9th IEEE International
Conference on, pp. 783–788, Jul. 2011.
[159] R. David and H. Alla, “Petri nets for modeling of dynamic systems: A survey,” Automatica,
vol. 30, no. 2, pp. 175–202, 1994.
[160] K. Jensen and L. Kristensen, Coloured Petri Nets. Springer Berlin Heidelberg, 2009.
[161] K. Ogata, Modern Control Engineering. Upper Saddle River, NJ, USA: Prentice Hall PTR,
4th ed., 2001.
[162] D. Spinellis, “Git,” Software, IEEE, vol. 29, pp. 100–101, May 2012.
[163] “Xilinx Zynq-7000 All Programmable SoC ZC702 Evaluation Kit.” http://www.xilinx.com/
products/boards-and-kits/ek-z7-zc702-g.html. Accessed: May 2015.
[164] M. Arslan, J. Janneck, and K. Kuchcinski, “Partitioning and mapping dynamic dataflow
programs,” in Signals, Systems and Computers (ASILOMAR), 2012 Conference Record of
the Forty Sixth Asilomar Conference on, pp. 1452–1456, Nov. 2012.
[165] J. Ahmad, S. Li, R. Thavot, and M. Mattavelli, “Secure computing with the MPEG RVC
framework,” Signal Processing: Image Communication, vol. 28, no. 10, pp. 1315 – 1334,
2013.
[166] E. Jang, M. Mattavelli, M. Preda, M. Raulet, and H. Sun, “Reconfigurable Media Coding:
An overview,” Signal Processing: Image Communication, vol. 28, no. 10, pp. 1215 – 1223,
2013.
[167] U. Mirza, F. Gruian, and K. Kuchcinski, “Design Space Exploration for Streaming Appli-
cations on Multiprocessors with Guaranteed Service NoC,” in Proceedings of the Sixth
International Workshop on Network on Chip Architectures, NoCArc ’13, (New York, NY,
USA), pp. 35–40, ACM, 2013.
207
Bibliography
[168] F. Palumbo, N. Carta, D. Pani, P. Meloni, and L. Raffo, “The multi-dataflow composer
tool: generation of on-the-fly reconfigurable platforms,” Journal of Real-Time Image
Processing, vol. 9, no. 1, pp. 233–249, 2014.
[169] C. Sau, L. Raffo, F. Palumbo, E. Bezati, S. Casale-Brunet, and M. Mattavelli, “Automated
design flow for coarse-grained reconfigurable platforms: An RVC-CAL multi-standard
decoder use-case,” in Embedded Computer Systems: Architectures, Modeling, and Simu-
lation (SAMOS XIV), 2014 International Conference on, pp. 59–66, Jul. 2014.
[170] J. Janneck, S. Casale-Brunet, and M. Mattavelli, “Characterizing communication be-
havior of dataflow programs using trace analysis,” in Embedded Computer Systems:
Architectures, Modeling, and Simulation (SAMOS XIV), 2014 International Conference
on, pp. 44–50, Jul. 2014.
[171] D. Bhowmik, A. Wallace, R. Stewart, X. Qian, and G. Michaelson, “Profile Driven Dataflow
Optimisation of Mean Shift Visual Tracking,” in IEEE Global Conference on Signal and
Information Processing (GlobalSIP), 2014 Conference on, Dec. 2014.
[172] “TURNUS Orcc RVC-CAL Profiler.” http://github.com/turnus/profiler-orcc. Accessed:
May 2015.
[173] “PAPI: Performance Application Programming Interface.” http://icl.cs.utk.edu/papi.
Accessed: May 2015.
[174] P. Mucci, S. Browne, C. Deane, and G. Ho, “PAPI: A Portable Interface to Hardware
Performance Counters,” in In Proceedings of the Department of Defense HPCMP Users
Group Conference, pp. 7–10, 1999.
[175] S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci, “A Portable Programming
Interface for Performance Evaluation on Modern Processors,” Int. J. High Perform.
Comput. Appl., vol. 14, pp. 189–204, Aug. 2000.
[176] “Caltoopia.” http://www.caltoopia.org. Accessed: May 2015.
[177] “Pin, A Dynamic Binary Instrumentation Tool.” http://software.intel.com/en-us/
articles/pintool. Accessed: May 2015.
[178] “GCC, the GNU Compiler Collection.” http://gcc.gnu.org. Accessed: May 2015.
[179] “Intel C++ Compiler.” https://software.intel.com/en-us/c-compilers. Accessed: May
2015.
[180] B. Zeigler, T. Kim, and H. Praehofer, Theory of Modeling and Simulation. Orlando, FL,
USA: Academic Press, Inc., 2nd ed., 2000.
[181] J. Nutaro, Building Software for Simulation: Theory and Algorithms, with Applications
in C++. Wiley Publishing, 2010.
208
Bibliography
[182] E. Coffman, Computer and Job Shop Scheduling Theory. New York: John Wiley & Sons
Inc, 1976.
[183] K. Ravindran, Task Allocation and Scheduling of Concurrent Applications to Multipro-
cessor Systems. PhD thesis, EECS Department, University of California, Berkeley, Dec.
2007.
[184] C. Yang and B. Miller, “Critical path analysis for the execution of parallel and distributed
programs,” in Distributed Computing Systems, 1988., 8th International Conference on,
pp. 366–373, Jun. 1988.
[185] C. Alexander, D. Reese, and J. Harden, “Near-Critical Path Analysis of Program Activity
Graphs,” in Proceedings of the Second International Workshop on Modeling, Analysis, and
Simulation On Computer and Telecommunication Systems, MASCOTS ’94, (Washington,
DC, USA), pp. 308–317, IEEE Computer Society, 1994.
[186] D. West, Introduction to Graph Theory. Prentice Hall, 2 ed., Sep. 2000.
[187] W. Press, S. Teukolsky, W. Vetterling, and B. Flannery, Numerical Recipes 3rd Edition: The
Art of Scientific Computing. New York, NY, USA: Cambridge University Press, 3 ed., 2007.
[188] R. Walpole, R. Myers, S. Myers, and K. Ye, Probability & statistics for engineers and
scientists. Upper Saddle River: Pearson Education, 8th ed., 2007.
[189] G. Amdahl, “Validity of the Single Processor Approach to Achieving Large Scale Com-
puting Capabilities,” in Proceedings of the April 18-20, 1967, Spring Joint Computer
Conference, AFIPS ’67 (Spring), (New York, NY, USA), pp. 483–485, ACM, 1967.
[190] J. Gustafson, “Reevaluating Amdahl’s Law,” Commun. ACM, vol. 31, pp. 532–533, May
1988.
[191] S. Krishnaprasad, “Uses and abuses of Amdahl’s law,” Journal of Computing Sciences in
Colleges, vol. 17, no. 2, pp. 288–293, 2001.
[192] S. Battacharyya, E. Lee, and P. Murthy, Software Synthesis from Dataflow Graphs. Norwell,
MA, USA: Kluwer Academic Publishers, 1996.
[193] P. Murthy and S. Bhattacharyya, Memory Management for Synthesis of DSP Software.
CRC Press, 2006.
[194] S. Stuijk, M. Geilen, and T. Basten, “Exploring Trade-offs in Buffer Requirements and
Throughput Constraints for Synchronous Dataflow Graphs,” in Proceedings of the 43rd
Annual Design Automation Conference, DAC ’06, (New York, NY, USA), pp. 899–904,
ACM, 2006.
[195] T. Parks, Bounded scheduling of process networks. PhD thesis, University of California at
Berkeley, Berkeley, CA, USA, 1995. UMI Order No. GAX96-21312.
209
Bibliography
[196] W. Liu, Z. Gu, J. Xu, Y. Wang, and M. Yuan, “An efficient technique for analysis of
minimal buffer requirements of synchronous dataflow graphs with model checking,” in
Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign
and system synthesis, CODES+ISSS ’09, (New York, NY, USA), pp. 61–70, ACM, 2009.
[197] M. Geilen, T. Basten, and S. Stuijk, “Minimising Buffer Requirements of Synchronous
Dataflow Graphs with Model Checking,” in Proceedings of the 42Nd Annual Design
Automation Conference, DAC ’05, (New York, NY, USA), pp. 819–824, ACM, 2005.
[198] C. Garcia, D. Prett, and M. Morari, “Model predictive control: Theory and practice - A
survey,” Automatica, vol. 25, no. 3, pp. 335–348, 1989.
[199] S. Qin and T. Badgwell, “A survey of industrial model predictive control technology,”
Control Engineering Practice, vol. 11, no. 7, pp. 733–764, 2003.
[200] B. Ghavami and H. Pedram, “High performance asynchronous design flow using a novel
static performance analysis method,” Comput. Electr. Eng., vol. 35, pp. 920–941, Nov.
2009.
[201] P. Kudva, G. Gopalakrishnan, E. Brunvand, and V. Akella, “Performance analysis and
optimization of asynchronous circuits,” in Computer Design: VLSI in Computers and
Processors, 1994. ICCD ’94. Proceedings., IEEE International Conference on, pp. 221–224,
1994.
[202] S. Suhaib, D. Mathaikutty, and S. Shukla, “Dataflow architectures for GALS,” Electronic
Notes in Theoretical Computer Science, vol. 200, no. 1, pp. 33–50, 2008.
[203] A. Hemani, T. Meincke, S. Kumar, A. Postula, T. Olsson, P. Nilsson, J. Oberg, P. Ellervee,
and D. Lundqvist, “Lowering power consumption in clock by using globally asyn-
chronous locally synchronous design style,” in Design Automation Conference, 1999.
Proceedings. 36th, pp. 873–878, 1999.
[204] T. Wuu and S. Vrudhula, “Synthesis of Asynchronous Systems from Data Flow Specifica-
tion,” Research Report ISI/RR-93-366, University of Southern California, Information
Sciences Institute, Dec. 1993.
[205] E. Bezati, H. Yviquel, M. Raulet, and M. Mattavelli, “A unified hardware/software co-
synthesis solution for signal processing systems,” in Design and Architectures for Signal
and Image Processing (DASIP), 2011 Conference on, pp. 1–6, Nov. 2011.
[206] “Open RVC-CAL Applications.” http://github.com/orcc/orc-apps. Accessed: May 2015.
[207] W. Hamidouche, M. Raulet, and O. Deforges, “Real time SHVC decoder: Implementation
and complexity analysis,” in Image Processing (ICIP), 2014 IEEE International Conference
on, pp. 2125–2129, Oct. 2014.
[208] “Open HEVC decoder.” http://github.com/OpenHEVC/openHEVC. Accessed: May
2015.
210
Bibliography
[209] “International Telecommunication Union (ITU) HEVC conformance bit-stream
collection (draft).” http://wftp3.itu.int/av-arch/jctvc-site/bitstream_exchange/draft_
conformance/. Accessed: May 2015.
[210] “Gephi Toolkit.” https://github.com/gephi. Accessed: May 2015.
[211] M. Bastian, S. Heymann, and M. Jacomy, “Gephi: An Open Source Software for Exploring
and Manipulating Networks,” in International AAAI Conference on Weblogs and Social
Media, 2009.
[212] D. Melpignano, L. Benini, E. Flamand, B. Jego, T. Lepley, G. Haugou, F. Clermidy, and
D. Dutoit, “Platform 2012, a Many-core Computing Accelerator for Embedded SoCs:
Performance Evaluation of Visual Analytics Applications,” in Proceedings of the 49th
Annual Design Automation Conference, DAC ’12, (New York, NY, USA), pp. 1137–1142,
ACM, 2012.
[213] “Xilinx Power Estimator (XPE).” http://www.xilinx.com/products/design_tools/logic_
design/xpe.htm. Accessed: May 2015.
[214] “Parallella Board.” https://www.parallella.org. Accessed: May 2015.
[215] L. De-Moura and N. Bjorner, “Satisfiability modulo theories: An appetizer,” in Formal
Methods: Foundations and Applications, pp. 23–36, Springer, 2009.
[216] “Blueprints: A Property Graph Model Interface.” https://github.com/tinkerpop/
blueprints. Accessed: May 2015.
[217] M. Ciglan, A. Averbuch, and L. Hluchy, “Benchmarking traversal operations over graph
databases,” in Data Engineering Workshops (ICDEW), 2012 IEEE 28th International
Conference on, pp. 186–189, IEEE, 2012.
[218] S. Jouili and V. Vansteenberghe, “An empirical comparison of graph databases,” in Social
Computing (SocialCom), 2013 International Conference on, pp. 708–715, IEEE, 2013.
[219] “Hotel Bouton d’Or, Courmayeur Mont-Blanc.” http://www.hotelboutondor.com. Ac-
cessed: May 2015.
[220] “Baking a Hello World Cake.” http://www.mike-worth.com/2013/03/31/
baking-a-hello-world-cake/. Accessed: May 2015.
[221] “Chef.” http://www.dangermouse.net/esoteric/chef.html. Accessed: May 2015.
[222] “Numb3rs: End of Watch (18 Dec. 2014), Season 3, Episode 8.” http://www.casalebrunet.
com/phd/video/criticalPath.html. Accessed: May 2015.
211

Simone CASALE BRUNET
Doctoral Assistant
casalebrunet@ieee.org • www.casalebrunet.com
skype: casalebrunet • github: casalebrunet • twitter: casalebrunet
Last update: May 2015
Summary Simone Casale-Brunet received a B.S. degree in Electrical Engineering (2008) and an M.S.
degree in Mechatronics Engineering (2010), both with highest honours, from the Politecnico di Torino,
Italy. In 2010 he joined the EPFL SCI STI MM group of the E´cole Polytechnique Fe´de´rale de Lausanne,
Switzerland, where he is currently a Ph.D. candidate under the supervision of Dr. Marco Mattavelli. His
research interests include design space exploration of heterogeneous parallel systems and advanced con-
trol theory. His research work is sponsored by the Fonds National Suisse pour la Recherche Scientifique.
Experience
December 2010 - June 2015 (exp.), E´cole Polytechnique Fe´de´rale de Lausanne
Doctoral assistant: in the SCI-STI-MM Multimedia Group, under the supervision of Dr. Marco Mat-
tavelli. Design methodologies for software/hardware applications for digital signal processing and com-
munication.
September 2010 - December 2010, Politecnico di Torino
Research assistant: in the Department of control and computer engineering. Development of hard
real-time model predictive controller, system identification and optimization.
Education
2010 - present, Electrical Engineering Doctoral School
Main research topic: Design space exploration and optimization for high parallel heterogeneous systems
using high-level dataflow representation.
Supervisor: Dr. Marco Mattavelli
2008 - 2010, MSc Mechatronics Engineering (Summa Cum Laude)
Advanced control theory, electronics engineering, computer science engineering, mechanical engineer-
ing, mathematical optimization.
Supervisor: Prof. Massimo Canale
2005 - 2008, BSc Electronics Engineering (Summa Cum Laude)
Electronics engineering, computer science engineering, control theory, business management
Supervisor: Prof. Massimo Canale
Skills
Foreign Languages
◦ Italian (Mother tongue) ◦ French (Bilingual) ◦ English (Fluent)
Programming Languages
◦ Java ◦ C/C++ ◦ PhP ◦ CAL
Operating Systems
◦ MacOS ◦ GNU/Linux ◦ Windows
Collaborative Projects
• (FP7) ICT-ALICANTE
Media Ecosystem Deployment through Ubiquitous Content-Aware Network Environments.
http://www.ict-alicante.eu
• (EUREKA’s Eurostars) VAMPA
Embedded Video content Analysis on the STM STHORM Multicore Architecture.
http://vampa.epfl.ch
Open Source Software
• TURNUS: (main contributor and maintainer)
a computer-aided co-exploration framework that guides designers during the co-exploration and
optimization process. Released under GPL3 licence.
http://github.com/turnus
• Open RVC-CAL Compiler (Orcc): (code interpreter contributor)
an RVC-CAL compiler infrastructure that allow several languages (software and hardware) to be
generated from the same description composed of RVC-CAL actors and XDF networks. Released
under BSD licence.
http://github.com/orcc
Grants and Sponsorships
• Fonds National Suisse pour la Recherche Scientifique, grant 200021.138214
Service to the Profession
• Session Co-Chair, Applications of Model Predictive Control
12th European Control Conference (ECC13), Zurich, July 2013
Professional Memberships
• Member of the Institute of Electrical and Electronics Engineers (IEEE)
and the IEEE Computer Society (CS)
and the IEEE Circuits and Systems Society (CSS)
and the IEEE Council on Electronic Design Automation (CEDA)
• Member of the Association for Computing Machinery (ACM)
References
• Dr. Marco Mattavelli
E´cole Polytechnique Fe´de´rale de Lausanne, EPFL SCI STI MM, Switzerland
marco.mattavelli@epfl.ch
• Prof. Massimo Canale
Politecnico di Torino, Dipartimento di Automatica e Informatica, Italy
massimo.canale@polito.it
Publications
Journals
2014 [J2] M. Canale and S. Casale-Brunet. A multidisciplinary approach for Model Predictive
Control Education: A Lego Mindstorms NXT-based framework. International Journal of
Control, Automation and Systems, 12(5):1030–1039, 2014
2012 [J1] S. Casale-Brunet, A. Elguindy, E. Bezati, R. Thavot, G. Roquier, M. Mattavelli, and
J. W. Janneck. Methods to Explore Design Space for MPEG RMC Codec Specifications.
Image Commun., 28(10):1278–1294, Nov. 2013
Conferences
2014 [C23] S. Casale-Brunet, E. Bezati, M. Mattavelli, M. Canale, and J. Janneck. Execution
trace graph analysis of dataflow programs: bounded buffer scheduling and deadlock recov-
ery using model predictive control. In Proceedings of Conference on Design and Architectures
for Signal and Image Processing (DASIP), 2014
[C22] S. Casale-Brunet, M. Wiszniewska, E. Bezati, M. Mattavelli, J. Janneck, and
M. Canale. TURNUS: an open-source design space exploration framework for dynamic
stream programs. In Proceedings of Conference on Design and Architectures for Signal and
Image Processing (DASIP), 2014
[C21] M. Canale, S. Casale-Brunet, E. Bezati, M. Mattavelli, and J. Janneck. Dataflow pro-
grams analysis and optimization using model predictive control techniques: An example of
bounded buffer scheduling. In Signal Processing Systems (SiPS), 2014 IEEE Workshop on,
pages 1–6, Oct. 2014
[C20] E. Bezati, S. Casale-Brunet, M. Mattavelli, and J. Janneck. Coarse grain clock gating
of streaming applications in programmable logic implementations. In Electronic System
Level Synthesis Conference (ESLsyn), Proceedings of the 2014, pages 1–6, 2014
[C19] J. Janneck, S. Casale-Brunet, and M. Mattavelli. Characterizing communication
behavior of dataflow programs using trace analysis. In Embedded Computer Systems: Archi-
tectures, Modeling, and Simulation (SAMOS XIV), 2014 International Conference on, pages
44–50, Jul. 2014
[C18] A. Ab-Rahman, S. Casale-Brunet, C. Alberti, and M. Mattavelli. A methodology for
optimizing buffer sizes of dynamic dataflow FPGAs implementations. In Acoustics, Speech
and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 5003–5007,
May 2014
[C17] C. Sau, L. Raffo, F. Palumbo, E. Bezati, S. Casale-Brunet, and M. Mattavelli. Au-
tomated design flow for coarse-grained reconfigurable platforms: An RVC-CAL multi-
standard decoder use-case. In Embedded Computer Systems: Architectures, Modeling, and
Simulation (SAMOS XIV), 2014 International Conference on, pages 59–66, Jul. 2014
[C16] D. de Saint-Jorre, C. Alberti, M. Mattavelli, and S. Casale-Brunet. Exploring MPEG
HEVC decoder parallelism for the efficient porting onto many-core platforms. In Image
Processing (ICIP), 2014 IEEE International Conference on, pages 2115–2119, Oct. 2014
[C15] J. Janneck, G. Cedersjo, E. Bezati, and S. Casale-Brunet. Dataflow Machines. In
Signals, Systems and Computers, 2014 Asilomar Conference on, Nov. 2014
2013 [C14] S. Casale-Brunet, M. Mattavelli, C. Alberti, and J. Janneck. Systems design space
exploration by serial dataflow program execution. In Signals, Systems and Computers, 2013
Asilomar Conference on, pages 1805–1809, Nov. 2013
[C13] S. Casale-Brunet, E. Bezati, C. Alberti, M. Mattavelli, E. Amaldi, and J. Janneck.
Multi-clock domain optimization for reconfigurable architectures in high-level dataflow
applications. In Signals, Systems and Computers, 2013 Asilomar Conference on, pages 1796–
1800, Nov. 2013
[C12] S. Casale-Brunet, M. Mattavelli, C. Alberti, and J. Janneck. Design Space Exploration
of High-Level Stream Programs on Parallel Architectures. Conference: 8th International
Symposium on Image and Signal Processing and Analysis (ISPA 2013), Trieste, Italy, pages
738–743, Sep. 2013
[C11] S. Casale-Brunet, M. Mattavelli, and J.W. Janneck. Buffer optimization based on
critical path analysis of a dataflow program design. In Circuits and Systems (ISCAS), 2013
IEEE International Symposium on, pages 1384–1387, May 2013
[C10] S. Casale-Brunet, M. Mattavelli, and J. Janneck. TURNUS: A design exploration
framework for dataflow system design. In Circuits and Systems (ISCAS), 2013 IEEE Inter-
national Symposium on, pages 654–654, May 2013
[C9] S. Casale-Brunet, E. Bezati, C. Alberti, G. Roquier, M. Mattavelli, J. Janneck, and
J. Boutellier. Design space exploration and implementation of RVC-CAL applications us-
ing the TURNUS framework. In Design and Architectures for Signal and Image Processing
(DASIP), 2013 Conference on, pages 341–342, Oct. 2013
[C8] S. Casale-Brunet, C. Alberti, M. Mattavelli, and J. Janneck. TURNUS: A unified
dataflow design space exploration framework for heterogeneous parallel systems. In De-
sign and Architectures for Signal and Image Processing (DASIP), 2013 Conference on, pages
47–54, Oct. 2013
[C7] M. Casale-Brunet, S.and Mattavelli, C. Alberti, and J. Janneck. Representing Guard
Dependencies in Dataflow Execution Traces. In Computational Intelligence, Communication
Systems and Networks (CICSyN), 2013 Fifth International Conference on, pages 291–295,
2013
[C6] S. Casale-Brunet, E. Bezati, C. Alberti, M. Mattavelli, E. Amaldi, and J.W. Janneck.
Partitioning and optimization of high level stream applications for multi clock domain
architectures. In Signal Processing Systems (SiPS), 2013 IEEE Workshop on, pages 177–
182, Oct. 2013
[C5] A. Ab-Rahman, S. Casale-Brunet, C. Alberti, and M. Mattavelli. Dataflow program
analysis and refactoring techniques for design space exploration: MPEG-4 AVC/H.264 de-
coder implementation case study. In Design and Architectures for Signal and Image Process-
ing (DASIP), 2013 Conference on, pages 63–70, Oct. 2013
[C4] E. Bezati, S. Casale-Brunet, M. Mattavelli, and J. Janneck. Synthesis and optimization
of high-level stream programs. In Electronic System Level Synthesis Conference (ESLsyn),
2013, pages 1–6, May 2013
[C3] M. Canale and S. Casale-Brunet. A Lego Mindstorms NXT experiment for Model
Predictive Control education. In Control Conference (ECC), 2013 European, pages 2549–
2554, Jul. 2013
2012 [C2] S. Casale-Brunet, M. Mattavelli, and J.W. Janneck. Profiling of Dataflow Programs
Using Post Mortem Causation Traces. In Signal Processing Systems (SiPS), 2012 IEEE Work-
shop on, pages 220–225, Oct. 2012
[C1] A. Ab-Rahman, R. Thavot, S. Casale-Brunet, E. Bezati, and M. Mattavelli. Design
space exploration strategies for FPGA implementation of signal processing systems using
CAL dataflow program. In Design and Architectures for Signal and Image Processing (DASIP),
2012 Conference on, pages 1–8, Oct. 2012
