Worst-case delay analysis of core-to-IO flows over many-cores architectures by Abdallah, Laure
En vue de l'obtention du
DOCTORAT DE L'UNIVERSITÉ DE TOULOUSE
Délivré par :
Institut National Polytechnique de Toulouse (INP Toulouse)
Discipline ou spécialité :
Réseaux, Télécommunications, Systèmes et Architecture
Présentée et soutenue par :
Mme LAURE ABDALLAH
le mercredi 5 avril 2017
Titre :
Unité de recherche :
Ecole doctorale :
Worst-case delay analysis of core-to-IO flows over many-cores
architectures
Mathématiques, Informatique, Télécommunications de Toulouse (MITT)
Institut de Recherche en Informatique de Toulouse (I.R.I.T.)
Directeur(s) de Thèse :
M. CHRISTIAN FRABOUL
M. MATHIEU JAN
Rapporteurs :
M. LAURENT GEORGE, ESIEE NOISY LE GRAND
Mme ISABELLE PUAUT, UNIVERSITE RENNES 1
Membre(s) du jury :
1 M. LAURENT GEORGE, ESIEE NOISY LE GRAND, Président
2 M. CHRISTIAN FRABOUL, INP TOULOUSE, Membre
2 M. JEROME ERMONT, INP TOULOUSE, Membre
2 M. MARC GATTI, THALES AVIONICS, Membre
2 M. MATHIEU JAN, CEA SACLAY, Membre
 
 
Acknowledgements
I would like to extend my sincere gratitude to all the reviewers for accepting to evaluate this
work: to Mr. Laurent Georges, Professor at the University of Paris-Est, and Mrs Isabelle
Puaut, Professor at the University of Rennes 1, who very kindly agreed to be the scientific
evaluators of this thesis. My earnest thanks are also due to Mr. Marc Gatti, Research Director
at Avionics Thales, for accepting to be a member of the jury.
I have an immense respect, appreciation and gratitude towards my advisors: Mr. Mathieu Jan,
Research Engineer at CEA LIST, Mr. Jérôme Ermont, Associate Professor at INP-ENSEEIHT,
and Mr. Christian Fraboul, Professor at INP-ENSEEIHT, for their consistent support and for
sharing their invaluable knowledge, experience and advice which have been greatly beneficial
to the successful completion of this thesis.
Thanks to all L3S team members for the occasional coffees and discussions especially during
our “PhD Days”.
I would also like to thank my friends Damien, Briag, Sarah, Jad, Hadi and Marcelino for their
support and all our happy get-togethers and for uplifting my spirit during the challenging times
faced during this thesis.
I am particularly grateful to my friends Ola and Mohammad for their priceless friendship and
precious support, and for continuously giving me the courage to make this thesis better.
Last but definitely not least, I am forever indebted to my awesomely supportive parents, my
wonderful sisters, Layal, Reem and Ghina, and my beloved brother Wassim, for inspiring and
guiding me throughout the course of my life, and hence this thesis.
i
ii
Abstract
Many-core architectures are more promising hardware to design real-time systems than multi-
core systems as they should enable an easier mastered integration of a higher number of ap-
plications, potentially of different level of criticalities. In embedded real-time systems, these
architectures will be integrated within backbone Ethernet networks, as they mostly provide
Ethernet controllers as Input/Output(I/O) interfaces. Thus, a number of applications of dif-
ferent level of criticalities could be allocated on the Network-on-Chip (NoC) and required to
communicate with sensors and actuators. However, the worst-case behavior of NoC for both
inter-core and core-to-I/O communications must be established. Several NoCs targeting hard
real-time systems, made of specific hardware extensions, have been designed. However, none
of these extensions are currently available in commercially NoC-based many-core architectures,
that instead rely on wormhole switching with round-robin arbitration. Using this switching
strategy, interference patterns can occur between direct and indirect flows on many-cores. Be-
sides, the mapping over the NoC of both critical and non-critical applications has an impact
on the network contention these core-to-I/O communications exhibit.
These core-to-I/O flows (coming from the Ethernet interface of the NoC) cross two networks of
different speeds: NoC and Ethernet. On the NoC, the size of allowed packets is much smaller
than the size of Ethernet frames. Thus, once an Ethernet frame is transmitted over the NoC, it
will be divided into many packets. When all the data corresponding to this frame are received
by the DDR-SDRAM memory on the NoC, the frame is removed from the buffer of the Ethernet
interface. In addition, the congestion on the NoC, due to wormhole switching, can delay these
flows. Besides, the buffer in the Ethernet interface has a limited capacity. Then, this behavior
may lead to a problem of dropping Ethernet frames. The idea is therefore to analyze the worst
case transmission delays on the NoC and reduce the delays of the core-to-I/O flows.
In this thesis, we show that the pessimism of the existing Worst-Case Traversal Time (WCTT)
computing methods and the existing mapping strategies lead to drop Ethernet frames due to
an internal congestion in the NoC. Thus, we demonstrate properties of such NoC-based worm-
hole networks to reduce the pessimism when modeling flows in contentions. Then, we propose
a mapping strategy that minimizes the contention of core-to-I/O flows in order to solve this
problem.
We show that the WCTT values can be reduced up to 50% compared to current state-of-the-
art real-time packet schedulability analysis. These results are due to the modeling of the real
iii
impact of the flows in contention in our proposed computing method. Besides, experimental
results on real avionics applications show significant improvements of core-to-I/O flows trans-
mission delays, up to 94%, without significantly impacting transmission delays of core-to-core
flows. These improvements are due to our mapping strategy that allocates the applications in
such a way to reduce the impact of non-critical flows on critical flows. These reductions on the
WCTT of the core-to-I/O flows avoid the drop of Ethernet frames.
iv
Résumé
Les architectures pluri-coeurs sont plus intéressantes pour concevoir des systèmes en temps réel
que les systèmes multi-coeurs car il est possible de les maîtriser plus facilement et d’intégrer
un plus grand nombre d’applications, potentiellement de différents niveau de criticité. Dans
les systèmes temps réel embarqués, ces architectures peuvent être utilisées comme des éléments
de traitement au sein d’un réseau fédérateur car ils fournissent un grand nombre d’interfaces
Entrées/Sorties telles que les contrôleurs Ethernet et les interfaces de la mémoire DDR-SDRAM.
Aussi, il est possible d’y allouer des applications ayant différents niveaux de criticités. Ces
applications communiquent entre elles à travers le réseau sur puce (NoC) du pluri-coeur et avec
des capteurs et des actionneurs via l’interface Ethernet. Afin de garantir les contraintes temps
réel de ces applications, les délais de transmission pire cas (WCTT) doivent être calculés pour
les flux entre les coeurs ("inter-core") et les flux entre les coeurs et les interfaces entrées/sorties
("core-to-I/O"). Plusieurs réseaux sur puce (NoCs) ciblant les systèmes en temps réel dur
ont été conçus en s’appuyant sur des extensions matérielles spécifiques. Cependant, aucune
de ces extensions ne sont actuellement disponibles dans les architectures de réseaux sur puce
commercialisés, qui se basent sur la commutation wormhole avec la stratégie d’arbitrage par
tourniquet. En utilisant cette stratégie de commutation, différents types d’interférences peuvent
se produire sur le réseau sur puce entre les flux. De plus, le placement de tâches des applications
critiques et non critiques a un impact sur les contentions que peut subir les flux "core-to-I/O".
Ces flux "core-to-I/O" parcourent deux réseaux de vitesses différentes: le NoC et Ethernet. Sur
le NoC, la taille des paquets autorisés est beaucoup plus petite que la taille des trames Ethernet.
Ainsi, lorsque la trame Ethernet est transmise sur le NoC, elle est divisée en plusieurs paquets.
La trame sera supprimée de la mémoire tampon de l’interface Ethernet uniquement lorsque
la totalité des données aura été transmise. Malheureusement, la congestion du NoC ajoute
des délais supplémentaires à la transmission des paquets et la taille de la mémoire tampon de
l’interface Ethernet est limitée. En conséquence, ce comportement peut aboutir au rejet des
trames Ethernet. L’idée donc est de pouvoir analyser les délais de transmission pire cas sur les
NoC et de réduire leurs délais afin d’éviter ce problème de rejet.
v
Dans cette thèse, nous montrons que le pessimisme de méthodes existantes de calcul de WCTT
et les stratégies de placements existantes conduisent à rejeter des trames Ethernet en raison
d’une congestion interne sur le NoC. Des propriétés des réseaux utilisant la commutation "worm-
hole" ont été définies et validées afin de mieux prendre en compte les conflits entre les flux. Une
stratégie de placement de tâches qui prend en compte les communications avec les I/O a été
ensuite proposée. Cette stratégie vise à diminuer les contentions des flux qui proviennent de
l’I/O et donc de réduire leurs WCTTs.
Les résultats obtenus par la méthode de calcul définie au cours de cette thèse montrent que
les valeurs du WCTT des flux peuvent être réduites jusqu’à 50% par rapport aux valeurs de
WCTT obtenues par les méthodes de calcul existantes. En outre, les résultats expérimentaux
sur des applications avioniques réelles montrent des améliorations significatives des délais de
transmission des flux "core-to-I/O", jusqu’à 94%, sans impact significatif sur ceux des flux "inter-
core". Ces améliorations sont dues à la stratégie d’allocation définie qui place les applications
de manière à réduire l’impact des flux non critiques sur les flux critiques. Ces réductions de
WCTT des flux "core-to-I/O" évitent le rejet des trames Ethernet.
vi
Contents
Acknowledgements i
Abstract iii
Résumé v
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Network-on-Chip for real-time service 7
2.1 NoC components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Real-time systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 NoC concepts and real-time requirements . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Latency and Quality-of-Service . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Main concepts impacting the latency . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Main concepts impacting the design complexity . . . . . . . . . . . . . . 18
vii
viii CONTENTS
2.4 Assumed NoC architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Related works around Worst Case Traversal Time (WCTT) 25
3.1 WCTT over wormhole networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.1 Priority-based wormhole networks . . . . . . . . . . . . . . . . . . . . . . 26
3.1.2 BE wormhole networks and recursive methods . . . . . . . . . . . . . . . 29
3.2 Contention-aware mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 Task mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Application mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 WCTT and mapping baseline references . . . . . . . . . . . . . . . . . . . . . . 37
4 Managing I/O in many-cores: problems and approach 39
4.1 System architecture and assumptions . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.1 Description of I/O in Tilera . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.2 Model of NoC architecture and assumptions . . . . . . . . . . . . . . . . 43
4.2 An avionic case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Problem illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.1 Is improving the computation of the WCTT sufficient? . . . . . . . . . . 53
4.3.2 Is a contention-aware mapping strategy the solution? . . . . . . . . . . . 56
4.4 Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5 RCNoC: an optimized WCTT analysis for NoC 61
5.1 Notations and definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
CONTENTS ix
5.2 Wormhole network properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.1 Local worst-case scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.2 Direct and indirect contentions . . . . . . . . . . . . . . . . . . . . . . . 69
5.3 Description of the proposed pipeline-based algorithm . . . . . . . . . . . . . . . 76
5.3.1 An illustrating example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.2 Identifying the global worst-case scenario . . . . . . . . . . . . . . . . . . 80
5.3.3 Implementation of direct and indirect contention analysis . . . . . . . . . 83
5.4 Unitary evaluation of the properties . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6 MapIO: an I/O contention-aware mapping technique 95
6.1 How can the WCTT of core-to-I/O flows be reduced? . . . . . . . . . . . . . . . 96
6.2 Overview of our strategy of mapping: MapIO . . . . . . . . . . . . . . . . . . . 97
6.3 Phase 1: Core-to-I/O flows distance minimization . . . . . . . . . . . . . . . . . 99
6.3.1 Filling NoC regions with applications . . . . . . . . . . . . . . . . . . . . 100
6.3.2 Application shape and its assignment to regions . . . . . . . . . . . . . . 103
6.3.3 Reclaiming unused cores within shapes . . . . . . . . . . . . . . . . . . . 106
6.4 Phase 2: Core-to-I/O flows contention minimization . . . . . . . . . . . . . . . . 109
6.4.1 I/O critical paths within applications . . . . . . . . . . . . . . . . . . . . 109
6.4.2 Mapping tasks on the critical paths . . . . . . . . . . . . . . . . . . . . . 112
6.4.3 Mapping tasks around the critical path . . . . . . . . . . . . . . . . . . . 117
6.4.4 Minimizing the contention of outgoing flows . . . . . . . . . . . . . . . . 120
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7 Evaluation of RCNoC and MapIO on case studies 125
7.1 Impact of RCNoC on the core-to-I/O flows . . . . . . . . . . . . . . . . . . . . . 128
7.2 Impact of MapIO on the core-to-I/O flows . . . . . . . . . . . . . . . . . . . . . 129
7.3 Combining our MapIO and RCNoC is necessary . . . . . . . . . . . . . . . . . . 132
7.4 Impact of MapIO on both core-to-I/O and core-to-core flows . . . . . . . . . . . 133
7.4.1 Impact of MapIO rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.4.2 Impact of unused cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8 Conclusion 147
8.1 Summary of Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Bibliography 152
Bibliography 153
x
List of Tables
2.1 Table reporting some examples of NoC architectures. . . . . . . . . . . . . . . . 22
6.1 Different shapes and their free cores for each application. . . . . . . . . . . . . . 104
7.1 Table illustrating the WCTT of the core-to-I/O flows, of the case study A,
sharing the Ethernet (0,6) and (0,2) using MapIO and SHiC approach. . . . . . 130
7.2 Table illustrating theWCTT of the core-to-I/O flows, of the case study B, sharing
the Ethernet (0,6) and (0,2) using MapIO and SHiC approach. . . . . . . . . . . 131
7.3 Table illustrating the WCTT of HM12 blocking HM16 of the case study C and
sharing the Ethernet (0,6) using the different mapping strategies and computing
methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.4 Table illustrating the different results for applications, of case study A, sharing
the Ethernet (0,6) using MapIO and SHiC approach. . . . . . . . . . . . . . . . 134
7.5 Table illustrating the different results for HM and FADEC in both case studies
A and B using MapIO and SHiC approach. . . . . . . . . . . . . . . . . . . . . . 135
7.6 Table reporting the WCTT of core-to-I/O flows for the applications of the case
study D using MapIO and the SHiC method. . . . . . . . . . . . . . . . . . . . . 138
7.7 Table reporting the WCTT of the core-to-I/O flow of HM when varying the
number of unused cores (Ui) by FADEC9. . . . . . . . . . . . . . . . . . . . . . 140
xi
xii
List of Figures
2.1 The NoC components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 The different buffering strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Example of topologies that are most used in NoCs. . . . . . . . . . . . . . . . . 19
3.1 An example illustrating Recursive Calculus method. . . . . . . . . . . . . . . . . 30
3.2 AllocatingApp3 using the mapping strategies presented in [dSCCM10], [FRD+12]
and [COM08]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Allocating App3 using SHiC method presented in[FDLP13]. . . . . . . . . . . . . 37
4.1 An overview of a Tilera architecture. . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 The different steps of the ingress data flows. . . . . . . . . . . . . . . . . . . . . 42
4.3 A Tilera-like NoC architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Task graph of core-to-core and core-to-I/O communications of the FADEC ap-
plication assuming 7 tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5 ROSACE case study, extracted from [PSG+14]. . . . . . . . . . . . . . . . . . . 48
4.6 Task graph illustrating ROSACE case study. . . . . . . . . . . . . . . . . . . . . 49
4.7 Task graph illustrating HM application assuming 5 tasks. . . . . . . . . . . . . . 49
xiii
xiv LIST OF FIGURES
4.8 The 6 FFT communication stages. . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.9 Task graph illustaring FFT application assuming 16 tasks. . . . . . . . . . . . . 51
4.10 Arbitrary mapping of a case study made of one FADEC, one FFT and two HM
applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.11 (a) Core-to-core blocking flows of HM core-to-I/Oflow and their flits position,
(b) Transmission timeline of core-to-core and core-to-I/Ocommunications. . . . . 55
4.12 SHiC mapping of the case study A made of one FADEC, one FFT and two HM
applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.13 FFT flows blocking the HM core-to-I/Oflow by considering SHiC mapping. . . . 58
5.1 Example illustrating the number of routers separating an analyzed flow fa from
its indirect flow fid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Example illustrating the second condition of Property 1. . . . . . . . . . . . . . 65
5.3 Different scenarios to illustrate the second condition of property 1. . . . . . . . . 65
5.4 Timeline of the transmission of flows for the two different scenarios of the example
shown in Figure 5.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.5 Different scenarios illustrating the third condition of property 1 . . . . . . . . . 67
5.6 Timeline of the transmission of flows of the example shown in Figure 5.5. . . . . 68
5.7 Example illustrating that Property 1 should be applied recursively. . . . . . . . 68
5.8 A worst-case scenario of the example in Figure 5.7 illustrating the recursion of
Property 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.9 A second configuration of the example in Figure 5.7 which does not lead to a
worst-case scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.10 An example illustrating the case where nhops = nfj . . . . . . . . . . . . . . . . . 72
LIST OF FIGURES xv
5.11 An example illustrating the case where nhops > nfj . . . . . . . . . . . . . . . . . 73
5.12 An example illustrating the case where nhops < nfj . . . . . . . . . . . . . . . . . 74
5.13 Example illustrating the analyzed flow fa, the direct flow fd and the indirect
flow fid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.14 Example illustrating Property 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.15 Transmission of flows f1, f2 and f3 of the case in Figure 5.14 where we consider
a size of 3 flits for each packet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.16 Timelines of transmission of flows f1, f2 and f3 of the case in Figure 5.14 where
we consider a size of 4 flits for each packet. . . . . . . . . . . . . . . . . . . . . . 77
5.17 Example illustrating the method of computation. . . . . . . . . . . . . . . . . . 77
5.18 Timeline illustrating the transmission of flows of the example in Figure 5.17 and
considering the first scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.19 Mapping of flows in the first configuration and normalized gain compared to the
RC method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.20 Mapping of flows in the second configuration and normalized gain compared to
the RC method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.1 The different steps of our approach MapIO for mapping FADEC, HM and FFT
applications of case study A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.2 Mapping of FADEC and HM applications of the case study B using our approach
MapIO. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3 Different cases of a possible critical path, shown in blue, for a core-to-I/O flow
(path shown in red) of an application (whose mapping is shown in yellow). . . . 110
6.4 The different steps of internal mapping for FADEC8 and HM8. . . . . . . . . . 111
6.5 The order of mapping the tasks by considering the different cases of a critical
path. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.6 The different communications that could block the X and Y part of a critical
path depending on where free cores are localized. . . . . . . . . . . . . . . . . . 114
6.7 Perpendicular communications with the critical path avoids the contention with
the core-to-I/O flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.8 For each possible configuration of a critical path, the defined areas and their
order in the tasks mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.9 Flows blocking directly the path of the outgoing flow. . . . . . . . . . . . . . . . 121
6.10 The configuration in (a) shows how the communications with DDR interfere the
critical path while this interference is avoided by our mapping as shown in (b). . 122
6.11 Mapping of FADEC9 of the case study B. . . . . . . . . . . . . . . . . . . . . . 123
7.1 mapping of flows of case studies A and B by applyingMapIO mapping and SHiC
mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.2 mapping of flows of case studies C and D by applyingMapIO mapping and SHiC
mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.3 The different steps for mapping the applications of the case study D usingMapIO .137
7.4 The different steps of MapIO task mapping for ROSACE and FADEC applica-
tions of the case study D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.5 MapIO mapping of FADEC when varying the number of unused cores UNi. . . . 142
7.6 SHiC mapping of FADEC when varying the number of unused cores UNi referring
to MapIO mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.1 Other possibility of the task mapping for ROSACE using our approach. . . . . . 151
xvi
Chapter 1
Introduction
Contents
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1 Motivation
The use of real-time systems continues to spread in many diverse application areas including,
engine management, process control, medical electronics, telecommunications, robotics, multi-
media and avionics. The underlying idea of real-time systems is not the speed concept but the
reactivity: a real-time system operates in a dynamic environment and must constantly adapt
to the changes in this environment. This implicates that the response to these changes must be
adapted (functional correction), but must also respect the time constraints (time correction).
These time constraints vary considerably in their degree of severity. In soft real-time systems,
the late response may be tolerated. In contrast, in hard real-time systems, a failure to respond
within the time constraints may constitute a catastrophic failure of the system.
1
2 Chapter 1. Introduction
In this thesis, we focus on hard real-time systems.
Real-time systems range in complexity from simple controllers implementing a single function,
to complex sets of communicating sub-systems, each of which is responsible for performing
a number of critical functions. These systems are becoming increasingly complex with the
complexity of applications, whether in architectures (number of processors, the presence of
networks), or in requirements (criticality of tasks, computing power, energy consumption, etc..).
The need to meet these requirements lead to computer systems that comprised millions of
transistors on a single chip, commonly called Systems on Chip or SoC. The most common
communication backbone used in SoCs is the shared medium arbitrated bus. However, a
bus based SoC does not scale with the number of cores attached, and so the performance is
reduced. Thus, a search for the communication backbone of next generation many-core based
SoCs supporting new inter-core communication demands started, and lead to the design of
what is called Network on Chip (NoC). NoC has emerged as a viable alternative as it consists
of various cores connected to a router-based network. Such communication architecture is
described as modular and scalable [BDM02b, DT01, KJS+02].
In this thesis, we focus on the NoC-based many-core architectures.
Many-core architectures are indeed promising hardware to support the design of hard real-time
systems [NYP+14a]. They are based on simpler cores, without complex hardware mechanisms
that can be found in multi-core architectures. The timing predictability of cores within many-
core architectures are thus easier to analyze. In order to support hard real-time traffic, the
guarantees in the worst-case scenarios must be established. Then, the Worst-Case Traversal
Time (WCTT) of all the packets generated by a flow must be lower than a predetermined
deadline. A deadline marks the latest time that a packet should arrive to its destination.
Thus, Quality-of-Service (QoS) has been a major concern for NoC [BM06]. In fact, packets
are routed in a network where resources are shared, thus bringing unpredictable performance.
NoCs can thus be used in hard real-time systems using two approaches. A first approach is to
use analytic methods to analyze the WCTTs of flows on existing many-cores. A second one is
1.1. Motivation 3
to modify the hardware architecture in such a way no contentions can occur by design, leading
to straightforward WCTT for flows. However, the problem in the second approach is that it
does not exist a commercial NoC architecture implementing it.
In this thesis, we focus on the use of NoC architectures in hard
real-time systems using the first approach.
In embedded real-time systems, a number of applications of different level of criticalities could
be allocated on a NoC and required to communicate with sensors and actuators. As many-
cores architectures present an important number of Ethernet interfaces, thus, it could be used
as processing elements within a backbone Ethernet network. The challenge is then to analyze
the worst-case behavior of the underlying NoC [BDM02a] used for both inter-core and core to
external memories or peripherals communications. Such real-time packet schedulability analysis
have been done for various types of networks by taking into account the type of contentions that
can occur between flows. However, none of these works consider in addition the core-to-I/O
communications.
In this thesis, we focus on the integration of I/O constraints within
NoC communications.
A core-to-I/O flow experiences a change in its speed as it crosses two networks of different
types (NoC and Ethernet). However, in NoCs, the congestion is possible especially when using
a wormhole switching. A NoC congestion delays the core-to-I/O flow leading to an overflow of
the buffer of the Ethernet interface which is of limited capacity. Therefore, incoming Ethernet
frames holding critical data could be dropped. Real-time packet schedulability analysis must
then be done by taking into account all types of contentions that can occur between all types
of flows, which severely complicate its timing analysis.
The objective of this thesis is to analyze the WCTT of all flows on the
NoC and to reduce the WCTT of the core-to-I/O flow in order to
avoid the loss of critical payloads.
4 Chapter 1. Introduction
1.2 Contributions
In the aim to reach the objectives mentioned in the previous section, this thesis presents three
main contributions:
• An analysis showing that existing computing methods and congestion-aware mapping
strategies are not sufficient to integrate the I/O constraints within the NoC communica-
tions. Besides, an explanation of the I/O interfaces and an illustration of the communica-
tions between cores and I/O interfaces are presented. To the best of our knowledge, this
is the first work that integrates I/O communications within the NoC communications.
• An analytical method to compute the WCTT by including three properties of the worm-
hole switching when analyzing contentions between flows. This method reduces the pes-
simism of the existing methods, thus tightest bounds of the delay can be obtained.
• A congestion-aware mapping strategy of critical and non-critical applications that consider
not only the core-to-core communications, but also the core-to-I/O communications. This
strategy aims to reduce the contention on the core-to-I/O flows and then their WCTTs.
The thesis is organized as follows:
• Chapter 2 presents the main NoC concepts that impact the WCTT of flows over different
architectures. Besides, we present some of the existing NoC architectures in order to argue
the architecture we assume in this work
• Chapter 3 describes the existing methods to compute the WCTT of the flows on the
wormhole networks. It illustrates also the existing mapping strategies that reduce the
congestion on the NoC and thus the WCTT of these flows.
• Chapter 4 presents our first contribution: an illustration of the the problems when using
NoCs in real-time systems interconnected to sensors and actuators via Ethernet. Thus,
first, we present the model of our NoC architecture and the I/O model. Then, we show
1.3. Publications 5
in a motivating case study made of real-time applications what are the limitations of the
existing WCTT analysis and mapping methods that lead to these problems.
• Chapter 5 explains our second contribution: an analysis method that reduces the pes-
simism when computing WCTT of flows. This chapter shows the improvements of our
method compared to the existing methods.
• Chapter 6 describes our third contribution: a mapping algorithm that takes into account
the core-to-I/O flows and that aims to reduce the contention on these flows.
• Chapter 7 presents an evaluation of our contributions compared to the current state-of-
the-art methods on several case studies. The objective of this chapter is to illustrate the
impacts of our proposed approaches on the WCTT of the core-to-I/O and core-to-core
flows.
• Chapter 8 concludes the manuscript by summarizing the major contributions of the
thesis and proposing interesting research directions as future work.
1.3 Publications
This is a list of papers and publications that reflects the results achieved during the develope-
ment of the research work presented in this dissertation. A significant part of this thesis is
compiled from these papers and publications.
• Laure Abdallah, Mathieu Jan, Jérôme Ermont and Christian Fraboul. “Reducing the
contention experienced by real-time core-to-I/O flows over a Network on Chip”, In 28th
Euromicro Conference on Real-Time Systems (ECRTS), Toulouse, France, July 2016.
• Laure Abdallah, Mathieu Jan, Jérôme Ermont and Christian Fraboul. “Wormhole net-
works properties and their use for optimizing worst case delay analysis of many-cores”, In
10th IEEE International Symposium on Industrial Embedded Systems (SIES 2015), pages
59–68, Siegen, Germany, June 2015.
6 Chapter 1. Introduction
• Laure Abdallah, Mathieu Jan, Jérôme Ermont and Christian Fraboul. “I/O contention
aware mapping of multi-crticalities real-time applications over many-core architectures”,
In Proc. of the 22nd IEEE Real-Time and Embedded Technology and Applications Sym-
posium (RTAS), Session Work-In-Progress, Vienna, Austria, April 2016.
• Laure Abdallah, Mathieu Jan, Jérôme Ermont and Christian Fraboul. “Optimizing worst
case delay analysis on wormhole networks by modeling the pipeline behavior”, In Proc.
of the 13th International Workshop on Real-Time Networks (RTN), Madrid, Spain, July
2014.
• Laure Abdallah, Mathieu Jan, Jérôme Ermont and Christian Fraboul. “Why and how to
map real-time core-to-I/O flows over a Tilera-like Network on Chip?”, In FNRS Seminar-
Real-time networks, Brussels, Belgium, May 2016.
• Laure Abdallah, Mathieu Jan, Jérôme Ermont and Christian Fraboul. “Propriétés des
réseaux wormhole pour optimiser l’analyse de délai pire cas dans les many-coeurs”, In
Real-time summer school, Rennes, France, August 2016.
Chapter 2
Network-on-Chip for real-time service
Contents
2.1 NoC components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Real-time systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 NoC concepts and real-time requirements . . . . . . . . . . . . . . . . 10
2.3.1 Latency and Quality-of-Service . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Main concepts impacting the latency . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 Main concepts impacting the design complexity . . . . . . . . . . . . . . . 18
2.4 Assumed NoC architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Network-on-chip has been a very active research field since their emergence in the early 2000s,
as it offers various opportunities in terms of performance and computing capabilities. At the
same time, they pose many challenges to be used in real-time systems and ensure the temporal
predictability of flows. Indeed, the flows on a NoC are routed in a network where resources
are shared, thus bringing unpredictable performance. Different parameters impact the upper
bounds delay of a flow on a NoC. This chapter aims to analyze the different implementations
of the NoC concepts and their impact on delivering real-time requirements. In this chapter, we
first present the NoC components and we define the real-time systems. Then, we show how the
7
8 Chapter 2. Network-on-Chip for real-time service
NI
PE
NI
PE
Router Router
Router Router
NI
PE
NI
PE
......
......
.....
. .....
.
ARBITER
Control
Figure 2.1: The NoC components.
implementation of NoC concepts impacts the needed Quality-of-Service (QoS) and the design
complexity of the NoC. Finally, this chapter shows the NoC architecture on which our work is
based.
2.1 NoC components
NoC was introduced in [DT01] as a better alternative to global wiring structures, used to
interconnect different Intellectual Property (IP) cores, which are also called Processing Elements
(PE). A NoC, being on a single chip, is composed of a number of interconnected tiles. A tile,
which is also called a node, contains one or many IP cores and a network router. A NoC
also interconnects the tiles to the I/O interfaces such as the external memory, the Ethernet
interfaces, etc. Figure 2.1 illustrates the structure of a NoC which is composed of routers,
network interfaces and links.
Router
A router is used to send messages from one tile to another. It consists of a set-up of input
and output ports that can (or not) contain buffers, an interconnection matrix, and an arbiter.
Routers can have buffers either at input port or at output port [DT04]. Figure 2.1 illustrates
an example of a router with input buffers. These inputs and outputs ports are connected to
2.2. Real-time systems 9
each other via the interconnection matrix which is controlled by the arbiter [DMB06].
Links
Routers communicate to each other by one or more physical channels, i.e. a link [DMB06,
LRV06]. This connection could be uni- or bi-directional.
Network interface
Network Interface (NI) provides an interface between the core and the network. It makes
NoC transparent for the PEs. It is used by PEs to access the interconnect medium. The
data produced by the core have to be encapsulated by the source NI and decapsulated by the
destination NI. NI converts the messages into packets for the transmission of the NoC.
A packet is divided into flow control units (flits). A flit is the minimum unit of information
that can be transferred across a link [CSG99]. The first flit is called the header. The packet
header contains informations about the destination NoC node. These informations are needed
by NI to determine the path of the packet. The header is followed by one or several flits which
compose the data payload. The data payload is the data transmitted by the IP core across the
NoC.
2.2 Real-time systems
In a real-time system, the accuracy of the application depends not only on the result but also
on the time at which this result is produced. A data packet received by a destination too late
could be useless or even cause a severe consequence. Typical examples of real-time systems
include control systems for cars, aircraft and space vehicles. In an Antilock Braking System
(ABS), not only the breaking pressure must be calculated, but also the time of application is
critical to gain a functioning ABS. ABS braking system should not put more than 150ms to
receive the information and 1s to react.
Thus, time is an important characteristic for real-time systems. This characteristic distinguishes
10 Chapter 2. Network-on-Chip for real-time service
these systems from other types of computing systems. For on-chip communications, the packet
transmission duration is the time which interests us. Assigning a deadline to each packet
distinguishes a real-time communication from a non real-time one. A deadline marks the latest
time that a packet should arrive to its destination. Thus, a real-time communication means
meeting the deadlines, i.e. satisfying the timing constraints. For hard real-time systems, a
missed deadline is not tolerated. Thus, communications in these systems must imperatively
meet deadlines, otherwise the system might fail and the penalty incurred is catastrophic. It
could lead to loss of life, serious damage to the environment or threats to business interests. As
an example of hard real-time applications, Full Authority Digital Engine Controller (FADEC)
that controls the activities of an aircraft jet engine. The FADEC design requires particular
timing requirements. For example, if the FADEC senses that a turbine drive shaft has broken,
then the FADEC must respond with a damage mitigating action in a predetermined time.
In this thesis, we are interested in such hard-real time communications, where we focus on
the response time communications. Now, let us see how the implementation of NoC concepts
impact the delivery of such real-time requirements.
2.3 NoC concepts and real-time requirements
A number of metrics are needed to measure and quantifying the performance of a NoC [KPN+05,
DYN03]. In general, it is desirable that a NoC architecture exhibits low latency, high through-
put, energy efficiency, low cost, low area overhead and high scalability. In hard real-time
systems, the delay bounds of each packet must be guaranteed. As we deal with these systems,
we focus on the latency metric where the maximal latency, called Worst-case Traversal Time
(WCTT), of each packet must be lower than its deadline. The goal of this section is to intro-
duce a number of NoC concepts that impact the latency and the design complexity. First, let
us see what is the latency metric?
2.3. NoC concepts and real-time requirements 11
2.3.1 Latency and Quality-of-Service
The measurement of the end-to-end network latency is either at the packet or the message
level. Thus, the latency of a packet is the time between sending the first flit at the source and
receiving the last flit of this packet at the destination. Similarly, the latency of a message is
the time between the transmission of the first packet and the reception of the last one.
However, the latency of a packet is the sum of two components [LRV06]:
1. The conflict-free delay;
2. The blocking delay.
The conflict-free delay includes the router delay, the wire delay and the distance between the
source and the destination. The blocking delay results from the possible contention between
packets transmitted on the NoC. However, these delays depend on the choices made when
designing the elements of a NoC. But, the degree of freedom is large. For this reason, the key
NoCs concepts have to be carefully considered by the designer as they affect all the performance
metrics. The latency is an important performance metric as it is often associated with a need
for Quality-of-Service (QoS). Actually, a QoS defines a certain level of guarantees that is given
for packet transfers. [GDvM+03] identifies two basic QoS classes:
1. Best-Effort (BE):
BE services do not reserve any resources, and hence, provide no guarantees on latency
and throughput. Thus, it optimizes the average network resources usage.
2. Guaranteed services (GS):
In this service, a number of mechanisms are used to allocate the network resources to
ensure fixed throughput and/or latency, regardless of network load.
In the context of real-time systems over NoC, the WCTT must be computed. Thus, in this
context, either we use different analytic methods to compute the WCTTs of packets in BE NoCs,
12 Chapter 2. Network-on-Chip for real-time service
or we use GS NoCs where the analysis of WCTT is more simpler as it leads to straightforward
WCTT values.
The way in which a NoC concept is implemented can determine the QoS class offered for the
applications on the NoC. Besides, it impacts the performance of a NoC on several metrics, such
as the latency, the throughput, the area, etc.
2.3.2 Main concepts impacting the latency
The need for a QoS affects the way in which NoC concepts are implemented. Let us first see
what are the essential NoC concepts that are used to provide a QoS class.
Switching techniques
The switching strategy defines how the flits are transmitted and stored by the routers. There
are some points that affect the choice of the strategy, such as the cost, the granularity of data
to be transmitted, the complexity of the router and the need of the Quality-of-Service (QoS).
The switching techniques are divided into two categories: the circuit-switching and the packet
switching [MMM+03].
1. Circuit switching: this strategy is usually used to provide guaranteed services, i.e.
guaranteed latency, as it reduces the blocking delays. In fact, in this strategy, the con-
nection between two communicating nodes is reserved before the message is sent by the
source node. This reservation is only released when the transfer of data is complete and
so received by the destination. Thus, the resources (links, buffers) are only used by this
communication and could not be used by another pair of nodes willing to communicate
via the same path. This strategy is more adapted to transfer large payload in order
to compensate the negotiation time to establish the connexion. Here, we can mention
SoCBus [WL03] that aims to achieve the real-time guarantees by implementing the circuit
switching.
The advantages of this strategy is that the available bandwidth and the latency of the
2.3. NoC concepts and real-time requirements 13
message are known. However, one of the disadvantages of circuit switching is the ineffi-
cient bandwidth usage.
2. Packet switching: In this strategy, packets are injected into the network as soon as the
network can accept them, i.e. based on local availability information, without waiting
for path setting before sending packets. Thus, in this case there is no guarantee that the
path will be entirely available from the source to the destination. Each router takes the
decision if a packet should be forwarded to the corresponding output port or whether
it should wait in case of unavailability of the port. Thus, a control flow mechanism is
required. In case of the unavailability of the port, packets are stored in buffers at inter-
mediate routers.
An advantage of this strategy is the shared network between the communications com-
pared to the circuit-switching where a communication is blocked until the release of
the active connexion. Besides, there is no delay to establish a connexion as opposed
to circuit-switching. However, this strategy leads to greater delays than on the circuit-
switching due to the possible contentions between packets sharing the same resources.
Therefore, the packet switching is rather suitable for traffic with low requirements of
QoS, i.e. BE. However, it could provide different QoS classes by implementing with spe-
cific arbitration mechanisms as we explain later. There are three basic packet switching
schemes for forwarding data: store-and-forward (SF), virtual-cut through (VCT) and
wormhole [DYN03]. These schemes are differentiated by their packet transmission gran-
ularity, i.e. at the flit level or packet level. SF and VCT schemes are not widely used in
NoC architectures as they induce a high area cost of a router. Actually, a router should
include at each input port a buffer able to store entire packets. Wormhole scheme is
the most widely used scheme by NoC [SKH08] and especially to provide a BE service.
For this purpose, we are interested by this scheme of packet switching. Let us see the
functionality of wormhole switching.
Wormhole switching:
Wormhole switching [Moh98] reduces the buffer requirement at the flit level. Thus, the
14 Chapter 2. Network-on-Chip for real-time service
buffer capacity of the router is a multiple of flits, and so their size is smaller. The packets
are transmitted between routers in units of flits, where a flit is transmitted as soon as
there is space for one flit in the buffer of the next router. The header flit contains the
routing information and the next flits containing data follow contiguously this header in
a pipeline fashion. If a packet is blocked, flits of the packet may be stalled on a sequence
of routers.
This technique provides a low router latency because they do not have to store the full
packets. In addition, the area cost is reduced because the queues are much smaller. How-
ever, a packet may occupy several routers at the same time. In this case, it is possible to
block the transmission of other packets, leading to a high level of congestion and ineffi-
cient use of channel bandwidth due to chained blocking. This chained blocking could lead
to deadlock where messages wait for each other and no one can advance any further. To
avoid deadlock, specific buffering and/or routing schemes are combined with the wormhole
switching. Wormhole switching can provide more efficient network channel utilization.
For instance, Tilera [Til11], Mango [BS05], Teraflops [HVS+07], Aethereal [GDR05] and
Kalray [dDvAPL14] NoCs implement this mode.
Buffering
The buffering determines the capacity of storage at the input or output ports of a router.
Increasing buffering capacities can improve the latency and the throughput at the price of the
cost of area and power. The relative positioning of the buffers at input and output ports of the
router is performed using various strategies. In this section, we distinguish three main strategies
used in NoC design: input queuing, output queuing and virtual output queuing which are also
called by virtual channels [RGR+03].
1. Input queuing: In this strategy, N queues are placed at the input ports of the router
as illustrated in Figure 2.2a. We note that the router presents N input and N output
ports. An arbiter determines when an input queue is connected to an output port so
that no conflict occurs. Although this technique is the least expensive on the surface, it
2.3. NoC concepts and real-time requirements 15
...
...
...
...
...
...
...
..
...
..
...
..
Phy
sica
l in
put
 po
rts I0
I1
In
Input buﬀers Physical output ports
O0
O1
On
Interconnexion Matrix
(a) Input queuing
...
..
...
..
Phy
sica
l in
put
 po
rts I0
I1
In
Output buﬀers Physical output ports
O0
O1
On
...
..
..
..
..
..
...
(b) Output queuing
...
...
...
...
...
...
......
..
...
..
Phy
sica
l in
put
 po
rts I0
I1
In
Physical output ports
O0
O1
On
Interconnexion Matrix..
..
..
Virtual channels
(c) Virtual output queuing
Figure 2.2: The different buffering strategies.
16 Chapter 2. Network-on-Chip for real-time service
can induce to the Head-of-line blocking problem [KHM87]. This happens when a given
data in the head of queue cannot access the associated output port, thus blocking other
packets in the queue, even though their output ports are free.
2. Output queuing: When the queues are placed at the output of the router as shown
in Figure 2.2b, each output port has a number of output queues equals to the number
of input ports. This technique presents higher performance than input queuing but it
increases the area cost: for a router of N input ports and N output ports, we need N2
queues.
These two previous schemes are usually used to provide BE services. However, in order
to reduce the waiting delay from which a flow can suffer by waiting a place in the next
queues, virtual channels are used.
3. Virtual output queuing or virtual channels: Figure 2.2c illustrates this strategy
which combines the advantages of the input and output queuing. Thus, for each input
port, there are a number of input queues that help to buffer the incoming packets in
function of their destination or of a level of priority. This strategy is also called virtual
channels (VCs) [Dal90, MTCM05] as for each physical channel, i.e. the link between
two nodes, there are a number of VCs that share the bandwidth of the link. Thus, in
this way, a blocked packet can be doubled by another packet sharing the same physical
channel, but stored in another VC. It is used to solve the Head-of-Line blocking problem
and to increase the router throughput. Although it implies an increase of the area, VCs
have several advantages. They are deadlock free and support guaranteed traffic. Thus,
this buffering is either coupled with circuit-switching to provide guaranteed latency, as
in MANGO NoC, or with the packet switching. However, in the latter case, it should be
implemented with the priority arbitration (which is explained in the next paragraph) by
using a priority level for each VC to ensure a low latency for the high priority traffic.
2.3. NoC concepts and real-time requirements 17
Arbitration mechanisms
An arbitration mechanism is a way to address the contention problem. At each router, when
more than one packet on different input ports, compete for the same output port, the arbiter
chooses which input port will transfer the packet. Besides, the arbitration mechanisms are
required when VCs are implemented in order to select which VC transfers a flit on a link. In
the following, we describe the arbitration mechanisms used in NoCs.
1. Time Division Multiple Access (TDMA): This policy is usually coupled with the
circuit-switching mode in order to ensure a guaranteed latency (GS) where it leads to
straightforward WCTT. This policy divides the time into time-slots. Each time-slot is
assigned to a connexion between an input and an output port. Thus, during a time-slot,
an input port can have the full access to an output port. However, when these time slots
are not carefully aligned, higher latencies cannot be avoided [LRV06]. The Aethereal NoC
uses the TDMA with the circuit-switching to ensure guaranteed services.
2. Priority-based: This policy assigns to each packet an individual priority during its
transmission from the source to the destination. Thus, when multiple packets on different
input ports contend to the same output port, the packet with the highest priority is
granted first. This technique is usually used in real time NoCs where the packets with
the highest priority must meet their deadlines. It is especially used with virtual channels
where to each VC is assigned a priority level. For example, QNOC [BCGK04] chooses
this strategy with virtual channels to provide real-time requirements. However, this policy
could lead to a starvation for low priority packets.
3. Round-Robin (RR): The RR policy assigns priorities to each input port. In each
round of arbitration, the requested input port with the highest priority is served first.
But, the request that is served has then the lowest priority in the next round of ar-
bitration [SRM13]. In this way, this policy ensures a fairness between the requested
communication from input ports and gives equal chance to the competitors. Besides, it
ensures a faster access to an output port than the TDMA policy. Actually, a source port
18 Chapter 2. Network-on-Chip for real-time service
has to wait its time slot to send packets even when there are any competitors to this
output port. The drawback of this policy is that does not differentiate the quality of
services demanded by the different packets. Thus, it is usually used in BE NoCs. How-
ever, the RR policy is widely used in NoC architectures [TSSJ14] because it is fair and
prevents starvation. For example Teraflops [HVS+07], Aetheral [GDR05], Tilera [Til11],
Kalray [dDvAPL14] and Mango [BS05] NoCs use this strategy.
To summarize, a GS is provided either by the use of a circuit-switching usually coupled with
TDMA arbitration and VCs or the implementation of wormhole switching coupled with a
priority arbitration and VCs. A BE NoC usually uses the wormhole routing, the RR arbitration
and input buffering.
2.3.3 Main concepts impacting the design complexity
There are other concepts that have an impact on the design complexity of the NoC, such
as the topology, the routing algorithms and the flow control. However, The choice of an
implementation of these concepts is not inclusive to a certain QoS class. Let us see what are
the different possibilities of these concepts implementations and which one is the most used in
the NoCs architecture.
Topology
The Topology of a network defines how the routers are interconnected using network links. The
choice of the topology directly impacts the timing performance and the area. It is the first step
in the design of the network as it defines the routing strategy used [BC06].
There are a wide variety of network topologies [DMB06] that are used in NoCs. 2-D mesh and
torus topologies, illustrated in 2.3, constitute over 60% of the NoC architectures [SKH08]. They
indeed allow to define simple routing rules and provide good electrical properties. However,
the most common topology is the 2D-Mesh [IG13], because of the following reasons:
2.3. NoC concepts and real-time requirements 19
Router
PE
Router
PE
Router
PE
Router
PE
Router
PE
Router
PE
Router
PE
Router
PE
Router
PE
(a) 2D Mesh
Router
PE
Router
PE
Router
PE
Router
PE
Router
PE
Router
PE
Router
PE
Router
PE
Router
PE
(b) 2D Torus
Figure 2.3: Example of topologies that are most used in NoCs.
1. It has an acceptable wire cost;
2. It presents a reasonably high bandwidth;
3. It is easy to group IPs that communicate a lot so that they do not consume any unnec-
essary high amount of resources in the network.
Routing algorithms
The routing algorithm, implemented at each router, determines which path the packet should
take through the network, i.e. the output port from which the packet should be delivered.
The selection of the routing algorithm depends on several factors such as the implementation
complexity and the performance requirements [OHM05]. Thus, a compromise is needed between
an optimal use of the network communication channels and a simple implementation of an
algorithm that does not require excessive hardware resources.
A deterministic routing is preferred by NoC designers: according to [SKH08], it is used by
more than 70% of the NoCs. It is indeed a simple routing algorithm where the path taken
by the packet is completely known in advance for a given source/destination pair [DA93].
Then, it is simple to implement and it is inexpensive. Besides, it presents a low latency when
the network is not congested. It is deadlock free when wormhole switching is used. As an
example of deterministic routing is the XY routing, which is the popular routing algorithm
20 Chapter 2. Network-on-Chip for real-time service
considered on 2D mesh or torus NoCs [NM93]. An adaptive routing is more complicated to
implement [BDM02b] as it provides multiple paths to route a packet from its source to its
destination. It may lead to network deadlock or livelock or both. Livelock occurs when a
packet does not arrive at its destination and rotates round in the network.
These algorithms can also be determined either at the source (source routing) or constructed
sequentially in routers (distributed routing). In the source routing, the source determines the
whole path of the packet which will be indicated in the header. In the distributed routing, each
router chooses the next destination in function of the final destination. The source routing is
simpler than distributed routing but cannot adapt paths in case of traffic congestion as the
paths are pre-computed oﬄine. As an example of NoCs that uses the source deterministic
routing, we can cite Tilera [Til11] and Mango [BS05] NoCs.
Flow control
The choice of the flow control strategy depends on which switching and buffering schemes/s-
trategies are used. Because of the limited capacity of buffers, a flow control is necessary to avoid
buffers to overflow and thus to reject some data. Thus, flow control ensures that a router cannot
send any data to the next router if there is not enough space available to buffer it. The com-
mon control flow mechanisms proposed in the literature are the followings: ON/OFF [CPC08],
credit-based protocol [BS04, DRGR03, RDG+04] and ACK/NACK [PABB05]. The ON/OFF
strategy minimizes the amount of the backpressure signaling by sending only a single control
bit that indicates to the upstream router whether it is allowed to send (on) or not (off). In
the credit-based mechanism, there is a counter in each upstream router to keep track of the
number of free flits within the downstream router. Thus, a router can send a number of flits
corresponding to the number of credits. Finally, in the ACK/NACK protocol, the logic flow
control must acknowledge each flit received by the receiver port.
In the literature, the credit-based scheme is the most widely used in NoCs because of its high
performance with limited buffering [Gai15].
2.4. Assumed NoC architecture 21
2.4 Assumed NoC architecture
This chapter has introduced the different concepts of NoCs and how they can be implemented
to provide different classes of QoS. Thus, designing a NoC is difficult as it depends on a
large number of parameters. This design is affected by the QoS needed and by a trade-off
between the complexity/cost and ensuring a minimum throughput/latency. In order to give a
general view on the state-of-the-art on the NoCs, we present a table summarizing a set of NoCs
architectures (Table 2.1). For a more exhaustive list, [SKH08] lists sixty architectures of NoCs.
We classify these works into the following types: BE (Best-Effort) NoCs providing a good
average performance, GS (Gauaranteed-Services) NoCs providing hard real-time requirements
or GS and BE NoCs where the routers are more complex as they include two switching mode.
The majority of NoCs uses the mesh topology which is explained by the fact that it is the
simplest and most flexible solution to implement. Some NoCs rely on torus topology in the
aim to reduce the network diameter which could reduce the latency. Also, we can identify
that the round-robin scheme is predominating other arbitration policies. Mostly about 70% of
NoCs [SKH08] use the deterministic routing due to its advantages over the adaptive routing as
mentioned before. Besides, in order to reduce the cost area of routers and especially buffers,
virtual channels and output queuing are avoided.
However, the parameters that are directly related to the QoS are the switching mode and the ar-
bitration mechanism. We can notice that wormhole switching with the round-robin arbitration
are the most common mechanisms used in BE NoCs, as it provides lower latency, smaller and
faster routers than other techniques [LRV06]. In GS NoCs, the virtual or pure circuit switching
is more used and it is coupled with TDMA to ensure a higher predictability for the network as
Aethereal [GDR05] and Nostrum [PJ06, MNTJ04]. Other NoCs use the packet switching mode
with the priority-based arbitration to ensure guaranteed services such as QNoC [BCGK04] and
Faust [DBL05] NoCs. Thus, in order to provide guarantees to hard real-time traffic, GS NoCs
present a complex hardware architecture. Thus, no contentions can occur by design, leading
to straightforward WCTT for packets but with penalizing the average performance. Besides,
none of these NoCs targeting hard real-time constraints are available in commercially existing
22 Chapter 2. Network-on-Chip for real-time service
Topology
Sw
itching
m
ode
B
uffering
R
outing
F
low
control
A
rbitration
B
E
SC
C
[H
D
V
+11]
M
esh
V
irtual-cut
through
V
O
Q
(8
V
C
s)
Source
determ
inistic
-
W
W
FA
Teraflops[H
V
S
+07]
M
esh
W
orm
hole
V
O
Q
(2
V
C
s)
determ
inistic
adaptive
O
n/O
FF
R
R
A
T
ilera
[T
il11]
M
esh
W
orm
hole
Input
Source
determ
inistic
C
redit-based
R
R
A
K
alra y
[dD
vA
P
L14]T orus
m
odified
W
orm
hole
O
utput
Source
determ
inistic
source-based
R
R
A
P
roteo
[STA
N
04]
R
ing
virtual cut-through
-
Source
determ
inistic
-
-
X
pip es
[B
B
04]
-
w orm
hole
-
Source
determ
inistic
-
-
Spin
[A
G
03]
T ree
w orm
hole
-
D
istributed
adaptive
-
-
O
ctagon
[K
N
D
02]
R
ing
w orm
hole
-
Source
determ
inistic
-
-
G
S
SoC
B
U
S
[W
L03]
M
esh
C
ircuit
Input
D
idtributed
D
eterm
inistic
A
C
K
/N
A
C
K
R
R
A
F aust
[D
B
L05]
M
esh
w orm
hole
V
O
Q
(2
V
C
s)
source
determ
inistic
A
C
K
/N
A
C
K
Priorit y-based
Q
N
oC
[B
C
G
K
04]
M
esh
worm
hole
V
O
Q
Source
determ
inistic
credit-based
Priority-based
G
S
+
B
E
A
ethereal[G
D
R
05]
M
esh
W
orm
hole
+
C
ircuit
Input
Source
determ
inistic
C
redit-based
R
R
A
+
T
D
M
A
M
A
N
G
O
[B
S05]
M
esh
W
orm
hole
+
C
ircuit
V
O
Q
(8
V
C
s)
Source
determ
inistic
C
redit-based
R
R
A
N
ostrum
[P
J06]
M
esh
Store&
forward
+
C
ircuitV
O
Q
(8
V
C
s)
A
daptive
-
T
D
M
A
T able
2.1:
Table
reporting
som
e
exam
ples
ofN
oC
architectures.
2.4. Assumed NoC architecture 23
many-core architectures.
In hard real-time systems, analytic methods are used to compute the WCTT in BE NoCs. For
instance, Tilera [Til11] and Kalray [dDvAPL14] are commercially BE NoCs. Besides, this class
of QoS optimizes the average network resources usage. For this reason, we base our work on an
existing commercial NoC architecture and we choose a Tilera like NoC. Tilera is a commercial
BE NoC implementing the most common used options (Mesh topology, Wormhole switching,
Round-Robin arbitration, input buffers). Besides, the documentation on this architecture was
available when this study was made.
In this work, we are interested in real-time packet schedulability analysis for many-cores net-
works. Thus, we need to analyze the worst-case behavior for both inter-core and core to external
memories or peripherals communications by taking into account the different contentions that
could occur between packets. The next chapter presents a state-of-the-art of the existing meth-
ods that are used to analyze, compute and reduce the WCTT.
24 Chapter 2. Network-on-Chip for real-time service
Chapter 3
Related works around Worst Case
Traversal Time (WCTT)
Contents
3.1 WCTT over wormhole networks . . . . . . . . . . . . . . . . . . . . . . 26
3.1.1 Priority-based wormhole networks . . . . . . . . . . . . . . . . . . . . . . 26
3.1.2 BE wormhole networks and recursive methods . . . . . . . . . . . . . . . 29
3.2 Contention-aware mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 Task mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Application mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 WCTT and mapping baseline references . . . . . . . . . . . . . . . . . 37
Chapter 2 has shown different NoCs architectures that provide different classes of QoS. Re-
gardless of the NoC architecture, in distributed hard real-time systems, the WCTT of all the
packets generated by a flow must be lower than a predetermined deadline. Such real-time
packet schedulability analysis have been done for various types of networks by taking into
account the type of contentions that can occur between flows. The challenge lies in the abil-
ity to define analysis techniques that have a limited complexity, in order to give results in a
25
26 Chapter 3. Related works around Worst Case Traversal Time (WCTT)
reasonable amount of time, and at the same time compute WCTT values that are not too pes-
simistic. The complexity of the analysis depends on the NoC concepts implemented. Virtual
circuit switching relying on a TDMA leads to WCTT that can be easily computed. In fact,
no contention can therefore occur. However, in wormhole networks, the analysis of the system
behavior becomes more complex due to different types of interferences. Thus, in this chapter,
we focus on approaches used to compute the WCTT over wormhole networks. Even we have
assumed a BE NoC architecture in the remainder of the work, we focus in this chapter not only
on BE wormhole networks but also on priority-based wormhole networks. Actually, we are in-
terested to present how the existing approaches model the behavior of the wormhole switching.
Besides, we show how they consider the different types of interferences in the computation of
the WCTT. On the other hand, the WCTT depends on the mapping of the application. Thus,
another challenge lies in the ability to reduce the congestion incurred by the flows, whether be-
longing to the same application or not. At the end of the chapter, we present a state-of-the-art
on the mapping strategies used to reduce the contentions between flows and thus their WCTT.
3.1 WCTT over wormhole networks
In this section, we present a state-of-the art on the approaches used to analyze the WCTT on
priority-based wormhole networks and BE wormhole networks.
3.1.1 Priority-based wormhole networks
Real-time communication over wormhole networks were studied before NoCs were introduced.
The most important mechanism in a wormhole real-time network is the use of virtual chan-
nels. Moreover, coupled with a mechanism for preemptive priorities, it allows a message to
temporarily stop the transmission of another lower priority message.
The first approaches to upper-bound the latency of transmitted packets over a priority-based
wormhole switching with virtual channels were proposed in [Mut94, HO97]. Using these ap-
3.1. WCTT over wormhole networks 27
proaches, the priority of flows is mapped over the priority bits of VCs. At each router, links
are assigned to VCs of the highest priority that have a flit ready to be transmitted. [HO97]
presents a feasibility test that determines whether a set of messages can be transmitted in a
given network respecting the deadlines of all messages. Each message is periodic and has a
different priority level. Each physical link includes a virtual channel for each message that uses
it. The principle to compute the worst-case delay of a message mi is to compute the delays of
transmission of the messages mj having higher priorities than mi and sharing at least one link
with mi. Moreover, as the messages are periodic, multiple instances of the same message can
block mi. The authors propose to consider the entire path of a given message as a single shared
resource, i.e. as one link which is must be free of any message of highest priority than mi.
Kim et al. [KKHL98] noticed that neither of those approaches consider the impact of indirect
interference. An indirect interference happens when a message shares at least one link with mj
but does not share any link with mi, and thus could have an impact on the latency bounds of
mi.
[KKHL98] proposes an algorithm to compute WCTT of flows over priority-based wormhole
networks. Compared to [Mut94, HO97], the authors use a dependency graph between flows
to identify both the direct and indirect contentions a flow may suffer. When an indirect flow
is identified, the accuracy of WCTT values is improved, as the blocking effect of indirect
flows can thus be modeled to occur only when other (periodic) blocking flows are released
simultaneously. Otherwise, this indirect message can not interfere and its impact is not counted
in the computation of the upper bound. However, [Shi09] shows that this method still introduces
excessive interference to the studied flow. Actually, they show on an example that possible
parallel communication will take place on disjoint links which reduces the possible interference
from higher priority flows.
[LJS05] proposed another improved algorithm. The defect in the approaches in [Mut94, HO97]
is that they aggregate all the links used by the studied flow and the flows that interfere with
it in a single resource. This resource must be fully reserved for the studied flow in order to
be transmitted. Thus, if two flows in conflict with the studied flow do not intersect, they
can be transferred simultaneously. Therefore, the authors in [LJS05] distinguish direct from
28 Chapter 3. Related works around Worst Case Traversal Time (WCTT)
indirect contentions but by utilizing a contention tree. This tree is further used to identify when
parallel transmissions of flows on disjoint concurrent paths can occur, reducing the contention
interferences.
Priority-based wormhole was also applied in the context of NoC. [SB08] is based on previous
works and provides another method for verifying the schedulability of a set of messages. It
presents a method for computing the WCTT of flows which integrates both direct and indirect
interferences of higher-priority flows. The computation proposed by [SB08] is based on the
assumption that a message is undergoing its worst-case delay if it is transmitted at the same
time with all higher priority messages. The authors then provide an iterative method and
separately compute the delays due to direct and indirect interference. Indirect interferences
are treated as additional release jitter on flows that directly interfere with the analyzed flow.
They also show that their method provides tighter bounds than [HO97], but it is not optimal
when simultaneous interference exists. Besides, they show that values obtained in [LJS05] can
be optimistic, as the worst case scenario do not occur when flows are released simultaneously.
However, the authors imply that it is not possible to obtain an exact bound in this case.
Both algorithms proposed in [SB08] and [LJS05] assume that all flows that are in indirect
contentions have an impact.
[Shi09] is an evolution of [SB08] in the aim to reduce the number of priority levels necessary
to the proper functioning of this type of network. Indeed, the virtual channels are expensive
in terms of memory and power consumption. So the authors propose to share a given priority
level between several messages. They also provide a method to compute the transmission delay
of a flow in this model. The principle of the computation is as follows: at each priority level,
they treat all flows having this level as a single message that must reserve all links used by the
messages of the same level in order to be transmitted. The WCTT for this "aggregate message"
is then calculated according to the method presented in [SB08]. The delay for a message is
given by the sum of the delay for the "aggregate message" and the delay for the message from
the source node.
More recently, [NYP14b] analyzes WCTT of flows of applications designed following a parti-
3.1. WCTT over wormhole networks 29
tioning approach but where runtime migrations are decided by the applications themselves. As
paths of flows cannot be known at design time, the size of the search space of the worst-case
scenario is higher. Compared to [SB08], a lower complexity but more pessimistic method is
thus used.
3.1.2 BE wormhole networks and recursive methods
In contrast to a wormhole real-time network, a wormhole BE network does not provide mech-
anisms dedicated to respect the real-time constraints, such as priority mechanisms. In the
context of BE NoC, [QLD10] extends the contention tree used in [LJS05] to identify three basic
contention patterns. Indeed, they consider that the shared resources in the network are of
three types: the credit flow control, the links and the routers input buffers. Thus, complex
contentions are then decomposed into these patterns and Network Calculus (NC) [LBT01] is
used to compute WCTT of flows. However, [FFF11] clearly showed that this method leads to
over-dimensioning the resources.
[Lee03] presents a recursive algorithm to compute the WCTT of flows induced by contentions
on the path of the analyzed flow. It presents an algorithm called WCFC (Wormhole Channel
Feasibility Checking) which computes a bound on the worst-case delay of a flow and checks that
its deadline is fulfilled. This computation is based on a recursive formula that takes into account
the contentions in routers traversed successively by a packet and the contentions incurred by
the downstream packets that block it. Indeed, WCFC does not assume the existence of a round-
robin mechanism as used in most BE NoC routers and thus gives more pessimistic values of
the WCTT bound.
In [RMB+09, RMB+13] and [FFF09b], the tightness of the WCFC method [Lee03] is improved
as the round-robin arbitration is modeled. Therefore, only a single flow per port in contention
can use the link before the analyzed flow. Indeed, the authors in [RMB+09, RMB+13] present
two methods to compute the worst-case delay in a BE wormhole network. The first, called
LL-RTB (Real-Time Bound for Low Latency traffic), assumes that the interval between two
successive packets of the same flow is important enough that packets do not interfere with them.
30 Chapter 3. Related works around Worst Case Traversal Time (WCTT)
r1 r2 r3 r4 r5 r6 r7 r8
f1f2
f3
f4
delay(f1) = d(f1) + delay(f2) + delay(f3)delay(f2) = d(f2) + delay(f3) + delay(f4)delay(f3) = d(f3) + delay(f4)delay(f4) = d(f4)   
Figure 3.1: An example illustrating Recursive Calculus method.
The second method, named HB-RTB for Real-Time Bound for High Bandwidth traffic, is
similar in principle to the first one, but is based on a different traffic hypothesis. Here, it is
assumed that successive packets of a flow are transmitted continuously, as long as the network
can support them. Furthermore, it assumes that, initially, the network is already fully saturated
by packets. This method leads to higher bound values than RTB-LL, but it is more general
because it does not require any assumptions about the frequency of packets or a minimum
interval. This method was introduced simultaneously with the Recursive Calculus (RC) method
presented in [FFF09b]. They are both based on computing the worst-case delay of a packet
recursively. They analyze the contention incurred by the analyzed flow, by considering that
all the intermediate buffers in the routers between the source and the destination are filled to
their capacities. Besides, they consider also that the packets can be injected into the network
continuously. A packet delivery is assumed to be divided into two phases. In the first phase,
the header of the packet is routed to its destination and creates a virtual circuit between the
source and the destination. In the second phase, the whole packet is then transferred. The
method recursively analyzes the contention in the path of the analyzed flow f. At each router,
the direct flows are identified and their delays are computed by thus taking into account the
indirect flows of f. As a round-robin arbitration is assumed, at each input port, the maximum
delay from the set of flows in direct or indirect contention with f is added.
Figure 3.1 illustrates an example of the computation of the WCTT using RC method. In
this example, we consider f1 the analyzed flow, f2 and f3 are direct flows to f1, while f4 is an
indirect flow. f1 is blocked by f2 at r1 and by f3 at r2. Thus, the delays of f2 and f3, noted
by delay(f2) and delay(f3), are computed recursively and added to the transmission delay of
3.2. Contention-aware mapping 31
f1, i.e. d(f1). Actually, f2 is blocked by f3 at r2 and by f4 at r5. Then, to compute delay(f2),
the delays of f3 and f4 are computed and added to d(f2). Similarly, delay(f3) and delay(f4)
are computed. Then, delay(f1) = d(f1) + d(f2) + 2d(f3) + 2d(f4).
[FFF12] shows that, in general, NC gives tighter WCTT values than RC. However, RC is less
pessimistic than NC when the network is saturated. As in [Lee03, RMB+09, RMB+13], this
method supposes that packets must reach their destinations before the next one can progress.
Recently, [DNNP14] introduces a branch-and-prune approach and identifies two main sources
of pessimism of the initial RC method over best effort NoC. Indeed, this approach is based on
the RC method but it gives tighter WCCT values than RC by reducing the occurrence of a flow
to block the analyzed flow. Indeed, an analyzed flow could be blocked directly by two flows fi
and fj respectively at routers ri and rj. However, as a recursive method is applied, thus, if fj
blocks also fi, then the delay of fj is added twice. To overcome this pessimism, the method
in [DNNP14], introduces for each flow, constraints on the inter-release arrival of packets and
the number of packets that can be emitted for a given interval.
All the approaches introduced in this section assume that flows in direct contention must reach
their destinations before the analyzed flows can progress, which is pessimistic. Actually, these
methods do not model the wormhole switching at the flit granularity.
3.2 Contention-aware mapping
Several methods are used in NoCs to allocate hard real-time applications. However, how hard
real-time tasks that generate the flows (critical and non-critical) are mapped within cores, is of
utmost importance to control the contention over the NoC and thus the WCTT of flows.
Several contention aware mapping strategies are proposed in the literature. We classify these
strategies into two groups: task mapping and application mapping. The task mapping presents
the strategies used to allocate a single application. It aims to reduce the congestion of the inter-
core communications belonging to this application, called internal congestion. The application
32 Chapter 3. Related works around Worst Case Traversal Time (WCTT)
mapping presents the strategies to allocate several applications. It has a goal to reduce not only
the internal congestion but also the congestion between applications, called external congestion.
3.2.1 Task mapping
Different strategies exist to reduce the number of contentions that a flow can experience on its
path, when allocating an application on the NoC, using minimization functions. A link con-
tention aware mapping problem has appeared in [CM08], where an Integer Linear Programming
(ILP) is proposed. It aims to minimize the contention that flows exhibit at each intermediate
cores between their sources and their destinations. The authors first analyze the factors that
produce a network contention. These factors may be a source-based, a destination-based or a
path-based contention. The source-based is due to the flows sharing the same source, while
the destination-based are those going to the same destination. The analysis shows that mini-
mizing the path-based contention due to the flows sharing only some links, leads to minimize
the network contention. Thus, the authors propose an ILP formulation that minimizes the cost
function of flows that share a subset of their paths and have different source and destination
cores. The objective function has two parts: 1) a weighted communication distance term, 2)
a term emphasizing the link contention of on-chip communication. As this is an NP-complete
problem, they proposed to keep only the first term and then reducing only the distance be-
tween communicating tasks, which will decrease the number of links shared by the inter-cores
communications. The authors claim that by minimizing this distance, the packet latency will
eventually decrease.
The authors in [ZM12] propose a low-contention mapping algorithm for real-time applications.
They developed a TDMA-like approach to ensure a separation of communications on NoC.
Thus, the communications are allocated within several time frames, which minimize the num-
ber of interference communications. In order to minimize the contention due to inter-process
communications (IPC) sharing the same time frame, an ILP approach based on the number of
links shared by flows is used, similarly to [CM08].
[RI12] uses a genetic algorithm to explore the search space of contention aware mapping of
3.2. Contention-aware mapping 33
tasks. A genetic algorithm is a stochastic search algorithm based on operations of natural
genetics. Here, fixed-sized population of chromosomes evolves over a number of generations
following the principle of natural selection. Natural operations, such as crossovers and muta-
tions, are represented by operators that presents this behavior. A fitness measure is associated
to each chromosome which identifies a potential solution. In [RI12], the authors investigate
the effectiveness of genetic algorithms for static task scheduling in wormhole Network-on-Chip-
based systems. They explore the mapping of tasks as well as the priority ordering of the task
set. They define a fitness function which select the mapping where all communications of the
application meet their deadlines. Indeed, they select randomly two or five mappings (popu-
lations). Each mapping is presented as a chromosome: two branches presenting the index of
the task and the index or coordinates of the core. Then, crossover and mutation operators are
applied on these chromosomes. The chromosomes are selected to participate in the crossover
operation. Their parts are exchanged to create new offspring. The mutation operator is imple-
mented by selecting first a parent chromosome, then changing randomly some of its portions.
Then, they use flows response times analysis of each mapping that model contentions as fitness
function. This function measures the number of unschedulable flows, and thus the mapping
which presents the minimum number is chosen.
A tree-model based contention-aware mapping which takes the bandwidth constraints into
consideration is proposed in [YGSP12]. It abstracts the NoC into an extend tree by traversing
the tiles from the central one. The most communicating task is mapped to the root. Thereafter,
at each step, the task having the largest communication volume with the mapped tasks is
selected and mapped to one node from the next level of the tree. This node is chosen in such
way that the bandwidth constraints are respected. Thus, as [ZM12] and [CM08], this method
minimizes the distance of the paths between communicating tasks in order to reduce potential
contentions.
All these approaches consider the mapping of a single application. However, a NoC presents a
large number of computing cores, and a hardware that permits to allocate a mix of real-time
applications. Then, these strategies cannot be used to allocate several applications as they
do not consider the external congestion. For instance, the underlying genetic algorithms used
34 Chapter 3. Related works around Worst Case Traversal Time (WCTT)
in [RI12] should be extended to enable the coexistence of different fitness functions, so that
mapping several applications of potentially different levels of criticality can be supported.
3.2.2 Application mapping
Several mapping strategies deal with the problem of reducing the internal and external conges-
tion. The mapping area selection of each application is of the utmost importance. It impacts
the mapping of the other applications as we will present in this section.
These strategies divide the mapping into two phases: the application mapping and the task
mapping. The application mapping consists on finding the region where to allocate the ap-
plication. Thus, a first core is selected to define this region. The task mapping uses different
heuristics to allocate the tasks of each application. Actually, it maps a first task on the first
selected core, and different mapping heuristics are used to allocate the remaining tasks around
the first one.
[dSCCM10], [CCM07] divide the NoC arbitrarily into clusters for the application mapping.
Each cluster is dedicated to an application. An initial task in each application is selected
and is placed inside a cluster. [COM08] enhance the definition of clusters for applications
by making them near convex regions. Thus, generating non-contiguous regions is avoided,
which reduces the external congestion. Then, within each cluster a congestion-aware mapping
heuristic, similar to one in [ZM12], minimizes the bandwidth utilization of NoC links and the
distance between communicating tasks. [FRD+12] shows a problem in these methods: the first
core selection policy when building regions, leads to map an application over a fragmented
region.
In order to solve this problem, [FRD+12] proposes a solution, called CoNA, to select the core
having the most available neighbors (up to 4) as the first core in the allocation. It aims to
avoid region fragmentation and thus decrease both internal and external congestions. The task
mapped onto the first core is the one with the largest number of communications. Then, the
task graph is traversed in breadth-first order from this first task allocated. The subsequent tasks
are mapped to the cores that fit into the smallest square centered in the first core. However,
3.2. Contention-aware mapping 35
CoNA cannot guarantee that the first core is always surrounded by enough free cores as it only
considers direct neighbors when selecting the first core and then still leads to fragmentation of
areas.
A mapping performance analysis has been made in [FDLP13] to show the impact of the first
core selection. The authors illustrate the drawbacks of the previous mentioned strategies of
mapping. They show that these methods lead to allocate applications into fragmented regions
on the NoC, which increases the external congestion. Figure 3.2 shows an example of such a
fragmentation when applying previous methods. Figure 3.2a illustrates the mapping of two
applications, noted App1 and App2 using the previous methods. The asterisks present the
(1,1)(7,1)
(7,7) (1,7)
* *
*
*
*
*
*
*
*
APP1
APP2
(a)
(7,1)
(7,7) (1,7)
* *
*
*
*
*
*
*
*
APP1
APP2
APP3
APP3
(1,1)
(b)
Figure 3.2: Allocating App3 using the mapping strategies presented in [dSCCM10], [FRD+12]
and [COM08].
36 Chapter 3. Related works around Worst Case Traversal Time (WCTT)
starting core within the different clusters defined by the method in [dSCCM10]. These previous
methods lead to an allocation of the application App3 over a fragmented region as shown in
Figure 3.2b. Actually, the method in [dSCCM10] and CoNA chooses the core located at (2,2)
to be the first core to define the region where App3 will be allocated. This core has 4 direct
neighbors which is sufficient to be selected as first core in CoNA method. The core (1,1) is
the first core selected by the method presented in [COM08]. This core is inside a near-convex
region. However, in all cases, the first core selection policy does not take into consideration
the size of the application to be allocated. Here, App3 has a size of 11 tasks. Thus, whatever
the heuristic used to allocate the tasks within this region, App3 is fragmented on the NoC. The
first contiguous region selected has an area of 9 which is lower than the size of the application.
Smart Hill Climbing (SHiC) approach [FDLP13] finds a contiguous near square region having a
number of free cores at least equal to the number of tasks of the application. It considers a new
metric called square-factor (SF) for selecting the first core when building region to approximate
the contiguous available cores around the selected one. Towards the SF calculation of each core,
each application already allocated on the NoC is defined by a rectangle that may include all
the tasks of an application or not. Thus, the SF of a core is the maximal size of the square
area in which that core can be put in, to which the number of free cores around this square
is added. This square must not have any conflicts with the other rectangles, i.e. applications
already mapped. The first core is then the one having a SF greater or equal to the size of
the application to be mapped, i.e. the number of cores that are needed assuming a core can
only execute a single task. The remaining tasks are then allocated in this region as in CoNA
method.
Figure 3.3a shows on the example introduced previously, the first core selection policy of SHiC
method. The core located at (6,5) has an SF equal to 13. In fact, this core is centered on a
square of area 9 and presents 4 cores around this square. Figure 3.3b shows how SHiC method
leads to allocate App3 in a contiguous region without been fragmented.
[FRX+14] adapts SHiC so that contiguous regions are used to map critical applications, in
order to reduce contentions, while non-critical applications are mapped over non-contiguous
3.3. WCTT and mapping baseline references 37
(1,1)(7,1)
(7,7) (1,7)
* *
*
*
*
*
*
*
*
APP1
APP2
SF= 13
(a)
(1,1)(7,1)
(7,7) (1,7)
* *
*
*
*
*
*
*
*
APP1
APP2
APP3
(b)
Figure 3.3: Allocating App3 using SHiC method presented in[FDLP13].
regions to increase the system throughput. This leads to allocate two tasks repectively from a
critical and non-critical applications on the same core. Thus, it needs to schedule the critical
tasks with a higher priority; i.e. the non critical tasks are suspended as soon as critical tasks
demand for the system resources.
3.3 WCTT and mapping baseline references
The challenges facing the NoCs are mainly in the ability to analyze the behavior of network
in order to fulfill hard real-time requirements. Thus, analytic methods are needed to compute
38 Chapter 3. Related works around Worst Case Traversal Time (WCTT)
the WCTT of the critical flows exchanged on the NoC. In this chapter, we have presented the
existing methods to compute the worst-case delays over wormhole networks. As we have chosen
a Tilera-like architecture in chapter 2, we focus then on the methods used over Best-Effort net-
works. The method presented in [DNNP14] consider all direct and indirect contentions from
which a flow suffers. It reduces the pessimism of the existing recursive methods by adding
assumptions at the application level. However, we focus first on transmission properties at the
network level to reduce the pessimism in the existing methods. Thus, in this work, we refer to
recursive calculus method (RC), presented in [FFF09b], as it is the best method that could be
used in our assumed architecture.
Besides, the mapping of the tasks of the different applications on the NoC impacts the de-
lays of the flows. Thus, several strategies of mapping are proposed in the literature to reduce
the congestion on the NoC which reduces the WCTT of the flows generated by these tasks.
These strategies either consider the allocation of an application or several applications. A
contention-aware mapping should reduce both the internal and external congestions when al-
locating different applications on the NoC. The SHiC method, introduced in [FDLP13], is the
best multi-application congestion-aware mapping approach that reduces both the internal and
external congestions of the inter-core communications. Thus, in this work, we refer to SHiC
method as a contention-aware mapping.
The next chapter illustrates the problems when integrating I/O constraints within the NoC
communications. We detail the drawbacks of existing mapping strategies and computing meth-
ods.
Chapter 4
Managing I/O in many-cores:
problems and approach
Contents
4.1 System architecture and assumptions . . . . . . . . . . . . . . . . . . . 40
4.1.1 Description of I/O in Tilera . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.2 Model of NoC architecture and assumptions . . . . . . . . . . . . . . . . . 43
4.2 An avionic case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 Problem illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.1 Is improving the computation of the WCTT sufficient? . . . . . . . . . . . 53
4.3.2 Is a contention-aware mapping strategy the solution? . . . . . . . . . . . . 56
4.4 Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
In embedded real-time systems, many-cores architectures could be used as processing elements
within a backbone Ethernet network, as these architectures present an important number of
I/Ointerfaces such as the Ethernet interfaces and the memory controllers. Besides, it is pos-
sible to allocate a number of applications of different level of criticalities on the NoC-based
many-cores architectures. These applications communicate with sensors and actuators via the
39
40 Chapter 4. Managing I/O in many-cores: problems and approach
Ethernet interfaces. Therefore, in this context, a NoC supports different types of communi-
cations either between cores, or between cores and I/Ointerfaces. The real-time constraints
concern not only the inter-cores communications but also the I/Ocommunications. In chapter
3, we have presented the analytical methods to compute the WCTT of flows on different NoC
architectures and the strategies to reduce the congestion on the NoC. However, these existing
works only focus on the inter-core communications and do not consider the I/Opart. The ob-
jective of this chapter is to present the problems from which suffer the core-to-I/Oflows when
using NoCs in real-time systems connected to a number of sensors and actuators via Ethernet.
Thus, in this chapter, we first present an explanation about the I/Ointerfaces in Tilera NoC.
We illustrate how the communications are done between cores and I/Ointerfaces. Besides, we
present the model we assume in this work. Finally, we illustrate the problems on a case study
made of several real-time applications and show the limitations of existing works.
4.1 System architecture and assumptions
In this section, we first present a description of the I/Ointerfaces in Tilera NoC. Besides, we
illustrate the communications between cores and I/Ointerfaces. Then, we show our architectural
model based on Tilera NoC, and the different assumptions considered in this work.
4.1.1 Description of I/O in Tilera
A Tilera NoC interconnects cores, Ethernet and DDR-SDRAM memory interfaces that are
located on its edges. The term DDR in the remainder of this thesis refers to the DDR-SDRAM
memory. In a Tilera NoC, a tile contains only one core. A set of identical controllers are
in general spread around the NoC in opposite cardinal directions. For instance, two memory
controllers are located north and two south of the NoC in Tilera Tile64, as illustrated in
Figure 4.1. However, it only provides Ethernet controllers on the east of its NoC. Finally, each
I/O interface can be accessed from the core adjacent to this interface through specific ports.
For instance, each memory controller of the Tile64 has 3 ports. The 6 central columns (out of
4.1. System architecture and assumptions 41
port 0 port 0
port 0 port 0
port 1port 2 port 1
port 1port 1
port 2
port 2port 2
port 0
port 0
port 0
port 1
port 1
port 1
DDR DDR
DDR DDR
Figure 4.1: An overview of a Tilera architecture.
8) of the Tile64 are connected from both their north and south edges to a port of a memory
controller. Each Ethernet controller of the Tile64 is connected to 2 ports respectively.
Now, let us see how an Ethernet frame is transmitted on the NoC. Figure 4.2 illustrates the
different steps of the transmission of the ingress data flows. This transmission proceeds as
follows:
1. The header and the payload of an Ethernet frame are separated into two buffers.
2. The header is transmitted to a specific tile which determines where the payload will be
sent.
3. There are two possibilities to send a payload to a destination core. These possibilities
depend on where the data will be stored:
(a) First possibility
42 Chapter 4. Managing I/O in many-cores: problems and approach
header FIFO Ethernetinterface
payload FIFO
Ethernet frame
Header Payload
via IDN2
1
Destination
2
3
DDRinterface
3a.i
3a.ii ACK
3a.ii
3b
Figure 4.2: The different steps of the ingress data flows.
i. The data is sent to the DDR memory.
ii. Then, the DDR memory sends the data to the destination tile while it sends
back an ACK to the I/Ointerface.
(b) Second possibility
i. The data is sent directly to the destination tile.
This transmission is done under different constraints. First, the buffer on the Ethernet interface
which stores the payload, has a limited capacity: 2KB for a giga-Ethernet interface. Second,
the maximal size of NoC packets is limited and it depends on the network used by these
packets. Actually, Tilera provides five dynamic networks to support different type of flows. The
transmission of the ingress data flows could use two networks: IDN (I/ODynamic Network) and
MDN (Memory Dynamic Network). The IDN is primarily used to exchange data between I/O
themselves, and between I/Oand tiles. It supports NoC packets made of 128 flits at maximum.
The MDN network supports data transfer between tiles, I/O interfaces and between tiles and
DDR memory. It supports packets made of 2 flits at minimum and 19 flits at maximum. Thus,
4.1. System architecture and assumptions 43
a core can receive data directly from the Ethernet interface via IDN (step 3b) or through an
intermediate memory controller via MDN (step 3a). Besides, the header is transmitted to a
specific tile via IDN (step 2). On the other hand, the flits have a size of 32 bits. Then, the
size of an Ethernet frame is generally several time higher than the size of a NoC packet. Thus,
several NoC packets are needed to transmit an Ethernet payload to a tile.
A similar process is used for the egress data flows where a data is sent directly to the Ethernet
interface via the IDN or through the DDR memory via MDN.
4.1.2 Model of NoC architecture and assumptions
NoC Model
We model a Tilera-like NoC as a mesh network that interconnects L×W routers as illustrated
in Figure 4.3. A core of the NoC is identified by its (x,y) coordinates and we assume that the
core (1,1) is located on the bottom right of the NoC. The NoC interconnects cores, Ethernet
and DDR interfaces.
We assume that a set of tasks is statically allocated over a NoC. These tasks exchange various
payloads. A payload is modeled as a flow and we note F the set of m flows of an application.
Each flow fi (i = 1..m) is made of successive packets that are transmitted on the NoC.
We consider the following assumptions related to the model of the NoC architecture, the trans-
mission mode and and the mechanisms used in this NoC:
A1. A mesh NoC which tiles are made of a single core. We illustrate such a NoC on a 7 × 7
grid.
A2. A core can execute only one task.
A3. The packets are divided into a set of flits of fixed size of 32 bits.
A4. The transfers are done only on the MDN network, thus the packets are made between 2
and 19 flits.
A5. The packets are injected at a given inter-arrival time to prevent contentions with previous
44 Chapter 4. Managing I/O in many-cores: problems and approach
Figure 4.3: A Tilera-like NoC architecture.
packets of the same flow to occur.
A6. The wormhole switching, the Round-Robin arbitration, the credit-based flow control and
the XY routing are the mechanisms used on this NoC.
A7. The destination router immediately consumes any flit arriving, as well as forwards back
credit to the previous router.
A8. The latency for a flit to be read from an input buffer, traverse the crossbar, and reach the
storage at the input of a neighboring switch is a single cycle.
A9. A worst-case scenario occurs when each of the NoC packet is blocked at each router by
all NoC flows that can be encountered.
A10. Since we consider worst-scenarios and in order to ensure an upper bound of the WCTT
of flows, we assume that the input buffer of each router has a size of a single flit. Actually,
when the size of the buffer is reduced, the flits of each packet are spread on many routers, which
could increase the contentions on the network.
4.1. System architecture and assumptions 45
I/OModel
In this work, we assume the NoC to be connected to two memory controllers. One is located
to the north and the other one to the south of the NoC, as shown by Figure 4.3. As in Tilera,
we assume that each controller can be accessed from the edges of columns 2 to L − 1 of the
NoC, through the ports 1 to n. We therefore have L ≥ 3. On the Ethernet side, we consider
that a set of at least two Ethernet controllers are connected to the east of the NoC, as in Tilera.
Similar to the localization of memory ports, we assume that these Ethernet controllers can be
accessed from the east of lines 2 to W − 1 of the NoC. We therefore have W ≥ 4 and the
maximum number of Ethernet controller equals to W − 2. Ethernet interfaces can therefore be
shared between several applications to be mapped on the NoC. Note that if W − 2 is higher
than the actual number of Ethernet controllers, the Ethernet controllers are spaced by some
lines of cores. Each Ethernet controller can be accessed from the core adjacent to it through
one port.
Hence, the assumptions to model the I/Ointerfaces and the transmission of the I/Oflows are
the followings:
A11. 3 Ethernet controllers are located to the east, respectively at the second, fourth and
sixth rows of the 7× 7 grid, i.e. y = 2, 4 and 6. The DDR memory located to the north and to
the south are accessed directly from the second till the sixth column, i.e. x = 2 till x = 6.
A12. Single giga-Ethernet interfaces are considered, where the buffer in each interface has a
size of 2 KB.
A13. The transfer of data between I/Ointerfaces and tiles are done following the first possibility
mentioned in section 4.1.1, i.e. first, from the Ethernet interface to the DDR memory and then
from the DDR to the destination core. Thus, the WCTT of a core-to-I/Oflow is the WCTT
of this flow to reach the DDR memory added to the WCTT of the flow from the DDR to the
core destination. In the remainder of this work, we focus on the WCTT of the core-to-I/Oflow
to reach the DDR memory, since this WCTT impacts directly the buffer state on the Ethernet
interface.
46 Chapter 4. Managing I/O in many-cores: problems and approach
A14. The Ethernet controller sends the data to the nearest memory and to the first DDR
port that might not be used by another Ethernet controller. However, the DDR sends the data
to the destination core from the port belonging to the same column of this core. For example,
in Figure 4.3, if the south Ethernet controller located at (0,2) sends data to the core (L, 1), it
first sends data to port 1 of the south memory and then the data are transmitted by port n to
the core (steps 3a.i and 3a.ii of figure 4.2). However, if the Ethernet controller located at (0,4)
want to send this data, it sends it first to port 2 of the south memory.
A15. A payload stored in the buffer of the Ethernet interface is removed only when all its
corresponding NoC packets are received by a port of the DDR memory.
A16. The Ethernet frames have a Maximum Transmit Unit (MTU) of 1500 bytes.
4.2 An avionic case study
As said previously, a NoC integrates a mix of real-time applications with different levels of
criticalities, which communicate to a number of sensors and actuators via Ethernet interfaces.
In the remainder of this work, we consider only two levels of criticalities: critical and non-
critical. Flows can have various requirements depending on the applications. For instance,
critical applications have critical latency requirements and often a low payload, as it consists of
sensors and actuators command. Larger payloads can be exchanged in less critical applications
which are received by a larger number of sensors. However, the latency requirement of these
flows is less strict than those of critical applications.
Critical applications
The considered critical applications in this work are based on: (1) Full Authority Digi-
tal Engine (FADEC) application, (2) Research Open-Source Avionics and Control engineering
(ROSACE) application.
FADEC is an application that controls all aspects of aircraft engine performance. It receives
low payloads of sensors data from an engine. Thus, we consider that it receives 1500 bytes
4.2. An avionic case study 47
ETH DDR
tf0
tf1
tf2 tf3
tf4
tf5
tf6
Figure 4.4: Task graph of core-to-core and core-to-I/O communications of the FADEC appli-
cation assuming 7 tasks.
of data from the Ethernet interface. These data are then divided and distributed to n tasks,
noted tf0 to tfn. These tasks, except tf6, exchange 250 bytes of data between them. All these
tasks also send 250 bytes of data to the task noted tf6. Then, tf6 stores 130 bytes within a
DDR interface and sends back 76 bytes of actuators data through the same Ethernet interface.
Figure 4.4 shows a graph of the core-to-core and core-to-I/O communications between the tasks
of the FADEC application, assuming 7 tasks. As the control of aircraft engine is complex, the
number of tasks composing this application could be increased. Thus, different instances for
this application are considered and noted by FADECn where n corresponds to the number of
tasks composing the application.
ROSACE is the second critical application taken from the case study introduced in [PSG+14].
It manages the longitudinal motion of a medium-range civil aircraft in en-route phase. ROSACE
is composed of 10 tasks. Figure 4.5 presents the longitudinal flight controller architecture, which
is divided into two parts: the environment simulation and the controller. The environment
simulation presents the aircraft as well as the engines and elevators that are to be controlled.
The controller is composed of 5 filters and 3 controllers. We modify this application in such
a way that it communicates with the I/Ointerfaces: Figure 4.6 shows the task graph that we
thus assume. In the original application, data exchanged between the tasks are of low payload
(about tens of Bytes). In this work, as we consider data coming from an Ethernet interface
48 Chapter 4. Managing I/O in many-cores: problems and approach
Figure 4.5: ROSACE case study, extracted from [PSG+14].
where the buffer could store larger payloads (2 KB), we thus increase the size of data exchanged
in such a way to be aligned with FADEC application. Hence, we consider that a 600 bytes
of payload are transmitted via Ethernet frame. These data are divided and distributed to 5
tasks: hf , azf , vzf , qf and vaf . These tasks then send these data of 120 bytes to vzc or to
vac or to both of them. Finally, vzc and vac send these data to respectively eng and elev,
which then transfer them to the DDR in order to be sent back to the Ethernet interface. This
critical application is considered because it provides different requirements from FADEC. It
indeed presents less-constrained communications.
Non-critical applications
The considered non-critical applications are based on: (1) a Health Monitoring (HM) appli-
cation, (2) a Fast-Fourier Transform (FFT) application.
HM are signal processing applications used to recognize incipient failure conditions of engines.
An HM application continuously receives from a large number of sensors through an Ether-
net interface, a set of frames representing data to be processed in order to anticipate engine
failures. The size of a frame is 130 KBytes and a set is made of 30 frames. When a set of
frames is received, every two frames are assigned to a different task amongst n tasks, noted th0
4.2. An avionic case study 49
ETH
DDR
h_ﬁlter
az_ﬁlter Vz_ﬁlterq_ﬁlter
Va_ﬁlter
altitude_hold
Vz_control Va_control
elevator engine
Figure 4.6: Task graph illustrating ROSACE case study.
to thn. When the processing takes place, task thi also sends 112 bytes of data to thi+1, with
i ∈ [0, n]. Finally, all these tasks finish their processing by storing their frames into the memory.
Figure 4.7 shows the task graph of the application HM6 which is composed of 6 tasks. It is
ETH
DDR
th0 th1
th2
th3th4
th5
HM
Figure 4.7: Task graph illustrating HM application assuming 5 tasks.
characterized by its high number of communications with the DDR where its large payloads
are stored. HM is a signal processing application that can be easily parallelized on an arbitrary
number of cores. For this reason, we consider different instances of the HM application.
50 Chapter 4. Managing I/O in many-cores: problems and approach
FFT application is of widespread use in signal processing and embedded control [DPPB+12].
This application consists of different stages in the computation of FFT, as shown by Figure 4.8.
Each core exchanges data with another one, and performs some local computation. In stage 5,
Figure 4.8: The 6 FFT communication stages.
the result computed by each core is gathered on one tile. In this work, we focus only on this stage
as it presents a large number of communications as illustrated in Figure 4.9. Different size of
data could be exchanged by tasks of this application. Thus, in this work, we consider arbitrary
size of data where we suppose FFT receives 750 bytes of data from the Ethernet interface in
order to be aligned with the applications introduced previously. This data is distributed on 15
tasks, noted tff0 to tff14. Each of this task performs some computing operations, and sends
the results to the task tff15. We vary the size of data sent from these tasks from 8 bytes to 76
bytes, which correspond to the minimal and maximal size of the NoC packets (refer to A4 ).
Considered case study A
A mix of these realistic applications is considered to illustrate the problem when the NoC is
connected to sensors and actuators via Ethernet interfaces. The considered case study, noted
A, is made of the following applications: FADEC9, FFT16, and two instances of HM noted
by HM11 and HM12. Figure 4.10 shows an arbitrary mapping of this case study. The square
3× 3, whose left corner is located at (7,1) defines the regions where FADEC9 is mapped. The
4.3. Problem illustration 51
tﬀ7
tﬀ3 tﬀ11
tﬀ1
tﬀ9tﬀ5
tﬀ13
tﬀ12
tﬀ14tﬀ0
tﬀ2
tﬀ4
tﬀ6 tﬀ8
tﬀ10
ETH
DDR
tﬀ15
Figure 4.9: Task graph illustaring FFT application assuming 16 tasks.
rectangle 4× 3, whose right corner located at (1,1), defines the region where HM12 is mapped.
HM12 and FADEC9 therefore use the same Ethernet interface, located at (0, 2) as shown on
the Figure 4.10. The rectangle 3× 4, defined by its upper left corner located at (7,7) and the
square 4× 4, whose upper right corner is located at (1,7), define respectively the regions where
HM11 and FFT16 are mapped. Thus, they use the same Ethernet interface located at (0,6).
In this case study, we consider that the Ethernet interface at (0,4) is not used.
4.3 Problem illustration
Different types of communications are exchanged on the NoC: core-to-core and core-to-IO.
However, the core-to-I/Oflows experience a change in their speeds as they traverse two networks
of different types: Ethernet and NoC. An Ethernet frame coming to the NoC will first be
buffered in a buffer of limited capacity (A12 ). It is then divided into a number of NoC packets
in order to be transmitted on the NoC, and this is due to the difference of the maximum size
of packets allowed on each network (A4, A13 and A16 ). However, if a next frame comes
to the same interface and given the assumption A15 , the question will then be: would the
buffer overflow and thus lead to drop the frame? To answer the question, the WCTT
52 Chapter 4. Managing I/O in many-cores: problems and approach
port 1port 2port 3port 4port 5
FADEC9tf8
tf3
tf7
tf1 tf0tf2
tf6
tf4 tf5
HM12
th11 th10
th9th8
th7 th6
th0th1
th2
th5 th4
th3
HM11 th0th2
th3 th4 th5
th6th7th8
th9 th10
th1
tﬀ15 tﬀ14 tﬀ13 tﬀ12
tﬀ0tﬀ1tﬀ2tﬀ3
tﬀ7 tﬀ6 tﬀ5 tﬀ4
tﬀ11 tﬀ10 tﬀ9 tﬀ8
FFT16
4567 3 2 1 0
5
6
7
0
1
2
3
4
(2,0)(6,0) (0,0)
(1,1)
(0,2)
(0,6)
Figure 4.10: Arbitrary mapping of a case study made of one FADEC, one FFT and two HM
applications.
of the core-to-I/Oflows on the NoC must be analyzed. Let us first illustrate this problem on
the considered case study A.
Problem identification
Let us first focus on the steps and the timing of the core-to-I/Oflow coming from the Ethernet
interface (0,6) for HM11. We suppose that an Ethernet frame of HM11 is transmitted before
a frame of FFT16. When an Ethernet frame for HM11 arrives at the Ethernet interface, it is
stored into the Ethernet buffer in 12.336 µs (transmission of 1500 bytes of payload at 1Gb/s).
Since the size of the payload of the Ethernet frame for HM11 is 1500 bytes and the Ethernet
buffer size is 2 KB (A12 ), the Ethernet buffer can then store an additional Ethernet frame of
only 500 bytes. The size of a FFT Ethernet frame is however 750 bytes. The HM frame must
therefore have been transmitted, through the NoC, to the memory before the FFT frame can
be stored (A13 ). This means that the WCTT of the HM core-to-I/Oflow to reach the memory
must be less than the arrival delay of FFT frame to the Ethernet interface. We recall that
4.3. Problem illustration 53
the HM core-to-I/Oflow is composed from a number of NoC packets corresponding to the HM
frame. Due to the maximum packet size over the NoC (A13 ), the HM frame is divided into
20 packets of 19 flits followed by one packet of 15 flits. Each one of these packets is blocked
by the FFT16 flows that can be encountered (A9 ). For instance, we consider that core-to-core
communications in the FFT application have a size of 8 bytes which is equivalent to 2 flits.
The WCTT of an individual HM packet on the NoC is computed by considering the recursive
calculus (RC) method presented in section 3.1.2 of chapter 3. Thus, by using the RC method,
an HM packet made of 19 flits takes tp = 400.775 ns to reach the memory. The analyzed
flow noted fa, i.e. a packet of an HM core-to-I/O flow, can indeed be blocked by direct and
indirect flows. Figure 4.11a shows the possible blocking flows of the analyzed flow fa. This
figure presents one direct blocking flow: f1, and seven indirect flows: f2 till f8. At each router,
RC thus identifies the direct flows and add their delays. Therefore, the value tp returned by
RC takes into account indirect flows. For instance, f1 is the direct flow blocking fa. The flows
that block f1 on its path are identified: f2, f3, f4, f5, f6, f7 and f8. The delays of these flows
are also computed and added to the delay of f1. Recursively, for each of these flows, blocking
flows are identified and their delays are also added. The blocking delay of f1 is therefore:
delay(f1) = d(f2) + 2d(f3) + 4d(f4) + 4d(f5) + 12d(f6) + 12d(f7) + 36d(f8), where d(fi) is the
transmission delay of a flow to reach its destination. Therefore, tp = d(f1) + d(fa) + delay(f1).
Thus, if all the packets of HM core-to-I/Oflow experience their WCTT on the NoC, the global
WCTT for the HM core-to-I/Oflow is t1 = 8.407 µs. However, the transmission of the FFT
frame on Ethernet takes t2 = 6 µs. As we have t1 > t2, then this WCTT leads to drop the
FFT frame. Let us see why we have this negative result.
4.3.1 Is improving the computation of the WCTT sufficient?
The second part of Figure 4.11a shows the flits position of the different flows at the moment
where the flow f3 is no more blocked by the flows f4 and f5 at the router (4,6). We note that
f3 blocks directly f1 and indirectly fa. Thus starting from this state, figure 4.11b shows the
timeline of the pipeline transmission of the different flits of these flows. Each line represents a
54 Chapter 4. Managing I/O in many-cores: problems and approach
router, so once a flit is transmitted to the next router, a credit is also transmitted to the previous
router. These transmissions are presented on the timeline by oblique lines. At each router, we
show the time of passage of each flit and its delay, which is reprenseted by a small rectangle.
A flow is blocked at a router by different flows that share the same output by respecting the
RR arbitration (A6 ). A blocked flow is presented on the timeline by a line from a router to
the next one, without being followed by its flit delay, i.e. the small rectangle. At t1, the second
flit of the direct blocking flow f1 leaves the router located at (2,6) to be blocked at the router
(3,6). At this moment, its input buffer is thus free and so fa can progress since it is no longer
blocked. fa reaches its destination at t2 = 277ns without being affected by the transmission of
the indirect blocking flows, while f1 progresses slowly waiting for its blocking flows to progress.
Thus, f1 reaches its destination at t3 = 314ns. However, in the recursive methods, fa could
not progress until f1 reaches its destination located at the router (4,4).
Thus, recursive methods do not consider the pipeline transmission leading to an over-approximation
of WCTT. The delay of fa is made dependent on the remaining distance between its destina-
tion and the destination of the flows in contention. Also, these methods always consider all the
indirect flows as influent.
Let us now compute the WCTT of the HM12 core-to-I/O flow where its frame is transmitted
before FADEC9 frame. We recall that these 2 applications share the second Ethernet interface
located at (0,2). The Ethernet frame for HM12 arrives at the Ethernet interface and is stored
into the Ethernet buffer in 12.336 µs. Each packet corresponding to the HM payload is blocked
by the HM flows. The WCTT of an individual HM packet on the NoC takes 807.3 ns by
using the RC method. The global WCTT of the HM12 core-to-I/Oflow is therefore 16.944 µs.
However, the transmission of the FADEC frame on Ethernet takes also 12.336 µs as its payload
size is 1500 bytes. As FADEC frame reaches the Ethernet interface before the removal of the
HM frame and its size is greater than the free size in the Ethernet buffer, then the FADEC
frame is dropped.
However, by modeling the real transmission of the pipeline behavior, similar to what is done
for the FFT application, the WCTT for the HM frame (16.5 µs) still leads to drop the FADEC
frame. Therefore, we need another strategy to reduce this WCTT.
4.3. Problem illustration 55
t ﬀ6t ﬀ2 t ﬀ10
t ﬀ5
t ﬀ0 t ﬀ4 t ﬀ8 t ﬀ12
t ﬀ3 t ﬀ15
t ﬀ14
t ﬀ13
t ﬀ7 t ﬀ11
t ﬀ9t ﬀ1
FFT
16
f a f 1
f 2
f 3
f 4f 5 f 6
f 7 f 8
(0,6
)
(1,6
)
(2,6
)
(3,6
)
(4,6
)
(4,7
)
(2,7
)
(4,5
)
(4,4
)
(3,4
)
f 1 h
ead
er ﬂ
it
f 3 h
ead
er ﬂ
it
f a h
ead
er ﬂ
it
f 8 h
ead
er ﬂ
it
f 6 h
ead
er ﬂ
it
f 1 s
eco
nd 
ﬂit
f 3 s
eco
nd 
ﬂit
f 7 h
ead
er ﬂ
it
(a
)
f a h
ead
er a
nd 
nex
t ﬂi
ts
f 1 h
ead
er a
nd 
nex
t ﬂi
ts
f 3 h
ead
er a
nd 
nex
t ﬂi
ts
f 6 h
ead
er a
nd 
nex
t ﬂi
ts
f 7 h
ead
er a
nd 
nex
t ﬂi
ts
f 8 h
ead
er a
nd 
nex
t ﬂi
ts
f 4 h
ead
er a
nd 
nex
t ﬂi
ts
f 5 h
ead
er a
nd 
nex
t ﬂi
ts
blo
cke
d 
at (
1,6
)
blo
cke
d 
at (
2,6
)
blo
cke
d 
at (
3,6
)
Eth
ern
et
(1,6
)
(2,6
)
(3,6
)
(4,6
)
(4,5
)
(3,4
)
(4,4
)
(2,7
)
(4,7
)
t 1 =
 20
2 n
s
t 2 =
 27
7 n
s
t 3 =
 31
4 n
s
hea
der
 of 
f 1 
blo
cke
d a
t (4
,6)
2nd  
ﬂit 
of f
1 
blo
cke
d a
t (3
,6)
hea
der
 of 
f 3 
blo
cke
d a
t (4
,6)
2nd  
ﬂit 
of f
3 
blo
cke
d a
t (4
,6)
(b
)
Fi
gu
re
4.
11
:
(a
)
C
or
e-
to
-c
or
e
bl
oc
ki
ng
flo
w
s
of
H
M
co
re
-t
o-
I/
O
flo
w
an
d
th
ei
r
fli
ts
po
sit
io
n,
(b
)
Tr
an
sm
iss
io
n
tim
el
in
e
of
co
re
-t
o-
co
re
an
d
co
re
-t
o-
I/
O
co
m
m
un
ic
at
io
ns
.
56 Chapter 4. Managing I/O in many-cores: problems and approach
4567
(6,0)
5
6
7
port 1port 2port 3port 4port 5
FADEC9tf8
tf3
tf7
tf1
tf0
tf2
tf6
tf4
tf5
HM12th11
th10
th9 th8
th7
th6 th0
th1
th2
th5
th4
th3
HM11
th0
th2 th3
th4
th5th6
th7
th8
th9 th10
th1
tﬀ15
tﬀ14
tﬀ13
tﬀ12
tﬀ0
tﬀ1
tﬀ2
tﬀ3
tﬀ7
tﬀ6 tﬀ5 tﬀ4
tﬀ11
tﬀ10 tﬀ9
tﬀ8FFT16
(0,6)
(0,2)
(1,1)
(0,0)(2,0)
3 2 1 0
0
1
2
3
4
Figure 4.12: SHiC mapping of the case study A made of one FADEC, one FFT and two HM
applications.
4.3.2 Is a contention-aware mapping strategy the solution?
We know that changing the mapping of tasks of applications, modifies the values of the WCTT
of the different flows. Thus, let us see what happens when we modify the mapping of the FFT
and HM applications by considering the SHiC method, a congestion-aware mapping presented
in section 3.2.2 of chapter 3. Figure 4.12 shows the mapping of our considered case study by
applying the SHiC method. We note that in this mapping the regions, where the applications
are allocated, are not modified compared to the arbitrary mapping proposed before. Only the
mapping of the tasks in each region changes. Actually, SHiC first chooses the core located at
(6,2) having a Square Factor (SF) equal to 16, as this core is centered into a square of area 9
and having 7 cores available at its borders. On this core, SHiC allocates the task of FADEC9
that has the maximum number of communications, i.e. tf0. The other tasks communicating
with tf0 are allocated to the neighboring cores by forming the smallest square including tf0.
This same process is applied for the next applications where each of these cores on (2,2), (3,6)
and (6,5) presents the first core to start the allocation for respectively HM12, FFT16, HM11
with SF = 16, 20 and 12.
4.3. Problem illustration 57
We now compute the WCTT of the HM11 core-to-I/O flow coming from the Ethernet interface
at (0,6). The global WCTT of the HM core-to-I/O flow blocked by the FFT communications
using recursive calculus method is t1 = 1.681 µs. Thus, after that HM frame is removed from
the Ethernet buffer, the next FFT frame is stored at the buffer and is not dropped as it takes
t2 = 6 µs to reach the Ethernet interface. However, if we consider that packets of 15 flits are
exchanged between the FFT tasks, the global WCTT of the HM core-to-I/O flow increases and
it takes t1 = 6.762µs to reach the memory. t1 is now greater than t2, thus this WCTT leads to
drop the FFT frame. The problem of dropping the FFT frame is then solved partially.
Besides, this problem remains when analyzing the WCTT of core-to-I/O flow for the HM12
whose frame is transmitted before FADEC9. The values are the same of those obtained by
considering the arbitrary mapping.
Need of a congestion-aware mapping considering the core-to-I/Oflows
The only goal of SHiC and related works mapping strategies is to reduce the congestion on
the core-to-core flows. Thus, they do not consider the locations of I/O interfaces within the
NoC during applications mapping. But, we saw that these interfaces could be shared between
several applications. In this case study, the critical application FADEC is allocated far from
Ethernet and DDR controllers. So, a core-to-I/Oflow experiences a delay due to the non-critical
application HM, which presents a high number of communications of large payloads, especially
with the DDR memory. Besides, the internal mapping of the applications also influences the
WCTT of this core-to-I/O flow. In fact, SHiC, similar to most existing strategies, allocates
the task with the highest number of communications, approximate at the center of the square
where the application is mapped. Thus, for the FFT application, as illustrated in Figure 4.12,
it allocates the task tff15 at the core located at (3,6) having SF = 16. As this task receives
data from all other tasks, the core-to-I/Oflow is then blocked directly by the flow generated
by tff1 having tff15 as destination. However, this flow is also blocked by other flows, such as
the flows coming from tff11, tff12, tff13 and tff14, which induce indirect contentions with the
core-to-I/Oflow. Figure 4.13 illustrates these flows blocking directly and indirectly the core-
to-I/Oflow. We note that the flows generated by tff11, tff13 and tff14 block directly the one
58 Chapter 4. Managing I/O in many-cores: problems and approach
tﬀ15
tﬀ14
tﬀ13
tﬀ12
tﬀ0
tﬀ1
tﬀ2
tﬀ3
tﬀ7
tﬀ6 tﬀ5 tﬀ4
tﬀ11
tﬀ10 tﬀ9
tﬀ8FFT16
(0,6)
(3,6)
Figure 4.13: FFT flows blocking the HM core-to-I/Oflow by considering SHiC mapping.
generated by tff1 at the router (3,6) as they share the same destination. The flow coming
from tff12 blocks the one generated by tff1 as they share the same link. Thus, these direct
and indirect contentions with the core-to-I/Oflow lead to an increase in its WCTT. Thus, this
internal mapping strategy is not appropriate to reduce the contention on the core-to-I/Oflows
as it does not consider these flows when mapping the tasks. In order to reduce the WCTT of
a core-to-I/Oflow and avoid the dropping of incoming I/O packet, the number of contentions a
core-to-I/O flow experiences should instead be reduced.
Thus, we need to define a mapping application strategy that considers core-to-I/O flows on a
Tilera-like NoC as first-class citizen, so that the contentions they exhibit and thus their WCTT
are reduced to avoid the aforementioned problem.
4.4 Proposal
NoC could be used as processing elements within a backbone Ethernet network as they provide
different I/Ointerfaces. Thus, different types of communications can exist in the NoC: core-to-
core and core-to-IO. The core-to-I/Oflows experience a change in their speed when crossing two
types of network: Ethernet and NoC. However, when a NoC congestion occurs on paths taken
by core-to-I/O flows, their estimated WCTTs can be higher than the arrival delay of the next
incoming Ethernet frame. In this case, this next Ethernet frame may be dropped due to the lack
4.4. Proposal 59
of space in the Ethernet buffer, which is of limited capacity. Previous Ethernet frames could
indeed be stored in this buffer while waiting associated NoC packets can progress towards the
DDR interface. For this reason, the WCTT of core-to-I/Oflows should be analyzed on the NoC.
In this chapter, we illustrated this problem using a case study made from critical and non-critical
applications of different requirements. We then showed the need to reduce the pessimism that
exists in the current state-of-the-art method, i.e. Recursive Calculus, to compute the WCTT
of this core-to-I/Oflow. In fact, this analysis method do not take advantage of the pipeline
transmission of wormhole-switching. However, adding such capabilities in RC, is not sufficient
to avoid dropping Ethernet frames. For this end, we need to change the mapping to reduce the
WCTT of the core-to-I/O flows. Nevertheless, existing mapping strategies consider only the
objective to reduce the contention of core-to-core communications. We illustrated that such
a mapping, i.e. SHiC mapping, does not consider the contention on the path of core-to-I/O
flows which lead to drop critical and non-critical Ethernet frames. Thus, a static mapping
strategy of critical and non critical real-time flows that reduces the WCTT of core-to-I/O
communications over a Tilera-like NoC is needed. The next chapters present our approaches to
avoid the aforementioned problem: dropping Ethernet frame. Thus, in Chapter 5, we describe
our analytical method to reduce the pessimism when compting the WCTT of flows. In chapter
6, we present our mapping strategy that takes into account both core-to-I/O flows and core-
to-core communications in order to reduce the WCTT of the core-to-I/Oflows. The evaluation
of these approaches is illustrated in Chapter 7.
60 Chapter 4. Managing I/O in many-cores: problems and approach
Chapter 5
RCNoC: an optimized WCTT analysis
for NoC
Contents
5.1 Notations and definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Wormhole network properties . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.1 Local worst-case scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.2 Direct and indirect contentions . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3 Description of the proposed pipeline-based algorithm . . . . . . . . . 76
5.3.1 An illustrating example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.2 Identifying the global worst-case scenario . . . . . . . . . . . . . . . . . . 80
5.3.3 Implementation of direct and indirect contention analysis . . . . . . . . . 83
5.4 Unitary evaluation of the properties . . . . . . . . . . . . . . . . . . . . 89
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Chapter 4 has shown that the recursive methods do not model the pipeline behavior of the
wormhole switching. This is a source of pessimism when computing the WCTT of flows. To
reduce this pessimism, the analysis should be done at the granularity of a flit. To compute
61
62 Chapter 5. RCNoC : an optimized WCTT analysis for NoC
the WCTT of an analyzed flow fa, we have to explore all the possibilities of transmission of
different flows. However, the number of possibilities can be too high, and then they cannot be
explored in a reasonable amount of time. In this chapter, we propose to define an improved
recursive method, called RCNoC , to analyze the behavior of the network. This method is based
on three properties which allow to reduce the number of scenarios and the pessimism of the
classical recursive methods. The proposed method is then evaluated on a synthetic benchmark.
First, we propose some notations which will be used to describe our method.
5.1 Notations and definitions
The following notations and definitions are used in this chapter in order to explain the properties
on which are based our proposed method.
• Notation 1: ds is the switching delay of a flit, i.e. the time required for the router to
grant an output port for the packet. In Tilera Tile64 NoC, the switching delay of the
header and that of the remaining flits are identical.
• Notation 2: fsize is the size of a flit.
• Notation 3: dt is the traversal delay of a link of capacity C. Thus,
dt =
fsize
C
This delay corresponds to the traversal delay for regular data or credits due to the flow
control policy.
• Notation 4: dflit is the delay taken by a flit to be transmitted from a router rk to rk+1.
Then,
dflit = ds + dt
In this work, we assume that ds = dt and dflit = 1 cycle as in the Tilera NoC.
5.1. Notations and definitions 63
rk rk+1 rk+2 rk+3 rk+4 rk+5 rk+6 rk+7
fafd fid
Figure 5.1: Example illustrating the number of routers separating an analyzed flow fa from its
indirect flow fid.
• Notation 5: nfi is the size of a packet of a flow fi in numbers of flits.
• Notation 6: a flow fi is described by a set of routers defining its path from the source
router, noted risource, to its destination router, noted ridestination. The path of this flow is
noted by path(fi). Then,
path(fi) = risource, rj, rk, ..., ridestination
The example in Figure 5.1 is used to illustrate the following notations from 7 till 9.
• Notation 7: fd is a direct flow that blocks fa. The last common router between fa and
fd is rd. In the example of Figure 5.1, rd corresponds to rk+2.
• Notation 8: fid is an indirect flow of fa and which blocks fd at rid. In the example, rid
corresponds to rk+4.
• Notation 9: Er is the number of routers separating fa from fid, i.e. between rd and rid.
Thus, in the example, Er = 1, which corresponds to the router rk+3.
• Definition 1: F is the set of flows in an application.
• Definition 2: Fp ⊂ F is the set of flows to be considered when analyzing fa, should
they be in direct or indirect contention with fa.
• Definition 3: Fi,k is the set of flows blocking fi on a router rk of its path. Due
to the wormhole behavior, Fp includes not only the flows blocking directly fa in its
path, but also the indirect flows that blocks the direct blocking flows. This means that
∪∀rk∈path(fa)Fa,k ⊆ Fp.
64 Chapter 5. RCNoC : an optimized WCTT analysis for NoC
5.2 Wormhole network properties
In the following sections, we describe three properties that permit to compute the WCTT
and reduce the pessimism in this computation. The first one reduces the number of possible
scenarios by studying the arrivals of the flows on each router. The two following ones concern
the pipeline behavior of the transmission and its utilization in the WCTT computation.
5.2.1 Local worst-case scenario
Computing a WCTT requires to identify the worst-case scenario. To identify such a worst-case
scenario, the possible sequences of flows from Fp over routers must be explored, even though
all these scenarios do not lead to the WCTT. To reduce the number of scenarios to explore
when performing a WCTT analysis, we thus define a first property that identifies the worst-
case scenario. The local worst-case scenario is identified when computing the maximal blocking
delay of fa at each router.
Property 1: Identifying the worst-case scenario.
For a flow fa, the worst-case scenario, on each router rk ∈ path(fa), occurs when:
1. The port on which fa arrives is the last port served by the round-robin arbitration (RRA).
2. The headers of fa and a flow fb ∈ Fa,k arrive synchronously on rk.
3. The other flows fj ∈ Fa,k (fj 6= fb) arrive on rk either synchronously with fa and fb or
no later than the release of the output port by the served flow (i.e. fb or another fj),
consequently before the next round of arbitration.
The first condition is obvious. Actually, when fa is the last one served, then it waits for the
progression of all flows contending to the same output port. Now, let us prove the condition 2.
Proof. Let us assume that fa is blocked at router rk by two flows fi and fj. These flows have
therefore in common a next router rk+1 in their paths towards their destinations. At least one
5.2. Wormhole network properties 65
r1 r2 r3 r4 r5 r6 r7 r8
f1f2
f3
Figure 5.2: Example illustrating the second condition of Property 1.
link must be shared between two flows so that they are in (direct) contention. When the headers
of fa and fi arrive synchronously on rk, fa must wait till all the flits of fi leave rk+1 before
being able to progress, as we consider buffers of capacity of a single flit (A11). This leads to a
blocking delay D1 equals to nfi×2dflit, since the two routers rk and rk+1 must have transmitted
all the flits of fi before being able to proceed with another flow. Now, let us assume that ncfi
flits of fi have already been transmitted by rk before the header of fa arrives at rk. Then, the
blocking delay of fa before being able to progress is D2 = (nfi − ncfi) × 2dflit. Therefore, the
arrival of the header of fa with any flit of fi, except the header, leads to a lower delay of fa,
where D2 < D1. Thus, the synchronous arrival of headers is a worst-case scenario.
To illustrate this behavior, let us consider the example in Figure 5.2. This example uses 3
flows. f1 is the analyzed flow fa, f2 and f3 are direct flows blocking f1 at r1. We assume that
all packets are made of 4 flits. We focus, first, on the transmission of f1 and f2.
r1 r2 r3 r4 r5 r6 r7 r8
f1
f2
(a)
r1 r2 r3 r4 r5 r6 r7 r8
f1
f2
(b)
Figure 5.3: Different scenarios to illustrate the second condition of property 1.
The first possible scenario is proposed in Figure 5.3a. f1 and f2 arrive synchronously at r1. As f1
and f2 share the same link between r1 and r2, f1 waits for the transmission of all flits of f2 by the
router r2. This leads to a blocking delay D1 = 4×2dflit as illustrated in the scenario (a) on the
timeline of Figure 5.4. Thus, delay(f1) = d(f1)+D1 = 17cycles, where d(f1) corresponds to the
66 Chapter 5. RCNoC : an optimized WCTT analysis for NoC
transmission delay of f1 to reach its destination. In the second scenario shown in Figure 5.3b,
f1 arrives synchronously with the fourth and last flit of f2. The scenario (b) on the timeline of
Figure 5.4 illustrates the waiting delay of f1. f1 only waits for the progression of the last flit of
f2 from the router r2 to r3, leading to D2 = 2dflit. Then, delay(f1) = d(f1) +D2 = 11cycles.
time(cycles)
time(cycles)
t1
r1
r2
r3
r4
r5
r6
r7
r8
0 2 4 6 8 10 12 14 16t
f2 headerf1
2 , 4th ﬂitsnd, 3rd
1 2nd, 3 , 4th ﬂitsrd
t1
(a)
(b)
header
f1 1st ﬂit arrives
t0 2 4 6 8 10f1 1st ﬂit arrives
f2f
dﬂit
D1
D2 df1
Figure 5.4: Timeline of the transmission of flows for the two different scenarios of the example
shown in Figure 5.3.
Now, we have to prove the third condition of Property 1: the worst-case blocking delay of fa
is obtained when fj arrives either synchronously with fa and fb or before the next round of
arbitration.
Proof. If fj arrives after the current round of arbitration, it is clear that fa is going to progress
before fj. In this case, the worst-case blocking delay that fa suffers (due to fb) is equal to
D1, previously introduced. Let us assume that fj arrives before the next round of arbitration
and when ncfb flits of fb have been transmitted by rk. Meanwhile, fa was blocked for an
amount of time equal to ncfb × 2dflit. However, the next round of arbitration only occurs
when all flits of fb are transmitted by rk. The blocking delay that fa suffers is thus equal
to nfb × 2dflit, independently from the number of flits that was transmitted before fj arrives.
This blocking delay corresponds to: ncfb × 2dflit + (nfb − ncfb)× 2dflit, where the second term
(nfb − ncfb)× 2dflit is the blocking delay that fa suffers since the arrival of fj at rk. The next
5.2. Wormhole network properties 67
round of arbitration is won by fj due to the first condition of Property 1. Hence, fa is again
blocked till all flits of fj are transmitted by rk. In total, fa is thus blocked for an amount of
time D3 = (nfb+nfj)×2dflit. Now, let us assume that fj arrives before the synchronous arrival
of fa and fb. When fa arrives at rk, it is thus blocked for an amount of time corresponding to
the transmission of the remaining flits by rk. The same rationale leads to a total worst-case
blocking delay for fa equal to D4 = (nfb + ncfj)× 2dflit. As D4 < D3, the worst-case blocking
delay of fa is obtained when fj arrives either synchronously with fa and fb or before the next
round of arbitration.
Note that if fj arrives synchronously with the arrival of fa and fb (i.e. ncfb = 0), the RRA
and the first condition of Property 1 ensure that fa will be the last flow whose flits will be
transmitted by rk. rk will either transmit fb or fj first, however the blocking delay that fa will
suffer is still equal to D3.
Figure 5.5 illustrates the third condition of Property 1 when the 3 flows of our example are
considered. Figure 5.5a presents a first scenario where f3 arrives synchronously with f2 and
f1. The second scenario, shown in Figure 5.5b, illustrates the case where f3 arrives when r1
has transmitted a number of flits of f2. Here, we assume that f1 and f2 arrive synchronously,
as previously, and that f3 arrives with the fourth flit of f2. At this time, r1 has transmitted
r1 r2 r3 r4 r5 r6 r7 r8
f1f2
f3
(a)
r1 r2 r3 r4 r5 r6 r7 r8
f1
f2f3
(b)
Figure 5.5: Different scenarios illustrating the third condition of property 1
three flits of f2. f1 has thus been waiting for 3 × 2dflit and has to wait that r1 transmits the
last flit of f2. This total additional delay is thus equal to 4× 2dflit. The timeline of figure 5.6
68 Chapter 5. RCNoC : an optimized WCTT analysis for NoC
illustrates this scenario noted by (b). We can see that f1 will only be able to progress after all
flits of f3 have been transmitted by r1, i.e. at t1 since f1 has arrived on the last port served by
the RRA. This corresponds to an additional delay of 4× 2dflit. Thus, the blocking delay of fa
is equal to 8× 2dflit and delay(f1) = 8× 2dflit + d(f1) = 24cycles. However, this delay is the
same if f3 arrives synchronously with f1 and f2, as shown in Figure 5.6 in the scenario noted
by (a). In this case, RRA serves f2, f3 and then f1. Thus, f1 waits until the progression of all
flits of f2 and f3.
We claim that the global worst-case scenario is obtained when Property 1 is recursively applied
to each flow that affects fa. This is illustrated by an example in Figure 5.7, where f3 blocks f2
and delays f1 and fa = f1.
time(cycles)
t2
r1
r2
r3
r4
r5
r6
r7
r8
t10 2 4 6 8 10 12 14 16t 18 20 22 24
0 2 4 6 8 10 12 14 16t 18 20 22 24t2t1
f3 1st ﬂit arrives
f2 headerf1 headerf3 header
f2 2nd, 3rd , 4th ﬂits
f1 2nd, 3rd , 4th ﬂits
f3 2nd, 3rd , 4th ﬂits
time(cycles)
f1 and f 3: 1st ﬂit arrive
f11st ﬂit arrives
(a)
(b)
Figure 5.6: Timeline of the transmission of flows of the example shown in Figure 5.5.
r1 r2 r3 r4 r5 r6 r7 r8
f1
f2
f3
Figure 5.7: Example illustrating that Property 1 should be applied recursively.
As mentioned before, f1 has to wait that the second common router between f1 and f2, i.e.
r2, transmits all the flits of f2. However, f2 will be blocked by f3 at r4. Again, two scenarios
5.2. Wormhole network properties 69
can be defined at router r4, as shown in Figures 5.8 and 5.9. The scenario leading to the worst-
r1 r2 r3 r4 r5 r6 r7 r8
f1
f2
f3
(a)
time(cycles)
t2
r1
r2
r3
r4
r5
r6
r7
r8
t10 2 4 6 8 10t
f2 headerf1 headerf3 header
f2 2nd, 3rd , 4th ﬂits
f1 2nd, 3rd , 4th ﬂits
f3 2nd, 3rd , 4th ﬂits
12 14 16 18 20 22f1 1st ﬂit arrives
f3 1st ﬂit arrives
24
(b)
Figure 5.8: A worst-case scenario of the example in Figure 5.7 illustrating the recursion of
Property 1.
case delay corresponds to the first scenario, i.e. Figure 5.8a, illustrated in the timeline of the
Figure 5.8b. In this scenario, f2 has indeed to wait that r4 transmit all flits of f3 before being
able to progress. f1 is thus blocked for a delay D5 equals to (nf3 + nf2) × 2dflit before being
able to progress, leading to a delay of f1 equal to 25 cycles. In the second scenario, plotted in
Figure 5.9b, ncf3 flits of f3 have already been transmitted by r4 before f2 arrives at this router.
f1 thus waits for a delay equal to (nf3−ncf3+nf2)×2dflit, that is lower than D5. This scenario
leads to a delay of f1 equal to 19 cycles.
5.2.2 Direct and indirect contentions
To reduce the pessimism when computing WCTT, two properties are defined. The first property
computes the maximal delay from which a flow can suffer. The second property eliminates some
70 Chapter 5. RCNoC : an optimized WCTT analysis for NoC
r1 r2 r3 r4 r5 r6 r7 r8
f1
f2
f3
(a)
time(cycles)
t2
r1
r2
r3
r4
r5
r6
r7
r8
t10 2 4 6 8 10t
f2 headerf1 headerf3 header
f2 2nd, 3rd , 4th ﬂits
f1 2nd, 3rd , 4th ﬂits
f3 2nd, 3rd , 4th ﬂits
12 14 16 18f1 1st ﬂit arrives
f3 1st ﬂit arrives
(b)
Figure 5.9: A second configuration of the example in Figure 5.7 which does not lead to a
worst-case scenario.
scenarios by identifying flows in indirect contention with fa as non influent. Note that Property
1 is applied each time a flow is blocked by another one.
Property 2: Computing the maximal blocking delays.
If fa is blocked by an unblocked flow fj, its maximal blocking delay is bounded by nfj × 2dflit.
This delay corresponds to the time needed by the last flit of fj to leave the blocked router.
As seen previously, if fa is blocked by fj at router rk, then these flows must have in common
a next router rk+1 in their paths towards their destinations. Due to the pipeline transmission,
fa can progress only when the last flit of fj leaves their next common router rk+1. fa is thus
blocked for an amount of time equals to nfj × 2dflit. The scenario (a) on the timeline of
Figure 5.4 is an example of this behavior. f1 is blocked by the unblocked flow f2. We can see
that f1 waits for a delay t that is equal to nf2 × 2dflit, i.e. 8 cycles, before it can progress.
Now, let us prove by contradiction that this delay t is the maximal blocking delay fa can suffer
5.2. Wormhole network properties 71
when it is blocked by an unblocked flow.
Proof. Let us assume that a bigger blocking delay t1 for fa exists. Since t1 > nfj × 2dflit, the
header of fj has moved from rk to rk+2nfj . Actually, when a flit moves from a router rk+1 to a
router rk+2, rk+1 transmits a credit back to rk allowing the transmission of the next flit. This
explains why flits of a flow are separated by one router when these flits follow the header in a
pipeline way. To explain this behavior, let us go back to the scenario (a) of Figure 5.4, where
we consider r1 = rk. We can see that when the header of f2 is at the router rk+7, i.e. at r8,
then the 2nd flit is at rk+5, i.e. at r6. The 3rd and 4th flits are respectively at r4 and r2.
Therefore, after t1, the header of fj should have been transmitted by router rk+2nfj . The last
flit of fj is then located within router rk+2. fa has therefore progressed at router rk+1 which is
contradictory with the initial assumption.
We note that when n unblocked flows block fa at rk, then the blocking delay of fa is equal to∑n nfj × 2dflit. This is the case presented in the timeline of Figure 5.6. f1 waits first for all
flits of f2 to quit the router r2, i.e. nf2 × 2dflit. Then, it waits the progression of all flits of f3
to leave the router r2, i.e. nf3 × 2dflit.
However, a flow fi could block fj at router rl. fj should be unblocked first, in order to unblock
fa. This property must thus be applied recursively on all blocking flows. For this reason, we
consider the property 2-bis.
Property 2-bis: If fa is blocked by a blocked flow fj, the maximal blocking delay of fa incurred
by fj, when this one is unblocked, depends on the number of hops separating rk from rl.
When fj is blocked by fi at rl, fj has progressed from rk before being blocked at rl. Once
blocked, fj waits for an amount of time equal to t, i.e. nfi × 2dflit, to receive the credit from
rl+1. Meanwhile, fa is blocked at rk and waits for the credit from rk+1 to progress: this blocking
delay depends on the number of hops, noted nhops, separating rk and rl. Three cases exists:
1. nhops = nfj . In this case, the last flit of fj is located within rk+1 when fj has reached rl,
where it is blocked by fi. Actually, rl can be noted as rk+nhops which is equal in this case to
rk+nfj . This means that the header of fj is blocked at rk+nfj , and so the last flit is located
72 Chapter 5. RCNoC : an optimized WCTT analysis for NoC
at rk+1. fa can immediately progress when each flit of fj progresses one router, i.e. the last
flits leaves rk+1. This is possible when fj is no more blocked by fi. Then, from this instant,
the blocking delay that f1 suffers, noted D is thus equal to: nfj × dflit.
We illustrate this case of this property on the example of Figure 5.10a where we consider
three flows, each one has a size of three flits: f1, f2 and f3. In this case, f1 is blocked
at r1 by f2, while f2 is blocked by f3 at r4. Then, here nhops = nf2 = 3. The timeline
r1 r2 r3 r4 r5 r6 r7 r8
f1 f3
f2
(a)
time(cycles)
r1
r2
r3
r4
r5
r6
r7
r8
0 2 4 6      8 10 12 14 16   18
f2 headerf1 headerf3 header
f2 2nd, 3rd ﬂits
f1 2nd, 3rd ﬂits
f3 2nd, 3rd ﬂits
Dd(header f2) D1 d(f1)
 2nd ﬂit of f2blocked at r3 
3rd ﬂit of f2blocked at r2 
header ﬂit of f2blocked at r3 
(b)
Figure 5.10: An example illustrating the case where nhops = nfj .
in Figure 5.10b illustrates this case. We can see that the last flit of f2 is blocked at r2,
waiting for the progression of the header. But, f2 is blocked by the unblocked flow f3,
thus its blocking delay is equal to nf3 × 2dflit = 6 cycles, noted by D1. Now, f1 waits
for that each flit of f2 moves to its next router, i.e. the last flit moves from the router
r2 to the router r3. This blocking delay D is equal to nf2 × dflit = 3 cycles. Therefore,
delay(f1) = d(f1) + d(headerf2) + D + D1 = 19 cycles, where d(headerf2) corresponds to
the transmission delay of the header of f2 from r1 to r4.
5.2. Wormhole network properties 73
2. nhops > nfj . Compared to the previous case, some flits of fa must have already been
transmitted by rk when fj reaches rl. The previous blocking delayD of fa at ri is thus lowered
by the number of routers the header of fa has crossed. This number is equal to nhops−nfj , as
when fj is blocked at rk+nhops , the header of fa has reached rk+nhops−nfj . Then, the blocking
delay of fa at rk when fj is no more blocked corresponds to: D − (nhops − nfj)× dflit.
r1 r2 r3 r4 r5 r6 r7 r8
f1 f3
f2
(a)
time(cycles)
r1
r2
r3
r4
r5
r6
r7
r8
0 2 4 6      8 10 12 14 16   18
f2 headerf1 headerf3 header
f2 2nd, 3rd ﬂits
f1 2nd, 3rd ﬂits
f3 2nd, 3rd ﬂits
Dnhops-nf2
(b)
Figure 5.11: An example illustrating the case where nhops > nfj
To illustrate this case, we modify the previous example, where f2 is now blocked at r5. We
illustrate this example in Figure 5.11a, where nhops = 4. The delay D, computed in the
previous case, is lowered by (nhops − nfj) = 1 as shown in Figure 5.11b. In fact, the header
of f1 has progressed nhops − nfj , i.e. one router, when f2 is blocked at r2.
3. nhops < nfj . In this case, a number of flits of fj must be located on rk and possibly on
previous routers. fa is thus blocked till these flits of fj are transmitted by rk. This number
is equal to nfj − nhops. The blocking delay of fa of the first case therefore increases to:
D + (nfj − nhops)× dflit.
74 Chapter 5. RCNoC : an optimized WCTT analysis for NoC
This case is illustrated in Figure 5.12a, where nhops = 2 as f2 is blocked at r3. As illustrated
in Figure 5.12b, when nhops < nf2 , only 2 flits of f2 among 3 flits can progress before being
blocked by f3 at the router r3. Thus f1, blocked at r1, should wait for the last flit of f2,
i.e. the (nf2 − nhops) flits, to progress from r1 to r3. So, when f2 is no more blocked, the
blocking delay of f1 at r1 is equal to D + (nfj − nhops)× dflit = 4 cycles.
r1 r2 r3 r4 r5 r6 r7 r8
f1 f3
f2
(a)
time(cycles)
r1
r2
r3
r4
r5
r6
r7
r8
0 2 4 6      8 10 12 14 16   18
f2 headerf1 headerf3 header
f2 2nd, 3rd ﬂits
f1 2nd, 3rd ﬂits
f3 2nd, 3rd ﬂits
D nf2-nhops
(b)
Figure 5.12: An example illustrating the case where nhops < nfj .
Finally, we recall that to obtain the total blocking delay of fa, the blocking delay computed
in each case must be added to the transmission delay of the header of fj to reach rl and its
blocking delay at this router.
Property 3: Reducing the delays by identifying the non-influencing indirect flows.
fd is a direct flow blocking fa, while fid is an indirect flow that blocks fd (Notations 7 and
5.2. Wormhole network properties 75
8). fid does not impact the progression of fa when Er ≥ nfd − 1. We recall that Er corresponds
to the number of routers separating fa and fid (Notation 9).
rd-2 rd-1 rd rd+1 rid rid+1 rid+2 rid+3rid-1...
fafd
fid
Figure 5.13: Example illustrating the analyzed flow fa, the direct flow fd and the indirect flow
fid.
Proof. Let us first consider Figure 5.13 in order to illustrate fa, fd and fid. If Er ≥ nfd − 1,
the header of fd must therefore be blocked on rid and the remaining flits are located on routers
rd+1 up to rid−1. rd can therefore transmit the first flit of fa which will then be able to progress
in a pipeline way. Consequently, fa is not affected by fid, and thus its delay is independent
from fid.
To illustrate this property , we consider the example in Figure 5.14 where Er = 2, as r4 and r5
separates f1 from f3. Let us first suppose that f2 has a size of 3 flits, i.e. Er = nf2 − 1. This
r1 r2 r3 r4 r5 r6 r7 r8
f1 f3
f2
Figure 5.14: Example illustrating Property 3.
case is illustrated in Figure 5.15. In this case, the header of f2 is blocked at r6. The 2nd and 3rd
flits are located at r5 and r4 as shown in Figure 5.15a. Hence, the path of f1 is no more blocked.
The timeline in Figure 5.15b illustrates the transmission of these flows where we can see that
f3 does not affect the transmission of f1. Here, delay(f1) = nf2 × 2dflit + df1 = 13 cycles.
However, if Er < nfd − 1, a flit of fd must be located on rd since fd is blocked at rid by fid. fa
is thus blocked on its path by fd and thus indirectly delayed by fid.
This is the case when f2, in this example, has a size of 4 flits. This case is illustrated in
Figure 5.16. In this case, Er is less than nf2 − 1, thus the 4th flit of f2 is located at r3, blocking
the progression of f1 as shown in Figure 5.16a. f1 should wait for the progression of f3. The
timeline, illustrating this transmission in Figure 5.16b, shows that f3 affects indirectly f1. f1
76 Chapter 5. RCNoC : an optimized WCTT analysis for NoC
r1 r2 r3 r4 r5 r6 r7 r8
f1 f2 f3
(a)
time(cycles)
t2
r1
r2
r3
r4
r5
r6
r7
r8
t1t0 2 4 6 8 10 12 14 16
f2 headerf1 headerf3 header
f2 2nd, 3rd ﬂits
f1 2nd, 3rd ﬂits
f3 2nd, 3rd ﬂits
2 nd ﬂit of f2blocked at r3
3rd ﬂit of f2blocked at r2
header of f2blocked at r3
(b)
Figure 5.15: Transmission of flows f1, f2 and f3 of the case in Figure 5.14 where we consider a
size of 3 flits for each packet.
waits for f2 to become unblocked by f3, in order that the last flit of f2 leaves the router r3 at
t1. This case leads to a delay of f1 that is equal to 25 cycles.
5.3 Description of the proposed pipeline-based algorithm
The pessimism when computing the WCTT can be reduced by modeling the pipeline behavior.
Thus, the analysis should be performed at the granularity of flits, as presented in Properties 2
and 3. In this section, we illustrate on an example our algorithm RCNoC to compute WCTT
of flows. We then describe how RCNoC implements the properties introduced at the previous
section.
5.3. Description of the proposed pipeline-based algorithm 77
r1 r2 r3 r4 r5 r6 r7 r8
f1 f2 f3
(a)
time(cycles)
r1
r2
r3
r4
r5
r6
r7
r8
0 2 4 6 8 10 12 14 16 18 20 22 24
f2 headerf1 headerf3 header
f2 2nd, 3rd ﬂits
f1 2nd, 3rd ﬂits
f3 2nd, 3rd ﬂits
t1 t
(b)
Figure 5.16: Timelines of transmission of flows f1, f2 and f3 of the case in Figure 5.14 where
we consider a size of 4 flits for each packet.
5.3.1 An illustrating example
In order to understand the behavior of our proposed algorithm, we propose to study the example
in Figure 5.17. This example is made of 5 flows: f1, f2, f3, f4 and f5. f1 is the studied flow,
r9 r10 r11
f1
f3
r1 r2 r3
f4 f5
f2
r4 r5 r6
r12 r13 r14
r7
r15
r8
r16
Figure 5.17: Example illustrating the method of computation.
i.e. the flow that we want to compute its WCTT. f2 and f3 are in direct contention with f1.
f4 and f5 are in indirect contention with f1 as they are in direct contention with f2 and share
no common routers with f1.
78 Chapter 5. RCNoC : an optimized WCTT analysis for NoC
In order to compute the WCTT of f1, we have to determine the list of flows that block directly
and indirectly f1. Thus first, we have to build, for each flow fi, the list of flows that are in
direct contention with fi, and impacts f1, by using Property 1. We note Lfi this list and it is
made of a set of couples < fj, rk > where fj is a flow in direct contention with fi and rk is the
first router where fj blocks fi. The flows are ordered by the round robin arbitration, assuming
Property 1. To identify when fi can be transmitted by a router rk and then progress, i.e. it is
no longer blocked at rk, fi is also added in Lfi when passing from a router to another.
For instance, in our example, Lf1 = {< f2, r2 > < f3, r2 > < f1, r2 >}. This means that f2
and f3 block f1 at the same router r2 as they share with f1 a common link between r2 and r3.
Considering the RRA, f2 is sent before f3. As f2 and f3 are two blocking flows of f1 at r2, then
their blocking flows may also have an impact on the transmission of f1. This requires to build
their list of blocking flows starting from r3. Thus, Lf2 = {< f4, r6 > < f2, r6 > < f5, r7 > <
f2, r7 >} and Lf3 = {< f3, r2 >}. Actually, f2 blocks f1 at r2 and it is transmitted before f3
due to RRA. This RRA at r2 explains the reason why we do not consider f3 as a blocking flow
to f2 and vice versa. f2 can progress without being blocked till the router r6. The recursion
of Property 1 applies, and thus, f2 is blocked by f4 at r6. When f2 is no more blocked by f4,
it can progress to be blocked by f5 at r7. Now, f2 is no more blocked on its path. For f3, it
progresses after f2. On its path, f3 is not blocked.
Similar to f2 and f3, as f4 and f5 block f2 at r6 and r7 respectively, then their blocking flows
could also have an impact on f1. Thus, we build their list of blocking flows starting respectively
from r7 and r8. Then, Lf4 = {< f5, r7 > < f4, r7 >} and Lf5 = {< f5, r7 >}.
Now, we have finished to build the list of flows that have an impact, directly and indirectly, on
the analysis of f1. Let us now see how to analyze the WCTT of f1 by stepping through these
lists by starting from Lf1 , as f1 is the studied flow. Lf1 indicates that f2 is the first direct flow
blocking f1. Therefore, we have to analyze Lf2 in order to compute the maximal blocking delay
after which the next flows after f2 in Lf1 , i.e. f3 and f1, could progress. Since both f4 and f5
in Lf2 are indirect flows of f3 and f1, i.e the next flows that progress after f2 in Lf1 , then two
scenarios can occur:
5.3. Description of the proposed pipeline-based algorithm 79
• First scenario: nf2 = 2.
Property 3 is applied: f4 has no indirect influence over f3 and f1, since one router at minimum
separates f4 from these flows, i.e. Er = nf2 − 1 = 1. Consequently, f2 becomes an unblocked
flow for the flow after f2 in Lf1 , i.e. f3. This means that f3 can progress directly after f2
when its last flit leaves the router r3. Thus, the blocking of f4 to f2 does not impact the
transmission of f3. Property 2 is therefore applied: the maximal blocking delay of f3 before
it can progress is computed and is equal to nf2 × 2dflit. Now, f3 can progress. Thus, we step
to Lf3 to compute the maximal blocking delay from which suffer the next flow after f3 in Lf1 ,
i.e. f1, before it can progress. Lf3 indicates that f3 is an unblocked flow. Thus, the next flow
f1 waits only the transmission of f3. Property 2 applies that f1 waits nf3×2dflit before it can
progress. Now, f1 can progress and thus the transmission time of f1 till its destination must
be added to compute its WCTT. This WCTT is thus equal to: d(f1) + (nf2 + nf3) × 2dflit.
The timeline in Figure 5.18 illustrates the transmission of these flows where we can see that
WCTT (f1) ≈ 13cycles.
time(cycles)
r2
r3
r4
r5
r6
r7
r8
0 2 4 6      8 10 12 14 16   18
r11
t
f2
header
f4 header
f3 2nd
2nd
2nd
ﬂits
f1f4
ﬂits
ﬂits
header
headerf3f1
f5
f2
f5
2nd
2nd ﬂits
ﬂits
header
nf2x2dﬂit nf3x2dﬂit df1
Figure 5.18: Timeline illustrating the transmission of flows of the example in Figure 5.17 and
considering the first scenario.
• Second scenario: nf2 > 2.
Two cases could be considered in this scenario.
80 Chapter 5. RCNoC : an optimized WCTT analysis for NoC
– First case: nf2 = 3. In this case, f4 does not impact the transmission of the flow f3 but
impacts f1. Actually, there are one router separating f4 from f1 and thus Er < nf2 − 1.
On the other hand, there are two routers separating f4 from f3, which means that
Er = nf2 − 1.
– Second case: nf2 > 3. In this case, f4 has an indirect influence on both f3 and f1, as
the number of routers separating f4 from f1 or f3 is less than nf2 − 1.
In this example, we show the analysis by considering the second case. As f4 has an influence
on both of these flows, therefore we need to analyze Lf4 . Lf4 tells us that f4 is blocked by
f5, which requires the analysis of Lf5 in order to compute the maximal blocking delay of
f4. However, f5 is not blocked, as presented in Lf5 , so Property 2 computes the maximal
delay of f4, which is now no more blocked as there are no more flows in Lf4 . Now, as f4 is
an unblocked flow, the maximal blocking delay that the flow after f4 in Lf2 suffers from it
is computed by Property 2. Going back to Lf2 , f2 can now progress but this list indicates
that f2 progress only one router, i.e. r6, before being blocked by f5 at r7. Here, f5 has an
influence on f3 and f1, i.e. the flows after f2 in Lf1 . Thus, the analysis of both Lf3 and Lf5
is required.
This behavior has been implemented into an algorithm that is described in the next section.
5.3.2 Identifying the global worst-case scenario
RCNoC is divided into two parts. The goal of the first part, illustrated in Algorithm 1, is to
identify the set of blocking flows of fa, denoted Fp (Definition 2). The second part described in
Algorithm 3 computes the WCTT of fa by adding the delays of the blocking flows by considering
the properties defined in section 5.2.
Computing the blocking flows Fp
This first part consists to build Fp from the set of all flows exchanged in a given configuration,
i.e. F (Definition 1). This part is composed from four steps as described in Algorithm 1.
5.3. Description of the proposed pipeline-based algorithm 81
Algorithm 1 Compute_Blocking(F )
1: Fp.insert(fa)
2: fi ← Fp.head()
3: do
4: F.remove(fi)
#Step 1: Compute the direct blocking flows of fa
5: if ((fi = fa)||(fi.same_source(fa)) then
6: for all rk ∈ [risource, ridestination] in fi do
7: Compute_direct_blocking(Fp, F, fi, Lfi , rk)
8: end for
#Step 2: Compute the indirect blocking flows of fa
9: else
10: for all rk ∈]riblocked, ridestination] in fi do
11: Compute_direct_blocking(Fp, F, fi, Lfi , rk)
12: end for
13: end if
#Step 3: Applying the RRA
14: for all fj ∈ Lfi do
15: if ((!fj.samesource(fa)) & (fj.sameport(∀fl ∈ Lfi , rjblocked))) then
16: F.remove(fl)
17: Fp.remove(< fl, rjblocked >)
18: Lfi .remove(< fl, rjblocked >)
19: end if
20: end for
#Step 4: Following the progression of fi in Lfi
21: for all fj ∈ Lfi do
22: if (rjblocked in < fj, rjblocked >6= rkblocked in next(< fj, rjblocked >)) then
23: Lfi .insert(< fi, rjblocked >)
24: end if
25: end for
26: fi = Fp.next()
27: while !Fp.last_reached()
In the first step, we compute the direct blocking flows of fa that block its path (Notation 6)
from the source router rasource to the destination router radestination (lines 5 to 8 ). Thus, here
we consider the flows blocking fa at the source, i.e. sharing the same source, and on its path,
i.e. sharing the same links.
The second step consists to compute the indirect blocking flows of fa. As seen previously, it
is necessary to consider the blocking flows of each flow fi in Fp (initially from Lfa). Their Lfi
must thus also be computed and any flow fj ∈ Lfi (fj /∈ FP ) must be added in Fp (lines 10 to
12).
In the third step, i.e. after computing the different blocking flows, we must consider the RRA
82 Chapter 5. RCNoC : an optimized WCTT analysis for NoC
to choose one blocking flow from each port of each router rjblocked, i.e. the router where fj
blocks a flow fi in Lfi (lines 14 to 20).
The final step consists on following the progression of each flow on their blocked routers. Thus,
the flow fi is also added to its list of blocking flows Lfi each time that fi can progress some
routers before it is blocked by other flows at the following routers on its path (lines 21 to 25).
How to compute the blocking flows?
The function Compute_direct_blocking(), described in Algorithm 2, identifies the direct
blocking flows of a flow fi. Thus, for the analyzed flow fa, we must consider two types of direct
blocking flows. The first type are the flows that block fa at its source (lines 2 to 4). The
second type are the direct flows sharing a part of the path or the same destination with the
analyzed flow. Thus, in this type, it is sufficient to identify two consecutive common routers
between the analyzed flow and a flow fj to consider fj as a direct blocking flow (lines 6 to 10).
Therefore, each flow of Lfa , identified as blocking flow, is added in Fp. The blocking flows of fi
are computed by determining the flows sharing a part of its path. But, we must consider this
path from the next router of riblocked where fi blocks fa (line 10 in Algorithm 1). We note that
the flows blocking fi at the router riblocked are not added to Lfi as they blocked fi once due to
the RRA.
Algorithm 2 Compute_direct_blocking(Fp, F, fi, Lfi , rk)
1: for all ((fj ∈ F ) & (fj /∈ Lfi)) do
#Step 1: Compute the direct blocking flows sharing the same source of fa
2: if (fj.same_source(fa) & (fi = fa)) then
3: Fp.insert_head(< fj, risource >)
4: Lfi .insert_head(< fj, risource >)
#Step 2: Compute the direct blocking flows sharing the same path of fa or fi
5: else if (!fj.same_port(fi, rk)) then
6: if ((rk.common_router(fi, fj)) & (rk.next().common_router(fi, fj))) then
7: rjblocked = rk
8: Fp.push(< fj, rjblocked >)
9: Lfi .push(< fj, rjblocked >)
10: end if
11: end if
12: end for
5.3. Description of the proposed pipeline-based algorithm 83
Example. We go back to the example in Figure 5.17, to describe how to build the list of
blocking flows using Algorithms 1 and 2. In Algorithm 1, we thus begin by the first step (lines
from 5 to 8), where we have to consider the path of the studied flow f1 from r2 to r4 in order
to determine its direct blocking flows. f1 is not blocked at source but it is only blocked at its
path. Thus, the second step in Algorithm 2 (lines from 6 to 10) applies that r2 and r3 are
two common consecutive routers between f2 and f1. Then, f2 is added to Fp and < f2, r2 > is
added to Lf1 . Similar to f2, f3 is also a direct blocking flow of f1 and so f3 is added to Fp and
< f3, r2 > is added to Lf1 .
As now there are no more direct blocking flows for f1 on its path, we move to the next step
(Step 2 from line 10 to 12), where we build the list of blocking flows for the flows in Fp, i.e. f2
and f3. Actually, f3 is not considered a direct blocking flow to f2 as both f2 and f3 block f1 at
the same router. f4 and f5 are identified by the second step of Algorithm 2 as direct blocking
flows of f2. Thus, Fp contains now f3, f4 and f5. This recursion of this step is applied on the
different flows in Fp and ends when no new flows are added to Fp.
Finally, the step 4 of Algorithm 1 is applied on Lf1 , Lf2 , Lf3 , Lf4 and Lf5 . For example, it adds
< f1, r2 > at the end of Lf1 to indicate that f1 can progress.
5.3.3 Implementation of direct and indirect contention analysis
This part consists to compute the WCTT of fa by stepping through the list of blocking flows,
Lf , built in the previous algorithm. Each direct blocking flow is identified and its blocking
delay, after which it leaves the blocked router, is computed.
Algorithm 3 describes this computation of the WCTT of fa. For each analyzed flow f , we
compute (lines 3 to 6): its first direct blocking flow in Lf (noted fb), its next direct blocking
flow in Lf (fbnext), the blocking flow of fb in Lfb (fbb) and the next blocking flow of fb in Lfb
(fbbnext). These variables are used to compute the delay incurred by fb, potentially blocked by
fbb, in order to give the credit to the router allowing the progression of fbnext. A list L is defined
to add at each time the flow that is blocked and so it helps to indicate which flow can progress
after the release of the path of the blocking flows. Initially, L contains the analyzed flow fa.
84 Chapter 5. RCNoC : an optimized WCTT analysis for NoC
Algorithm 3 Compute_Delay(f , Lf , rleave)
1: delay = 0
2: influent = false
#Step 1: Compute the first direct blocking flows of fi and their blocking flows
3: < fb,rb > = Lf .get_fb()
4: < fbb,rbb > = Lfb .get_fb()
5: < f.fbnext, f.rbnext >= Lf .get_fb().next()
6: < fbbnext,rbbnext > = Lfb .get_fb().next()
#Case 1: fb can progress some routers, we verify if it will be blocked or not?
7: if (fb == fbb) then
8: delay+ = check_Next(Lf , fb, fbbnext, L)
9: end if
#Case 2: fb is blocked directly by fbb, and fbb is in direct contention with fbnext
10: if (fbb.is_direct(f.fbnext)) then
11: delay+ = fb.progress_up(rbb)
12: L.add(fb) . fb /∈ L
13: return delay + compute_Delay(fb, Lfb , rbnext)
14: end if
#Case 3: fb is blocked directly by fbb, and fbb is in indirect contention with fbnext
15: if (fbb.is_indirect(fbnext)) then
16: delay+ = check_influence(Lf , fb, fbb, fbnext, L)
17: end if
18: return delay
Implementing the Property 2: Check whether blocking flows are blocked or not
By definition on how lists of blocking flows are built, the case fbb = fb means that fb can
progress some routers before it is blocked by a flow fbbnext or it is not or no more blocked. At
line 8, the function check_Next deals with the cases where fb is blocked or not. Algorithm 4
describes this function. In the case where fb is blocked, fbbnext is not NULL (lines 1 to 6). This
means that fb can be blocked by several fbbnext at rbbnext before no longer blocking f at rb. fb
can thus first progress from its current location rbb to the first rbbnext. This delay is computed
by the function progress_up() and is equal to: dflit × (rbbnext − rbb).
However, as fb can progress in its path, this progression can release the path of the flow f
blocked by fb. Thus, if a flow blocked by f exists, its analysis started in a previous iteration
must next be resumed, in order to check whether fbbnext is now non influent over fbnext and
can also be discarded (Property 3). The function Compute_delay() is iterated on the last flow
blocked by f in the list L (which is expressed at line 4). If not, the analysis of f must be
continued on the next flow blocking f , i.e. calling compute_Delay with fbb set to fbbnext (lines
5.3. Description of the proposed pipeline-based algorithm 85
Algorithm 4 check_Next(Lf , fb, fbbnext, L)
#Case 1: fb is blocked
1: if (fbbnext 6= NULL) then
2: delay+ = fb.progress_up(rbbnext)
3: fb.setTo(fbbnext)
4: f = L.previous_tail()
5:
6: return delay + Compute_Delay(f, Lf , rbnext)
#Case 2: fb is no more blocked
7: else
8: if (fb == fa) then
9: return delay + fa.progress_leave(radestination)
10: else
11: delay+ = fb.give_credit(rbnext)
12: f.setTo(fbnext)
13: L.remove(fb)
14: f = L.previous_tail()
15:
16: return delay + Compute_Delay(f, Lf , rbnext)
17: end if
18: end if
4 and 6).
In the other case (lines 7 to 17), when fb is no longer blocked, fb can therefore progress to
give the credit to fbnext at router rbnext, i.e. fb leaves the router rbnext. This corresponds to
the use of the Property 2 to compute the maximal delay before that fbnext can progress. This
delay is computed by the function give_credit(). Also, here the same iteration is done as in
the previous case to identify the state of progression of the flow previously blocked. However,
in the case where fb is the flow fa, so the analyzed flow is no more blocked and can progress
to its destination router radestination from the last router rb where it was blocked. This delay is
equal to dflit × (radestination − rb + 1) + 2dflit × (nfa − 1), and it is computed by the function
progress_leave() at line 9. We note that dflit×(radestination−rb+1) computes the transmission
delay of fa to reach its destination.
Implementing Properties 2 and 3: Check whether blocking flows are direct or not
In the other case where fb 6= fbb, fb is blocked by fbb. However, fbb can be either in direct
or indirect contention with the flow fbnext. If fbb is in direct contention, the time needed by
86 Chapter 5. RCNoC : an optimized WCTT analysis for NoC
fb to progress to rbb, i.e. its blocking router, must first be accounted by using the function
progress_up() (line 11). Then, fb must be analyzed to compute its blocking delay until it can
progress and so allowing the progression of fbnext. Thus, fb is added to the list L (lines 12 and
13).
Otherwise, when fbb is an indirect contention with fbnext, the function check_influence (line
15) establishes whether fbb is influent over fbnext or not by applying Property 3. Algorithm 5
describes this function. The computation to be performed when fbb is influent is identical to
the case where fbb is in direct contention with fbnext (lines 4 to 7 which are similar to lines 10
to 13 in Algorithm 3). However, two cases must be considered when fbb is non influent.
The first case is when fbb has no influence over any remaining flow fx in Lf (fx 6= fb and
fx 6= fbnext)(line 11 and lines 17 to 22). Thus, at rbnext, fb gives the credit to fbnext, i.e. fb can
progress and leaves the router rbnext, allowing the transmission of fbnext. This blocking delay of
fbnext, by waiting the progression of fb, is computed by applying Property 2. If a flow blocked
by f exists, its analysis must next be resumed. If not, the analysis of f must be continued with
fbb set to fbbnext. The second case is when fbb has an influence on a flow fx (line 13 and lines 23
to 37). In this case, the maximal delay is obtained by exploring two scenarios as the impact of
future contentions are unknown at this point. In the first scenario (lines 24 to 29), all the flows
in Lf before fx are skipped when further analyzing f . Before, the analysis continues similarly
to the direct case (lines 11 and 12 in Algorithm 3) but with fbnext = fx. In the second scenario
(lines 30 to 35), the computation skips fbb and fb gives the credit to fbnext. And so the analysis
is similar to the first case.
Example. Let us illustrate how this algorithm is applied on the first scenario of the proposed
example in section 5.3.1 (nf2 = 2).
First, lets recall the list of blocking flows:
• Lf1 = {< f2, r2 > < f3, r2 > < f1, r2 >}
• Lf2 = {< f4, r6 > < f2, r6 > < f5, r7 > < f2, r7 >}
5.3. Description of the proposed pipeline-based algorithm 87
Algorithm 5 check_influence(Lf , fb, fbb, fbnext, L)
1: delay_ind = 0
2: delay_d = 0
3: Lind = L
#Case 1: Property 3 indicates that fbb is influent to fbnext
4: if (fbb.is_influent(fbnext)) then
5: delay+ = fb.progress_up(rbb)
6: L.add(fb)
7: return delay + Compute_Delay(fb, Lfb , rbnext)
#Case 2: Property 3 indicates that fbb is not influent to fbnext
8: else
9: for all fx ∈]fbnext, f ] in Lf do
10: if ((fbb.is_non_influent(fx)) then
11: influent = False
12: else
13: influent = True
14: break
15: end if
16: end for
#Case 3: Property 3 indicates that fbb is not influent to fx
17: if influent = false then
18: delay+ = fb.give_credit(rbnext)
19: f.setTo(fbnext)
20: L.remove(fb)
21: f = L.previous_tail()
22: return delay + Compute_delay(f, Lf , rbnext)
#Case 3: Property 3 indicates that fbb is influent to fx
23: else
24: fbnext_init = fbnext
25: fbnext = fx . fbnext in Lf is set to fx
26: delayd = fb.progress_up(rbb)
27: L.add(fb)
28: delayd+ = Compute_delay(fb, Lfb , rbnext)
29: fbnext = fbnext_init
30: delay_ind = fb.give_credit(rbnext)
31: f.setTo(fbnext
32: L.remove(fb)
33: L.previous_tail()
34: delay_ind+ = Compute_delay(f, Lf , rbnext)
35: end if
36: return delay+ = max(delay_ind, delayd)
37: end if
88 Chapter 5. RCNoC : an optimized WCTT analysis for NoC
• Lf3 = {< f3, r2 >}
• Lf4 = {< f5, r7 > < f4, r7 >}
• Lf5 = {< f5, r7 >}
We start by Lf1 as f1 is the studied flow. Initially, L contains f1. Let us compute fb, fbb, fbnext
and fbbnext (lines from 3 to 6 in Algorithm 3):
• < fb, rb >=< f2, r2 >: f2 blocks directly f1.
• < fbb, rbb >=< f4, r6 >: f4 blocks directly f2.
• < fbnext, rbnext >=< f3, r2 >: f3 is the next flow that progress after f2.
• < fbbnext, rbbnext >=< f2, r6 >: f2 is the next flow that progress after f4.
As fb 6= fbb, then we have to check the influence of fbb, i.e. f4, on fbnext, i.e. f3. f4 is an
indirect flow over f3 (line 5 in Algorithm 3), thus the function check_influence() is used to
apply Property 3 and identifies the influence of f4 over f3. By considering the first scenario
of this example, Property 3 applies that f4 is a non influent flow to f3. This implies to check
this influence over fx ∈]fbnext, f ] in Lf (line 9 in Algorithm 5), i.e. fx ∈]f3, f1] in Lf1 . In this
case, fx corresponds only to f1. f4, in the same scenario, does not have an influence over f1
(influent = false). Then, f2 can progress and leaves the router r3, allowing the transmission
of f3 by applying Property 2. f3 waits only nf2 × 2dflit.
fb is now equal to f3 in Lf1 . We iterate the function Compute_blocking() on f1 as it is the
last flow blocked in L. Here:
• < fb, rb >=< f3, r2 >
• < fbb, rbb >=< f1, r2 >
• < fbnext, rbnext >=< f3, r2 >
• < fbbnext, rbbnext >= NULL
5.4. Unitary evaluation of the properties 89
fb = fbb means that fb, i.e. f3, can progress. This is the case described in lines from 7 to 9
in Algorithm 3. Thus, the function check_Next() is applied to verify if f3 is blocked or not.
fbbnext is NULL in this example. As f3 6= fa, the flow f3 can progress to leave the router r2
allowing the transmission of f1 (lines 11 to 16 in Algorithm 4). This means that Property 2 is
applied, where f1 waits only nf3 × 2dflit, which is computed by the function give_credit(r2).
As f1 still the last blocked flow in L so the function Compute_blocking() is iterated on Lf1
and fb is set to fbnext, i.e. f1. Now:
• < fb, rb >=< f1, r2 >
• < fbb, rbb >= NULL
• < fbnext, rbnext >=< f1, r2 >
• < fbbnext, rbbnext >= NULL
In this iteration, fb = fbb, fbnext = NULL and fb = fa (line 8 in Algorithm 4). This in-
dicates that f1 can progress to its destination r4. This delay is computed by the function
progress_leave(r2) and it is equal to:
(r4 − r2 + 1)× dflit + 2dflit × (nfa − 1) = 3dflit + 2dflit = 5dflit.
We note that (r4 − r2 + 1)× dflit corresponds to the transmission delay of the header from r2
to r4. If we consider that the packets are made of 2 flits, thus:
WCTT (f1) = 5dflit + (nf2 + nf3)× 2dflit = 13 cycles.
This value corresponds to the timeline obtained in Figure 5.18.
5.4 Unitary evaluation of the properties
This section presents an evaluation of our algorithm made over a synthetic benchmark. The
evaluation over the synthetic benchmark reports how Properties 2 and 3 can reduce the pes-
90 Chapter 5. RCNoC : an optimized WCTT analysis for NoC
simism of the WCTT by comparing to the classical RC method. To quantify the improvements
against the RC method, we compute the following metric: (dRC−dA)×100
dRC
, where dRC is the WCTT
value returned by the RC method and dA the value returned by RCNoC . We also report on
how the modifications of some parameters, such as the number of flits or the size of the NoC,
affect our results.
We used a SpaceWire application described in [FFF09a]. Since NoCs used within many-cores
are larger than a SpaceWire network, we vary the destination of flows of the initial SpaceWire
application. We consider the mapping shown in Figure 5.19a, extracted from the SpaceWire
application. We compute the WCTT of f1, i.e. fa = f1, and present several configurations to
explain the gain due to the properties 2 and 3.
r7 r8 r9
f1
f3
r1 r2 r3
f5
f4
f2
f6
r4 r5 r6
r10 r11 r12
(a)
0
10
20
30
40
50
60
70
80
90
100
0 2 4 6 8 10 12 14 16 18 20
G
ai
n
/R
C
(%
)
Number of hops
latency of flow f1
2 flits
3 flits
6 flits
12 flits
19 flits
(b)
Figure 5.19: Mapping of flows in the first configuration and normalized gain compared to the
RC method.
5.4. Unitary evaluation of the properties 91
Impact of Property 2. The goal of this first configuration is to quantify, over specific
mapping of flows over a NoC, how Property 2 improves the computation of the WCTT. As
shown in Figure 5.19a, we generate different mapping of flows by modifying the destination of
some of the direct flows represented in dotted lines. For each mapping, we compute the WCTT
in order to study how the number of hops, they make after the destination of f1, affects the
results. For instance, a number of hops equals to 0 means that these direct flows have the same
destination as f1, while a number of hops 3 corresponds to the dotted lines.
Figure 5.19b shows the gain obtained for various number of hops and different length of pack-
ets. In the classical RC method, a flow cannot progress until its direct blocking flows reach
their destinations. The WCTT computed by RC therefore increases with the number of hops.
Conversely, Property 2 states that this delay only depends on the number of flits of the direct
flow. Then, by using RCNoC , for each packet of a given number of flits, the WCTT remains
constant when increasing the number of hops. The blocked flow indeed always waits for the
same amount of time the credit from the blocking direct flow, independently of its destination.
This difference between the classical RC and RCNoC explains the shape of each curve.
For a given number of hops, the gain decreases when increasing in the number of flits of the
packet. The blocking delay in both methods indeed increases as more flits of the direct blocking
flows have to be transmitted by routers. Also, the larger the packet is, the closer is the maximum
blocking delay of a flow blocked by fi to the waiting delay for fi to reach its destination.
Impact of Property 3. The goal of this second configuration is to quantify the gain, over
specific mapping of flows, due to Property 3. Compared to the previous configuration, we
therefore add an indirect flow f7, that blocks f5 a direct flow of f1, as shown in Figure 5.20a.
We also set the destination of f5 21 hops after the destination of f1. The value of Er is one
parameter that determines whether an indirect flow is influent or not. The router where f7
blocks f5 ranges from r5 to r23.
Figure 5.20b shows the gain for various values of Er and for different length of packets. For
a given length of packets, the two constant obtained gains are explained by the fixed 21 hops
92 Chapter 5. RCNoC : an optimized WCTT analysis for NoC
r24 r25 r26
f1
f3
r1 r2 r3
f5
f4
f2
f6
r4 r5 r6
r27 r28 r29f7
r....
r....
(a)
0
20
40
60
80
100
120
0 2 4 6 8 10 12 14 16 18 20
G
ai
n
/R
C
(%
)
Number of empty routers
latency of flow f1
2 flits
3 flits
6 flits
12 flits
19 flits
(b)
(c)
Figure 5.20: Mapping of flows in the second configuration and normalized gain compared to
the RC method.
5.5. Conclusion 93
that f5 performs to reach its destination which are accounted in the WCTT computed by the
RC method. The existence of the two constants gains comes from whether Property 3 applies
or not, i.e. either f7 is influent or not. This graph does not show only the gain due to Property
3, but it includes the constant gain due to the Property 2. In order to show only the gain due
to Property 3, we deduced from the initial gain, the one when the empty routers are equal or
less than the number of flits. The graph 3 in the figure 5.20c, presents the gain due only to
property 3.
When Er < nfi (left part of the curves), f7 indirectly blocks f1 and thus influences its WCTT.
However, when Er ≥ nfi the RC method still assumes that f7 influences f1 and thus adds it
in the computation of the WCTT of f1. RCNoC instead identifies that f7 is not influent as
Property 3 applies. This difference leads to an increase in the obtained gain and the second
plateau in the curves shown in Figure 5.20b and so the curves shown in Figure 5.20c.
Discussion
Considering the pipeline transmission of flits in the WCTT computation reduces the pes-
simism introduced by the classical RC method. This improvement depends on the type of
contentions, the size of the NoC and the size of packets. Property 2 improves WCTT values
when the size of NoC is large and the size of packets is small or when the number of hops
performed by the flows is greater than the size of packets in flits. Property 3 helps when the
size of NoC is sufficient for indirect contentions to occur with a minimum spacing between
analyzed flows and indirect ones. Let us stress that while our examples assume that all packets
have the same size, the properties we stated do not. Besides, these properties could also be
generalized allowing to overcome the assumption A10.
5.5 Conclusion
In this chapter, we have shown that the classical RC method leads to pessimistic values for
the WCTT of the flows. It is explained by the fact that the pipeline way of transmitting flits
94 Chapter 5. RCNoC : an optimized WCTT analysis for NoC
in the wormhole networks is not taken into account. The current existing methods consider
that an analyzed flow can be blocked by others flows till they have reach their destinations.
Besides, it considers all indirect flows as blocking flows. Thus the transmission delays of all
blocking flows, i.e. direct and indirect, are added to the transmission delay of the analyzed
flow fa, leading to an over-approximation of the WCTT of fa. However, we have shown that
the pipeline transmission of flits over NoC can be used to reduce the maximal blocking delay
of flows, thus tightening computed WCTTs.
In this way, we introduce three properties to either reduce the number of scenarios to be explored
when performing a WCTT analysis or to reduce the pessimism when computing WCTTs:
1. Reducing the number of scenarios: The first property states the worst-case scenario
when computing the maximal blocking delay a flow can suffer when a round-robin arbiter
is used.
2. Computing maximal delays: The second property states that the maximal blocking
delay a flow can suffer is bounded. This eliminates the need to wait till the blocking flow
reaches its destination.
3. Reducing the delays: A third property allows to check whether an indirect flow influ-
ences the transmission delay of an analyzed flow.
These properties are implemented in an algorithm based on a recursive method in order to
improve the WCTT, as shown in the evaluation we report. We compare our algorithm noted
RCNoC to the RC method by considering a synthetic benchmark. We showed how some pa-
rameters affect the gain we obtained related to RC thanks to the use of these properties. The
evaluation shows a significant reduction of the WCTT due to the properties 2 and 3, especially
with small packets. The reduction of the WCTT is of utmost importance as it permits to flows
to respect their constraints. Besides, as shown in Chapter 4, the computation of WCTT has
an impact on the behavior of the core-to-I/O flows. The evaluation of RCNoC on the problem
stated in Chapter 4, is illustrated in Chapter 7.
Chapter 6
MapIO: an I/O contention-aware
mapping technique
Contents
6.1 How can the WCTT of core-to-I/O flows be reduced? . . . . . . . . . 96
6.2 Overview of our strategy of mapping: MapIO . . . . . . . . . . . . . . . 97
6.3 Phase 1: Core-to-I/O flows distance minimization . . . . . . . . . . . . 99
6.3.1 Filling NoC regions with applications . . . . . . . . . . . . . . . . . . . . 100
6.3.2 Application shape and its assignment to regions . . . . . . . . . . . . . . 103
6.3.3 Reclaiming unused cores within shapes . . . . . . . . . . . . . . . . . . . . 106
6.4 Phase 2: Core-to-I/O flows contention minimization . . . . . . . . . . 109
6.4.1 I/O critical paths within applications . . . . . . . . . . . . . . . . . . . . 109
6.4.2 Mapping tasks on the critical paths . . . . . . . . . . . . . . . . . . . . . 112
6.4.3 Mapping tasks around the critical path . . . . . . . . . . . . . . . . . . . 117
6.4.4 Minimizing the contention of outgoing flows . . . . . . . . . . . . . . . . . 120
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Chapter 4 has shown that even if we reduce the pessimism in the values computed for the
95
96 Chapter 6. MapIO: an I/O contention-aware mapping technique
WCTT, the problem of rejecting critical frames remains. A strategy of mapping that reduces
the WCTT of the core-to-I/O flows is thus needed. However, as mentioned in Chapter 4, the
existing mapping methods consider only core-to-core and core-to-memory communications with
the objective to minimize their contentions. Thus, these methods do not reduce the contention
on the core-to-I/O flows, leading to drop Ethernet frames. In order to avoid this problem, we
propose a new mapping strategy, called MapIO. This strategy takes into account the position
of the I/O interfaces and aims to reduce the contention on the path of the core-to-I/O flows.
6.1 How can theWCTT of core-to-I/O flows be reduced?
In Chapter 4, we have shown on the case study A, in section 4.3.2, that existing mapping
strategies do not consider where I/O interfaces are located within a NoC architecture when
mapping applications. We recall that this case study is made of 1 critical application and 3
non-critical applications. Let us explain how the allocation of a critical application, taking into
account the location of the I/O interfaces, may impact the WCTT of the core-to-I/O flows.
When allocating a critical application far from Ethernet, a core-to-I/O flow may therefore suffer
from the large payload of the non-critical flows. Besides, the distance, in number of hops, taken
by the path of the core-to-I/O flow from DDR to Ethernet increases. Thus, the probability
that the core-to-I/O flow is blocked on its path increases. Then, the WCTT of the core-to-
I/O flow may increase. However, if a critical application is allocated far away from the DDR,
the distance taken by the path of the core-to-I/O flow going from/to DDR to/from the core
destination increases. This allocation leads also to increase the distance taken by the path of
the core-to-I/O flow coming from Ethernet to the DDR. Therefore, the WCTT of the core-to-
I/O flow may increase due to this allocation as the probability that this core-to-I/O flow is on
contention on different routers with the flows of non-critical applications increases.
However, we have shown that even we map an application near to both DDR and Ethernet
controllers, the internal mapping of the applications may no longer be appropriate to the
objective of reducing the contention on the path of the core-to-I/O flow. Therefore, a static
mapping strategy of critical and non critical real-time flows that reduces the WCTT of core-
6.2. Overview of our strategy of mapping: MapIO 97
to-I/O communications is needed.
This mapping strategy presents two objectives:
• Objective 1: Minimizing the distance separating the placement of critical applications
from both of the DDR and Ethernet controllers.
If we note d the total distance, in number of hops, of the path taken by a core to I/O flow. We
have d = d1+2d2+ d3 with d1 the distance from the Ethernet to the memory controller, d2 the
distance between the DDR controller and the destination core of the core-to-I/O flow, and d3
the distance of the outgoing I/O flow from the DDR to the Ethernet controller. The first goal
is then: min(d1 + d2 + d3) as minimizing the distance of a flow reduces the number of times it
can be in contention with other flows. The values of d1 depends on which Ethernet interface is
coming the core-to-I/O flow. We note that it depends on the region where the applications are
mapped on the NoC (A14 ). Both d2 and d3 depend on where the destination core is mapped
on the NoC, as well as on the memory port that is used.
• Objective 2: Reducing the contention on the core-to-I/O flow.
Once the distance of a core-to-I/O flow is minimized, the second goal is to minimize the con-
tention that this flow can experience on its path. As the path taken by a core-to-I/O flow is
induced by the XY routing, the allocation of specific tasks on and around this path can reduce
its contention.
6.2 Overview of our strategy of mapping: MapIO
Figure 6.1 illustrates the different steps of MapIO which allocates the different applications of
the case study A presented in Chapter 4. This mapping is divided into two phases following
the two objectives introduced previously.
In the first phase illustrated in steps from a to d, we allocate the applications on the different
regions on the NoC. The idea is to allocate critical applications in the corners of the NoC where
98 Chapter 6. MapIO: an I/O contention-aware mapping technique
ETH
port 1port 2port 3port 4port 5
ETH
port 1port 2port 3port 4port 5
ETH
ETH
port 1port 2port 3port 4port 5
ETH
port 1port 2port 3port 4port 5
ETH
3 x 4
7 x 3
HM12 
ETH
port 1port 2port 3port 4port 5
ETH
port 1port 2port 3port 4port 5
ETH
ETH
port 1port 2port 3port 4port 5
ETH
port 1port 2port 3port 4port 5
ETH
a b
c
FFT16
tf0
tf6
tf2
tf1
tf3 tf4
tf5
tf8
tf7
ETH
port1port2port3
FFT16
7 x 3
d
HM12 FFT16
3 x 3HM11 
FFT16HM12 
HM11 FADEC9
port1port2port3
ETH
tﬀ15ETH
tﬀ0tﬀ8tﬀ14tﬀ7
tﬀ11 tﬀ1 tﬀ12
tﬀ10
tﬀ6 tﬀ5 tﬀ4 tﬀ3
tﬀ13 tﬀ9 tﬀ2
(1,1)(7,1)
(1,7)
(1,6)
(2,2)
Figure 6.1: The different steps of our approach MapIO for mapping FADEC, HM and FFT
applications of case study A.
both Ethernet and DDR controllers are available. This placement reduces the distance d as
it especially reduces d2 and d3. Besides, it reduces the contention with flows of non-critical
applications. However, allocating arbitrarily critical applications at the corner could fragment
the remaining regions on the NoC. For this reason, we have to follow a unique direction in
the allocation. Thus, we begin from a corner of the NoC on the side of DDR and Ethernet
and we follow one circular direction in order to not fragment the regions on the NoC. The
applications are ordered in such a way to allocate the critical applications near of both Ethernet
and DDR controllers to minimize the distance d. In order to avoid the contention between the
applications, we use rectangular shapes when mapping the applications in the regions, due to
the XY routing. The shapes are chosen in such a way to correspond to the size of the region
where they are allocated.
In the example of Figure 6.1, we begin by the non-critical application FFT16 by choosing a
shape 4×4, as shown in the step a. On the next region 3×4, we choose a non-critical application
6.3. Phase 1: Core-to-I/O flows distance minimization 99
whose size is closest to the size of this region, i.e. HM12. Similarly, in the step c, on the region
7 × 3, HM11 is allocated by a rectangular shape 4 × 3. And finally, the critical application
FADEC9 is then allocated near to Ethernet and DDR controllers.
After allocating the applications in the different regions on the NoC, we are interested now in
reducing the contention on the core-to-I/O flows. These core-to-I/O flows follow a fixed path
(A13 and A14 ). Thus, the second phase consists in allocating the tasks of the applications,
placed near to the Ethernet, controllers in such a way to reduce this contention.
As illustrated in the task mapping of FFT application, the task tff15 is allocated to the core
located at (1,6). Actually, this task does not send any flow to the other tasks, thus it does not
block directly the path of the core-to-I/O flow. Also, FADEC9 present a task tf6 which only
receives data from the other tasks. Then, if we allocate this task at the core (2,2), this task
will not generate a flow to the task located at (2,1), i.e. tf2 , and thus not blocking the Y path
of the core-to-I/O flow at the router (2,2).
The following sections illustrates the details of each phase.
6.3 Phase 1: Core-to-I/O flows distance minimization
This section focuses on the first phase of MapIO strategy over a Tilera-like architecture. This
first phase allocates in priority critical applications in a dedicated region close to memory and
Ethernet controllers. To this end, we have first to order the applications in a list, noted Lapp,
in order to be followed when allocating the applications. This phase is divided into three steps
to reach our first objective. The first step consists on filling the applications in the different
regions on the NoC. It matches each region to a corresponding application, by following both
Lapp and a circular direction. Then, the second step assigns a rectangular shape to each
application. However, when this is not possible we authorize the definition of arbitrary shape,
possibly leading to inter-application contention due to the XY routing policy used within NoC.
Finally, the third step reclaims cores unused within assigned shapes to the regions where critical
applications are allocated in order to be used by the second phase of this strategy. These unused
100 Chapter 6. MapIO: an I/O contention-aware mapping technique
cores can be allocated on the path of the core-to-I/O flow in order to reduce its contention.
These steps are described in Algorithm 6, which will be explained progressively in the following
sections.
In this section, we consider another case study, noted B, made of two instances of FADEC:
FADEC8, FADEC9 and 3 instances of HM: HM14, HM9, HM8. This case study is used to
explain the different steps of the first phase of MapIO.
6.3.1 Filling NoC regions with applications
We note Lregion the list of rectangular subsets of the NoC, called regions. Lregion initially
contains a single region of size equals to the size of the NoC, i.e. L ×W . Lapp is the list of
applications that remain to be mapped. As mentioned in the previous sections, the objective is
to allocate critical applications in the corner close to memory and Ethernet controllers. Besides,
we have to follow a circular direction in order to avoid the fragmentation of the different regions
on the NoC. The organization of the applications within Lapp is essential to reach this objective.
How the applications in Lapp are organized?
Non-critical applications are first inserted in Lapp, while critical applications are then inserted
between the head and the tail. All applications are inserted in decreasing order of their size,
noted SAi. SAi is the number of cores an application uses. Then, the larger a critical applica-
tion is, the closer to the memory and Ethernet controllers it is located to. Therefore, starting
the allocation from a corner on the NoC (side of Ethernet interfaces) and following both Lapp
and the circular direction on the NoC, ensure the placement of the critical applications near
both DDR and Ethernet controllers.
However, another objective of our mapping is to reclaim the unused cores from regions assigned
to non-critical applications in order to give them to the regions that will later on be assigned
to critical applications. We note that UNused cores value (UN) corresponds to the number
of cores remaining free if we allocate all the applications on the NoC and thus it is equal to:
UN = (L×W )− n∑
i
SAi. These unused cores will be useful to reach our “objective 2”: reducing
6.3. Phase 1: Core-to-I/O flows distance minimization 101
the contention on the path of the core-to-I/O flow, and so, for the second phase. However, this
can not be done if we do not choose correctly which application should be placed in head or
in tail of the list Lapp. For example, in the case of one critical application and UN ≥ 0, when
inserting the critical application at the head and beginning by mapping this application, it
avoids to give the unused cores to the critical application. Actually, if we choose to reclaim the
unused cores at the beginning, we can not ensure a regular shape for the remaining applications.
For this reason, we choose to allocate the critical application at the end, and insert it at the
tail, and in this way, when we can reclaim unused cores from non-critical applications, it could
be used within the region of the critical application.
How critical applications are inserted between the head and the tail of Lapp in order
to reclaim unused cores?
How critical applications are allocated between the head and the tail of Lapp depends on
the value of UN . If UN = 0, the critical applications are thus inserted at the head of Lapp
except one which is inserted at the tail when there are more than one critical application, since
no free cores can be reclaimed. Thus, in this case, the critical applications are allocated first
which ensures rectangular shapes to these applications. If however UN ≥ 0, then it is the
opposite: critical applications are inserted at the tail except one at the head. In this case,
critical applications are allocated at the end while retrying to reclaim unused nodes from the
non-critical applications. In both cases, we ensure that at least two critical applications are
allocated in the two corners near to DDR and Ethernet interfaces.
How to fill the applications into the regions?
Once Lapp is built, we can thus fill the applications into the different regions on the NoC.
We start to fill regions from one corner of the NoC. We then proceed following a clockwise or
counterclockwise direction, depending on whether the initial corner is located respectively on
the bottom or the top of the NoC. The function Search_App_to_Map() (line 9 of Algorithm 6)
searches for the corresponding application to allocate to the current region.
102 Chapter 6. MapIO: an I/O contention-aware mapping technique
Algorithm 6 Map_IO(Lapp)
1: Region = Lregion.head()
2: while (Region 6= NULL) do
3: Appi = head(Lapp)
#There is only one application in Lapp which is allocated on the remaining
regions on the NoC
4: if (Lapp.Size == 1) then
5: Appi.Accepted_Shape.Insert(Lregion)
6: Lapp.Delete(App)
7: Lregion = NULL
8: end if
#Lapp contains more than one application
#Step 1: Search the corresponding application to the current region on the
NoC
9: Appi = Search_App_to_Map(Lapp)
#There is no application that has a size less than the size of the current
region, then aggregate regions
10: if (Appi = NULL) then
11: Aggregate_Regions(Lregion)
#An application is found to be allocated in the current region
12: else
#Step 2: Choose the corresponding rectangular shape for the selected Appi
13: min_xy = min_X_Y (Appi, Lapp)
14: Appi.Selected_shape = Choose_Shape(min_xy,Region,Appi)
#Step 3: Reclaim unused cores within the selected shape
15: Map_Update(Region, Lregion, Appi,min_xy,min_app_size, UN)
#The application Appi is allocated
16: Appi.Remove(Lapp)
17: Region.Remove(Lregion)
18: end if
#Update Lapp and Lregion depending on the UN value
19: if ((UN == 0)&&(Initial_UN ! = 0)) then
20: Lapp.Inverse()
21: Lregion.Inverse()
22: end if
23: end while
The current head of Lapp defines the criticality level that is considered when performing this
lookup process to select an application. The application verifying this criticality level and has
the highest SAi value that is equal or less than the size of the considered region is selected to
fill this region. Whenever a part of a region is filled by an application, after performing the
second and third steps (lines 13 to 18 which are explained in the next sections), we say that this
application is mapped and it is removed from Lapp. Lregion is thus updated with the remaining
6.3. Phase 1: Core-to-I/O flows distance minimization 103
rectangular regions of the NoC. If the size of the considered region is strictly inferior to the
size of any application (lines 10 to 12), the head of Lregion and the next region are associated
by the function Aggregate_Regions(). The size of this aggregated region is computed and
the lookup process is repeated until a region of sufficient size is build. Nevertheless, when it
remains only one application to allocate in Lapp (lines 4 to 8), this application will be assigned
to the remaining regions in Lregion.
Finally, as rectangular shapes are used to avoid inter-application contention, an application
may use additional unused cores (UN ≥ 0) to define such a rectangular area. Thus, the value
of UN dynamically changes when mapping an application. However, if these cores are used
by non-critical applications and so UN moves to 0, they can prevent the mapping of critical
applications. To avoid this situation when UN = 0 (lines 19 to 22), we reverse the lists Lregion
and Lapp and therefore continue from the other corner. We also inverse the direction followed
to fill regions.
Example. In our new case study B, we have UN = 1 and we have more than one critical
application. FADEC9 is thus inserted at the head of Lapp while FADEC8 is inserted at the
tail. Therefore, Lapp contains {FADEC9, HM14, HM9, HM8, FADEC8}. However, in the case
study A illustrated in Figure 6.1, there is one critical application FADEC9 which is inserted
at the tail of Lapp as UN = 1. Now, let us see in the next section in which shapes these
applications are allocated on the NoC.
6.3.2 Application shape and its assignment to regions
Filling a part of a region with an application requires to define the actual rectangular shape
taken by this application. This choice of the rectangular is not arbitrary as we will see in this
second step of the first phase of MapIO.
Each application Appi is described by its size SAi, assuming each core is dedicated to a task.
For each shape, we note UNi the number of UNused cores by Appi and we therefore have
UNi = (Xi × Yi) − SAi. We arbitrarily compute three possible rectangular Xi × Yi shapes of
104 Chapter 6. MapIO: an I/O contention-aware mapping technique
minimal UNi values and whose sizes are equal or higher than SAi. These computed shapes are
as close as possible to squares to avoid inter-application contentions. We thus forbid Xi and
its Yi to be equal 1 if SAi > 3.
How to choose the appropriate rectangular shape for each application? When an
application is mapped on the NoC, its most appropriate shape is selected. However, this
selection depends on the possible shapes of other applications to avoid generating a remaining
region unable to map any other application. The goal is to optimize the shape of these newly
created regions so that next applications can be mapped without being fragmented, i.e. assigned
on two regions. To this end, the minimum value of the Xi and Yi of all applications, noted by
minxy is computed by the function min_X_Y () (line 13). minxy represents the minimum size
for either the row or the column of an application. Thus, the function Choose_Shape() (line
14) includes the following constraints on minxy and therefore selects the shape that avoids an
application to be fragmented on several regions.
((L−Xi ≥ minxy) ∨ (L−Xi = 0)) ∧ (6.1)
((W − Yi ≥ minxy) ∨ (W − Yi = 0)) (6.2)
Example. Figure 6.2 illustrates the different steps of the application mapping for this case
study. Table 6.1 shows the computed rectangular shapes for each application of the case study
B as well as the value of UNi.
FADEC9 HM14 HM9 HM8 FADEC8
X × Y UN X × Y UN X × Y UN X × Y UN X × Y UN
3× 3 0 2× 7 0 3× 3 0 2× 4 0 2× 4 0
2× 5 1 3× 5 1 5× 2 1 3× 3 1 3× 3 1
3× 4 3 4× 4 2 4× 3 3 2× 5 2 2× 5 2
Table 6.1: Different shapes and their free cores for each application.
The application at the head of Lapp is a critical application. We thus have to consider only
critical applications. The first matching critical application which has the highest SAi value
6.3. Phase 1: Core-to-I/O flows distance minimization 105
ETH
port 1port 2port 3port 4port 5
ETH
port 1port 2port 3port 4port 5
ETH
ETH
port 1port 2port 3port 4port 5
ETH
port 1port 2port 3port 4port 5
ETH
FADEC94 x 3
7 x 4
ETH
port 1port 2port 3port 4port 5
ETH
port 1port 2port 3port 4port 5
ETH
HM 9 
UN = 3 3 x 1
7 x 4
ETH
port 1port 2port 3port 4port 5
ETH
port 1port 2port 3port 4port 5
ETH
HM 9 
Part of HM 14 
ETH
port 1port 2port 3port 4port 5
ETH
port 1port 2port 3port 4port 5
ETH
HM 14 
ETH
port 1port 2port 3port 4port 5
ETH
port 1port 2port 3port 4port 5
ETH
HM 9 
HM 14 
FAD
EC
8
4 x 42 x 4
ETH
port 1port 2port 3port 4port 5
ETH
port 1port 2port 3port 4port 5
ETH
HM 9 
HM 14 
a b c
def
g
HM 9 
HM 9 FADEC9 FADEC9
FADEC9FADEC9FADEC9
FAD
EC
8
FADEC9
HM 8
Figure 6.2: Mapping of FADEC and HM applications of the case study B using our approach
MapIO.
that is equal or less than the size of the NoC is FADEC9 (SAFADEC9 = 9). To select the
shape of FADEC9, we compute minxy which is equal to 2. Since the first shape 3 × 3 for
FADEC9 keeps free more than 2 columns and 2 rows, then this shape is selected. We start
by filling the NoC from its top right corner and therefore map FADEC9 on this corner, as
shown in Figure 6.2a. We then follow a counterclockwise direction as explained in the step 1,
as shown by the other steps of Figure 6.2. The next region that is considered is of size 4 × 3.
HM14 is now at the head of the list Lapp, thus a non-critical application should be selected.
SAHM14 is greater than the size of this next region, then HM14 does not correspond to this
region. Therefore, HM9 is selected to fill this region as its SAi value is less than the size of this
region. At this step, minxy is still equal to 2. The first shape that is considered for HM9 is
106 Chapter 6. MapIO: an I/O contention-aware mapping technique
3× 3. However, it does not keep free minxy, i.e. 2 columns, in order to be able to map a next
application with a rectangular shape, i.e. it does not verify the equation 6.1 ((7− 3)− 3 6≥ 2).
In fact, the remaining region 1×3 has a size that is less than the minimum size of the remaining
applications. Then, it is impossible to allocate an application within this region without being
fragmented. On the other hand, the shape 2× 5 does not fit in the selected region 4× 3. Thus,
the selected shape is thus 4×3 which verifies the equation 6.1, as show in Figure 6.2b. However,
this shape has a size greater than the size of the application, and thus, it generates an number
of unused cores. Let us see, in the next section, the next step (step 3) to be performed when
selecting a shape having UNi > 0.
6.3.3 Reclaiming unused cores within shapes
A selected shape can generate a number of unused cores UNi which indicates whether these cores
could be reclaimed or not. The function Map_update() of the third step (line 15) determines
which case is applied for the selected shape in the second step. Algorithm 7 describes this
function. There are two cases:
• Case 1: The unused cores are eliminated from the shape of an application when UNi >
UN or UNi ≥ minxy.
When UNi ≥ minxy, these unused cores could in fact form a contiguous region with the
one of the critical application. Thus, we reclaim these cores to allocate them to a critical
application. On the other hand, when UNi > UN , these unused cores must be eliminated in
order to allocate all the applications on the NoC. The shape must be modified by eliminating
a number of unused cores from either the column or the row. This choice depends on whether
the remaining region, after the removal of cores from the column, corresponds to the size
of one existing application. For this reason, we compute first the remaining region when
eliminating the unused cores from the column (lines 2 and 3). The size of the remaining region is
thus col_remaining_region.col× col_remaining_region.row. If this computed region could
correspond to the size of an application having the same level of criticality, then the unused
6.3. Phase 1: Core-to-I/O flows distance minimization 107
cores are eliminated from the columns (lines 4 to 7). The new shape, noted Accepted_Shape, is
recomputed using the function Compute_Shape_col. Otherwise, we remove these cores from
the row and the new shape is recomputed using the function Compute_Shape_row (lines 9 to
12). This choice is in fact due to the clockwise or counterclockwise direction chosen to progress
between the regions. This direction increases the probability to have the maximum number of
cores presented next to the remaining region when eliminating the cores from row. Thus, it
avoids to allocate the remaining applications into fragmented regions.
• Case 2: The unused cores cannot be eliminated from the shape of an application when
UNi ≤ UN and UNi < minxy.
When UNi < minxy, there are no applications that can benefit from these unused cores as
they cannot form a contiguous region. Then, there is no need to eliminate these unused cores.
However, we have to verify that UNi ≤ UN in order to guarantee a sufficient number of
remaining cores to allocate all applications. Then, the definitive shape is the one that was
selected in the previous part, i.e. the second step (line 14).
Finally, Lregion is updated by computing the remaining regions (line 16).
Example. In our example, the shape selected for HM9, i.e. 4 × 3, presents 3 unused cores.
This number exceeds the allowed number of unused cores which is UN = 1. We therefore
have to removed these 3 unused cores from the shape. This removal is done from the row.
Actually, the elimination from the column generates a region of size 1 × 3, which does not
corresponds to the size of any remaining application. Figure 6.2b and 6.2c present these steps
to select the definitive shape for HM9 which is 4 × 2 + 1 × 1. Lregion therefore contains two
remaining regions which are {< 3 × 1 >;< 7 × 4 >}. These two remaining regions are then
aggregated to cope with the size of the remaining non-critical application HM14. The region
3× 1 handles a part of HM14. The size of the remaining part of HM14 is equal to 11 which fits
in the region 7× 4. A new computation of candidate shapes is done at this step by considering
the size 11, in the same way as described in section 6.3.2. The selected shape is 3× 4 since it
verifies the equations 6.1 and 6.2. These steps are shown by the Figures 6.2d and 6.2e. This
108 Chapter 6. MapIO: an I/O contention-aware mapping technique
last shape does not need to be redefined as UNi < minxy. The definitive shape of HM14 is
an aggregated shape of < 3 × 1 > and < 3 × 4 >. Lregion contains a single region which is
{< 4 × 4 >} and we now have UN = 0. We therefore have to reverse both Lregion and Lapp,
leading to Lapp = {FADEC8, HM8}. We then continue from the bottom right corner of the
NoC and follow a clockwise direction. This ensures to map the critical FADEC8 application
near to both the Ethernet and DDR controllers. Its first candidate shape 2× 4 is the definitive
shape as it verifies the equations 6.1 and 6.2. Besides, it does not generate any unused cores.
Finally, the remaining region of 2 × 4 is left to the remaining application HM8. These steps
are illustrated on the part f and g of Figure 6.2.
Algorithm 7 Map_Update(Region, Lregion, Appi,min_xy,min_app_size, UN)
#Case 1: UNi can be reclaimed
1: if ((App.Selected_shape.UNi ≥ minxy) ‖ (App.Selected_shape.UNi > UN)) then
#Compute the remaining regions if we eliminate UNi from the column of the
selected shape
2: col_remaining_region.row = region.row − shape.row + 1
3: col_reamining_region.col = App.shape.UN
#Eliminate UNi from the column of the selected shape
4: if (col_reamining_region.Size ≥ min_app_size) then
5: Eliminate_unused_cores_col(region,App, Lregion)
6: App.Accepted_Shape = Compute_Shape_Col(App)
7: App.Selected_shape.UNi = 0
#Eliminate UNi from the row of the selected shape
8: else
9: Eliminate_unused_cores_row(region,App, Lregion)
10: App.Accepted_Shape = Compute_Shape_row(App)
11: App.Selected_shape.UNi = 0
12: end if
13: else
#Case 2: UNi cannot be reclaimed
14: App.Accepted_shape.Insert(App.shape)
15: end if
#Update Lregion
16: Compute_remaining_regions(region,App, Lregion)
6.4. Phase 2: Core-to-I/O flows contention minimization 109
6.4 Phase 2: Core-to-I/O flows contention minimization
The second phase of MapIO maps the tasks of each application within its assigned subset
of the Tilera-like NoC. It reduces the WCTT of the core-to-I/O flows by applying specific
rules for mapping tasks on and around the paths taken by these flows. These rules favor
the perpendicular communications to the core-to-I/O flow as the XY routing is used. In this
section, we present these rules where first we begin the allocation on the path of the core-to-
I/O flow, noted by the critical path. Then, we allocate the tasks around this path and finally,
we apply some rules to reduce the contention on the I/O outgoing flow, i.e. from the DDR
to the Ethernet interface. We note that the following rules are described by considering the
core-to-I/O flows going to the DDR memory located south of the NoC. However, to apply these
rules on the core-to-I/O flows using the DDR memory located north of the NoC, we have just
to inverse the Y axis.
On the other hand, we apply the SHiC method (described in Chapter 3 in section 3.2.2) to
allocate the tasks within the applications mapped far away from Ethernet interfaces and which
region is not crossed by core-to-I/Oflows.
Before explaining the rules used in the second phase, let us first describe how to determine the
critical path.
6.4.1 I/O critical paths within applications
Let us assume an application Appi whose bottom right edge is located at (1,1) and the associated
region, obtained from the first phase, is of size m×n. A core-to-I/O flow, traversing its region,
follows a path from (0, yi), where is located the used Ethernet controller, to (xi, 0), where is
located the used DDR port. This path may thus be located entirely or partially within the
region of Appi depending on the values of m and n. The part of the path located within this
region is called the critical path of the core-to-I/O flow. When xi < m and yi < n, the path
from (0, yi) to (xi, 0) belongs to Appi as shown by Figure 6.3a. It is therefore the critical path
of this flow. In the other cases, only either a subset of the X part (xi > m and yi < n) or the
110 Chapter 6. MapIO: an I/O contention-aware mapping technique
(a) (b)
(c)
Figure 6.3: Different cases of a possible critical path, shown in blue, for a core-to-I/O flow
(path shown in red) of an application (whose mapping is shown in yellow).
Y part (xi < m and yi > n) of the path of the core-to-I/O flow is located within the region of
Appi. These other cases are illustrated by the Figure 6.3b and 6.3c. In these latter cases, the
remaining parts of the critical path belongs to another application, and so the same principle
will be applied for this application. Note also that an application can thus be crossed by several
core-to-I/O flows and can therefore have several critical paths associated to it.
Example. Let us focus, in the case study B illustrated in Figure 6.2, on the Ethernet con-
trollers located at:
• (0, 2) which is shared between FADEC8 and HM8,
• (0, 4) which is used by HM14.
Let us first consider the mapping of FADEC8 whose bottom right edge is located at (1,1) and
6.4. Phase 2: Core-to-I/O flows contention minimization 111
Mapping on 
critical path
Mapping around 
critical path
Change DDR 
     port
1 2 3
ETH
ETH
port 1
1
2
3
45
tf4
tf2
tf6
tf1tf0
ETH
ETH
port 1
tf4
tf2
tf6
tf1tf0
tf5
tf7tf3
port 2port 3
ETH
ETH
port 1
tf4
tf2
tf6
tf1tf0
tf5
tf7tf3
port 2port 3
1
2
3
4
1
Change DDR 
     port
th0
th3
th5
th7
port 2port 3
th0
th3
th5
th7
2
th1
th2
th4
th6
port 2port 3
th0
th3
th5
th7
th1
th2
th4
th6
port 4 3
Mapping on 
critical path
0
ETH
ETH
port 1
Determine the 
critical path
Mapping around 
critical path
ETH
port 1port 2port 3port 4port 5
ETH
port 1port 2port 3port 4port 5
ETH
HM 9 
HM 14 
FAD
EC
8
FADEC9
HM 8
(1,1)
(4,1)
(1,1)
(1,4)
Figure 6.4: The different steps of internal mapping for FADEC8 and HM8.
the size of its shape is 2 × 4. There are two critical paths, as shown by the step 0 at bottom
of the Figure 6.4. The first one is its core-to-I/O flow or the core-to-I/O flow of HM8 (as they
follow the same path), which comes from the Ethernet controller located at (0, 2) and goes to
the DDR port located at (2, 0). The second critical path is a subset of the core-to-I/O flow of
HM8, which comes from (0, 4) and goes to (3, 0). This critical path is defined as from (0, 4) to
(2, 4). The remaining part, from (3, 4) to (3, 0), is the critical path of HM8.
Impact of the critical path on the WCTT
The critical path is the path taken by the core-to-I/O flow going from Ethernet to DDR
controller. A strategy to avoid the initial problem of dropping Ethernet frames, stated in
chapter 4, is to reduce the WCTT of the core-to-I/O flows. A solution consists of reducing the
contention experienced by the core-to-I/O flows. Thus, it is important to allocate the tasks
in such a way that they do not generate flows in direct contention with the core-to-I/O flow.
For this reason, we have to divide the solution into two parts. First, we have to choose the
112 Chapter 6. MapIO: an I/O contention-aware mapping technique
corresponding tasks to be allocated on the critical path. Second, as these tasks on the critical
path could generate different communications with other tasks, thus it is important to choose
also the placement of these remaining tasks. Let us see how to allocate the tasks on and around
the critical path.
6.4.2 Mapping tasks on the critical paths
Let us consider an application Appi with several critical paths. We sort the critical paths by
increasing length. Let us consider a critical path from (0, yi) to (xi, 0). The contention level
on a critical path can be reduced by limiting the number of flows in direct contention with
the corresponding core-to-I/O flow. Note that the minimum value is simply obtained by not
mapping tasks on the critical path due to the XY routing policy. We thus alternatively select
a core from the row yi and from the column xi in order to map tasks. Figure 6.5 illustrates the
order to follow in the different cases of the critical path of Appi. For the row yi and the column
xi, we start respectively from the cores located at (1, yi) and (xi, 1) and end up when the limit
of the critical path within Appi is reached. When the end of the row or the column is reached,
and xi 6= yi, we continue in sequence with the cores on the remaining direction of the critical
path as illustrated in Figure 6.5a. This alternation returns to the fact that this order of cores
present the most blocking effect on the path of core-to-I/O flow. To explain this alternation,
we have to show first how much it is interesting to leave the cores unused by following this
order to reduce the contention when UNi 6= 0.
Specific mapping rules when unused cores exist
When UNi > 0 for an application Appi, leaving cores on one of its critical path unused, i.e.
without tasks being mapped on them, reduces the contention the core-to-I/O flow experiences.
Let us first focus on the X part of a critical path. The first reason is that the number of
flows in direct contention with the core-to-I/O flow is indeed reduced. This is illustrated by
Figures 6.6a and 6.6b that shows that if we do not map a task on (1, yi), the number of Egress
communications (EC) blocking the critical path is reduced. This example can be easily adapted
6.4. Phase 2: Core-to-I/O flows contention minimization 113
DDR
ETH
(xi,0)
(0,yi)(xi,yi) 1
2
3
4
6 5
(a)
DDR
ETH(0,yi)(xi,yi)
(xi,0)
(m,yi) 123
(b)
DDR
ETH
(xi,0)
(xi,n)
(0,yi)(xi,yi)
1
2
(c)
Figure 6.5: The order of mapping the tasks by considering the different cases of a critical path.
for the Y part of a critical path.
When allocating a task on (1, yi) but keeping unused the core (2, yi), all flows sent from this task
to any task on the NoC, except those on its first column, block directly the critical path. They
indeed share the same links. We therefore define a mapping rule for leaving on the row yi cores
unused: the closest one to (0, yi) are leaved unused if possible (Ui > 0). This task mapping
rule also impacts the outgoing core-to-I/O flows, i.e. from DDR to Ethernet controllers. These
flows are indeed also directly blocked by flows received from the tasks located on the first row
till yi−1. The number of blocking flows is however reduced by simply inversing the mapping, as
shown by Figure 6.6b. The task on (2, yi) can indeed: 1) send to the columns 1 and 2 and 2)
receives from any task, except those located on the first row with x > 2, without blocking the
outgoing core-to-I/O flow in both cases.
A similar strategy applies for the Y part of the critical path. Figure 6.6c and 6.6d justifies the
114 Chapter 6. MapIO: an I/O contention-aware mapping technique
DDR
ETH
......
......
......
......
...... ...... ...... ......
(1,n)(m,n)
(m,1) (1,1)
(xi,yi)
(xi,0)
(0,yi)
(a)
DDR
ETH
......
......
......
......
...... ...... ...... ......
(1,n)(m,n)
(m,1) (1,1)
(xi,yi)
(xi,0)
(0,yi)
(b)
(c) (d)
Figure 6.6: The different communications that could block the X and Y part of a critical path
depending on where free cores are localized.
order where to choose the unused cores on the column of the critical path by reasoning now
on the flows received by the tasks. Thus, when we allocate a task on (xi, 1), all flows received
from the tasks allocated on the NoC, except the first row, block the critical path, even in its
Y part or in its XY part. Furthermore, the flows sent from this task to all columns having
x < xi block the outgoing I/O flow. However, these number of blocking flows is reduced when
we allocate a task on (xi, 2) and we keep free the core (xi, 1) as illustrated in Figure 6.6d.
This explains the order to follow when allocating the tasks on the cores on the critical path.
This is the same order when we choose to keep a core unused on this path. Note that the
number of cores of the critical path (both the X and Y parts) with no tasks mapped on them
depends on the value of UNi. If UNi is greater than the number of cores on the critical path,
6.4. Phase 2: Core-to-I/O flows contention minimization 115
then all cores on the critical path are unused. Besides, we keep unused the cores in the region
where x < xi and y < yi as this region could affect directly the path of the I/O outgoing flow.
Let us see now, which tasks are chosen to be allocated on the critical path.
Mapping rules when no unused cores exist
When UNi = 0, all the cores on the critical path should be allocated by following the order
illustrated in Figure 6.5a. Once a core is selected, we distinguish whether it is located in the X
part or the Y part of the critical path when we apply the mapping rules.
Allocation on the X part. On the X part of the critical path, we allocate tasks from (1, yi)
to (xi − 1, yi). We then map a task on (j, yi) if it does not directly block the core-to-I/O flow,
i.e. follows the following set of constraints:
• when j = 1, the task (1, yi) must not send data to DDR controllers: A task on (1, yi) that
communicates with DDR, generates a flow in contention with the X part of the critical.
This condition is specific to Tilera architecture as the first column of the NoC is not
connected directly to the first DDR port.
• must not send any data to the tasks allocated on (xi, k), with k ∈ [1, j − 1]: A task
sending data to (xi, k) generates a flow that is in direct contention with the X and Y path
of the core-to-I/O flow.
• must not receive any data from the tasks allocated on (j−1, yi) (when j ∈ [2, xi]): A task
on (2, yi) sending data to (3, yi) generates a flow in direct contention with the X path of
the core-to-I/O flow. For this reason, the task that should be allocated on (3, yi) must
not receive data from the tasks allocated on (2, yi) and (1, yi).
We order the list of tasks that fulfill these constraints in an increasing number of EC. Let us
suppose that the task of highest EC is allocated on (1, yi). The probability that it sends data
to the tasks allocated on other columns than the one to which it belongs increases. Only the
flows sent to the tasks belonging to its column, i.e. the first one, do not block the X part of
116 Chapter 6. MapIO: an I/O contention-aware mapping technique
the critical path. For this reason, we have to allocate on (1, yi) the task that verifies these rules
and has the minimum number of EC. However, a task on (2, yi) could sent to two columns, i.e
x = 1 and x = 2 without blocking this X path. Thus, this constraint decreases while increasing
xi.
Allocation on the Y part. On the Y part of the critical flow, similar rules can be defined
when we allocate tasks from (xi, 1) to (xi, yi−1). Thus, a task is allocated to (xi, j) must verify
the following rules:
• must not send data to the tasks allocated on (xi, k), with k ∈ [1, j − 1] to avoid the Y
contention with the critical path;
• must not receive data from tasks allocated on (l, yi), with l ∈ [1, j] to avoid the contention
on the XY part of the critical path.
The list of tasks that fulfill these constraints is ordered in an increasing number of Ingress
Communication (IC). Let us suppose that the task of highest number of IC is allocated on
(xi, 1). The probability that it receives data from the tasks allocated on other rows than the
one to which it belongs increases and so could block directly the Y part of the critical path.
Therefore, we have to allocate on (xi, 1) the task that verifies these rules and has the minimum
number of IC, as it can only receives from tasks belonging to its row without blocking the
critical path.
Finally, note that if other critical paths exist within an application, coming from other Ethernet
interfaces (0, ym), we must take into consideration all the critical paths. Thus, the tasks for
example on the cores from (1, ym) to (xi, ym), with ym > yi, must not send to the tasks allocated
on cores from (xi, 1) to (xi, yi−1). This avoids the contention with the Y path of the core-to-I/O
flow.
Allocation on the XY part. The tasks mapped on (xi, yi) must not receive data from tasks
mapped on their rows and not send data to those on their columns to avoid the XY contention
with the critical path.
6.4. Phase 2: Core-to-I/O flows contention minimization 117
Example. The two steps 1 of Figure 6.4 show for both HM8 and FADEC8 the task mapping
on their critical paths. We start by the shortest critical path in FADEC8, i.e. the one of
FADEC8 or HM8 that comes from (0, 2). We first have to select a task to be mapped on
(1, 2). The task must have the minimum EC value but also it must not send data to DDR.
tf6 is the task of minimum EC value, however it sends data to the DDR. We therefore have
to choose another task. As the remaining tasks have the same characteristics, tf4 is arbitrarily
chosen. We must map on (2, 1) the task with the minimum IC value and that not receive data
from tf4. The matching task is tf2. Finally, tf6 is mapped on (2, 2) as it is the sole task that do
not send to tf2. Let us now consider the second critical path from (0, 4) to (2, 4), i.e. the one
of HM8. The task on (1, 4) must have the minimum EC value and must not send data to the
tasks on the column of the first critical path. Moreover, the task on (2, 4) must also not receive
from the one allocated on (1, 4). Arbitrary tasks are however chosen since all the tasks have
the same characteristics. The mapping on the critical path in HM8 is shown by the upper part
of Figure 6.4. th0 is mapped on (3, 1) as it is the task having the minimum IC value. On the
cores above, we have to ensure to allocate the tasks that not send data to the allocated tasks
below.
6.4.3 Mapping tasks around the critical path
Once the task mapping is performed on the critical path, we have to map the remaining tasks
around the critical path. Actually, the objective is that the remaining tasks, i.e. are not yet
allocated, communicating with the tasks allocated on the critical path, do not generate flows in
direct contention with the core-to-I/O flow. To reduce the number of contentions in the X or Y
part of the critical paths, flows from these tasks should cross the critical path perpendicularly.
For this reason, we consider different regions around the critical path, as shown in Figure 6.7,
where in each region, we allocate the tasks that generate perpendicular flows with the critical
path.
For each configuration of a critical path, Figure 6.8 illustrates the different areas that must
then be considered. In the general case (Figure 6.8a), i.e. the entire critical path belongs to the
118 Chapter 6. MapIO: an I/O contention-aware mapping technique
Figure 6.7: Perpendicular communications with the critical path avoids the contention with
the core-to-I/O flow.
current application, 4 areas must be considered: I) x < xi and y < yi, II) x > xi and y ≤ yi,
III) x < xi and y > yi and IV) x > xi and y > yi. However, the number of regions can be
reduced to two (Figures 6.8b and 6.8c). Note that if several critical paths exist, the one with
the highest values for both xi and yi is used to define the areas to consider.
Mapping in region I (x < xi and y < yi)
Let us first focus on the case of Figure 6.8a where the cores in this region share the same
rows and columns with the critical path. We have to take into consideration the characteristics
of the tasks allocated on the critical path. We recall that the tasks allocated on the row yi
present a minimum EC, while those on the column xi have a minimum IC. Therefore, each task
that might be mapped on (x, y) in this region, should send data to the tasks on its row and
belonging to the critical path, i.e. on (xi, y). These tasks must also receive data from the tasks
on their column and belonging to the critical path, i.e. on (x, yi). However, these tasks should
not send to all tasks mapped on the rows of critical path, starting from (xi, 1) till (xi, y − 1).
This last condition avoid a Y contention with the critical path. In the other cases, i.e. shown
by Figures 6.8b and 6.8c, one of these conditions is applied depending on the part of critical
path that belongs to the current application. Note that if we can not find any tasks verifying
at least one condition, for the moment we allocate unused cores and then we apply the rules to
6.4. Phase 2: Core-to-I/O flows contention minimization 119
DDR
ETH
(xi,0)
(0,yi)(xi,yi)
III
IIIIV
(a)
DDR
ETH(0,yi)(xi,yi)
(xi,0)
(m,yi)
I
III
(b)
DDR
ETH
(xi,0)
(xi,n)
(0,yi)(xi,yi)
III
(c)
Figure 6.8: For each possible configuration of a critical path, the defined areas and their order
in the tasks mapping.
reduce the contention on the I/O outgoing flow, which are explained in the next section.
Example. The step 2 for FADEC8, shown on the bottom part of Figure 6.4, puts in this
region I the task tf7 on (1, 3) that receives from tf1 and tf4. A similar condition is applied for
the task mapped on (1, 1). The tasks on the third row must also not send to tf6 and tf2 while
the one on (1, 1) have to send to tf2.
Mapping in region II (x > xi and y ≤ yi)
As in this region the cores share only the same rows with the critical path (Figures 6.8a
and 6.8b), then we have to apply only the conditions related to the row of the critical path.
Therefore, in this region, each task mapped on the core (x, y) should send data to the tasks
120 Chapter 6. MapIO: an I/O contention-aware mapping technique
on (xi, y). Besides, it should not send data to all tasks mapped on the row of critical path,
starting from (xi, 1) till (xi, y − 1).
Example. For instance, the step 2 for HM8, shown by Figure 6.4, puts in this region II a
task on (4, 4) that send to th7 and not send to th5, th3 and th0, i.e. th6. Similarly, th4 is mapped
on (4, 3) as it sends to th5 and not send to th3 and th0. The same rules are applied to allocate
the tasks on the remaining cores, i.e. (4, 2) and (4, 1).
Mapping in region III (x < xi and y > yi)
Since x < xi, each task on (x, y) must receive data from the tasks on the column of its critical
path, i.e. (x, yi). However, in order to reduce the contention on the Y part with the critical
path, it must also not send data to tasks on the column of the critical path starting from (xi, 1)
till (xi, y − 1), as in the case of Figure 6.8a.
Mapping in region IV (x > xi and y > yi)
The cores in this region do not share any row or column of the critical path. However, we have
to ensure that the tasks in this region do not send any data to the tasks on the row of the
critical path starting from (xi, 1) till (xi, y − 1).
6.4.4 Minimizing the contention of outgoing flows
The rules detailed before minimize the WCTT of the flow incoming from Ethernet. We also
define tasks mapping rules that reduce the latency of outgoing core-to-I/O flows, i.e. from the
DDR to the Ethernet controller. These rules are applied when tasks remain to be mapped.
The first rule is to allocate the tasks sending to Ethernet on the nearest cores to both the DDR
and Ethernet controllers, as this minimize d2 and d3 and thus the delay from the task to the
DDR and from the DDR to the Ethernet. In fact, the tasks sending to Ethernet send first
6.4. Phase 2: Core-to-I/O flows contention minimization 121
(xo,0)
Figure 6.9: Flows blocking directly the path of the outgoing flow.
the data to the DDR controller and then this last one transfer data to Ethernet. Thus, we
identify some conditions for mapping the remaining tasks. Where these currently unmapped
tasks are going to be allocated can indeed affect the outgoing flows on the NoC, going from the
port xo, and which uses the first row then the first column to reach the Ethernet controller.
Remaining tasks that might be allocated on (x, 1) must not send data to the tasks allocated
on (xj, 1), with xj < min(x, xo). Actually, when we allocate a task on (x, 1) with x < xo, then
this task must not send to the tasks allocated on (xj, 1) with xj < x. This condition can avoid
the contention with the X part of the path of the I/O outgoing flow. However, it can send to
the tasks on (xj, 1) with xj > x, due to the use of the bidirectional links, which eliminate the
possible contention with the I/O outgoing flow. On the other hand, when x > xo, the task on
(x, 1) should not send to any task allocated on (xj, 1) with xj < xo, in order to avoid this X
contention with the I/O outgoing flow.
A task on (x, 1) must also not send to the tasks allocated on the first column, in order to
avoid the contention with the Y part of the path of the path of the I/O outgoing flow. These
conditions can minimize the number of flows that could block the flows outgoing from DDR to
Ethernet controller. As show in Figure 6.9, when considering a task on (x, 1) that sends data
to tasks allocated on the first row and/or first column, it generates flows blocking directly the
path of the I/O outgoing flow.
After applying all rules and allocating the tasks verifying these rules, the remaining tasks are
mapped arbitrarily on the cores. Actually, in this work, we do not consider any conditions to
122 Chapter 6. MapIO: an I/O contention-aware mapping technique
(a) (b)
Figure 6.10: The configuration in (a) shows how the communications with DDR interfere the
critical path while this interference is avoided by our mapping as shown in (b).
map the remaining tasks on the remaining cores.
Choosing the DDR port to which tasks should communicate
Once all tasks have been mapped, we have to check the tasks communicating with the DDR.
In fact, in this work we consider the communications with DDR controllers even as a final des-
tination, or an intermediate destination. The flows coming from the tasks communicating with
DDR and located at the column xi or at the first column when xi = 2, interfere with the critical
path as shown in Figure 6.10a. For this reason, we have to choose for each task communicating
with DDR, the corresponding port to which it should send data without interfering with the
critical path. Therefore, we consider virtual tasks that will be allocated on the xi + 1 port as
a destination port for the tasks located at the column xi and at the first column when xi = 2,
as illustrated in Figure 6.10b.
We note that when it exists more than one critical path in the application, we consider the
one with the highest xi value. However, we also modify the destination of tasks from other
applications: we modify the port from x = xi + 1 to xi + 2.
Example. For FADEC8 of our case study B, the task tf6 should send data to the port 1 of
the DDR. However, this port is used by the critical path coming from the Ethernet interface
located at (0,2). Thus, we modify the DDR port destination for this task. The port 2 is used
6.4. Phase 2: Core-to-I/O flows contention minimization 123
ETH
port1port2
Mapping on
critical path
ETH
port1port2
tf4
tf2tf6
tf5
ETH
port1port2
Mapping on
x<xi and y<yi
Mapping on
x>xi and y<yi
Mapping on
x<xi and y>yi
ETH
port1port2
tf4
tf2tf6
tf5tf7tf3 ETH
port1port2
tf4
tf2tf6
tf5tf7tf3
tf1 tf0
ETHtf4
tf2tf6
tf5tf7tf3 tf1 tf0tf8
Mapping on
x>xi and y>yi
Change DDR
port
port2
1 2
3 4
5 6
Determine the 
critical path
ETH
port 1port 2port 3port 4port 5
ETH
port 1port 2port 3port 4port 5
ETH
HM 9 
HM 14 
FAD
EC
8
FADEC9
HM 8
1tf4
2 tf2tf63
ETH
port1port2
tf4
tf2tf6
tf5tf7tf3
tf1 tf0tf8
port1
Figure 6.11: Mapping of FADEC9 of the case study B.
by the critical path coming from the Ethernet interface at (0,6). To avoid the contention with
these critical paths, the port 3 is chosen as the destination port for the task tf6 , as illustrated
on the step 3 in Figure 6.4.
Figure 6.11 illustrates the internal mapping of FADEC9 sharing the Ethernet (0, 6) with HM9.
This mapping applies the same rules as in the mapping of FADEC8, where tf6 is allocated
on (2, 6). This task sends to DDR, and thus could not be allocated at (1, 6) even it presents
a minimum number of EC. However, as it is the sole task that does not receive data from tf4
allocated at (1, 6), tf6 is thus allocated at (2, 6) as illustrated at the step 1. Besides, the step
6 shows that tf6 sends data to the port 2 to avoid the direct contention with the critical path.
124 Chapter 6. MapIO: an I/O contention-aware mapping technique
Example: Case study A. Now let us go back to the case study illustrated in Figure 6.1 to
explain how the tasks are allocated within FFT. tff15 is a task that does not send to DDR and
also presents the minimum number of EC. Thus, this task is allocated on the core at (1, 6).
Besides, on the core located at (2, 7), we choose a task with a minimum number of IC and does
not receive from tff15, i.e. tff8. Finally, a task that does not receive from tff15 and does not
send to tff8 is chosen to be allocated at (2, 6), i.e. tff12. However, the remaining tasks are
allocated arbitrarily on the remaining regions as they present the same characteristics.
6.5 Conclusion
Chapter 4 has shown that reducing the pessimism when computing the WCTT is a strategy to
avoid Ethernet frames to be dropped. However, this strategy is not always sufficient to solve
the problem as illustrated in the considered case study A. Existing contention-aware mapping
strategies aim at minimizing the inter-core congestion without taking into account requirements
of I/O communications of applications. However, the WCTT of core-to-I/O flows depends on
the congestions generated by the mapping of both critical and non-critical applications.
This chapter presents a description of our static mapping strategy of critical and non critical
real-time flows that reduces the WCTT of core-to-I/O communications over Tilera-like NoC,
calledMapIO. MapIO splits the NoC into regions and then allocates in priority critical applica-
tions in a dedicated region close to memory and Ethernet controllers. The path of core-to-I/O
and the outgoing I/O flows of critical applications are thus reduced. These regions are contigu-
ous and non-fragmented which avoids the contention between applications. Besides, it permits
us to allocate a number of applications whose size is equal to the size of the NoC.
Then, the second part of MapIO is the mapping of the tasks within the regions in order to
reduce the contentions that core-to-I/O flows can experience over their paths. The evaluation
ofMapIO method compared to the SHiC method, a method of the state-of-the-art, is presented
in the next chapter.
Chapter 7
Evaluation of RCNoC and MapIO on
case studies
Contents
7.1 Impact of RCNoC on the core-to-I/O flows . . . . . . . . . . . . . . . . . 128
7.2 Impact of MapIO on the core-to-I/O flows . . . . . . . . . . . . . . . . . 129
7.3 Combining our MapIO and RCNoC is necessary . . . . . . . . . . . . . . 132
7.4 Impact of MapIO on both core-to-I/O and core-to-core flows . . . . . 133
7.4.1 Impact of MapIO rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.4.2 Impact of unused cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Chapter 5 has presented RCNoC , a method that allows to reduce the pessimism of the computed
WCTT. Its evaluation on a synthetic benchmark shows a significant reduction of the WCTT
values compared to the RC method. Besides, chapter 6 has described MapIO, an application
placement strategy, that takes into account the communications between cores and I/O in-
terfaces. In this chapter, we propose to evaluate these two methods. Thus, we first evaluate
125
126 Chapter 7. Evaluation of RCNoC and MapIO on case studies
(i) Our mapping  (ii) SHiC mapping  
port 1port 2port 3port 4port 5
HM11
th11
th10
th9th8
th7 th6
th0th1
th2
th5 th4
th3
HM12 th0th2
th3 th4 th5
th6th7th8
th9 th10
th1
tﬀ6 tﬀ5 tﬀ4 tﬀ3
tﬀ0tﬀ1tﬀ14tﬀ7
tﬀ11 tﬀ1 tﬀ12 tﬀ15
tﬀ10 tﬀ13 tﬀ9 tﬀ2
FFT16
FADEC9tf8
tf7
tf1
tf6 tf4tf3
tf0
tf2 tf5(1,1)
port 1port 2port 3port 4port 5
FADEC9tf8
tf3
tf7
tf1
tf0
tf2
tf6
tf4
tf5
HM12th11
th10
th9 th8
th7
th6 th0
th1
th2
th5
th4
th3(1,1)
HM11
th0
th2 th3
th4
th5th6
th7
th8
th9 th10
th1
tﬀ15
tﬀ14
tﬀ13
tﬀ12
tﬀ0
tﬀ1
tﬀ2
tﬀ3
tﬀ7
tﬀ6 tﬀ5 tﬀ4
tﬀ11
tﬀ10 tﬀ9
tﬀ8FFT16(1,7) (1,7)
(a) Case study A
(i) Our mapping  (ii) SHiC mapping  
ETH
port 1port 2port 3port 4port 5
ETH
port 1port 2port 3port 4port 5
ETH
FADEC9
tf8
tf7
tf1
tf6 tf4tf3
tf0
tf2 tf5
tf7
(1,1)
tf1tf0
tf3
tf6 tf4
tf2 tf5
th7
th5
th3
th0
th6
th4
th2
th1
FADEC8
th0
th2 th3
th4
th5th6
th7
th8
th9 th10
th1
th11
th12th13HM14
HM9
th1
th0
th2
th3th4
th5 th6
th7
th8
ETH
DDR
port 1port 2port 3port 4port 5
ETH
tf6
tf4tf3
tf1t
tf2 tf5tf7
th13th12
th2 th3th8th9
th1 th4th7
th0 th5th6th11
th10
th1
th0
th2th3
th4
th5 th6
th7
th8
HM9
HM8
ETH
FADEC9
HM14
(1,1)
HM8
th1
th0
th2th3
th4
th5 th6
th7
f8
tf0
(1,7) (1,7)
(b) Case study B
Figure 7.1: mapping of flows of case studies A and B by applying MapIO mapping and SHiC
mapping.
127
port 1port 2port 3port 4port 5
HM11
th10
th9th8
th7 th6
th0th1
th2
th5 th4
th3(1,1)
HM12 th0th2
th3 th4 th5
th6th7th8
th9 th10 th11
th1
th12 th13 th14 th15
th8th2th3th9
th10 th4 th1 th7
th11 th5 th0 th6
HM16
(i) Our mapping  (ii) SHiC mapping  
port 1port 2port 3port 4port 5
FADEC9tf8
tf7
tf1
tf6 tf4tf3
tf0
tf2 tf5
HM11
th10
th9th8
th7 th6
th0th1
th2
th5 th4
th3 (1,1)
HM12 th0th2
th3 th4 th5
th6th7th8
th9 th10 th11
th1
th7 th8 th9 th10
th14th0th3th4
th5 th2 th1 th15
th6 th13 th12 th11
HM16
FADEC9tf8
tf3
tf7
tf1
tf0
tf2
tf6
tf4
tf5
(1,7) (1,7)
(a) Case study C
(i) Our mapping  (ii) SHiC mapping  
ETH
port 1port 2port 3port 4port 5
ETH
port 1port 2port 3port 4port 5
ETH
FADEC9
tf0
tf1
tf2
tf4
tf5 tf6
tf7 tf8
tf3
th12
th9
th11
th10
th13
th2 th3th8
th1 th4th7
th0 th5th6
HM14ROSACE10
HM10 th0 th5
th1 th4
th2 th3
th6
th7
th8
th10
th9
Vzc
Vac
elev
eng
qf
ah hf
Vaf
Vzf azf
(1,1)
ETH
port 1port 2port 3port 4port 5
ETH
port 1port 2port 3port 4port 5
ETH
(1,1)
ROSACE10
HM10 FADEC9
HM14 qf ah
hfazf
elev
eng
Vac
Vzc
Vzf
Vaf
tf8
tf7
tf1
tf4tf3
tf0
tf2 tf5
tf6
th4 th1
th2
th0
th3
th5
th6 th7
th8th9th13 th12
th10 th11
th3 th2
th4 th1
th5 th0
th6 th7
th9 th8 (1,7) (1,7)
(b) Case study D
Figure 7.2: mapping of flows of case studies C and D by applying MapIO mapping and SHiC
mapping.
128 Chapter 7. Evaluation of RCNoC and MapIO on case studies
the impact of these methods, applied independently and then together, on the initial problem:
dropping of Ethernet frames for critical or non-critical applications. Later, we evaluate the
impact of these methods, applied together, on the WCTT of both core-to-I/O and core-to-core
flows. We thus explain the impact of MapIO rules and the number of unused cores, generated
by this strategy, on these WCTT values.
For these evaluations, we use different case studies: A and B (introduced in the previous chap-
ter), C and D (new case studies which are introduced in this chapter). In order to facilitate
the access to these case studies, we illustrate, in the beginning of this chapter, the mapping of
the case studies A and B, in Figure 7.1, and the one of C and D in Figure 7.2. In these figures,
we show the mapping obtained by MapIO and SHiC methods.
7.1 Impact of RCNoC on the core-to-I/O flows
In order to evaluate the impact of RCNoC proposed in Chapter 5 on the core-to-I/O flows, we
refer to the case study A introduced at Chapter 4 and illustrated in Figure 4.10. We recall that
FFT and HM11 share the same Ethernet interface at (0,6), while FADEC and HM12 share
the Ethernet interface at (0,2). As shown in the section 4.3, the RC method computes the
WCTT of a packet of the HM core-to-I/O flow by adding the WCTT of all direct and indirect
flows blocking this packet. This computation leads to drop the FFT frame. We recall that the
WCTT of a packet of the HM core-to-I/O flow is equal to 400.775 ns by considering the RC
method. However, by applying RCNoC , we can note a reduction of 31% which corresponds to
a WCTT of 277.725 ns. This delay is equal to the delay obtained by the timeline illustrated
in Figure 4.11b modeling the real pipeline behavior of flits transmission.
Actually, this result shows that RCNoC models this pipeline behavior, where Property 3 is
applied to remove the impact of the indirect flows as presented in the timeline. In fact, when
f1 is no more blocked by f2, f3 is the sole indirect flow that impact the analyzed flow fa,
i.e. the HM core-to-I/O flow. Indeed, Property 3 indicates a separation of nf1 − 1, i.e. one
router corresponding to (3,6), between fa and fid, to classify fid as non-influent. As f3 is
not separated from the HM core-to-I/O flow by any router, then f3 is an influent indirect
7.2. Impact of MapIO on the core-to-I/O flows 129
flow (see Figure 4.11a). Thus, when f1 arrives to the router located at (3,6), an HM packet
progresses without being affected by the flows blocking f1, i.e. indirect flows: f4, f5, f6, f7 and
f8. Besides, Property 2 reduces the pessimism by computing the progression of flows without
waiting the blocking flows to reach their destination. For example, f1 does not wait f3 to reach
its destination at (4,4) before it progresses, and thus its maximal blocking delay is computed.
Therefore, the WCTT of the 21 packets corresponding to the HM core-to-I/O flow is equal to
5.8 µs which is less than the transmission delay of the FFT frame, i.e. 6 µs. In this case, the
FFT frame is no more dropped.
Reducing the pessimism in the computation not sufficient
Let us increase the number of flits composing the FFT core-to-core flows, for example to
3 flits, then the WCTT of a packet of HM core-to-I/O flow increases to 490.47 ns. In this
case, additional indirect flows are considered as influent on the core-to-I/O flow. Thus, this
core-to-I/O flow takes 10.29 µs to send all the payload to the DDR. This delay is greater than
the transmission delay of FFT frame on Ethernet leading to drop it.
On the other hand, a packet of the HM11 core-to-I/O coming from the Ethernet interface at
(0,2) takes 790.05 ns to reach the DDR memory. Comparing to the RC method, we reduce
the WCTT value of only 2%. Actually, Property 3 can not be applied due to the absence of
indirect blocking flows. Besides, the flows are made of 19 flits that crosses at maximum four
routers. The maximal blocking delay is therefore close to the waiting delay for the blocking
flow to reach its destination. Thus, Property 2 does not provide a significant gain against the
RC method. This gain of 2% is not sufficient to avoid to drop the FADEC frame as the global
WCTT of the HM core-to-I/O flow is equal to 16.5 µs and thus greater than the transmission
delay of the FADEC frame.
7.2 Impact of MapIO on the core-to-I/O flows
Problem solved: packets are no longer dropped
As seen previously, reducing the pessimism when computing the WCTT is not sufficient in
130 Chapter 7. Evaluation of RCNoC and MapIO on case studies
some cases to avoid dropping Ethernet frames for the case study A. In this section, we compare
the mapping generated by MapIO, called MapIO mapping, against the mapping generated by
the SHiC method, called SHiC mapping. We also show that the problem mentioned above is
solved.
We recall that Figure 7.1a.i illustrates MapIO mapping, while Figure 7.1a.ii shows SHiC map-
ping. Table 7.1 summarizes the WCTT of the core-to-I/O flows obtained when applyingMapIO
and SHiC method for the considered case study A. The second line of this table reports the
WCTT of flows MapIO SHiC approach
HM11 in SHiC and
HM12 in MapIO
blocking FFT16:
ETH → DDR
0.98 µs =⇒ FFT16 received 6.704 µs =⇒ FFT16 dropped
HM12 in SHiC and
HM11 in MapIO
blocking FADEC9:
ETH → DDR
11.8 µs =⇒ FADEC9 received 16.5 µs =⇒ FADEC9 dropped
Table 7.1: Table illustrating the WCTT of the core-to-I/O flows, of the case study A, sharing
the Ethernet (0,6) and (0,2) using MapIO and SHiC approach.
WCTT values of the HM core-to-I/O flow blocking FFT16 and the third one reports these of
the HM core-to-I/O flow blocking FADEC9. In both mapping strategies, we indicate whether
the problem of dropping Ethernet frame is solved or not.
As mentioned in the section 4.3, when we increase the size of packets of the FFT core-to-core
flows to 15 flits, then the FFT frame is dropped when allocating the FFT tasks using the SHiC
mapping. Actually, the WCTT of the HM11 core-to-I/O flow takes 6.704µs to send all the pay-
load to the DDR by considering RCNoC and 6.762µs by applying RC method. In both cases,
the delay of the HM core-to-I/O flow is greater than the arrival delay of the FFT frame, i.e
6µs. However, in MapIO mapping, FFT is still near to Ethernet but we apply MapIO rules to
change the mapping of the tasks into FFT. In this mapping, FFT share the Ethernet interface
with HM12. The HM12 core-to-I/O flow is not blocked. Its global WCTT is equal to 0.98µs
and thus the FFT frame is not dropped. We note that in this case, this WCTT is independent
from the number of flits making the FFT core-to-core flows, unlike when considering the SHiC
mapping. In the SHiC mapping, the HM11 core-to-I/O flow is blocked by the FFT core-to-core
7.2. Impact of MapIO on the core-to-I/O flows 131
flows, and so depends from the size of their packets. Thus, when increasing the size of these
packets, the indirect flows become influent, and the blocking delay of the HM11 core-to-I/O
flow increases.
On the other hand, for the Ethernet interface at (0,2), when considering the SHiC mapping,
the HM12 core-to-I/O flow takes 16.5 µs (16.944 µs by applying RC method) to reach the DDR
controller. This delay is greater than the arrival delay of the FADEC frame, i.e. 12.336 µs,
and leads to drop FADEC frame. Compared to the SHiC mapping, let us simply permute the
HM12 and FADEC9 applications. The SHiC mapping of tasks on FADEC9 leads to a WCTT
for the HM12 core-to-I/O flow of 15.5 µs. This delay is reduced compared to the initial one.
However, it is still higher than the transmission delay of the FADEC9 frame on Ethernet, i.e.
12.336 µs, which leads to drop FADEC frame. This demonstrates the need for a task mapping
within applications that further reduces the WCTT of core-to-I/O flows.
WCTT of flows MapIO SHiC approach
HM12 in SHiC and
HM11 in MapIO
blocking FADEC9:
0.98 µs =⇒ FFT16 received 6.704 µs =⇒ FFT16 dropped
HM8 inMapIO block-
ing FADEC8: ETH
→ DDR
6.48 µs =⇒ FADEC8 received
Table 7.2: Table illustrating the WCTT of the core-to-I/O flows, of the case study B, sharing
the Ethernet (0,6) and (0,2) using MapIO and SHiC approach.
MapIO mapping of FADEC9, shown by Figure 7.1a, leads to a WCTT for the HM11 core-to-
I/O flow of 11.8 µs. We note that this core-to-I/O flow uses the same Ethernet controller as
FADEC9, i.e. the one located (0, 2). This delay is lower than the transmission delay of the
FADEC9 frame on Ethernet. Therefore, the FADEC9 frame reaches the Ethernet interface
after the removal of the HM11 frame from the Ethernet buffer.
Besides, we evaluate MapIO on the case study B introduced in the previous chapter in order to
compare it to SHiC mapping illustrated in Figure 7.1b. Table 7.2 reports the WCTT values of
the HM core-to-I/O flows blocking respectively FADEC9 in the second line, and FADEC8 in
the third line. The results show that the WCTT values obtained in MapIO lead to avoid the
132 Chapter 7. Evaluation of RCNoC and MapIO on case studies
dropping of FADEC8 and FADEC9 Ethernet frames. Also, we can notice that the WCTT
values of the HM core-to-I/O flow in the second line are not reported for SHiC mapping.
Actually, SHiC mapping cannot allocate the critical application FADEC8 as illustrated in
Figure 7.1b.ii.
7.3 Combining our MapIO and RCNoC is necessary
The previous section has shown that applying MapIO on the case studies A and B solve the
problem of dropping Ethernet frames. But, is it sufficient to apply only the mapping strategy
without considering the computing method? To answer this question, we consider the following
case study, noted C, where figures 7.2a.i and 7.2a.ii illustrate respectively MapIO mapping and
SHiC mapping.
This case study is made of the following applications: FFT12, FADEC9 and two instances of
HM: HM16 and HM12. We consider that the HM12 Ethernet frame, transmitted before the
HM16 frame, holds a payload of 684 Bytes. Besides, we suppose that the core-to-core flows
exchanged in the HM applications are made of 16 flits.
In this section, we focus on the Ethernet interface (0,6) which is shared between HM12 and
HM16 in both mapping. Table 7.3 shows the WCTT of the HM12 core-to-I/O flow computed
when applying RCNoC and the RC method in both mapping illustrated in figures 7.2a.
The results show that by considering the SHiC mapping and whatever is the computing method,
the HM16 Ethernet frame is dropped. In fact, the WCTT of a packet of the HM12 takes
672.75 ns (RCNoC) to reach the DDR. However, the HM12 frame is divided into 9 NoC packets.
Thus, the global WCTT of HM12 core-to-I/O flow to reach the DDR is 6.05 µs which is greater
than the transmission delay of HM16 on the Ethernet link (5.808 µs). This delay is due to the
fact that HM12 core-to-I/O flow in SHiC mapping is blocked by 3 direct flows coming from th7,
th1 and th2 (see Figure 7.2a.ii).
On the other hand, when considering MapIO, the WCTT of the core-to-I/O flow computed by
RC method leads to drop the HM16 frame. The WCTT of a packet of HM12 computed by the
7.4. Impact of MapIO on both core-to-I/O and core-to-core flows 133
MapIO SHiC approach
WCTT of flows RCNoC RC method RCNoC RC method
HM12 blocking
HM16: ETH →
DDR
5.7 µs =⇒
HM16 received
5.86 µs =⇒
HM16 dropped
6.05 µs =⇒
HM16 dropped
6.21 µs =⇒
HM16 dropped
Table 7.3: Table illustrating the WCTT of HM12 blocking HM16 of the case study C and
sharing the Ethernet (0,6) using the different mapping strategies and computing methods.
RC method is 651.47 ns and thus its global WCTT to reach DDR is 5.86 µs. This WCTT is
greater than 5.808 µs. However, RCNoC presents a reduction of 2.55% of the WCTT of the
HM12 packet, where the global WCTT of the HM12 core-to-I/O flow is 5.7 µs. This WCTT
is less than 5.808 µs where the HM12 payload is removed from the Ethernet interface buffer
before the arriving of HM16 Ethernet frame.
Actually, this difference between the values obtained by RCNoC and the RC method is explained
by the applicability of Property 2. This property computes the maximal blocking delay of each
blocking flow (either direct or indirect flow) without waiting it to reach its destination. We note
that in MapIO mapping, the HM12 core-to-I/O flow is blocked directly by one flow generated
by th15 which explains the values obtained byMapIO mapping compared to SHiC mapping (see
Figure 7.2a.i).
As this case study shows the need to combine RCNoC and MapIO, thus in the following evalu-
ations, we use RCNoC to compute the WCTT values of the different flows.
7.4 Impact of MapIO on both core-to-I/O and core-to-
core flows
The previous evaluations show the positive impact of MapIO, combined with RCNoC , on the
WCTT of the core-to-I/O flows leading to solve the problem of dropping Ethernet frames. In
this section, we illustrate on different case studies the impact of the internal mapping rules and
the presence of unused cores generated by MapIO not only on the core-to-I/O flows but also
on the core-to-core flows.
134 Chapter 7. Evaluation of RCNoC and MapIO on case studies
7.4.1 Impact of MapIO rules
We consider the case studies A and B and provide the WCTT values of the core-to-core flows
for the critical applications. Besides, we explain how MapIO rules lead to the different WCTT
values for the core-to-I/O and core-to-core flows. We compare these values to those obtained
when applying SHiC method. The second column of Tables 7.4 and 7.5 report WCTT values
when the mapping is build using MapIO, while the third column reports values assuming the
SHiC mapping.
WCTT of flows MapIO SHiC approach
HM11 and HM12: ETH → DDR 0.98 µs 6.704 µs
FFT16: core-to-cores 1136.81 ns 1616.9 ns
Table 7.4: Table illustrating the different results for applications, of case study A, sharing the
Ethernet (0,6) using MapIO and SHiC approach.
Case study A: FFT16 and HM applications
Table 7.4 refers only to the case study A, and reports the WCTT value of the HM core-
to-I/O flow (second line) and the average WCTT value of the core-to-core flows into FFT16
(third line). We recall that in SHiC mapping, HM11 shares the same Ethernet interface, i.e.
located at (0,6), with FFT16, while in MapIO mapping, HM12 shares this interface with FFT
application.
Impact on the core-to-I/O flows. MapIO reduces the WCTT of the HM core-to-I/O flow
by 85 %. This gain is only due to the internal mapping of the tasks into FFT application as
FFT in both mapping, i.e. MapIO and SHiC mapping, is allocated in the same region. In
fact, the characteristic of the FFT application is that all tasks from tff0 to tff14 send data to
one destination, i.e. tff15. MapIO rules put the task tff15 which presents the minimum EC
at the core (1,6). This placement ensures that there are no flow in direct contention with the
core-to-I/O flow, as tff15 does not send data. However, the SHiC mapping allocates this task
at (3,6) as it is the task with the maximum number of communications. Thus, there is one
7.4. Impact of MapIO on both core-to-I/O and core-to-core flows 135
WCTT of flows MapIO SHiC approach
HM9, HM14, HM11 and HM12: ETH → DDR 11.8 µs 16.58 µs
FADEC9: tf6 → ETH 3341.9 ns 7028.8 ns
FADEC9: core-to-cores 3344.522 ns 2872.4 ns
Table 7.5: Table illustrating the different results for HM and FADEC in both case studies A
and B using MapIO and SHiC approach.
flow in direct contention with the core-to-I/O flow generated by the task allocated at (1,6) and
having tff15 as destination, i.e. the flow coming from tff1 to tff15 (see Figure 7.1a.ii). This
direct blocking flow is also blocked at the core (2,6) by the flow coming from tff12, as it shares
the same link between (2,6) and (3,6). Besides, it is also blocked at (3,6) by other flows coming
to the destination of tff15, as they share the same destination. This explains the increased
value of the WCTT of HM core-to-I/O flow when considering the SHiC mapping.
Impact on the core-to-core flows. On the other hand, MapIO has a positive effect on
the internal congestion as it reduces the average WCTT by 29.7 %. Although the number
of routers traversed by the flows increase by allocating the task tff15 at (1,6), this placement
reduces for each flow the blocking delay at the destination. Actually, when tff15 is located at
(3,6), i.e. in the SHiC mapping, a flow coming to this core will be blocked by 3 flows having
tff15 as destination and coming from 3 different ports. This blocking is due to the Round-
Robin arbitration where for example a flow coming to the core (3,6) from the east port, will
be blocked before reaching the destination core by flows coming from the north, south and
west ports. However, when this task is allocated at (1,6), a flow can only be blocked at the
destination by 2 flows coming from 2 ports. For example, a flow coming to the core (1,6)
from the west is blocked by 2 flows entering respectively from the north and the south ports.
Therefore, each of these 15 flows having tff15 as destination, has their blocking delay reduced
at the destination core.
Case studies A and B: FADEC9 and HM applications
Table 7.5 refers to both case studies. For the case study B, HM14 shares with FADEC9 the
Ethernet interface (0,2) when considering the SHiC mapping. In MapIO mapping, it is HM9
136 Chapter 7. Evaluation of RCNoC and MapIO on case studies
that shares this interface with FADEC9. However, for the case study A, HM11 and HM12 are
respectively the applications that share the Ethernet interface with FADEC inMapIO mapping
and in SHiC mapping (second line of Table 7.5). The third line of this table reports the outgoing
flow from the task tf6 of FADEC9 to Ethernet via the DDR controller in both case studies. The
internal mapping of FADEC9 is the same in both case studies, thus the last line of Table 7.5
reports the average WCTT value of the FADEC9 core-to-core communications. As seen in
Figure 7.1b.ii, SHiC mapping is unable to map FADEC8, so we do not report any WCTT
result for this application.
Impact on the core-to-I/O flows. MapIO reduces the WCTT of the Ethernet to DDR
flow by 28% and by 52% for the flow from task to Ethernet. These gains are mainly due the
mapping of the critical applications near both the Ethernet and DDR controllers. Besides, the
task mapping of FADEC9, using MapIO, maps tf6 in the center of its 3× 3 region. Actually,
this reduces the Y contention on the critical path, as tf6 does not send any data and especially
to the task located on the Y part of the critical path (i.e. tf2). Thus, in this mapping illustrated
in Figure 7.1b.i, the core-to-I/O flow is blocked directly two flows generated by tf4 and tf1.
Besides, the outgoing I/O flow crosses 4 routers to reach the Ethernet interface as shown in
Figure 7.1b.i. However, the SHiC mapping, shown in 7.1b.ii allocates FADEC application far
from Ethernet controller and so the outgoing I/O flow crosses more routers before reaching this
interface. Furthermore, the core-to-I/O flow is blocked directly by the HM core-to-memory
communications at three routers, i.e. at (1,2), (2,2) and (2,1).
Impact on the core-to-core flows. MapIO increases the internal congestion of FADEC9 by
16% compared to the congestion obtained when applying the SHiC method. Mapping the task
with the maximum number of communications at the center reduces the internal congestion, as
this core will be the nearest to all cores around it. The center of FADEC9 region is, in MapIO
mapping, the core located at (2, 6) (see Figure 7.1b.i). MapIO mapping allocates tf6 on this
core. However, tf6 has the lowest number of communications link with other tasks, as it only
receives data from other tasks. SHiC maps tf0, instead of tf6, at the center of the region of
7.4. Impact of MapIO on both core-to-I/O and core-to-core flows 137
FADEC9, as illustrated in Figure 7.1b.ii, explaining the reduced internal congestion.
ETH
port 1port 2port 3port 4port 5
ETH
port 1port 2port 3port 4port 5
ETH
ETH
port 1port 2port 3port 4port 5
ETH
port 1port 2port 3port 4port 5
ETH2 x 7 5 x 5
ROSACE 10 ROSACE 10
5 x 5
ETH
port 1port 2port 3port 4port 5
ETH
port 1port 2port 3port 4port 5
ETH
a
ROSACE 10
HM
14
HM
10
HM
14
3 x 5
ETH
port 1port 2port 3port 4port 5
ETH
port 1port 2port 3port 4port 5
ETH
ROSACE 10
HM
14
HM
10 FADEC 9
b
cd
Figure 7.3: The different steps for mapping the applications of the case study D using MapIO .
7.4.2 Impact of unused cores
The goal of this variant of the case studies introduced before is to show the impact of unused
cores on the WCTT values, i.e. when UN > 0. Besides, we show the impact of MapIO on a
less-constrained critical application. The modified case study noted D is made of two critical
and two non-critical applications. The critical applications are now ROSACE10 in addition to
FADEC9, while the non critical applications are HM14 and HM10.
Figures 7.3 and 7.4 illustrate the application and task mapping steps of MapIO. We assume
that only two Ethernet interfaces are used. The applications are allocated in this order:
ROSACE10, HM14, HM10 and FADEC9, as shown in Figure 7.3, due to the presence of
138 Chapter 7. Evaluation of RCNoC and MapIO on case studies
two critical applications and of 6 unused cores.
port 1port 2
port 1port 2port 3port 4
port 1port 2port 3port 4
Mapping on 
critical path
Mapping tasks 
sending to DDR
Mapping on 
x>xi and y<yi
Mapping remaining
tasks
2 3
1
Change DDR 
     port
2
3
Mapping on 
critical path
ETH
port 1port 2port 3port 4port 5
ETH
port 1port 2port 3port 4port 5
ETH
ROSACE 10
HM
 14
 
HM
 10
 
FADEC 9
port 1port 2
ETH
port 3
1
port 4 2
3 Vzc
hf
azf
ETHVzc
hf
azf
Mapping on 
x<xi and y<yi
elev
ETHVzc
hf
azf
elevVzf
Vaf
Vac eng
qf ah
port 1port 2port 3port 4
ETHVzc
hf
azf
elevVzf
Vaf
Vac eng
qf ah
4
ETH
ETH
port 1port 2
ETH
ETH
tf6
1
port 1port 2
ETH
ETH
tf6
tf1 tf0
tf7
tf3 tf4
tf5tf2
tf8
Figure 7.4: The different steps of MapIO task mapping for ROSACE and FADEC applications
of the case study D.
Impact on the core-to-I/O flows. Table 7.6 presents the WCTT of core-to-I/O flows from
Ethernet controllers for both MapIO and SHiC mapping. Note that SHiC mapping allocates
the HM applications near to the Ethernet controllers. The second line of Table 7.6 reports the
WCTT of flows MapIO SHiC
Both HM14 and HM10: ETH → DDR 0.98 µs 16.58 µs
FADEC9: tf6 → ETH 1262.125 ns 7028.8 ns
ROSACE10: elev → ETH 399 ns 621 ns
ROSACE10: eng → ETH 335 ns 1620.925 ns
Table 7.6: Table reporting the WCTT of core-to-I/O flows for the applications of the case study
D using MapIO and the SHiC method.
WCTT of both the core-to-I/O flows of: 1) HM10 that shares the Ethernet controller located
7.4. Impact of MapIO on both core-to-I/O and core-to-core flows 139
at (0, 2) with FADEC9, and 2) HM14 which shares the Ethernet controller located at (0, 6)
with ROSACE10. Using MapIO, the non-blocked core-to-I/O flow of HM10 takes advantage
of the unused cores that are put in priority on its critical path. This explains the decrease
from a WCTT of 16.58 µs in case studies A and B to 0.98 µs. However, thanks to MapIO task
mapping rules, the WCTT value of HM14 is also equal to 0.98 µs, even though no unused cores
are available into the region of the ROSACE application. Actually, as shown in the step 1 of
Figure 7.4, and by referring to the ROSACE task graph in Figure 4.6, we allocate first on the
core (1,6) the task with the minimum EC, i.e V zc. As this task generates only one flow to the
task elev, this one is allocated at the same column with V zc (see step 2 of Figure 7.4), avoiding
the contention with the X path of the core-to-I/O flow. Besides, on the core (2,7), we allocate
the task with the minimum IC, i.e. hf . Thus, there are no flows coming to this task and so it
avoids the Y contention with the critical path. Finally, we choose a task that does not send to
hf and does not receive from V zc to be allocated at the core (2,6), i.e. azf . Therefore, there
are no flows in direct contention with the core-to-I/O flow. This explains why the HM14 core-
to-I/O flow is not blocked in this case. However, in SHiC mapping illustrated in Figure 7.2b.ii,
the core-to-I/O flow is blocked directly by the HM10 core-to-memory communications at 3
routers as in the previous case study.
We note that the WCTT of the outgoing I/O flows are also reduced thanks toMapIO application
mapping that allocates the critical applications near to both the Ethernet and DDR controllers.
Besides, we allocate the tasks sending to Ethernet via the DDR controller, i.e. elev and eng,
at the available cores near to the both interfaces, i.e. (1,7) and (3,6). Thus, the outgoing I/O
flow will cross less routers to reach the Ethernet interface compared to the SHiC mapping (see
Figure 7.2b.i).
Impact on ROSACE core-to-core flows. The average WCTT of core-to-core flows for
ROSACE10 increases by 28.8%, compared to this obtained when using SHiC mapping. The
arbitrary mapping of the tasks of ROSACE10 in the 3 × 2 region (see step 2 of Figure 7.4),
whose right corner is located at (3, 7), explains this increased value. In the 6th row, V af , qf
and ah send data to the same task V zc. Each flow generated by these tasks are thus blocked
140 Chapter 7. Evaluation of RCNoC and MapIO on case studies
at each router before reaching V zc. However, in SHiC mapping, V zc that presents a maximum
number of communications is allocated at the center, thus reducing the number of cores crossed
by the flows to reach V zc and so decreases the number of contention in their path.
Impact on FADEC core-to-core flows. On the other hand, the average WCTT of core-to-
core flows for FADEC9 is however decreased by 7.4%, compared to SHiC. Our task mapping,
also shown by step 3 of Figure 7.4, indeed leaves unused the core located at (2, 4). This thus
reduces the number of contentions for the other flows. Let us take as an example a flow coming
from tf4 to tf3. This flow in MapIO mapping is not blocked at the core (2, 4) by any flow as
it is an unused core. However, in the SHiC mapping illustrated in Figure 7.2b.ii, this flow is
blocked at the core (6, 2) by a flow generated from tf0 to tf3. Besides, tf6 is mapped at (3, 1)
when using MapIO. Due to the XY routing policy, all incoming flows towards this task can
only come from the north port of core located at (3, 1). There is thus no more contention at
the destination core when a flow arrives to the router located at (3, 1). Then, the arbitration
for accessing to the next link has fewer input requests to consider, as they are not coming from
all directions (we assume a RRA strategy).
What happens when modifying the number of unused cores?
Let us now show the influence of UNi in the FADEC application on the WCTT of the core-
to-I/O flow of HM10. We increase the size of the FADEC application and vary it from 11 up
to 15 and with therefore a value of UNi ranging respectively from 4 to 0. The second line
of Table 7.7 reports the computed WCTT in each configuration. We recall that MapIO task
mapping rules put the unused cores in priority on the critical path by the following order of
priority: {(1, 2), (2, 1), (2, 2) and (1, 1)}. Figure 7.5 illustrates the internal mapping of FADEC
tasks when varying the number of unused cores from 4 to 0.
UNi(MapIO) 4 3 2 1 0
WCTT of ETH → DDR flow 0.98µs 0.98µs 0.98µs 6.45µs 23.7µs
Average core-to-core impact +11.8% +16.4% +7.4% −14.5% −20.8%
Table 7.7: Table reporting the WCTT of the core-to-I/O flow of HM when varying the number
of unused cores (Ui) by FADEC9.
7.4. Impact of MapIO on both core-to-I/O and core-to-core flows 141
Impact on HM core-to-I/O flow. When the flow is not blocked by any other flows, its
WCTT is equal to 0.98µs (cases where UNi = 4 up to UNi = 2, i.e. the unused cores are
located at (1, 2) and (2, 1)). When we allocate two tasks on cores ((2, 2) and (1, 1)), i.e. cases
where UNi = 3 and UNi = 2, the core-to-I/O flow is not in direct contention with other flows
on its critical path, even if these tasks generate flows or receive flows. These flows do not impact
directly the path of the core-to-I/O flow. However, if we allocate a task on core (2, 1), as in the
case where UNi = 1, i.e. tf13, then this task generates a flow that is in direct contention as it
receives data from the tasks allocated above the first row. Thus, the core-to-I/O flow is blocked
by one flow at the core (2, 2), i.e. the one coming from tf2 to tf13. We note that when UNi = 1,
we allocate tf6 at the core (2, 2) as it does not send data to tf13 and thus reduces the number of
flows blocking the core-to-I/O flow. However, the WCTT further increases significantly when
Ui = 0 and reaches 23 µs. In this case, the core-to-I/O flow is blocked by a flow generated from
the task allocated at (1, 2), i.e. tf14. Besides, it is blocked by the flows coming to tf13. Indeed,
these blocking flows are assumed to be blocked at each router of their path, which increases
the blocking delay of the core-to-I/O flow. To summarize, with a few number of unused cores
(2 here) the WCTT of the core-to-I/O is significantly reduced.
Impact on FADEC core-to-core flows. Let us now evaluate how the value of UNi, in
the various instances of the FADEC application, influences the average value of the WCTT
of its core-to-core flows, i.e. its internal congestion. We note AWCTTSHiC and AWCTTO
the average value of the WCTT of core-to-core flows computed using respectively the SHiC
method and our approach, MapIO. The last line of Table 7.7 thus reports the impact ofMapIO
compared to the SHiC method using the following equation: AWCTTSHiC−AWCTTO
AWCTTSHiC
.
When reducing the value of UNi: 1) the absolute values of the internal congestion obviously
increases, as the number of tasks is increased and 2) the impact of MapIO on the internal
congestion changes from a positive effect to a negative one compared to SHiC (except for
UNi = 3). For instance, for FADEC11 having UNi = 4, MapIO has a positive effect of
11.8% on the internal congestion compared to SHiC. For FADEC15 having UNi = 0, MapIO
has however a negative effect of 20.8%. Reducing the number of unused cores indeed simply
142 Chapter 7. Evaluation of RCNoC and MapIO on case studies
port 1port 2
ETH
ETH
tf6
tf1 tf0
tf7
tf3 tf4
tf5tf2
tf8
tf9
tf10
UNi = 4 / FADEC11
(1,1)
(3,5)
port 1port 2
ETH
ETH
tf6
tf1 tf0
tf7
tf3 tf4
tf5tf2
tf8
tf9
tf10
UNi = 3 / FADEC12
(1,1)
(3,5)
tf11
port 1port 2
ETH
ETH
tf6
tf1 tf0
tf7
tf3 tf4
tf5tf2
tf8
tf9
tf10
UNi = 2 / FADEC13
(1,1)
(3,5)
tf11
tf12
port 1port 2
ETH
ETH
tf6
tf1 tf0
tf7
tf3 tf4
tf5tf2
tf8
tf9
tf10
UNi = 1 / FADEC14
(1,1)
(3,5)
tf11tf12 tf13
port 1port 2
ETH
ETH
tf6
tf1 tf0
tf7
tf3 tf4
tf5tf2
tf8
tf9
tf10
UNi = 0 / FADEC15
(1,1)
(3,5)
tf11tf12 tf13
tf14
Figure 7.5: MapIO mapping of FADEC when varying the number of unused cores UNi.
7.4. Impact of MapIO on both core-to-I/O and core-to-core flows 143
port 4port 5
tf6tf2
tf0
tf7
tf3tf4
tf5
tf1 tf8
tf9 tf10UNi = 4 / FADEC11
(5,1)
(7,4)
port 4port 5
tf6tf2
tf0
tf7
tf3tf4
tf5
tf1 tf8
tf9 tf10UNi = 3 / FADEC12
(5,1)
(7,4)tf11
port 4port 5
tf6tf2
tf0
tf7
tf3tf4
tf5
tf1 tf8
tf9 tf10UNi = 2 / FADEC13
(4,1)
(7,4)tf11 tf12
port 3 port 4port 5
tf6tf2
tf0
tf7
tf3tf4
tf5
tf1 tf8
tf9 tf10UNi = 1 / FADEC14
(4,1)
(7,4)tf11 tf12
port 3
tf13
port 4port 5
tf6tf2
tf0
tf7
tf3tf4
tf5
tf1 tf8
tf9 tf10UNi = 0 / FADEC15
(4,1)
(7,4)tf11 tf12
port 3
tf13
tf14
Figure 7.6: SHiC mapping of FADEC when varying the number of unused cores UNi referring
to MapIO mapping.
144 Chapter 7. Evaluation of RCNoC and MapIO on case studies
increases the number of contentions both core-to-I/O and core-to-core flows can experience.
Figure 7.6 shows the internal mapping tasks of FADEC by considering the SHiC mapping,
when varying the unused cores from UNi = 4 to UNi = 0 referring toMapIO mapping in terms
of UNi. Thus, when mapping FADEC11, i.e. UNi = 4, we can see that MapIO mapping in
Figure 7.5 allocates FADEC within a region of 3× 5, thus including these 4 unused cores in its
region, while the SHiC mapping allocates FADEC within a region of 3× 4, and thus including
only one unused core.
In MapIO mapping, the 2 unused cores located at (2, 2), (1, 2) reduces the contention between
flows. If we take the flows coming from tf10 respectively to tf2 and tf5, then these flows are
not blocked at these cores by any flow. But, in the SHiC mapping, these flows will be blocked
at (6, 4) by the flows generated by tf9, and the contention is only reduced by one unused core,
i.e. (7, 4), for the flow going to tf5. However, when moving from FADEC11 to FADEC12,
the positive effect of MapIO increases, while the value of UNi decreases. We explain this
discrepancy by the inability of the SHiC to include unused cores in the region allocated to
FADEC12. As seen in the Figure 7.6 when UNi = 3, SHiC indeed allocates a 3 × 4 region,
with thus no unused cores. However, MapIO still uses a 3× 5 region, with thus 3 unused cores
as for the FADEC11 case. This positive impact reduces for FADEC13 but still higher than
the SHiC mapping. Even that the SHiC mapping presents more unused cores than MapIO
mapping, the placement of tf6 at the corner in MapIO mapping, i.e. at (3, 1), reduces the
number of blocking flows at this core. Actually, each flow arriving to the core (3, 1) will be
blocked at this core by two flows coming respectively from the east and the north port. In
the case of the SHiC mapping, tf6 is allocated at the core (5, 3) and each flow received by this
task, will be blocked at this core by 3 flows coming respectively from the north, south and west
ports.
When decreasing the number of unused nodes to UNi = 1 and UNi = 0, tf6 is now allocated at
the core (2, 2) as we have to apply MapIO mapping rules, i.e. tf6 does not send any flow to the
tf13 allocated at (2, 1). Thus, each flow received by tf6 will incur a contention with two to three
flows at the core (2, 2). These flows are coming from the different ports of the router due to the
RRA. This contention at tf6 is the same in the case of the SHiC mapping as tf6 is also allocated
7.5. Conclusion 145
approximatively at the center of the region, i.e. at (5, 3) as illustrated in Figure 7.6. However,
in these cases, the SHiC mapping provides respectively 2 and 3 unused cores when allocating
FADEC14 and FADEC15, while MapIO mapping includes respectively 1 and 0 unused cores.
This explains the negative impact on the core-to-core internal congestion.
We have to note that in SHiC mapping, FADEC14 and FADEC15 could not be allocated on
the NoC when considering all the applications of the case study D. Thus, these unused cores
presented in SHiC mapping are considered in the case where FADEC is allocated sole on the
NoC.
7.4.3 Discussion
The results in this chapter show the positive impact of MapIO, combined with RCNoC , on
the WCTT values of the core-to-I/O flow leading to solve the problem of dropping Ethernet
frames. We have seen thatMapIO leads to a significant reduction of 94% for the WCTT values,
in two cases: 1) when including unused cores in the region of the critical applications, and 2)
when allocating a less-constrained critical application. However, this mapping has a reasonably
limited impact on the core-to-core congestions. Actually, we have shown that there is a positive
impact of the presence of unused cores on the core-to-core flows congestions. However, in the
case of UNi = 0, this impact could be negative.
7.5 Conclusion
This chapter presents an evaluation of both of RCNoC and MapIO. We have first shown on the
case study A how RCNoC impacts the WCTT of the core-to-I/O flows, avoiding in some cases
to drop Ethernet frames. However, the reduction of the WCTT values is not sufficient to avoid
this problem. Thus, applying MapIO on this case study is a solution to solve this problem and
this by reducing the contention on the core-to-I/O flows.
However, the mapping strategy applied without reducing the pessimism in the computation of
the WCTT could not be the best solution. We have illustrated on the case study C the need
146 Chapter 7. Evaluation of RCNoC and MapIO on case studies
to combine RCNoC with MapIO to avoid dropping Ethernet frames. The results show on this
case study that even applying MapIO, the WCTT values computed by the recursive method
leads to drop the Ethernet frame. Despite the small reduction of this WCTT when applying
RCNoC (2.5%), this is sufficient to solve the problem.
Finally, we evaluate the impact of the rules of MapIO and the presence of the unused cores
generated by this strategy on both core-to-I/O and core-to-core flows. Our results show on
realistic avionics case studies that the core-to-I/O transmission delays are significantly reduced,
up to 94%. Meanwhile, the internal congestion for the core-to-core flows can increase up to
28.8%, but slightly impacting this congestion in the other cases we have considered. Besides,
we have noticed on some case studies the inability of existing mapping strategy to allocate all
applications whose size does not exceed the size of NoC, unlike MapIO.
Chapter 8
Conclusion
Contents
8.1 Summary of Thesis Contributions . . . . . . . . . . . . . . . . . . . . . 147
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.1 Summary of Thesis Contributions
In this thesis, we are interested in the use of NoCs in real-time systems interconnected to
sensors and actuators via Ethernet. In this context, a congestion in the NoC, due to wormhole
switching, delays the core-to-I/O flow (coming from Ethernet) leading to an overflow of the
buffer of the Ethernet interface which is of limited capacity. Therefore, incoming Ethernet
frames could be dropped. Real-time packet schedulability analysis must then be done, taking
into account all the types of flows. The objective of this thesis was to analyze the WCTT of
the different types of flows and to reduce the WCTT of the core-to-I/O flow in order to avoid
the drop of Ethernet frames. We have illustrated two main problems in the existing methods
when applying them in this context over a Tilera-like architecture:
1. Existing WCTT computing methods do not model the pipeline transmission of flits over
wormhole NoCs, thus leading to an over-approximation of the WCTT values. Actually,
147
148 Chapter 8. Conclusion
these methods consider that an analyzed flow can be blocked by others flows while these
last flows have not reach their destinations. Besides, they add the transmission delays
of all direct and indirect blocking flows to the transmission delay of the analyzed flow.
The pessimism values of the WCTT, especially for the core-to-I/O flows, leads to drop
Ethernet frames.
2. Existing contention-aware mapping strategies aim to minimize only the inter-core conges-
tion without taking into account the requirements of I/O communications of applications.
Then, the WCTT of core-to-I/O flows depends on the congestions generated by applica-
tions allocated next to the Ethernet interfaces. These mapping strategies cannot reduce
the congestion on the core-to-I/O flows, leading to drop Ethernet frames.
In order to reach our objective, i.e. to avoid the drop of Ethernet frames, two approaches have
been proposed:
1. A WCTT computing method, noted RCNoC , that models the pipeline transmission of
flits. For this purpose, we have defined three properties to reduce both the number of
scenarios to be explored when performing a WCTT analysis and the computed WCTTs
values. Using these properties, we compute the maximal blocking delay a flow can suffer
from the blocking flow. This leads to eliminate the need to wait till the blocking flow
reaches its destination. Besides, we identify the indirect flows that do not impact the
transmission of an analyzed flow. Then, the delay of these flows are not added to the
transmission delay of the analyzed flow. We have implemented these properties in an
algorithm based on a recursive method to compute the WCTT of the flows.
2. A static mapping strategy of critical and non critical real-time flows, noted MapIO that
reduces the WCTT of core-to-I/O communications over Tilera-like NoC. This mapping
is divided into two phases. In the first phase, the NoC is split into regions where critical
applications are allocated in priority in dedicated regions close to memory and Ethernet
controllers. Thus, the lengths of core-to-I/O communications of critical and non-critical
applications are thus reduced. This phase ensures a mapping of all applications without
8.2. Future Work 149
being fragmented over the NoC. In the second phase, the tasks of each application is
mapped within its region in such a way that the contentions core-to-I/O communications
experience are reduced.
We have first evaluated the impact of RCNoC on the WCTT of core-to-core flows. We compared
the results against an existing recursive method (RC) on a synthetic benchmark. These results
shown a significant improvement of this WCTT. The last chapter has shown the impact of
RCNoC and MapIO on the core-to-I/O flows over several realistic case studies. We have evalu-
ated their behavior facing the initial problem compared to current state-of-the-art method. The
results have shown the need to combine both methods in order to avoid the drop of Ethernet
frames. Besides, these methods lead to a significant improvement of the WCTT of the core-
to-I/O flows. This WCTT is reduced by up to 29% when a critical application presenting an
all-to-all communicate is allocated in the region near to I/O interfaces. However, this reduction
reaches 94% when unused cores are allocated in the region of this critical application. Besides,
we find a similar result when a less-constrained critical application is allocated in this region
without the presence of unused cores.
The impact of our approach on the core-to-core flows is also evaluated. The internal congestion
for the core-to-core flows can increase up to 28.8%, but slightly impacting this congestion in
most the other cases we have considered.
In the end, the work in this thesis addresses a new problem where the I/O constraints are
integrated within NoC communications. This work has led to find an approach to avoid the
problem of the drop of Ethernet frames and computing tightness values of the WCTT. The
study done during this thesis research opens several perspectives that we detail in the next
section.
8.2 Future Work
The analysis on the case study D, which shows a negative impact of MapIO on the internal
congestion on the NoC, leads to the first perspective: consider a mapping strategy where a
150 Chapter 8. Conclusion
trade-off between reducing the WCTT of the core-to-core and the core-to-I/O flows is done.
Actually, in real-time applications, the WCTT of the different flows must be lower then a prede-
termined deadline. Thus, we have to merge our mapping method, MapIO with the congestion-
aware strategies for core-to-core flows. We then allocate a number of tasks using MapIO in the
region impacting the core-to-I/O flows in such a way to ensure that the Ethernet frame will
not be dropped. The remaining tasks are allocated in the remaining region by applying the
congestion-aware strategies for core-to-core flows. Besides, we have shown the positive impact
of the unused cores not only on the core-to-I/O flows but also on the core-to-core flows. Thus,
we can exploit the presence of the unused cores in order to reach the objective of this perspec-
tive.
The evaluation on the case study D, has shown an increasing value of the internal congestion
of ROSACE application. The average WCTT of core-to-core flows for ROSACE increases by
28.8%, compared to the mapping generated by a current state-of-the-art method (SHiC). Our
mapping method, MapIO, has allocated the tasks on the square 2 × 2 near to Ethernet inter-
face in order to reduce the contention on the path of the core-to-I/O flow (see steps 1 and 2
of Figure 7.4). These tasks are chosen arbitrary among a number of tasks verifying the MapIO
rules. Besides, as explained in section 7.4.2, the remaining tasks are allocated arbitrary in the
remaining region 3× 2 (see steps 2 and 3 of Figure 7.4). Then, MapIO method could generate
different possibilities of mapping by changing the tasks chosen with applying always theMapIO
rules. Thus, during this work, we have considered another mapping, generated by MapIO, for
ROSACE tasks as illustrated in Figure 8.1. This mapping leads to the same average WCTT
of core-to-core flows as in the SHiC method. Actually, in this mapping, we apply our mapping
rules on the critical path so the core-to-I/O flow is still unblocked. We reduce the internal
congestion by applying rules of the SHiC method. Then, we allocate the tasks presenting the
maximum number of communications, i.e. V zc and V ac, at the center of the remaining region
3× 2.
This work was based on a Tilera-like architecture. Thus, it is important to consider this second
perspective: generalize the problem addressed in this work regardless of NoC architecture.
This needs first to address this problem on different number of architectures. Then, we plan
8.2. Future Work 151
port 1port 2port 3port 4
ETHhf
qf
azf
ahVzf
Vaf
Vac elev
Vzc eng
Figure 8.1: Other possibility of the task mapping for ROSACE using our approach.
to study how to adapt our mapping strategy when assuming for instance a torus network and
non-blocking routers, as in the MPPA many-core from Kalray [dDvAPL14], or simply different
arbitration mechanisms than Round-Robin, such as priority arbitration [BHI14].
Besides, the RCNoC method could be also generalized regardless of the NoC architectures.
Actually, the properties of this method are based on the pipeline transmission of the wormhole
networks which is the most used in the different NoC architectures. Thus, we have to study the
WCTT when assuming buffers within routers having a capacity of more than one flit and/or
supporting of multiple VCs.
As reducing the WCTT is important in real-time applications, then our third perspective
is: consider different mechanisms to reduce the WCTT of the flows. For this purpose, an
extension of the RCNoC is to add the assumptions at the application level: [DNNP14], is thus
complementary to our work and could be used to further reduce the pessimism of computed
WCTTs. Besides, the segmentation mechanisms of the packets introduced by [LBBN16] could
be extended to further reduce the computed WCTTs. Actually, the segmentations of specific
packets could generate non-influent indirect flows. For example, let us consider the scenario of
Figure 5.14 where f1 the analyzed flow. If we suppose that f2 has a size of 6 flits, then in this
case f3 is an indirect flow impacting the transmission of f1. However, if we segment the packet
of f2 into 2 packets of size respectively of 3 flits, then f3 becomes an indirect non-influent flow.
This work addresses the use of NoCs in real-time systems interconnected to sensors and actu-
ators via Ethernet. As long term perspective, we can consider the use of NoCs in avionics
domain, where NoCs could be interconnected via AFDX network. Then, it is interesting in
this case to study the end-to-end WCTT of flows. Besides, we can consider the problem of
152 Chapter 8. Conclusion
optimizing the latency and the bandwidth in this context.
Finally, it is interesting to study and analyze the behavior of existing NoC architectures in
real-time systems. This analysis could lead to find the most adapted architecture to be used in
a given real-time context with a minimal hardware complexity.
Bibliography
[AG03] Adrijean Andriahantenaina and Alain Greiner. Micro-network for soc: Imple-
mentation of a 32-port spin network. In Proceedings of the conference on Design,
Automation and Test in Europe-Volume 1, page 11128. IEEE Computer Society,
2003.
[BB04] Davide Bertozzi and Luca Benini. Xpipes: a network-on-chip architecture for
gigascale systems-on-chip. IEEE circuits and systems magazine, 4(2):18–31, 2004.
[BC06] Luciano Bononi and Nicola Concer. Simulation and analysis of network on chip
architectures: ring, spidergon and 2d mesh. In Proceedings of the conference
on Design, automation and test in Europe: Designers’ forum, pages 154–159.
European Design and Automation Association, 2006.
[BCGK04] Evgeny Bolotin, Israel Cidon, Ran Ginosar, and Avinoam Kolodny. Qnoc: Qos
architecture and design process for network on chip. Journal of systems architec-
ture, 50(2):105–128, 2004.
[BDM02a] L. Benini and G. De Micheli. Networks on chips: a new soc paradigm. Computer,
35(1):70–78, Jan 2002.
[BDM02b] Luca Benini and Giovanni De Micheli. Networks on chip: a new paradigm for
systems on chip design. In Design, Automation and Test in Europe Conference
and Exhibition, 2002. Proceedings, pages 418–419. IEEE, 2002.
153
154 BIBLIOGRAPHY
[BHI14] Alan Burns, James Harbin, and Leandro Soares Indrusiak. A wormhole noc
protocol for mixed criticality systems. In Proc. of the IEEE 35th Real-Time
Systems Symposium, RTSS, pages 184–195, Rome, Italy, December 2014.
[BM06] Tobias Bjerregaard and Shankar Mahadevan. A survey of research and practices
of network-on-chip. ACM Computing Surveys (CSUR), 38(1):1, 2006.
[BS04] Tobias Bjerregaard and Jens Sparso. Virtual channel designs for guaranteeing
bandwidth in asynchronous network-on-chip. In Norchip Conference, 2004. Pro-
ceedings, pages 269–272. IEEE, 2004.
[BS05] Tobias Bjerregaard and Jens Sparso. A router architecture for connection-oriented
service guarantees in the mango clockless network-on-chip. In Design, Automation
and Test in Europe, pages 1226–1231. IEEE, 2005.
[CCM07] Ewerson Carvalho, Ney Calazans, and Fernando Moraes. Heuristics for dynamic
task mapping in noc-based heterogeneous mpsocs. In 18th Intl Workshop on Rapid
System Prototyping (RSP), pages 34–40, 2007.
[CM08] Chen-Ling Chou and Radu Marculescu. Contention-aware application mapping
for network-on-chip communication architectures. In IEEE Intl. Conf. on Com-
puter Design (ICCD), pages 164–169, 2008.
[COM08] Chen-Ling Chou, Umit Y Ogras, and Radu Marculescu. Energy-and performance-
aware incremental mapping for networks on chip with multiple voltage levels.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Sys-
tems, 27(10):1866–1879, 2008.
[CPC08] Nicola Concer, Michele Petracca, and Luca P Carloni. Distributed flit-buffer
flow control for networks-on-chip. In Proceedings of the 6th IEEE/ACM/IFIP
international conference on Hardware/Software codesign and system synthesis,
pages 215–220. ACM, 2008.
[CSG99] David E Culler, Jaswinder Pal Singh, and Anoop Gupta. Parallel computer ar-
chitecture: a hardware/software approach. Gulf Professional Publishing, 1999.
BIBLIOGRAPHY 155
[DA93] William J. Dally and Hiromichi Aoki. Deadlock-free adaptive routing in multi-
computer networks using virtual channels. IEEE transactions on Parallel and
Distributed Systems, 4(4):466–475, 1993.
[Dal90] William J Dally. Virtual-channel flow control, volume 18. ACM, 1990.
[DBL05] Yves Durand, Christian Bernard, and Didier Lattard. Faust: On-chip distributed
architecture for a 4g baseband modem soc. Proceedings of Design and Reuse
IP-SOC, 5:51–55, 2005.
[dDvAPL14] Benoît Dupont de Dinechin, Duco van Amstel, Marc Poulhiès, and Guillaume
Lager. Time-critical computing on a single-chip massively parallel processor. In
Proc. of the Conf. on Design, Automation & Test in Europe (DATE’14), pages
97:1–97:6, 2014.
[DMB06] Giovanni De Micheli and Luca Benini. Networks on chips: technology and tools.
Academic Press, 2006.
[DNNP14] Dakshina Dasari, Borislav Nikoli’c, Vincent N’elis, and Stefan M Petters. Noc
contention analysis using a branch-and-prune algorithm. ACM Transactions on
Embedded Computing Systems (TECS), 13(3s):113, 2014.
[DPPB+12] Manel Djemal, François Pêcheux, Dumitru Potop-Butucaru, Robert De Simone,
Franck Wajsburt, and Zhen Zhang. Programmable routers for efficient mapping of
applications onto NoC-based MPSoCs. In Conf. on Design and Architectures for
Signal and Image Processing (DASIP), pages 1–8, Karlsruhe, Germany, October
2012.
[DRGR03] John Dielissen, Andrei Radulescu, Kees Goossens, and Edwin Rijpkema. Con-
cepts and implementation of the philips network-on-chip. In IP-Based SoC De-
sign, pages 1–6, 2003.
[dSCCM10] Ewerson Luiz de Souza Carvalho, Ney Laert Vilar Calazans, and Fernando Gehm
Moraes. Dynamic task mapping for mpsocs. Design & Test of Computers,
27(5):26–35, 2010.
156 BIBLIOGRAPHY
[DT01] William J Dally and Brian Towles. Route packets, not wires: on-chip intercon-
nection networks. In Design Automation Conference, 2001. Proceedings, pages
684–689. IEEE, 2001.
[DT04] William James Dally and Brian Patrick Towles. Principles and practices of in-
terconnection networks. Elsevier, 2004.
[DYN03] Jose Duato, Sudhakar Yalamanchili, and Lionel M Ni. Interconnection networks:
an engineering approach. Morgan Kaufmann, 2003.
[FDLP13] Mohammad Fattah, Masoud Daneshtalab, Pasi Liljeberg, and Juha Plosila. Smart
hill climbing for agile dynamic mapping in many-core systems. In Proc. of the
50th Annual Design Automation Conference, page 39, 2013.
[FFF09a] Thomas Ferrandiz, Fabrice Frances, and Christian Fraboul. A method of com-
putation for worst-case delay analysis on SpaceWire networks. In Proc. of the
4th Intl. Symp. on Industrial Embedded Systems (SIES), pages 19–27, Lausanne,
Switzerland, July 2009.
[FFF09b] Thomas Ferrandiz, Fabrice Frances, and Christian Fraboul. A method of com-
putation for worst-case delay analysis on spacewire networks. In 2009 IEEE
International Symposium on Industrial Embedded Systems, pages 19–27. IEEE,
2009.
[FFF11] Thomas Ferrandiz, Fabrice Frances, and Christian Fraboul. Using network calcu-
lus to compute end-to-end delays in spacewire networks. ACM SIGBED Review,
8(3):44–47, 2011.
[FFF12] Thomas Ferrandiz, Fabrice Frances, and Christian Fraboul. A sensitivity analysis
of two worst-case delay computation methods for spacewire networks. In 2012
24th Euromicro Conference on Real-Time Systems, pages 47–56. IEEE, 2012.
[FRD+12] Mohamamd Fattah, Marco Ramirez, Masoud Daneshtalab, Pasi Liljeberg, and
Juha Plosila. Cona: Dynamic application mapping for congestion reduction in
BIBLIOGRAPHY 157
many-core systems. In 30th Intl. Conf. on Computer Design (ICCD), pages 364–
370, 2012.
[FRX+14] Mohammad Fattah, Amir-Mohammad Rahmani, Thomas Canhao Xu, Anil Kan-
duri, Pasi Liljeberg, Juha Plosila, and Hannu Tenhunen. Mixed-criticality run-
time task mapping for noc-based many-core systems. In 22nd Euromicro Intl.
Conf. on Parallel, Distributed and Network-Based Processing (PDP), pages 458–
465. IEEE, 2014.
[Gai15] Pierre-Emmanuel Gaillardon. Reconfigurable Logic: Architecture, Tools, and Ap-
plications, volume 48. CRC Press, 2015.
[GDR05] Kees Goossens, John Dielissen, and Andrei Radulescu. Æthereal network on chip:
concepts, architectures, and implementations. IEEE Design & Test of Computers,
22(5):414–421, 2005.
[GDvM+03] Kees Goossens, John Dielissen, Jef van Meerbergen, Peter Poplavko, Andrei Ră-
dulescu, Edwin Rijpkema, Erwin Waterlander, and Paul Wielage. Guaranteeing
the quality of services in networks on chip. In Networks on chip, pages 61–82.
Springer, 2003.
[HDV+11] Jason Howard, Saurabh Dighe, Sriram R Vangal, Gregory Ruhl, Nitin Borkar,
Shailendra Jain, Vasantha Erraguntla, Michael Konow, Michael Riepen, Matthias
Gries, et al. A 48-core ia-32 processor in 45 nm cmos using on-die message-passing
and dvfs for performance and power scaling. IEEE Journal of Solid-State Circuits,
46(1):173–183, 2011.
[HO97] SL Hary and F Ozguner. Feasibility test for real-time communication using worm-
hole routing. IEE Proceedings-Computers and Digital Techniques, 144(5):273–278,
1997.
[HVS+07] Yatin Hoskote, Sriram Vangal, Arvind Singh, Nitin Borkar, and Shekhar Borkar.
A 5-ghz mesh interconnect for a teraflops processor. IEEE Micro, 27(5):51–61,
2007.
158 BIBLIOGRAPHY
[IG13] Vaishali V Ingle and Mahendra A Gaikwad. Review of mesh topology of noc
architecture using source routing algorithms. International Journal of Computer
Applications, pages 30–34, 2013.
[KHM87] Mark Karol, Michael Hluchyj, and Samuel Morgan. Input versus output queue-
ing on a space-division packet switch. IEEE Transactions on Communications,
35(12):1347–1356, 1987.
[KJS+02] Shashi Kumar, Axel Jantsch, J-P Soininen, Martti Forsell, Mikael Millberg, Johny
Oberg, Kari Tiensyrja, and Ahmed Hemani. A network on chip architecture and
design methodology. In VLSI, 2002. Proceedings. IEEE Computer Society Annual
Symposium on, pages 105–112. IEEE, 2002.
[KKHL98] Byungjae Kim, Jong Kim, Sungje Hong, and Sunggu Lee. A real-time commu-
nication method for wormhole switching networks. In Parallel Processing, 1998.
Proceedings. 1998 International Conference on, pages 527–534. IEEE, 1998.
[KND02] Faraydon Karim, Anh Nguyen, and Sujit Dey. An interconnect architecture for
networking systems on chips. IEEE micro, 22(5):36–45, 2002.
[KPN+05] Jongman Kim, Dongkook Park, Chrysostomos Nicopoulos, Narayanan Vijaykr-
ishnan, and Chita R Das. Design and analysis of an noc architecture from per-
formance, reliability and energy perspective. In Proceedings of the 2005 ACM
symposium on Architecture for networking and communications systems, pages
173–182. ACM, 2005.
[LBBN16] Meng Liu, Matthias Becker, Moris Behnam, and Thomas Nolte. Using segmen-
tation to improve schedulability of real-time packets on nocs with mixed traffic.
In The 14th International Workshop on Real-Time Networks, July 2016.
[LBT01] Jean-Yves Le Boudec and Patrick Thiran. Network calculus: a theory of determin-
istic queuing systems for the internet, volume 2050. Springer Science & Business
Media, 2001.
BIBLIOGRAPHY 159
[Lee03] Sunggu Lee. Real-time wormhole channels. Journal of Parallel and Distributed
Computing, 63(3):299–311, 2003.
[LJS05] Zhonghai Lu, Axel Jantsch, and Ingo Sander. Feasibility analysis of messages for
on-chip networks using wormhole routing. In Proceedings of the ASP-DAC 2005.
Asia and South Pacific Design Automation Conference, 2005., volume 2, pages
960–964. IEEE, 2005.
[LRV06] Anthony Leroy, Frédéric Robert, and Diederik Verkest. Optimizing the on-chip
communication architecture of low power systems-on-chip in deep sub-micron
technology. 2006.
[MMM+03] Fernando Gehm Moraes, Aline Mello, Leandro Möller, Luciano Ost, and Ney
Laert Vilar Calazans. A low area overhead packet-switched network on chip:
Architecture and prototyping. In VLSI-SOC, pages 318–323, 2003.
[MNTJ04] Mikael Millberg, Erland Nilsson, Rikard Thid, and Axel Jantsch. Guaranteed
bandwidth using looped containers in temporally disjoint networks within the
nostrum network on chip. In Design, Automation and Test in Europe Conference
and Exhibition, 2004. Proceedings, volume 2, pages 890–895. IEEE, 2004.
[Moh98] Prasant Mohapatra. Wormhole routing techniques for directly connected multi-
computer systems. ACM Computing Surveys (CSUR), 30(3):374–410, 1998.
[MTCM05] Aline Mello, Leonel Tedesco, Ney Calazans, and Fernando Moraes. Virtual chan-
nels in networks on chip: implementation and evaluation on hermes noc. In Pro-
ceedings of the 18th annual symposium on Integrated circuits and system design,
pages 178–183. ACM, 2005.
[Mut94] Matt WMutka. Using rate monotonic scheduling technology for real-time commu-
nications in a wormhole network. In Parallel and Distributed Real-Time Systems,
1994. Proceedings of the Second Workshop on, pages 194–199. IEEE, 1994.
[NM93] Lionel M. Ni and Philip K. McKinley. A survey of wormhole routing techniques
in direct networks. Computer, 26(2):62–76, 1993.
160 BIBLIOGRAPHY
[NYP+14a] Vincent Nélis, Patrick Meumeu Yomsi, Luís Miguel Pinho, José Carlos Fonseca,
Marko Bertogna, Eduardo Quiñones, Roberto Vargas, and Andrea Marongiu. The
Challenge of Time-Predictability in Modern Many-Core Architectures. In 14th
Intl. Workshop on Worst-Case Execution Time Analysis, pages 63–72, Madridr,
Spain, July 2014.
[NYP14b] Borislav Nikolić, Patrick Meumeu Yomsi, and Stefan M Petters. Worst-case com-
munication delay analysis for many-cores using a limited migrative model. In
2014 IEEE 20th International Conference on Embedded and Real-Time Comput-
ing Systems and Applications, pages 1–10. IEEE, 2014.
[OHM05] Umit Y Ogras, Jingcao Hu, and Radu Marculescu. Key research problems in
noc design: a holistic perspective. In Proceedings of the 3rd IEEE/ACM/IFIP
international conference on Hardware/software codesign and system synthesis,
pages 69–74. ACM, 2005.
[PABB05] Antonio Pullini, Federico Angiolini, Davide Bertozzi, and Luca Benini. Fault tol-
erance overhead in network-on-chip flow control schemes. In 2005 18th Symposium
on Integrated Circuits and Systems Design, pages 224–229. IEEE, 2005.
[PJ06] Sandro Penolazzi and Axel Jantsch. A high level power model for the nostrum
noc. In 9th EUROMICRO Conference on Digital System Design (DSD’06), pages
673–676. IEEE, 2006.
[PSG+14] Claire Pagetti, David Saussié, Romain Gratia, Eric Noulard, and Pierre Siron.
The rosace case study: from simulink specification to multi/many-core execution.
In Proc. of Real-Time and Embedded Technology and Applications Symposium
(RTAS), pages 309–318. IEEE, 2014.
[QLD10] Yue Qian, Zhonghai Lu, and Wenhua Dou. Analysis of worst-case delay bounds
for on-chip packet-switching networks. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, 29(5):802–815, 2010.
BIBLIOGRAPHY 161
[RDG+04] Andrei Radulescu, John Dielissen, Kees Goossens, Edwin Rijpkema, and Paul
Wielage. An efficient on-chip network interface offering guaranteed services,
shared-memory abstraction, and flexible network configuration. In Design, Au-
tomation and Test in Europe Conference and Exhibition, 2004. Proceedings, vol-
ume 2, pages 878–883. IEEE, 2004.
[RGR+03] Edwin Rijpkema, Kees Goossens, Andrei Radulescu, John Dielissen, Jef van Meer-
bergen, Paul Wielage, and Erwin Waterlander. Trade-offs in the design of a
router with both guaranteed and best-effort services for networks on chip. IEE
Proceedings-Computers and Digital Techniques, 150(5):294–302, 2003.
[RI12] Adrian Racu and Leandro Soares Indrusiak. Using genetic algorithms to map
hard real-time on noc-based systems. In 7th Intl. Workshop on Reconfigurable
Communication-centric Systems-on-Chip (ReCoSoC), pages 1–8, 2012.
[RMB+09] Dara Rahmati, Srinivasan Murali, Luca Benini, Federico Angiolini, Giovanni
De Micheli, and Hamid Sarbazi-Azad. A method for calculating hard qos guar-
antees for networks-on-chip. In Proceedings of the 2009 International Conference
on Computer-Aided Design, pages 579–586. ACM, 2009.
[RMB+13] Dara Rahmati, Srinivasan Murali, Luca Benini, Federico Angiolini, Giovanni
De Micheli, and Hamid Sarbazi-Azad. Computing accurate performance bounds
for best effort networks-on-chip. IEEE Transactions on Computers, 62(3):452–
467, 2013.
[SB08] Zheng Shi and Alan Burns. Real-time communication analysis for on-chip net-
works with wormhole switching. In Proceedings of the Second ACM/IEEE In-
ternational Symposium on Networks-on-Chip, pages 161–170. IEEE Computer
Society, 2008.
[Shi09] Zheng Shi. Real-time communication services for networks on chip. publisher not
identified, 2009.
162 BIBLIOGRAPHY
[SKH08] Erno Salminen, Ari Kulmala, and Timo D Hamalainen. Survey of network-on-chip
proposals. white paper, OCP-IP, 1:13, 2008.
[SRM13] Dahule Suyog, Golhar Reetesh, and Ramteke Mangesh. The behavior of round
robin arbiter in noc architecture. International Journal of Engineering and Inno-
vative Technology (IJEIT), 3(5):312–314, 2013.
[STAN04] David Sigüenza-Tortosa, Tapani Ahonen, and Jari Nurmi. Issues in the devel-
opment of a practical noc: the proteo concept. Integration, the VLSI Journal,
38(1):95–105, 2004.
[Til11] Tilera corporation. Tile processor user architecture manual, November 2011.
UG101.
[TSSJ14] Konstantinos Tatas, Kostas Siozios, Dimitrios Soudris, and Axel Jantsch. Design-
ing 2D and 3D network-on-chip architectures. Springer, 2014.
[WL03] Daniel Wiklund and Dake Liu. Socbus: switched network on chip for hard real
time embedded systems. In Parallel and Distributed Processing Symposium, 2003.
Proceedings. International, pages 8–pp. IEEE, 2003.
[YGSP12] Bo Yang, Liang Guang, Tero Säntti, and Juha Plosila. Tree-model based
contention-aware task mapping on many-core networks-on-chip. Communications
in Information Science and Management Engineering, 2012.
[ZM12] Christopher Zimmer and Frank Mueller. Low contention mapping of real-time
tasks onto tilepro 64 core processors. In 18th Real-Time and Embedded Technology
and Applications Symposium (RTAS), pages 131–140, 2012.
