Many-Core Platforms in the Real-Time Embedded Computing Domain by Borislav Nikolic
FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO
Many-Core Platforms in the Real-Time
Embedded Computing Domain
Borislav Nikolic´
Doctoral Programme in Electrical and Computer Engineering
Supervisor: Dr. Stefan Markus Ernst Petters
April 24, 2015
c© Borislav Nikolic´, 2015
Many-Core Platforms in the Real-Time Embedded
Computing Domain
Borislav Nikolic´
Doctoral Programme in Electrical and Computer Engineering
Approved by:
President: Dr. José Silva Matos
External Referee: Dr. Petru Eles
External Referee: Dr. Leandro Soares Indrusiak
Internal Referee: Dr. Luís Miguel Pinho
FEUP Referee: Dr. Pedro Ferreira do Souto
FEUP Referee: Dr. Mário Jorge Rodrigues de Sousa
Supervisor: Dr. Stefan Markus Ernst Petters
April 24, 2015

Abstract
Over the past few decades, the technological advancements made our lives increasingly permeated
by and dependent on embedded systems. At the present day, these devices account for more than
98% of all produced computing systems, with applications that span over a wide range of areas,
from medicine to avionics. Some embedded systems interact with the physical environment and
have to guarantee not only that a certain action will be performed correctly, but also that the action
will complete within a certain time. These devices are called real-time embedded systems, and
some notable examples are medical pacemakers, airbags in cars and autopilots in airplanes.
The process of analysing the temporal behaviour of a real-time embedded system is called
real-time analysis. In many cases, the purpose of the analysis is to derive guarantees that a device
will perform its functions correctly, while at the same time meeting all timing requirements. A
real-time analysis is mostly performed at design-time, thus its efficiency highly depends on the
amount of predictability of the entire system, whereas any non-deterministic aspect of the system
behaviour has to be accounted for in the analysis with a certain degree of pessimism. A pessimistic
analysis may cause a significant resource over-provisioning in the design phase, and consequently
lead to a severe underutilisation of available resources at runtime. Therefore, reducing the analysis
pessimism is one of the ever-present objectives in the real-time embedded computing domain.
The first real-time embedded systems were predominantly single-core devices with limited
sets of functionalities. However, constantly increasing demands for more advanced and sophisti-
cated functionalities required more powerful computational devices. When faced with the same
challenge, the other computing areas (e.g. general-purpose or high-performance computing) opted
for platforms consisting of several cores – multi-cores and more than a dozen of cores – many-
cores. It comes as no surprise that the same trends, although with an offset, are noticeable in the
evolution of the real-time embedded systems, where many-core platforms present the new frontier
technology.
Besides giving the options to implement more advanced functionalities, many-core platforms
offer other beneficial possibilities as well. For instance, multiple functionalities, that were pre-
viously implemented on a set of single-core devices, can be integrated within fewer many-core
platforms with significant design-cost reductions. Moreover, the abundance of available cores al-
lows to implement efficient thermal and power management strategies by deliberately performing
temporary shutdowns of idle cores. At the same time, the existence of idle cores, which can be
used if necessary, makes these devices more resilient to hardware failures.
Yet, despite the aforementioned benefits, the integration of many-cores into the real-time em-
bedded domain is a big challenge. The most notable reasons are (i) increasingly complex de-
signs of hardware components, promoting performance, often at the expense of predictability,
and (ii) more significant and hard-to-analyse contention patterns for accesses to shared resources.
These facts may contribute to a non-deterministic system behaviour, while, as explained above,
every non-deterministic aspect of the system behaviour has to be accounted for in the real-time
analysis with a certain degree of pessimism.
i
ii
In this dissertation, the focus is on the analysis of real-time embedded systems deployed on
many-core platforms. Specifically, a comprehensive collection of techniques and design choices
is presented, with the common objective to make many-cores more amenable to the real-time
analysis, and consequently more suitable and applicable to the real-time embedded domain. The
proposed methods achieve this end in several ways: (i) by extending the state-of-the-art approaches
in order to reduce the analysis pessimism, (ii) by exploiting novel hardware features, as well as
enforcing constraints which cause a more deterministic and analysable system behaviour, and
(iii) by elaborating on promising OS and workload paradigms, which have not been previously
considered in the real-time embedded computing domain.
The contributions of this dissertation can be classified into two groups. In the first set of
contributions the focus is on the interconnect medium, which is one of the most complex-to-
analyse resources in many-core platforms. Initially, the target interconnect is the network-on-chip
with a 2-D mesh topology, which utilises the wormhole switching mechanism and the XY routing
technique. For such a generic model, which is present in the most of contemporary many-cores,
a novel worst-case communication delay analysis is proposed, and subsequently compared with
the state-of-the-art method. Then, assuming the additional hardware support in the form of virtual
channels, improvements over the state-of-the-art approaches are proposed, which, not only reduce
the analysis pessimism, but also significantly reduce the requirements for hardware resources.
Finally, a novel arbitration policy for NoC routers is proposed.
In the second set of contributions the focus is on a novel paradigm in the real-time embedded
domain, called the Limited Migrative Model. This model is inspired by the latest trends in the
high-performance and general-purpose computing. First, the model is introduced and the cost of
maintaining it is analytically estimated, both in terms of computational and interconnect resources,
where, for the later aspect, the findings from the first set of contributions are used (see the pre-
vious paragraph). Then, three aspects of the application workload are studied, namely: (i) com-
munication requirements, (ii) memory requirements, and (iii) computation requirements. The
first aspect is addressed by imposing several constraints, which make the communication patterns
more predictable, and subsequently allow to derive a communication delay analysis. Moreover,
the workload assignment to computational resources is investigated, but only from the commu-
nication perspective, with the objective to spatially distribute the workload in such a way that all
timing constraints posed on communication delays are met. Then, the focus is shifted towards the
memory requirements, and a set of analysis techniques are proposed, which can be used to check
whether the memory traffic requirements are also fulfilled. In the final part, the computation re-
quirements of the application workload are studied. However, for this aspect only a coarse-grained
analysis with several simplifying assumptions is presented. The proposed method represents an
initial step towards the complete analysis related to the computation requirements. Subsequently,
assuming this initial analysis, the problem of the workload assignment to computational resources
is revisited, but this time with an orthogonal objective, which is to assure that the computational
requirements of the workload are fulfilled.
The findings suggest that the first set of contributions significantly improves over the state-of-
the-art methods in the real-time analysis of interconnects. The improvements are manifested with
the reduced analysis pessimism, as well as reduced hardware requirements. Both these aspects
are essential for mitigating the resource over-provisioning effects when designing a new system.
Additionally, the findings suggest that the Limited Migrative Model has a lot of potential, and
represents a promising step towards the application of many-core platforms into the real-time
embedded computing domain.
Resumo
Ao longo da últimas décadas, os avanços tecnológicos tornaram a vida das pessoas cada vez mais
dependentes de sistemas embebidos. Actualmente, estes dispositivos representam mais de 98% de
todos os sistemas computacionais produzidos e são usados numa grande variedade de áreas, desde
a medicina até à aviação. Alguns destes sistemas embebidos interagem com o meio envolvente e
têm que garantir não só que determinada acção é executada correctamente, mas também que essa
acção é terminada dentro de um certo limite temporal. Este tipo de dispositivos são designados
de sistemas embebidos de tempo real e alguns exemplos são os “pacemakers” usados na saúde,
“airbags” nos automóveis e os pilotos automáticos nos aviões.
O processo de analisar o comportamento temporal de um sistema de embebido de tempo real
é designado de análise de tempo real. Em muitos casos, o propósito da análise é garantir que o
dispositivo vai executar correctamente e que as restrições temporais vão ser respeitadas. A análise
de tempo real é essencialmente efectuada antes de execução, isto é, na fase de projecto. Deste
modo, a sua eficiência depende muito do determinismo do sistema, uma vez que qualquer não de-
terminismo resultará na introdução de pessimismo na análise. Uma análise pessimista pode causar
um excessivo uso de recursos na fase de projecto, o que, geralmente, tem como consequência uma
baixa utilização desses recursos em tempo de execução. Portanto, reduzir o pessimismo na análise
é um dos objectivos prementes nesta área dos sistemas embebidos de tempo real.
Os primeiros sistemas embebidos de tempo real eram maioritariamente dispositivos com uma
única unidade de execução (designados de “single-core”) com um conjunto limitado de funcional-
idades. Porém, a crescente necessidade de suportar funcionalidades cada vez mais avançadas e
mais sofisticadas viria a exigir dispositivos com maior capacidade computacional. Noutras áreas
da computação (como por exemplo, computação de alto desempenho e mesmo na computação us-
ada sem nenhuma especificidade, isto é, de uso geral) este requisito de maior poder computacional
foi abordado com recurso a sistemas compostos por várias unidades de execução (designados por
“many-cores”). Consequentemente, e sem qualquer surpresa, a mesma estratégia tem vindo a
ser seguida no domínio dos sistemas embebidos de tempo real, onde as plataformas “many-core”
constituem o mais recente desafio tecnológico.
Além da possibilidade de implementar funcionalidades mais avançadas, as plataformas “many-
core” oferecem outras possibilidades. Por exemplo, muitas funcionalidades que eram implemen-
tadas num conjunto de sistemas “single-core”, podem ser integradas num conjunto mais reduzido
de plataformas “many-core” com uma redução significativa dos custos de projecto. Para além
disso, a abundância de unidades de execução permite implementar estratégias eficientes de gestão
energética e térmica, desligando ou “adormecendo”, temporariamente, as unidades de execução
que não tenham nenhuma tarefa para executar. Ao mesmo tempo, a existência de unidades de exe-
cução “adormecidas”, que podem ser “acordadas” sempre que necessário, torna estes dispositivos
mais resilientes a falhas de hardware.
Contudo, apesar dos benefícios referidos anteriormente, a integração das plataformas “many-
core” nos sistemas embebidos de tempo real apresenta-se como um grande desafio. As razões para
iii
iv
tal são (i) a cada vez maior complexidade dos componentes de hardware, que permite aumentar o
desempenho, contudo, muitas vezes diminuindo a previsibilidade dos sistemas, e (ii) a crescente
dificuldade na análise da contenção no acesso à utilização dos recursos partilhados. Estes factos
podem contribuir para um comportamento menos determinista dos sistemas, o que, como referido
anteriormente, implica adopção de níveis de pessimismo na analise.
O enfoque desta dissertação é a analise de sistemas embebidos de tempo real para plataformas
“many-core”. Especificamente, uma compreensiva colecção de técnicas e opções de projecto é
apresentada, com o objectivo comum de tornar as plataformas “many-core” mais acessíveis (ou
menos complexas) à analise de tempo real e consequentemente, mais apropriadas e aplicadas
na área dos sistemas embebidos de tempo real. Os métodos propostos alcançam este objectivo
de vários formas: (i) estendendo as abordagens existentes por forma a reduzir o pessimismo na
analise, (ii) explorando novas características (ou elementos) de hardware e aplicando restrições
que aumentem o determinismo e a analisabilidade dos sistemas e, por fim (iii) explorando novos
paradigmas e funcionalidades presentes em alguns Sistemas Operativos e estratégias inovadoras
ainda não consideradas no domínio dos sistemas computacionais de tempo real.
As contribuições apresentadas nesta dissertação podem ser classificadas em dois grupos. O
primeiro grupo de contribuições está relacionada com o meio de interconexão, que é o mais com-
plexo de analisar nas plataformas “many-core”. Inicialmente, o meio de interconexão considerado
foi o “Network-on-Chip” (NoC) com uma topologia em malha 2-D, que utiliza o mecanismo de co-
mutação “wormhole” e a técnica de encaminhamento XY. Nesta dissertação é proposta uma nova
análise para determinar o pior atraso nas comunicações para este modelo tão presente na maior
parte das plataformas “many-cores” de hoje. Depois, assumindo suporte adicional de hardware
na forma de canais virtuais, foram propostos melhoramentos às analises existentes, que além de
reduzirem o pessimismo, também reduzem significamente os requisitos de recursos de hardware.
Finalmente, foi também proposta uma nova politica de arbítrio para os encaminhadores NoC.
O segundo grupo de contribuições está relacionado com um novo paradigma no domínio dos
sistemas embebidos de tempo real designado de “Limited Migrative Model”. Este modelo é in-
spirado na tendência actual dos sistema computacionais de alto desempenho e de utilização geral.
Em primeiro lugar, o modelo é descrito e o custo da sua manutenção é estimado analiticamente,
quer em termos de recursos computacionais quer em termos de recursos de interconexão, sendo
para este caso, utilizadas as primeiras contribuições apresentadas no último parágrafo. Depois,
são estudados três requisitos relacionados com cada tarefa, nomeadamente requisitos de: (i) co-
municações, (ii) memória e (iii) computacionais. Em relação ao primeiro requisito, é imposto
um conjunto de restrições, que tornam os padrões de comunicação mais determinísticos e con-
sequentemente permitem a derivação analítica do atraso das comunicações. Para além disso, e
de forma a garatir que os requisitos temporais impostos às comunicações são cumpridos, a carga
atribuída aos diversos recursos computacionais é tratada apenas da perspectiva das comunicações.
Em seguida, o enfoque é alterado para a memória, e é proposto um conjunto de métodos de análise
tendo como objectivo verificar se os requisitos são também aqui cumpridos. Por fim, são estuda-
dos os requisitos computacionais para uma determinada carga. Contudo, em relação a este último
aspecto, a análise apresentada é bastante simplificada uma vez que se trata de uma primeira abor-
dagem para uma análise mais completa. Assim, o problema a alocação de carga computacional
é revisitado, desta vez com o objectivo ortogonal de garantir que também os requisitos computa-
cionais são cumpridos.
Os resultados mostram que o primeiro grupo de contribuições melhora substancialmente o
estado da arte na análise de tempo real de interconexões. Estas melhorias estão sobretudo patentes
na redução do pessimismo da análise e também na diminuição dos requisitos de hardware. Estes
dois aspectos revelam-se essenciais para mitigar os efeitos do sobre-provisionamento aquando do
vprojecto de um novo sistema. Para além disso, os resultados sugerem que o “Limited Migrative
Model” possui bastante potencial e este apresenta-se como bastante promissor para a suportar a
aplicação de plataformas “many-core” no domínio dos sistemas embebidos e de tempo real.
vi
Acknowledgements
Working towards the Ph.D. Degree was the most demanding and challenging activity in my life. It
took me four hard working years to finish all the cool things described in this dissertation. During
that period, I was never alone, and I would like to express my most sincere gratitude to the people
that made me feel this way:
• Stefan M. Petters – my Supervisor, for everything he has done for me, from the day when
he selected me to be his Ph.D. student, until the day he finished the review of the last page
of this dissertation. Stefan, thank you for always believing in me, for pushing me when I
needed to be pushed, and for teaching me many valuable things, not only about research,
but also about life.
• Dobrica and Ivan Nikolic´ – my parents, and Özge Güngör – my girlfriend, for giving me
strength, courage and motivation to persist on this long journey, especially during the most
challenging period, when I was writing this dissertation. Mum, dad, Özge, you are the best
support that I could ever have, and with you by my side nothing is impossible.
• Leandro Soares Indrusiak, for giving me the opportunity to spend three months under his
supervision, with the Real-Time Systems Group of the University of York, and for always
being available and enthusiastic to discuss new ideas. Leandro, like we said several times
already: “This is just the beginning!”. I truly believe that.
• Hazem Ismail Ali, for being an excellent friend and colleague, and for always having words
of encouragement, a priceless advice and a pack of Dutch waffles up his sleeve. Mr. Pres-
ident, from the bottom of my heart I wish you to reach this stage very soon, mabrouk we
32bel 3endak.
• Patrick Meumeu Yomsi, for joining my crew during the second half of my Ph.D. trip, and
for becoming an invaluable part of it ever since. Patrick, you know which are my favourite
publications. As I always say during CISTER Seminars: “Mömö, you are the man!”.
• Muhammad Ali Awan, Dakshina Dasari and Konstantinos Bletsas, for very productive
collaborations, as well as very interesting and enjoyable discussions, especially those not
related to the work. “Guys, I want to tell you a story...”.
• Paulo Baltarejo Sousa and Ricardo Severino, for translating the abstract into Portuguese.
Paulo, Ricardo, muito obrigado pela vossa ajuda e força F.C. Porto!
• Eduardo Tovar, for creating an outstanding work environment in CISTER Research Center.
This work was partially supported by FCT (Fundação para a Ciência e a Tecnologia) under
the individual doctoral grant SFRH/BD/81087/2011.
vii
viii
List of Publications
The following publications and technical reports have been developed in the scope of the research
activities presented in this dissertation.
Journals (in chronological order)
• [27] Dakshina Dasari, Borislav Nikolic´, Vincent Nélis and Stefan M. Petters, "NoC Con-
tention Analysis using a Branch and Prune Algorithm", ACM Transactions on Embedded
Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Proces-
sors, Volume 13, Issue 3s, Article number 113, March 2014.
• [70] Borislav Nikolic´ and Stefan M. Petters, "Real-Time Application Mapping for Many-
Cores Using a Limited Migrative Model", Real-Time Systems, To appear.
Conferences (in chronological order)
• [73] Borislav Nikolic´, Muhammad Ali Awan and Stefan M. Petters, "SPARTS: Simulator for
Power Aware and Real-Time Systems", In Proceedings of the 8th IEEE International Con-
ference on Embedded Software and Systems (IEEE ICESS-11), pages 999-1004, Changsha,
China, November 16-18, 2011.
• [68] Borislav Nikolic´ and Stefan M. Petters, "Towards Network-On-Chip Agreement Pro-
tocols", In Proceedings of the 10th ACM International Conference on Embedded Software
(EMSOFT 2012), pages 207-216, Tampere, Finland, October 7-12, 2012.
• [75] Borislav Nikolic´, Patrick Meumeu Yomsi and Stefan M. Petters, "Worst-Case Memory
Traffic Analysis for Many-Cores using a Limited Migrative Model", In Proceedings of the
19th IEEE International Conference on Embedded and Real-Time Computing Systems and
Applications (RTCSA 2013), pages 42-51, Taipei, Taiwan, August 19-21, 2013.
• [74] Borislav Nikolic´, Hazem Ismail Ali, Stefan M. Petters and Luís Miguel Pinho, "Are
Virtual Channels the Bottleneck of Priority-Aware Wormhole-Switched NoC-Based Many-
Cores?", In Proceedings of the 21st International Conference on Real-Time Networks and
Systems (RTNS 2013), pages 13-22, Sophia Antipolis, France, October 16-18, 2013.
• [76] Borislav Nikolic´, Patrick Meumeu Yomsi and Stefan M. Petters, "Worst-Case Commu-
nication Delay Analysis for Many-Cores using a Limited Migrative Model", In Proceedings
of the 20th IEEE International Conference on Embedded and Real-Time Computing Sys-
tems and Applications (RTCSA 2014), pages 1-10, Chongqing, China, August 20-22, 2014.
Short-listed for the best paper award and invited for the journal extension.
ix
x• [69] Borislav Nikolic´ and Stefan M. Petters, "EDF as an Arbitration Policy for Wormhole-
Switched Priority-Preemptive NoCs – Myth or Fact?", In Proceedings of the 12th ACM
International Conference on Embedded Software (EMSOFT 2014), pages 1-10, New Delhi,
India, October 12-17, 2014.
Technical Reports
• [71] Borislav Nikolic´, Konstantinos Bletsas and Stefan M. Petters, "Priority Assignment and
Application Mapping for Many-Cores Using a Limited Migrative Model", 2013.
• [72] Borislav Nikolic´, Leandro Soares Indrusiak and Stefan M. Petters, "A Tighter Real-
Time Communication Analysis for Wormhole-Switched Priority-Preemptive NoCs", 2014.
“The dawn cannot come before the morning.”
Nebojša Cˇovic´
“If you do not believe in miracles, then miracles will not happen to you.”
Dejan Radonjic´
xi
xii
Contents
Abstract i
Acknowledgements vii
List of Publications ix
1 Introduction 1
1.1 Real-Time Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Real-Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Single-Core⇒Multi-Core⇒Many-Core . . . . . . . . . . . . . . . . . . . . . 3
1.4 Benefits of Many-Cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Many-Cores in Real-Time Embedded Computing Domain . . . . . . . . . . . . 4
1.5.1 Computation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5.2 Interconnect medium . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5.3 Data Input and Output . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.5.4 Recapitulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.6 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2 Several Steps Closer to Real-Time NoCs 35
2.1 Traffic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2 NoCs with the Round-Robin Arbitration Policy . . . . . . . . . . . . . . . . . . 36
2.2.1 State-of-the-Art Method . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2.2 Analysis Pessimism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.2.3 Branch and Prune (BP) Method . . . . . . . . . . . . . . . . . . . . . . 43
2.2.4 Branch, Prune and Collapse (BPC) - More Efficient Method . . . . . . . 48
2.2.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.3 NoCs with Priority-Preemptive Arbitration Policies . . . . . . . . . . . . . . . . 59
2.3.1 State-of-the-Art Method for NoCs with Fixed-Priority Arbitration Policy 61
2.3.2 Priority-Share Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.3.3 Relaxing Hardware Requirements . . . . . . . . . . . . . . . . . . . . . 65
2.3.4 EDF as Arbitration Policy . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.3.5 Reducing Analysis Pessimism . . . . . . . . . . . . . . . . . . . . . . . 93
3 Limited Migrative Model - LMM 107
3.1 LMM in Detail . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.1.1 Operating System (OS) . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.1.2 Application Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.1.3 LMM Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
xiii
xiv CONTENTS
3.2 Application Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.3 Agreement Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.3.1 Master-Slave Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.3.2 List Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3.3.3 Hybrid Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
3.3.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 132
3.4 Inter-application Communication . . . . . . . . . . . . . . . . . . . . . . . . . . 136
3.5 Towards More Deterministic Communication Patterns . . . . . . . . . . . . . . . 139
3.5.1 Supermessages and Proxies . . . . . . . . . . . . . . . . . . . . . . . . 140
3.5.2 Maximum Number of Message Occurrences . . . . . . . . . . . . . . . 142
3.5.3 Performing the Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 148
3.5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
3.5.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 151
3.6 Application Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
3.6.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
3.6.2 Mapping Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
3.6.3 Mapping Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
3.6.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 170
3.6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
3.7 Memory Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
3.7.1 Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
3.7.2 Challenges and Inapplicability of Existing Techniques . . . . . . . . . . 179
3.7.3 Access Constraints and Bounding Messages . . . . . . . . . . . . . . . . 180
3.7.4 Solution to Mutually Exclusive Bounding Messages . . . . . . . . . . . 182
3.7.5 Approach One: Per-Packet Analysis . . . . . . . . . . . . . . . . . . . . 182
3.7.6 Intermediate Step: Partial Per-Pattern Analysis . . . . . . . . . . . . . . 183
3.7.7 Approach Two: Full per-pattern analysis . . . . . . . . . . . . . . . . . . 184
3.7.8 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 185
3.8 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
3.8.1 Core Shutdowns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
3.8.2 Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
3.8.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
3.8.4 Offline Schedulability Guarantees . . . . . . . . . . . . . . . . . . . . . 191
3.8.5 Online Schedulability . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
3.8.6 Semi-Schedulability Guarantees . . . . . . . . . . . . . . . . . . . . . . 193
3.8.7 Blind Synchronisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
3.8.8 Parent-Child Relationship . . . . . . . . . . . . . . . . . . . . . . . . . 198
3.8.9 Schedulability and Agreement Protocols . . . . . . . . . . . . . . . . . . 198
3.8.10 Priority Assignment and Mapping . . . . . . . . . . . . . . . . . . . . . 199
3.8.11 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 201
3.8.12 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
4 Conclusions and Future Work 209
Appendix 211
References 215
List of Figures
1.1 Fully partitioned scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Semi-partitioned scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Global scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Clustered scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Point-to-point communication . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 Network-on-Chip communication . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7 Different NoC topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.8 Packet transfer over NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.9 Minimal dimension-ordered routing algorithms . . . . . . . . . . . . . . . . . . 13
1.10 Simplified 2-D mesh NoC router architecture . . . . . . . . . . . . . . . . . . . 14
1.11 Store-and-forward switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.12 Wormhole switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.13 Credit-based flow control mechanism . . . . . . . . . . . . . . . . . . . . . . . 16
1.14 Packet traversals without contentions . . . . . . . . . . . . . . . . . . . . . . . . 17
1.15 Packet traversals with contentions . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.16 Contending packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.17 Indirect contentions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.18 Indirect interference with virtual channels . . . . . . . . . . . . . . . . . . . . . 22
1.19 Packet preemptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.20 Assumed NoC platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.21 Memory system of single-core devices . . . . . . . . . . . . . . . . . . . . . . . 28
1.22 Memory system of multi-core devices . . . . . . . . . . . . . . . . . . . . . . . 29
1.23 Assumed memory system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.1 Traffic flows (example 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2 Traffic flows (example 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3 Computation tree for flow f2 from Figure 2.2, in isolation (Ferrandiz et al. [32]
method) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4 Computation tree for flow f2 from Figure 2.2 (Ferrandiz et al. [32] method) . . . 40
2.5 Computation subtrees for Figure 2.4 . . . . . . . . . . . . . . . . . . . . . . . . 40
2.6 Traversal of packets from two adjacent periods . . . . . . . . . . . . . . . . . . 41
2.7 Traversal of packets from several successive periods . . . . . . . . . . . . . . . . 42
2.8 Computation tree for flow f2 from Figure 2.2 (BP method) . . . . . . . . . . . . 44
2.9 Computation subtrees constructed after branching in Figure 2.8 . . . . . . . . . . 44
2.10 Traffic flows (revisited example 2) . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.11 Distribution of WCTT improvement across flows (legends represent improvement ranges) 54
2.12 BPC method with varying SIRL vs. Ferrandiz et al. [32] method . . . . . . . . . 55
2.13 Inter-SIRL ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
xv
xvi LIST OF FIGURES
2.14 All-to-one contending flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.15 Impact of MIRT on flow f1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.16 Traffic flows (example 3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.17 Deadline miss for flow-set from Figure 2.16 and Table 2.2 . . . . . . . . . . . . 62
2.18 Assignment of virtual channels to flow-set with distinctive priorities . . . . . . . 66
2.19 Assignment of virtual channels to flow-set with priority-share policy . . . . . . . 66
2.20 Per-router assignment of virtual channels to flow-set with distinctive priorities . . 66
2.21 Core selection methodology proposed by Ali et al. [4] . . . . . . . . . . . . . . . 70
2.22 Virtual channels do scale with respect to number of flows . . . . . . . . . . . . . 75
2.23 Additional virtual channels do not help in schedulability guarantees . . . . . . . 76
2.24 Link bandwidths are bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.25 Traffic flows (example 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.26 Deadline miss for flow-set from Figure 2.25 and Table 2.5 . . . . . . . . . . . . 81
2.27 Chain of dependencies for Figure 2.25 . . . . . . . . . . . . . . . . . . . . . . . 83
2.28 Traffic flows (example 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
2.29 EDF vs RM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.30 EDF vs HSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
2.31 EDF vs RM,HSA (1-hop) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
2.32 Analysis time variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
2.33 Influence of ∆ on EDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
2.34 Traffic flows (example 6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
2.35 Detailed interference analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
2.36 Traffic flows (example 7) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
2.37 Traffic flows (example 8) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
2.38 Improvement in the worst-case traversal time of f2, for two contending flows f1
and f2, where P( f1)> P( f2), |L ( f1)|= |L ( f2)|, and f1 preempts f2 only once . 101
2.39 Improvements wrt flow sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
2.40 Improvements wrt path sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
2.41 Improvements wrt flow-set sizes . . . . . . . . . . . . . . . . . . . . . . . . . . 104
2.42 Improvements wrt flow and path sizes . . . . . . . . . . . . . . . . . . . . . . . . . 104
2.43 Improvements wrt priorities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
2.44 Analysis tightness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.1 Limited Migrative Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.2 Example of application’s computation, memory access and communication patterns 111
3.3 Agreement protocol messages are master-dependent . . . . . . . . . . . . . . . . 114
3.4 Analysis pessimism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.5 Impact of dispatchers on WCPD . . . . . . . . . . . . . . . . . . . . . . . . . . 135
3.6 Inter-application communication . . . . . . . . . . . . . . . . . . . . . . . . . . 137
3.7 Application which shape and messages comply with Definitions 3-4 . . . . . . . 140
3.8 Supermessages for different application shapes . . . . . . . . . . . . . . . . . . 140
3.9 Inter-application communication . . . . . . . . . . . . . . . . . . . . . . . . . . 142
3.10 Intra-application messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
3.11 Analysis improvements (1/2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
3.12 Analysis improvements (2/2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
3.13 Performance comparison (1/2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
3.14 Performance comparison (2/2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
3.15 Performance comparison for Hybrid protocol . . . . . . . . . . . . . . . . . . . 157
3.16 Analysis tightness across number of dispatchers . . . . . . . . . . . . . . . . . . 157
LIST OF FIGURES xvii
3.17 Analysis tightness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
3.18 Different dispatcher mappings for the same application-shape . . . . . . . . . . . 162
3.19 Shape comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
3.20 Rerouting penalty estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
3.21 Influence of application-set size on analysis time (1/2) . . . . . . . . . . . . . . . 173
3.22 Influence of application-set size on analysis time (2/2) . . . . . . . . . . . . . . . 174
3.23 Influence of grid size on analysis time . . . . . . . . . . . . . . . . . . . . . . . 175
3.24 Influence of parameter P on mapping process . . . . . . . . . . . . . . . . . . . 176
3.25 Influence of parameter G on mapping process . . . . . . . . . . . . . . . . . . . 177
3.26 Memory operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
3.27 Master volatility problem for memory operations . . . . . . . . . . . . . . . . . 180
3.28 Memory operations via proxy dispatchers . . . . . . . . . . . . . . . . . . . . . 180
3.29 Bounding messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
3.30 Analyses comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
3.31 Improvement with single controller . . . . . . . . . . . . . . . . . . . . . . . . . 187
3.32 Different memory operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
3.33 Full per-pattern vs. per-packet . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
3.34 Penalty of multiple controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
3.35 Semi-schedulability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
3.36 Non-synchronised semi-schedulability . . . . . . . . . . . . . . . . . . . . . . . 195
3.37 Future release schedulability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
3.38 Priority assignment upon dispatchers . . . . . . . . . . . . . . . . . . . . . . . . 199
3.39 Speculative mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
3.40 Impact of RTA priorities and SS on schedulability guarantees . . . . . . . . . . . 203
3.41 Impact of number of dispatchers on schedulability guarantees . . . . . . . . . . . 203
3.42 Impact of mapping strategies on schedulability guarantees . . . . . . . . . . . . 204
3.43 Efficiency of online tests when using remaining computation times . . . . . . . . 204
3.44 Efficiency of online tests without remaining computation times . . . . . . . . . . 205
3.45 Impact of number of dispatchers on runtime behaviour, without core shutdowns . 205
3.46 Impact of number of dispatchers on runtime behaviour, with core shutdowns . . . 206
3.47 Impact of mapping strategies on runtime behaviour, without core shutdowns . . . 206
3.48 Impact of mapping strategies on runtime behaviour, with core shutdowns . . . . . 207
3.49 Blind synchronisation mode (BSM) . . . . . . . . . . . . . . . . . . . . . . . . 207
4.1 General form of Bordered Hessian for n variables and one constraint . . . . . . . . . . 213
4.2 Bordered Hessian for λ = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
4.3 Bordered Hessian for λ 6= 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
xviii LIST OF FIGURES
List of Tables
1 List of symbols used in this dissertation (1/6) . . . . . . . . . . . . . . . . . . . xxi
2 List of symbols used in this dissertation (2/6) . . . . . . . . . . . . . . . . . . . xxii
3 List of symbols used in this dissertation (3/6) . . . . . . . . . . . . . . . . . . . xxiii
4 List of symbols used in this dissertation (4/6) . . . . . . . . . . . . . . . . . . . xxiv
5 List of symbols used in this dissertation (5/6) . . . . . . . . . . . . . . . . . . . xxv
6 List of symbols used in this dissertation (6/6) . . . . . . . . . . . . . . . . . . . xxvi
1.1 List of symbols related to the assumed platform . . . . . . . . . . . . . . . . . . 33
2.1 Analysis parameters for Section 2.2.5 . . . . . . . . . . . . . . . . . . . . . . . 53
2.2 Flow-set parameters for Figure 2.16 (example 1) . . . . . . . . . . . . . . . . . . 62
2.3 Flow-set parameters for Figure 2.16 (example 2) . . . . . . . . . . . . . . . . . . 64
2.4 Analysis parameters for Section 2.3.3.5 . . . . . . . . . . . . . . . . . . . . . . 74
2.5 Flow-set parameters for Figure 2.25 . . . . . . . . . . . . . . . . . . . . . . . . 80
2.6 Flow-set parameters for two contending flows . . . . . . . . . . . . . . . . . . . 86
2.7 Flow-set parameters for Figure 2.28 . . . . . . . . . . . . . . . . . . . . . . . . 87
2.8 Comparison of approaches (schedulability) . . . . . . . . . . . . . . . . . . . . 88
2.9 Analysis parameters for Section 2.3.4.5 . . . . . . . . . . . . . . . . . . . . . . 89
2.10 Flow-set parameters for Figure 2.34 (example 1) . . . . . . . . . . . . . . . . . . 98
2.11 Flow-set parameters for Figure 2.34 (example 2) . . . . . . . . . . . . . . . . . . 100
2.12 Analysis parameters for Section 2.3.5.5 . . . . . . . . . . . . . . . . . . . . . . 102
3.1 OS operations related to agreement protocols . . . . . . . . . . . . . . . . . . . 112
3.2 Analysis and simulation parameters for Section 3.3.4 . . . . . . . . . . . . . . . 133
3.3 OS operations related to inter-application communication . . . . . . . . . . . . . 136
3.4 Analysis and simulation parameters for Section 3.5.5 . . . . . . . . . . . . . . . 152
3.5 Analysis and simulation parameters for Section 3.6.4 . . . . . . . . . . . . . . . 172
3.6 Analysis parameters for Section 3.7.8 . . . . . . . . . . . . . . . . . . . . . . . 185
3.7 Speculative mapping computation for Figure 3.39 . . . . . . . . . . . . . . . . . 201
3.8 Analysis and simulation parameters for Section 3.8.11 . . . . . . . . . . . . . . 202
xix
xx LIST OF TABLES
List of Symbols
Table 1: List of symbols used in this dissertation (1/6)
Symbol Description
H
A
R
D
W
A
R
E
Ψ The assumed platform.
Π The set of processing elements (cores) of Ψ.
z The number of cores in Π.
pii The ith core of Π, where i ∈ {1, ...,z}.
∆ The maximum clock skew between any two cores of Π.
η A generic 2-D mesh NoC interconnect of Ψ.
ηRR η with the round-robin arbitration policy and a single channel.
ηPP η with the priority-preemptive arbitration policy and virtual channels.
x The horizontal dimension of η , where x · y = z.
y The vertical dimension of η , where x · y = z.
ρ The set of routers of η .
ρi The ith router of ρ , where i ∈ {1, ...,z}.
λ The set of links of η dedicated to the communication traffic.
λi, j The communication link from ρi (source) to ρ j (destination).
λc,i The communication link from pii (source) to ρi (destination).
λi,c The communication link from ρi (source) to pii (destination).
σ f lit The size of the flit and the width of each link of λ .
v̂lim The number of virtual channels of η .
δL The time it takes a flit to traverse a link.
δρ The time it takes a header flit to traverse a router.
δR The time it takes to reroute one packet.
νρ The frequency of routers.
xxi
xxii LIST OF SYMBOLS
Table 2: List of symbols used in this dissertation (2/6)
Symbol Description
H
A
R
D
W
A
R
E
Blink Link bandwidth, Blink =
σ f lit
δρ+δL .
Ξ The non-coherent memory system of Ψ.
µ The set of memory controllers of Ξ.
µi The ith controller of µ , where i ∈ {1,2,3,4}.
κ The set of cache memories of Ξ.
κi The ith cache memory of κ , where i ∈ {1, ...,z}.
λM The set of links of η dedicated to the memory traffic.
λMi, j The memory link from ρi (source) to ρ j (destination).
λMc,i The memory link from pii (source) to ρi (destination).
λMi,c The memory link from ρi (source) to pii (destination).
λMm,i The memory link from the memory controller (source) to ρi (destination).
λMi,m The memory link from ρi (source) to the memory controller (destination).
O
S
δ→P The time it takes to send the protocol message.
δ←P The time it takes to receive the protocol message.
δ→C The time it takes to send the execution context.
δ←C The time it takes to receive the execution context.
δQ The time it takes to get the info. from the kernel about the next job release.
δE The time it takes to elect the next master, or generate the sorted master list.
δ→I The time it takes to send the inter-application message.
δ←I The time it takes to receive the inter-application message.
M̂ The maximum number of concurrent masters on one core.
K The maximum number of concurrent core shutdowns.
S
O
F
T
W
A
R
E
F The set of flows.
w The number of flows inF .
fi The ith flow ofF , where i ∈ {1, ...,w}.
src( fi) The source of fi.
dst( fi) The destination of fi.
L ( fi) The set of links of λ which fi traverses.
σ( fi) The size of fi.
xxiii
Table 3: List of symbols used in this dissertation (3/6)
Symbol Description
S
O
F
T
W
A
R
E
T ( fi) The minimum inter-arrival period of fi.
D( fi) The deadline of fi.
P( fi) The priority of fi.
C( fi) The traversal time of fi, in isolation.
JR( fi) The release jitter of fi.
JN( fi) The network jitter of fi.
WCT T ( fi) The analytically computed worst-case traversal time of fi.
WCT T ∗( fi) The exact worst-case traversal time of fi.
U( fi) The utilisation ofL ( fi).
FD( fi) The set of flows directly contending with fi.
FI( fi) The set of flows indirectly contending with fi.
W ( fi) The busy period of fi.
Tcrit( fi) The set of critical instants of fi within W ( fi).
L( fi, t) The abs. analytically computed worst-case traversal time of fi released at t.
WCT T ( fi, t) The analytically computed worst-case traversal time of fi released at t.
f˚ The composite flow.
FC( f˚ ) The set of flows constituting f˚ .
I( fi→ f j) The interference that a single packet of fi causes to f j.
L pre−CDi, j The part ofL ( fi) before fi and f j start contending.
L CDi, j The part ofL ( fi) where fi and f j contend.
L post−CDi, j The part ofL ( fi) after fi and f j stop contending.
γ pre−CDi, j The time it takes a header flit of fi to traverseL
pre−CD
i, j .
γ post−CDi, j The time it takes a tail flit of fi to traverseL
post−CD
i, j .
WCT T+( fi) The tighter analytically computed worst-case traversal time of fi.
F The set of functionalities.
Fi The ith functionality of F , where i ∈ {1, ...,z}.
FS(Fi) The set of flows sent by Fi.
FR(Fi) The set of flows received by Fi.
M The mapping of the workload onto the platform.
xxiv LIST OF SYMBOLS
Table 4: List of symbols used in this dissertation (4/6)
Symbol Description
S
O
F
T
W
A
R
E
v̂(M ) The minimum number of virtual channels required byM .
Ŝ(M ) The fraction of flow-sizes for whichM is schedulable, 0< Ŝ(M )≤ 1.
A The set of applications.
v The number of applications is A .
Q The quality of A .
AH The subset of A , representing the high-priority applications.
AL The subset of A , representing the high-priority applications.
P The parameter which decides the breakdown of A on AH and AL.
G The parameter which decides the allowed shape sizes for AH .
ai The ith application of A , where i ∈ {1, ...,v}.
P(ai) The priority of ai.
M(ai) The migration coefficient of ai.
qi The quality of the current mapping of ai.
Smin(ai) The surface of the narrow mapping of ai.
Smax(ai) The surface of the wide mapping of ai.
T (ai) The minimum inter-arrival period of ai.
D(ai) The deadline of ai.
D(ai) The set of dispatchers of ai.
Cτ(ai) The computation requirements of ai.
Dτ(ai) The computation deadline of ai.
Dµ(ai) The memory traffic deadline of ai.
Dτ+µ(ai) The sum of Dτ(ai) and Dµ(ai).
Dη(ai) The communication deadline of ai.
F µ(ai) The set of flows related to the memory operations of ai.
Fη(ai) The set of flows related to the intra- and inter-application comm. of ai.
FηD (ai) The set of directly interfering flows of any flow fromF η(ai).
FηS (ai) The set of inter-application messages sent by ai.
FηR (ai) The set of inter-application messages received by ai.
F̂ (ai) The set of supermessages of ai.
xxv
Table 5: List of symbols used in this dissertation (5/6)
Symbol Description
S
O
F
T
W
A
R
E
f̂ ji The jth supermessage of F̂ (ai).
FP(ai) The set of inter-proxy messages of ai.
f Pi, j The inter-proxy message ofF P(ai).
F (ai) The set of bounding messages belonging to ai.
fi The bounding message.
FE( fi) The set of mutually exclusive bounding messages of fi, including fi.
FR( fi) The reduced set of interfering messages of fi.
FR(ai) The reduced set of bounding messages belonging to ai.
ω( fi) The max. num. of occurrences of fi during one period of its application.
ωP( fi) The max. num. of occurrences of fi during one protocol execution.
ωC( fi) The max. num. of occurrences of fi during one context transfer.
ωI( fi) The max. num. of occurrences of fi during inter-application communication.
CP( f̂
j
i ) The traversal time of f̂
j
i related to the protocol execution of ai, in isolation.
CC( f̂
j
i ) The traversal time of f̂
j
i related to the context transfer of ai, in isolation.
CI( f̂
j
i ) The traversal time of f̂
j
i related to inter-application comm. of ai, in isolation.
CηM(ai) The delay of comm.-related OS ops of the current master of ai, in isolation.
CηS (ai) The delay of comm.-related OS ops of the slave of ai, in isolation.
CηN(ai) The delay of comm.-related OS ops of the next master of ai, in isolation.
Cη# (ai) The delay of communication traffic of ai, in isolation.
Cη(ai) The communication delay of ai, in isolation.
Iη# (ai, t) The worst-case network interference that ai can suffer within t.
Rη(ai) The analytically computed worst-case communication delay of ai.
Rη∗ (ai) The worst-case communication delay of ai obtained via simulations.
CηR (ai) The rerouting delay of the communication traffic of ai, in isolation.
CηRP(ai) The rerouting delay of one protocol execution of ai, in isolation.
CηRS(ai) The rerouting delay of sent inter-application traffic of ai, in isolation.
CηRR(ai) The rerouting delay of received inter-application traffic of ai, in isolation.
Cη#P(ai) The delay of protocol-related traffic of ai, in isolation.
Cη#S(ai) The delay of sent inter-application traffic of ai, in isolation.
xxvi LIST OF SYMBOLS
Table 6: List of symbols used in this dissertation (6/6)
Symbol Description
S
O
F
T
W
A
R
E
Cη#R(ai) The delay of received inter-application traffic of ai, in isolation.
IηR (ai, t) The worst-case rerouting interference that ai can suffer within t.
Rµ(ai) The analytically computed worst-case memory traffic delay of ai.
maxhops(ai) The max. distance (in hops) between any two dispatchers of ai.
maxhops(ai,a j) The max. distance (in hops) between any two dispatchers of ai and a j.
σprt The size of the protocol message.
σctx The size of the execution context message.
σiam The size of the inter-application message.
d ji The jth dispatcher of ai.
P(d ji ) The priority of d
j
i .
pi(d ji ) The core of d
j
i .
ρ(d ji ) The rerouter connected to pi(d
j
i ).
Dpi(d ji )
The set of dispatchers residing on pi(d ji ), including d
j
i .
Iη(d ji , t) The worst-case on-core interference that d
j
i can suffer within t.
r(d ji ) The max. num. of reroutings of ρ(d
j
i ) due to d
j
i , during one period of ai.
rP(d
j
i ) The max. num. of reroutings on ρ(d
j
i ) due to d
j
i , during one protocol of ai.
rI(d
j
i ) The max. num. of reroutings on ρ(d
j
i ) due to d
j
i , during inter-app. comm. of ai.
Chapter 1
Introduction
In this chapter, the context and motivation for the research activities conducted in the scope of this
dissertation are provided.
1.1 Real-Time Embedded Systems
Over the course of the last few decades, the technological advancements allowed dramatic im-
provements in the production process of electronic devices. As the possibilities of electronic
systems continue to grow, we delegate to them more and more of our daily activities. Nowadays,
without any exaggeration, we can say that our lives are dependant on electronic devices, and that
these systems have become an integral part of our life.
In many cases, electronic devices are designed to perform only a specific set of functions.
Usually, such devices are implemented on a specialised hardware, with limited capacities, and/or
power consumption constraints. These systems are called the embedded devices. At the present
date the embedded devices account for more than 98% of all produced electronic equipment [31],
where the scope of their application spans over a wide range of areas, such as medicine, automotive
industry and avionics.
In many applications, the only requirement is that the embedded system executes its function-
alities correctly. However, depending on the purpose of the system, some additional requirements
may be present, for instance, to correctly execute the functionalities within a certain time period.
This is usually the case with the embedded devices that have a strong connection with the phys-
ical environment. The most notable examples are medical pacemakers, airbag systems in cars,
autopilot functionalities in aircrafts, where a timely reaction to the outside stimuli is of crucial
importance. That is, when a car accident occurs, an airbag system must commence the inflation
process exactly at the critical moment. If it fails to do so, or does it too early, or too late, catas-
trophic consequences may occur. The embedded systems of such kind, where both the correct
execution and temporal behaviour are important, are called the real-time embedded systems, and
they are central to this dissertation.
1
2 Introduction
The importance of the temporal behaviour of the real-time embedded system highly depends
on its purpose. For instance, in some scenarios, not fulfilling the posed timing requirements has
mild, almost negligible effects, usually expressed as a degradation in the quality of service. Ex-
amples of this category are live audio/video systems, and these devices are referred to as the soft
real-time systems. Conversely, in some other scenarios, fulfilling all timing requirements is an
absolute imperative, while any failure to do so may have catastrophic consequences. The afore-
mentioned example with the airbag system falls in this category, and these devices are called the
hard real-time systems.
In this dissertation, the focus is on hard real-time systems.
1.2 Real-Time Analysis
Simulations and measurement-based techniques are some of the methods that can be used to ob-
serve the temporal behaviour of the system. However, these techniques face several limitations.
For instance, they are often time-consuming. Moreover, they give the possibility to observe the
system behaviour during a certain time interval, which is usually not enough to capture all possi-
ble system states. These limitations become very obvious in the context of hard real-time systems,
where one of the requirements is to investigate whether a given system will always meet all timing
requirements, even under the worst-case conditions. In such cases the only viable approach is the
real-time analysis.
Real-time analysis is a technique where an analytic description of the system characteristics
is used to derive conclusions regarding the system runtime behaviour. The main advantage of this
technique is that it can be used to analytically describe corner cases (worst-case scenarios), often
with a certain degree of pessimism, whereas the analysis complexity and pessimism are usually
in a trade-off relationship. If the focus of the analysis is on extracting the temporal properties
of an identified, or artificially generated worst-case scenario, the analysis is called the worst-case
analysis. This analysis is predominantly used in the hard real-time domain, where one of the
fundamental requirements is to derive guarantees that a system behaviour will never violate any
temporal constraint, not even in the worst-case scenarios.
The worst-case analysis is usually performed at design-time. Evidently, its efficiency highly
depends on an accurate prediction of the system behaviour at runtime. In order to prevent the
underestimation of the worst-case scenario, that is, to keep the worst-case analysis safe, any un-
predictable and non-deterministic aspect of the system behaviour has to be accounted for in the
analysis with a certain degree of pessimism. A more pessimistic analysis may cause a significant
resource over-provisioning at design-time, and consequently lead to a severe underutilisation of
platform resources at runtime. Thus, the foremost objective in the hard real-time domain is to
propose analyses which are (i) safe, (ii) with acceptable computational complexity and (iii) with a
tolerable amount of pessimism.
In this dissertation, the focus is on worst-case analysis.
1.3 Single-Core⇒Multi-Core⇒Many-Core 3
1.3 Single-Core⇒Multi-Core⇒Many-Core
As the technological advancements provided the environment for thriving of real-time embedded
systems, the demands for more advanced and sophisticated functionalities kept emerging each
year. Until recently, a steady progress in the semi-conductor technology was able to successfully
cope with the constantly rising requirements, thus resulting in a vast amount of transistors accom-
modated within a single processing unit. However, the miniaturisation process hit the limit [60],
due to inability to manage dissipated heat and consequently increasing temperatures at high fre-
quencies, thus hindering further continuation of an established trend of processing power enhance-
ments related to single-core processors.
In order to continue the progress of computational devices, a paradigm shift in the processor
design was needed. The chip manufacturers applied a different strategy; rather than enhancing the
abilities of a single-core device, the idea is to integrate multiple cores within a single chip. From
that moment onwards, the number of cores integrated within a same chip started to grow, and that
trend is present even today. The first such platforms that emerged contained only a few cores, and
they were referred to as multi-cores. When the number of cores integrated within a single chip
outgrew the number of ten, the new term was coined for such devices - they were called many-
cores. Even though that it may appear at this stage that the difference between multi-cores and
many-cores is just in the mere number of cores, that is not true. In fact, multi-core and many-core
platforms significantly differ in numerous aspects, such as: the interconnect medium, the memory
system, the process of managing the correct and consistent system-wide state, etc.
The scientific area which transitioned the fastest and the smoothest from single-cores to multi-
and many-cores is the high-performance computing. This comes as no surprise, as the devel-
opment trends in the chip design are predominantly driven by the need for more powerful and
efficient computational devices (better performance), which entirely coincides with the require-
ments of the aforementioned area. Very similar evolution trends, although with a small offset, are
visible in the general-purpose computing area. The real-time embedded computing area lags even
more behind the aforementioned areas. In fact, at the present date, single-cores and multi-cores
present the majority of the systems that are considered for practical implementations in the real-
time embedded domain, while many-core platforms are perceived only as an emerging technology
that will be used in the forthcoming years.
The integration of many-core platforms in the real-time embedded domain goes slowly be-
cause the advancements in the field of chip design significantly differ, and, to an extent, contradict
the requirements from the said domain. Specifically, the system predictability is one of the essen-
tial prerequisites for the efficient real-time analysis, while the topmost objective of chip designers
is to improve the efficiency and the performance of the system, usually at the expense of the
system predictability. In fact, there is a common opinion among researchers that the many-core
domain presents a significantly more challenging environment for the real-time research, than the
single-core domain.
In this dissertation, the focus is on many-core platforms.
4 Introduction
1.4 Benefits of Many-Cores
Many-core platforms have the potential to bring numerous benefits to the real-time embedded
domain, of which only few will be listed here.
• Functionality enhancements. The abundance of cores brings more computational power,
which gives to possibility to extend the existing functionalities, or implement new ones.
• Cost reductions. Integrating into fewer many-core devices the functionalities that were
previously executed on numerous single-core systems can significantly reduce design costs.
• Flexibility. The abundance of cores allows to accommodate workload changes at runtime.
That is, admission tests can be performed and subsequently the existing functionalities can
be extended, or new ones deployed. Similarly, obsolete functionalities can be removed from
the system, so as to save more capacities for future extensions/deploys.
• Energy/thermal management. The abundance of cores allows to perform load balancing
for energy and thermal management. That is, the workload can be efficiently migrated
around the platform to prevent thermal hotspots and/or contribute to the durability of the
system. Moreover, in cases where the capacity of the platform significantly exceeds the
requirements, idle cores can be temporary shut down for two reasons: (i) to increase the
durability of the system, and (ii) to decrease the power consumption.
• Better fault tolerance. The abundance of cores allows to recover from hardware failures.
That is, once a malfunctioning of one core has been suspected or noticed, that core can be
preventively shut down and the platform can continue to work with the remaining capacities.
1.5 Many-Cores in Real-Time Embedded Computing Domain
Despite the aforementioned benefits, numerous challenges arise when researchers try to put many-
core platforms in the real-time embedded context. Perhaps the greatest challenge is the complex
design. In order to circumvent this problem, the researchers usually do not analyse an entire
system, but rather focus on a subset of aspects, typically one type of resources, e.g. processing
elements, interconnect mediums, memory subsystems, and analyse them independently of the rest
of the platform. Subsequently, based on the analysed aspect, the state-of-the-art approaches can
be classified into categories.
In the rest of this section the overview of the many-cores and their perspective in the real-
time embedded domain will be given. However, rather than focusing on the entire platform, the
individual aspects will be emphasised and discussed independently.
1.5.1 Computation Process
Each functionality implemented within a many-core platform occasionally requires certain com-
putation processes to be performed. These activities are executed on the processing elements –
1.5 Many-Cores in Real-Time Embedded Computing Domain 5
cores. Besides in the mere number of cores, which can range from a dozen to hundreds, many-
core platforms differ from each other in other aspects as well, such as types of cores. Nowadays,
the most established trend in the chip design are platforms with cores which are identical in terms
of physical characteristics, called the homogeneous many-cores. The most notable examples are
the Tilera family of processors [93], the Single-Chip-Cloud Computer (SCC) manufactured by
Intel [42] and the Epiphany processors designed by Adapteva [3]. One of the implications of such
a design is that a computation requirements related to some functionality will always be the same,
irrespective of the actual core that has been selected to perform the computation.
In recent years, chip manufacturers also explored the possibilities to integrate different types
of cores within the same many-core platform. Usually, cores of a certain type are specialised in
some activities, and the types are combined within the platform in such a way to complement each
other. These computers are referred to as the heterogeneous platforms, and one example is the
MPPA-256 Manycore Processor developed by Kalray [44]. The focus of the majority of works in
the real-time domain is on homogeneous platforms, and the same is true with this dissertation.
How to assign the available cores to the functionalities existing within the platform has been,
and still is, one of the most extensively studied problems in the real-time analysis of many-cores.
The area that covers this aspect is called the multiprocessor scheduling theory. If the functionality
does not have the ability to change cores, but always has to perform its computation on the same
core, in the scheduling theory it is referred to as non-migrative. Conversely, if the functionality
has the possibility to use multiple cores, it is considered as migrative. Depending on the level
of migrative freedom given to functionalities, all state-of-the-art approaches can be classified into
several groups.
1.5.1.1 Non-migrative (Fully Partitioned) Approaches
These approaches are in the scheduling theory also known as the fully partitioned approaches.
Each functionality is statically, at design-time, assigned to a specific core where it has to perform
its computation. Once assigned, the functionality cannot switch to another core (migrate), but in-
stead, it has to always perform the computation on the same core. The scheduling of the workload
on each core is performed by a local scheduler, in a single-core fashion, independent of the rest of
the system. An illustrative example of this scheme is given in Figure 1.1.
pi1 2pi pi3
Scheduling
algorithm
Scheduling Scheduling
algorithm algorithm
pi4
Scheduling
algorithm
Functionalities
Cores
Figure 1.1: Fully partitioned scheduling
6 Introduction
The advantage of this concept is reflected in the fact that the prevention of functionality migra-
tions makes the system more predictable and hence analysable, which is beneficial when deriving
real-time guarantees. Moreover, the problem of workload scheduling is, relatively speaking, sig-
nificantly simplified, because each core can be perceived (analysed) as an independent single-core
system, and the uniprocessor scheduling theory [18, 58], which is very advanced, can be applied
in this context. However, the advantage of this concept is at the same time its weakness. The
inability to perform migrations makes these approaches rigid and inflexible, which renders them
ineffective in scenarios with dynamic load changes, or scenarios where runtime load balancing is
required, for energy/thermal management and fault tolerance reasons. These limitations signifi-
cantly narrow the application domain of the fully partitioned approaches.
1.5.1.2 Semi-Partitioned Approaches
These techniques (e.g. [15, 47]), to some extent, embrace the concept of workload migrations.
However, each migrative functionality still has its computation pattern defined at design-time.
That is, a migrative functionality may have its computation split among two (or more) cores, but
always has to perform a prescribed amount of work on each of the cores where it migrates, in a
given order. A semi-partitioned scheduling is illustrated in Figure 1.2.
pi1 2pi pi3
Scheduling
algorithm
Scheduling Scheduling
algorithm algorithm
pi4
Scheduling
algorithm
Functionalities
Cores
Figure 1.2: Semi-partitioned scheduling
When compared to the fully partitioned concept, the semi-partitioned one allows for more
efficient workload assignment and consequently more effective resource utilisation [15]. How-
ever, these approaches suffer from the same limitations as the previous category, that is, they are
inapplicable in scenarios where dynamic load changes occur, or where the runtime load balanc-
ing is required. In terms of scheduling, both the aforementioned approaches tend to perceive the
many-core platform as a collection of independent single-core systems. Evidently, this strategy
removes the platform flexibility, which has been already identified as one of its greatest benefits
(see Section 1.4).
1.5.1.3 Fully Migrative (Global) Approaches
These approaches (e.g. [7, 9]), in the scheduling theory also known as the global approaches,
allow unconstrained workload migrations, which means that each functionality has the possibility
1.5 Many-Cores in Real-Time Embedded Computing Domain 7
to migrate at any time to any core within the platform, and perform its computation there. All
migration decisions are made at runtime, by a single global scheduler. An illustrative example of
global scheduling is given in Figure 1.3.
Functionalities
pi1 2pi pi3Cores pi4
Scheduling algorithm
Figure 1.3: Global scheduling
Global approaches are much more flexible than the two aforementioned categories, and as such
do not suffer from the same limitations. That is, due to the fact that migration-related decisions are
made at runtime, the global approaches can successfully cope with dynamic load changes. More-
over, the load balancing is inherently supported by the design itself, which makes it possible to
implement some energy/thermal management policies by shutting down some of the cores at run-
time. However, the global approaches require a global notion of time, and also global, centralised
structures, such as a scheduler and a ready queue. As the number of cores and the workload within
the platform increase, the contentions for the global structures also increase, thus scalability issues
become more apparent. It is important to mention that these scalability issues are indeed some
of the key reasons why multi-core and many-core platforms are considered as different systems.
That is, in multi-cores some fully migrative scheduling approaches can be efficiently implemented,
because the scalability issues are mild due to the small number of cores. Conversely, in the many-
core domain, global scheduling is not considered as an efficient and viable approach. In fact,
numerous challenges arise when attempting implementations [10].
1.5.1.4 Clustered Approaches
Clustered approaches (e.g. [19]) present a concept which combines the properties of the non-
migrative and fully migrative approaches. Specifically, all cores are divided into disjoint groups,
where each group forms one cluster. A cluster is perceived and treated as an independent sys-
tem, with the global scheduling policy applied on the cluster-level. Moreover, each functionality
is assigned to exactly one cluster, and it has the possibility to freely migrate within its cluster.
Figure 1.4 depicts the system with the clustered scheduling.
These approaches offer both runtime load balancing (the positive aspect of fully migrative
approaches) and scalability (the positive aspect of non-migrative approaches). However, they are
inefficient in scenarios where workload migrations are driven by fault tolerance or energy/thermal
8 Introduction
pi1 2pi pi3 pi4
Scheduling algorithm Scheduling algorithm
Cores
Functionalities
Figure 1.4: Clustered scheduling
management, where inter-cluster migrations (preferably spatially distant ones) are a necessary
option. This limitation narrows the scope of application of these approaches.
1.5.1.5 Discussion
So far, different workload execution techniques have been analysed solely from the perspective
of processing elements, i.e. cores. Yet, it is worth mentioning that the selection of the workload
execution method impacts not only cores, but also other system resources, such as the interconnect
medium. For example, in the non-migrative approaches, the core of each functionality is known
at design-time. This implies that the communication between any two functionalities will always
involve the same two cores, and with a deterministic routing mechanism it can be assured that
the exchanged data always uses the same resources of the interconnect medium, e.g. always
traverses the same path over the interconnect. This possibility to inject the predictability into
the communication between functionalities can significantly ease the real-time analysis of the
interconnect mediums.
Conversely, in fully migrative and clustered approaches it is impossible to predict on which
core will which functionality be at runtime. This infers that the communication patterns among
functionalities are unpredictable and impossible to analyse at design-time. For example, any two
communicating functionalities may at some point during runtime share the same core, but later
migrate to two spatially very distant cores. Performing the real-time analysis of the interconnect
medium under these schemes is very hard, if not impossible.
The aforementioned discussion shows that the selection of the workload execution technique
has a broad effect and impacts the entire system, not just the cores. The benefits and limitations
of different techniques have been presented. One of the goals of this dissertation is to propose
a novel workload execution concept, that will make many-cores more amenable to the real-time
analysis. Specifically, such a concept should exploit the full flexibility of the many-core platform
by allowing workload migrations, like the fully migrative and clustered approaches, but also be
scalable and predictable, like the non-migrative approaches. For that purpose, in Chapter 3 a novel
concept, called the Limited Migrative Model is proposed.
1.5 Many-Cores in Real-Time Embedded Computing Domain 9
1.5.1.6 Assumptions Regarding Processing Elements
The platform under consideration Ψ contains a set of processing elements Π, which is a collection
of z identical cores Π = {pi1,pi2, ...,pix−1,piz}. The value of z is not fixed, but it can be in the
range 50−150, which corresponds to the number of processors in currently available many-core
platforms, e.g. [3, 42, 93]. The decision to keep the number of cores parametrised rather than
constant is based on the fact that this strategy allows to test the scalability potential of the concepts
proposed in this dissertation, simply by varying the parameter z.
In this dissertation, the focus is on homogeneous many-core platforms.
1.5.2 Interconnect medium
As soon as the idea to integrate multiple cores within the same platform emerged, the most im-
portant question was how to efficiently interconnect them. In the multi-core domain that problem
is usually solved by providing a direct point-to-point communication between all core-pairs. An
illustrative example of a point-to-point communication is given in Figure 1.5, where a data packet
is sent from the core pi1 to the core pi2. Notice, that this concept is not scalable, because the num-
ber of necessary communication links grows sub-quadratically with the number of cores, i.e. for
z cores in total z·(z−1)2 bidirectional links are needed. This is another major difference between the
multi-core and the many-core platforms, where, for the former, the point-to-point communication
is an efficient approach, while for the latter an alternative method is needed [26].
packet
pi1 2pi
Figure 1.5: Point-to-point communication
1.5.2.1 Network-on-Chip Paradigm
The quest for scalable interconnect mediums was highly popular around 15 years ago. Eventually,
the research efforts paid off, and a new interconnect paradigm emerged. It is called the Network-
on-Chip, or simply NoC [12, 26, 37]. The NoC paradigm abstracts away the communication
medium from the communicating entities. In other words, the functionalities that exchange data
packets are totally agnostic about the transfer process. This is possible with the additional logic
that is implemented within the NoC elements called the routers. Routers serve as a network
interface to the cores, and cores access them in order to send/receive data. Although there are
routers that are accessed by multiple cores, e.g. [42], usually in many-core platforms there is one
router per core, e.g. [3, 93]. Thus, the data transfer between two functionalities that are located
on different cores is performed in the following way: (i) the packet is sent by the sender’s core
10 Introduction
to the local router, (ii) the packet is transferred through the network of routers until it reaches the
destination router, and (iii) the packet is sent by the receiver’s router to the local core.
packet
1 ρ
.
.
.
.
.
.
......
pi2 ..
.
.
.
.
......
pi
ρ 2
1
Figure 1.6: Network-on-Chip communication
It is worth mentioning that not all pairs of routers have a direct link between each other. In fact,
in the majority of NoCs, each router is connected only with a small set of neighbouring routers.
In such cases, a packet may need to traverse multiple routers before reaching the destination. An
illustrative example of the communication over the NoC architecture is given in Figure 1.6, where
the data packet is exchanged between the two cores pi1 and pi2. The shaded rectangles ρ1 and ρ2
depict the routers belonging to pi1 and pi2, respectively. Notice that ρ1 and ρ2 do not have a direct
link connection, so the packet needs to traverse one or more intermediate routers before reaching
ρ2. For better clarity, the intermediate routers were omitted from the figure.
In this dissertation, the focus is on the NoC-based many-core platforms.
1.5.2.2 Topologies
How to efficiently arrange the routers inside the NoC architecture and how to organise the inter-
router connections is one of the most important questions in the NoC design theory. This is un-
derstandable, as the selection of the topology has a significant impact on numerous aspects, such
as the system performance, the design complexity, the design cost, etc. Some NoC topologies
are depicted in Figure 1.7. For clarity purposes, a single line was used to represent the connec-
tion between all directly connected routers, whereas the connection is typically implemented with
two unidirectional links of opposite directions. Notice, that the fully connected topology (Fig-
ure 1.7(d)) is very similar to the point-to-point approach.
To better understand the differences between topologies, a simple comparison is performed
between the 2-D mesh topology (Figure 1.7(c)) and the fully connected one. In terms of required
resources, the 2-D mesh design is much more scalable, as illustrated with the following numerical
example. Assuming the platform with 100 cores, the fully connected design requires in total
100·99
2 = 4950 bidirectional links, while the 10×10 NoC architecture requires only 2 ·10 ·9 = 180
bidirectional links, which represents less than 4% of the number of links required by the former
design. However, it is also worth mentioning that in the fully connected approach all packets
traverse only two routers, irrespective of the locations of the sender and the receiver, while on a
2-D mesh platform that is not always the case. Specifically, on a 10×10 2-D mesh, a packet that is
1.5 Many-Cores in Real-Time Embedded Computing Domain 11
(a) Star (b) Ring
(c) 2-D Mesh (d) Fully Connected
Figure 1.7: Different NoC topologies
exchanged by two diagonally placed corner routers must traverse 10+9 = 19 routers! This infers
that, as pointed out in the beginning of this section, different topologies have significantly different
properties, and the preference of one topology over another highly depends on the purpose of the
system. An interested reader may consult the work of Abba and Lee [1], where a comprehensive
comparison of numerous different NoC topologies is performed.
Despite the variety of possible topologies, so far, only few of them have been used as the
interconnect mediums in many-cores. For example, a 2-D mesh is one of the most popular choices,
and it is present in some currently available many-cores, e.g. [3, 42, 93]. Additionally the ring
topology is also present in the Xeon Phi family of processors [43], manufactured by Intel, while
in the MPPA-256 Manycore Processor developed by Kalray [44] a 2-D torus topology is used.
In this dissertation, the focus is on the 2-D Mesh NoC interconnect medium.
1.5.2.3 Routing
As already described in the previous section, in many topologies, including the 2-D mesh, not
all the router-pairs are directly connected. Therefore, in such cases, depending on the position
of the sender and the receiver, a packet may need to travel across multiple intermediate links and
12 Introduction
packet
1 pi
ρ1 ρ ρ2 3 ρ4
pi
λ
λ λc,1
λ 1,2
4,c
4
λ 2,3 3,4
Figure 1.8: Packet transfer over NoC
routers. A set of traversed network elements (routers and links) is in the literature called the packet
route, while the number of traversed links is usually referred to as the cardinality of the path, or
the number of hops. An illustrative example of one packet transfer is given in Figure 1.8. A
packet is sent from the core pi1 to the core pi4. During its transfer, the packet traverses the links
λc,1,λ1,2,λ2,3,λ3,4,λ4,c, and the routers ρ1,ρ2,ρ3,ρ4, which jointly constitute the route of 5 hops.
Note, that for clarity purposes, the cores belonging to ρ2 and ρ3 have been omitted from the figure.
The process of transferring the packet from its source to its destination is called the routing, and
that action is the responsibility of the routers. Once a packet reaches the router, the router decides
in which direction the packet will be forwarded. The logic inside the router that is responsible for
making this decision is called the routing algorithm. Evidently, there are numerous criteria based
on which the routing decisions can be derived. For instance, one option is to minimise the path,
and hence derive routing decisions such that the packet always traverses the minimal possible
number of hops. This class of routing algorithms is called the minimal routing. Moreover, if
the packets between the same source and the same destination are always routed across the same
path, then it is referred to as the deterministic routing. Alternatively, the routing decisions can
be made at runtime, for example, based on the status and the load of individual links. Such
techniques are called the adaptive routing. The adaptive routing can improve the performance of
the system (the average case behaviour), however, at the expense of the predictability. Conversely,
the deterministic routing is predictable and much easier to implement, but may cause an inefficient
utilisation of the NoC resources, whereas some links may be heavily congested, while some other
links may be completely idle.
The selection of the routing mechanism depends on the purpose of the system. As already
mentioned, in the real-time embedded domain the predictability of the system is essential, because
it allows to analyse the temporal behaviour of the system with significantly less pessimism. Thus,
in the real-time domain, the deterministic routing techniques are a preferable option. Unintuitively,
this coincides with the development trends of many-core platforms, because the deterministic
routing techniques are indeed the most common choice in the currently available many-cores,
e.g. [3, 42, 93]. However, this coincidence did not occur from the intentions of chip manufacturers
to enforce predictability, but it occurred solely because the deterministic routing is much simpler
to implement. Note that this is a very rare case where the trends in the development of many-core
platforms suit the real-time requirements.
1.5 Many-Cores in Real-Time Embedded Computing Domain 13
One class of popular minimal deterministic routing algorithms in 2-D mesh NoCs is the
dimension-ordered routing. Assuming these schemes, the packets are firstly routed along one
dimension of the NoC, and after reaching the coordinate of the destination, if needed, continue
the transfer along the other dimension. One of the most popular routing algorithms of this class
is the X-Y routing, where the horizontal axis of the platform is usually denoted with the letter X,
while the vertical axis is denoted with the letter Y. The X-Y routing policy is deadlock and livelock
free [40].
X − axis
Y
 −
 a
xi
s
(a) X-Y routing
X − axis
Y
 −
 a
xi
s
(b) Y-X routing
Figure 1.9: Minimal dimension-ordered routing algorithms
In Figure 1.9 are illustrated several packets routed with the X-Y routing algorithm (Fig-
ure 1.9(a)) and with the Y-X routing algorithm (Figure 1.9(b)). For clarity purposes, the links
were omitted from the figures. The X-Y routing is one of the most popular routing algorithms in
currently available many-cores, e.g. [3, 42, 93].
In this dissertation, the focus is on the X-Y routing algorithm.
1.5.2.4 Switching
When the NoC resources are free, a packet intermittently traverses routers and links on its path
towards the destination. For example, the packet illustrated in Figure 1.8 traverses NoC elements
in the following order: λc,1 → ρ1 → λ1,2 → ρ2 → λ2,3 → ρ3 → λ3,4 → ρ4 → λ4,c. However,
in the presence of other traffic, it may happen that one of the links on the path of a packet is
busy transferring some other packet. In such cases, the former packet is stalled and it is stored
inside the router element called the port. Usually, the NoC router contains multiple ports, each
dedicated to the traffic to/from a specific link. A simplified illustration of the 2-D mesh NoC
router architecture is given in Figure 1.10. Towards each neighbouring router with which it has
a direct link connection (e.g. north, south, east and west direction), a router may have: (i) only
an input port, (ii) only an output port, or (iii) both an input and an output port. Input and output
ports are used to store incoming and outgoing packets in a given direction, respectively. Inside the
14 Introduction
router, the switch crossbar is used to transfer the packets between different input and output ports.
Note that the router may have two more ports, namely the core input port and the core output port.
These ports are used to exchange the data with the local core. The core ports and some other router
elements (e.g. the control unit, the arbitration unit) have been omitted in Figure 1.10 for clarity
purposes.
Crossbar switch
input
port
south
output
port
port
input
south
north
output
north
port
po
rt
o
u
tp
ut
w
es
t
po
rt
in
pu
t
w
es
t
in
pu
t
po
rt
o
u
tp
ut
po
rt
ea
st
ea
st
Figure 1.10: Simplified 2-D mesh NoC router architecture
In the earliest router designs, one of the most popular choices was to construct the ports with
the capacity to store entire packets. This concept allows to implement the data transfer technique
where each packet traverses NoC elements in a sequential manner. In other words, a packet is first
entirely transferred between two adjacent routers, and only then it continues its progress further.
This approach is called the store-and-forward switching [92]. An example of this data transfer
technique is depicted in Figure 1.11, where a data packet of 5 bytes is being sent from the core pi1
to the core pi4.
12345
pi pi41
(a) Packet in the first router
12345
1pi pi4
(b) Packet in the second router
12345
1pi 4pi
(c) Packet in the third router
Figure 1.11: Store-and-forward switching
One of the greatest limitations of the store-and-forward switching is the port capacity require-
ment. Evidently, this technique can be applied only in cases where the port capacity is sufficient
to store entire packets. However, over the years, the demands for more sophisticated functional-
ities implemented on many-core platforms also required more frequent data transfers and larger
1.5 Many-Cores in Real-Time Embedded Computing Domain 15
and larger packets. As the buffering within ports became a challenge, the focus shifted towards
alternative switching techniques.
The wormhole switching [67] is a data transfer technique that is conceptually the opposite of
the store-and-forward switching. Specifically, prior to sending, a data packet is divided into small
elements of fixed size, called flits. Usually the flit size is set to be equal to the link width, which
is in the range 1− 64 bytes in currently available many-cores. A flit is a basic, transferable unit
across the NoC and it is indivisible. The first flit of the packet is called the header flit, while the
last flit of the packet is referred to as the tail flit. Once the packet is divided into flits, the transfer
commences. The flits are released into the NoC in an orderly fashion, the header and the tail flit
are injected first and last, respectively. The header flit establishes the path, and the rest of the flits
follow in a pipeline manner. In Figure 1.12 is illustrated the wormhole switching, where a 5-byte
packet is transferred between the cores pi1 and pi4, while the size of the flit in this example is 1
byte.
1pi pi4
12
(a) Flits 1-2 in the NoC
1pi pi4
1234
(b) Flits 1-4 in the NoC
1pi pi4
345
(c) Flits 3-5 in the NoC
Figure 1.12: Wormhole switching
One benefit of this scheme is that the flits are allowed to travel in parallel, thus the throughput
significantly increases [48]. Moreover, the buffering requirements are considerably reduced, and it
is not any more necessary to have ports with capacities to store entire packets. In fact, wormhole-
switched NoCs can successfully operate with ports which capacity is equal to the size of only 1 flit!
However, the storage buffers inside the ports of currently available many-cores are usually larger
than 1 flit. This is especially helpful in scenarios where the header flit is being blocked due to the
busy link, and the rest of the flits do not need to stall in their current ports, but can continue their
progress through the intermediate routers, and group themselves in the ports as close to the busy
link as possible. At the present date, the wormhole switching technique is one of the predominant
data transfer choices in many-core platforms. e.g. [42, 44, 93].
In order to avoid the overflow of port buffers, a certain mechanism has to be implemented to
prevent new flits to enter routers which ports have already been full with other flits. One of the
most popular techniques is called the credit-based flow control, and it works as follows. For each
empty space to store one flit, a port has one credit. When the flit enters the port, the credit is
consumed and sent to the previous port. Indeed, the credits always travel in the opposite direction
of the flits, as illustrated with Figure 1.13. Note that flits may progress only if there are available
credits in the receiving port.
Notice that the number of credits directly influences the number of flits that can be concurrently
stored inside the port. As already stated, the wormhole switching mechanism can successfully
16 Introduction
1234 data
credits
Figure 1.13: Credit-based flow control mechanism
work even if each port has only a single credit (flit-sized buffer), however, the chip designers
usually dimension the port buffers and assign the credits in such a way that multiple flits can be
concurrently stored.
The credit-based is not the only flow control mechanism for wormhole-switched NoCs. For
example, one alternative scheme is the back suction [28]. However, this technique requires the
additional hardware support in the form of virtual channels, which will be explained later.
In this dissertation, the focus is on the wormhole switching technique with the
credit-based flow control mechanism.
1.5.2.5 Arbitration
In the previous sections, the transfer of a single packet over the NoC has been analysed. However,
it rarely happens at runtime that all the routers and the links on the path of a traversing packet are
free. In fact, due to the numerous functionalities existing within many-core platforms, the amount
of generated traffic can cause significant contentions inside the NoC medium, where two of more
packets concurrently try to access the same resource, e.g. a link.
Several approaches have been proposed with the same fundamental idea to avoid traffic con-
tentions. These methods are referred to as the contentionless approaches. One way to achieve
this is by pre-allocating times slots at which packets may traverse, e.g. [36, 63, 77]. Note that this
strategy is very similar to the time-division-multiple-access method – a popular access technique
for shared mediums in networks. Another possibility is to pre-allocate an entire path of the packet
prior to sending, and subsequently transfer it without contentions. The pre-allocated path is called
the virtual circuit, and one implementation that exploits this concept is the MANGO NoC [14].
Conversely, the contention-aware approaches are the approaches in which the contentions among
traffic packets are allowed. In the rest of this section the focus is on such NoCs.
Note that having a common router is only a necessary, but not a sufficient condition for two
packets to contend, while having a common link is both a necessary and a sufficient condition for
the contention. This is explained with the illustrative examples given in Figures 1.14 and 1.15. For
clarity purposes only the routers and the packets are depicted, however, recall that between each
pair of directly connected routers two unidirectional links exist. First, consider the example given
in Figure 1.14(a), where 4 packets exist, namely p1, p2, p3 and p4. Notice, that p1 and p2 have 4
common routers and no common links. Similarly, p3 and p4 have 3 common routers but no com-
mon links. Thus, according to the aforementioned explanation, there are no contentions among
1.5 Many-Cores in Real-Time Embedded Computing Domain 17
these packets. To understand that this is indeed the case, consider Figure 1.14(b), which depicts
the router that is traversed by p3 and p4 (the same router is in Figure 1.14(a) emphasised with
a darker color). Indeed, it is visible that these two packets require different resources and hence
can concurrently traverse the router without interfering with each other. A similar conclusion can
be reached for the rest of the routers in Figure 1.14(a), which are traversed by two packets. This
confirms the first part of the statement, that having a common router is not a sufficient condition
for the contention between packets.
p1p p2 3 p4
(a) Packets
input
port
south
output
port
port
input
south
output
north
port
po
rt
o
u
tp
ut
w
es
t
po
rt
in
pu
t
w
es
t
in
pu
t
po
rt
o
u
tp
ut
po
rt
ea
st
ea
st
p4
p3
north
Crossbar switch
(b) Common Router of p3 and p4
Figure 1.14: Packet traversals without contentions
Now, consider the illustrative example given in Figure 1.15(a). Again, 4 packets exist, p1, p2, p3
and p4. However, in this example p2, besides sharing the routers, also shares at least one common
link with each of the remaining packets. On the other hand, the remaining packets p1, p3 and p4
do not have the common routers, nor links with any other packet except p2. Figure 1.15(b) depicts
the router which is common for p2 and p3 (the same router is in Figure 1.15(a) emphasised with
a darker color). It is visible that these two packets contend for the common resource (the south
output port) and hence interfere with each other.
The aforementioned example demonstrates that the common link is indeed a necessary and a
sufficient condition for the contention between two or more packets. Since the common link re-
quires at least one common router, it can be concluded that the common router is only a necessary,
but not a sufficient condition for the contention.
When contentions do occur, there exists a policy that decides which packet has the precedence
among all packets contending for the same output port (and hence the link connected to it). Only
the packet with the precedence will be able to progress, while the rest of the contending packets
will be stalled. The policy that decides which packet has the precedence is called the arbitration
policy. One of the most commonly used arbitration policies is called the round-robin, and it is
explained with the following example.
18 Introduction
p21p
p3
p4
(a) Packets
Crossbar switch
input
port
south
output
port
port
input
south
north
output
north
port
po
rt
o
u
tp
ut
w
es
t
po
rt
in
pu
t
w
es
t
in
pu
t
po
rt
o
u
tp
ut
po
rt
ea
st
ea
st
p3
2p
(b) Common Router of p2 and p3
Figure 1.15: Packet traversals with contentions
Assume that packets p1, p2, p3 and p4 enter the common router from the input ports west,
east, north and core respectively, as illustrated in Figure 1.16. Moreover, consider that all the
aforementioned packets need to progress in the south direction, hence try to access the south
output port of the depicted router. Evidently, these packets contend, so only one of them can
progress at any time instance, while the other packets will be stalled.
p1 p2
3p
p4
Figure 1.16: Contending packets
As already described, the arbitration policy decides
which packet has the precedence among all contending pack-
ets. The round-robin arbitration policy changes the prece-
dence of packets, based on an input port from which they en-
tered the router, in a orderly fashion. The change occurs af-
ter each transfer, that is, if p1 had the precedence to progress,
another packet from the west input port, say p′1, will not be
able to progress until p2, p3 and p4 have progressed. How-
ever, p′1 will have the precedence over the other packets from
the input ports of p2, p3 and p4, say p′2, p
′
3 and p
′
4, respec-
tively. Thus, one round-robin arbiter, applied to the afore-
mentioned example, might give the precedence to the pack-
ets in the following order: p1, p2, p3, p4, p′1, p
′
2, p
′
3, p
′
4. Note,
that other round-robin implementations are also possible, and that the only fundamental property
of this scheme can be summarised as follows: any packet, coming from the input port of the last
transferred packet, will not be able to progress, until the packets from all other input ports have
been considered. Also note, that if there are no packets coming from a certain port, the same is
immediately skipped, and the packets from the next port are considered. The round-robin arbitra-
tion policy assures fairness among packets and prevents starvation, which makes it very popular
choice in the currently available many-cores, e.g. [3, 93].
In this dissertation, the round-robin arbitration policy is analysed.
1.5 Many-Cores in Real-Time Embedded Computing Domain 19
1.5.2.6 Worst-Case Real-Time Analyses Applicable to Round-Robin-Arbitrated NoCs
In the early works related to the timing analysis of wormhole-switched networks, e.g. [23, 29], the
researchers attempted to apply the concepts from the queuing theory. The focus of these studies
is on the performance (the average-case behaviour), hence are not suitable for the hard real-time
domain. Moreover, there was an attempt to employ the concepts of the network calculus the-
ory [54] in the worst-case analysis of wormhole-switched networks, assuming a weighted round-
robin arbitration policy [79], however, one significant limitation of this approach is that it is overly
pessimistic [33]. An additional element called the wormhole section was introduced [34], with
the primary objective to decrease the pessimism of the worst-case analysis, which is based on the
network calculus theory. Yet, this approach targets SpaceWire networks, which are small-scale in-
terconnects with few links and flows, while it faces challenges, such as scalability and pessimism,
when applied to larger networks with more flows, like NoCs for many-cores [35]. Moreover, the
approach for the worst-case analysis called the recursive calculus method [32] was proposed, also
for SpaceWire networks. This method is scalable, and hence can be applied to NoCs, however,
its greatest limitation is that it does not take into account the inter-arrival times of packets and, as
comprehensively covered in Section 2, it can be very pessimistic. So far, there is a lack of ade-
quate techniques for the worst-case analysis of round-robin-arbitrated wormhole-switched NoCs,
and one of the objectives of this dissertation is to propose a novel method to compute safe and
tight upper-bound estimates on the worst-case delays of individual traffic packets.
1.5.2.7 Indirect Contentions
As already explained, the main advantages of the wormhole switching technique are good through-
put and small buffering requirements [48]. These benefits come from the fact that flits of a packet
may traverse in parallel. However, concurrent use of multiple resources (i.e. routers and links on
the path of the packet), may, at the same time, cause very complex contention scenarios. Specif-
ically, a traversing packet can suffer interference not only from the packets with which it shares
some resources, but also from other packets with which it does not. This is explained with an
illustrative example given in Figure 1.17.
As it is visible from the figure, there exist 3 packets, p1, p2 and p3. Each packet has 8 flits,
and each port has the capacity to accept 2 flits. The ports of interest are depicted outside of their
respective routers. Packets p1 and p2 share 2 routes and 1 link, which is both a necessary and a
sufficient condition for the contention between them. The same is true for p2 and p3. However,
p1 and p3 do not have common resources, and hence they do not contend. Consider that all
packets start traversing at the same time (Figure 1.17(a)). When the flits of p1 reach the router
that is common for p1 and p2, they cannot progress further, because the output port of that router,
and its respective link, are busy transferring the flits of p2. Therefore, the further progress of
p1 is impossible, and it is said that p1 suffers blocking from p2 (Figure 1.17(b)). Because the
links on their paths are still free, p2 and p3 uninterruptedly progress (Figure 1.17(c)). However,
when the flits of p2 reach the router which is common for p2 and p3, the blocking occurs again
20 Introduction
1 p
p3
2p
1 1
1
(a) p1, p2, p3 start traversing
1 p
p3
2p
2
2
1
1
12
(b) p1 blocked by p2
1 p
p3
2p
2 3
3
2
2
1
113
(c) p1 still blocked by p2
1 p
p3
2p
2 4
4
3
2
13 3 2
1
(d) p2 blocked by p3
1 p
p3
2p
2 6
8
13 4 2
1
6
7
35
(e) p2 still blocked by p3
1 p
p3
2p
2 613 4 2
7
8
35
1
(f) p2 continues traversing
1 p
p3
2p
2 13 7 6
5
8
4
(g) p1 still blocked by p2
1 p
p3
2p
23 8
6
1
5
7
(h) p1 continues traversing
1 p
p3
2p
34 1
7
2
6
8
(i) no more blocking
Figure 1.17: Indirect contentions
(Figure 1.17(d)). Notice that p1 is blocked by p2, which is in turn blocked by p3. This infers that,
due to the traversal of p3, the progress of p1 will be delayed even more, despite the fact that p1 and
p3 do not have common resources. In other words, p3 indirectly blocks p1. The indirect blocking
will be revisited at the end of the example.
p1 and p2 remain blocked until the last flit of p3 leaves the common router of p2 and p3 (Fig-
ure 1.17(e)). When the last flit leaves the router, p2 continues traversing (Figure 1.17(f)). Similarly,
p1 remains blocked as long at there is at least one flit of p2 in their common router (Figure 1.17(g)).
After the last flit of p2 leaves the common router, p1 continues traversing (Figure 1.17(h)). Finally,
all packets finish their traversals uninterrupted (Figure 1.17(i)).
The previous example demonstrated that a packet may suffer the delay from packets with
which it contends directly, but may also suffer the delay from packets with which it contends
1.5 Many-Cores in Real-Time Embedded Computing Domain 21
indirectly (no common resources). The indirect contentions can significantly affect the timing
properties of packets. This is demonstrated with a simple numerical computation for the afore-
mentioned example. For the sake of simplicity, consider that the traversal of each flit between each
pair of adjacent routers takes 1 time unit. The time instances at which p1, p2 and p3 complete their
traversals can be easily deduced from the illustrated example, and are 21,16 and 9, respectively.
However, if p3 did not exist, the traversal of p1 would be completed at time instance 16. Thus, the
existence of p3 caused the additional delay of 5 time units to the traversal of p1. From this example
it is obvious that, in order to perform the worst-case analysis of NoC traffic, it is necessary to take
into account both direct and indirect contentions.
1.5.2.8 Virtual Channels
The effects of indirect contentions can be, to some extent, mitigated by enforcing a rule on the
usage of port-buffers. Consider the example introduced in the previous section, but with one
additional rule, that each packet may store at most 1 flit in each port-buffer along its path. In other
words, inside each port-buffer, each packet has a private, dedicated slot of the size of 1 flit. This
is illustrated in Figure 1.18, where the slots to different packets are depicted with different colors.
Moreover, for clarity purposes, only the slots of interest are shown, which implies that for the
routers that are traversed by only one packet, only a single slot is shown (the corner routers in this
example).
Consider again the scenario where p1, p2 and p3 are released at the same time (Figure 1.18(a)).
Like in the previous example, p1 is blocked when its flits reach the router that is common for p1 and
p2 (Figure 1.18(b)). The links on the paths of p2 and p3 are free, so these packets uninterruptedly
progress (Figure 1.18(c)). When the flits of p2 reach the router that is common for p2 and p3, the
blocking occurs again (Figure 1.18(d)). Recall, that in the previous example this was the moment
when the indirect contention between p1 and p3 occurred, when only p3 could progress, while p1
and p2 were blocked. However, due to the existence of per-packet slots inside port buffers, this
problem can be mitigated. Specifically, once p2 becomes blocked, p1 can continue its traversal
because all resources on its path are free. Therefore, the moment when p2 becomes blocked by p3
is the moment when p1 continues traversing (Figure 1.18(d)).
The situation remains the same until the moment when the last flit of p3 leaves the router that
is common for p2 and p3 (Figure 1.18(e)). When that last flits leaves the router, p2 can continue
its traversal. At the same time the traversal of p2 causes another blocking to p1 (Figure 1.18(f)).
p1 remains blocked until the last flit of p2 leaves the router that is common for p1 and p2 (Fig-
ure 1.18(g)). When that happens, p1 can continue its progress (Figure 1.18(h)). Finally, all packets
finish their traversals without further contentions (Figure 1.18(i)).
The rule of allowing each packet to consume only 1 slot in each router it traverses can signifi-
cantly improve the throughput, as demonstrated here. Consider again that the traversal of each flit
between two adjacent routers takes 1 time unit. For the example illustrated in Figure 1.18 it can
be deduced that p1, p2 and p3 complete their traversals at time instances 17,16 and 9, respectively.
Notice that the completion times of p2 and p3 remained the same as in the previous example, while
22 Introduction
1 p
p3
2p
1 1
1
(a) p1, p2, p3 start traversing
1 p
p3
2p
2
2
1
1
12
(b) p1 blocked by p2
1 p
p3
2p
2 3
3
2
2
1
11
(c) p1 still blocked by p2
1 p
p3
2p
2 4
4
3
2
1 3 2
1
(d) p2 blocked by p3, p1 contin-
ues traversing
1 p
p3
2p
4
8
7
6
3 2
1
6 5 4
(e) p2 still blocked by p3, p1 still
traversing
1 p
p3
2p
4
8
7
3 2
1
7 6 5
(f) p2 continues traversing, p1
blocked by p2
1 p
p3
2p
5
8
4 3
2
7 6 5
1
(g) p1 still blocked by p2
1 p
p3
2p
7
7 6 5
6
8
(h) p1 continues traversing
1 p
p3
2p
8
8 7 6
7
(i) no more blocking
Figure 1.18: Indirect interference with virtual channels
the completion time of p1 decreased from 21 to 17. This is expected, because the newly introduced
rule of dedicated per-packet flit-sized slots within each port indeed reduces the effects of indirect
contentions. However, the aforementioned rule does not entirely remove the effects of indirect
contentions, it just makes its effects milder. This will receive additional attention in Chapter 2.
This possibility to improve the performance of wormhole-switched networks via dedicated
slots has been recognised shortly after the wormhole switching mechanism was introduced. In
the literature, this strategy is called the virtual channels [24, 25], and some currently available
many-cores support this concept, e.g. [42]. If a packet has a dedicated slot in each router along its
path, it is said that the packet has a virtual channel. A reader may notice that, counter-intuitively,
the expression virtual channel does not refer to the links, but to the slots in port-buffers.
1.5 Many-Cores in Real-Time Embedded Computing Domain 23
As demonstrated with the examples in Figure 1.17 and Figure 1.18, virtual channels allow for
more efficient use of available resources. That is, in the former case, when p2 was blocked by p3,
p1 could not use the links along its path, despite the fact that all the links were idle. Conversely,
virtual channels always allow to exploit available links, which has been in the latter example
demonstrated by a traversal of p1 while p2 was blocked.
In this dissertation, the NoCs both with and without virtual channels are analysed.
1.5.2.9 Flit-Level Preemptions and Priority-Preemptive Arbitration
Besides improving the throughput, the concept of virtual channels offers one more very beneficial
possibility – to enforce preemptions among packets. The preemption is a well-established term
in the scheduling theory, and it refers to the situation where the computation process of the lower
priority functionality on a core is temporary suspended, so that the computation process of the
higher priority functionality can be performed. Similarly, in NoCs, the preemption is the scenario
where a lower priority packet is temporary stalled, in order to allow the traversal of a higher
priority packet over the resources that these two packets have in common.
Note, in the example illustrated in Figure 1.18 the preemptions are implicitly assumed, and
one such scenario is visible in Figure 1.18(f). Due to the fact that p3 is not blocking p2 any more,
p2 continues its progress by reclaiming the resources (links) from p1, causing p1 to be blocked
again.
1p p2
1
(a) p1 starts traversing
1p p2
4 3 2 11
(b) p2 preempts p1
1p p2
4 3 22 1
(c) p1 preempted by p2
1p p2
4 3 8 7
(d) p1 still preempted by p2
1p p2
4 3 8
(e) p1 continues traversing
1p p2
5 4 3
(f) no more preemptions
Figure 1.19: Packet preemptions
The concept of packet preemptions is explained in detail with the illustrative example given
in Figure 1.19. Consider two packets p1 and p2, where both packets consist of 8 flits, and where
the latter packet is considered to be of higher priority. Let p1 start traversing (Figure 1.19(a)).
At the moment when the first flit of p1 reached the destination, the packet p2 was released (Fig-
ure 1.19(b)). Due to the fact that p2 has a higher priority than p1, the former packet preempts the
latter, causing the third and the fourth flit of p1 to stall within their respective routers, while p2
progresses (Figure 1.19(c)). This scenario is referred to as the preemption, and p1 is called the
preempted packet, while p2 is called the preempting packet. The situation remains the same as
long as there are flits of p2 traversing the common link (Figure 1.19(d)). When the last flit of p2
24 Introduction
traverses the common link, p1 can continue its transfer (Figure 1.19(e)). After that, p1 completes
its transfer and no more preemptions occur (Figure 1.19(f)).
Since the preemptions occur with the flit-level granularity, this concept is called the flit-level
preemptions [89]. Notice, that in order to be implemented, this concept requires an adequate
arbitration policy. That is, in Figure 1.19(b), when p2 was released, the arbitration mechanism
must compare the priorities of p1 and p2, and subsequently make a decision which packet will be
given the permission to progress. The arbitration policy which makes decisions based on packet
priorities is called the priority-preemptive arbitration.
In this dissertation, the priority-preemptive arbitration policy is analysed.
1.5.2.10 Worst-Case Real-Time Analyses Applicable to Priority-Preemptive NoCs
Many researchers have studied the timing properties of priority-aware interconnects. Some of
these studies do not analyse the worst-case scenarios, e.g. [16, 38, 89], and hence are not applicable
to the hard real-time domain. Of studies that analyse the worst-case scenarios, some have been
classified as pessimistic [8, 49], while some other have been classified as optimistic [59, 66]. The
limitations of the aforementioned approaches are covered in more detail in the dissertation of
Zheng Shi [83].
Later, the method for the worst-case analysis was proposed [85] with the following assump-
tions: (i) flit-level preemptions, (ii) per-traffic-flow distinctive priorities, (iii) per-priority virtual
channels, and (iv) a traffic model with constrained deadlines. For this model the priority assign-
ment algorithm for traffic flows was proposed [84]. For the same model the workload mapping
approaches based on genetic algorithms were proposed [62, 80]. Subsequently, the method for the
worst-case analysis was extended to cover the model with arbitrary traffic deadlines [88].
One limitation of the aforementioned concept is that the number of virtual channels should be
at least equal to the number of traffic-flows, which is a requirement that is not easy to fulfil, e.g. the
SCC platform [42] has only 8 virtual channels, and according to this method, it can accommodate
at most 8 traffic flows. As a solution to this problem, the novel concept called the priority share
policy and its accompanying analysis were introduced. This method reduces the total number of
required virtual channels by forcing some traffic flows to share the same channel [86]. Based on
the priority-share policy and its worst-case analysis, the application mapping process was pro-
posed [87]. Note, that both the aforementioned concepts require that each virtual channels has the
capacity to store at least one flit.
Very recently, a novel approach called the stage level analysis [46] was introduced. When com-
pared with the aforementioned approaches, this method indeed renders tighter estimates, however,
it has the same requirement regarding the number of virtual channels as the initial method [85].
Yet, the fundamental limitation of this approach is the fact that it requires unrealistically big port
buffers, where, in many cases, the capacity of virtual channels should be such that an entire packet
can be stored [45] (and not a single flit, like in the aforementioned approaches). Notice that this
1.5 Many-Cores in Real-Time Embedded Computing Domain 25
requirement is very similar to the buffering requirements of the store-and-forward switching tech-
nique, which is the main reason why the wormhole-switching was introduced in the first place.
1.5.2.11 Discussion
The design choices and the evolution trends of NoCs are mostly driven by performance reasons,
usually at the expense of predictability. These strategies not only further increase the gap be-
tween the average-case and the worst-case behaviour of NoCs, but also make it increasingly hard
to perform any meaningful worst-case analysis. For example, some currently available NoCs
(e.g. [3, 93]) use a round-robin arbitration policy in routers, because it promotes fairness and
avoids starvation, both of which are beneficial from the performance perspective. However, the
round-robin arbitration policy displays a non-deterministic behaviour, where the precedence be-
tween two contending packets depends not only on the packets and their paths, but also on the
state of routers at the very moment of contention (i.e. which arbitration decisions immediately
preceded the moment of observation). Due to this fact, it is impossible to deduce the arbitration
decisions at design-time, which subsequently has to be reflected in the worst-case analysis with a
certain degree of pessimism. This further implies that the round-robin arbitration policy, despite
its wide presence in the currently available many-cores, may not be the most suitable option for
the real-time embedded domain.
The existence of virtual channels inside the NoC architecture (e.g. [42]) significantly changes
the perspective. Specifically, multiple virtual channels offer the possibility to introduce priority-
preemptive arbitration policies, and in that way make the NoC platform more predictable and
hence suitable for the worst-case analysis. For instance, assuming that packets have fixed priori-
ties, the one with the higher priority will always have the precedence over the other, irrespective
of the moment of observation. This infers that all arbitration decisions are known at design-time,
and consequently the worst-case analysis can be performed for priority-preemptive NoCs with
substantially less pessimism than the analysis for round-robin-arbitrated NoCs. It comes as no
surprise that the majority of the real-time works addressing NoCs indeed assume that the platform
provides virtual channels and priority-preemptive arbitration techniques.
In the next chapter the focus will be on improving over the state-of-the-art methods for both
types of platforms, with and without virtual channels. Moreover, assuming that multiple virtual
channels are available, a novel arbitration policy for routers will be proposed. These aspects are
covered in detail in Chapter 2.
1.5.2.12 Assumptions Regarding Interconnect Medium
Recall (Section 1.5.1.6), that the platform under consideration in this dissertation Ψ contains a set
of z processing elements Π= {pi1,pi2, ...,piz−1,piz} . Moreover, the platform contains an intercon-
nect medium η , which is a 2-D mesh NoC of dimensions x and y. The NoC contains a collection
of y · x routers ρ = {ρ1,ρ2, ...,ρy·x−1,ρy·x}, where the number of routers is equal to the number of
cores, i.e. y · x = z. Thus, one router corresponds to one core, like in Epiphany [3], or Tilera [93]
26 Introduction
...
...
...
.
.
.
.
.
.
.
.
.
...
.
.
.
ρ
ρ
yx−1
(y−1)x−1
ρ
(y−1)x
ρ
yx
(y−2)x+2
ρ
(y−1)x+2
ρρ
(y−1)x+1
(y−2)x+1
ρ
ρ
2x−1
ρ
2x
x−1
ρ
x
ρρ
21
ρ
ρ ρ
x+1 x+2
pi pi pi pi
pipipipi
pi pi
pi pi pi
pipi
pi
pix λx,cλ
x
c,x
λx,x−1
x−1,xλ
ρ
x,2xλ λ2x,x
Figure 1.20: Assumed NoC platform
platforms. Figure 1.20 depicts the assumed NoC platform. Each router can have at most 10 other
ports, namely East Input, East Output, West Input, West Output, North Input, North Output, South
Input, South Output, Core Input and Core Output. A router uses core ports to exchange the data
with the local core, while the rest of the ports are used for the communication between the router
and its neighbouring routers. The number of ports inside a router depends on the position of the
router within the NoC. Central routers have all 10 ports (e.g. ρx+2 in Figure 1.20), routers on
the edges have 8 (e.g. ρ2 in Figure 1.20), whereas corner routers have only 6 ports (e.g. ρ1 in
Figure 1.20).
Each pair of communicating routers is connected with two unidirectional links, between their
closest ports. For instance, in Figure 1.20 the routers ρx−1 and ρx are connected with two unidirec-
tional links λx−1,x and λx,x−1, where the first number represents the source router of the link and
the second number denotes the destination router. Therefore, over λx,x−1, the packets are trans-
ferred from the west output port of ρx to the east input port of ρx−1. Moreover, the core ports of
each router ρi, are connected with the local core pii via two unidirectional links λi,c, and λc,i. Note,
that for clarity purposes, the links between the cores and the routers have been omitted from the
left part of Figure 1.20.
The platform employs the X-Y routing policy and a wormhole switching technique with the
credit-based flow control mechanism. All links have identical physical characteristics. The width
of each link is σ f lit , which is equal to the size of one flit. All flits travel with the same speed, and
it takes δL clock cycles for a flit to travel between the output port of the sender and the input port
of the receiver router, i.e. to traverse one link. For example, it takes δL clock cycles for a flit from
the core pix to traverse the link λc,x and enter the core input port of ρx, and it also takes δL clock
cycles for a flit from the west output port of ρx to traverse λx,x−1 and enter the east input port of
ρx−1. Additionally, the flits travel through routers. A header flit contains the information about
the destination, which is used to pre-compute the path which will be followed by the remaining
flits. Therefore the routing delay is suffered only by a header flit and it takes δρ clock cycles for a
header to traverse one router. All routers work on the same frequency νρ .
1.5 Many-Cores in Real-Time Embedded Computing Domain 27
Moreover, the NoC η may or may not provide the support for virtual channels. In this disser-
tation, both types of NoCs will be analysed. Therefore, let ηRR and ηPP be the NoCs without and
with virtual channels, respectively. For ηRR, it is assumed that the round-robin arbitration policy
is used. Moreover, the capacity of the buffers inside ports should be such that at least 1 flit can be
stored, i.e. σ f lit . As explained in previous sections, assuming a single channel, it is not possible
to implement inter-packet preemptions. Note that the Epiphany [3] and Tilera [93] platforms, in
terms of their characteristics, can be classified as ηRR.
For ηPP, the number of virtual channels should be at least equal to the maximum number of
traffic-flows. This requirement guarantees that each packet can be (if needed) successfully stored
inside a dedicated virtual channel along its path. Subsequently, each virtual channel should be
able to accommodate at least 1 flit. Moreover, the priority-preemptive arbitration mechanism is
assumed, where a higher-priority packet can preempt a lower-priority packet with the flit-level
granularity.
1.5.3 Data Input and Output
In many cases, in order to perform the computation, functionalities need some input data. Depend-
ing on the purpose of the system, the required data can be (i) values sensed from the environment
with which the device is interacting, and/or (ii) the results from the previous computations of the
functionality itself, as well as other functionalities interacting with it. When the computation is
performed, in-core registers are used to manipulate the data. However, the capacity of the registers
is not enough to store the necessary data for all functionalities, not even for a single functionality.
Thus, there must exist an external medium where data can be stored, and there must exist an ac-
companying mechanism which performs the data transfer between the registers and the external
medium, i.e. to send data to the medium when it is not needed any more, and retrieve it when
it is needed again. The part of the hardware that is responsible for these operations is called the
memory system.
1.5.3.1 Memory Systems in Single-Core and Multi-Core Platforms
Over the years, the memory systems underwent a multitude of changes. Chip manufacturers de-
cided to implement changes in order to improve the performance of the system, usually at the
expense of a more complex design. These evolution trends of memory systems are very beneficial
for some areas, such as the high-performance or general purpose computing, where the average-
case behaviour (performance) is an important aspect. However, a more complex design, at the
same time, makes it increasingly hard to analyse the temporal behaviour of the memory system,
which is an important aspect in other areas, such as the real-time embedded domain. It is no sur-
prising that, as the memory design grows more complex, new challenges in the real-time analysis
of memory systems keep emerging.
Figure 1.21 depicts a typical memory system of a single-core platform. As already stated, the
computation process uses in-processor registers. When the data that is needed for the computation
28 Introduction
pi1
Memory
Cache
1 2 3Registers Main
Memory
Hard
Disk
Figure 1.21: Memory system of single-core devices
is not present in the registers, an in-processor memory called the cache memory is checked for the
needed content (see Arrow 1© in Figure 1.21). The cache memory is a fast memory with a limited
capacity, usually in the range 1− 64 kilobytes. If the required data is not in the cache memory,
it has to be looked for outside of the core. In such cases, the module called the main memory is
checked (see Arrow 2© in Figure 1.21). When compared with the cache memory, the main memory
can store much more data, e.g. in currently available single- and multi-core computers the capacity
of the main memory is in the range of gigabytes. However, the main memory is significantly
slower than the cache memory, and fetching the data also requires the communication between
the processor and the external memory module, usually over the part of the NoC dedicated to the
memory traffic, while fetching the data from the cache memory is performed inside the processor.
Finally, if the desired data is not in the main memory, the component called the hard disk is
used to fetch the data (see Arrow 3© in Figure 1.21). The comparison between the main memory
and the hard disk is very similar to that between the cache memory and the main memory. In other
words, the capacity of currently available hard disks is in the range of terabytes, thus the hard disk
can store much more data than the main memory. However, the access to the hard disk is more
time consuming than the access to the main memory.
From the previous description and illustration it can be concluded that the cache memory and
the main memory are used as the means to bridge the gap between the latencies of accessing the
data that is (i) in the registers and (ii) on the hard disk. Evidently, the presence and absence of the
required data in the cache and the main memory has a significant impact on the performance of
the system. Thus, one of the main objectives of chip manufacturers is to organise the data transfer
such that the required data is as close to the core as possible, at the very moment when it is
needed. There are many techniques which are used to achieve this, the most popular ones employ
prediction heuristics, which are based on temporal and spatial locality principles, however, in this
dissertation, the data transfer policies are not of interest.
In Figure 1.22 is depicted a typical memory system for the multi-core platform. It is visible that
this design has one more element called the Level 2 cache memory. This memory module is usually
shared with two neighbouring cores, and it is often referred to as the shared cache, to distinguish
from the in-core cache memories, which are, for the same reason, called the private caches. The
characteristics of the shared cache fall between the private cache and the main memory, both in
terms of the capacity and the access latency.
1.5 Many-Cores in Real-Time Embedded Computing Domain 29
pi1
Memory
Cache
1
2
3
Registers
Registers 1
Cache
Memory
pi2
Cache
Level 2
Memory Disk
4Main
Memory
Hard
Figure 1.22: Memory system of multi-core devices
Notice, that Figures 1.21-1.22 are examples of memory systems in single-core and multi-core
platforms. However, there exist designs where a core has multiple levels of internal cache, not
only one. Similarly, cores may share multiple levels of shared cache, where the number of cores
that share the cache usually grows with the level, e.g. the first level shared cache is shared by two
cores, the second level shared cache is shared by four cores, etc.
1.5.3.2 Consistency and Coherence
Consider a case where two functionalities share some parameter. Moreover, assume that one func-
tionality modified that parameter. Subsequently, the other functionality mast be instantly notified
about that change, so as to take it into account in its future computations. When such a mechanism
is in place, it is said that the system is in a consistent state. Having a consistent state is an im-
perative for a correct functioning of the system. Notice, that unless some mechanisms to enforce
parallelism are applied, in a single-core system, at any time instant, at most one functionality may
execute and hence manipulate the data. Therefore, maintaining a consistent state in the single-core
domain is usually trivial. However, in multi-core platforms, several functionalities may concur-
rently perform computations on different cores. This infers that there must exist a mechanism
which will maintain the consistent system state.
Consider the multi-core case where two functionalities share some parameter. Moreover, as-
sume that the functionalities execute concurrently, and that the shared parameter is in both private
caches. Once the parameter has been modified by one functionality, the value of that parameter
in the private cache of the other functionality also has to be modified. Specifically, assuming the
memory architecture from Figure 1.22, once the functionality changes the value of the parameter,
the new value has to be propagated back into the shared cache, and then into the private cache of
the other functionality. Furthermore, if these two functionalities do not have a common shared
cache, the information has to be propagated back into the main memory, and then into the private
cache of the other functionality. This process is called the memory coherence mechanism.
30 Introduction
Notice, that the coherence mechanism can generate significant traffic. In the corner cases,
for a single modified parameter, it might be necessary to propagate the information back to the
main memory, and subsequently broadcast that information to all remaining cores within the plat-
form. Evidently, the coherence mechanism is not scalable, and as the number of cores grows, the
overheads become more apparent. In fact, due to the coherence-related overheads, in some cases
adding more cores can have a negative impact on the system and cause performance drops [11].
Another limitation of this approach is the fact that it is very hard, if not impossible, to efficiently
analyse the temporal behaviour of coherence mechanisms at design-time, because the decisions
taken by the coherence mechanism highly depend on the system state at runtime.
1.5.3.3 Message-Passing
Despite the fact that the coherence mechanism can be used to assure a consistent system state, it
is not applicable to the many-core domain due to its poor scalability potential. In fact, yet another
key difference between multi- and many-core platforms lays in the fact that for the former, the
coherence mechanism presents an efficient approach, while for the latter, an alternative mechanism
is necessary.
An alternative to the coherence mechanism is called the message-passing technique. A long
time ago it has been shown that the coherence and the message passing are dual approaches [53,
55], and that the dominance of one over another depends on a platform upon which it is imple-
mented. However, for many years the coherence mechanism was a predominant choice. Yet, as
many-core platforms emerged, the limitations of the coherence mechanism became evident, and
the academia as well as the industry focused on the message-passing technique [52].
The message passing paradigm is build around the principles that are opposite to those of
coherence mechanisms. For instance, the message passing approach suggests that data should not
be shared between multiple processes, which is why it is also called the share nothing approach.
Specifically, instead, of sharing the same parameter by two or more functionalities, the parameter
should independently exist within the scope of each functionality, while for any modification of
the parameter, an explicit messages should be exchanged between respective functionalities.
Notice, that the problem of consistency is now elevated from the hardware (coherence mech-
anisms) to the software (exchange of messages). In other words, it is the responsibility of im-
plementers and system designers to assure that the execution of functionalities does not cause an
inconsistent system state. However, the greatest benefit of this technique is that it is scalable and
can be successfully applied to the many-core domain [11, 52]. Another benefit is that data de-
pendencies among functionalities are known at design-time, and so are the messages that must be
exchanged in order to maintain a consistent state. This implies that the message-passing approach
is also much more predictable and easier to analyse, which carries a significant importance in the
real-time domain.
1.5 Many-Cores in Real-Time Embedded Computing Domain 31
1.5.3.4 Hardware Support for Message-Passing
Many existing many-core platforms rely on the message-passing mechanism to maintain the con-
sistent system state, e.g. Epiphany [3], SCC [42] and MPPA-256 [44], while in some other plat-
forms both the coherence and the message-passing mechanisms are supported, e.g. Tile 64TM [93].
For instance, inside each router of the SCC platform, a buffer called the message-passing buffer
exists, and it presents a hardware support for the message-passing mechanism. In Tile 64TM a
feature called the dynamic distributed cache is present, which allows access to caches of remote
cores, also as a support to the message-passing mechanism. In the Epiphany platform each core is
mapped to a specific region in the main memory, which eases the inter-core communication and
promotes the message-passing.
Evidently, the trends of the chip manufacturers suggest that future many-core platforms will be
non-coherent systems where the consistent state will be assured via a message-passing mechanism.
Recall, that the main reason to study the message-passing technique was the fact that it was the
only viable approach for many-cores. It is also worth mentioning that its predictable nature allows
to efficiently analyse the system behaviour at design-time, and makes it suitable to the real-time
analysis. Thus, this is another case where the trends in the many-core design are also beneficial
for the real-time domain.
1.5.3.5 Assumptions Regarding Memory System
Recall, in Sections 1.5.1.6-1.5.2.1 it was mentioned that the platform under consideration Ψ con-
tains the set of processing elements Π and the interconnect medium η . In addition to that, the
platform contains the non-coherent memory system Ξ. The memory space of Ξ is divided among
4 different memory controllers µ = {µ1,µ2,µ3,µ4}, where each controller arbitrates the accesses
to a different part of the address space. Moreover, there are z = x ·y private cache memories in the
system κ = {κ1,κ2, ...,κz−1,κz}, where z is equal to the number of cores in the system and also to
the number of routers, i.e. each core has a private cache, like depicted in Figure 1.23. Note, that
for clarity purposes the cores have been omitted from the figure.
A functionality gets its data from the cache of its core. If the required data is not in the cache,
it is fetched from the memory controller. The communication between the cache memory and
the memory controller is performed over the NoC interconnect η . The memory controllers are
accessible from the topmost or the bottommost row of routers (see Figure 1.23). Additionally,
each controller provides a concurrent access to x2 routers from its access row, e.g. the controller
µ1 to the routers ρ1,ρ2, ...,ρ x2−1,ρ x2 , the controller µ2 to the routers ρ x2+1,ρ x2+2, ...,ρx−1,ρx, etc.
Memory read and write requests originate from the core of the functionality, and traverse the
X-Y routed path until reaching the memory controller. Similarly, responses originate from the
memory controller, and traverse the X-Y routed path to the core from which the request was sent.
Note, that in this dissertation, the focus is only on the delays of memory traffic over the NoC, while
the delays occurring inside the memory controller are out of scope, and have been extensively
32 Introduction
...
...
...
.
.
.
.
.
.
.
.
.
...
.
.
.
ρ
ρ
yx−1
(y−1)x−1
ρ
(y−1)x
ρ
yx
(y−2)x+2
ρ
(y−1)x+2
ρρ
(y−1)x+1
(y−2)x+1
ρ
ρ
2x−1
ρ
2x
x−1
ρ
x
ρρ
21
ρ
ρ ρ
x+1 x+2
κ
κκκ
κ κ κ κ
κκκκ
κ κ
1µ µ2
µ3 µ4
κ
κ
ρ
x
λx,2x λx,2x
λm,x λx,m
λ
λ
x,x−1
M M
M
M
M M
x−1,x
Figure 1.23: Assumed memory system
studied in the following studies [78, 95]. Merging the contributions of this dissertation, with those
of the mentioned works, into a unified approach, is a potential future work.
Memory traffic uses dedicated NoC resources: dedicated physical links and dedicated port
buffers, which have identical physical characteristics to those dedicated to the communication
traffic, e.g. between the routers ρx−1 and ρx in Figure 1.23 there are two unidirectional links λMx−1,x
and λMx,x−1, and the link and routing delays are δL and δρ , respectively. Similarly, between the
memory controller µ2 and the router ρx there are two unidirectional links λMx,m and λMm,x. Also,
between the core pix and the router ρx there are two unidirectional links λMx,c, and λMc,x. Note,
that for clarity purposes, the cores and their respective links were not depicted in Figure 1.23.
Moreover, the memory traffic is arbitrated in the same way as the communication traffic, that is,
if the latter is arbitrated with the round-robin policy, so is the former, while if the latter employs
the priority-preemptive policy, so does the former. Thus, the communication and memory traffic
do not contend, and the respective analyses can be performed separately.
In this dissertation, the focus is on non-coherent many-core platforms with the
message-passing mechanism.
1.5.4 Recapitulation
In the previous sections, the elements of interest of the assumed platform Ψ were introduced. For
better readability, all symbols that have been previously introduced are summarised in Table 1.1.
1.5 Many-Cores in Real-Time Embedded Computing Domain 33
Table 1.1: List of symbols related to the assumed platform
Symbol Description
Ψ The assumed platform.
Π The set of processing elements (cores) of Ψ.
z The number of cores in Π.
pii The ith core of Π, where i ∈ {1, ...,z}.
η A generic 2-D mesh NoC interconnect of Ψ.
ηRR η with the round-robin arbitration policy and without virtual channels.
ηPP η with the priority-preemptive arbitration policy and with virtual channels.
x The horizontal dimension of η , where x · y = z.
y The vertical dimension of η , where x · y = z.
ρ The set of routers of η .
ρi The ith router of ρ , where i ∈ {1, ...,z}.
λ The set of links of η dedicated to the communication traffic.
λi, j The communication link from ρi (source) to ρ j (destination).
λc,i The communication link from pii (source) to ρi (destination).
λi,c The communication link from ρi (source) to pii (destination).
σ f lit The size of one flit and the width of each link.
δL The number of clock cycles that it takes for one flit to traverse one link.
δρ The number of clock cycles that it takes for a header flit to traverse one router.
νρ The frequency of routers.
Ξ The non-coherent memory system of Ψ.
µ The set of memory controllers of Ξ.
µi The ith controller of µ , where i ∈ {1,2,3,4}.
κ The set of cache memories of Ξ.
κi The ith cache memory of κ , where i ∈ {1, ...,z}.
λM The set of links of η dedicated to the memory traffic.
λMi, j The memory link from ρi (source) to ρ j (destination).
λMc,i The memory link from pii (source) to ρi (destination).
λMi,c The memory link from ρi (source) to pii (destination).
λMm,i The memory link from the memory controller (source) to ρi (destination).
λMi,m The memory link from ρi (source) to the memory controller (destination).
34 Introduction
1.6 Thesis Statement
Many-core platforms are the imperative and the sine qua non in the design of future real-time
embedded systems. With:
• adequate hardware support for (i) the message passing communication paradigm and (ii) vir-
tual channels,
• operating system design choices which promote scalability and predictability via message-
passing,
• mindful and thoughtful worst-case analyses,
it is possible to make many-cores amenable to real-time analysis and applicable to the real-time
embedded domain.
Chapter 2
Several Steps Closer to Real-Time NoCs
In this chapter, a set of contributions concerning the NoC interconnect η of the assumed many-
core platform Ψ is presented. The common aim of all contributions is to make the existing NoC
platforms more amenable to the real-time domain, by reducing the analysis pessimism and/or by
reducing the hardware requirements. This chapter is organised as follows. In Section 2.1 the traffic
model is introduced. This model is used to analytically describe the communication and memory
traffic. Then, in Section 2.2, the existing method for the worst-case traffic analysis of round-robin-
arbitrated NoCs is covered, and subsequently the novel, less pessimistic method called the Branch,
Prune and Collapse is proposed. In Section 2.3 the focus is on priority-preemptive NoCs. First,
the existing approach for the worst-case traffic analyses is introduced, and then the new method to
reduce the requirements for hardware resources is proposed. After that, in Section 2.3.4 the novel
arbitration policy for priority-preemptive NoCs is covered. This policy is based on the Earliest
Deadline First scheduling paradigm. Finally, in Section 2.3.5 an improvement over the existing
methods for priority-preemptive NoCs is presented. This improvement allows to perform a less
pessimistic analysis.
2.1 Traffic Model
The communication (core-to-core) traffic, as well as the memory (core-to-memory) traffic is mod-
elled by a sporadic flow-set F , which is a collection of w flows F = { f1, f2, ..., fw−1, fw}. Each
flow fi has a source src( fi), and a destination dst( fi), where the source and the destination can be
cores or memory controllers, i.e. src( fi) ∈ {pi ∪ µ}∧ dst( fi) ∈ {pi ∪ µ}. Additionally, fi is char-
acterised by a set of traversed links L ( fi), its size σ( fi), its minimum inter-arrival period T ( fi),
its deadline D( fi), and its priority P( fi). Each flow fi generates a potentially infinite sequence of
packets. A packet released from src( fi) at the time instant t should be received by dst( fi) no later
than t+D( fi). Otherwise, it has missed a deadline. In this dissertation, it is assumed that all flows
have constrained or implicit deadlines, i.e. ∀ fi ∈F | D( fi)≤ T ( fi).
In Figure 2.1 are depicted 3 flows, f1, f2 and f3. From the figure it is visible that the flow
characteristics are as follows:
35
36 Several Steps Closer to Real-Time NoCs
ρ ρ4 5
f1 f32f
ρ1 ρ2 ρ3
Figure 2.1: Traffic flows (example 1)
src( f1) = pi1, dst( f1) = pi3,L ( f1) = {λc,1,λ1,2,λ2,3,λ3,c}.
src( f2) = pi2, dst( f2) = pi4,L ( f2) = {λc,2,λ2,3,λ3,4,λ4,c}.
src( f3) = pi3, dst( f3) = pi5,L ( f3) = {λc,3,λ3,4,λ4,5,λ5,c}.
Notice that the flows f1 and f2 are directly competing because of the common link λ2,3. The
same is also true for the flows f2 and f3, because of the common link λ3,4. However, f1 and f3 are
not directly competing because they do not have any common link.
2.2 NoCs with the Round-Robin Arbitration Policy
Irrespective of the type of the traffic (communication or memory), and irrespective of the arbi-
tration policy (round-robin or priority preemptive), if every packet of one flow meets its deadline,
then that flow is considered schedulable. Subsequently, if every flow of the flow-set is schedulable,
then the flow-set is considered schedulable. The schedulability can be investigated via simulations
or measurement-based techniques, however, as mentioned in Section 1.2, these methods have sev-
eral fundamental limitations. In the hard real-time domain the most common approach to test the
schedulability is to perform the worst-case traffic analysis. Specifically, it is investigated whether
a packet of the analysed flow can meet a deadline, while facing the worst-case occurrence patterns
of packets belonging to other flows. This case is called the worst-case scenario. If the packet of
the analysed flow can meet a deadline in the worst-case scenario, it means that every packet of that
flow will be able to meet its deadline, and hence the flow is schedulable.
One of the key challenges is to recognise and analytically describe the worst-case scenario.
Depending on the conditions, finding the worst-case scenario can be a relatively straightforward
activity, or it can be an intractable process, and both situations will be encountered throughout
this dissertation. In the latter case, as will be shown in this very section, the goal is to identify an
artificial worst-case scenario. Its advantages are that it is easier to capture that the actual worst-
case scenario and that it is safe, i.e. never leads to an underestimation. However, its disadvantage
is that it is more pessimistic than the actual worst-case scenario.
2.2.1 State-of-the-Art Method
Ferrandiz et al. [32] proposed a method to perform the worst-case analysis for round-robin-
arbitrated SpaceWire networks, and this method is also applicable to NoCs. This approach is
2.2 NoCs with the Round-Robin Arbitration Policy 37
based on two observations:
Observation 1. A packet may, at any router, be blocked by at most one packet from each of the
other input ports.
Observation 2. A packet may be directly blocked by at most one packet of each flow.
1 f32ff
f f54
ρ2ρ1 ρ3
ρ4 ρ5 ρ6
ρ7
Figure 2.2: Traffic flows (example 2)
Observations 1-2 are the consequence of the round-
robin arbitration policy. Figure 2.2 is used to illustrate the
implications of these observations. Consider that the flow
f2 is under analysis. Upon its release, it can be blocked by
f1 and f3. However, after the blocking, when it progresses
to ρ5, it cannot suffer any more blocking from f1 and f3
(Observation 2). Moreover, if there exist other flows f ′1
and f ′3, with the same paths as f1 and f3, respectively, only
one flow from each direction would be able to block f2
(Observation 1). That is, only one of f1 and f ′1 would be
able to block f2, and the same is true for f3 and f ′3.
Although a flow can be blocked directly by any flow
at most once, it may happen that the analysed flow is ad-
ditionally blocked by the same flow multiple times indi-
rectly. In other words, in Figure 2.2, when f2 is blocked
by f1, then f1 can be blocked by f4 and f5 in ρ5, which causes indirect blocking to f2. However,
when f2 reaches ρ5, it can also be blocked by f4 and f5, but this time directly. In fact, in the illus-
trated example, f2 can be blocked three times by f4 and f5, two times indirectly when f1 and f3
reach ρ5, and once directly when itself reaches ρ5. This scenario is illustrated with the following
flow traversal sequence: f4, f5, f1, f4, f5, f3, f4, f5, f2.
Ferrandiz et al. [32] proposed to compute the worst-case traversal time (WCTT) of flows by
invoking the Algorithm 1, with the following parameters delay( fi,get(L ( fi),1)), where fi is the
flow under analysis, and get(L ( fi),1) denotes the first link on the pathL ( fi). First, it is checked
whether λm,n is null (line 2). If it is, this means that the header of fi reached the destination, so the
only remaining activity is to transfer the rest of the flits (line 3). Otherwise, it is checked whether
λm,n is the first link on the flow path. If it is, this means that fi can be blocked on λm,n only by
the flows with the same source as fi. All such flows are grouped into the set Fsrc (lines 7− 9).
Then, WCTT of fi is equal to the sum of: (i) the delay of all flows from Fsrc to traverse λm,n,
(ii) the delay of all flows from Fsrc to reach the destination, (iii) the delay of fi to traverse λm,n,
and (iv) the delay of fi to reach the destination (lines 10−13).
Conversely, if λm,n is not the first link of fi, then fi can suffer blocking only from flows which
also traverse λm,n, but enter the router ρm from a different link than the one from which fi does
(which is λk,m). Thus, all such links are added to the set of blocking links BL (lines 18−20).
For each blocking link λg,m ∈ BL, a set of flows Fg,m is created. Fg,m contains all flows that
traverse λg,m and subsequently compete with fi on λm,n (lines 21− 26). For each λg,m ∈ BL, a
38 Several Steps Closer to Real-Time NoCs
Algorithm 1 delay( fi,λm,n)
Input: flow fi, link λm,n
Output: WCTT of fi, starting from λm,n, to the destination
1: if (λm,n = null) then
2: // the header flit reached the destination, so the remaining flits have to be transferred
3: return
⌈
σ( fi)
σ f lit
⌉
·δL;
4: end if
5: if (get(L ( fi),1) = λm,n) then
6: // λm,n is the first link of fi, so only the flows with the same source as fi can block
7: for each ( f j ∈F | f j 6= fi∧get(L ( f j),1) = λm,n) do
8: Fsrc← f j; // find all flows with the same source as fi
9: end for
10: VAL1← ∑
∀ f j∈Fsrc
(δL+delay( f j,get(L ( f j),2)));
11: VAL2← δL+delay( fi,get(L ( f j),2));
12: WCT T ←VAL1+VAL2;
13: return WCT T ;
14: end if
15: // find λk,m, which fi traversed immediately before λm,n
16: λk,m← get prev(L ( fi),λm,n);
17: // fi can be blocked by flows which traverse λm,n, but have previous link different than λk,m
18: for each (g ∈ N | λg,m ∈ λ ∧g 6= k) do
19: BL← λg,m;
20: end for
21: for each (g ∈ N | λg,m ∈ BL) do
22: // find the set of flowsFg,m that traverse λg,m and λm,n, and hence can block fi
23: for each ( f j ∈F | λg,m ∈L ( f j)∧λm,n ∈L ( f j)) do
24: Fg,m← f j;
25: end for
26: end for
27: VAL1← ∑
∀λg,m∈BL
max
∀ f j∈Fg,m
{δρ +δL+delay( f j,getnext(L ( f j),λm,n))};
28: VAL2← δρ +δL+delay( fi,getnext(L ( fi),λm,n));
29: WCT T ←VAL1+VAL2;
30: return WCT T ;
2.2 NoCs with the Round-Robin Arbitration Policy 39
single flow from Fg,m is found, such that it can cause the maximum delay to fi. Subsequently,
for all such flows the following values are computed: (i) the delay to traverse ρm, (ii) the delay
to traverse λm,n, and (iii) the delay to reach the destination. All these delays jointly contribute to
the delay of fi (line 27). Then, the following values are computed: (i) the delay of fi to traverse
ρm, (ii) the delay of fi to traverse λm,n, and (iii) the delay of fi to reach the destination (line 28).
Finally, the worst-case traversal time of fi is equal to the sum of these terms (line 29).
δ + + delay( f2, )Lδ null
=
σ( 2)f
σflit
δL.
delay( f , )2 λ5,7
delay( f , )Lδ+δ + 2 λ7,c
f ,delay( )λc,22
f ,delay( )λ2,52Lδ +
Lδ+δ +ρ
ρ
ρ
Figure 2.3: Computation tree for flow f2
from Figure 2.2, in isolation (Ferrandiz et
al. [32] method)
The execution of Algorithm 1 is demonstrated
by using the example given in Figure 2.2. Consider
that f1, f3, f4 and f5 do not exist, and that f2 is the
only flow, which, due to inexistence of other flows,
can uninterruptedly progress. By invoking Algo-
rithm 1 for this example, the computation process
follows the steps illustrated in Figure 2.3, and re-
turns the following result:
WCT T ( f2) = 4 ·δL+3 ·δρ +
⌈
σ( f2)
σ f lit
⌉
·δL.
This result is intuitive, because f2 traverses four
links (four hops) and three routers, so its header flit
suffers the link traversal delay four times, and the
routing delay three times before reaching the des-
tination. Additionally, when the header flit is re-
ceived, the rest of the flits are transferred, and the
last term in the aforementioned computation corre-
sponds to that delay.
This example allows to make a more general ob-
servation:
Observation 3. Consider a schedulable flow-set
F . A packet of a flow fi ∈F , when traversing in isolation, will suffer a constant delay C( fi),
expressed with Equation (2.1).
C( fi) = |L ( fi)| ·δL+(|L ( fi)|−1) ·δρ +
⌈
σ( fi)
σ f lit
⌉
·δL (2.1)
Recall, that |L ( fi)| denotes the cardinality of the path, which is also called the number of
hops, and it represents the number of links that fi traverses, i.e. the number of elements in the
L ( fi) set.
C( fi) is in the literature also called the basic network latency [30]. When the flow traverses in
isolation, its worst-case traversal time is equal to its basic network latency: WCT T ( fi) = C( fi),
which also holds for the previous example: WCT T ( f2) = C( f2) = 4 · δL + 3 · δρ +
⌈
σ( f2)
σ f lit
⌉
· δL.
However, in the presence of other flows, fi may suffer additional delay, which can make WCT T ( fi)
significantly greater than C( fi). Thus, C( fi) it is the minimum delay that any packet of fi may
40 Several Steps Closer to Real-Time NoCs
suffer while traversing the NoC, and C( fi)≤D( fi) is a necessary but not a sufficient condition for
the schedulability of fi.
Now, consider the example from Figure 2.2, but this time assuming all depicted flows. By
invoking Algorithm 1 to compute WCT T ( f2), the computation trees illustrated in Figures 2.4-2.5
will be generated. The computation process returns the following value:
WCT T ( f2) = 22 ·δL+21 ·δρ +3 ·
(⌈
σ( f4)
σ f lit
⌉
+
⌈
σ( f5)
σ f lit
⌉)
·δL+
(⌈
σ( f1)
σ f lit
⌉
+
⌈
σ( f3)
σ f lit
⌉
+
⌈
σ( f2)
σ f lit
⌉)
·δL
delay( f )δ+δ ,L 2 5,7delay( f )+ λδ ,3 5,7
delay( f , )2 λc,2
delay( f , )2 λ2,5δ +L
delay( )δ+δ + λ,L 5,71 λδ +Lρ ρρ f +++
Figure 2.4: Computation tree for flow f2 from Figure 2.2 (Ferrandiz et al. [32] method)
=
δ + + delay( f4, )Lδ λ7,c
δρ+ + delay( f4, )Lδ null
σ( 4)f
σflit
δL.
ρ
= =
delay( f )λ, 5,71
δ + delay( f5, )Lδ δρ+ + delay( f1, )Lδ+ + + λ7,cλ7,c
δρ + delay( f5, )Lδ null δρ+ + delay( f1, )Lδ null+ + +
σ( 1)f
σflit
δL.
σ( 5)f
σflit
δL.
ρ
= = =
delay( f ),2 5,7
δρ+ + delay( f , )Lδ δ + + delay( f5, )Lδ δρ+ + delay( f2, )Lδ4 + +λ7,c
λ
λ7,cλ7,c
δρ+ + delay( f , )Lδ null δρ+ + delay( f5, )Lδ null δρ+ + delay( f2, )Lδ null4 + +
σ( 2)f
σflit
δL.
σ( 5)f
σflit
δL.
σ( 4)f
σflit
δL.
ρ
= = =
delay( f )λ,3 5,7
δρ+ + delay( 4, )Lδ f δρ+ + delay( f5, )Lδ δρ+ + delay( f3, )Lδ+ + λ7,cλ7,cλ7,c
δρ+ + delay( 4, )Lδ nullf δρ+ + delay( f5, )Lδ null δρ+ + delay( f3, )Lδ null+ +
σ( 3)f
σflit
δL.σ( 5)fσflit δL.
σ( 4)f
σflit
δL.
Figure 2.5: Computation subtrees for Figure 2.4
Notice, that in this case, due to the presence of other flows, the worst-case traversal time of f2
is significantly greater than its basic network latency, i.e. WCT T ( f2)>C( f2).
2.2.2 Analysis Pessimism
The method proposed by Ferrandiz et al. [32] renders safe upper-bound estimates on WCT T , and
the related computation terminates within a reasonable time, as demonstrated by the authors in
their work. However, this method does not take into account traffic characteristics when construct-
ing contention trees, which may lead to pessimistic results. This is illustrated with the example
from Figure 2.2, and the corresponding computation tree, given in Figure 2.4. It is visible from
the figure that both f4 and f5 occurred three times during the traversal of the analysed flow f2.
However, if the periods of f4 and f5 are greater than the worst-case delay of f2, then these flows
2.2 NoCs with the Round-Robin Arbitration Policy 41
cannot appear that often, and hence cannot contribute that significantly to WCT T ( f2). Yet, the
state-of-the-art method does not have a mechanism to take into account this information, but in-
stead allows each flow to appear at the maximum possible rate, which clearly may lead to overly
pessimistic worst-case traversal time estimates.
In this dissertation, two sources of pessimism of the state-of-the-art method [32] are identified,
referred to as the packet-level pessimism and the flow-level pessimism.
The packet-level pessimism is related to the fact that, in the existing method, it is assumed
that two packets of a flow, released during adjacent periods, can reach the same router succes-
sively. However, the time interval between two packets from adjacent periods depends on their
characteristics, as demonstrated with Theorem 1.
Theorem 1. Consider a schedulable flow-set F . Two packets of a flow fi ∈F , related to two
adjacent periods, cannot traverse the same router in the time interval which is less than or equal
to T ( fi)−D( fi)+C( fi).
Proof. Proven directly. Consider the case when the first packet of fi starts traversing as late as
possible, while the second packet starts traversing as early as possible, as illustrated in Figure 2.6.
The traversal of the first packet can be delayed by at most D( fi)−C( fi), as delaying it any further
would cause a deadline miss, which contradicts the initial assumption of a schedulable flow-set.
The second packet can be released at earliest immediately after the period, i.e. at T ( fi). Let ε
be an infinitesimally small but finite value, representing the delay of the second packet to leave
the source and reach the NoC. Thus, the time interval between two packet occurrences within any
router of the NoC can not be less than T ( fi)− (D( fi)−C( fi))+ ε > T ( fi)−D( fi)+C( fi).
f )D( i D( i)f fi +)T(fT( i) i )fC(0 C( if− )
Figure 2.6: Traversal of packets from two adjacent periods
Theorem 1 can be used to reduce the packet-level pessimism in the following way. Within
each router ρi that the flow f j traverses, a pair < f j, t∗ > is stored, where t∗ denotes the time-
stamp of the last traversal of f j through ρi. Subsequently, during the analysis, the next traversal of
f j through ρi would be possible only at t∗+T ( f j)−D( f j)+C( f j). Of course, every new feasible
traversal of f j through ρi would be followed by an update of the time-stamp.
Similarly, the flow-level pessimism is related to the fact that, in the existing approach, it is
assumed that an arbitrary number of packets of a flow, released during adjacent periods, can reach
the same router successively. However, Theorem 1 demonstrated that there indeed exists a time
interval between the occurrences of two adjacent packets, and the same is true for an arbitrary
number of successive packets, as proven with Theorem 2.
Theorem 2. Consider a schedulable flow-setF . Within the time interval t, the maximum number
of packets of a flow fi ∈F that can reach the same router is
⌈
t+D( fi)−C( fi)
T ( fi)
⌉
.
42 Several Steps Closer to Real-Time NoCs
fT( i) fi +)T( i )fC(f )D( i D( i)f0 C( if− ) fi).n T( fi +)T( i )fC(n.
...
Figure 2.7: Traversal of packets from several successive periods
Proof. Two cases need to be investigated:
1. t ≤ T ( fi)−D( fi)+C( fi): Proven directly. From Theorem 1 it follows that at most 1 packet
can reach the same router within t. By substituting the value of t into the computation for
this theorem, it follows that
⌈
t+D( fi)−C( fi)
T ( fi)
⌉
<
⌈
T ( fi)
T ( fi)
⌉
= 1, which coincides with the result
of Theorem 1.
2. t > T ( fi)−D( fi)+C( fi): Proven by contradiction. Consider that within the time interval
t, at most
⌈
t+D( fi)−C( fi)
T ( fi)
⌉
+ 1 packets can reach the same router. By initial assumptions,⌈
t+D( fi)−C( fi)
T ( fi)
⌉
+ 1 ≥ 3. Consider that the first packet reached the router as late as possi-
ble, while all the other packets reached it as early as possible, as illustrated in Figure 2.7.
Consider
⌈
t+D( fi)−C( fi)
T ( fi)
⌉
− 1 packets, that are surrounded by the first and the last. These
packets are referred to as the inner packets. All the inner packets contribute to t with their
entire periods, therefore require time interval of at least
(⌈
t+D( fi)−C( fi)
T ( fi)
⌉
−1
)
·T ( fi) where
only these can reach the router. According to Theorem 1, the first packet could not reach
the router later than T ( fi)−D( fi)+C( fi) time units before the first inner packet. Finally,
the last packet can reach the router only ε time units after its period. Recall, that ε is an
infinitesimally small but finite value representing the delay of the packet to leave the source
and reach the NoC.
T ( fi)−D( fi)+C( fi)+
(⌈
t+D( fi)−C( fi)
T ( fi)
⌉
−1
)
·T ( fi)+ ε =⌈
t+D( fi)−C( fi)
T ( fi)
⌉
·T ( fi)−D( fi)+C( fi)+ ε ≥(
t+D( fi)−C( fi)

T ( fi)
)
·T ( fi)−D( fi)+C( fi)+ ε =
t+ ε ≤ t
The contradiction has been reached.
Theorem 2 can be used to decrease the flow-level pessimism in the following way. Within
each router ρi that the flow f j traverses, not only the time-stamp of the last traversal is stored, as
proposed for the reduction of the packet-level pessimism, but a time-stamp of each traversal of f j
through ρi is stored. Specifically, for each flow f j that traverses through ρi, the traversal informa-
tion is stored in the form < f j, t∗0 , t
∗
1 , ..., t
∗
n >, where t
∗
0 denotes the time-stamp of the traversal of
the first packet of f j through ρi, t∗1 of the second packet, etc. Subsequently, during the analysis,
2.2 NoCs with the Round-Robin Arbitration Policy 43
the traversal of f j, through ρi, at the time instant t, will be possible only if n+1≤
⌈
t+D( fi)−C( fi)
T ( fi)
⌉
,
where n denotes the number of packets of f j that already traversed ρi.
2.2.3 Branch and Prune (BP) Method
In this section, a novel method to compute worst-case traversal times of flows will be presented.
This method is called the Branch and Prune (BP) Method, and it renders tigher estimates than the
state-of-the-art approach [32]. This method is developed from the same fundamental idea as the
existing work, which is to recursively track the progress of flows over the NoC. However, the BP
method additionally takes into account the traffic characteristics, so as to reduce the packet- and
flow-level pessimism.
2.2.3.1 Overview
As in the existing approach, in the BP method the tracking of the flow progress remains the same.
That is, within each router that the flow traverses, a set of input links from which blocking flows
may arrive is considered. Then, for each blocking link, the flow which can cause the greatest
delay is identified. However, in the existing method [32], the maximum delay from each blocking
link was computed before the next blocking link was considered. This causes the depth-first
traversal of the computation tree. For example, when computing the worst-case traversal time for
f2 from Figure 2.2, first the blocking delay caused by f1 is computed, then the delay by f3, and
then f2 progresses to the next router. Note that the order in which the blocking links and hence
blocking flows are considered is entirely irrelevant. In other words, the computed value is the
same, irrespective of whether the blocking delay was computed first for f1, and then for f3, or the
other way around. This is possible, because the existing method does not take into account flow
characteristics.
Conversely, in the BP method, before the computation process starts, a construction of all pos-
sible orderings of blocking flows at the current router is performed. This has to be done, because,
as explained later, in the BP method the flow ordering does matter. One ordering of flow traversals
through the router is called the interfering scenario, and all possible traversals constitute the list
of interfering scenarios – LIS. For the example with the flow f2, illustrated in Figure 2.2, the
LIS for the traversal of f2 through ρ2 contains the following elements: LIS( f2,ρ2) = { f1, f3, f2},
{ f3, f1, f2}, { f1, f2}, { f3, f2}, { f2}. Notice that this is equivalent to branching into several inde-
pendent computation trees, where each interfering scenario represents one computation tree, like
illustrated in Figures 2.8-2.9. Then, each scenario (branch) is evaluated whether the flows can
arrive in the given order, without violating the constraints of Theorems 1-2. If an investigated
scenario is infeasible, the entire branch related to it is pruned. The tests for compliance with The-
orems 1-2 are applied every time a new branching occurs, which gives the possibility to detect and
prune infeasible scenarios (branches) as early as possible. Notice that pruning a given scenario
in fact prunes an entire sub-tree that would result from it, which vastly reduces the search space.
This is especially beneficial for loaded networks, where, due to the complex contention patterns,
44 Several Steps Closer to Real-Time NoCs
the solution space may grow exponentially with the number of flows and path lengths. Therefore,
an efficient and timely pruning can significantly reduce the computational complexity.
delay( f , )2 λc,2
B4 B5B3B2B1
Branching
Figure 2.8: Computation tree for flow f2 from Figure 2.2 (BP method)
δ +L
delay( f , )2 λ2,5δ +L
+ +
.
.
.
.
.
.
.
.
.
B1
delay( f )δ+δ + λ,Lρ 5,71 delay( f )+ λδ ,ρ 3 5,7 delay( f )δ+δ + ,Lρ 2 5,7λ
(a) Subtree B1
δ +L
delay( f , )2 λ2,5δ +L
+ +
.
.
.
.
.
.
.
.
.
B2
delay( f )δ+δ + ,Lρ 2 5,7λdelay( f )+ λδ ,ρ 1 5,7delay( f )δ+δ + λ,Lρ 5,73
(b) Subtree B2
delay( f , )2 λ2,5δ +L
+
B3
.
.
.
delay( f )δ+δ + λ,Lρ 5,71
.
.
.
delay( f )δ+δ + ,Lρ 2 5,7λ
(c) Subtree B3
delay( f , )2 λ2,5δ +L
+
B4
.
.
.
delay( f )δ+δ + λ,Lρ 5,73
.
.
.
delay( f )δ+δ + ,Lρ 2 5,7λ
(d) Subtree B4
δ +L
delay( f , )2 λ2,5δ +L
.
.
.
B5
delay( f )+ λδ ,ρ 2 5,7
(e) Subtree B5
Figure 2.9: Computation subtrees constructed after branching in Figure 2.8
2.2.3.2 Necessity to Investigate All Scenarios
After pruning infeasible scenarios, any of the remaining ones may lead to the worst-case delay,
and therefore all of them have to be investigated. Note that although it is counter-intuitive, it is
not safe to assume that a scenario S1, which elements are only a subset of some other scenario
S2, will always lead to a smaller delay than S2. Specifically, for the given example, it is not
safe to assume that S1 = { f1, f3, f2} (Figure 2.9(a)) would always lead to a larger delay than
S3 = { f1, f2} (Figure 2.9(c)) and S5 = { f2} (Figure 2.9(e)), and therefore discard S3 and S5 for
further consideration. The explanation is as follows. Although S3 and S5 may lead to the smaller
delay at the given router (local maximum), these scenarios may in the future progression allow
for other subscenarios which would be infeasible in S1, and which could contribute to the worst-
case delay of f2 with the significant blocking delay (global maximum). Therefore, all feasible
interfering scenarios have to be investigated.
Scenarios from LIS and investigated sequentially. When considering S1 = { f1, f3, f2} for ex-
ample, first the worst-case traversal time of f1 is computed, recursively, then the worst-case traver-
sal time of f3, and finally f2 is allowed to progress to the next router on its path. However, before
computing the delay for f1 (and the same applies to f3 later), the pessimism of the existing method
2.2 NoCs with the Round-Robin Arbitration Policy 45
is reduced by applying the two optimization mechanisms that determine whether f1 is "feasible".
Being infeasible implies that it is impossible for a given flow to release a packet at the given time,
considering that it either released a packet too close in time relative to its previous packet (violat-
ing Theorem 1 and thereby accounting for the packet-level pessimism) or it has already exceeded
the upper bound on the number of packets it could possibly generate from the beginning of the
computation (violating Theorem 2 and thereby accounting for the flow-level pessimism). If the
flow f1 is deemed infeasible, then the entire interfering scenario S1 is discarded, and the algorithm
moves on to the next interfering scenario. Conversely, if f1 is deemed feasible, then the analysis
of S1 continues, and the traversal of f3 is investigated. Similarly, if f3 is deemed infeasible, the
entire interfering scenario S1 is discarded. Otherwise, the flow f2 progresses to the next router ρ5
(see Figure 2.2), and new scenarios are generated and subsequently investigated: S6 = { f4, f5, f2},
S7 = { f5, f4, f2}, S8 = { f4, f2}, S9 = { f5, f2}, S10 = { f2}.
2.2.3.3 Need for Ordering
Since the BP method considers the input flow characteristics, the order of flows within each sce-
nario cannot be ignored, as it can lead to different results. This is illustrated with an example given
in Figure 2.2. Consider two possible flow traversal sequences:
Seq1 = { f4, f5, f3, f5, f4, f2, f4, f5, f1} ∧ Seq2 = { f4, f5, f3, f4, f5, f2, f4, f5, f1}.
These two sequences differ only in the order in which f4 and f5 block f2. However, notice that in
Seq1 the first and the second packet of f5 are distanced only by f3 , while the second and the third
packet of f4 are distanced only by f2. Conversely, in Seq2, any two packets of the same flow are
distanced by the packets of at least two other flows. Depending on flow characteristics, in some
cases the entire Seq2 might be feasible, while Seq1 would require the pruning of some packets
of f4 and/or f5. Thus, considering only Seq1 in the analysis may result in unsafe worst-case
estimates, and in order to capture the exact worst-case, it is necessary to investigate all possible
flow orderings (scenarios) at every traversed router.
Indeed, at a given router ρi along the path of the flow under analysis f j, the list of interfering
scenarios can be constructed as explained above. However, identifying which blocking flows are
infeasible within each of the scenarios requires the information about the flow traversal history.
That is, it is necessary to keep the track of which flows have already progressed, in which order,
and through which routers, before f j reaches ρi. Without this knowledge, the pruning mechanisms
would not be able to determine whether a flow listed in an interfering scenario is feasible or not.
This leads to the concept of the context, which is the key component of the BP method.
2.2.3.4 Context
The context is a data structure which stores all the information that characterises the unique se-
quence of flow progressions throughout the network, before the analysed flow fi reaches a certain
router ρ j at the time instant t. The context contains the following information: (i) the order in
46 Several Steps Closer to Real-Time NoCs
which the flows have progressed over the network in the interval [0− t], (ii) the time-stamps of
the past traversals of all directly or indirectly contending flows of fi through the routers on their
respective paths, and (iii) the delay incurred by fi before reaching ρ j. The context has all the infor-
mation that is necessary to check whether a traversal of a certain flow would violate Theorems 1-2.
In other words, the context is the infrastructure that is needed to perform an efficient pruning of
infeasible scenarios.
Notice that the context contains the information about the unique system state, i.e. it corre-
sponds to a unique sequence of flow traversals. However, when the branching occurs, the context
must also be maintained for each of the newly generated sequences of flow traversals. Therefore,
each flow sequence that may potentially lead to the worst-case scenario must have its private con-
text. At the very end, when the analysed flow fi reaches the destination, WCT T ( fi) will be found
by comparing all the flow sequences (i.e., the contexts) in which fi reaches its destination.
2.2.3.5 Algorithm
Algorithm 2 delay( fi,λm,n)
Input: flow fi, link λm,n
Output: WCTT of fi, starting from λm,n, to the destination
1: ctx.sequence← /0;ctx.delay← 0; // the computation starts with the initial (empty) context
2: ctxSet← getContexts( fi,λm,n,ctx); // find the set of all possible contexts
3: WCT T ← max
∀ctx∈ctxSet
ctx.delay; // find the maximum delay
4: return WCT T ;
The computation process starts by invoking the Algorithm 2 with the following input param-
eters: the flow under analysis fi, the first link on its path λm,n. First, an initial context is created
(line 1). The initial context contains an empty flow sequence and a delay of zero. Then, in order
to obtain all possible contexts for the traversal of fi, Algorithm 3 is invoked (line 2). The context
with the largest delay is found by enumerating all contexts (line 3). Finally, the largest delay is
returned as the worst-case traversal time of fi.
All contexts are generated with Algorithm 3, which takes as input parameters (i) the flow under
analysis fi, (ii) the first link on its path λm,n, and (iii) the newly created initial context currCtx.
First, it is checked whether λm,n is null. If it is, this means that the header of fi reached the
destination, so the only remaining activities are to transfer the rest of the flits and to append fi
to the sequence of traversed flows (line 3). Otherwise, it is checked whether λm,n is the first link
on the flow path. If it is, this means that fi can be blocked on λm,n only by the flows with the
same source as fi. All such flows are grouped into the set Fsrc (lines 8− 10). Then, the list of
interfering scenarios LIS( fi,ρn) for the flow fi at the router ρn is created (line 11). LIS( fi,ρn)
contains all possible combinations of flows from Fsrc (including an empty set) with the flow fi
appended to the end of each flow sequence. Consider the flow f2 in the example illustrated in
Figure 2.2. LIS( f2,ρ2) = f2, because f2 is the only flow originating from the router ρ2.
2.2 NoCs with the Round-Robin Arbitration Policy 47
Algorithm 3 getContexts( fi,λm,n,currCtx)
Input: flow fi, link λm,n, current context currCtx
Output: A set of contexts ctxSet
1: if (λm,n = null) then
2: // the header flit reached the destination, so the remaining flits have to be transferred
3: currCtx.delay+=
⌈
σ( fi)
σ f lit
⌉
·δL; currCtx.sequence.Append( fi);
4: return currCtx;
5: end if
6: if (get(L ( fi),1) = λm,n) then
7: // λm,n is the first link of fi, so only the flows with the same source as fi can block
8: for each ( f j ∈F | f j 6= fi ∧ get(L ( f j),1) = λm,n) do
9: Fsrc← f j; // find all flows with the same source as fi
10: end for
11: LIS( fi,ρn)← Set of local interfering scenarios based onFsrc;
12: else
13: λk,m← get prev(L ( fi),λm,n); // find λk,m, which fi traversed immediately before λm,n
14: // fi can be blocked by flows which traverse λm,n, but have previous link different than λk,m
15: for each (g ∈ N | λg,m ∈ λ ∧ g 6= k) do
16: BL← λg,m;
17: end for
18: for each (g ∈ N | λg,m ∈ BL) do
19: // find the set of flowsFg,m that traverse λg,m and λm,n, and hence can block fi
20: for each ( f j ∈F | λg,m ∈L ( f j) ∧ λm,n ∈L ( f j)) do
21: Fg,m← f j;
22: end for
23: end for
24: LIS( fi,ρn)← Set of local interfering scenarios based on BL;
25: end if
26: GCList← /0;
27: for each (Si ∈ LIS( fi,ρn)) do
28: SCList←{currCtx};
29: for each ( f j ∈ Si) do
30: while (SCList 6= /0) do
31: ctxk← SCList.pop();
32: if (CompliantT hm1( f j,ctxk) ∧ CompliantT hm2( f j,ctxk)) then
33: if (get(L ( f j),1) = λm,n) then
34: ctx.delay+= δL;
35: else
36: ctx.delay+= δρ +δL;
37: end if
38: FCListk← getContexts( f j,getnext(L ( f j),λm,n),ctxk);
39: end if
40: end while
41: SCList←⋃
∀k
FCListk;
42: end for
43: GCList← GCList ∪ SCList;
44: end for
45: return GCList;
48 Several Steps Closer to Real-Time NoCs
Conversely, if λm,n is not the first link of fi, then fi can suffer blocking only from flows which
also traverse λm,n, but enter the router ρm from a different link than the one from which fi does
(which is λk,m). Thus, all such links are added to the set of blocking links BL (lines 15− 17).
For each blocking link λg,m ∈ BL, a set of flows Fg,m is created. Fg,m contains all flows that
traverse λg,m and subsequently compete with fi on λm,n (lines 18−23). Then, the list of interfering
scenarios LIS( fi,ρn) is created (line 24). LIS( fi,ρn) contains all combinations of flow sequences
in which there is at most one flow from each link in BL, and fi is appended to the end of each flow
sequence. Consider the flow f2 in the example illustrated in Figure 2.2. As already mentioned
and illustrated in Figures 2.8-2.9, LIS( f2,ρ2) = { f1, f3, f2}, { f3, f1, f2}, { f1, f2}, { f3, f2}, { f2}.
Similarly, when f2 progresses to the next router ρ5, it is LIS( f2,ρ5) = { f4, f5, f2}, { f5, f4, f2},
{ f4, f2}, { f5, f2}, { f2}.
Now, the algorithm investigates all scenarios from LIS( fi,ρn) in a sequential manner (line 27).
For each scenario Si, a list SCList is created (line 28). Ultimately, SCList will contain all possible
contexts that are produced as a result of applying this scenario at the given router to all existing
contexts.
When a certain scenario Si ∈ LIS( fi,ρn) has been selected (line 27), the flows from its flow
sequence are considered consecutively (line 29). The traversal of the first flow from the list,
denoted f j, is attempted assuming all possible contexts. Specifically, all existing contexts are
considered sequentially (line 30), and it is checked whether the traversal of f j violates Theorems 1-
2 for each given context (line 32). If it does, the combination of the current context ctxk and the
current scenario Si is discarded, which is equivalent to pruning one branch of the computation
tree. Otherwise, f j is allowed to progress (lines 33−37). Then, all traversal sequences of f j in the
subsequent routers are generated via a recursive call, and stored within the list FCListk (line 38).
After testing the traversal of f j upon all contexts, the SCList is updated with all newly produced
contexts (line 41), and the next flow from the current scenario Si is considered. This process is
repeated for all the flows in the current scenario Si.
When the scenario Si is applied to all current contexts, the resulting scenarios are added into
the final list of all contexts GCList (line 43), and the next scenario from LIS( fi,ρn) is considered.
After applying all scenarios to all contexts, GCList will contain the complete list of all possible
contexts for the analysed flow fi. Finally, GCList is returned (line 45), and the Algorithm 3
terminates.
2.2.4 Branch, Prune and Collapse (BPC) - More Efficient Method
The existing method proposed by Ferrandiz et al. [32] scales well, because the contexts are not
constructed and maintained, but only the maximum delay incurred at each router is retained. How-
ever, due to that fact, this method may derive pessimistic worst-case estimates. Conversely, the
proposed BP method takes into account traffic characteristics and constructs contexts, which helps
to identify an exact flow sequence that leads towards the worst-case traversal time of an analysed
flow. This allows to compute the exact worst-case traversal times for all traffic flows. However, the
benefits of the BP method come at the price of a significant computational complexity (processing
2.2 NoCs with the Round-Robin Arbitration Policy 49
power), and a significant spatial complexity (memory consumption), which suggests that BP does
not scale, and that it may not be the most efficient approach for large flow-sets.
These two extreme approaches may be merged into a hybrid method in the following way. Pe-
riodically, a certain context information can be dropped, in order to reduce the computational and
spatial complexity of the method. In such cases, the maximum delay observed until the dropping
moment should be retained, and the computation process may continue. This method is called the
Branch, Prune and Collapse (BPC).
2.2.4.1 Overview
As already observed, the identification of infeasible sequences in the BP method was possible
due to the explicit book-keeping of all possible contexts. However, the computational and spatial
complexity of the BP method may render it inapplicable to large flow-sets. Motivated by this
fact, the BPC method is introduced. BPC presents a more generalised method, of which the BP
method and the existing method of Ferrandiz et al. [32] are special cases. Specifically, BPC has a
configurable parameter called the Sequence Information Retention Limit (SIRL), which creates a
trade-off between the analysis tightness (the amount of pessimism) on one side, and the computa-
tional and spatial complexity on another. The parameter SIRL acts as a threshold on the number
of flow sequences whose contexts are retained.
In the BP method, all investigated flow sequences and their contexts are combined with all
possible scenarios at the current router, so as to generate new flow sequences for the next router
on the path of the analysed flow. Conversely, in the BPC method, when the number of investigated
sequences reaches a pre-set limit of SIRL, a new flow sequence containing a single dummy flow
is created. The context of this sequence is populated as follows: (i) the delay field is set to the
maximum of the delays of all sequences investigated thus far, and (ii) the history information, i.e
the time-stamps of past packet occurrences of each flow at each router are set to null (or zero,
as appropriate). This process is called the collapse phase, because during it a set of investigated
sequences is "collapsed" (merged) into a single dummy sequence containing a single dummy flow
with the conservative delay estimate, and no history information regarding the traversal of other
flows. Now, instead of continuing the computation process by managing all the sequences and
their contexts (which number is at least SIRL), as it would be done in the BP method, only this
one dummy sequence is created, and the method continues to behave in the same way as the BP
approach. This holds until the number of investigated sequences again exceeds SIRL, and thereby
invokes another collapse phase.
Note, that in cases where SIRL → ∞, the collapse phase never occurs, and the behaviour of
the BPC method is identical to that of the BP method. Conversely, in cases where SIRL = 1, the
collapse occurs for every combination of sequences and scenarios at every router, which is identi-
cal to the behaviour of the existing method [32]. Also, notice that the parameter SIRL controls the
frequency of collapse phases, and the smaller the SIRL is, the more frequently collapses occur.
50 Several Steps Closer to Real-Time NoCs
2.2.4.2 Necessity to Create Dummy Sequence
At any intermediate stage of the analysis, when the number of sequences reaches the value of the
parameter SIRL, a single sequence that will provably lead to the WCTT cannot be detected. This
has already been explained in Sections 2.2.3.2-2.2.3.3, and this is the main reason why in the BP
method all contexts need to be combined with all scenarios, and subsequently why all resulting
contexts need to be retained for future computations. Therefore, during the collapse phase, from
the set of the constituent sequences, a specific single sequence cannot be solely carried forward
in the computation. This is due to the fact that a single sequence, that was recognised as the
local maximum at the intermediate stage, may in the later stages of the computation be subject to
pruning, because of its flow history, and thereby may not contribute to the global maximum delay.
In order to prevent this, the history information regarding the previous flow traversals is entirely
dropped (thereby reducing the chances of the collapsed sequence to be subject to pruning during
the later stages of the computation process). In fact, only the local maximum delay is retained
within a new dummy sequence, consisting of a single dummy flow. To summarize, during the
collapse phase, the BPC method creates a dummy sequence which inherits the delay of the local
maximum, but entirely drops the traversal history of the flows constituting that sequence.
2.2.4.3 Illustrative Example
The behaviour of the BPC method is illustrated with an example of flows given in Figure 2.10,
which is identical to the example depicted in Figure 2.2, but has been repeated here for the reader’s
convenience.
1 f32ff
f f54
ρ2ρ1 ρ3
ρ4 ρ5 ρ6
ρ7
Figure 2.10: Traffic flows (revisited
example 2)
Consider the flow f2 between the cores pi2 and pi7. f2
traverses the routers ρ2,ρ5, and ρ7. As mentioned in Sec-
tion 2.2.1, the flows f4 and f5 can potentially block f2 three
times: twice indirectly by blocking f1 and f3 at ρ5 ( f1 and
f3 block f2 directly at ρ2), and finally once directly at ρ5
when f2 reaches it. Thus, f4 and f5 are the promising can-
didates for pruning (line 32 of Algorithm 3).
Assume that SIRL = 5. On the link λc,2 the analysed
flow f2 can suffer blocking only from flows originating
from the same core pi2. However, there are no such flows,
so f2 uninterruptedly progresses to the router ρ2. For the
next link on the path of f2, i.e. λ2,5, the BPC method con-
structs the LIS for f2 as follows: LIS( f2,ρ2) = { f1, f3, f2},
{ f3, f1, f2}, { f1, f2}, { f3, f2}, { f2}. First, the scenario
{ f1, f3, f2} is explored. At this time, the list SCList is
reseted to the current (initial) context (at line 28 of Al-
gorithm 3). Note, that SCList will ultimately contain all generated contexts arising from the
analysis of this scenario { f1, f3, f2}, before being appended to the global list of contexts GCList
2.2 NoCs with the Round-Robin Arbitration Policy 51
(line 43 of Algorithm 3). Now, a recursive call is invoked for the router ρ5 with f1 being the
analysed flow. This will result in a new LIS constructed at ρ5 as: LIS( f1,ρ5) = { f4, f5, f1},
{ f5, f4, f1}, { f4, f1}, { f5, f1}, { f1}. Similarly, LIS is generated for the flow f3 at the router ρ5
as: LIS( f3,ρ5) = { f4, f5, f3}, { f5, f4, f3}, { f4, f3}, { f5, f3}, { f3}.
These sequences are propagated back, and should be combined with the current (initial) con-
text before f2 progresses to ρ5. As both LIS( f1,ρ5) and LIS( f3,ρ5) contain 5 elements, the com-
bined set of sequences may contain 25 elements (i.e. 5 sequences from f1 combined with 5 from
f3). It is obvious that for large flow-sets this back-propagation may produce a large number of
sequences and even cause intractability, which is the main drawback of the BP method. Given
that SIRL = 5 in this example, the collapsing requirement is met. Those 25 sequences are col-
lapsed into a single one, which contains a dummy flow fX , while its delay is set to maximum
delay amongst all collapsed sequences. When f2 finally progresses to ρ5 and encounters the po-
tentially blocking flows f4 and f5, the method checks the context related to the current sequence
{ fX}. Since there is no prior information regarding f4, nor f5, the algorithm considers that these
two flows are arriving for the first time and thus allows them to pass and additionally block f2.
The resulting sequences would be { fX , f4, f5, f2}, { fX , f5, f4, f2}, { fX , f4, f2}, { fX , f5, f2} and
{ fX , f2}. Conversely, BP would have retained the contexts for all sequences, which would give
the possibility to potentially prune another traversals of f4 and f5, but at the expense of investi-
gating 25 · 5 = 125 sequences. Note that this includes only the investigation of the first scenario
{ f1, f3, f2}. Similarly, the number of sequences that can arise from the scenario { f1, f3, f2} is also
53. The number of sequences from the scenarios { f1, f2} and { f3, f2} is 52, while the number of
sequences from the scenario { f1} is 51. Thus, the total number of possible flow sequences for this
example is 53+53+52+52+51 = 305.
2.2.4.4 Proof of Safety of BPC
From the previous example it is obvious that the BPC method, by collapsing upon reaching the
SIRL threshold and hence discarding some history information, may output more pessimistic
WCTT estimates than the BP method. In this section, it is proven that the values obtained by
the BPC method will under no circumstance lead to an unsafe WCTT estimate.
Let wcs (short for the worst-case sequence) be the flow sequence leading to the worst-case
delay of the analysed flow. By definition, wcs is a feasible flow sequence. In order to prove that
BPC is safe, it will be proven that it will never eliminate wcs from the set of investigated sequences,
i.e. BPC will never return a value which is smaller than the delay of wcs.
First, it should be noted that if SIRL→ ∞, then the BPC method behaves like the BP method:
it performs an exhaustive enumeration of all possible sequences at each router, and thus considers
all possible blocking patterns of other flows during the traversal of the analysed flow (a brute-force
approach, which is inherently safe). The pruning mechanisms (line 32 of Algorithm 3) exploit the
information regarding the previous flow occurrences, in order to identify infeasible flow sequences,
and in that way reduce the list of sequences that need to be investigated. By definition, wcs is a
feasible flow sequence and therefore, it will not be eliminated by these pruning techniques.
52 Several Steps Closer to Real-Time NoCs
Given the loss of the history information during the collapse phase, the BPC method is unable
to identify as many infeasible flow sequences as the BP method. These infeasible sequences
contain flows that cannot traverse that frequently due to Theorems 1-2, and hence additionally
contribute to the worst-case traversal time of the analysed flow. Eventually, the set of explored
sequences will also include the set of feasible sequences, which includes wcs, but will also include
some infeasible sequences, which were not identified due to loss of history information during
the collapse phase. Finally, after computing the maximum traversal time of all the sequences, the
method will return a value which is greater than, or equal to the WCTT corresponding to wcs, and
therefore the method is safe.
To summarize, the BP method investigates all feasible flow sequences (including wcs) and all
infeasible ones. The infeasible sequences are pruned due to the retained contexts, and thereby the
method is safe. BPC investigates all feasible and infeasible sequences, prunes only a fraction of
infeasible ones, and as a consequence is still safe but may be pessimistic.
2.2.4.5 Proof of Termination of Algorithm 3
Let S be the set of flow-link pairs 〈 fi,λk,l〉, where fi ∈ F and λk,l ∈ λ , such that 〈 fi,λk,l〉 ∈
S ⇐⇒ λk,l ∈ L ( fi). Since λ and F are the sets of the finite number of elements, it holds
that S has a finite number of elements as well. A progress of a flow fi from a link λk,l to a
subsequent link λl,m on its path is equivalent to the progress from the pair 〈 fi,λk,l〉 to the pair
〈 fi,λl,m〉. If some flow f j blocks the flow fi on the link λk,l , it corresponds to the progress from
the pair 〈 fi,λk,l〉 to the pair 〈 f j,λk,l〉. For a given flow fi, and a current link λk,l , the algorithm
progresses in a forward manner to the next link getnext(L ( fi),λk,l) on the path of the flow fi by
invoking the function getContexts() (line 38 of Algorithm 3). Starting from any pair 〈 fi,λk.l〉 ∈S ,
Algorithm 3 investigates all flow-link pairs from S that are reachable, assuming the round-robin
arbitration policy. Then, the algorithm is recursively invoked for every transitioned pair. From the
deadlock-free property of the X-Y routing mechanism [40] it follows that the initial pair 〈 fi,λk,l〉
will never be revisited. As soon as all the explored contexts are popped (line 31 of Algorithm 3),
the algorithm will terminate.
2.2.5 Experimental Evaluation
In this section, the experimental evaluation is performed with the dual objective: to compare
the BP and BPC methods with the existing approach [32], and to study the impact of different
parameters on the tightness of derived WCTT estimates. The analysis parameters are summarized
in Table 2.1, where an asterisk sign denotes a randomly generated value, assuming a uniform
distribution.
2.2.5.1 Experiment 1: BPC vs Existing Method of Ferrandiz et al. [32]
The improvements of the BPC method over the existing one cannot be quantified in a general
sense, because the results largely depend on the characteristics of the flow-set upon which the
2.2 NoCs with the Round-Robin Arbitration Policy 53
Table 2.1: Analysis parameters for Section 2.2.5
NoC topology and size 2-D mesh with 8×8 routers
Router frequency νρ 250 MHz
Routing delay δρ 3 cycles (12 ns)
Link traversal delay δL 1 cycle (4 ns)
Flow packet size σ( fi),∀ fi ∈F 512 bytes
Link width = flit size σ f lit 16 bytes
Testing platform Intel dual-core desktop & Java (Max heap-size: 4 GB)
techniques are applied. Therefore, in order to observe the trends and the ranges of improvements
that are achieved by employing the proposed approach, the comparison is performed on a wide
range of different flow-sets.
Experiment 1a (NoC with moderate number of flows): 200 random flow-sets are generated,
each consisting of 64 flows. The flows originate from each core, but terminate at a randomly
generated destination by following an X-Y routed path. A deadline and a period for each flow
are randomly generated values in the range [20−100] µs, assuming a uniform distribution, where
D( fi)≤ T ( fi),∀ fi ∈F . The upper-bounds on the worst-case traversal times (WCTT) of each flow
are computed, using both the approaches. Subsequently, the results are compared. For the BPC
method, the value of the parameter SIRL = 10000.
In order to quantify the range of improvements, a metric called the Percentage Improvement
Ratio (PIR) is used. PIR = 100·(WCT TU−WCT T
10000
O )
WCT TU
, where WCT TU denotes the upper-bound on
WCTT computed by the existing approach, which is also referred to as the unoptimized WCTT, and
WCT T 10000O is the value computed by the BPC method for SIRL = 10000, which is also reffered to
as the optimized WCTT. Therefore, PIR = 25% implies that BPC derived a 25% smaller (tighter)
WCTT upper-bound.
Figure 2.11(a) illustrates the results. For 31.84% of the flows, the bounds computed by both
methods are equal, that is WCT T 10000O =WCT TU , and hence PIR= 0%. For the rest of the 68.16%
the BPC method renders tighter WCTT estimates. Specifically, 1.29% of the flows have PIR in the
range [1−10%], 13.22% in the range [11−20%], and so on. At the higher end of the PIR scale,
for 3.55% of the analysed flows the BPC method computed [71−100%) tighter WCTT bounds.
The WCTT* parameter: If the computation finishes with the number of investigated sequences
not exceeding SIRL, this means that no collapses occurred, and that the BPC method computed a
value of the traversal time as would be computed by the BP method. This further implies that all
possible flow sequences were investigated, and the one causing the greatest worst-case delay to
the analysed flow was identified. The delay of such a sequence is denoted WCTT∗.
Conversely, in cases where collapses do occur, the returned value presents only an upper-
bound on the worst-case traversal time of the analysed flow, without any additional information on
how tight that bound actually is. When viewed from that perspective, the existing approach [32]
presents a special case of the BPC method where SIRL = 1. Therefore, the existing approach
obtains WCTT∗ only when the number of possible flow sequences is equal to 1.
54 Several Steps Closer to Real-Time NoCs
(a) NoC with moderate number of flows (b) NoC with high number of flows
Figure 2.11: Distribution of WCTT improvement across flows (legends represent improvement ranges)
Of those 31.84% of the flows for which both methods returned equal WCTT values (i.e.
WCT T 10000O =WCT TU ), for
3
4 of them, (23.96% of all the flows), the existing method was able to
capture WCTT∗, inferring that these cases were simple and involved the investigation of a single
flow sequence. Therefore, in these cases there was no scope for improvements.
Based on the computed bounds, it can be concluded that the BPC method performs equally
well or dominates the existing method. Also, for the selected SIRL value, BPC managed to capture
WCTT∗ in 92.13% of the cases, implying that any additional increase in SIRL would not provide
significantly tighter WCTT bounds, but would require factorially greater amount of computation
time.
The analysis completes within 24 hours, averaging a little bit more than 7 minutes per flow-set
(each with 64 flows). The most complex flow-set took around an hour to compute all WCTT val-
ues. This finding suggests that the execution times may vary drastically, when analysing flow-sets
with identical characteristics but different flow routes. Indeed, for flow-sets with complex con-
tention patterns, the analysis time can be an order of magnitude greater than the average analysis
time for the flow-set with similar traffic characteristics.
Experiment 1b (NoC with high number of flows): The main purposes of this experiment
are to test the scalability potential of the BPC method, and to observe its efficiency when applied
to larger flow-sets. Again, 200 random flow-sets are generated, each consisting of 128 flows,
with two flows originating from each core, but terminating at a random destination by following
an X-Y routed path. Flow deadlines and periods are randomly generated values in the range
[0.1− 1] ms, assuming a uniform distribution, where D( fi) ≤ T ( fi),∀ fi ∈F . For all flows, the
values of WCT TU and WCT T 10000O were computed and compared. The analysis completed in
5 days, averaging 36 minutes per flow-set. The most complex ones consumed around 3 hours,
demonstrating that the BPC method is scalable and applicable to practical scenarios involving
hundreds of concurrent flows.
2.2 NoCs with the Round-Robin Arbitration Policy 55
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  



























   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   



























   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   



























  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  



























   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   



























   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   



























   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   



























  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  



























       
        
   


            
  
      
   
   
   
   
   
   
   
   
   
 
 
 
 
 
 
 
 
 
 
     
     
     
     
  
  
  
  
     
     
     



    
    
    



     
     
     
  


     
     

  
     
     
     
  
  
  
   
   
 
 
  
  


  
  
  
  
  


  
  
   
   


 
  


  0%
  20%
  40%
  60%
  80%
  100%
Lim4 Lim20 Lim100 Lim200 Lim1,000 Lim2,000 Lim4,000 lim10,000
Pe
rc
en
ta
ge
 o
f f
lo
w
s 
91−100% improvement
81−90% improvement
71−80% improvement
61−70% improvement
51−60% improvement
41−50% improvement
31−40% improvement
21−30% improvement
11−20% improvement
1−10% improvement
No improvement
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  

























   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
 
 



























   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
  
  



























  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  



























   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   



























   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   



























   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   



























  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  



























       
        
  


            
  
      
   
   
   
   
   
   
   
   
   
 
 
 
 
 
 
 
 
 
 
     
     
     
     
  
  
  
  
     
     
     



    
    
    



     
     
     
  


     
     

  
     
     
     
  
  
  
   
   
 
 
  
  


  
  
  
  
  


  
  
   
   


 
  


  0%
  20%
  40%
  60%
  80%
  100%
Lim4 Lim20 Lim100 Lim200 Lim1,000 Lim2,000 Lim4,000 lim10,000
Pe
rc
en
ta
ge
 o
f f
lo
w
s 
91−100% improvement
81−90% improvement
71−80% improvement
61−70% improvement
51−60% improvement
41−50% improvement
31−40% improvement
21−30% improvement
11−20% improvement
1−10% improvement
No improvement
Figure 2.12: BPC method with varying SIRL vs. Ferrandiz et al. [32] method
As in the previous experiment, the PIR metric is used to quantitatively express the improve-
ments of the BPC method over the existing one [32]. Figure 2.11(b) illustrates the results. For
9.23% of the flows, no improvements were made (PIR = 0%). For most of the flows without any
improvements (8.11%), the existing method indeed managed to capture WCTT∗, which implies
that for these simple sequences consisting of a single flow no improvements were possible. For
the rest, i.e. 90.77% of the analysed flows, it holds that WCT T 10000O <WCT TU , that is, the BPC
method rendered tighter estimates. It is interesting to see that, for more than 13% of the flows, the
improvements were in the range [61− 70%], while for more than 8% of them the improvements
were greater than 70%.
Due to more complex traffic patterns, resulting from the increased amount of traffic, BPC
with SIRL = 10000 identified WCTT∗ for 41.71% of the flows, which is significantly less, when
compared with the same for moderately loaded NoCs. This suggests that in these cases the im-
provements can be achieved by increasing SIRL beyond 10000, but at the expense of additional
computational and spatial complexity. Although the computation time of the BPC method is
longer than that of the existing method, BPC clearly dominates the existing approach in terms of
tightness of obtained bounds. The selection of the parameter SIRL creates a trade-off between the
computational and spatial complexity on one side, and the analysis tightness on another, as will be
investigated in the next experiment.
2.2.5.2 Experiment 2: Impact of SIRL on Analysis
The objective of this experiment is to observe the impact of SIRL on the tightness of computed
WCTT estimates. Intuitively, retaining more information about flow sequences provides more
opportunities for pruning (eliminating infeasible flow sequences), and therefore leads to tighter
estimates. To validate this assumption, considering the flow-sets from Experiment 1b, the BPC
method is invoked and the WCTT estimates are obtained for different values of the parameter
SIRL. Subsequently, the obtained bounds are compared against the bounds obtained by the existing
method [32]. The results are illustrated in Figure 2.12. As in Experiment 1, the PIR metric is used.
56 Several Steps Closer to Real-Time NoCs
1 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000
5
10
15
20
25
30
35
40
Limit size
Qu
an
tity
 o
f c
ap
tu
re
d 
W
CT
T*
, in
 %
 o
f t
ot
al 
flo
w−
se
t
(a) Identified worst-cases (WCTT*) across SIRLs
Anomalies Equal 1−10% 11−20% 21−30% 31−100%0
10
20
30
40
50
60
Qu
an
tity
 o
f f
low
s, 
in 
%
 o
f t
ot
al 
flo
w−
se
t
Improvement
 
 
SIRL1 = 100 / SIRL2 = 40
SIRL1 = 256 / SIRL2 = 100
SIRL1 = 640 / SIRL2 = 256
SIRL1 = 1600 / SIRL2 = 640
SIRL1 = 4000 / SIRL2 = 1600
SIRL1 = 10000 / SIRL2 = 4000
(b) Improvements across SIRLs
Figure 2.13: Inter-SIRL ratios
It is visible that as SIRL increases, the percentage of flows with no improvements decreases.
Thus, with SIRL = 4, the WCTT estimates computed for 43.6% of the flows exhibit no improve-
ments, while with SIRL = 2000 only 9.29% of the flows show no improvements (and the rest
90.71% of the flows have tighter WCTT bounds). Notice, that the percentage of flows within high
PIR categories increases as SIRL increases. This is in accordance with the method rationale that
the retention of contexts, and within them the information regarding past flow traversals, can pro-
vide opportunities for pruning and tightening the WCTT estimates. But, as seen in the shift from
SIRL = 4000 to SIRL = 10000, the distribution of improvements across PIR categories does not
differ much, because almost all opportunities for pruning infeasible sequences are exhausted. This
further implies that choosing limits beyond a given SIRL will only burden the system to retain
the contexts of flow sequences which have very small chances to lead towards a tighter WCTT
estimate. So, a judicious decision must be taken by the system designer, considering the desired
tightness of results, and the time in which the analysis must complete. In that respect, the BPC
method provides the full flexibility via the parameter SIRL.
2.2.5.3 Experiment 3: Inter-SIRL Ratios
In the previous experiment, the BPC method with different values of the SIRL parameter was
compared against the existing approach [32]. In order to get a deeper insight into the impact of
SIRL on the tightness of derived bounds, in this experiment the bounds are obtained assuming
BPC with different SIRL values, where the flow-sets from Experiment 1b are used. The obtained
bounds are compared against each other, and subsequently the results are plotted in Figure 2.13.
The results coincide with the intuition, suggesting that greater values of SIRL improve the chances
of capturing WCTT∗. This claim is confirmed with a logarithmic growth in the number of non-
collapsed sequences, as SIRL increases (Figure 2.13(a)).
Figure 2.13(b) demonstrates that the relative improvements across SIRLs diminish as SIRL
increases. That is, in 60% of the cases the BPC method with SIRL = 100 shows improvements
2.2 NoCs with the Round-Robin Arbitration Policy 57
f1 f f2 4
f f f f5 6 7 8
f9
f3
Figure 2.14: All-to-one contending flows
over BPC with SIRL = 40, while the improvements are reported only in 30% of the cases when
comparing results of SIRL = 10000 and SIRL = 4000. Thus, the number of cases with no im-
provements increases with SIRL. Conversely, as SIRL increases, the cases with improvements
decrease across all improvement ranges, suggesting that it may not be efficient to perform the
analysis with very high values of SIRL. The benefits of the analysis with higher SIRL diminish as
SIRL increases (especially for sequences comprising of flows that can occur at most once within
the observed interval).
As already stated, the value of the SIRL influences the frequency of collapses. However,
one interesting and counter-intuitive observation is the fact that higher SIRL does not necessarily
always lead to a tighter WCTT upper-bound. This is explained with the following example. Con-
sider the flow f2 from the example depicted by Figure 2.10. Assume that f4 and f5 are potential
candidates for pruning. Now, assume that greater SIRL performs a collapse between occurrences
of f1 and f2. As the history information is lost, the flows f4 and f5 will contribute to the delays of
both f1 and f2. On the other hand, a smaller SIRL might trigger a collapse, for example, before
(and after) the appearance of f1 and f2. In this case, it may successfully prune one appearance of
f4 and f5, thereby causing the situation (which is referred to as an anomaly), where, with a smaller
value of SIRL, BPC returns a tighter WCTT estimate. As is visible from the results, the number
of anomalies never exceeds 8%, for all the analysed flow-sets.
2.2.5.4 Experiment 4: All-to-One Case Study
Figure 2.14 shows one flow-set where all flows have the same destination. This example is used
to demonstrate some interesting properties. Consider the flow f1. The flows f2, f3, f4, f5, f6, f7,
f8 and f9 are the other flows which can block f1. By applying the existing approach [32], one of
the sequences which might lead to the worst-case scenario for f1 is: Seq1 = { f9, f8, f9, f5, f9,
f4, f9, f8, f9, f5, f9, f3, f9, f8, f9, f5, f9, f4, f9, f8, f9, f5, f9, f2, f9, f8, f9, f5, f9, f4, f9, f8,
f9, f5, f9, f3, f9, f8, f9, f5, f9, f4, f9, f8, f9, f5, f9, f1}. This sequence is a good example of the
58 Several Steps Closer to Real-Time NoCs
 0
 5
 10
 15
 20
 25
 0  10  20  30  40  50  60  70  80  90
Bl
oc
ki
ng
 fl
ow
 o
cc
ur
en
ce
s
MITR in microseconds 
(a) Number of blockings by f9 vs. MIRT
 90
 100
 110
 120
 130
 140
 150
 160
 170
 180
 190
 0  10  20  30  40  50  60  70  80  90
W
CT
T 
in
 m
icr
os
ec
on
ds
MITR in microseconds 
Unoptimized
Optimized
(b) WCTT of f1 vs. MIRT
Figure 2.15: Impact of MIRT on flow f1
potential packet-level and flow-level pessimism, and illustrates the case of a highly over-estimated
WCTT, when infeasible sequences are not pruned. Flows f9 and f8 are positioned in such a way
that they can frequently block the other flows. That is, in the aforementioned sequence, during
one traversal of f1, the flows f9 and f8 appear 24 and 8 times, respectively. However, due to their
characteristics, it may be impossible that the packets of f9 and f8 appear so often, inferring that the
computed sequence may be overly pessimistic. Yet, in order to assess whether the aforementioned
sequence is indeed pessimistic, and to what extent, flow characteristics must be taken into account.
In the reminder of the experiment the focus will be on this aspect.
As already observed, the flows originating closer to the destination have a higher tendency
to block f1 directly and indirectly (by blocking the other flows which are also on its path). To
verify this, the deadline and the period of each flow are assigned the same value, which is equal
to the worst-case traversal time of the respective flow, computed with the existing method [32].
For example, the deadline and period of the flow f9 are assigned the delay of the sequence Seq9 =
{ f1, f9}. Similarly, the deadline and period of f8 are assigned the delay of the following sequence:
Seq8 = { f9, f1, f9, f8}. This process is repeated for all flows, and it assures that all flows can
generate their packets as frequently as possible, while still being schedulable.
Assuming this flow-set, the worst-case traversal times are computed for each flow. Then, the
periods of all flows are increased by the value of the newly introduced parameter MITR, while
the deadlines remain the same, i.e. Dnew( fi) = Dold( fi) ∧ Tnew( fi) = Told( fi)+MIT R,∀ fi ∈F .
The worst-case traversal times are computed again, and the obtained values are compared. The
objective of this experiment is to see how the obtained values change with the increase in flow
periods (parameter MITR).
Figure 2.15 shows that as MIRT increases, the number of times the other flows can block
f1 decreases, and thus the WCTT estimate of f1 decreases, as expected. In contrast, since the
existing approach [32] does not take into account flow characteristics, the infeasible sequences are
not pruned and as a result, irrespective of the change in the flow parameters, the obtained WCTT
bounds remain constant (see solid line in Figure 2.15(b)).
2.3 NoCs with Priority-Preemptive Arbitration Policies 59
2.2.6 Discussion
Branch, Prune and Collapse (BPC) is a method for the worst-case analysis of wormhole-switched
round-robin-arbitrated NoCs. BPC uses a branch and prune technique, which improves over the
work of Ferrandiz et al. [32] by taking into account the flow characteristics, and thereby provides
tighter upper-bound estimates on the worst-case traversal times of traffic flows. In order to tackle
the complexity issues of the branch and prune technique, a collapse phase was introduced. The
concept of collapses allows the system designer to efficiently use the BPC method, and, via a
configurable parameter, control the trade-off between the computational and spatial complexity
on one side and the analysis tightness on another. A large set of experiments demonstrated the
performance of the proposed method in comparison with the existing approach. In particular, BPC
dominates the state-of-the-art method by yielding tighter WCTT estimates at the cost of additional
computational and spatial resources, where these effects can be, to some extent, mitigated by the
right selection of the configurable parameter.
This inability of the existing methods to efficiently derive tight bounds for large complex
flow-sets is the consequence of two inherent properties of the round-robin arbitration policy itself,
namely the indirect contentions and the commutative property. As already noticed, each flow can
be blocked not only by other flows with which it directly competes for some links on its path, but
also by other flows with which it does not. Consequently, the set of possible flow sequences that
needs to be investigated constitutes a vast solution space, which may be impossible to efficiently
explore within a reasonable time, even for small flow-sets. For example, for the flow-set of only
5 flows, illustrated in Figure 2.10, there are 305 possible flow sequences, while for the example
of 9 flows illustrated in Figure 2.14, a single flow sequence may contain 48 elements. Moreover,
the commutative property applied to flow blocking may hold in the context of the round-robin
arbitration policy. In other words, if two flows fi and f j compete for some link on their path, in
some cases fi would be given a precedence, while in some others f j would be allowed to progress.
This infers that fi can block f j, but the opposite is also true. Since the arbitration decisions are
based on the router state at runtime (previous routing decisions), it is impossible to predict routing
decisions at design-time, which means that this arbitration non-determinism has to be covered in
the worst-case analysis either by exploring all possible scenarios, or by making some pessimistic
assumptions. Thus, it can be concluded that, despite its high popularity and wide presence in
currently available many-core interconnects, the round-robin arbitration policy may not be the
most efficient arbitration technique for the real-time many-cores.
2.3 NoCs with Priority-Preemptive Arbitration Policies
In this section, the focus is on priority-preemptive arbitration policies. As already discussed and
demonstrated in Chapter 1, by allowing priority-based preemptions among flows, the effects of
indirect contentions are significantly mitigated, which will be explored in detail later in this sec-
tion. Moreover, for schemes with fixed flow priorities, the commutative property applied to flow
60 Several Steps Closer to Real-Time NoCs
preemptions does not hold. In other words, if two flows fi and f j contend for some link on their
paths, such that P( fi) > P( f j), then fi would always preemt f j. Both the aforementioned facts
allow to perform the worst-case analysis with much less pessimism and/or computational com-
plexity, which infers that the priority-preemptive NoCs may be a preferable interconnect medium
for real-time many-cores. Before the state-of-the-art methods for the worst-case analysis are pre-
sented, several basic concepts are introduced.
Definition 1 (Directly contending flow). If a flow f j shares a part of the path with the flow under
analysis fi, and has a higher priority than fi, it is considered as the directly contending flow, and
it belongs to the setFD( fi). Formally:
∀ f j ∈F | P( f j)> P( fi)∧L ( f j)∩L ( fi) 6= /0⇒ f j ∈FD( fi)
Definition 2 (Indirectly contending flow). If a flow fk does not share a part of the path with the
flow under analysis fi, but shares it with some other flow f j, which is either directly or indirectly
contending flow of fi (recursive definition), and has a higher priority than f j, it is considered as
the indirectly contending flow, and it belongs to the setFI( fi). Formally:
∀ fk ∈F | fk 6∈FD( fi)∧∃ f j ∈F ∧ f j ∈ {FD( fi)∪FI( fi)}∧ fk ∈FD( f j)⇒ fk ∈FI( fi)
Recall, that for NoCs with the round-robin arbitration policy, the analysed flow could be
blocked by both directly and indirectly contending flows. However, assuming priority-preemptive
NoCs, the flow under analysis can suffer the interference only from directly contending flows (see
Theorem 3).
Theorem 3. In wormhole-switched NoCs, with per-flow distinctive priorities, per-priority virtual
channels and flit-level preemptions, any flow fi can not suffer interference from indirectly contend-
ing flows.
Proof. Proven by contradiction. Consider three flows fi, f j and fk. Let fk be a directly contending
flow of f j, and f j be a directly contending flow of fi, where fk and fi do not share a common
part of the path. Thus, by Definition 2, fk is an indirectly contending flow of fi. Assume that fk
can cause interference to fi. By initial assumption, fi is preempted. Furthermore, as fk causes the
interference, it is traversing. Due to the traversal of fk, the flow f j is either preempted, or does not
exist at that time instant. In either case, it is not progressing. Thus, all routers on the path of fi are
idle. Due to per-priority virtual channels, fi can uninterruptedly reach its destination, even though
preempted packets of f j may exist on its path (which would not be the case in a scheme with a
single virtual channel, where fi would have to wait until f j passes). The contradiction has been
reached.
One implication of Theorem 3 is that in the example illustrated in Figure 2.16, the flow f1 can
not suffer the interference from the flow f3. However, notice that f3 can preempt f2, influence
its occurrence patterns, and in that way indirectly contribute to the delay of f1. This is a very
important fact, and it will receive additional attention later, when the state-of-the-art methods for
the worst-case analysis will be introduced.
2.3 NoCs with Priority-Preemptive Arbitration Policies 61
ρ ρ4 5
f1 f32f
ρ1 ρ2 ρ3
Figure 2.16: Traffic flows (example 3)
Observation 4. The analysed flow fi can be preempted by any directly interfering flow f j ∈
FD( fi), and consequently suffer interference. The delay caused to fi by a single preemption of f j
is equivalent to its basic network latency – C( f j), which is computed by solving Equation 2.1.
2.3.1 State-of-the-Art Method for NoCs with Fixed-Priority Arbitration Policy
Shi and Burns [85] proposed a method for the worst-case analysis of NoCs with the fixed-priority
arbitration. This approach is based on the following assumptions: (i) flit-level preemptions,
(ii) per-flow distinctive priorities, (iii) per-priority virtual channels, and (iv) a traffic model with
constrained or implicit deadlines, i.e. D( fi)≤ T ( fi),∀ fi ∈F . The authors proposed to treat the en-
tire path of the flow under analysis as an indivisible resource, while any request for any of its parts
by directly contending flows is considered as interference1. This allows to reuse the concepts from
the single-core scheduling theory. Specifically, the problem of computing the worst-case traversal
time of a flow translates into the problem of computing the worst-case response time of the task
which is scheduled upon a single-core platform. By straightforwardly applying the concepts of
the single-core scheduling theory, the worst-case traversal time of the flow fi in the presence of its
directly contending flowsFD( fi) can be computed as follows:
WCT T ( fi) =C( fi)+ ∑
∀ f j∈FD( fi)
⌈
WCT T ( fi)+ JR( f j)
T ( f j)
⌉
·C( f j) (2.2)
Equation 2.2 is interpreted as follows. The worst-case traversal time of a flow is equal to the
sum of its isolation delay and its interference delay. The interference is computed by summing up
the interferences from all higher-priority flows, where individual terms are obtained by calculating
the maximum load that each higher-priority flow can generate within the observed time interval,
augmented by the additional term JR( f j). JR( f j) is called the release jitter, which is defined as the
maximum deviation of successive packet releases from its period [6].
Notice that the term WCT T ( fi) exists on the both sides of Equation 2.2. This equation is
solved using an iterative technique, where the first iteration is performed by assuming that the
term WCT T ( fi) from the right-hand side is equal to 0. The computed value (the term WCT T ( fi)
from the left-hand side) is then fed back into the calculation as the right-hand side term, and the
computation is performed iteratively. The process terminates when WCT T ( fi) does not change in
1Notice the difference between this approach and the methods for the round-robin-arbitrated NoCs, where the anal-
ysis was performed per-link, recursively.
62 Several Steps Closer to Real-Time NoCs
two successive iterations. The obtained value represents the smallest value of WCT T ( fi) which
satisfies Equation 2.2. This is a well-known and widely used technique in the real-time domain [6].
Now, consider the example of flows illustrated in Figure 2.16 with the flow parameters given
in Table 2.2.
Table 2.2: Flow-set parameters for Figure 2.16 (example 1)
Flow Priority C(f) JR(f) D(f) = T(f)
f1 P( f1) 3 0 10
f2 P( f2)< P( f1) 2 0 6
f3 P( f3)< P( f2) 2 0 5
The worst-case traversal times of the flows can be obtained as follows:
WCT T ( f1) =C( f1) = 3
WCT T ( f2) =C( f2)+
⌈
WCT T ( f2)+ JR( f1)
T ( f1)
⌉
·C( f1) = 5
WCT T ( f3) =C( f3)+
⌈
WCT T ( f3)+ JR( f2)
T ( f2)
⌉
·C( f2) = 4
Since the worst-case traversal times of all the flows are less than their respective deadlines, it
may be concluded that this flow-set is schedulable. But that is not true! Figure 2.17 demonstrates
that the packet of f3 can miss its deadline. The explanation is as follows. Even though f1 cannot
directly interfere with f3 because they do not have a common part of the path, f1 can influence
the occurrence pattern of f2 and in that way indirectly contribute to the delay of f3. Notice in
Figure 2.17 that f1 delayed the first packet of f2, causing its two successive packets to be distanced
by less than T2. Consequently, f3 experienced more interference from f2 than what was computed
with Equation 2.2. In particular, within the observed interval, f3 suffered the interference from two
packets of f2, while the analysis considered the interference from only one packet. Thus, in the
presence of indirectly contending flows, assuming periodic occurrences of higher-priority flows is
not safe.
f1
f2
f3
2 53 6 70 1 4 8 9
Figure 2.17: Deadline miss for flow-set from Figure 2.16 and Table 2.2
2.3 NoCs with Priority-Preemptive Arbitration Policies 63
Shi and Burns [85] noticed this effect, and proposed the modification to Equation 2.2, which
takes into account the impacts of indirectly contending flows. The intuition behind their approach
is the following: for each higher-priority directly contending flow, which releases can be deferred
due to indirectly contending flows, assume that the first packet is delayed as much as possible,
while all the other packets are released as early as possible. The maximum delay that the first
packet may experience while still being schedulable is JN( fi) =WCT T ( fi)−C( fi), which is in
the literature known as the maximum network jitter. Thus, the authors propose to assume the
maximum jitter for each higher-priority contending flow f j that can suffer the interference from
at least one flow fk, which is indirectly contending with the analysed flow fi. Conversely, if f j
cannot suffer the interference from any flow which is indirectly contending with fi, then its jitter
is equal to zero, i.e. JN( f j) = 0. The jitter computation can be expressed with Equation 2.3.
JN( f j) =
WCT T ( f j)−C( f j) if ∃ fk ∈F | fk ∈FD( f j)∧ fk ∈FI( fi)0 otherwise (2.3)
Notice from Equation 2.3 that the network jitter does not depend only on the higher-priority
flow f j, but also on the analysed flow fi and indirectly interfering flows (if any). This implies that
if two different flows fi and fi′ have the same directly interfering higher-priority flow f j, it may
happen that the network jitters of f j in these two cases have different values.
Now, a safe upper-bound on the worst-case traversal time of fi can be computed by iteratively
solving Equation 2.4.
WCT T ( fi) =C( fi)+ ∑
∀ f j∈FD( fi)
⌈
WCT T ( fi)+ JR( f j)+ JN( f j)
T ( f j)
⌉
·C( f j) (2.4)
Consider again the flow f3 from the previous example. The indirect interference that f1 causes
to f3 is manifested with the jitter of f2, i.e. JN( f2) =WCT T ( f2)−C( f2) = 3. And indeed, after
recomputing the worst-case traversal time of f3 with Equation 2.4, it is confirmed that f3 is in fact
unschedulable:
WCT T ( f3) =C( f3)+
⌈
WCT T ( f3)+ JR( f2)+ JN( f2)
T ( f2)
⌉
·C( f2) = 6> D( f3)
2.3.2 Priority-Share Policy
The greatest limitation of the aforementioned method is the requirement that each flow must have
a distinctive priority, and that the platform must provide the number of virtual channels which is
at least equal to the number of flows in the flow-set. This requirement is in most of the cases
unrealistic. For example, the SCC platform [42] provides only 8 virtual channels. This infers that
only flow-sets with at most 8 flows can be accommodated, which is a huge limitation. Thus, there
is a need to more efficiently organise the access of the flows to the existing virtual channels.
64 Several Steps Closer to Real-Time NoCs
In order to reduce the number of needed virtual channels, Shi and Burns [86] proposed the
method where multiple flows can share the same priority, called a priority-share policy. How-
ever, the existence of flows with the same priority brings significant overheads and more complex
blocking and interference patterns, similar to those involving a single virtual channel and the
round-robin arbitration policy. That is, a flow can be blocked not only by the other same-priority
flows with which it shares some parts of the path, but also by others with which it does not. For
example, if the flows f1, f2 and f3 from Figure 2.16 share the same priority, then f3 can indirectly
block f1.
In order to circumvent this problem the authors propose to group all flows with the same
priority into a single entity called the composite flow – f˚ . Let FC( f˚ ) be a set of same-priority
flows constituting f˚ . Now, the basic network latency of the composite flow f˚ is computed as
follows:
C( f˚ ) = ∑
∀ fi∈FC( f˚ )
C( fi) (2.5)
LetFD( f˚ ) be a set of flows which can cause direct interference to any flow fromFC( f˚ ), and
hence to f˚ . Formally:
∀ f j ∈F | ∃ fi ∈FC( f˚ )∧ f j ∈FD( fi)⇒ f j ∈FD( f˚ ) (2.6)
f˚ can suffer interference from any f j ∈ FD( f˚ ). Now, the worst-case traversal time of the
composite flow can be computed by solving Equation 2.7.
WCT T ( f˚ ) =C( f˚ )+ ∑
∀ f j∈FD( f˚ )
⌈
WCT T ( f˚ )+ JR( f j)+ JN( f j)
T ( f j)
⌉
·C( f j) (2.7)
Subsequently, the worst-case traversal time of each flow fi ∈FC( f˚ ) can be computed as fol-
lows:
WCT T ( fi) =WCT T ( f˚ )+ JR( fi) (2.8)
Notice two potential sources of pessimism: (i) all same-priority flows are grouped into one
entity, while some of them may not be able to (in)directly block each other, and hence could be
treated independently, and (ii) not all directly interfering flows can cause interference during an
entire period WCT T ( f˚ ). To emphasise the limitations of the priority-share policy, consider the
example illustrated in Figure 2.16 with the flow parameters given in Table 2.3.
Table 2.3: Flow-set parameters for Figure 2.16 (example 2)
Flow Priority C(f) JR(f) D(f) = T(f)
f1 P( f1) 2 0 10
f2 P( f2)< P( f1) 2 0 10
f3 P( f3)< P( f2) 2 0 10
2.3 NoCs with Priority-Preemptive Arbitration Policies 65
Assuming that each flow has a dedicated virtual channel, the worst-case traversal times are as
follows:
WCT T ( f1) =C( f1) = 2
WCT T ( f2) =C( f2)+
⌈
WCT T ( f2)+ JR( f1)
T ( f1)
⌉
·C( f1) = 4
WCT T ( f3) =C( f3)+
⌈
WCT T ( f3)+ JR( f2)+WCT T ( f2)−C( f2)
T ( f2)
⌉
·C( f2) = 4
However, when assuming that all three flows share the same priority (and hence the virtual
channel), the worst-case traversal times of the flows are:
WCT T ( f˚ ) =C( f˚ ) =C( f1)+C( f2)+C( f3) = 6
WCT T ( f1) =C( f˚ )+ JR( f1) = 6
WCT T ( f2) =C( f˚ )+ JR( f2) = 6
WCT T ( f3) =C( f˚ )+ JR( f3) = 6
The aforementioned example demonstrated that the priority-share policy can indeed reduce the
number of needed virtual channels to an arbitrary value, however, at the expense of a substantial
increase in the worst-case traversal times. Notice that this can have a significant impact on the
schedulability, where many flows that are schedulable with per-flow priorities are not schedulable
when some flows start sharing the same priority. This finding suggests that the priority-share
policy may not be the most efficient method for the reduction of required virtual channels, and
some alternative techniques are desirable.
2.3.3 Relaxing Hardware Requirements
The state-of-the-art method for flows with distinctive priorities [85] poses an unrealistic require-
ment regarding the number of virtual channels. The subsequently proposed priority-share pol-
icy [86] gives the possibility to relax that requirement (reduce the number of needed virtual chan-
nels), but at the expense of schedulability. In this section, the focus is on providing an alternative
method, which allows to reduce the number of needed virtual channels, and at the same time does
not have a significant impact on the schedulability.
2.3.3.1 Inefficiency of Existing Methods
The common underlying assumption of the two aforementioned approaches is that each flow tra-
verses its entire path through the same, statically assigned virtual channel, which is dedicated to
its priority. In other words, the assignment of virtual channels to priorities has to be done in a
consistent manner across the entire platform, which means that many port-buffers belonging to a
certain virtual channel will remain unused, simply because there are no flows with that priority
66 Several Steps Closer to Real-Time NoCs
which traverse them. The limitation of this approach is explained with Figure 2.18 where 4 flows
with distinctive priorities traverse 4 virtual channels, and where different virtual channels are de-
picted with different colors. Notice that the virtual channel for f1 (the darkest color) is used only
in the first two routers, while in the rest of the platform remains unused.
f21f f3 f4
Figure 2.18: Assignment of virtual channels to flow-set with distinctive priorities
The priority-share policy suffers from the same limitation, which is illustrated with Figure 2.19.
Consider the depicted flow-set where flows f1 and f4 are grouped within the same priority (virtual
channel), and the same is true for f2 and f3. Notice that the virtual channel shared by f1 and f4
(the darker color) is used only in the first two and the last two routers, while in the middle two
routers remains unused.
f21f f3 f4
Figure 2.19: Assignment of virtual channels to flow-set with priority-share policy
2.3.3.2 Dynamically Changing Virtual Channels
The SCC platform [42] provides a feature that can be used to overcome the aforementioned limi-
tation. Specifically, SCC offers the possibility that a flow can dynamically change virtual channels
along its path. This feature was described using an analogy about how cars switch lanes on the
highway. This implies that the assignment of virtual channels to priorities does not need to be
done in a consistent manner across the entire platform, but can be performed individually, for each
contending element – link (port). The benefits of this feature are illustrated in Figure 2.20, where
for 4 flows with distinctive priorities only 2 virtual channels are needed. Notice, that f2 changes
virtual channels twice along its path (the darkest color).
f21f f3 f4
Figure 2.20: Per-router assignment of virtual channels to flow-set with distinctive priorities
2.3 NoCs with Priority-Preemptive Arbitration Policies 67
This approach brings two benefits. First, within each port, the virtual channels are assigned to
a certain priority only if necessary (the flow or flows with that priority traverse that port). Second,
the number of required virtual channels within the entire platform is significantly reduced. Specif-
ically, assuming that each flow has a distinctive priority, the number of necessary virtual channels
decreases from the total number of flows in the flow-set to the maximum number of contentions
for any link within the platform. In the illustrated example, it dropped from 4 (Figure 2.18) to only
2 (Figure 2.20). Notice, that unlike the priority-share policy, this technique to reduce the number
of virtual channels does not have an impact on the schedulability and that the analysis remains the
same as in the case with the per-flow distinctive priorities (Section 2.3.1), which makes it a more
preferable approach than the priority-share policy.
One overhead of this approach is that, for each flow, a list of traversed virtual channels should
be specified. This can be implemented in the following way. The header of each flow fi is extended
with the list of virtual channels V ( fi), where the number of elements in V ( fi) is equal to the
number of ports on the path L ( fi) of the flow fi, i.e. if routers have only input or output ports,
|V ( fi)| = |L ( fi)− 1|, otherwise |V ( fi)| = 2 · |L ( fi)− 1|. In other words, for each port that fi
traverses, there is an information in which virtual channel it should be stored. One alternative
implementation is that the flows are agnostic with respect to virtual channels, but the information
is stored within each router, and upon the arrival of a flow, it is router’s responsibility to retrieve
that information and select the corresponding virtual channel.
2.3.3.3 Importance of Mapping
As soon as the methods for the worst-case analysis of NoCs were proposed, the researchers re-
alised that the schedulability of flow-sets is highly dependent on the mapping process. This is
expected, because as observed in the previous sections, the flow can suffer the interference only
from directly contending flows, while flow relationships are in fact the consequence of the map-
ping process. In their studies, Mesidis and Indrusiak [62] and Racu and Indrusiak [80] focus
on deriving mappings which maximise the schedulability potential of the flow-sets, but without
any impact on the number of employed virtual channels, because the authors assumed the model
with distinctive per-flow priorities. Similarly, Shi and Burns [87] explore the flow-set mapping
problem, but for the model with the priority-share policy. Their primary objective is to improve
the schedulability, while the secondary objective is to decrease the number of employed virtual
channels by packing as much flows in the same priority (virtual channel) as possible.
So far, no work has considered that flows may dynamically change virtual channels. Notice,
that under this assumption, the mapping process becomes even more important, because the num-
ber of necessary virtual channels is equal to the maximum number of contentions for any link,
which is directly dependant on the mapping process.
At this stage several questions can be raised: 1. What is reduction in the number of virtual
channels that can be achieved by allowing flows to change virtual channels? 2. What is the further
reduction in the number of virtual channels that can be achieved through mapping? 3. Does
SCC (or any other platform with a desired feature) offer enough virtual channels to satisfy the
68 Several Steps Closer to Real-Time NoCs
requirements of present and future real-time systems? 4. If not, how big is the gap between the
existing platforms and the platforms suitable for the real-time domain? In order to answer these
questions, the novel approach will be proposed, which combines the dynamically changing flow
priorities with the workload mapping.
2.3.3.4 Proposed Mapping Method
Mapping Workload
Before the mapping process can be described, the mapping workload has to be introduced. Re-
call, so far it has been assumed that each flow is specified with the source core and the destination
core. However, flows model the communication between functionalities, thus the source and des-
tination cores of each flow are in fact the cores of the sending and receiving functionality, respec-
tively. Therefore, the mapping workload is a collection of functionalities F = {F1,F2, ...Fz−1,Fz},
where z is equal to the number of cores/routers. The assumption valid in this section is that each
functionality can be mapped to any core within the platform, however, each core can accept at
most one functionality. Each functionality Fi is characterised by a set of sent flowsFS(Fi) and the
set of received flowsFR(Fi).
Method Overview
The mapping method is based on the assumption that, while traversing, flows are allowed to
change virtual channels. The method derives a mapping plan (solution) M , for a given set of
functionalities F , on a given platform Ψ, i.e. M = F →Ψ. The primary objective is to minimise
the maximum number of contentions for any link on the platform. The motivation behind this
concept is to relax the requirements for virtual channels as much as possible. The secondary
objective is to maximise the schedulability, without increasing the number of necessary virtual
channels.
Such a mapping method allows to identify the minimum platform characteristics (e.g. the num-
ber of virtual channels, the link bandwidths) that are necessary to accommodate a given workload.
A mapping method should be used by a system designer, and should ideally provide a mapping
M , of the given workload F , to the given platform Ψ, such that the flow-set is schedulable, i.e.
WCT T ( fi) ≤ D( fi),∀ fi ∈ F . In cases where this requirement can not be fulfilled, the method
should provide information about the service that a given platform can provide to a given work-
load, e.g. (i) the number of available virtual channels is enough to accommodate only half of the
flow-set, or (ii) assuming initial flow sizes the system is not schedulable, however, assuming that
each flow has a half of its initial size the system is schedulable. This information is useful for
two reasons. First, it gives system designer feedback on how close/far the current configuration is
from the one that can accommodate the entire workload and provide real-time guarantees. Con-
sequently the system designer can reconfigure the platform characteristics and repeat the mapping
process until finding the configuration which guarantees the fulfilment of all constraints, with as
less resources as possible. Second, if the platform is already specified, the designer can decide
(i) which least essential functionalities (and their respective flows) should be dropped from the
2.3 NoCs with Priority-Preemptive Arbitration Policies 69
system, such that the constraint on the number of virtual channels is satisfied, and/or (ii) to which
sizes should the flows be reduced, such that timing constraints of all flows can be met. Note, that
variable flow sizes are in many scenarios acceptable, e.g. [13], and proportional to the provided
quality of service. Hence, when mapping such a workload, the system designer might find it more
convenient to use a platform where a workload is schedulable with a certain quality (flow sizes),
rather than buying a much more expensive platform which will guarantee the best quality (schedu-
lable system with initial flow sizes). This approach also allows to study the effect of different
platform characteristics on the provided guarantees and will help to identify the bottlenecks of
NoCs in commodity many-cores.
Detailed Description of Mapping Method
The proposed method consists of three mapping stages: (i) initial phase, (ii) VC minimisation
phase and (iii) workload-exploration phase.
The initial phase is inspired by the mapping algorithm of dataflow applications on many-
core platforms [4], where the objective is to map the set of functionalities F on the set of cores
Π in such a way that heavily communicating functionalities are mapped as close to each other as
possible. This strategy should help in reducing contentions and provide a "good" starting point, i.e.
initial solution Min, which will undergo further optimisations during subsequent phases. While
mapping, the initial phase does not consider flow, nor platform properties (i.e. flow sizes, timing
constraints, link bandwidths, available virtual channels).
The communicating functionalities are mapped near to each other by using a core selection
methodology proposed by Ali et al. [4], that proved to decrease communication overhead and
response times of applications. This methodology consists of two main functions, spiralMove and
findNearestCore, which are illustrated in Figure 2.21. The first function, spiralMove, defines a
fixed spiral path on the platform that is followed while mapping the set of functionalities F . The
spiralMove function returns the next core on the spiral path every time it is called, as shown in
Figure 2.21(a). The second function, findNearestCore, takes a reference core as an input and starts
searching for a free core one hop away from the reference core. If not possible, it searches for
a free core two hops away, and so on, until finding a possible core to map the functionality. The
search criteria starts by finding the nearest core in this order: North, South, East and West. The
first core that the findNearestCore function finds is returned for mapping. Figure 2.21(b) shows
the searching regions, classified according to the distance from the reference core.
The mapping algorithm of the initial phase (shown in Algorithm 4), starts by sorting the func-
tionalities in a non-increasing order of total number of flows (both sent and received). Then, it
picks the functionalities in that order and checks whether they are mapped or not. If the func-
tionality Fi was already mapped, the algorithm checks the children of Fi (the functionalities that
communicate with Fi, either via received or sent flows), and maps them near Fi by using the func-
tion findNearestCore. On the other hand, if the picked up functionality was not mapped yet, the
algorithm calls the spiralMove function to determine the next free core on the spiral path to map
it. Then, it maps its children functionalities using the findNearestCore function in the same way
70 Several Steps Closer to Real-Time NoCs
0
0
1 2 3
1
2
3
Current CoreNext Core
spiral_move
(a) spiralMove
0
0
1 2 3
1
2
3
3 hop 1 hop 2 hop 4 hop 
Current Core
(b) findNearestCore
Figure 2.21: Core selection methodology proposed by Ali et al. [4]
Algorithm 4 mapInitialPhase(F,Ψ)
Input: set of functionalities F , platform Ψ
Output: initial mappingMin
1: Ford ← orderFunctionalities(F); // sort functionalities by the number of flows, non-increasingly
2: for each (Fi ∈ Ford) do
3: if (Fi.mapped 6= true) then
4: pik← spiralMove(); // find the best core for mapping
5: map(Fi,pik);
6: Fi.mapped← true;
7: end if
8: Fch(i)← f indChildFunctionalities(Fi); // find all child functionalities of Fi
9: if (Fch(i) 6= /0) then
10: for each (Fj ∈ Fch(i)) do
11: pim← f indNearestCore(Fi);
12: map(Fj,pim);
13: Fj.mapped← true;
14: end for
15: end if
16: end for
17: return Min;
as explained previously.
The VC minimisation phase takes the solution of the initial phase Min as an input, and
optimises it in such a way that the number of needed virtual channels is minimised. The VC
minimisation phase is implemented as a Simulated Annealing (SA) meta-heuristic [50]. The jus-
tification for this approach is the fact that the problem of workload mapping is equivalent to the
quadratic assignment problem, which is NP-Hard, hence searching for the optimal solution can be
2.3 NoCs with Priority-Preemptive Arbitration Policies 71
Algorithm 5 mapVCMinPhase(F,Ψ,Min)
Input: set of functionalities F , platform Ψ, initial solutionMin
Parameters: tmin – min SA temperature, tmax – max SA temperature, tstep – temperature step,
c – iterations in each step, Pw – probability of transitioning from the current state
Output: intermediate solutionMmin
1: Mmin←Min;
2: for (tcurr← tmax; tcurr > tmin; tcurr← tcurr− tstep) do
3: it← 0; // number of iterations at the current temperature
4: while (it ≤ c) do
5: {Fi,Fj | (Fi∧Fj) ∈ F,Fi 6= Fj} // select two random functionalities Fi and Fj
6: Mcurr← swap(Mmin,Fi,Fj); // create new solution by swapping the positions of Fi and Fj
7: if (v̂(Mcurr)≤ v̂(Mmin)) then
8: if ((v¯(Mcurr)< v¯(Mmin))∨ (rnd()< Pw · tcurr/tmax)) then
9: Mmin←Mcurr; // new solution accepted
10: end if
11: end if
12: it← it+1;
13: end while
14: end for
15: return Mmin;
prohibitively expensive even for small NoCs [40], e.g. 4×4. Note that the majority of workload
mapping methods are also heuristic-based.
As the single objective of the VC minimisation phase is to minimise the number of virtual
channels, this phase also does not consider flow nor platform properties. This phase is described
with Algorithm 5. Through the exploration of the solution space, the algorithm makes a transition
from one solutionMmin to anotherMcurr with respect to an acceptance test. This test consists of
two conditions, of which at least one must be satisfied in order to accept the new solution. These
two conditions are: (i) the average number of contentions (computed for all links) of the new so-
lution is less than the average number of contentions of the solution which currently requires the
minimal number of virtual channels v¯(Mcurr) < v¯(Mmin), and (ii) the acceptance probability test
function (rnd() < Pw · tcurr/tmax), where rnd() is a random number from the range [0,1], Pw is the
probability of accepting a new solution, while tcurr and tmax are the current and the maximum tem-
perature of the algorithm, respectively. As is visible, the acceptance probability test is a function
of tcurr. This means that at high temperatures the algorithm has a higher tendency to accept worse
solutions in an attempt to transition to a new solution space where it can find a better solution. As
the algorithm temperature cools down, this tendency decreases and the algorithm starts to lock on
the best solution in the current neighbourhood.
As shown in Algorithm 5, the algorithm starts its iterations by setting the current temperature
tcurr to a high value tmax. Then, two functionalities are randomly selected and swapped. Conse-
quently, the rerouting of their respective incoming and outgoing flows is performed. If the number
of needed virtual channels for the new solution v̂(Mcurr) is less than or equal to the number of
virtual channels for the currently best solution v̂(Mmin), the new solution is eligible to evaluate
72 Several Steps Closer to Real-Time NoCs
its acceptance probability using the acceptance test previously described, otherwise, it is rejected.
In cases where the new solution progresses to the acceptance test and subsequently passes, the
algorithm accepts the new solution, i.e. Mmin←Mcurr. Otherwise, the best solution remains the
same.
The workload-exploration phase is the last phase and its main role is to discover and quan-
tify the guarantees that can be provided for a given workload F by a given platform Ψ. It takes
the output of the VC minimisation phase Mmin as an input, which assures that the starting point
is the solution with the minimal number of virtual channels v̂(Mmin), and performs the optimisa-
tion with the objective to derive a feasible solutionMwe (if possible), assuming specific platform
characteristics (i.e. link bandwidth Blink =
σ f lit
δρ+δL and available virtual channels v̂lim ≥ v̂(Mmin))
as well as workload temporal constraints. The workload-exploration phase is also implemented as
a SA meta-heuristic, and it should ideally derive a feasible solutionMwe. In cases where it is not
possible, the workload-exploration phase investigates to what percentage of initial sizes all flows
have to be uniformly reduced, such that the feasibility can be guaranteed, and consequently tries
to find a solution where the reduction is the least possible.
The workload-exploration phase attempts to maximize the size of all flows Ŝ traversing the
NoC, such that flows’ timing constraints are satisfied, and that the number of employed virtual
channels does not exceed the limit – v̂lim, which represents the number of channels provided by
the platform Π. Similar to the VC minimisation phase, this phase is also based on a SA meta-
heuristic, but with a different goal, which is in this case to find a solutionMwe, such that Ŝ(Mwe)
is maximised. This algorithm also makes a transition from one solution to another with respect
to an acceptance test. However, the acceptance test of the workload-exploration phase differs in
the first condition, where it compares the new and the best values of maximum feasible flow sizes,
Ŝ(Mcurr)> Ŝ(Mwe), in order to accept new solutions.
As shown in Algorithm 6, the algorithm starts its iterations by setting the current temperature
tcurr to a high value tmax. Then, two functionalities Fi and Fj are randomly picked and swapped in
order to generate a new solutionMcurr. This also requires the rerouting of incoming and outgoing
flows of the remapped functionalities. If the number of needed virtual channels of the new solution
v̂(Mcurr) exceeds the number of virtual channels provided by the platform v̂lim, the new solution
Mcurr is rejected. Otherwise, the maximum schedulable flow sizes of the new solution Ŝ(Mcurr)
and the best one Ŝ(Mwe) are compared as a part of the acceptance test. These values are calculated
by incrementing sizes of all flows of the flow-set uniformly, from zero up to their original sizes, and
repeatedly testing whether all flows’ timing constraints are still satisfied. Assuming the mapping
M , the value 0 < Ŝ(M ) ≤ 1 represents the maximum flow sizes for which the flow-set is still
schedulable, where the value of 1 corresponds to the original flow sizes.
In cases where the new solution passes the acceptance test, the algorithm accepts the new
solution as the best one, i.e. Mwe ←Mcurr. Otherwise, the new solution is discarded. After
several iterations, the maximum flow size Ŝ(Mwe) can reach the value of 1. This means that the
current mapping Mwe of the set of functionalities F on the set of cores Π is schedulable with
original sizes of all flows, so the mapping process can terminate. Otherwise, if Ŝ(Mwe) is less
2.3 NoCs with Priority-Preemptive Arbitration Policies 73
Algorithm 6 mapWorkloadExplorationPhase(F,Ψ,Mmin)
Input: set of functionalities F , platform Ψ, intermediate solutionMmin
Parameters: tmin, tmax, tstep, c, Pw, v̂lim – number of virtual channels available within the platform
Output: final solutionMwe
1: Mwe←Mmin;
2: for (tcurr← tmax; tcurr > tmin; tcurr← tcurr− tstep) do
3: it← 0; // number of iterations at the current temperature
4: while (it ≤ c) do
5: {Fi,Fj | (Fi∧Fj) ∈ F,Fi 6= Fj} // select two random functionalities Fi and Fj
6: Mcurr← swap(Mmin,Fi,Fj); // create new solution by swapping the positions of Fi and Fj
7: if (v̂(Mcurr)≤ v̂lim) then
8: if ((Ŝ(Mcurr)> Ŝ(Mwe))∨ (rnd()< Pw · tcurr/tmax)) then
9: Mwe←Mcurr;
10: if (Ŝ(Mwe) = 1) then
11: return Mwe;
12: end if
13: end if
14: end if
15: it← it+1;
16: end while
17: end for
18: return Mwe;
than 1, this means that the system can guarantee the schedulability, but only assuming that sizes
of all flows are uniformly reduced to Ŝ(Mwe), so the mapping process will keep trying to find a
better solution in subsequent iterations.
2.3.3.5 Experimental Evaluation
In this section, assuming the proposed mapping approach, several tests are performed in order to
observe how platform characteristics influence derived schedulability guarantees. Specifically, it is
investigated how guarantees change with varying number of virtual channels and link bandwidths.
In order to provide a complete and comprehensive study, and cover a wide range of scenarios rang-
ing from lightly to extremely loaded networks, different workloads are generated and consequently
analysed.
Experimental Setup
The analysis parameters are given in Table 2.4, where flow sizes are randomly generated val-
ues, assuming a uniform distribution. For each flow a source and a functionalities FS and FD are
randomly selected.
Experiment 1: Do Virtual Channels Scale?
The proposed approach is applicable to the mapping solution M only if a platform provides
at least v̂(M ) virtual channels, where v̂(M ) is equal to the maximum number of contentions
for any port within the NoC. The objective of this experiment is to test how (un)realistic that
74 Several Steps Closer to Real-Time NoCs
Table 2.4: Analysis parameters for Section 2.3.3.5
NoC topology and size 2-D mesh with 10×10 routers
Router frequency νρ 2 GHz
Routing delay δρ 3 cycles (1.5 ns)
Link traversal delay δL 1 cycle (0.5 ns)
Link width = flit size σ f lit 16 bytes
Flow packet size σ( fi),∀ fi ∈F [32B−32kB]∗ bytes
Number of functionalities |F | 100
requirement is. This is performed as follows. First, for a given workload F , the mapping process
is performed until reaching the final solutionMwe, and subsequently the number of needed virtual
channels is identified. Then it is observed how the number of needed virtual channels change
when the number of flows increases. For that purpose, different workload categories are generated,
ranging from 50 to 1000 flows for an entire set of functionalities. Each category consists of 1000
sets of functionalities. For each functionality the mapping is performed (only initial and VC
minimisation phases of the proposed approach, which output is the intermediate solution Mmin),
and subsequently the minimum number of virtual channels needed to support it – v̂(Mmin) is
computed. The results are given in Figure 2.22, where a horizontal axis corresponds to the number
of flows and a vertical axis stands for the number of virtual channels. The whiskers were set to
25th and 75th percentile.
As already known, the NoC architecture is widely accepted due to its scalability potential
related to traffic in general, hence an intuitive guess would be that the number of needed virtual
channels also scales. Figure 2.22 confirms that assumption with an almost-linear dependency
between the number of flows and virtual channels, and also gives a quantitative estimate regarding
how many channels are needed for various workloads. In particular, currently available SCC
platform, with its 8 virtual channels, can accommodate around 300 flows. On average, an addition
of one channel increases the potential of the platform to accept 50 new flows. Recall, that the
traditional distinctive-priority approach requires that the number of virtual channels is equal to the
number of flows, and notice the reduction achievable with the proposed approach where flows are
allowed to dynamically change virtual channels, i.e for 1000 flows the proposed method requires,
on average, only 23 channels. The practical implications of these findings are further elaborated
in Section 2.3.3.6.
Experiment 2: Do Virtual Channels Help?
The previous experiment demonstrated that the proposed approach efficiently maps the work-
load, such that the number of needed virtual channels is minimised to the level beyond any rea-
sonable expectation. This implies that the mapping process evenly distributes the contentions
across the entire platform and avoids hotspots. However, as the minimisation of port-contentions
(i.e. virtual channels) is the central criteria, this can result in solutions where some flows are
forced to traverse long routes, which can in turn have an impact on the schedulability. Conversely,
placing highly communicative functionalities close to each other would minimise the flow routes
2.3 NoCs with Priority-Preemptive Arbitration Policies 75
50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000
0
5
10
15
20
25
N
um
be
r o
f v
irt
ua
l c
ha
nn
el
s
Number of flows
Figure 2.22: Virtual channels do scale with respect to number of flows
and might potentially improve the schedulability, but would create hotspots and cause more con-
tentions for some ports. Thus, an intuitive guess would be that the schedulability highly depends
on the number of available virtual channels, and the objective of this experiment is to quantify that
trade-off, that is, to observe to what degree can the schedulability be improved by employing an
additional virtual channel.
3 workload categories are generated, containing 100, 300 and 500 flows with implicit dead-
lines, i.e. D( fi) = T ( fi),∀ fi ∈F . For each category there are 3 subcategories with the following
flow minimum inter-arrival periods: T1 ∈ [1−5]µs, T2 ∈ [2−10]µs and T3 ∈ [5−20]µs. For each
subcategory 200 sets of functionalities are generated and consequently mapped. The results are
given in Figure 2.23, where a horizontal axis stands for the number of virtual channels provided
by the platform – v̂lim, and the vertical axis stands for the provided schedulability guarantees, that
is, to what percentage of initial sizes all flows have to be uniformly reduced, such that the system
is schedulable. A value VCmin corresponds to the number of virtual channels obtained by the VC
minimisation phase, i.e. VCmin = v̂(Mmin).
This experiment reported highly unexpected and unintuitive results, which implies that adding
virtual channels might negatively impact the efficiency of the mapping process and produce a solu-
tion which is worse than the one obtained with the minimum number of virtual channels v̂(Mmin).
These surprising results may be related to specific workloads or the consequence of inherent prop-
erties of SA meta-heuristic itself. To rule out the former possibility, the experiment was repeated
several times, as described above, with very diverse traffic loads. However, the results remained
very similar and consistently demonstrated a systematic decrease in schedulability guarantees as
the number of virtual channels increased. To rule out the latter possibility, more experimentation
is needed, with different heuristic and meta-heuristic approaches. Thus, we can conclude that,
when mapping the workload with the aforementioned method, virtual channels are not bottle-
necks. This can be interpreted in the following way: minimising virtual channels (contentions)
indeed contributes to the schedulability. As this objective tends to distribute the contentions on
76 Several Steps Closer to Real-Time NoCs
VCmin VCmin + 1 VCmin + 2 VCmin + 3 VCmin + 430
40
50
60
70
80
90
100
Number of virtual channels
Sc
he
du
la
bl
e 
flo
w 
siz
e 
(in
 %
 of
 in
itia
l fl
ow
 si
ze
)
 
 
T3 ∈ [5 − 20] µS
T2 ∈ [2 − 10] µS
T1 ∈ [1 − 5] µS
(a) Number of flows = 100
VCmin VCmin + 1 VCmin + 2 VCmin + 3 VCmin + 45
10
15
20
25
30
35
40
Number of virtual channels
Sc
he
du
la
bl
e 
flo
w 
siz
e 
(in
 %
 of
 in
itia
l fl
ow
 si
ze
)
 
 
T3 ∈ [5 − 20] µS
T2 ∈ [2 − 10] µS
T1 ∈ [1 − 5] µS
(b) Number of flows = 300
VCmin VCmin + 1 VCmin + 2 VCmin + 3 VCmin + 42
4
6
8
10
12
14
16
18
Number of virtual channels
Sc
he
du
la
bl
e 
flo
w 
siz
e 
(in
 %
 of
 in
itia
l fl
ow
 si
ze
)
 
 
T3 ∈ [5 − 20] µS
T2 ∈ [2 − 10] µS
T1 ∈ [1 − 5] µS
(c) Number of flows = 500
Figure 2.23: Additional virtual channels do not help in schedulability guarantees
the grid as evenly as possible, it might cause longer traversal paths of some flows. However, as
the load is equally spread across all the links, even assuming those longer paths, better guarantees
can be provided. Conversely, minimising flow distances causes hotspots and even if some flow
traverses a short distance, links it consumes might be heavily loaded, hence highly impacting its
schedulability. Therefore, the conclusion is that (i) the solution obtained with the minimal number
of virtual channels is one of near-optimal solutions, and (ii) adding more virtual channels, in most
cases, unnecessarily expands the solution space which causes SA to drift away from the "good"
solution space and frequently conclude the search with some worse solution. Of course, this can
not be analytically and/or experimentally proven as the workload mapping is an NP-Hard problem,
which is computationally intractable even for small grids, e.g. 4× 4 [40]. However, small scale
experiments with exhaustive enumeration and different meta-heuristics are the possibilities to fur-
ther support or deny the aforementioned claims, and these activities are potential future work. The
implications of these surprising findings are further discussed in Section 2.3.3.6.
Experiment 3: Then, What is the Bottleneck?
As previous experiment provided experimental evidence which suggests that virtual channels
2.3 NoCs with Priority-Preemptive Arbitration Policies 77
64Gbps 128Gbps 192Gbps 256Gbps10
20
30
40
50
60
70
80
90
100
Link bandwidth
Sc
he
du
la
bl
e 
flo
w 
siz
e 
(in
 %
 of
 in
itia
l fl
ow
 si
ze
)
 
 
T3 ∈ [5 − 20] µS
T2 ∈ [2 − 10] µS
T1 ∈ [1 − 5] µS
(a) Number of flows = 100
64Gbps 128Gbps 192Gbps 256Gbps0
10
20
30
40
50
60
70
80
Link bandwidth
Sc
he
du
la
bl
e 
flo
w 
siz
e 
(in
 %
 of
 in
itia
l fl
ow
 si
ze
)
 
 
T3 ∈ [5 − 20] µS
T2 ∈ [2 − 10] µS
T1 ∈ [1 − 5] µS
(b) Number of flows = 300
64Gbps 128Gbps 192Gbps 256Gbps0
5
10
15
20
25
30
35
40
Link bandwidth
Sc
he
du
la
bl
e 
flo
w 
siz
e 
(in
 %
 of
 in
itia
l fl
ow
 si
ze
)
 
 
T3 ∈ [5 − 20] µS
T2 ∈ [2 − 10] µS
T1 ∈ [1 − 5] µS
(c) Number of flows = 500
Figure 2.24: Link bandwidths are bottlenecks
are not the bottleneck, in order to test the limits of the platform, the focus of this experiment is on
another parameter – the link bandwidth Blink. The individual link bandwidth of the SCC platform is
around 256 Gbps, while Tilera platforms provide around 166 Gbps. Assuming a variety of network
loads, it is observed how different link bandwidths influence provided schedulability guarantees.
3 workload categories are generated, consisting of 100, 300 and 500 flows with implicit dead-
lines. Each category has 3 subcategories, which differ in flow minimum inter-arrival periods:
T1 ∈ [1− 5]µs, T2 ∈ [2− 10]µs and T3 ∈ [5− 20]µs. For each subcategory, 1000 sets of func-
tionalities are generated and consequently mapped, assuming that the platform provides only the
minimal number of virtual channels, i.e. v̂lim = v̂(Mmin). The results are given in Figure 2.24,
where the horizontal axis stands for the link bandwidth, and the vertical axis represents the pro-
vided schedulability guarantees, expressed as the maximum percentage of initial flow sizes, to
which all packets have to be reduced, such that the system is schedulable.
This experiment reported results which are expected, suggesting that the derived guarantees
linearly depend on link bandwidths. The trend is similar for scenarios which cover moderately
and extremely loaded networks. The only exceptions are cases for 100 flows and T1 as well
as T2 periods (Figure 2.24(a)), where, due to lighter load, solutions can be found for which the
78 Several Steps Closer to Real-Time NoCs
schedulability of initial flow sizes can be guaranteed even with smaller bandwidths. Thus, in
general, link bandwidths can be perceived as the bottleneck of the system, which coincides with
the intuition.
2.3.3.6 Discussion
The proposed technique to employ the existing feature of the SCC platform to minimise the num-
ber of needed virtual channels significantly relaxes the requirements for platform characteristics
which are needed for the real-time analysis. Through experiments, it has been demonstrated that
(i) the number of needed virtual channels scales linearly with the increasing traffic and (ii) the
number is not unreasonably high and should be achievable with forthcoming generations of many-
core platforms. It has been shown that limiting the number of channels to the minimum while
mapping the workload is, in most cases, a beneficial approach, and leads to a near-optimal solu-
tion. The aforementioned facts altogether answer the questions posed in Section 2.3.3.3. How-
ever, these answers raise another question: After all, is the priority-share policy needed? As it has
been demonstrated with the experiments, the distinctive-priority analysis can be performed with
little overheads in terms of needed virtual channels, and thus it is very probable that the future
real-time-oriented interconnect mediums will provide sufficient virtual channels to afford per-flow
distinctive priorities and avoid the priority-share policy and the pessimism related to it.
Furthermore, link bandwidths were recognised as the bottleneck of the system. As this char-
acteristic is equally important in high-performance and general-purpose computing, which are the
main drivers for the advancements in interconnect mediums, it is reasonable to expect that link
bandwidths are going to continue their increase in future years, which will be also appreciated by
the real-time community.
2.3.4 EDF as Arbitration Policy
All the methods for the worst-case analysis of NoCs that have been introduced so far have been
developed from the same underlying assumption that each flow has a fixed priority. In this section,
it will be investigated whether allowing flows to dynamically change their priorities can have a
positive impact on the schedulability. Specifically, a novel arbitration policy for NoC routers is
proposed, which is based on the EDF paradigm [58] – a well-established concept in the scheduling
theory. In this section, the following questions will be answered: 1. What are the prerequisites to
enforce the EDF arbitration policy within NoC routers? (Section 2.3.4.1) 2. How to perform the
worst-case traffic delay analysis? (Section 2.3.4.2) 3. Does this approach outperform the state-
of-the-art methods, under which conditions and by how much? (Section 2.3.4.5) 4. What are the
practical limitations of the approach? (Sections 2.3.4.5-2.3.4.6) 5. Ultimately, do its merits justify
the overhead of enforcing a novel arbitration policy? (Section 2.3.4.6).
2.3 NoCs with Priority-Preemptive Arbitration Policies 79
2.3.4.1 Prerequisites
EDF (Earliest Deadline First) is a well-known concept in the single-core scheduling theory [58].
EDF has been proven optimal in the following sense: if any scheduling policy (including the
previously mentioned fixed-priority ones) can render the task-set schedulable, EDF will also be
able to do so. EDF has also been studied from the perspective of multi-core platforms [9], however,
it faces several challenges in that context: (i) it is shown that it is not very efficient, and (ii) due
to the necessity to maintain global structures e.g. a ready-queue, scalability issues may arise. So
far, no work has considered EDF as an arbitration policy for NoC routers and the work presented
in this section is motivated by that fact.
Irrespective of whether it is applied to the scheduling or the NoC contention theory, the EDF
policy arbitrates the access to the shared resource (e.g. a core, a link) based on the latest time
instant until which contending entities (e.g. jobs, packets) have to complete their execution. Thus,
one of the prerequisites to enforce such a policy in NoC routers is that, prior to sending, at time
instant t, a packet of a flow fi is tagged with its deadline, expressed in absolute values d( fi) =
t +D( fi). In the ideal case, if two packets, belonging to two distinctive flows fi and f j, with
their respective deadlines d( fi) and d( f j), encounter each other and contend for some link on their
paths, the one with the earlier deadline should have the precedence and win the arbitration. That
is, if d( fi)< d( f j), then the packet of fi will be able to preempt the packet of f j, and vice versa.
What are the requirements to implement such a policy? With respect to the router’s logic, very
few changes are needed; instead of packet priorities, packet deadlines are compared. However,
tagging packets with their deadlines might be a challenging activity, because the timestamps may
consume a lot of space. Also, notice that the value of the deadline highly depends on the perception
of time of the core releasing it. Due to the finite signal propagation speed, different temperatures
and different physical compositions, it is quite common that two components of the same system
receive the same signal at different time instants. This infers that two cores may perceive the same
time instant at two different moments, which is in the circuit design theory called the clock skew.
Ultimately, the clock skew is inevitable and chip manufacturers employ various techniques to
mitigate its effects, however, that topic is beyond the scope of this dissertation. Here, it is assumed
that the parameter ∆ denotes the maximum clock skew. The implications are explained with the
following example. Consider that, in a hypothetical case, two cores release two packets with the
identical deadline D, one at the absolute time t+D and the other slightly later at t+D+ ε , where
ε < ∆. Due to the clock skew, the latter core may still tag its packet with the deadline value which
is less than the one with which the former cores tagged its packet. Consequently, if these packets
contend, the latter will win the arbitration, despite the fact that it was indeed released ε time units
after the former. This effect has to be considered in the worst-case analysis.
2.3.4.2 Worst-Case Analysis
As already described, the worst-case traversal time of a flow fi consists of two components, namely
the isolation delay Ci and the interference Ii. Notice, that the first term is independent of the
80 Several Steps Closer to Real-Time NoCs
f1 f2 3 f f54f
Figure 2.25: Traffic flows (example 4)
arbitration policy. So, in order to be able to test the schedulability of the flow-set upon a NoC with
the EDF arbitration policy, it is necessary to obtain the interference component.
Consider the example illustrated in Figure 2.25 with the flow characteristics given in Table 2.5.
Assume that the analysis for an EDF-scheduled single-core system with constrained or implicit
deadlines has been straightforwardly applied to this example. For such a model, both the necessary
and sufficient condition for schedulability is that the total system utilisation does not exceed the
value of one [58], i.e. U ≤ 1. In previous sections it was mentioned that the worst-case analysis
for priority-preemptive NoCs is performed in such a way that a path of a flow under analysis is
treated as an indivisible resource, and any request for any of its parts by other flows is considered
as the potential interference. This implies that, in this example, it has to be checked whether
the utilisation of the path of each flow, treated as a single-core system, fulfils the schedulability
condition, i.e. U( fx)≤ 1,∀ fx ∈ { f1, f2, f3, f4, f5}.
Table 2.5: Flow-set parameters for Figure 2.25
Flow C(f) JR(f) D(f) = T(f)
f1 3 0 9.98
f2 3 0 9.99
f3 1 0 7
f4 6.02 0 11.01
f5 3 0 10
The computation renders the following values:
U( f1) =
C( f1)+ JR( f1)
T ( f1)
+
C( f2)+ JR( f2)
T ( f2)
≈ 0.6
U( f2) =
C( f1)+ JR( f1)
T ( f1)
+
C( f2)+ JR( f2)
T ( f2)
+
C( f3)+ JR( f3)
T ( f3)
≈ 0.74
U( f3) =
C( f2)+ JR( f2)
T ( f2)
+
C( f3)+ JR( f3)
T ( f3)
+
C( f4)+ JR( f4)
T ( f4)
≈ 0.99
U( f4) =
C( f3)+ JR( f3)
T ( f3)
+
C( f4)+ JR( f4)
T ( f4)
+
C( f5)+ JR( f5)
T ( f5)
≈ 0.99
U( f5) =
C( f4)+ JR( f4)
T ( f4)
+
C( f5)+ JR( f5)
T ( f5)
≈ 0.85
The results suggest that the flow-set is schedulable. However, Figure 2.26 demonstrates that
2.3 NoCs with Priority-Preemptive Arbitration Policies 81
missed deadlines may occur! The explanation is identical to that of the fixed-priority example
(Section 2.3.1 and Figures 2.16-2.17), indirect interferences cause network jitters, which were not
considered in the analysis and yet have an impact on the schedulability. In this particular example,
f2 indirectly interferes with f4 in a sense that it causes the jitter to f3, which eventually leads to a
missed deadline of f4. Thus, in the presence of network jitters, having the utilisation of all paths
less than or equal to 1 is not a sufficient condition for a flow-set to be schedulable. Therefore, in
order to test the schedulability of a flow-set, network jitters must be included in the analysis.
f3
f1
2f
f
4
f5
2 53 6 70 1 4 8 9 10 11 12 13 14 15 16 17
Figure 2.26: Deadline miss for flow-set from Figure 2.25 and Table 2.5
In the scheduling theory, the model that bears the closest resemblance to this model is the one
that involves EDF-scheduled single-core systems with release jitters. For that model Spuri [90]
proposed the worst-case analysis. He defined a busy period, which represents the time interval of
maximal length with a continuous execution demand, assuming that the first jobs of all tasks were
released with the maximum jitters. Subsequently, he proved that if no missed deadlines occur
within the busy period, the system is schedulable.
Recall, that in the worst-case analysis of priority-preemptive NoCs, a path of each flow is
treated as a single-core system. Thus, unlike in the scheduling theory where only one busy period
is computed, here one busy period for each flow (path) has to be computed. For now, assume that
the jitters are known, and later it will be explained how to compute them. The length of the busy
period W ( fi) (Equation 2.9) is obtained by summing up the maximum load that can be generated
by the flow under analysis fi, and the maximum load that can be generated by all the other flows
that are directly contending with fi (share a part of the path with it).
W ( fi) =
⌈
W ( fi)+ JR( fi)+ JN( fi)
T ( fi)
⌉
·C( fi)+ ∑
∀ f j∈FD( fi)
⌈
W ( fi)+ JR( f j)+ JN( f j)
T ( f j)
⌉
·C( f j) (2.9)
Once the busy period has been computed, it has to be checked whether the analysed flow fi
can miss a deadline during that period. A set of time instants, for which it has to be checked, are
those where the deadlines of fi and at least one of the potentially interfering flows coincide, i.e.
82 Several Steps Closer to Real-Time NoCs
Tcrit( fi) =
⋃
∀ f j∈FD(( fi)
{
k ∗T ( f j)−T ( fi),k ∈ N0
}∩ [0,W ( fi)], hereafter referred to as the critical
instants. Assuming that a packet is released at the critical instant t ∈ Tcrit( fi), its worst-case
traversal time L( fi, t) (expressed in absolute values) is computed by solving Equation 2.10.
L( fi, t) =
(
1+
⌊
t+ JR( fi)+ JN( fi)
T ( fi)
⌋)
·C( fi)+
∑
∀ f j∈FD( fi)
T ( f j)≤t+T ( fi)+JR( f j)+JN( f j)+∆
min
{⌈
L( fi, t)+ JR( f j)+ JN( f j)
T ( f j)
⌉
,
⌊
t+T ( fi)+ JR( f j)+ JN( f j)+∆
T ( f j)
⌋}
·C( f j)
(2.10)
Equation 2.10 consists of the sum of the latencies of all packets of fi, which were released
in the interval [0−L( fi, t)], augmented by the interference that the packets of fi may suffer from
the packets of other flows. Individual terms are obtained by finding the smaller between (i) the
maximum number of releases of an interfering flow within the interval [0−L( fi, t)], and (ii) the
maximum number of those releases which deadline falls before, or coincides with L( fi, t). Note,
Equation 2.10 is similar to the one Spuri [90] proposed for an EDF-scheduled single-core systems
with release jitters. The differences are that this approach considers the maximum clock skew
(parameter ∆), two types of jitters and constrained or implicit deadlines.
The worst-case traversal time of a packet of the flow fi, released at the critical instant t, denoted
by WCT T ( fi, t), is computed by subtracting its release t from the obtained value L( fi, t). The
additional remark is that the result cannot be less than C( fi) (Equation 2.11).
WCT T ( fi, t) = max{C( fi),L( fi, t)− t} (2.11)
Finally, upon obtaining the traversal times for all the critical instants, the worst-case traversal
time of a flow fi, denoted by WCT T ( fi), can be computed by finding the maximum for all critical
instants (Equation 2.12). Of course, both the necessary and sufficient schedulability condition is
WCT T ( fi)≤ D( fi).
WCT T ( fi) = max{WCT T ( fi, t),∀t ∈Tcrit( fi)} (2.12)
2.3.4.3 Network Jitter Computation
The analysis proposed in the previous section is applicable under the assumption that network jit-
ters of all flows are known in advance. Recall (Equation 2.3), network jitters are the way to model
the effects of indirect interferences, and have non-zero values only for those directly competing
flows which are involved in the indirect interference relationship (from the perspective of the anal-
ysed flow). Thus, it is trivial to see that the network jitter of the analysed flow is always zero, i.e.
in Equations 2.9-2.10 JN( fi) = 0.
Remember that in the analysis for the fixed-priority model (Section 2.3.1), it is safely as-
sumed that each contending flow that is involved in the indirect interference relationship (from
2.3 NoCs with Priority-Preemptive Arbitration Policies 83
WCTT( f
WCTT(
WCTT(
f
f
)
)J )N ( f
JN ( f2
JN ( f3)
JN ( )5f
= 0
= 0
3)
)
4
2
4
JN ( f1)
Figure 2.27: Chain of dependencies for Figure 2.25
the perspective of the analysed flow) experiences the maximum possible jitter, while still being
schedulable, i.e. JN( fi) = WCT T ( fi)−C( fi). That approach was straightforwardly applicable,
because in the fixed-priority model the commutative property applied to preemptions does not
hold. In other words, for any two contending flows fi and f j it holds that, if P( fi)> P( f j), then fi
can preempt f j, but f j cannot preempt fi. Thus, the existence of f j has no effect on the analysis
of fi and the worst-case traversal time and the jitter can be computed first for fi and then for f j.
In fact, by sorting the flows decreasingly by their priorities, and by performing the analysis in that
order, the worst-case traversal times and jitters of all flows can be obtained in a single pass.
Conversely, in the model with the EDF arbitration policy, the commutative property applied to
preemptions may hold, that is, two contending flows fi and f j may preempt each other. This infers
that if the aforementioned approach to compute jitters (JN( fi) =WCT T ( fi)−C( fi)) is applied to
this model, circular dependencies may occur. This is demonstrated with an illustrative example.
Consider the flow-set from Figure 2.25. Moreover, assume that the objective is to compute the
worst-case traversal time of the flow f3. It is visible that WCT T ( f3) depends on both JN( f2) and
JN( f4), which both have non-zero values due to the existence of f1 and f5, respectively. Both jitters
JN( f2) and JN( f4) depend on the respective worst-case traversal times WCT T ( f2) and WCT T ( f4).
Furthermore, WCT T ( f2) depends on JN( f1) and JN( f3). JN( f1) has a value of zero, because f1
does not have directly competing flows which may cause indirect interference to f2. However,
JN( f3) has a non-zero value, due to the existence of f4. JN( f3) depends on WCT T ( f3). Notice,
that the computation of WCT T ( f3) is reached for the second time, which infers that a circular
dependency has been encountered. And indeed, by constructing the chain of dependencies for the
given example (Figure 2.27), it can be noticed that
WCT T ( f3)→ JN( f2)→WCT T ( f2)→ JN( f3)→WCT T ( f3)
and
WCT T ( f3)→ JN( f4)→WCT T ( f4)→ JN( f3)→WCT T ( f3)
are circles of dependencies.
As discussed above, the commutative property with respect to preemptions may hold for EDF-
arbitrated NoCs, and may cause circular dependencies. Therefore, jitters and the worst-case traver-
sal times cannot be straightforwardly computed like in the fixed-priority model. One way to solve
this problem in a nested iterative approach, where inner iterations correspond to individual flows,
84 Several Steps Closer to Real-Time NoCs
Algorithm 7 isSchedulable(F )
Input: flow-setF
Output: the information whetherF is schedulable
1: for each ( fi ∈F ) do
2: WCT T ( fi)←C( fi); // New WCTT
3: WCT Told( fi)← 0; // Old WCTT
4: end for
5: while (∃ fi ∈F |WCT Told( fi) 6=WCT T ( fi)) do
6: for each ( fi ∈F ) do
7: WCT Told( fi)←WCT T ( fi);
8: WCT T ( fi)← compWCT T ( fi,F );
9: if (WCT T ( fi)> D( fi)) then
10: return false;
11: end if
12: end for
13: end while
14: return true
similar to the fixed-priority model, while outer iterations correspond to the entire flow-set. Algo-
rithms 7-8 describe how to compute jitters and the worst-case traversal times, in an interleaved
fashion. The computation process starts by invoking the function described with Algorithm 7.
For each flow fi of the flow-set the previously computed value of the worst-case traversal time
WCT Told( fi) and the newly computed value WCT T ( fi) are kept. Initially, the computation starts
by setting the worst-case traversal time of each flow to its isolation latency (line 2). Then, for
each flow, the function described with the Algorithm 8 is invoked, which computes the worst-case
traversal time of a flow (line 8). While there exists at least one flow for which the new and the old
value of the worst-case traversal times are different, the process is repeated (line 5). Finally, when
the worst-case traversal times from two successive iterations are equal for every flow, the compu-
tation process terminates. That is, the stopping condition is: WCT Told( fi) =WCT T ( fi),∀ fi ∈F .
If, at any stage, the computed worst-case traversal time of any flow exceeds its deadline, the entire
flow-set is rendered unschedulable (lines 9−11).
The computation of the worst-case traversal time of an individual flow is described with Al-
gorithm 8. The algorithm has three stages. The first one generates a list FD( fi), which contains
all flows that are directly contending with the analysed flow fi (lines 1− 5). Then, for each flow
f j ∈FD( fi), it is tested whether it has a contending flow fk which may indirectly interfere with
fi. If at least one such flow exists, the network jitter JN( f j) has a non-zero value and is computed
by subtracting the isolation latency of f j from its worst-case traversal time (lines 8− 10). Con-
versely, if there exists no fk which can cause indirect interference to fi through f j, it follows that
JN( f j) = 0 (lines 10− 12). Once the jitters of all contending flows are obtained, WCT T ( fi) is
computed from Equations 2.9-2.12 (lines 15−22).
2.3 NoCs with Priority-Preemptive Arbitration Policies 85
Algorithm 8 compWCT T ( fi,F )
Input: analysed flow fi, flow-setF
Output: the worst-case traversal time WCT T ( fi)
1: // 1. Find all flows contending with fi
2: FD( fi)← /0;
3: for each ( f j ∈F | f j 6= fi∧L ( f j)∩L ( fi) 6= /0) do
4: add (FD( fi), f j);
5: end for
6: // 2. Compute jitters of all contending flows
7: for each ( f j ∈FD( fi)) do
8: if (∃ fk ∈F | fk 6∈FD( fi)∧L ( fk)∩L ( f j) 6= /0) then
9: JN( f j)←WCT T ( f j)−C( f j);
10: else
11: JN( f j)← 0;
12: end if
13: end for
14: // 3. Compute the WC traversal time of fi
15: W ( fi)← compBusyInterval( fi∪FD( fi)); // Equation 2.9
16: Tcrit( fi)← f indCritPts( fi∪FD( fi),W ( fi)); // Section 2.3.4.2
17: WCT T ( fi)←C( fi);
18: for each (t ∈Tcrit( fi)) do
19: L( fi, t)← compAbsTime( fi∪FD( fi), t); // Equation 2.10
20: WCT T ( fi)← max{WCT T ( fi),L( fi, t)− t}; // Equation 2.12
21: end for
22: return WCT T ( fi);
2.3.4.4 Discussion
In the previous two sections, the method to compute the worst-case traversal times of flows was
proposed. This method is applicable to the NoCs with the EDF arbitration policy. Consequently,
if it holds that WCT T ( fi)≤ D( fi),∀ fi ∈F , then the entire flow-setF is schedulable.
Despite the fact that the worst-case analysis of NoCs is similar to the single-core scheduling
theory, there are some differences as well, which have several interesting implications. For in-
stance, Shi and Burns [84] showed that the rate-monotonic priority assignment technique is not
optimal in the NoC context where flows have fixed priorities, while the opposite is the well-known
fact in the single-core scheduling theory [58]. Similarly, it is well-known that EDF is the opti-
mal scheduling policy for single-core systems [58], so it will be interesting to investigate if there
exist cases in the NoC context where the fixed-priority arbitration policy outperforms EDF. The
following two case studies further explore these ideas.
Case-study 1 (EDF outperforms FP)
Consider the example of only two contending flows, with the characteristics given in Table 2.6.
Since only two flows are involved, there are no indirect interferences, i.e. both jitters JN( f1) and
JN( f2) are zero. Consider the schedulability test for this flow-set, assuming that the priorities have
been assigned in the rate-monotonic fashion: T ( f1)< T ( f2)⇒ P( f1)> P( f2).
86 Several Steps Closer to Real-Time NoCs
Table 2.6: Flow-set parameters for two contending flows
Flow C(f) JR(f) D(f) = T(f)
f1 5 0 10
f2 6 0 15
WCT T ( f1) =C( f1) = 5< D( f1)
WCT T ( f2) =C( f2)+
⌈
WCT T ( f2)+ JR( f1)+ JN( f1)
T ( f1)
⌉
·C( f1) = 16> D( f2)
The flow f2 is unschedulable. Now consider again the schedulability test if the priorities are
assigned differently: P( f2)> P( f1).
WCT T ( f2) =C( f2) = 6< D( f2)
WCT T ( f1) =C( f1)+
⌈
WCT T ( f1)+ JR( f2)+ JN( f2)
T ( f2
⌉
·C( f2) = 11> D( f1)
In this case, the flow f1 is unschedulable. Now, consider the schedulability test of this flow-set
assuming the EDF arbitration policy. Since both network jitters JN( f1) and JN( f2) are zero, it is
sufficient to only test if the utilisations of their pathsL ( f1) andL ( f2) are less than one.
U( f1) =U( f2) =
C( f1)
T ( f1)
+
C( f2)
T ( f2)
= 0.9
As U( f1) =U( f2)< 1, the flow-set is schedulable.
Case-study 2 (FP outperforms EDF)
Consider the example of flows given in Figure 2.28, with the flow characteristics given in
Table 2.7. By applying the rate-monotonic priority assignment policy it follows that P( f1) >
P( f2) and P( f3) > P( f2), thus, there are again no indirect interferences and jitters. Consider the
schedulability test for this example.
ρ ρ4 5
f1 f32f
ρ1 ρ2 ρ3
Figure 2.28: Traffic flows (example 5)
WCT T ( f1) =C( f1) = 2< D( f1)
WCT T ( f3) =C( f3) = 2< D( f3)
2.3 NoCs with Priority-Preemptive Arbitration Policies 87
Table 2.7: Flow-set parameters for Figure 2.28
Flow C(f) JR(f) D(f) = T(f)
f1 2 0 6
f2 3 0 7
f3 2 0 6
WCT T ( f2) =C( f2)+
⌈
WCT T ( f2)+ JR( f1)+ JN( f1)
T ( f1)
⌉
·C( f1)+
+
⌈
WCT T ( f2)+ JR( f3)+ JN( f3)
T ( f3)
⌉
·C( f3) = 11> D( f2)
The flow f2 is unschedulable. Now consider the schedulability test for this example when
the priorities are assigned differently: P2 > P1 ∧ P2 > P3. It is easy to see that again indirect
interferences do not exist, and hence jitters are equal to zero.
WCT T ( f2) =C( f2) = 3< D( f2)
WCT T ( f1) =C( f1)+
⌈
WCT T ( f1)+ JR( f2)+ JN( f2)
T ( f2)
⌉
·C( f2) = 5< D( f1)
WCT T ( f3) =C( f3)+
⌈
WCT T ( f3)+ JR( f2)+ JN( f2)
T ( f2)
⌉
·C( f2) = 5< D( f3)
The flow-set is now schedulable. Finally, consider the schedulability test for the EDF arbitra-
tion policy. Notice, that in this case, indirect interferences do exist, i.e. when f1 is under analysis,
then f3 can indirectly interfere, and vice versa. First, consider the busy period for the flow f2.
W ( f2) =
⌈
W ( f2)+ JR( f2)+ JN( f2)
T ( f2)
⌉
·C( f2)+⌈
W ( f2)+ JR( f1)+ JN( f1)
T ( f1)
⌉
·C( f1)+
⌈
W ( f2)+ JR( f3)+ JN( f3)
T ( f3)
⌉
·C( f3)→ ∞
As the busy period cannot be computed, the flow-set is unschedulable. The same conclusion
can be reached by computing the utilisation ofL ( f j), because U( f j)≈ 1.095.
These two case studies have shown that there are scenarios where the EDF arbitration policy
can schedule a flow-set, while no priority assignment exists which can cause the fixed-priority
arbitration policy to do the same (the first row in Table 2.8). Similarly, it has been demonstrated
that the opposite is true as well, and also the findings of Shi and Burns [84] regarding the existence
of a priority-assignment that can schedule the flow-set which is unschedulable with priorities as-
signed in the rate-monotonic fashion have been confirmed (the second row in the Table 2.8). The
case-studies for which the third and the fourth row of Table 2.8 are true are omitted, however,
in Section 2.3.4.5 it will be demonstrated that such cases indeed exist. Finally, it is trivial to see
that there exist flow-sets for which the fifth and the sixth row are true. This allows to identify the
second major difference between the single-core scheduling theory and the worst-case analysis of
88 Several Steps Closer to Real-Time NoCs
priority-preemptive NoCs. Although in the former EDF is proven to be optimal [58], in the lat-
ter a flow-set can be schedulable with the fixed-priority arbitration policy, but not with EDF (see
Case-study 2).
Table 2.8: Comparison of approaches (schedulability)
EDF Rate-monotonic Exists priority assignment
PO
SS
IB
L
E
SC
E
N
A
R
IO
S
1 X X X
2 X X X
3 X X X
4 X X X
5 X X X
6 X X X
2.3.4.5 Experimental Evaluation
Recall, in this dissertation, it is assumed that cores access local routers via a core link and a core
port. However, in many cases, the manufacturers identify these elements as bottlenecks, and in
order to improve the performance, allow cores to directly access the ports of the local router. This
is equivalent to eliminating the first and the last link on the path of each flow. The benefit of
this approach is that two flows that are originating from the same core, and going into opposite
directions, do not have a common link any more, and thus do not contend with each other. This
further implies that the flow between two adjacent cores does not traverse three links, but only
one.
In this section, it will be assumed that cores can directly access the ports in their respective
routers. The reason is that this approach allows to study flow-sets where flow paths consist of a
single hop, and in such scenarios the EDF arbitration policy displays some interesting properties,
as will be covered later in this section. The evaluation is performed by comparing the proposed
analysis for EDF-arbitrated NoCs, in the further text referred to as the EDF method, against the two
existing state-of-the-art approaches for FP-arbitrated NoCs [84, 85]. In the former, the priorities
are assigned in the rate-monotonic fashion. In the latter, the heuristics-based search algorithm
(HSA) is used to find a priority ordering, if one exists, such that the flow-set is schedulable. These
methods are referred to as the RM method and the HSA method, respectively. Although HSA is
based on the heuristics, it has one limitation: if it is unable to find a priority ordering with which
the flow-set is schedulable, it will exhaustively enumerate all possible priority orderings. This
infers that HSA has a factorial computational complexity (i.e. for flow-sets with |F | flows there
exist |F |! different priority orderings), which further implies that HSA can be inapplicable in
scenarios where flow-sets consist of 50 or more flows. Thus, in this section, for HSA is imposed
the limit on the maximum number of orderings that can be evaluated. Specifically, for the flow-set
of |F | flows, HSA is allowed to attempt at most 5 · |F | different priority orderings. If HSA fails
to find an ordering with which the flow-set is schedulable, the process terminates.
2.3 NoCs with Priority-Preemptive Arbitration Policies 89
Evaluation Metrics and Parameters
The comparison of the approaches is performed through the sensitivity analysis with respect
to flow sizes. Specifically, if a flow-set is unschedulable with initial flow sizes, then the sizes of
all flows are uniformly decreased until the flow-set becomes schedulable. Similarly, if a flow-
set is schedulable with initial sizes, then the sizes of all flows are uniformly increased until the
flow-set becomes unschedulable. The maximum flow sizes for which one method can guarantee
the schedulability of a flow-set is called the schedulability threshold (ST). Of course, a higher ST
infers that the method is more efficient. Upon obtaining the STs for the same flow-set with all the
approaches, the comparison is performed. Let STEDF , STRM and STHSA be the STs obtained for
the EDF method and with the two state-of-the-art methods. The improvements of the proposed
approach over the existing ones are measured in the following way: ImpEDF/RM =
STEDF−STRM
STRM
, and
ImpEDF/HSA =
STEDF−STHSA
STHSA
.
The flow-set and analysis parameters are summarised in Table 2.9, where an asterisk sign
denotes a randomly generated value assuming a uniform distribution.
Table 2.9: Analysis parameters for Section 2.3.4.5
NoC topology and size 2-D mesh with 8×8 routers
Router frequency νρ 2 GHz
Routing delay δρ 3 cycles (1.5 ns)
Link traversal delay δL 1 cycle (0.5 ns)
Link width = flit size σ f lit 16 bytes
Flow packet size σ( fi),∀ fi ∈F [1−128]∗ kB
Flow periods D( fi) = T ( fi),∀ fi ∈F [20−100]∗ µs
Testing platform Intel dual-core desktop & Java (Max heap-size: 4 GB)
Experiment 1: Overall improvements
In this experiment, the improvement trends related to flow path lengths are observed. This
is done by imposing a constraint that the maximum path length (expressed in hops), of any flow
of the flow-set, cannot exceed the value of the newly introduced parameter LIM, i.e. |L ( fi)| ≤
LIM,∀ fi ∈ F . The parameter LIM is varied in the range [1− 14]. LIM = 1 means that only
single-hop paths are allowed. Conversely, LIM = 14 allows all possible paths, because, assuming
the XY routing, the maximum path length on a 8× 8 platform is 14 hops long. For each value
of the parameter LIM, 1000 flow-sets are randomly generated, each consisting of 200 flows. For
each flow, the initial size and the period are randomly generated, while the source and the destina-
tion router were generated in a way that the path length does not exceed the imposed limit LIM.
Subsequently, STEDF , STRM and STHSA are computed for each flow-set, and the obtained values are
compared.
Figure 2.29 shows the improvements of EDF over RM. It is visible that, for cases where
LIM = 1, EDF dominates RM. The explanation is as follows. When LIM = 1 the transitivity
property always holds, and consequently indirect interferences do not exist. This infers that the
rules from the uniprocessor scheduling theory also hold in this context: (i) EDF is the optimal
90 Several Steps Closer to Real-Time NoCs
1 2 3 4 5 6 7 8 9 10 11 12 13 14
−30
−20
−10
0
10
20
30
40
Maximum length of flow paths (in hops)
Im
pr
ov
em
en
ts
 (in
 %
)
Figure 2.29: EDF vs RM
1 2 3 4 5 6 7 8 9 10 11 12 13 14
−50
−40
−30
−20
−10
0
10
20
30
Maximum length of flow paths (in hops)
Im
pr
ov
em
en
ts
 (in
 %
)
Figure 2.30: EDF vs HSA
policy and hence always outperforms RM, and (ii) RM is optimal among fixed-priority policies.
On the rest of the domain (LIM > 1), the transitivity property may not hold, which implies that
indirect interferences are possible. Thus, as exhibited in Case Studies 1-2, there exist scenarios
where EDF outperforms RM, but the opposite is also true. Yet, the cases in which EDF performs
better are more frequent, and on average EDF outperforms RM by 7%. As LIM grows, the average
improvements also grow, however, the increase is barely noticeable.
Figure 2.30 illustrates the improvements of EDF over HSA. It is visible, that for LIM = 1,
EDF also dominates HSA, in fact, the improvements are the same as in the example with RM.
This is expected, because, as stated in the previous paragraph, RM is optimal among fixed-priority
policies, and hence it should be HSA = RM. Thus, it can be concluded that EDF presents a very
promising approach for systems where flows mostly traverse single-hop distances. On the rest
of the domain (LIM > 1) HSA significantly outperforms EDF, and as the lengths of flow paths
increase, the dominance of HSA becomes more apparent. This is a very counter-intuitive finding
and the explanation is as follows. In EDF the commutative property holds, and in order a flow-set
to be schedulable, the necessary condition is that the utilisation of the path of each flow is less
than or equal to 1. Conversely, in the fixed-priority scheme, the commutative property does not
hold, and the flow can be schedulable even though the utilisation of its path is greater than 1 (see
Case Study 2). In fact, it can be concluded that, when LIM > 1, almost always there exists a
priority ordering that can schedule a flow-set which is unschedulable with EDF. This is a crucial
finding which suggests that EDF (or any other arbitration policy with dynamically changing flow
priorities) may not be the most efficient arbitration technique for flow-sets with arbitrary path
lengths. Yet, before this can be discussed, another important question has to be answered: how
hard it is to find a priority ordering which will succeed in cases where EDF fails? This is covered
in Experiment 2.
Assuming that all flows traverse single-hop distances, Figure 2.31 illustrates the improvements
of EDF over the optimal fixed-priority scheme (i.e. RM). The same data as in Figures 2.29-2.30
was used, but the focus is on LIM = 1 only . It is visible that the average improvements are around
10%, while in some cases can reach up to 30%.
2.3 NoCs with Priority-Preemptive Arbitration Policies 91
0 5 10 15 20 25 300
1
2
3
4
5
6
7
8
Improvements (in %)
Qu
an
tity
 o
f f
low
−s
et
s (
in 
%)
 
 
Equal
Better
Figure 2.31: EDF vs RM,HSA (1-hop)
Experiment 2: Scalability
In this experiment, the objective is to observe how the analysis duration time changes as a
function of the flow-set size. In other words, the aim is to investigate how scalable EDF, RM and
HSA are. The number of flows constituting the flow-set is varied in the range [100−400], with an
incremental step of 50 flows. For each flow-set size 1000 flow-sets are randomly generated. For
each flow the initial size, the period, the source and the destination routers are randomly generated,
without the maximum path length constraint, i.e. LIM = 14. For each flow-set, the time it takes to
compute STEDF , STRM and STHSA is measured. Subsequently, the obtained values are compared.
Figure 2.32 demonstrates that the computational complexity of EDF is higher than that of RM.
This is expected, because in RM the commutative property does not hold, and hence the worst-
case traversal times and the jitters can be obtained in a single pass, if flows are ordered by their
priorities, decreasingly. Conversely, in EDF the commutative property holds, and hence several
passes are needed until the worst-case traversal times and jitters stabilise for two successive passes
(see Section 2.3.4.3 and Algorithms 7-8).
In spite of imposing the limit on the maximum number of orderings in HSA to only 5 · |F |, its
computational complexity significantly surpasses that of EDF and RM, and the duration time of the
analysis grows exponentially. This occurs because HSA exhaustively enumerates the maximum
allowed number of orderings before eventually rendering the flow-set unschedulable, while RM
and EDF reach that conclusion much faster. These findings imply that HSA is the least scalable of
the compared approaches, and that searching for STHSA may be prohibitively expensive for flow-
sets with more than 500 flows. Therefore, it can be concluded that, for massive flow-sets, RM and
EDF are preferable options.
Experiment 3: Applicability
The common underlying assumption of the previous experiments is that the clock skew does
not exist, i.e. ∆= 0 in Equation 2.10. In this experiment, the objective is to quantify the sensitivity
92 Several Steps Closer to Real-Time NoCs
100 150 200 250 300 350 4000
5
10
15
20
25
30
Flow−set size
An
al
ys
is 
du
ra
tio
n 
(in
 se
co
nd
s)
 
 
RM
EDF
HSA
Figure 2.32: Analysis time variations
0.1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 1000
10
20
30
40
50
60
Parameter ∆ (in µs)
Pe
na
lty
 (in
 %
)
Figure 2.33: Influence of ∆ on EDF
of the EDF method with respect to the clock skew. Based on the findings, it will be possible to
derive some conclusions regarding the practical limitations of the EDF arbitration policy. The
experiment is conducted in the following way. First, through the sensitivity analysis, STEDF of
one flow-set is obtained, when the clock skew is equal to zero. That value serves as a baseline for
comparisons. Then, for a non-zero value of the clock skew the sensitivity analysis is performed
again, and the value STEDF ′ if obtained for the same flow-set. These values are then compared
and the penalty suffered due to the clock skew is expressed with the following metric: penalty =
STEDF−STEDF ′
STEDF
. The process is repeated for 1000 flow-sets with the fixed flow-set size |F |= 200 and
LIM = 14. Subsequently, the clock skew value is changed (∆ ∈ [0.1−100µs]), and the penalty is
computed again.
The results are depicted in Figure 2.33. It is noticeable that for ∆ ≤ 0.1µs the effects are
negligible. In the range 0.1µs < ∆ ≤ 100µs the penalty grows logarithmically until reaching a
saturation point at ∆ = 80µs. The conclusion is that the schedulability is very sensitive to the
values of the clock skew that are within the range of flow periods, which is intuitive. It is also
interesting that the saturation point exists. The explanation is that, even in the hypothetical case,
with the infinite clock skew, the interference that one flow can cause to another ultimately has
an upper-bound which is equal to the maximum number of releases of the interfering flow within
the interval of interest (the first term in the min function of Equation 2.10). Regarding the most
important question about the applicability of EDF, it can be concluded that the clock skew values
that have an impact on the schedulability are several orders of magnitude greater than the ones in
the real systems, inferring that EDF, counter-intuitively, does not face practical limitations.
2.3.4.6 Discussion
The experiments demonstrated that EDF dominates all fixed-priority schemes in cases where traf-
fic flows traverse single-hop distances. In such cases the system inherits the properties of the
uniprocessor scheduling theory, where EDF is optimal, and where system resources can be utilised
in the most efficient way (even up to 100%). Can this finding be exploited on our quest towards
efficient and real-time oriented multiprocessors? In order to provide an answer to this question,
2.3 NoCs with Priority-Preemptive Arbitration Policies 93
further research in the area of application mapping is needed, because flow paths are inevitably
dependant on the position of functionalities within the platform.
For longer flow paths, there are cases where EDF performs better than RM, but the opposite
is also true. However, the number of cases where EDF performs better than RM are more fre-
quent, and, on average, EDF outperforms RM by 7%. Moreover, HSA systematically outperforms
EDF for all cases where path lengths exceed 3 hops. This finding suggests that, for flow-sets with
arbitrary path lengths, EDF (or any other arbitration scheme with dynamically changing flow pri-
orities) may not be the most efficient arbitration policy, which is a negative but important finding.
However, the experiments also suggest that searching for the priority ordering that outperforms
EDF indeed can be prohibitively expensive. Therefore, the applicability of EDF to a specific
flow-set highly depends on the parameters of the flow-set.
Finally, the experiments demonstrated that EDF does not suffer practical limitations with re-
spect to the clock skew. Specifically, the values of the clock skew, for which the analysis becomes
sensitive, exceed the clock skews in the real systems by several orders of magnitude.
2.3.5 Reducing Analysis Pessimism
In Chapter 1, it was mentioned that the efficiency of the worst-case analysis highly depends on the
amount of predictability of the analysed system, whereas any non-deterministic system behaviour
has to be accounted for in the analysis with a certain degree of pessimism. A more pessimistic
analysis may lead to a significant resource overprovisioning and/or underutilisation of platform
resources. Conversely, a less pessimistic approach allows to save on design costs, e.g. by choos-
ing a cheaper platform with fewer resources, which still guarantees the fulfilment of all timing
constraints. Moreover, with the less pessimistic analysis, the resources of the platform can be ex-
ploited more efficiently, for example, by accommodating the additional workload, or by decreas-
ing the power consumption via core shutdowns and smaller router frequencies. Thus, deriving the
analysis with the least pessimism possible is of paramount importance in the real-time domain. In
this section, one source of pessimism of the existing methods to perform the worst-case analysis
of priority-preemptive NoCs is identified. Subsequently, the extension to the existing methods
is proposed, which overcomes the aforementioned limitation, and allows for a less pessimistic
worst-case analysis.
Irrespective of the arbitration policy, in previous methods the following assumption was used:
the interference that a flow f1 causes to a flow f2 is equal to the isolation delay of f1, i.e. C( f1) (as
mentioned in Observation 4). However, this assumption may be pessimistic, as described below.
2.3.5.1 Motivational Example
Consider the example of two flows, f1 and f2, illustrated in Figure 2.34. Without loss of generality
with respect to arbitration policies, consider that the packet of the flow f1 can preempt the packet
of f2. Recall, that in the existing analyses, the entire traversal of f1 would be considered as the
interference to f2 (Observation 4). Notice, that f1 and f2 share the common part of the path, which
94 Several Steps Closer to Real-Time NoCs
consists of one link, and which is hereafter referred to as the contention domain – CD2. Let the
path of f1 be divided into 3 parts: (i) before f1 and f2 start sharing a common part of the path –
pre-CD, (ii) while f1 and f2 share the common part of the path – CD, and (iii) after f1 and f2 stop
sharing the common part of the path – post-CD.
f1 2f
pre−CD CD post−CD
Figure 2.34: Traffic flows (example 6)
Notice that while the header flit of f1 traverses pre-CD, f1 does not cause any interference
to f2 (see Figure 2.35(a)). Thus, in the existing methods, the traversal of the header flit of f1
through pre-CD is unnecessarily considered as the interference that f1 causes to f2. If that delay
is excluded from the interference that f1 causes to f2, the analysis pessimism can be reduced, and
consequently a tighter upper-bound estimate on the worst case traversal time of f2 can be obtained.
Once the header flit of f1 reaches CD, f2 starts suffering the interference (see Figure 2.35(b)).
When the tail flit of f1 leaves CD, f2 stops suffering the interference from f1, and may continue
its progress (see Figure 2.35(c)). Thus, the traversal of the tail flit of f1 through post-CD is also
unnecessarily considered as the interference in the existing methods. By excluding this delay
from the interference that f1 causes to f2, the analysis pessimism can be further reduced, and even
a tighter upper-bound estimate on the worst-case traversal time of f2 can be obtained.
2.3.5.2 Proposed Method
In this section, an extension to the existing methods for the worst-case analysis of priority-preemptive
NoCs is proposed. This extension allows to obtain a less pessimistic estimate of the interference
that directly interfering flows cause to the flow under analysis. Subsequently, tighter upper-bound
estimates on the worst-case traversal times can be obtained.
Consider again the example illustrated in Figure 2.34. Let the parts of L ( f1) constituting
sections pre-CD, CD and post-CD (with respect to flow f2) be L
pre−CD
1,2 , L
CD
1,2 and L
post−CD
1,2 ,
respectively. It is trivial to see that the union of these parts forms the entire path of f1, i.e.
L ( f1) =L
pre−CD
1,2 ∪L CD1,2 ∪L post−CD1,2 . Now, as described in the previous section, the less pes-
simistic estimate of the interference that a single traversal of f1 causes to f2, denoted as I( f1→ f2)
(Equation 2.13), can be computed by subtracting (i) γ pre−CD1,2 , which is the time it takes a header
2The XY-routing mechanism assures that any two flows with a direct interference relationship have exactly one CD
2.3 NoCs with Priority-Preemptive Arbitration Policies 95
f1 2f
pre−CD CD post−CD
(a) f2 does not suffer the interference when the header
flit of f1 is in pre-CD
f1 2f
pre−CD CD post−CD
(b) f2 suffers the interference when the header flit of
f1 is in either CD or post-CD, and the tail flit of f1 is in
either pre-CD or CD
f1 2f
pre−CD CD post−CD
(c) f2 does not suffer the interference when the tail flit
of f1 is in post-CD
Figure 2.35: Detailed interference analysis
flit of f1 to traverse pre-CD, and (ii) γ post−CD1,2 , which is the time it takes a tail flit of f1 to traverse
post-CD, from the entire traversal time of f1.
I( f1→ f2) =C( f1)− γ pre−CD1,2 − γ post−CD1,2 (2.13)
where γ pre−CD1,2 and γ
post−CD
1,2 are computed as follows:
γ pre−CD1,2 = |L pre−CD1,2 | ·δL+max
{
0,
(
|L pre−CD1,2 |−1
)}
·δρ (2.14)
γ post−CD1,2 = |L post−CD1,2 | ·δL (2.15)
If this reasoning is applied for all directly interfering flows of a flow under analysis fi, a
tighter worst-case traversal time WCT T+( fi) can be obtained. Specifically, for the fixed-priority
arbitration scheme, Equation 2.4 is substituted with Equation 2.16.
WCT T+( fi) =C( fi)+ ∑
∀ f j∈FD( fi)
⌈
WCT T+( fi)+ JR( f j)+ JN( f j)
T ( f j)
⌉
· I( f j→ fi) (2.16)
96 Several Steps Closer to Real-Time NoCs
Similarly, for EDF-arbitrated NoCs, Equations 2.9-2.12 are substituted with Equations 2.17-
2.20.
W+( fi) =
⌈
W ( fi)+ JR( fi)+ JN( fi)
T ( fi)
⌉
·C( fi)+ ∑
∀ f j∈FD( fi)
⌈
W ( fi)+ JR( f j)+ JN( f j)
T ( f j)
⌉
· I( f j→ fi)
(2.17)
L+( fi, t) =
(
1+
⌊
t+ JR( fi)+ JN( fi)
T ( fi)
⌋)
·C( fi)+ I( f j→ fi)·
∑
∀ f j∈FD( fi)
T ( f j)≤t+T ( fi)+JR( f j)+JN( f j)+∆
min
{⌈
L( fi, t)+ JR( f j)+ JN( f j)
T ( f j)
⌉
,
⌊
t+T ( fi)+ JR( f j)+ JN( f j)+∆
T ( f j)
⌋}
(2.18)
WCT T+( fi, t) = max
{
C( fi),L+( fi, t)− t
}
(2.19)
WCT T+( fi) = max{WCT T+( fi, t),∀t ∈Tcrit( fi)} (2.20)
2.3.5.3 Observations
When comparing the existing and the new methods to compute the worst-case traversal times of
individual flows, several interesting facts can be noticed:
Observation 5. When compared with the existing ones, the newly proposed methods never per-
form worse. Indeed, the new methods always derive upper-bounds which are either the same, or
less pessimistic. In other words, it always holds that WCT T+( fi) ≤WCT T ( fi),∀ fi ∈F . Theo-
rem 4 provides the proof.
Theorem 4. Irrespective of the arbitration policy, the worst-case traversal time, of any flow of
the flow-set, computed with the new method, is either equal to, or less pessimistic than the one
obtained with the existing method.
Proof. Proven directly. Consider two flows fi and f j, where the packets of f j can preempt the
packets of fi. Let γ j,i be the difference in the interference caused by a single preemption of f j to
fi, computed with the existing and the proposed method (Equation 2.21).
γ j,i =C j− I( f j→ fi) = γ pre−CDj,i + γ post−CDj,i = |L pre−CDj,i | ·δL+
max
{
0,
(
|L pre−CDj,i |−1
)}
·δρ + |L post−CDj,i | ·δL (2.21)
Since all the terms of Equation 2.21 are non-negative values, it follows that γ j,i ≥ 0. Let K j,i be
the number of preemptions that f j can cause to fi during the worst-case traversal time of fi. Also,
2.3 NoCs with Priority-Preemptive Arbitration Policies 97
let γi be the difference in the total interference caused to fi by all its directly interfering flows,
computed with the existing and the proposed method (Equation 2.22).
γi = ∑
∀ f j∈FD( fi)
γ j,i ·K j,i (2.22)
Since all the terms of Equation 2.22 have non-negative values, it follows that γi≥ 0. Moreover,
as the existing and the new method differ only in the way how the interference is computed, it
follows that WCT T ( fi)−WCT T+( fi) = γi ≥ 0.
Theorem 5 provides a proof that the obtained upper-bounds are safe.
Theorem 5. The traversal time of any packet belonging to the flow fi can not be greater than
WCT T+( fi) (Equation 2.16 or Equation 2.20), even in the worst-case conditions.
Proof. Proven directly. Depending on the arbitration policy, the worst-case traversal time of a flow
can be computed either by solving Equation 2.4 for fixed-priority-arbitrated NoCs (Section 2.3.1),
or Equation 2.12 for EDF-arbitrated NoCs (Section 2.3.4). If it can be proven that, irrespective of
the arbitration policy, there exists some discontinuous time interval γi which is a part of WCT T ( fi),
and during which fi does not progress, nor any of its interfering flows causes the interference to it,
that will prove that WCT T ( fi)− γi is also a safe upper-bound on the worst-case traversal time of
fi.
Consider two traffic flows fi and f j, where the packets of f j can preempt the packets of fi.
According to Observation 4, the entire traversal of f j is considered as the interference that f j
causes to fi, and hence entirely contributes to WCT T ( fi). Let γ j,i (Equation 2.21) be the sum of:
(i) γ pre−CDj,i , which is the interval when the header flit of f j traverses pre-CD and (ii) γ
post−CD
j,i ,
which is the interval when the tail flit of f j traverses post-CD. By definition, both a necessary and
sufficient condition for the contention between fi and f j is that both of them attempt to traverse
the CD section at the same time. However, during γ pre−CDj,i and γ
post−CD
j,i , the flow f j does not
traverse CD, hence fi can safely progress. Thus, the maximum interference that one packet of f j
can cause to fi has a safe upper-bound, which is C( f j)− γ j,i. Subsequently, if during the traversal
of fi, packets of f j can appear at most K j,i times, then the safe upper-bound on the interference
that f j can cause to fi is K j,i · (C( f j)− γ j,i). By elevating this reasoning, it can be concluded that
the worst-case traversal time of fi has a safe upper-bound WCT T+( fi) (Equation 2.23).
WCT T+( fi) =WCT T ( fi)− ∑
∀ f j∈FD( fi)
K j,i · γ j,i =WCT T ( fi)− γi (2.23)
Observe the improvements of the proposed method over the existing one on a small-scale
example given in Figure 2.34, where the flow characteristics are given in Table 2.10, and the
NoC characteristics are given in Table 2.12. Without the loss of generality in terms of arbitration
policies, in this and subsequent examples the fixed-priority arbitration policy will be assumed,
98 Several Steps Closer to Real-Time NoCs
however, note that any conclusion reached for this scheme will also hold for the EDF arbitration
policy.
Table 2.10: Flow-set parameters for Figure 2.34 (example 1)
Flow Priority σ (f) JR(f) D(f) = T(f)
f1 P( f1) 48 B 0 1 µs
f2 P( f2)< P( f1) 48 B 0 1 µs
Since only two flows exist, there are no indirect interferences, thus network jitters are equal to
zero, i.e. JN( f1) = JN( f2) = 0. f1 is the higher-priority flow, so its worst-case traversal time will
be the same with both methods:
WCT T ( f1) =WCT T+( f1) =C( f1) = |L ( f1)| ·δL+(|L ( f1)|−1) ·δρ +
⌈
σ( f1)
σ f lit
⌉
·δL = 14ns
However, the worst-case traversal time of f2 is different. First, the value will be obtained with
the existing analysis. For that, C( f2) is needed.
C( f2) = |L ( f2)| ·δL+(|L ( f2)|−1) ·δρ +
⌈
σ( f2)
σ f lit
⌉
·δL = 6ns
Now, WCT T ( f2) can be computed as follows:
WCT T ( f2) =C( f2)+
⌈
WCT T ( f2)+ JR( f1)+ JI( f1)
T ( f1)
⌉
·C( f1) = 20ns
Now, WCT T+( f2) will be obtained. To do so, first I( f1 → f2) has to be computed. Recall,
that I( f1→ f2) denotes the interference that f1 causes to f2 with a single preemption. The lengths
of the relevant sections are: |L pre−CD1,2 | = 3, |L CD1,2 | = 1 and |L post−CD1,2 | = 3 (see Figure 2.34).
Remember that the first and the last link of each flow are the core links, which have been omitted
from Figure 2.34 for clarity purposes.
I( f1→ f2) =C( f1)−|L pre−CD1,2 | ·δL−max
{
0,
(
|L pre−CD1,2 |−1
)}
·δρ −|L post−CD1,2 | ·δL = 8ns
Now, WCT T+( f2) can be obtained as follows:
WCT T+( f2) =C( f2)+
⌈
WCT T+( f2)+ JR( f1)+ JN( f1)
T ( f1)
⌉
· I( f1→ f2) = 14ns
It is visible that the worst-case traversal time of f2 obtained with the new approach is 14ns,
while it was 20ns with the existing method, which is an improvement of 6ns, or in relative terms,
an improvement of 30%.
Observation 6. The improvements of the proposed approach over the existing one depend on the
lengths of the pre-CD, CD and post-CD sections of interfering flows.
2.3 NoCs with Priority-Preemptive Arbitration Policies 99
Consider again the example from Figure 2.34, where P( f1) > P( f2). Assuming that the path
of f1 is constant, i.e. |L ( f1)|= const, from Equations 2.21-2.22 it straightforwardly follows that
the improvements in the worst-case traversal time of f2 are greater when both |L pre−CD1,2 | and
|L post−CD1,2 | are bigger. Since, the interference relationship between f1 and f2 is possible if and
only if a contention between these two flows exist, i.e. |L CD1,2 | ≥ 1, it follows that the necessary
condition for the maximum improvements is: |L pre−CD1,2 |+ |L post−CD1,2 |= |L ( f1)|−1.
This is demonstrated with an illustrative example given in Figure 2.36. Consider the same
flow characteristics as in the previous example (Table 2.10). If the worst-case traversal times of
both flows are computed with both methods, the following results are produced: WCT T ( f1) =
WCT T+( f1) = 14ns, WCT T ( f2) = 24ns and WCT T+( f2) = 20.5ns. When compared with the
previous example, it is visible that the improvements in the worst-case traversal time of f2 dropped
from 6ns to 3.5ns, or expressed relatively, from 30% to less than 15%.
f1 2f
pre−CD CD post−CD
Figure 2.36: Traffic flows (example 7)
Note, a special case occurs when paths of interfering flows entirely overlap, i.e. |L pre−CD1,2 |=
L post−CD1,2 | = 0, |L CD1,2 | = |L ( f1)|. In such scenarios there are no improvements because both
approaches return the same values.
Observation 7. The improvements of the proposed approach over the existing one depend on the
position of the CD section of interfering flows.
Consider again the example given in Figure 2.34, where P1 > P2. Assuming that the path of f1
and the length of the CD section are constant, i.e. |L ( f1)| = const, |L CD1,2 | = const, from Equa-
tions 2.21-2.22 it straightforwardly follows that the improvements in the worst-case traversal time
of f2 are more influenced by the length of the pre-CD than the post-CD section. Thus, a necessary
and sufficient condition for maximum improvements is: |L pre−CD1,2 |= |L ( f1)|−1, |L CD1,2 |= 1 and
|L post−CD1,2 |= 0.
To demonstrate that, consider the example from Figure 2.37, which differs from the example
from Observation 5 (Figure 2.34) only in the position of the CD section. If the worst-case traversal
times for both flows are computed again, assuming the same flow characteristics (Table 2.10),
the following values are produced: WCT T ( f1) = WCT T+( f1) = 14ns, WCT T ( f2) = 20ns and
WCT T+( f2) = 12.5ns. When compared with the example from Observation 5, it is visible that
the improvements in the worst-case traversal time of f2 increased from 6ns to 7.5ns, or expressed
relatively, from 30% to 37.5%.
100 Several Steps Closer to Real-Time NoCs
f1 2f
pre−CD CD post−CD
Figure 2.37: Traffic flows (example 8)
Observation 8. The improvements of the proposed approach over the existing one do not depend
on flow-sizes.
Indeed, from Equations 2.21-2.22 it is visible that only the paths, but not the sizes of the flows
influence the improvements. Therefore, with the increase in flow sizes, the worst-case traversal
times also grow, but the improvements remain the same in absolute values, and hence report de-
crease in relative values. If the worst-case traversal times are computed for the example given in
Figure 2.34, but this time with bigger flow sizes (Table 2.11), the following results are produced:
WCT T ( f1) = WCT T+( f1) = 17.5ns, WCT T ( f2) = 27ns and WCT T+( f2) = 21ns. Like in the
equivalent case for σ( f1) = σ( f2) = 48B, again WCT T ( f2)−WCT T+( f2) = 6ns, however, the
relative improvements drop from 30% to 22.2%. This implies that, as flow sizes increase, the
absolute improvements remain unaffected, however, the relative improvement decrease. In fact, in
a hypothetical case with flows of infinitely large sizes, the relative improvements asymptotically
converge towards 0%.
Table 2.11: Flow-set parameters for Figure 2.34 (example 2)
Flow Priority σ (f) JR(f) D(f) = T(f)
f1 P( f1) 160 B 0 1 µs
f2 P( f2)< P( f1) 160 B 0 1 µs
2.3.5.4 Numerical Example
In order to get a better insight into how different parameters influence analysis improvements, a
small-scale numerical example is used. Consider two contending flows f1 and f2, where P1 > P2,
FD( f2) = { f1} and f1 preempts f2 only once. The length of the CD section, the sizes, and the
paths of the flows are varied parameters. For each particular scenario the worst-case traversal
times of f2 are computed with both approaches. The objective is to observe the relative im-
provements, achieved by the proposed method. Figure 2.38 demonstrates the results. In each
subfigure, a lower surface covers a corner case where the length of the pre-CD section is equal
to zero, i.e. |L pre−CD1,2 | = 0 and |L post−CD1,2 | = |L ( f1)| − |L CD1,2 |, while the upper surface rep-
resents the opposite corner case, i.e. the length of the post-CD section is equal to zero, i.e.
2.3 NoCs with Priority-Preemptive Arbitration Policies 101
(a) |LCD1,2 |= 1 (b) |LCD1,2 |= min{5, |L ( f1)|}
(c) |LCD1,2 |= min{10, |L ( f1)|}
Figure 2.38: Improvement in the worst-case traversal time of f2, for two contending flows f1 and
f2, where P( f1)> P( f2), |L ( f1)|= |L ( f2)|, and f1 preempts f2 only once
|L pre−CD1,2 | = |L ( f1)| − |L CD1,2 | and |L post−CD1,2 | = 0. The trends in Figure 2.38 entirely coin-
cide with all the conclusions from Observations 5-8. Moreover, it is visible that the improve-
ments are equal to zero in cases where the path of f1 entirely belongs to the CD section, i.e.
|L CD1,2 |= |L ( f1)|, which has already been mentioned in Observation 6.
2.3.5.5 Experimental Evaluation
In the previous section, the improvements of the proposed analysis over the existing one were
analysed on small-scale illustrative examples consisting of only two flows. In this section, a large-
scale comprehensive experimental evaluation is performed, with the main objective to quantify the
improvements of the new approach over the existing ones, but this time assuming flow-sets with
hundreds of flows. This will help us to investigate whether the improvement trends from a small-
scale example also hold for large flow-sets, and to what extent. Subsequently, scenarios (flow
characteristics) can be identified, for which the proposed improvement reports the best results.
102 Several Steps Closer to Real-Time NoCs
Analysis parameters are given in Table 2.123. Note, that again the fixed-priority arbitration policy
is assumed, however, any conclusions reached for this scheme also hold in the context of the EDF
arbitration policy.
Table 2.12: Analysis parameters for Section 2.3.5.5
NoC topology and size 2-D mesh with 8×8 routers
Link width = flit size σ f lit 16 bytes
Router frequency νρ 2 GHz
Routing delay δρ 3 cycles (1.5 ns)
Link traversal delay δL 1 cycle (0.5 ns)
Flow periods D( fi) = T ( fi),∀ fi ∈F [10−100]∗ µs
Experiment 1: Improvements wrt flow sizes
In this experiment, the objective is to investigate how different flow sizes influence the im-
provements of the proposed method. Several categories of flow-sets are generated, with different
flow size ranges: 1B− 16B,16B− 64B, ...,64kB− 256kB. For each category 100 flow-sets are
generated, where a flow-set consists of 200 flows. The size of each flow is randomly generated,
within the limits of the respective flow-set category. Also, for each flow, the priority, the source
core and the destination core are randomly generated, where flow paths comply with the XY rout-
ing policy. Subsequently, for each flow of the flow-set, the worst-case traversal time is computed
with both methods, and the improvements are measured in relative terms.
Figure 2.39 shows the results. Since the improvements do not depend on flow sizes (see
Observation 8), the absolute improvements of the new method are constant, irrespective of the
flow sizes. However, as the increase in the flow size causes a uniform increase in the worst-case
traversal times obtained with both methods, the relative improvements of the new approach (y-
axis) decrease as the flow sizes increase (x-axis). Another interesting finding is that, irrespective
of the flow sizes, there are always flows for which the proposed method does not yield better
results. These are the highest-priority flows which do not suffer any interference, hence for them
both methods return the same results.
Experiment 2: Improvements wrt paths sizes
In this experiment, the lengths of flow paths are varied, and subsequently their influence on
the improvements are observed. Again, several categories of flow-sets are generated, but this time
with different lengths of flow paths: 3− 4,3− 6, ...,3− 16. For each category 100 flow-sets are
generated, where a flow-set consists of 200 flows. The priority, the source and destination cores
are generated randomly for each flow, but in accordance with the constraint on the maximum path
size, posed by the respective category to which the flow-set belongs. Each flow has a size which
3A period of each flow is randomly generated, within the given limits. If a generated flow-set is not schedulable,
then periods of all flows are uniformly increased (even beyond the limits) until the flow-set becomes schedulable.
2.3 NoCs with Priority-Preemptive Arbitration Policies 103
1 − 16 16 − 64 64 − 256 256 − 1k 1k − 4k 4k − 16k 16k − 64k 64k − 256k
0
10
20
30
40
50
60
70
80
Flow size (in bytes)
Im
pr
ov
em
en
t (i
n %
)
Figure 2.39: Improvements wrt flow sizes
3 − 4 3 − 6 3 − 8 3 − 10 3 − 12 3 − 14 3 − 16
0
10
20
30
40
50
60
Path size (in hops)
Im
pr
ov
em
en
t (i
n %
)
Figure 2.40: Improvements wrt path sizes
is randomly generated in the range [1B− 1kB]. The worst-case traversal times of all flows are
computed with both methods, and the improvements are measured in relative terms.
Figure 2.40 demonstrates that as the paths of the flows increase (x-axis), the relative improve-
ments also increase (y-axis). The explanation is that short paths substantially decrease the number
of contentions and interferences, thus decreasing the scenarios in which the new approach can
cause improvements. In fact, even in scenarios where contentions do occur, due to short flow
paths, CD sections cover large fractions of them, which has a significant impact on the improve-
ments (Observation 6). Conversely, longer paths cause more interferences, but also longer pre-CD
and post-CD sections. All these facts have a positive effect on the improvements (see Observa-
tion 6).
Experiment 3: Improvements wrt flow and path sizes
The objective of this experiment is to get a better insight into how both the aforementioned
flow characteristics (the flow size and the path size) together influence the improvements of the
new method. Different flow-set categoties are generated, where both parameters are varied. Again,
each category consists of 100 flow-sets, each with 200 flows with randomly generated priorities,
source and destination cores. For each flow the worst-case traversal times are computed with
both methods. Subsequently, for each category the average improvements achieved with the new
method are computed and expressed in relative terms.
Figure 2.41 shows the improvement trends (z-axis) associated with flow sizes (x-axis) and
path sizes (y-axis). The improvement trends are identical to those from Experiments 1-2, inferring
that the increase in the flow sizes and the decrease in the path sizes both have a negative effect
on the relative improvements. This infers that small flows with long paths benefit the most from
the proposed approach. Note, that a similar conclusion was reached for a small-scale numerical
example with two flows (see Section 2.3.5.4 and Figure 2.38).
Experiment 4: Improvements wrt flow-set sizes
The emphasis of this experiment is on the flow-set size. In other words, the objective is to
investigate how the improvement trends change with the number of flows constituting a flow-set.
104 Several Steps Closer to Real-Time NoCs
3 - 4
3 - 6
3 - 8
3 - 10
3 - 12
3 - 14
3 - 16
1 - 1632 - 64128 - 256512 - 1k2k - 4k8k - 16k32k - 64k
0
5
10
15
20
25
30
35
40
45
Flow size (in bytes)
Path
 size
 (in
 hops)
Im
pr
o
v
em
en
t (i
n
 
%
)
Figure 2.41: Improvements wrt flow-set sizes
100 150 200 250 300 350 400 450 500
0
10
20
30
40
50
60
Flow−set size
Im
pr
ov
em
en
t (i
n %
)
Figure 2.42: Improvements wrt flow and path sizes
Several flow-set categories are generated, where each category has the number of flows equal
to one of these values: 100,150, ...,500. For each category 100 flow-sets are generated, with
random priorities, source and destination cores. Moreover, each flow has a size which is randomly
generated in the range [1B− 1kB]. Subsequently, for each flow the worst-case traversal times
are obtained with both methods, and then compared. The improvements are measured in relative
terms.
Figure 2.42 demonstrates that as the flow-set size increases, so do the improvements. The
explanation is that larger flow-sets have more substantial contentions, which favours the proposed
approach. Yet, irrespective of the flow-set size, there always exist the highest-priority flows which
do not suffer interference and hence for them no improvements can be achieved. However, as
the flow-set size increases, the highest-priority flows constitute smaller and smaller fraction of the
entire flow-set, hence for sets with more than 200 flows these cases are below the 25th percentile,
and are therefore classified as outliers (depicted with red crosses in Figure 2.42).
Experiment 5: Improvements wrt priorities
The objective of this experiment is to investigate how the improvement trends change with
different flow priorities. 100 flow-sets are generated, each with 200 flows, where priorities, flow
sizes, source and destination cores are randomly generated, and flow sizes are in the following
range: [1B−1kB]. For each flow, the worst-case traversal times are computed with both methods,
and the improvements achieved by the new approach are measured in relative terms.
Figure 2.43 confirms that, as flow priorities decrease (bigger numbers on the x-axis), the rel-
ative improvements increase (y-axis). This confirms the initial assumption that the new approach
does not produce significant improvements for the highest-priority flows, because these flows suf-
fer very little interference (if at all). As flow priorities decrease, the interference that flows suffer
becomes more substantial, which favours the proposed approach.
Experiment 6: Analysis tightness
The objective of this experiment is to investigate the tightness of the obtained upper-bounds
on the worst-case traversal times. To achieve this, a single flow-set consisting of 42 flows is
2.3 NoCs with Priority-Preemptive Arbitration Policies 105
1−5 20−25 45−50 70−75 95−100 120−125 145−150 170−175 195−200
0
10
20
30
40
50
60
Priority
Im
pr
ov
em
en
t (i
n %
)
Figure 2.43: Improvements wrt priorities
0 5 10 15 20 25 30 35 40
0
0.5
1
1.5
2
2.5
3
3.5
4
Priority
Ti
m
e 
(in
 µs
)
 
 
Worst−case traversal time (old analysis)
Worst−case traversal time (new analysis)
Worst−case traversal time (measurement)
Average−case traversal time (measurement)
Best−case traversal time (measurement)
Figure 2.44: Analysis tightness
generated, and subsequently mapped on a 6× 6 platform with randomly generated source and
destination cores. Each flow has a payload in the range [2− 48] flits, and one additional header
flit. Moreover, each flow has a period in the range [0.5− 9]ms, and a unique priority. The router
frequency is 100 MHz. The worst-case traversal times of all flows are computed with both anal-
yses, and also the execution is simulated on a cycle-accurate simulator. The simulated time is 2
hyper-periods4. Subsequently, the analysis estimates are compared with the values obtained via
simulations, namely (i) the worst-case, (ii) the average-case and (iii) the best-case.
Results are depicted in Figure 2.44. It is visible that for high-priority flows both the analyses
provide tight bounds. This is expected, given that these flows suffer very little interference, if at
all. With the decrease in the priority (bigger numbers on the x-axis), the differences between the
observed values and the analysis results become more noticeable (y-axis), because the analysis
pessimism accumulates. Also, notice that the difference between the estimates increases with
the decrease in priorities. In fact, in some cases the proposed approach provides significantly
tighter estimates, which entirely coincides with the conclusions from the previous experiments,
and further motivates this work. However, the results also suggest that there is still room for
improvement, and this area remains a potential topic for future work.
2.3.5.6 Discussion
In the previous section, the pessimism of the state-of-the-art methods for the worst-case analysis
of wormhole-switched priority-preemptive NoCs was identified. Consequently, an improvement
was proposed, which overcomes the identified limitations. Through the experimental evaluation,
it has been observed that the proposed approach efficiently reduces the pessimism of the existing
methods, with respect to direct interference between contending traffic flows. The experiments
demonstrate that the proposed approach yields significant improvements in almost all cases, while
the greatest pessimism reductions are achieved in scenarios with large flow-sets, where flows have
small sizes and traverse long paths. These traffic characteristics correspond to control core-to-core
traffic as well as to read requests and write responses in core-to-memory traffic. Given that these
4A hyper-period is the least common multiplier of all flow-periods.
106 Several Steps Closer to Real-Time NoCs
traffic types constitute a significant fraction of the entire NoC traffic, the proposed method not
only can help to exploit the platform more efficiently and decrease the resource over-provisioning,
but also can render many flow-sets schedulable, even though the existing methods classified them
as unschedulable. These findings will most likely motivate and elicit further research efforts in the
area of the worst-case analysis of wormhole-switched priority-preemptive NoCs, with the main
objective to additionally decrease the analysis pessimism, especially in the domain of indirect
interferences, which still remains an unexplored topic.
Chapter 3
Limited Migrative Model - LMM
In this chapter, a new workload execution paradigm is presented. This approach is called the
Limited Migrative Model, LMM hereafter. There are two major differences between LMM and the
existing approaches:
1. In LMM, each functionality may be executed on an arbitrary number of cores. The candidate
cores for each functionality are selected at design-time, and during runtime the functionality
may freely migrate across its respective candidate cores.
2. The release/migration decisions of each functionality are made by the functionality itself,
which removes the requirement of a mandatory centralised scheduling entity. That is, the
functionality decides on which core it will perform its computation, while a local kernel on
that core is responsible to schedule the execution in a single-core fashion.
This unique design allows to embrace the positive characteristics of the existing techniques.
For example, on each core a local kernel is responsible to schedule the workload, which is a
scalable approach, very similar to the systems with the fully-partitioned scheduling. At the same
time, each functionality has the possibility to migrate across its candidate cores, which makes the
approach flexible, similar to the systems with the global scheduling. So far, LMM appears to be
a promising approach for integration of many-cores into the real-time embedded domain, and the
research activities presented in the rest of this chapter are motivated with that reasoning.
The graphical representation of LMM is given in Figure 3.1, where dotted arrows symbolise
candidate cores for each functionality.
3.1 LMM in Detail
3.1.1 Operating System (OS)
As already mentioned, the LMM approach has been developed from the assumption that each core
has an independent local kernel. A kernel is responsible to schedule the execution on its core,
i.e. to organise and arbitrate the computation requests by the functionalities which reside on that
107
108 Limited Migrative Model - LMM
pi1 2pi pi3
Scheduling
algorithm
Scheduling Scheduling
algorithm algorithm
pi4
Scheduling
algorithm
Functionalities
Cores
Figure 3.1: Limited Migrative Model
core. Moreover, a kernel exposes some of its features to local functionalities via system calls, and
functionalities invoke those system calls in order to e.g. request processing resources for their
computations, or communicate with other functionalities located on the same or other cores.
Each kernel is able to communicate with other kernels over the NoC interconnect. The commu-
nication among kernels is necessary, among other things, to maintain the coherent and consistent
system-wide state. This means that LMM uses the message-passing technique as a communi-
cation primitive. By combining the concepts of independent kernels with the message passing,
a novel operating system paradigm is created, in the literature known as the multi-kernel. The
multi-kernel is a relatively new, yet very promising approach in the design of operating systems
for many-core platforms. Some notable examples of multi-kernels are Barrelfish [11], fOS [94]
and Quest-V [57].
3.1.2 Application Layer
As mentioned earlier, each functionality (application) may perform the computation on an arbi-
trary set of candidate cores, which are for each application selected at design-time. On each of
the candidate cores an application’s execution code exists, encapsulated within an entity called
the dispatcher. Thus, the number of dispatchers of each application is equal to the number of its
candidate cores.
All dispatchers of one application (each located on a different core) communicate with each
other. The communication of dispatchers is termed the agreement protocol. The purpose of the
agreement protocol is to elect one dispatcher, called the master dispatcher. After the election
process, the master is responsible to perform the computation on its core, on behalf of the entire
application. In order to do that, the master invokes a system call, by which it requests from the local
kernel to consider its requirements when deriving future scheduling decisions. The computation
process must be performed entirely on the core of the master. When the computation completes,
the master is responsible to initiate the next instance of the agreement protocol, so as to elect the
master dispatcher for the next computation process. If the elected dispatcher is not the current
master, a migration occurs.
3.1 LMM in Detail 109
All applications are single-threaded, therefore, at any time instance there can be only one
master per application. The rest of the dispatchers are called the slave dispatchers, and their
purpose is to participate in the agreement protocol. When the master initiates the protocol, slaves
invoke system calls of their kernels, requesting the information regarding the possibility to perform
the next computation on their cores. After receiving the answers from their respective kernels,
dispatchers communicate with each other. Based on the information provided by the kernels, the
next master is elected1. If the newly elected dispatcher is the existing master, nothing changes.
Otherwise, the newly elected dispatcher becomes a master, while the old master becomes a slave.
Additionally, the execution context has to be transferred from the old to the new master. Finally,
after the computation is successfully completed on its core, the new master will initiate the new
instance of the agreement protocol.
Notice that being the master is only a temporary role of a dispatcher. Perceived from the
application’s perspective, its dispatchers exchange one master token. This property is termed the
master volatility, and it has several interesting implications which will be covered later when
performing the timing analysis.
The agreement protocols are classified as the intra-application communication, because the
communication is performed only between the dispatchers of the same application. Additionally,
applications may communicate with each other for e.g. synchronisation or data sharing purposes.
This inter-application communication is implemented by a message exchange between current
master dispatchers of interacting applications.
3.1.3 LMM Benefits
• Configurability: The greatest power of LMM lies within its configurability. When allowing
each application to have only a single dispatcher, the behaviour of the system is identical to the one
with the fully-partitioned scheduling policy. Conversely, if every application has a dispatcher on
every core of the platform, the system acquires the properties of the ones with the global schedul-
ing policy, although at the expense of extensive communication related to agreement protocols.
Thus, by consciously choosing the number of dispatchers for each application, a desired trade-off
between the flexibility and the amount of protocol-related communication can be achieved to fit
the actual purpose.
• Scalability: Irrespective of the number of dispatchers, the scheduling decisions are always
local, made by the kernels, which makes the approach scalable. This is possible because in LMM
release/migration decisions are derived on the application level, and are explicitly detached from
scheduling decisions, which is a novel concept in the real-time domain. The greatest benefit of this
distributed decision making process is that the centralised entity (e.g. ready queue) is not needed.
1In this dissertation, the timing analysis of agreement protocols is of interest. The election policy problem (how to
choose the master dispatcher) depends on the purpose of the system and has no effect on the timing analysis, which
makes it immaterial for the discussion in this dissertation.
110 Limited Migrative Model - LMM
• Flexibility: Each application has the migrative freedom, based on the number of its dispatchers.
Thus, LMM is able to efficiently exploit the potential of the underlying many-core platform by
performing the energy/thermal management via runtime load balancing.
• Resilience: Failures of individual cores or clusters of cores can be overcome by excluding the
dispatchers located on those cores from agreement protocols. In a similar way, voluntary core
shutdowns can be implemented for various beneficial reasons (e.g. to save power, to prolong
hardware life).
3.2 Application Workload
The workload consists of an application-set A , which is a collection of v applications (func-
tionalities): A = {a1,a2, ...,av−1,av}. An application ai has a unique priority P(ai), a set of u
dispatchers D(ai) = {d1i ,d2i , ...,du−1i ,dui }, a minimum inter-arrival period T (ai) and an implicit
deadline D(ai) = T (ai).
Computation: The computation requirements of any application ai are modelled by a single
sporadic task (recall that applications are single-threaded), which is a source of an infinite number
of recurring jobs released with the minimum inter-arrival period equal to T (ai). A job is released
by the local kernel, upon a request from the current master dispatcher of ai. The released job
inherits the priority of its application, has a constant execution time Cτ(ai), and has a deadline
denoted by Dτ(ai) < D(ai). In other words, when analysing only the computation process, a job
released at the time instant t has to execute for Cτ(ai) time units until t +Dτ(ai). If it fails to do
so, it has missed a deadline. Conversely, if guarantees can be provided that every job of ai can
meet its deadline, then ai is considered schedulable with respect to its computation requirements.
Memory: During the computation process, jobs of the application ai may need to access and
manipulate data. This means that during a single job execution, multiple accesses to the memory
controllers might be necessary. Therefore, the memory operations of the jobs belonging to ai are
modelled by a flow-set F µ(ai). A detailed description of F µ(ai) will be given in Section 3.7,
when the timing analysis of the memory traffic will be performed. At this stage it is sufficient to
only mention that all flows belonging to F µ(ai) have a joint deadline Dµ(ai) < D(ai). In other
words, when considering only memory operations, all memory traffic of the application ai has to
complete its transfer over the NoC within Dµ(ai). If it fails to do so, it has missed a deadline.
Conversely, if guarantees can be provided that the memory traffic of ai will not miss a deadline,
the application is considered schedulable with respect to its memory requirements.
Notice that neither the computation process, nor the memory access process are continuous
intervals. In fact, these two processes are mutually interleaved, and that has to be taken into
account when performing the timing analysis. To that aim, another deadline is defined, which is
equal to the sum of the aforementioned deadlines, i.e. Dτ+µ(ai) = Dτ(ai)+Dµ(ai). In order the
application to be schedulable with respect to both its computation and memory requirements, it
has to be proven that its computation takes no longer than Dτ(ai), and its memory operations take
no longer than Dµ(ai), when considering Dτ+µ(ai) as the time interval of interest.
3.3 Agreement Protocols 111
D
τ+µ ) Dη i)
i)D (
memory accesses interferencecommunicationcomputation
(a (i a
a
Figure 3.2: Example of application’s computation, memory access and communication patterns
Communication: Each application may perform two types of communication: intra- and
inter-application. The former covers the message exchange by the dispatchers of the same appli-
cation and one example of such communication is the agreement protocol. Conversely, the inter-
application traffic models the message exchange between the current masters of two interacting
applications.
Let F η(ai) be a flow-set which models all intra- and inter-application communication of the
application ai. A detailed description ofF η(ai) will be given in Sections 3.3-3.5, when the timing
analysis of the communication will be performed. At this stage it is sufficient to only mention that
all flows belonging to F η(ai) have a joint deadline Dη(ai) < D(ai). In other words, all commu-
nication traffic of the application ai has to complete its transfer over the NoC within Dη(ai). If it
fails to do so, it has missed a deadline. Conversely, if guarantees can be provided that the commu-
nication traffic of ai will not miss a deadline, the application is considered schedulable with respect
to its communication requirements. If the application is schedulable with respect to its (i) com-
putation requirements, (ii) memory requirements, and (iii) communication requirements, then it
is considered totally schedulable. If all applications of the application-set are totally schedulable,
the application-set is totally schedulable.
Notice that while the computation process and the memory access process are mutually in-
terleaved and form discontinuous time intervals, the communication process of each application
forms a continuous time interval, and therefore, when performing the timing analysis of the com-
munication of ai, only the interval Dη(ai) needs to be considered. Also note that the sum of the
three aforementioned deadlines is equal to the inter-arrival period of the application and thus its
deadline, i.e. Dτ(ai)+Dµ(ai)+Dη(ai) = Dτ+µ(ai)+Dη(ai) = D(ai). These facts are illustrated
with Figure 3.2.
3.3 Agreement Protocols
When performing the agreement protocol, the dispatchers may invoke six different OS operations,
as demonstrated in Table 3.1, where symbols denote the latencies of respective operations. Note
112 Limited Migrative Model - LMM
that these operations are executed on a first-come-first-serve basis and always have the same prior-
ity (the highest). Therefore, the protocol-related OS operations, invoked by any application, will
always preempt the job execution of any other application, irrespective of their priorities.
Table 3.1: OS operations related to agreement protocols
δ→P Send the protocol message (performed by all dispatchers)
δ←P Receive the protocol message (performed by all dispatchers)
δ→C Send the execution context (performed by the old master during migration)
δ←C Receive the execution context (performed by the newly elected master during migration)
δQ Get the info. from the local kernel about the next job release (performed by all dispatchers)
δE Elect the next master (performed by the old master)
In the following sections three agreement protocols will be introduced.
3.3.1 Master-Slave Protocol
3.3.1.1 Protocol Description
The dispatcher behaviour under this protocol is illustrated with Algorithm 9. Within every inter-
arrival period of application, after its computation and memory access deadlines expire, the agree-
ment protocol begins. At that time instant, the master initiates the protocol by sending messages
to all slaves (lines 4− 7). When a slave receives the message from the master, it requests from
the local kernel the information whether the next job can be released on that core or not (lines
25− 26). Upon receiving that information, the slave sends it back to the master (line 27). The
master waits until it receives the replies from all slaves (lines 9−12). After that, the master gets
the information from its kernel (line 14), and compares it with the information received from all
slaves, in order to elect the next master (line 15). If the elected master is the same as the old master,
nothing changes; the master requests the next job release from the kernel, which is deferred until
the communication deadline Dη(ai) expires (line 18). Then, the master waits for the time instant
to start the next protocol (line 4).
Conversely, if the elected dispatcher is not the current master, the migration occurs. The old
master sends the execution context to the new master, and demotes itself to the slave role (lines
20− 21). At the same time, each slave waits for one of two events: (i) the context transfer from
the old master, or (ii) the beginning of the new protocol (line 28). In the former case, the slave
promotes itself to the master role, and requests the next job release from the kernel (lines 29−32).
3.3.1.2 Timing Analysis
During this protocol, in the worst-case the master can perform the following OS operations:
(i) send the protocol message |D(ai)| − 1 times, (ii) receive the protocol message |D(ai)| − 1
3.3 Agreement Protocols 113
Algorithm 9 run()
1: while (true) do
2: if (isMaster = true) then
3: // dispatcher is master
4: wait(startTime); // wait for the time instant to start protocol
5: for each (d ji ∈D(ai) | d ji 6= this) do
6: sendMsg(d ji ); // send messages to all slaves
7: end for
8: rcvdMsgs← 0;
9: while (rcvdMsgs< |D(ai)|) do
10: wait(msgRcvd);
11: rcvdMsgs++;
12: end while
13: // replies from all slaves received
14: getNextReleaseIn f o();
15: nextMaster← chooseNextMaster();
16: if (nextMaster = this) then
17: // master remains the same
18: releaseDe f erredJob();
19: else
20: sendCtx(nextMaster); // master changes (migration occurs)
21: isMaster← f alse;
22: end if
23: else
24: // dispatcher is slave
25: wait(msgRcvd);
26: getNextReleaseIn f o();
27: sendMsg(master);
28: wait(ctxRcvd,startTime);
29: if (ctxRcvd = true) then
30: isMaster← true; // master changes (migration occurs)
31: releaseDe f erredJob();
32: end if
33: end if
34: end while
114 Limited Migrative Model - LMM
times, (iii) query the kernel for the information regarding the next release, (iv) elect the next mas-
ter, and (v) transfer the context to the new master. Thus, the delay of the protocol execution on the
core of the master, denoted by CηM(ai), can be expressed with Equation 3.1.
CηM(ai) = (|D(ai)|−1) ·δ→P +(|D(ai)|−1) ·δ←P +δQ+δE +δ→C (3.1)
Similarly, the delay of the protocol execution on the core of the slave that will not become the
next master, denoted by CηS (ai), can be expressed with Equation 3.2.
CηS (ai) = δ
←
P +δQ+δ
→
P (3.2)
Finally, the delay of the protocol execution on the core of the slave that will become the next
master, denoted by CηN(ai), can be expressed with Equation 3.3.
CηN(ai) = δ
←
P +δQ+δ
→
P +δ
←
C (3.3)
Besides the aforementioned terms, in order to compute the total delay of the protocol exe-
cution, it is necessary to calculate the delay of the message transfer over the NoC interconnect.
However, due to the master volatility, that task is not trivial. This problem is depicted in Fig-
ure 3.3, where the master broadcast (the beginning of the protocol) of the same application is
captured at two different time instants. Notice that depending on which dispatcher is the current
master (emphasised circle), produced messages may traverse entirely different routes.
(a) Bottom-left dispatcher is the master (b) Top-right dispatcher is the master
Figure 3.3: Agreement protocol messages are master-dependent
This problem can be circumvented in the following way. Recall (Equation 2.1) that the traver-
sal delay of each flow (message) f j in isolation is the function of two properties: (i) it’s size
σ( f j) and (ii) its path length |L ( f j)|. The size of the message is deterministic, but the path is
not. Let maxhops(ai) be the maximum distance between any two dispatchers of the application
ai to which f j belongs. As maxhops(ai) is an upper-bound on the path length of each protocol
message f j, the analysis covers the worst-case by assuming that each f j traverses that distance,
|L ( f j)|= maxhops(ai), ∀ f j ∈F η(ai).
With this assumption, the message transfer delay Cη# (ai) can be computed by solving Equa-
tion 3.4. In total 2 · (|D(ai)| − 1) protocol messages are exchanged between the master and all
slaves, followed by a context transfer to the new master.
3.3 Agreement Protocols 115
Cη# (ai) = 2 · (|D(ai)|−1) ·
(
maxhops(ai) ·δL+(maxhops(ai)−1) ·δρ +
⌈
σprt
σ f lit
⌉
·δL
)
+
maxhops(ai) ·δL+(maxhops(ai)−1) ·δρ +
⌈
σctx
σ f lit
⌉
·δL (3.4)
In Equation 3.4, σprt and σctx denote the size of one protocol message and one context mes-
sage, respectively.
After these terms have been obtained, the total delay of the protocol execution Cη(ai) can
be computed by solving Equation 3.5. Notice that because all slaves perform their protocols in
parallel, it is sufficient to take into account only one of them (the next master).
Cη(ai) =
old master delay︷ ︸︸ ︷
CηM(ai) +
next master delay︷ ︸︸ ︷
CηN(ai) +
network delay︷ ︸︸ ︷
Cη# (ai) (3.5)
Equation 3.5 represents the delay of the protocol execution of an application in isolation,
assuming that it does not suffer any interference from other applications. In order to obtain the
worst-case protocol delay, the interference from other applications has to be taken into account.
First, consider the interference that any dispatcher, master or slave, may suffer on its core
while performing the protocol-related OS operations. The dispatcher may suffer interference from
other masters or slaves residing on the same core and performing their agreement protocols. In
order to compute the maximum interference that a dispatcher may suffer, Theorem 6 is used.
Theorem 6. The number of protocol executions of any application ai within the time interval t
can be at most 1+
⌈
t−Dτ+µ (ai)
T (ai)
⌉
.
Proof. Proven by contradiction. Assume that 2+
⌈
t−Dτ+µ (ai)
T (ai)
⌉
protocol executions occurred within
the time interval t. There are
⌈
t−Dτ+µ (ai)
T (ai)
⌉
protocol executions surrounded by the first and the
last and these are refer to as the inner executions. All the inner executions contribute to t with
their entire application inter-arrival period T (ai), and therefore require time interval of at least⌈
t−Dτ+µ (ai)
T (ai)
⌉
·T (ai) where only these can execute. Additionally, assume that ε is infinitesimally
small but finite value representing the shortest possible duration of the protocol and that the first
protocol execution with the duration of ε was delayed as much as possible and hence completed
just before the interval of the inner executions started. Finally, the last protocol execution could
not start before the joint deadline of the computation and memory accesses – Dτ+µ(ai) expired.
ε+
⌈
t−Dτ+µ(ai)
T (ai)
⌉
·T (ai)+Dτ+µ(ai)≥ ε+
(
t−Dτ+µ(ai)

T (ai)
)
·T (ai)+Dτ+µ(ai) = ε+ t ≤ t
The contradiction has been reached.
116 Limited Migrative Model - LMM
The worst-case interference that a dispatcher d ji can suffer on its core within the time interval
t, termed Iη(d ji , t) can be computed by solving Equation 3.6.
Iη(d ji , t) = ∑
∀dmk ∈Dpi(d ji )
|dmk 6=d ji
(
1+
⌈
t−Dτ+µ(ak)
T (ak)
⌉)
·

CηM(ak) if d
m
k is master
CηN(ak) if d
m
k is slave (next master)
CηS (ak) if d
m
k is slave (not next master)
(3.6)
In Equation 3.6 the term pi(d ji ) represents the core of the dispatcher d
j
i , while Dpi(d ji ) denotes
all dispatchers residing on that core.
In addition to the on-core interference, when performing its agreement protocol, an application
may also suffer the interference within the NoC interconnect, called the network interference. The
network interference that the application of interest ai suffers from other applications within the
time interval t, denoted by Iη# (ai, t), can be computed by solving Equation 3.7.
Iη# (ai, t) = ∑
∀ak∈A |P(ak)>P(ai)
(
1+
⌈
t−Dτ+µ(ak)
T (ak)
⌉)
·Cη# (ak) (3.7)
In other words, it is assumed that the protocol messages of each higher-priority application
ak will cause the interference to the protocol messages of ai, irrespective of their potential paths.
This indeed is a conservative assumption, however, it is one way to circumvent the problem of
non-deterministic message paths caused by the master volatility property.
The worst-case protocol delay can be computed by summing up the aforementioned terms.
That is expressed with Equation 3.8, where the latest slave interference corresponds to the delay
of the slave which was the last to deliver its response to the master. Additionally, the terms t∗
and t+ are the sub-intervals of the worst-case protocol delay, and correspond to the time intervals
during which the latest slave and the next master can suffer the interference, respectively.
Rη(ai) =
isolation delay︷ ︸︸ ︷
Cη(ai) +
master interference︷ ︸︸ ︷
Iη(d ji ,R
η(ai)) +
latest slave interference︷ ︸︸ ︷
Iη(dki , t
∗) +
next master interference︷ ︸︸ ︷
Iη(dmi , t
+) +
network interference︷ ︸︸ ︷
Iη# (ai,R
η(ai)) (3.8)
Notice that in Equation 3.8 the first and the last term and master-independent. However, the
remaining terms depend not only on the decision which dispatchers perform the roles of the mas-
ter, the latest slave and the next master of the analysed application ai, but also on the roles that
dispatchers of other applications perform during the analysed period. Thus, in order to solve
Equation 3.8 it is necessary to identify roles of all dispatchers that will lead to the worst-case (the
biggest protocol delay).
Algorithm 10 demonstrates how to identify individual dispatcher roles on the core of the anal-
ysed dispatcher, which will lead to its worst-case interference within the time interval t. The value
3.3 Agreement Protocols 117
Algorithm 10 maxDispIn f (d ji , isMaster, t)
Input: analysed dispatcher d ji , boolean variable isMaster, time interval of interest t
Output: worst-case interference of d ji within t
1: DM ← /0; // initialise set of on-core masters
2: DS← /0; // initialise set of on-core slaves
3: if (isMaster) then
4: DM ←DM ⋃ {d ji };
5: else
6: DS←DS ⋃ {d ji };
7: end if
8: for each (dmk ∈Dpi(d ji ) | d
m
k 6= d ji ) do
9: if (|DM|< M̂) then
10: DM ←DM ⋃ {dmk }; // add dispatcher to the list of masters
11: master(dmk )← true;
12: else
13: // find master which causes the minimum interference
14: minMasterDelay← 0;
15: for each (dpn ∈DM | dpn 6= d ji ) do
16: currMasterDelay← Iη(dpn , t); // Equation 3.6
17: if (currMasterDelay< minMasterDelay) then
18: minMasterDelay← currMasterDelay;
19: minMaster← dpn ;
20: end if
21: end for
22: if (Iη(dmk , t)> minMasterDelay) then
23: DM ←DM \ {minMaster}; // remove the minimum from current masters
24: master(minMaster)← f alse;
25: DS←DS ⋃ {minMaster}; // add the minimum to current slaves
26: DM ←DM ⋃ {dmk }; // add dispatcher to current masters
27: master(dmk )← true;
28: else
29: DS←DS ⋃ {dmk }; // add dispatcher to current slaves
30: end if
31: end if
32: end for
33: // all roles assigned so compute maximum interference
34: maxDispIn f ← 0;
35: for each (dmk ∈Dpi(d ji ) | d
m
k 6= d ji ) do
36: maxDispIn f ← maxDispIn f + Iη(dmk , t); // Equation 3.6
37: end for
38: return maxDispIn f ;
118 Limited Migrative Model - LMM
M̂ denotes the maximum number of concurrent masters on a single core, and it is assumed that
this value has already been specified.
Algorithm 10 works as follows. First the lists of on-core masters and slaves are initialised
(lines 1−2). Then, if the analysed dispatcher d ji is the master, it is added to the list of masters (line
4). Otherwise, it is added to the list of slaves (line 6). After that, each dispatcher dmk that shares the
core with d ji is considered for assignment to one of the aforementioned lists. If the current number
of assigned masters is less than the maximum, dmk is added to the list of masters (lines 10− 11).
Otherwise, the master dispatcher which causes the minimum interference is identified (lines 14−
21). If the interference caused by the current dispatcher dmk is greater than the interference caused
by the identified minMaster, then minMaster is transferred to the list of slaves, while dmk is added
to the list of masters (lines 23− 27). Otherwise, dmk is added to the list of slaves (line 29). This
process is repeated for every dispatcher. After the lists of masters and slaves have been populated
with all on-core dispatchers, the worst-case interference is computed (lines 34−37) and returned
(line 38).
Algorithm 10 demonstrates how to identify on-core master and slave dispatchers of other ap-
plications which lead to the worst-case delay to the analysed dispatcher within the observed time
interval. However, in order to compute the worst-case protocol delay, it is necessary to also iden-
tify dispatcher roles for the application under analysis. For that Algorithm 11 is used.
Algorithm 11 is divided into 4 parts. First, the master of the analysed application is identified
(lines 4−11) in the following way. For each dispatcher d ji , the maximum interference that it can
suffer within the observed interval is computed. That value is obtained by invoking previously
described Algorithm 10 (line 7). The dispatcher with the maximum interference is identified, and
assigned the master role (line 9).
Then, the latest slave of the analysed application is identified in a similar way (lines 12−23).
The process is slightly more complex than identifying the master, because the latest slave can
suffer the interference only during t∗, which is a sub-interval of the entire worst-case protocol
delay. For each dispatcher dki that is not the master, the analysed interval t
∗ is initially computed
for the minimal value – δ←P + δQ + δ→P (line 18). It is obtained by invoking Algorithm 10. After
the new value of t∗ is obtained, it is fed back into the computation as the input. The process is
repeated until the fixed converging point is reached. The stopping condition is the same value of
t∗ for two consecutive iterations (line 19). The process is repeated for every dispatcher, until the
one with the biggest interval t∗ is identified and assigned the latest slave role (line 21).
After that, the next master is identified (24− 35). Notice that it can also be the latest slave,
thus the same dispatcher may be identified for both roles. The computation process is very similar
to the previous one. The only difference is that the analysed interval t+ is initially computed for
the minimal value, which represents a single OS operation – receiving the execution context (line
30). Again, the stopping condition is that the value of t+ has the same value for two successive
iterations (line 31). Of all slave dispatchers, the one with the biggest interval t+ is identified and
assigned the next master role (line 33).
3.3 Agreement Protocols 119
Algorithm 11 compWorstCaseDelay(ai,A )
Input: application ai, application-set A
Output: worst-case protocol delay of ai
1: Rη(ai)← 0; // initialise the worst-case protocol delay
2: repeat
3: Rηold(ai)← Rη(ai);
4: // 1. identify the master
5: maxMasterIn f ← 0;
6: for each (dki ∈D(ai)) do
7: currMasterIn f ← maxDispIn f (dki , true,Rηold(ai)); // Algorithm 10 and Equation 3.6
8: if (currMasterIn f > maxMasterIn f ) then
9: maxMasterIn f ← currMasterIn f ; master← d ji ;
10: end if
11: end for
12: // 2. identify the latest slave
13: maxSlaveIn f ← 0;
14: for each (dki ∈D(ai)|dki 6= master) do
15: t∗← 0;
16: repeat
17: t∗old ← t∗;
18: t∗← maxDispIn f (dki , f alse, t∗old +δ←P +δQ+δ→P ); // Algorithm 10 and Equation 3.6
19: until (t∗old = t
∗);
20: if (t∗ > maxSlaveIn f ) then
21: latestSlave← d ji ; maxSlaveIn f ← t∗;
22: end if
23: end for
24: // 3. identify the next master
25: maxNextMasterIn f ← 0;
26: for each (dki ∈D(ai)|dki 6= master) do
27: t+← 0;
28: repeat
29: t+old ← t+;
30: t+← maxDispIn f (dki , f alse, t+old +δ←C ); // Algorithm 10 and Equation 3.6
31: until (t+old = t
+);
32: if (t+ > maxNextMasterIn f ) then
33: nextMaster← d ji ; maxNextMasterIn f ← t+;
34: end if
35: end for
36: // 4. compute the worst-case protocol delay;
37: iso←Cη(ai); // compute isolation delay (Equation 3.5)
38: netIn f ← Iη# (ai,Rηold(ai)); // compute network interference (Equation 3.7)
39: Rη(ai) ← iso + netIn f + maxMasterIn f + maxSlaveIn f + maxNextMasterIn f ; // Equa-
tion 3.8
40: until (Rηold(ai) = R
η(ai));
41: return Rη(ai);
120 Limited Migrative Model - LMM
Finally, the worst-case protocol delay can be computed. As shown in Equation 3.8, it con-
sists of five components: (i) the isolation delay, (ii) the master interference, (iii) the latest slave
interference, (iv) the next master interference, and (v) the network interference. In previous steps
the second, the third and the fourth component were obtained. Thus, the remaining components
are obtained by solving Equations 3.5 and Equation 3.7 (lines 37−38). The worst-case protocol
delay is equal to the sum of these terms (line 39 and Equation 3.8). Similarly to intervals t∗ and
t+, the worst-case protocol delay Rη(ai) is also computed iteratively, until the computed value is
the same in two successive iterations. The schedulability condition is that the obtained value is
less than or equal to the communication deadline, i.e. Rη(ai)≤ Dη(ai).
3.3.1.3 Discussion
The decision made on the core of the master dispatcher is based on the information received from
every individual slave. However, in the moment when the master makes a decision, there are no
guarantees that the state of the system on the core of each slave is still identical to the one made
during the individual observations. One extreme, yet possible scenario occurs when one slave
gives a positive reply to its master regarding the next job execution on its core. Additionally,
some other dispatchers from the same core might have also sent positive replies to their respective
masters during their protocol executions. As a result, multiple applications might elect dispatchers
from that particular core for the next masters, and hence overload the core, which inevitably leads
towards missed deadlines. Therefore, the race condition is identified as the greatest flaw of this
protocol. The performance of the protocol will receive additional attention in Section 3.3.4.
3.3.2 List Protocol
3.3.2.1 Protocol Description
In this protocol, dispatchers form a singly linked list, and each dispatcher knows its successor,
termed nextDisp. The dispatcher behaviour under this protocol is illustrated with Algorithm 12.
Within every inter-arrival period of application, after its computation and memory access deadlines
expire, the agreement protocol begins. At that time instant, the master initiates the protocol by
getting the information regarding the next job release from its kernel (line 5). If the kernel provides
a positive reply, nothing changes; the master requests the next job release from the kernel (line 8),
and waits for the time instant to start the next protocol (line 4).
In cases when the kernel provides a negative reply, the master sends the message to its suc-
cessor dispatcher (line 10). When that dispatcher receives the message (line 17), it requests the
information regarding the next job release from its kernel (line 18). If the kernel does not allow
a new job release, the dispatcher sends the message to its successor dispatcher (line 25). This
process continues until the kernel of one dispatcher provides a positive reply. That dispatcher will
become the next master. Therefore, it informs the old master about the outcome by requesting the
execution context (line 20), and waits for the context to be transferred (line 21). The old master
3.3 Agreement Protocols 121
sends the context (line 12), and demotes itself to the slave role (line 13). After receiving the con-
text, the new master claims the master role (line 22), and requests the next job release from its
kernel (line 23).
Algorithm 12 run()
1: while (true) do
2: if (isMaster = true) then
3: // dispatcher is master
4: wait(startTime); // wait for the time instant to start protocol
5: getNextReleaseIn f o();
6: if (canReleaseDe f erredJob() = true) then
7: // master remains the same
8: releaseDe f erredJob();
9: else
10: sendMsg(nextDisp); // send a message to the successor dispatcher
11: wait(ctxReqRcvd);
12: sendCtx(nextMaster); // master changes (migration occurs)
13: isMaster← f alse;
14: end if
15: else
16: // dispatcher is slave
17: wait(msgRcvd);
18: getNextReleaseIn f o();
19: if (canReleaseDe f erredJob() = true) then
20: sendCtxReq(master); // send a context request to the master
21: wait(ctxRcvd);
22: isMaster← true; // master changes (migration occurs)
23: releaseDe f erredJob();
24: else
25: sendMsg(nextDisp); // send a message to the successor dispatcher
26: end if
27: end if
28: end while
3.3.2.2 Timing Analysis
The worst-case scenario occurs when only the last dispatcher of the list is granted the permission
to release the next job, and that case has to be considered when performing the analysis. Providing
guarantees that, at any time instant, at least one of the dispatchers will be able to accommodate the
next job on its core falls into the domain of the schedulability analysis, and that topic will receive
additional attention in Section 3.8.
During this protocol, in the worst-case the master can perform the following OS operations:
(i) query the kernel for the information regarding the next release, (ii) send the protocol message,
(iii) receive the context request, and (iv) transfer the context to the next master. Thus, the delay
122 Limited Migrative Model - LMM
of the protocol execution on the core of the master, denoted by CηM(ai), can be expressed with
Equation 3.9.
CηM(ai) = δQ+δ
→
P +δ
←
P +δ
→
C (3.9)
Similarly, the delay of the protocol execution on the core of the slave that will not become the
next master, denoted by CηS (ai), can be expressed with Equation 3.10.
CηS (ai) = δ
←
P +δQ+δ
→
P (3.10)
The delay of the protocol execution on the core of the slave that will become the next master,
denoted by CηN(ai), can be expressed with Equation 3.11.
CηN(ai) = δ
←
P +δQ+δ
→
P +δ
←
C (3.11)
The message transfer delay Cη# (ai) can be computed by solving Equation 3.12. In total |D(ai)|
protocol messages are exchanged when traversing the singly linked list of dispatchers, followed
by a context transfer to the new master.
Cη# (ai) = |D(ai)| ·
(
maxhops(ai) ·δL+(maxhops(ai)−1) ·δρ +
⌈
σprt
σ f lit
⌉
·δL
)
+
maxhops(ai) ·δL+(maxhops(ai)−1) ·δρ +
⌈
σctx
σ f lit
⌉
·δL (3.12)
After these terms have been obtained, the total delay of the protocol execution Cη(ai) can be
computed by solving Equation 3.13. Notice that in this protocol the slaves perform their protocol-
related OS operations sequentially, and therefore all of them have to be taken into account when
computing the total protocol delay.
Cη(ai) =
old master delay︷ ︸︸ ︷
CηM(ai) +
delay of all slaves except next master︷ ︸︸ ︷
(|D(ai)|−2) ·CηS (ai) +
next master delay︷ ︸︸ ︷
CηN(ai) +
network delay︷ ︸︸ ︷
Cη# (ai)
(3.13)
Equation 3.13 represents the delay of the protocol execution of an application in isolation,
assuming that it does not suffer any interference from other applications. In order to obtain the
worst-case protocol delay, the interference from other applications has to be taken into account.
The interference that any dispatcher, master or slave, may suffer while performing its protocol-
related OS operations within the time interval t, termed Iη(d ji , t), has already been computed for
the Master-Slave protocol (Equation 3.6). Since this term is protocol-independent, it can be reused
in this context. The same holds for the network interference that an application might suffer on
the NoC within the time interval t, denoted by Iη# (ai, t) (Equation 3.7).
3.3 Agreement Protocols 123
After all the relevant terms have been identified, the worst-case protocol delay can be com-
puted by solving Equation 3.14.
Rη(ai) =
isolation delay︷ ︸︸ ︷
Cη(ai) +
master interference︷ ︸︸ ︷
Iη(d ji ,R
η(ai)) +
all slaves interference︷ ︸︸ ︷
∑
∀dki ∈D(ai)|dki 6=d ji
Iη(dki , t
∗)+
next master interference︷ ︸︸ ︷
Iη(dmi , t
+) +
network interference︷ ︸︸ ︷
Iη# (ai,R
η(ai)) (3.14)
The term t∗ corresponds to the individual time interval during which each slave suffers the
interference, while t+ represents an additional interval during which the next master suffers the
interference when receiving the context. Of course, both t∗ and t+ are the sub-intervals of Rη(ai).
Notice that in Equation 3.14 the first and the last term and master-independent. However, the
remaining terms depend not only on the decision which dispatchers of the analysed application
ai perform the roles of the old and the new master, but also on the roles that dispatchers of other
applications perform during the analysed period. Thus, like in the previous case, in order to solve
Equation 3.14 it is necessary to identify roles of all dispatchers that will lead to the worst-case (the
biggest protocol delay).
Algorithm 10 was introduced before, and it was used to compute the maximum interference
that an individual dispatcher might suffer during the time interval t, due to the other on-core
dispatchers. Since that algorithm is protocol-independent, it will also be reused in this context.
Therefore, the only remaining activity is to identify the dispatcher roles for the application under
analysis. For that, Algorithm 13 is used.
Algorithm 13 is divided into 4 parts. First, the master of the analysed application is identified
(lines 4−11). This process is identical to the process of identifying the master for the Master-Slave
protocol.
Then, the joint interference suffered by all slaves of the analysed application is computed (lines
12− 21). Each slave can suffer the interference only during its individual interval t∗. This term
represents an interval during which the slave performs its protocol-related OS operations, and it
is the sub-interval of the entire worst-case protocol delay. For each dispatcher dki that is not the
master, the analysed interval t∗ is initially computed for the minimal value – δ←P +δQ+δ→P (line
18). It is obtained by invoking Algorithm 10. After the new value of t∗ is computed, it is fed
back into the computation as the input. The process is repeated until the fixed converging point is
reached. The stopping condition is the same value of t∗ for two consecutive iterations (line 19).
The process is repeated for every slave dispatcher, and the obtained values of t∗ are summed up
(line 20). The summation corresponds to the maximum interference that all slaves might suffer
during the protocol execution.
After that, the next master is identified (22− 33). This process is identical to the process of
identifying the next master of the Master-Slave protocol.
124 Limited Migrative Model - LMM
Algorithm 13 compWorstCaseDelay(ai,A )
Input: application ai, application-set A
Output: worst-case protocol delay of ai
1: Rη(ai)← 0; // initialise the worst-case protocol delay
2: repeat
3: Rηold(ai)← Rη(ai);
4: // 1. identify the master
5: maxMasterIn f ← 0;
6: for each (dki ∈D(ai)) do
7: currMasterIn f ← maxDispIn f (dki , true,Rηold(ai)); // Algorithm 10 and Equation 3.6
8: if (currMasterIn f > maxMasterIn f ) then
9: maxMasterIn f ← currMasterIn f ; master← d ji ;
10: end if
11: end for
12: // 2. compute interference suffered by all slaves
13: maxSlavesIn f ← 0;
14: for each (dki ∈D(ai)|dki 6= master) do
15: t∗← 0;
16: repeat
17: t∗old ← t∗;
18: t∗← maxDispIn f (dki , f alse, t∗old +δ←P +δQ+δ→P ); // Algorithm 10 and Equation 3.6
19: until (t∗old = t
∗);
20: maxSlavesIn f ← maxSlavesIn f + t∗;
21: end for
22: // 3. identify the next master
23: maxNextMasterIn f ← 0;
24: for each (dki ∈D(ai)|dki 6= master) do
25: t+← 0;
26: repeat
27: t+old ← t+;
28: t+← maxDispIn f (dki , f alse, t+old +δ←C ); // Algorithm 10 and Equation 3.6
29: until (t+old = t
+);
30: if (t+ > maxNextMasterIn f ) then
31: nextMaster← d ji ; maxNextMasterIn f ← t+;
32: end if
33: end for
34: // 4. compute the worst-case protocol delay;
35: iso←Cη(ai); // compute isolation delay (Equation 3.13)
36: netIn f ← Iη# (ai,Rηold(ai)); // compute network interference (Equation 3.7)
37: Rη(ai) ← iso+ netIn f +maxMasterIn f +maxSlavesIn f +maxNextMasterIn f ; // Equa-
tion 3.14
38: until (Rηold(ai) = R
η(ai));
39: return Rη(ai);
3.3 Agreement Protocols 125
Finally, the worst-case protocol delay can be computed. As shown in Equation 3.14, it consists
of five components: (i) the isolation delay, (ii) the master interference, (iii) the interference of all
slaves, (iv) the next master interference, and (v) the network interference. In previous steps the
second, the third and the fourth component were obtained. Thus, the remaining components are
obtained by solving Equations 3.13 and Equation 3.7 (lines 35− 36). The worst-case protocol
delay is equal to the sum of these terms (line 37 and Equation 3.14). Similarly to intervals t∗ and
t+, the worst-case protocol delay Rη(ai) is also computed iteratively, until the computed value is
the same in two successive iterations. The schedulability condition is that the obtained value is
less than or equal to the communication deadline, i.e. Rη(ai)≤ Dη(ai).
3.3.2.3 Discussion
The execution of the protocol stops at the moment when one of the dispatchers of the traversed
singly linked list announces the possibility to accommodate the next job release. In many cases
that dispatcher might not be the optimal choice, e.g. the core of some other dispatcher, yet not
traversed, might offer better execution environment (less interference). As implicitly stated, the
greatest limitation of this protocol is that the dispatchers are traversed in a static, predefined order,
whereas the decision regarding the new release may be derived without considering all of them.
This means that it is not possible to implement any selective scheduling policy, which are necessary
for energy/thermal management and other beneficial purposes. The performance of the protocol
will receive additional attention in Section 3.3.4.
3.3.3 Hybrid Protocol
3.3.3.1 Protocol Description
This protocol is the combination of the aforementioned two protocols and it consists of two phases.
The dispatcher behaviour is illustrated with Algorithm 14. Since the logic of the protocol is more
complex, the behaviour of the master dispatcher has been covered with the auxiliary Algorithm 15,
while the auxiliary Algorithm 16 describes the behaviour of slave dispatchers.
Algorithm 14 run()
1: while (true) do
2: if (isMaster = true) then
3: // dispatcher is master
4: runMaster(); // Algorithm 15
5: else
6: // dispatcher is slave
7: runSlave(); // Algorithm 16
8: end if
9: end while
Within every inter-arrival period of application, after its computation and memory access dead-
lines expire, the master initiates the agreement protocol (line 1 of Algorithm 15). At that time
126 Limited Migrative Model - LMM
Algorithm 15 runMaster()
1: wait(startTime); // wait for the time instant to start protocol
2: for each (d ji ∈D(a) | d ji 6= this) do
3: sendMsg(d ji ); // send messages to all slaves
4: end for
5: rcvdMsgs← 0;
6: while (rcvdMsgs< |D(ai)|) do
7: wait(msgRcvd);
8: rcvdMsgs++;
9: end while
10: // replies from all slaves received
11: getNextReleaseIn f o();
12: orderedList← sortDispatchers(); // create ordered list of all dispatchers
13: if ( f irst(orderedList) = this) then
14: // master remains the same
15: releaseDe f erredJob();
16: else
17: sendMsg( f irst(orderedList));
18: wait(msgRcvd,ctxReqRcvd);
19: if (ctxReqRcvd) then
20: sendCtx(nextMaster); // master changes (migration occurs)
21: isMaster← f alse;
22: else
23: getNextReleaseIn f o();
24: if (canReleaseDe f erredJob() = true) then
25: // master remains the same
26: releaseDe f erredJob();
27: else
28: sendMsg(next(orderedList)); // send a message to the successor dispatcher
29: wait(ctxReqRcvd);
30: sendCtx(nextMaster); // master changes (migration occurs)
31: isMaster← f alse;
32: end if
33: end if
34: end if
3.3 Agreement Protocols 127
instant, the first phase begins. This phase is very similar to the Master-Slave protocol. The master
sends messages to all slaves (lines 2−4 of Algorithm 15). Upon receiving the message from the
master (line 1 of Algorithm 16), each slave requests from the local kernel the information whether
the next job can be released on that core (line 2 of Algorithm 16). After receiving that information,
each slave sends it back to the master (line 3 of Algorithm 16). The master waits until it receives
the replies from all slaves (lines 6− 9 of Algorithm 15). Once all the replies are received, the
master requests the information from its kernel (line 11 of Algorithm 15), and together with the
information received from all slaves generates a list where all dispatchers are sorted according to
their likelihood of accommodating the next job release (line 12 of Algorithm 15). At this moment
the first phase finishes.
Algorithm 16 runSlave()
1: wait(msgRcvd);
2: getNextReleaseIn f o();
3: sendMsg(master);
4: wait(msgRcvd,startTime);
5: if (msgRcvd = true) then
6: getNextReleaseIn f o();
7: if (canReleaseDe f erredJob() = true) then
8: sendCtxReq(master); // send a context request to the master
9: wait(ctxRcvd);
10: isMaster← true; // master changes (migration occurs)
11: releaseDe f erredJob();
12: else
13: sendMsg(next(orderedList)); // send a message to the successor dispatcher
14: end if
15: end if
The second phase is very similar to the List protocol. If the first dispatcher in the sorted list
is the same as the old master, nothing changes; the master requests the next job release from its
kernel (line 15 of Algorithm 15), and waits for the time instant to start the next protocol (line 1 of
Algorithm 15).
Conversely, the master sends the message to the first dispatcher of the list (line 17 of Algo-
rithm 15). At this stage, each slave dispatcher waits for one of two possible events: (i) the arrival
of the protocol message, or (ii) the beginning of the new protocol (line 4 of Algorithm 16). In the
former case, the slave again requests the information regarding the next job release from its kernel
(line 6 of Algorithm 16). If the kernel does not allow a new job release, the dispatcher sends the
message to the next dispatcher from the list (line 13 of Algorithm 16). This process repeats until
the kernel of one dispatcher provides a positive reply, and that dispatcher will become the next
master.
If that dispatcher is the current master, nothing changes; the master requests the next job
release from its kernel (line 26 of Algorithm 15), and waits for the time instant to start the next
protocol (line 1 of Algorithm 15). Otherwise, if that dispatcher is the slave, it informs the old
128 Limited Migrative Model - LMM
master about the outcome by requesting the execution context (line 8 of Algorithm 16), and waits
for the context to be transferred (line 9 of Algorithm 16). Then, the old master sends the context
to the new master (line 20 or line 30 of Algorithm 16), and demotes itself to the slave role (line
21 or line 31). After it receives the context, the next master promotes itself to the master role and
requests the next job release from its kernel (lines 10−11 of Algorithm 16).
3.3.3.2 Timing Analysis
The worst-case scenario occurs when only the last dispatcher of the ordered list is granted the
permission to release the next job, and that case has to be considered when performing the analysis.
During this protocol, in the worst-case the master can perform the following OS operations:
(i) send the protocol message |D(ai)|−1 times, (ii) receive the protocol message |D(ai)|−1 times,
(iii) query the kernel for the information regarding the next release, (iv) generate the sorted list of
all dispatchers, (v) send the protocol message to the first dispatcher from the list, (vi) receive the
protocol message, (vii) query the kernel for the information regarding the next release, (viii) send
the protocol message, (ix) receive the context request, and (x) transfer the context to the next
master. Thus, the delay of the protocol execution on the core of the master, denoted by CηM(ai),
can be expressed with Equation 3.15.
CηM(ai) = (|D(ai)|−1) ·δ→P +(|D(ai)|−1) ·δ←P +δQ+δE +δ→P +δ←P +δQ+δ→P +δ←P +δ→C
(3.15)
Similarly, the delay of the protocol execution on the core of the slave that will not become the
next master, denoted by CηS (ai), can be expressed with Equation 3.16. Of interest for the analysis
is to identify the parts of CηS (ai) that were executed during the first and the second phase, termed
CηS1(ai) and C
η
S2(ai), respectively.
CηS (ai) =
CηS1(ai)︷ ︸︸ ︷
δ←P +δQ+δ
→
P +
CηS2(ai)︷ ︸︸ ︷
δ←P +δQ+δ
→
P (3.16)
The delay of the protocol execution on the core of the slave that will become the next master,
denoted by CηN(ai), can be expressed with Equation 3.17.
CηN(ai) = δ
←
P +δQ+δ
→
P +δ
←
P +δQ+δ
→
P +δ
←
C (3.17)
The message transfer delay Cη# (ai) can be computed by solving Equation 3.18. During the
first phase, 2 · (|D(ai)|− 1) protocol messages are exchanged between the master and all slaves.
During the second phase, |D(ai)| protocol messages are exchanged when traversing the singly
linked list of dispatchers, followed by a context request from the new master and the subsequent
context transfer.
3.3 Agreement Protocols 129
Cη# (ai) = 2 · (|D(ai)|−1) ·
(
maxhops(ai) ·δL+(maxhops(ai)−1) ·δρ +
⌈
σprt
σ f lit
⌉
·δL
)
+
(|D(ai)|+1) ·
(
maxhops(ai) ·δL+(maxhops(ai)−1) ·δρ +
⌈
σprt
σ f lit
⌉
·δL
)
+
maxhops(ai) ·δL+(maxhops(ai)−1) ·δρ +
⌈
σctx
σ f lit
⌉
·δL (3.18)
After these terms have been obtained, the total delay of the protocol execution Cη(ai) can
be computed by solving Equation 3.19. Notice that during the first phase all slaves perform their
protocols in parallel, thus it is sufficient to take into account only one of them, e.g. the next master.
During the second phase, slaves perform their protocol-related OS operations sequentially, so all
of them have to be taken into account when computing the total protocol delay. This is why CηS (ai)
had to be divided into CηS1(ai) and C
η
S2(ai) in Equation 3.16.
Cη(ai) =
old master delay︷ ︸︸ ︷
CηM(ai) +
next master delay︷ ︸︸ ︷
CηN(ai) +
2nd phase delay of all remaining slaves︷ ︸︸ ︷
(|D(ai)|−2) ·CηS2(ai) +
network delay︷ ︸︸ ︷
Cη# (ai)
(3.19)
Equation 3.19 represents the delay of the protocol execution of an application in isolation,
assuming that it does not suffer any interference from other applications. In order to obtain the
worst-case protocol delay, the interference from other applications has to be taken into account.
The interference that any dispatcher, master or slave, may suffer while performing its protocol-
related OS operations within the time interval t, termed Iη(d ji , t), has already been computed for
the Master-Slave protocol (Equation 3.6). Since this term is protocol-independent, it can be reused
in this context. The same holds for the network interference that an application might suffer on
the NoC within the time interval t, denoted by Iη# (ai, t) (Equation 3.7).
After all the relevant terms have been identified, the worst-case protocol delay can be com-
puted by solving Equation 3.20.
Rη(ai) =
isolation delay︷ ︸︸ ︷
Cη(ai) +
master interference︷ ︸︸ ︷
Iη(d ji ,R
η(ai)) +
latest slave interference in 1st phase︷ ︸︸ ︷
Iη(dki , t
∗) +
all slaves interference in 2nd phase︷ ︸︸ ︷
∑
∀dmi ∈D(ai)|dmi 6=d ji
Iη(dmi , t
+) +
next master interference︷ ︸︸ ︷
Iη(dni , t
4) +
network interference︷ ︸︸ ︷
Iη# (ai,R
η(ai)) (3.20)
130 Limited Migrative Model - LMM
The term latest slave interference in 1st phase corresponds to the delay of the slave which was
the last to deliver its response to the master during the first phase. Additionally, the terms t∗, t+
and t4 are the sub-intervals of the worst-case protocol delay. t∗ corresponds to the time interval of
the first phase during which the latest slave suffers the interference. t+ denotes to the individual
time interval of the second phase during which each slave suffers the interference. t4 represents
an additional interval during which the next master suffers the interference while receiving the
context.
Notice that in Equation 3.20 the first and the last term are master-independent. However, the
remaining terms depend not only on the decision which dispatchers of the analysed application
ai perform the roles of the master, the latest slave and the next master, but also on the roles that
dispatchers of other applications perform during the analysed period. Thus, like in the previous
case, in order to solve Equation 3.14 it is necessary to identify roles of all dispatchers that will
lead to the worst-case (the biggest protocol delay).
Algorithm 10 was introduced before, and it was used to compute the maximum interference
that an individual dispatcher might suffer during the time interval t, due to the other on-core
dispatchers. Since that algorithm is protocol-independent, it will also be reused in this context.
Therefore, the only remaining activity is to identify the dispatcher roles for the application under
analysis. For that, Algorithm 17 is used, which is divided into 5 parts. First, the master of the anal-
ysed application is identified (lines 4−11). This process is identical to the process of identifying
the master for the Master-Slave and List protocols.
Then, the latest slave of the analysed application during the first phase is identified (lines
12− 18). This process is identical to the process of identifying the latest slave for the Master-
Slave protocol.
After that, the joint interference suffered by all slaves of the analysed application during the
second phase is computed (lines 19− 25). This process is identical to the process of computing
the joint delay of all slaves for the List protocol.
Then, the next master is identified (26−32). This process is identical to the process of identi-
fying the next master of the Master-Slave and List protocols.
Finally, the worst-case protocol delay can be computed. As shown in Equation 3.20, it consists
of six components: (i) the isolation delay, (ii) the master interference, (iii) the interference of the
latest slave during the first phase, (iv) the interference of all slaves during the second phase, (v) the
next master interference, and (vi) the network interference. In previous steps the second, the third,
the fourth and the fifth component were obtained. Thus, the remaining components are obtained
by solving Equations 3.19 and Equation 3.7 (lines 34− 35). The worst-case protocol delay is
equal to the sum of these terms (line 36 and Equation 3.20). Similarly to intervals t∗, t+ and t4,
the worst-case protocol delay Rη(ai) is also computed iteratively, until the computed value is the
same in two successive iterations. The schedulability condition is that the obtained value is less
than or equal to the communication deadline, i.e. Rη(ai)≤ Dη(ai).
3.3 Agreement Protocols 131
Algorithm 17 compWorstCaseDelay(ai,A )
Input: application ai, application-set A
Output: worst-case protocol delay of ai
1: Rη(ai)← 0; // initialise the worst-case protocol delay
2: repeat
3: Rηold(ai)← Rη(ai);
4: // 1. identify the master
5: maxMasterIn f ← 0;
6: for each (dki ∈D(ai)) do
7: currMasterIn f ← maxDispIn f (dki , true,Rηold(ai));
8: if (currMasterIn f > maxMasterIn f ) then
9: maxMasterIn f ← currMasterIn f ; master← d ji ;
10: end if
11: end for
12: // 2. identify the latest slave
13: maxSlaveIn f ← 0;
14: for each (dki ∈D(ai)|dki 6= master) do
15: t∗← 0;
16: repeat t∗old ← t∗; t∗← maxDispIn f (dki , f alse, t∗old +δ←P +δQ+δ→P );
until (t∗old = t
∗);
17: if (t∗ > maxSlaveIn f ) then latestSlave← d ji ; maxSlaveIn f ← t∗;
end if
18: end for
19: // 3. compute interference suffered by all slaves
20: maxSlavesIn f ← 0;
21: for each (dmi ∈D(ai)|dmi 6= master) do
22: t+← 0;
23: repeat t+old ← t+; t+← maxDispIn f (dmi , f alse, t+old +δ←P +δQ+δ→P );
until (t+old = t
+);
24: maxSlavesIn f ← maxSlavesIn f + t+;
25: end for
26: // 4. identify the next master
27: maxNextMasterIn f ← 0;
28: for each (dni ∈D(ai)|dni 6= master) do
29: t4← 0;
30: repeat t4old ← t4; t4← maxDispIn f (dni , f alse, t4old +δ←C );
until (t4old = t
4);
31: if (t4 > maxNextMasterIn f ) then nextMaster← dni ; maxNextMasterIn f ← t4;
end if
32: end for
33: // 5. compute the worst-case protocol delay;
34: iso←Cη(ai); // compute isolation delay (Equation 3.19)
35: netIn f ← Iη# (ai,Rηold(ai)); // compute network interference (Equation 3.7)
36: Rη(ai) ← iso + netIn f + maxMasterIn f + maxSlaveIn f + maxSlavesIn f +
maxNextMasterIn f ; // Equation 3.20
37: until (Rηold(ai) = R
η(ai));
38: return Rη(ai);
132 Limited Migrative Model - LMM
3.3.3.3 Discussion
During the first phase of the protocol, all dispatchers are requesting the information regarding
the next job release from their respective kernels, and sending it to the current master. During
the second phase, the dispatchers are sequentially traversed with the objective to identify the one
which will be able to accommodate the release of the next job. The strategy of the protocol is to
firstly attempt to release the job on cores of those dispatchers which kernels, when queried during
the first phase, reported the most promising environment for the accommodation of that workload.
Due to its optimistic nature, when compared with the other two protocols, this one has a higher
probability to finish the protocol well before the computed worst-case protocol delay. In fact,
since the most promising dispatchers are traversed early during the second phase, it is reasonable
to expect that their kernels will be able to accommodate the next job and hence complete the
protocol with only a few dispatchers traversed during the second phase. However, this comes at
the expense of a more significant amount of traffic. The performance of the protocol will receive
additional attention in Section 3.3.4.
3.3.4 Experimental Evaluation
In this section, the experimental evaluation of the agreement protocols is performed. The experi-
ments are conducted on the extended version of the simulator SPARTS [73]. For protocol-related
OS operations (i.e. to send/receive protocol messages, to send/receive execution contexts, to re-
quest of the next job release information from the local kernel, to elect the master dispatcher,
to generate the sorted list of dispatchers) are assumed latencies that are valid for present micro-
kernels.
The aim of this evaluation is to observe the relations between the analytically obtained worst-
case protocol delay (WCPD) values – Rη(ai) (computed by the methods described in the previous
sections), against the WCPD values obtained during simulations – Rη∗ (ai). Moreover, it will be
observed how these trends change for different protocols and different workloads. Finally, by
varying the number of dispatchers it will be investigated how this protocol parameter and the
amount of traffic influence aforementioned relations and affect the overall protocol behaviour.
Workload, analysis and simulation parameters
When generating the workload for the evaluation, dispatchers of each application are randomly
assigned to the cores. The only restriction is that no two dispatchers of the same application can
be assigned to the same core. The mapping problem will be thoroughly investigated in Sec-
tion 3.6. The execution of each application-set will be simulated in two different scenarios: with
synchronous and asynchronous protocol releases. In the former case, the idea is to trigger and
observe the scenario where all applications start their protocols at the same time (i.e. to generate
significant contention), while the latter models a more realistic scenario.
The analysis and simulation parameters are given in Table 3.2, where an asterisk sign denotes
a randomly generated value assuming a uniform distribution.
3.3 Agreement Protocols 133
Table 3.2: Analysis and simulation parameters for Section 3.3.4
NoC topology and size 2-D mesh with 10×10 routers
Link width = flit size σ f lit 16 bytes
Router frequency νρ 2 GHz
Routing delay δρ 3 cycles (1.5 ns)
Link traversal delay δL 1 cycle (0.5 ns)
Protocol-related OS operations δ←P , δ→P , δ←C , δ
→
C , δQ, δE 10000 cycles (5 µs)
Application periods D(ai) = T (ai),∀ai ∈A [100−1000]∗ ms
Communication deadlines Dη(ai),∀ai ∈A 0.25·D(ai)
Protocol message size σprt 16 bytes
Execution context size σctx 1 Kbyte
Application-set size |A | 200 applications
Maximum concurrent on-core masters M̂ 10
Simulated time 100 s
Experiment 1: Analysis pessimism
In this experiment the focus is on the analysis pessimism. In particular, the ratio between
Rη∗ (ai) and Rη(ai) is observed. Each application is represented with 5 dispatchers, i.e. |D(ai)|=
5,∀ai ∈A .
In Figure 3.4, the horizontal axis represents the Rη∗ (ai) values, expressed relatively, as the per-
centage of the corresponding Rη(ai) values. The vertical axis stands for the amount of applications
which fall into a given category (certain ratio between observed and calculated WCPD), expressed
as the percentage of the total application-set size.
Since the number of generated messages for the Master-Slave protocol is always constant,
this protocol exhibits the least amount of pessimism of all three protocols. As a consequence, the
Rη∗ (ai) values represent greater fractions of the corresponding Rη(ai) ones.
As expected, the timing analysis of the List protocol in most cases overestimates the number
of messages. That is, the analysis always takes into account the traversal of the entire list, while
during simulations such scenarios rarely occurred. Consequently, this protocol exhibits greater
pessimism than the Master-Slave protocol.
The Hybrid protocol is the combination of the two aforementioned approaches. Since the
messages exchanged during the first phase of the protocol follow the logic explained for the
Master-Slave protocol (their number is constant), and since they constitute at least 2/3 of the
total number of messages, it is reasonable to expect that the pessimism of the Hybrid protocol will
be greater than that of the Master-Slave protocol, but less than that of the List protocol. However,
that is not true. The explanation for this surprising finding is as follows. During the second phase,
the Hybrid protocol traverses the dispatchers in such a way that the most promising candidates
are visited early. Consequently, the number of traversed dispatchers during the second phase in
simulations is very small. This infers that the protocol is efficient, which is manifested with small
WCPD ratios, i.e. the high amount pessimism. On the other hand, the List protocol pays the price
of a non-intelligent pre-determined order by which the dispatchers are being traversed. In other
134 Limited Migrative Model - LMM
0 10 20 30 40 50 600
2
4
6
8
10
12
14
16
18
20
Ratio between measured and calculated WCPD, in %
Qu
an
tity
 o
f a
pp
lic
at
ion
s, 
in 
%
 o
f t
he
 a
pp
lic
at
ion
−s
et
 si
ze
 
 
MS−Sync
MSNon−Sync
List−Sync
ListNon−Sync
Hybrid−Sync
HybridNon−Sync
Figure 3.4: Analysis pessimism
words, more dispatchers have to be visited before eventually finding the next master, which is
manifested with the smaller amount of pessimism. Due to these facts, the amount of pessimism is
greater in the Hybrid than in the List protocol.
As expected, it holds for all three protocols that the synchronous releases cause higher WCPD
ratios, due to the extensive amount of traffic generated in short periods of time.
Experiment 2: Scalability
The objective of this experiment is to explore the scalability potential of the proposed agree-
ment protocols. To that end, the number of dispatchers of each application is varied in the on
the range 2− 15, i.e. |D(ai)| ∈ {2, ...,15},∀ai ∈ A . Subsequently, it is observed how the ratio
between the WCPD values, obtained via simulations and via analysis, changes as a function of the
number of dispatchers.
In Figure 3.5, the horizontal axis represents the number of dispatchers per application. The
vertical axis in Figure 3.5(a) stands for the WCPD values obtained via simulations – Rη∗ (ai), ex-
pressed relatively, as the percentage of the analytically obtained WCPD ones – Rη(ai), while in
Figure 3.5(b) the values of the aforementioned terms are presented in a logarithmic scale.
The Master-Slave protocol with non-synchronised releases exhibits almost constant amount
of pessimism on the entire observed domain. The visible increase of the pessimism in the left
side of Figure 3.5(a) is caused by the pessimistically obtained network interference component,
which leads to the noticeable discrepancy between the simulations and analysis; a small number of
dispatchers causes fewer NoC contentions during simulations, while the analysis considers that all
higher priority traffic existing within the NoC will cause interference to the analysed application.
As the number of dispatchers and hence messages increase, the pessimism slowly decreases.
Conversely, for the Master-Slave protocol with synchronised releases, the pessimism is the
least for the small number of dispatchers. This occurs for two reasons. First, due to the syn-
chronous releases, contentions always occur, irrespective of the number of per-application dis-
3.3 Agreement Protocols 135
2 3 4 5 6 7 8 9 10 11 12 13 14 150
10
20
30
40
50
60
The number of dispatchers per application
R
at
io
 b
et
we
en
 m
ea
su
re
d 
an
d 
ca
lcu
la
te
d 
W
CP
D,
 in
 %
 
 
MS−Sync
MSNon−Sync
List−Sync
ListNon−Sync
Hybrid−Sync
HybridNon−Sync
(a)
2 3 4 5 6 7 8 9 10 11 12 13 14 15
106
107
108
109
The number of dispatchers per application
M
ea
su
re
d 
an
d 
ca
lcu
la
te
d 
W
CP
D,
 in
 lo
ga
rit
hm
ic 
sc
al
e
 
 
MS predicted
MS−Sync measured
MSNon−Sync measured
List predicted
List−Sync measured
ListNon−Sync measured
Hybrid predicted
Hybrid−Sync measured
HybridNon−Sync measured
(b)
Figure 3.5: Impact of dispatchers on WCPD
patchers. Second, for fewer messages the difference between the predicted (analysis) and observed
(simulations) worst-case scenarios is the least, thus the obtained network interference component
is the least pessimistic. Until a certain point, the network successfully copes with the increased
amount of dispatchers and messages, hence causing the raise of pessimism. Near the end of the
graph, the traffic congestion becomes more significant and a similar trend of a slight pessimism
decrease is noticeable. As expected, both the List protocol runs (with and without synchronous
releases) display an increase of the pessimism as the number of dispatchers increases. The ex-
planation is that in many cases the dispatchers placed near the end of the list are not traversed,
while the analysis always considers the worst-case (the traversal of the entire list). Figure 3.5(b)
demonstrates that on most of the observed domain additional dispatchers cause a barely notice-
able increase of the WCPD, which confirms the previous statement that the dispatchers positioned
near the end of the list are in most cases not visited. However, after a certain point, the protocol
starts to pay the price of a static non-intelligent traversing, thus causing numerous futile visits of
dispatchers which can’t accommodate the next job release. Therefore, a significant decrease of the
pessimism on the right side of Figure 3.5(a) is visible, leading to a counter-intuitive conclusion
that this protocol does not scale when the number of dispatchers is more than a dozen.
Finally, when compared with the List protocol, the Hybrid protocol demonstrates a similar
behaviour. Due to the safe assumption of the entire list traversal, a steady increase in the pessimism
is noticeable as the number of the dispatchers increases. However, after a certain point, the Hybrid
protocol pays the price of the extensive communication, leads to network congestions and shows
a steady but only temporary increase of the measured WCPD values for 7− 10 dispatchers. One
surprising conclusion drawn from Figure 3.5(b) is that there exists an interval (between 3 and
8 dispatchers) where the Hybrid protocol has a shorter measured WCPD than the Master-Slave
protocol, despite the fact that it always induces more messages. The explanation is that the Hybrid
protocol efficiently selects the next master dispatcher, while the Master-Slave protocol, due to
race conditions, causes fragmentation and highly loaded cores, where the on-core protocol-related
interference becomes a predominant factor. Additional surprising fact is that the Hybrid protocol
136 Limited Migrative Model - LMM
successfully copes with the network congestion by efficiently finding the next master dispatcher,
and hence drastically minimising the duration of its second phase. As is visible in Figure 3.5,
the Hybrid protocol scales well, displays a good average and worst-case performance in both
relative and absolute terms, however, for the very same reasons, exhibits a significant amount of
pessimism.
3.4 Inter-application Communication
The analyses proposed in the previous section consider only intra-application communication
(agreement protocols). This means that these analyses are applicable only to application-sets with
independent applications. To overcome this limitation, in this section the aforementioned analyses
will be extended, such that the inter-application communication is considered as well.
Recall, that the inter-application messages are exchanged by the master dispatchers of the
interacting applications. Thus, the only components of the aforementioned analyses that are af-
fected are: the master’s communication-related OS operations, denoted by CηM(ai) (Equation 3.1
for the Master-Slave protocol, Equation 3.9 for the List protocol and Equation 3.15 for the Hybrid
protocol), and the delay of the messages over the NoC, termed Cη# (ai) (Equation 3.4 for the List
protocol, Equation 3.12 for the List protocol and Equation 3.18 for the hybrid protocol). These
terms should be extended to also cover the inter-application communication.
In order to do that, first, the new OS operations are introduced, as shown in Table 3.3.
Table 3.3: OS operations related to inter-application communication
δ→I Send the inter-application message (performed the master dispatcher)
δ←I Receive the inter-application message (performed by the master dispatcher)
Let F ηS (ai) be a set of inter-application messages sent by an application ai to other applica-
tions, such that a message fi, j ∈F ηS (ai) is sent from the master of the application ai to the master
of the application a j. Similarly, let F
η
R (ai) be a set of inter-application messages received by ai,
such that a message f j,i ∈F ηR (ai) is sent from the master of a j to the master of ai.
With these assumptions, Equation 3.1, which covers the communication-related OS operations
performed by the master of the application with the Master-Slave protocol, is substituted with
Equation 3.21.
CηM(ai)=
protocol︷ ︸︸ ︷
(|D(ai)|−1) ·δ→P +(|D(ai)|−1) ·δ←P +δQ+δE +δ→C +
inter-application traffic︷ ︸︸ ︷
|F ηS (ai)| ·δ→I + |F ηR (ai)| ·δ←I
(3.21)
3.4 Inter-application Communication 137
1
a 2
Application a
Application
Figure 3.6: Inter-application communication
In a similar way, Equation 3.9, which corresponds to the List protocol is substituted with
Equation 3.22.
CηM(ai) =
protocol︷ ︸︸ ︷
δQ+δ→P +δ
←
P +δ
→
C +
inter-application traffic︷ ︸︸ ︷
|F ηS (ai)| ·δ→I + |F ηR (ai)| ·δ←I (3.22)
Finally, Equation 3.15, which covers the Hybrid protocol is substituted with Equation 3.23.
CηM(ai) =
protocol︷ ︸︸ ︷
(|D(ai)|−1) ·δ→P +(|D(ai)|−1) ·δ←P +δQ+δE +δ→P +δ←P +δQ+δ→P +δ←P +δ→C +
inter-application traffic︷ ︸︸ ︷
|F ηS (ai)| ·δ→I + |F ηR (ai)| ·δ←I (3.23)
Due to the master volatility property, the analysis of the inter-application traffic is not trivial.
This problem is illustrated with Figure 3.6, where two applications a1 and a2 communicate. Notice
that depending on which dispatchers of both applications perform the master role, a single inter-
application message f1,2 may traverse any of the paths illustrated in Figure 3.6.
One way to circumvent this problem is to apply an approach which is similar to the one used
for the intra-application traffic. Specifically, let maxhops(ai,a j) be the maximum distance be-
tween any two dispatchers of two interacting applications ai and a j. Subsequently, the analy-
sis of ai covers the worst-case by assuming that its every inter-application message traverses its
longest possible distance, that is: |L ( fi, j)| = maxhops(ai,a j),∀ fi, j ∈ F ηS (ai) and |L ( f j,i)| =
maxhops(a j,ai),∀ f j,i ∈ F ηR (ai). Moreover, it is assumed that each intra- and inter-application
message with the priority higher than that of ai: (i) also traverses its longest possible distance, and
(ii) causes interference to ai, irrespective of its potential path.
138 Limited Migrative Model - LMM
Note that inter-application messages inherit the priority of the sender application, i.e. P( fi, j) =
P(ai), and it is sender’s responsibility to deliver the message to the receiver within its own deadline.
With these assumptions, Equation 3.4, which covers the traffic delay of the application with
the Master-Slave protocol, is substituted with Equation 3.24.
Cη# (ai) = 2 · (|D(ai)|−1) ·
(
maxhops(ai) ·δL+(maxhops(ai)−1) ·δρ +
⌈
σprt
σ f lit
⌉
·δL
)
+
maxhops(ai) ·δL+(maxhops(ai)−1) ·δρ +
⌈
σctx
σ f lit
⌉
·δL+
∑
∀ fi, j∈FηS (ai)
maxhops(ai,a j) ·δL+(maxhops(ai,a j)−1) ·δρ +
⌈ σ fi, j
σ f lit
⌉
·δL+
∑
∀ f j,i∈FηR (ai)
maxhops(a j,ai) ·δL+(maxhops(a j,ai)−1) ·δρ +
⌈ σ f j,i
σ f lit
⌉
·δL (3.24)
In a similar way, Equation 3.12, which corresponds to the List protocol is substituted with
Equation 3.25.
Cη# (ai) = |D(ai)| ·
(
maxhops(ai) ·δL+(maxhops(ai)−1) ·δρ +
⌈
σprt
σ f lit
⌉
·δL
)
+
maxhops(ai) ·δL+(maxhops(ai)−1) ·δρ +
⌈
σctx
σ f lit
⌉
·δL+
∑
∀ fi, j∈FηS (ai)
maxhops(ai,a j) ·δL+(maxhops(ai,a j)−1) ·δρ +
⌈ σ fi, j
σ f lit
⌉
·δL+
∑
∀ f j,i∈FηR (ai)
maxhops(a j,ai) ·δL+(maxhops(a j,ai)−1) ·δρ +
⌈ σ f j,i
σ f lit
⌉
·δL (3.25)
Finally, Equation 3.18, which covers the Hybrid protocol is substituted with Equation 3.26.
Cη# (ai) = 2 · (|D(ai)|−1) ·
(
maxhops(ai) ·δL+(maxhops(ai)−1) ·δρ +
⌈
σprt
σ f lit
⌉
·δL
)
+
(|D(ai)|+1) ·
(
maxhops(ai) ·δL+(maxhops(ai)−1) ·δρ +
⌈
σprt
σ f lit
⌉
·δL
)
+
maxhops(ai) ·δL+(maxhops(ai)−1) ·δρ +
⌈
σctx
σ f lit
⌉
·δL+
3.5 Towards More Deterministic Communication Patterns 139
∑
∀ fi, j∈FηS (ai)
maxhops(ai,a j) ·δL+(maxhops(ai,a j)−1) ·δρ +
⌈ σ fi, j
σ f lit
⌉
·δL+
∑
∀ f j,i∈FηR (ai)
maxhops(a j,ai) ·δL+(maxhops(a j,ai)−1) ·δρ +
⌈ σ f j,i
σ f lit
⌉
·δL (3.26)
Now, depending on the employed agreement protocol, Algorithm 11, Algorithm 13 or Al-
gorithm 17 can be used to compute the worst-case delay of the application ai, termed Rη(ai).
However, the obtained delay does not represent the worst-case protocol delay (WCPD) any more,
but the entire worst-case communication delay (WCCD). Notice that the condition for the schedu-
lability with respect to the communication remains Rη(ai)≤ Dη(ai).
3.5 Towards More Deterministic Communication Patterns
Recent insights into priority-preemptive, wormhole-switched NoCs suggest that the dominant fac-
tor in the worst-case delay of a flow (message) is not the length of its path, but rather the interfer-
ence it suffers [74]. Thus, the pessimism related to the approach from the previous section can be
attributed mostly to the conservative method which is used to compute the network interference
component Iη# (by assuming that all higher priority traffic can cause interference, irrespective of
(im)possible contentions). Motivated by this reasoning, the novel approach is proposed in this
section. This approach relies on enforcing constraints, in order to make LMM traffic more deter-
ministic and predictable.
Definition 3 (Application shapes). Dispatchers of an application can be positioned only on the
edges of a rectangular x× y structure, such that no corner is left unoccupied and x,y ∈ N. The
special case is a line-like shape, where one or both dimensions of the shape are equal to “1”.
Definition 4 (Rerouting operations). Intra-application messages travel only on the edges of the
shape its application forms, and re-routing occurs where needed to comply with the global XY
routing policy. An individual message rotation (i.e. clockwise or counterclockwise) is chosen such
that the traversal distance is minimised.
Figure 3.7 demonstrates how dispatchers should be mapped and messages consequently routed.
Shaded dispatchers denote locations where reroutings occur. Reroutings are router routines, per-
formed in an interrupt-like manner, which can be, for example, implemented by instrumenting
the HardwallTM technology of Tilera platforms [93]. It is assumed that routers contain sufficient
logic and information to manually perform reroutings, without the need to consult local cores. The
latency of one rerouting operation is denoted by δR.
140 Limited Migrative Model - LMM
(a) Bottom-left dispatcher is the master (b) Top-right dispatcher is the master
Figure 3.7: Application which shape and messages comply with Definitions 3-4
3.5.1 Supermessages and Proxies
Because of Definitions 3-4, the part of the network that intra-application traffic uses is now de-
terministic. In order to make the traffic master independent as well, two additional concepts are
introduced, namely the supermessages and the proxies.
Definition 5 (Supermessage). A supermessage is a message which connects (i) diagonally-placed
dispatchers if an application has a rectangular shape, or (ii) terminal dispatchers if an application
has a line-like shape.
According to Definition 5, an application ai with a line-like shape has 2 supermessages – f̂ `1i ,
f̂ `2i , and does not involve reroutings, while an application with a rectangular shape has 4 superme-
ssages, of which 2 are with the clockwise orientation f̂ cw1i , f̂
cw2
i , and 2 with the counter-clockwise
f̂ cc1i , f̂
cc2
i (see Figure 3.8). Without any loss of generality, in the rest of this section only rectan-
gular shapes are analysed, because line-like shapes are the special case of the rectangular shapes.
Indeed, any conclusion reached for a rectangular shape can be applied to a line-like shape by con-
sidering only one supermessage of each orientation and treating the remaining supermessages as
non-existent (i.e. f̂ `1i = f̂
cw1
i ; f̂
`2
i = f̂
cc1
i ; f̂
cw2
i = null; f̂
cc2
i = null;). Moreover, applications with
line-like shapes do not need reroutings for the intra-application traffic (no shaded dispatchers in
Figure 3.8(b)).
(a) Rectangular shape yields 4 supermessages (b) Line-like shape yields 2 supermessages
Figure 3.8: Supermessages for different application shapes
Theorem 7. Any intra-application message of an application can be expressed by at most 2 dis-
tinct, same-orientation supermessages, and at most 1 rerouting.
Proof. Proven by contradiction. Any intra-application message fi,i of an application ai assumes
the orientation such that the distance between the dispatchers is minimised (see Definition 4). Let
3.5 Towards More Deterministic Communication Patterns 141
c denote the circumference of the application shape. Then, it holds that |L ( fi,i)| ≤ c2 . As each
supermessage of ai connects diagonal corners of its shape, the following holds:
|L ( f̂ cw1i )|= |L ( f̂ cw2i )|= |L ( f̂ cc1i )|= |L ( f̂ cc2i )|= c2 .
Assume that fi,i can be expressed with at least 3 same-orientation supermessages. Note, as
there are only two same-orientation supermessages (e.g. f̂ k1i and f̂
k2
i , where k ∈ {cc,cw}), one
of them has to appear twice, i.e. the sequence would be { f̂ k1i , f̂ k2i , f̂ k1i } or { f̂ k2i , f̂ k1i , f̂ k2i }. In
either case, the middle supermessage entirely belongs to fi,i, while the first and the last belong
with fractions ε1 > 0 and ε2 > 0, respectively. Hence:
|L ( fi,i)|= ε1+ |L ( f̂ k1i )|+ ε2 = ε1+ |L ( f̂ k2i )|+ ε2 = ε1+
c
2
+ ε2 >
c
2
The contradiction has been reached. Additionally, as reroutings occur only on places where
supermessages meet, and since any message can be expressed by at most 2 distinct supermessages,
it can involve at most 1 rerouting.
Notice that supermessages are master-independent and their number is significantly smaller
than the number of possible message paths. Thus, the intuitive idea behind the novel approach is to
transform every intra-application message into the corresponding supermessage(s) with eventual
rerouting, and perform the analysis on that model.
A similar method is applied to inter-application traffic:
Definition 6 (Proxy). A proxy dispatcher is a dispatcher which is selected at design-time, and
which participates in the inter-application communication. It mediates in the message exchange
between its master and the proxy dispatcher of the other (interacting) application.
An illustrative example of Definition 6 is given in Figure 3.9. In this scenario, an inter-
application message is divided into 5 different components: 1) a message from the master sender
to its proxy, 2) a rerouting on the router of the proxy sender, 3) a message between the proxies, 4)
a rerouting on the router of the proxy receiver, 5) a message from the proxy receiver to its master.
Proxies are decided at design-time, thus a message f Pi, j between proxy dispatchers of applications
ai and a j is also deterministic and master-independent. An application can have multiple proxies,
each responsible for the communication with a different application. If a proxy receives an inter-
application message during the agreement protocol of its application, the message is stalled inside
the proxy until the protocol completes and the master (destination) is decided.
Theorem 8. Any inter-application message can be expressed by (i) at most 2 distinct, same-
orientation supermessages on the sender’s side, (ii) a message between a proxy sender and a
proxy receiver, (iii) at most 2 distinct, same orientation supermessages on the receiver’s side and
(iv) at most 4 reroutings.
Proof. Proven directly. A message exchanged between a master sender and its proxy complies
with the rules of the intra-application traffic, hence, according to Theorem 7, it can be expressed
by at most 2 distinct same-orientation supermessages. The same conclusion holds for the message
142 Limited Migrative Model - LMM
proxy
proxy
Application
Application a
a
2
1
Figure 3.9: Inter-application communication
between a receiver proxy and its master. A message between proxies is a non-constrained point-
to-point message.
Messages between masters and their proxies on both sides (sender and receiver) each yield at
most 1 rerouting (Theorem 7). Finally, an inter-proxy message causes 1 rerouting on the router of
the proxy sender and 1 on the router of the proxy receiver.
The introduction of placement and rerouting constraints, supermessages and proxies, was a
consciously made decision to potentially "sacrifice" performance (i.e. messages may traverse
longer distances and may involve reroutings). Nonetheless, this approach yields predictable and
deterministic message paths, which allows to perform a more detailed and less pessimistic analysis
(covered later in Section 3.5.3), and derive tighter worst-case communication delay estimates.
According to Theorems 7-8, all intra- and inter-application traffic of all applications can be
expressed with (i) a set of supermessages – F̂ , (ii) a set of proxy-to-proxy messages – F P, and
(iii) a set of reroutings –R, which are all master-independent and known at design-time. In order
to be able to perform the timing analysis of the application ai, the maximum number of occurrences
ω( f̂ ki ) of its each supermessage f̂ ki ∈ F̂ (ai) | k ∈ {cw1,cw2,cc1,cc2} and the maximum number
of occurrences ω( f Pi, j) of its each inter-proxy message f Pi, j ∈ F P(ai) within its minimum inter-
arrival period has to be computed. F̂ (ai) andF P(ai) denote the sets of supermessages and inter-
proxy messages of ai, respectivelly. Additionally, the maximum number of rerouting occurrences
r(d ji ) ∈R, caused by each dispatcher d ji ∈D(ai) on its router ρ(d ji ) within the same interval has
to be computed.
3.5.2 Maximum Number of Message Occurrences
As is evident from the previous section, the maximum number of occurrences of both superme-
ssages and reroutings of each application depends on (i) the employed agreement protocol and
(ii) the amount of its inter-application traffic. In Section 3.3, three agreement protocols have been
proposed (Master-Slave, List and Hybrid). Since the first one suffers from race conditions, in this
section only the second and third will be considered.
3.5 Towards More Deterministic Communication Patterns 143
3.5.2.1 List Protocol
Before going into details on how to analytically express the List protocol as a function of super-
messages and reroutings, two constructs which will aid in that cause are introduced.
Definition 7 (Simple/Complex message). An intra-application message is called the simple mes-
sage - f s if it can be expressed with a single supermessage. Otherwise, it is called the complex
message - f c.
S
S
Sf f f
(a) Simple messages
Cf Cf
(b) Complex messages
Figure 3.10: Intra-application messages
An illustrative example of Defnition 7 is given in Figure 3.10.
Theorem 9. The List protocol (including the context transfer) of an application ai with |D(ai)|
dispatchers can be expressed as a linear combination of supermessages, where clockwise super-
messages can exist at most |D(ai)|+1 times, and counter-clockwise can exist at most twice.
Proof. Proven directly. As neighbouring dispatchers always share the same edge, all messages
exchanged by neighbouring dispatchers are simple messages. Thus, they can be expressed by
only one clockwise supermessage and do not involve reroutings. Unless the exact position of
neighbouring dispatchers on the application shape is known, it is not possible to deduce which
one of clockwise supermessages will be used. Therefore, each of clockwise supermessages might
appear in all |D(ai)| − 1 messages, starting from the master until reaching the last slave. The
message from the last slave to the master, and the following context transfer are not necessarily
exchanged by neighbouring dispatchers, therefore, in order to cover the worst-case, both have
to be considered as complex messages. Moreover, these two messages may be of an arbitrary
orientation, thus any supermessage may appear once in each of them (Theorem 7).
Equations 3.27-3.29 express the maximum number of occurrences of each supermessage dur-
ing one protocol execution. Since contexts and protocol messages may have different sizes and
hence different traversal delays, supermessage occurrences related to protocols and contexts have
to be counted separately.
ωP( f̂ cw1i ) = ωP( f̂ cw2i ) = |D(ai)| (3.27)
ωP( f̂ cc1i ) = ωP( f̂ cc2i ) = 1 (3.28)
144 Limited Migrative Model - LMM
ωC( f̂ cw1i ) = ωC( f̂ cw2i ) = ωC( f̂ cc1i ) = ωC( f̂ cc2i ) = 1 (3.29)
Now, the maximum number of reroutings as a consequence of one protocol execution can be
computed by employing Theorem 10.
Theorem 10. The List protocol of an application ai with |D(ai)| dispatchers can involve at most
2 reroutings.
Proof. Proven directly. Until reaching the last slave all messages are simple messages, can be
expressed with one supermessage and hence do not involve reroutings (Theorem 7). A message
from the last slave to the master and a subsequent context transfer can be complex messages and,
each can contribute with one rerouting (Theorem 7). The maximum number of reroutings is 2.
From Theorem 10 it straightforwardly follows that the maximum rerouting delay of the List
protocol can be computed by solving Equation 3.30. Recall that δR denotes the latency of one
rerouting operation.
CηRP(ai) = 2 ·δR (3.30)
Note that protocol-related reroutings can be performed only by ρ(d ji ), such that d
j
i is located
in the corner of the application-shape. The maximum number of protocol-related rerouting op-
erations that ρ(d ji ) may perform, due to d
j
i , is noted down as rP(d
j
i ) (Equation 3.31). Also note
that a 4-dispatcher application does not involve reroutings, as all corner dispatchers are mutually
reachable via supermessages. In such cases, CηRP(ai) = 0 ∧ rP(d ji ) = 0,∀d ji ∈ D(ai) . The same
is true for applications with line-like shapes.
rP(d
j
i ) = 2 (3.31)
Note that the supermessage f̂ ki may have two different sizes, one for the protocol message –
σprt , and one for the context transfer – σctx. Therefore, the traversal delay of the supermessage
can be computed either with Equation 3.32 (applicable to protocol messages) or Equation 3.33
(applicable to contexts).
CP( f̂ ki ) = |L ( f̂ ki )| ·δL+(|L ( f̂ ki )|−1) ·δρ +
⌈
σprt
σ f lit
⌉
·δL (3.32)
CC( f̂ ki ) = |L ( f̂ ki )| ·δL+(|L ( f̂ ki )|−1) ·δρ +
⌈
σctx
σ f lit
⌉
·δL (3.33)
Now the network delay of the List protocol, termed Cη#P(ai) (Equation 3.34), can be derived.
Until reaching the last slave, all requests are simple messages, thus are expressed by only one
clockwise supermessage (the first term in Equation 3.34). The response from the last slave and the
3.5 Towards More Deterministic Communication Patterns 145
context transfer are complex messages of arbitrary orientation, and by Theorem 7 are expressed
as two same-orientation supermessages.
Cη#P(ai) =
until the last slave︷ ︸︸ ︷
(|D(ai)|−1) ·CP( f̂ ki )+
last slave to master︷ ︸︸ ︷
2 ·CP( f̂ ki ) +
context transfer︷ ︸︸ ︷
2 ·CC( f̂ ki ) (3.34)
3.5.2.2 Hybrid Protocol
Theorem 11. The Hybrid protocol (including the context transfer) of an application ai with
|D(ai)| dispatchers can be expressed as a linear combination of supermessages, where each su-
permessage can exist at most 3 · |D(ai)| times.
Proof. In the Hybrid protocol, messages are not necessarily exchanged by neighbouring dispatch-
ers. Therefore, in order to analyse the worst-case, all messages have to be considered as complex
messages of arbitrary orientation. By Theorem 7, each message can involve 2 supermessages of
the same orientation. Thus, each supermessage can exist once within each protocol message. The
first phase involves requests sent to all |D(ai)|− 1 slaves and their |D(ai)|− 1 replies, rendering
at most 2 · (|D(ai)|−1) occurrences of each supermessage. During the second phase |D(ai)|+1
messages are exchanged until reaching the master, resulting in additional |D(ai)|+1 appearances
of each supermessage. Finally, the context transfer yields one additional occurrence of each su-
permessage.
The maximum number of occurrences of supermessages during one protocol execution and
during one context transfer are given in Equations 3.35-3.36.
ωP( f̂ cw1i ) = ωP( f̂ cw2i ) = ωP( f̂ cc1i ) = ωP( f̂ cc2i ) =
first phase︷ ︸︸ ︷
2 · (|D(ai)|−1)+
second phase︷ ︸︸ ︷
|D(ai)|+1 = 3 · |D(ai)|−1
(3.35)
ωC( f̂ cw1i ) = ωC( f̂ cw2i ) = ωC( f̂ cc1i ) = ωC( f̂ cc2i ) = 1 (3.36)
Now, the maximum number of reroutings as a consequence of one protocol execution can be
computed by employing Theorem 12.
Theorem 12. The Hybrid protocol of an application ai with |D(ai)| dispatchers can involve at
most 3 · |D(ai)| reroutings.
Proof. Proven directly. Each message is treated as a complex message and, by Theorem 7, can
cause at most one rerouting. Thus, the maximum number of reroutings is equal to the maximum
number of messages, hence in total 3 · |D(ai)| reroutings might occur.
146 Limited Migrative Model - LMM
From Theorem 12 it straightforwardly follows that the maximum rerouting delay of the Hybrid
protocol can be computed by solving Equation 3.37.
CηRP(ai) = 3 · |D(ai)| ·δR (3.37)
A protocol-related rerouting can be performed by ρ(d ji ), only if d
j
i is positioned in the corner
of the application shape. The maximum number of protocol-related rerouting operations that
ρ(d ji ) may perform, due to d
j
i , is noted down as rP(d
j
i ) (Equation 3.38). Like in the List protocol,
4-dispatcher applications and applications with line-like shapes involve no reroutings: CηRP(ai) =
0 ∧ rP(d ji ) = 0,∀d ji ∈D(ai).
rP(d
j
i ) = 3 · |D(ai)| (3.38)
Now the network delay of the Hybrid protocol – Cη# (ai) can be computed by solving Equa-
tion 3.39. Every message has to be treated as a complex message, thus is represented as a sum
of two same-orientation supermessages. As in the List protocol, protocol messages and context
transfers are treated separately, because CP( f̂ ki ) 6=CC( f̂ ki ).
Cη#P(ai) =
first and second phase︷ ︸︸ ︷
(3 · |D(ai)|−1) ·2 ·CP( f̂ ki )+
context transfer︷ ︸︸ ︷
2 ·CC( f̂ ki ) (3.39)
3.5.2.3 Inter-application Traffic
Theorem 13. Any inter-application message, which is, by applying Theorem 8, expressed as
a function of supermessages and the inter-proxy message, can yield at most one occurrence of
(i) each supermessage and (ii) the inter-proxy message.
Proof. Follows directly from Theorems 7-8.
The implications of Theorem 13 are that each sent and received inter-application message
triggers one occurrence of the inter-proxy message and each of the supermessages (Equation 3.40).
As different inter-application messages can have different sizes, their occurrences should not be
summed up, but instead should be counted separately.
ωI( f̂ cw1i ) = ωI( f̂ cw2i ) = ωI( f̂ cc1i ) = ωI( f̂ cc2i ) = ωI( f
P
i, j) = 1 (3.40)
The maximum number of reroutings induced by one inter-application message can be com-
puted by applying Theorem 14.
Theorem 14. Any inter-application message can involve at most 2 reroutings on the sender’s side,
of which at most one can appear on any router.
Proof. Proven directly. Consider an inter-application message from a master sender to its proxy
dispatcher d ji . By Theorem 7, it involves one rerouting at the router of the intermediate corner
3.5 Towards More Deterministic Communication Patterns 147
dispatcher dki which is not d
j
i since it is the destination of that message. When the router of
d ji is reached, one additional rerouting occurs before the inter-proxy message is forwarded to
the receiver proxy dmj . Perceived from the sender’s perspective, at most 2 reroutings may occur,
one performed on ρ(d ji ), and the other on the router ρ(d
k
i ) of any other corner-placed dispatcher
dki .
Equation 3.41 presents the maximum number of reroutings occurring at the router of the same
dispatcher d ji which is either a corner dispatcher or a proxy. Recall, F
η
S (ai) and F
η
R (ai) denote
sets of sent and received inter-application messages of the application ai, respectively.
rI(d
j
i ) = |F ηS (ai)|+ |F ηR (ai)| (3.41)
The proof for the receiver’s side is very similar and is therefore omitted.
Now, the maximum rerouting delay, induced by all sent inter-application traffic of an applica-
tion ai, is given in Equation 3.42, while Equation 3.43 holds for all received traffic.
CηRS(ai) = 2 · |F ηS (ai)| ·δR (3.42)
CηRR(ai) = 2 · |F ηR (ai)| ·δR (3.43)
Now consider the traversal delay of inter-application traffic. During the analysis, an inter-
application message from the application ai to the application a j will be decomposed into 2 disjoint
segments: the responsibility of ai is to deliver the message to the proxy of a j, while delivering it
to its master is the responsibility of a j. As already stated, the traversal latencies of supermessages
related to inter-application messages have to be computed separately, as their sizes σiam may be
different than those of protocol messages and contexts transfers, i.e. σiam 6= σprt 6= σctx. The
traversal delay of the supermessage involved in the inter-application communication is computed
with Equation 3.44.
CI( f̂ ki ) = |L ( f̂ ki )| ·δL+(|L ( f̂ ki )|−1) ·δρ +
⌈
σiam
σ f lit
⌉
·δL (3.44)
Now, the traversal delay of all outgoing inter-application traffic of the application ai is de-
scribed with Equation 3.45. According to Theorem 7, each message between the master sender
and its proxy may involve two supermessages.
Cη#S(ai) = ∑
∀ fi, j∈FηS (ai)
(
master to proxy︷ ︸︸ ︷
2 ·CI( f̂ ki ) +
inter proxy︷ ︸︸ ︷
CI( f Pi, j) ) (3.45)
148 Limited Migrative Model - LMM
Equation 3.46 describes the traversal delay of all incoming inter-application traffic of the ap-
plication ai.
Cη#R(ai) = ∑
∀ f j,i∈FηR (ai)
proxy to master︷ ︸︸ ︷
2 ·CI( f̂ ki ) (3.46)
3.5.2.4 Summary
In summary, between any two consecutive job releases of ai, its supermessage f̂ ki can appear
ωP( f̂ ki ) times during the protocol execution, ωC( f̂ ki ) times during the context transfer and ωI( f̂ ki )
times for every sent and received inter-application message. Furthermore, each proxy-to-proxy
message f Pi, j may appear ωI( f Pi, j) times due to inter-application traffic. Additionally, the maximum
number of reroutings that occur on the router ρ(d ji ) of each dispatcher d
j
i is equal to the sum of
reroutings during the protocol execution and during the inter-application communication – r(d ji ) =
rP(d
j
i )+ rI(d
j
i ).
Moreover, the network delay of ai when performing all communication (the agreement pro-
tocol and the inter-application traffic) can be computed by solving Equation 3.47, where the in-
dividual terms present the network delay of protocol-related messages, outgoing inter-application
traffic and incoming inter-application traffic, respectively.
Cη# (ai) =C
η
#P(ai)+C
η
#S(ai)+C
η
#R(ai) (3.47)
Finally, the rerouting delay of ai when performing all communication (the agreement protocol
and the inter-application traffic), can be computed by solving Equation 3.48, where the individual
terms present the rerouting delay of protocol-related messages, outgoing inter-application traffic
and incoming inter-application traffic, respectively.
CηR (ai) =C
η
RP(ai)+C
η
RS(ai)+C
η
RR(ai) (3.48)
3.5.3 Performing the Analysis
Recall, the worst-case communication delay of an application ai, termed Rη(ai), computed by the
existing path-abstracting method (Equation 3.14 for the List protocol and Equation 3.20 for the
Hybrid protocol), consists of several components, which can be divided into three groups:
1. Communication delay in isolation – Cη(ai) (the first term in both Equation 3.14 and Equa-
tion 3.20).
2. On-core interference (the sum of the second, third and fourth term in Equation 3.14, and
the sum of the second, third, fourth and fifth term in Equation 3.20). Let Iη(ai) denote the
aforementioned sums.
3.5 Towards More Deterministic Communication Patterns 149
3. Network interference – Iη# (ai,R
η(ai)) (the last term in both Equation 3.14 and Equa-
tion 3.20).
Thus, Equation 3.14 and Equation 3.20 can be rewritten as Equation 3.49.
Rη(ai) =Cη(ai)+ Iη(ai)+ I
η
# (ai,R
η(ai)) (3.49)
As already mentioned, the approach presented in this section relies on enforcing placement
constraints and reroutings, in order to make the traffic more predictable and deterministic. There-
fore, assuming the new approach, the worst-case communication delay consists of two more com-
ponents, specifically the rerouting delay in isolation, termed CηR (ai), and the rerouting interference,
termed IηR (ai,R
η(ai)).
Now, the worst-case communication delay can be obtained by solving Equation 3.50.
Rη(ai) =Cη(ai)+C
η
R (ai)+ I
η(ai)+ I
η
# (ai,R
η(ai))+ I
η
R (ai,R
η(ai)) (3.50)
Note that Cη(ai) remains unaffected (Equation 3.13 for the List protocol, and Equation 3.19
for the Hybrid protocol), although its subcomponent corresponding to the message transfer delay
in isolation Cη# (ai) is not any more computed with Equations 3.25-3.26, but with Equation 3.47.
This is expected, because in the previous approach the messages were free point-to-point com-
munication between dispatchers, while in the novel approach the message transfer has to be in
compliance with Definition 4.
The rerouting delay in isolation CηR (ai) can be computed by solving Equation 3.48 (explained
in the previous section).
The on-core interference component Iη(ai) remains unaffected, as the novel approach only
affects the messages, not the communication-related OS operations that dispatchers perform.
The network interference component Iη# (ai,R
η(ai)) can now be computed in a different signif-
icantly less pessimistic way than Equation 3.7, due to the fact that the analysis will be performed
on supermessages and proxy-to-proxy messages which are deterministic and master-independent.
Note that obtaining a significantly less pessimistic value of this component was in fact the main
incentive for these restrictions.
Finally, the rerouting interference component IηR (ai,R
η(ai)) is yet to be computed.
First, the focus will be on obtaining the network interference component. The application un-
der analysis ai can suffer the network interference from higher-priority supermessages and inter-
proxy messages. LetF ηD (ai) be a set of all higher-priority messages that can cause direct interfer-
ence to ai. A message belongs to the setF
η
D (ai) if it is directly contending with any message of ai,
where a direct contention between two messages has been introduced with Definition 1. Formally,
F ηD (ai) is defined as follows:
∀ f̂ wq ∈ F̂ ,∃
(
f̂ ji ∈ F̂ (ai)∨ f Pi,k ∈F P(ai)
)
| f̂ wq ∈
{
FD( f̂
j
i )
⋃
FD( f Pi,k)
}
⇒ f̂ wq ∈F ηD (ai)
∧
150 Limited Migrative Model - LMM
∀ f Pm,n ∈F P,∃
(
f̂ ji ∈ F̂ (ai)∨ f Pi,k ∈F P(ai)
)
| f Pm,n ∈
{
FD( f̂
j
i )
⋃
FD( f Pi,k)
}
⇒ f Pm,n ∈F ηD (ai)
Now, the network interference that ai might suffer from all messages fromF
η
D (ai) within the
time interval t can be computed with Equation 3.51.
Iη# (ai, t) =
inter-application traffic︷ ︸︸ ︷
∑
∀ f Pm,n∈FηD (ai)
ωI( f Pm,n) ·CI( f Pm,n) ·
number of inter-arrivals︷ ︸︸ ︷(
1+
⌈
t−Dτ+µ(am)
T (am)
⌉)
+
∑
∀ f̂ wq ∈FηD (ai)

protocol︷ ︸︸ ︷
ωP( f̂ wq ) ·CP( f̂ wq )+
context transfer︷ ︸︸ ︷
ωC( f̂ wq ) ·CC( f̂ wq )+
inter-application traffic︷ ︸︸ ︷
∑
∀ fi, j∈FηS (ai),∀ f j,i∈FηR (ai)
ωI( f̂ wq ) ·CI( f̂ wq )
 ·
number of inter-arrivals︷ ︸︸ ︷(
1+
⌈
t−Dτ+µ(aq)
T (aq)
⌉)
(3.51)
Recall that inter-proxy messages participate only in the inter-application traffic, hence there is
only one interference-related term in the first row of Equation 3.51. Conversely, supermessages
participate in the protocol messages, in the context transfers and in the inter-application traffic, and
hence there are three interference-related terms in the second row of Equation 3.51. Also recall
that F ηS (ai) and F
η
R (ai) denote sets of sent and received messages of the application ai. Note
that inter-application messages may have different sizes, so their occurrences cannot be summed
up, but have to be computed independently. Hence the inner summation in the second row of
Equation 3.51.
The last outstanding term is the rerouting interference, and it can be computed by solving
Equation 3.52.
IηR (ai, t) = ∑
∀d ji ∈D(ai)
∑
∀dmk ∈Dpi(d ji )
|dmk 6=d ji
r(dmk ) ·
number of inter-arrivals︷ ︸︸ ︷(
1+
⌈
t−Dτ+µ(ak)
T (ak)
⌉)
·δR (3.52)
3.5.4 Discussion
The previous, path-abstracting method for the worst-case communication delay analysis, presented
in Sections 3.3-3.4, does not involve reroutings, and hence does not have rerouting-related compo-
nents CηR (ai) and I
η
R (ai, t) (compare Equations 3.49-3.50). However, the limitation of that approach
is that it treats the path non-determinism in a pessimistic way, and computes the interference on the
application level. Conversely, the method presented in this section employs dispatcher placement
and rerouting constraints, in order to express the traffic as a function of supermessages, inter-proxy
messages and reroutings. This strategy imposes longer message distances and employs a rerouting
3.5 Towards More Deterministic Communication Patterns 151
mechanism, which both additionally contribute to the worst-case communication delay. However,
this approach makes message paths deterministic and known at design-time. Consequently, the
fine-grained analysis can be performed on the message-level, and the interference component can
be computed with much less pessimism, which is a fundamental requirement for a tight worst-
case communication delay analysis. This claim is also backed up with the recent insights in the
priority-preemptive wormhole-switched NoCs [74], which suggest that the most dominant factor
in the worst-case analysis is indeed the interference component.
3.5.5 Experimental Evaluation
This section describes the evaluation of the newly proposed approach. Specifically, the objective
is to find answers to the following questions:
•How does the novel method for the worst-case analysis compare against the previous method,
presented in Sections 3.3-3.4? See Experiment-Set 1
•What is the penalty for enforcing the traffic determinism via reroutings and supermessages,
how does it affect the performance, and is it justifiable? See Experiment-Set 2.
• How pessimistic is the novel method, i.e. how the analytically obtained upper-bound esti-
mates compare against the corresponding values observed during simulations? See Experiment-
Set 3.
3.5.5.1 Evaluation Metrics, Analysis and Simulation Parameters
In Experiment-Set 1, the novel method for the worst-case analysis is compared against the existing
method, presented in Sections 3.3-3.4. Specifically, for each application ai, the estimates on the
worst-case communication delay are obtained with the novel method Rηnew(ai), and the existing
method Rηold(ai). Then, the improvements of the novel approach are computed with the following
metric: imp = R
η
old(ai)−Rηnew(ai)
Rηold(ai)
. In cases where the new method underperforms, the penalty has a
negative value, and it is expressed with the following metric: pen = R
η
old(ai)−Rηnew(ai)
Rηnew(ai)
.
In Experiment-Set 2, the workload execution is simulated. The simulations are performed
on the extended version of the SPARTS [73] simulator. Two different scenarios are simulated:
(i) the communication is free point-to-point, which corresponds to the approach presented in Sec-
tions 3.3-3.4, and (ii) the reroutings are employed, in order to comply with Definition 4. For each
application ai, two worst-case delays are captured, one for the first simulated scenario (the exist-
ing approach) Rηold∗(ai), and one for the second simulated scenario (the novel approach) R
η
new∗(ai).
Similarly to Experiment-Set 1, the improvements and the penalty were expressed with the follow-
ing metrics: imp∗ = R
η
old∗(ai)−Rηnew∗(ai)
Rηold∗(ai)
, and pen∗ = R
η
old∗(ai)−Rηnew∗(ai)
Rηnew∗(ai)
.
In Experiment-Set 3, assuming the novel approach, the analytic estimates that were obtained
with Experiment-Set 1 are compared against the corresponding worst-case delays observed during
simulations in Experiment-Set 2. Then, the following metric is used to express the tightness of the
new method: tgh = R
η
new∗(ai)
Rηnew(ai)
.
152 Limited Migrative Model - LMM
The analysis and simulation parameters are summarised in Table 3.4. An asterisk sign denotes
a randomly generated value, assuming a uniform distribution.
Table 3.4: Analysis and simulation parameters for Section 3.5.5
NoC topology and size 2-D mesh with 10×10 routers
Link width = flit size σ f lit 16 bytes
Router frequency νρ 2 GHz
Routing delay δρ 3 cycles (1.5 ns)
Link traversal delay δL 1 cycle (0.5 ns)
Rerouting delay δR 10000 cycles (5 µs)
OS operations δ←P , δ→P , δ←C , δ
→
C , δ
←
I , δ→I , δQ, δE 10000 cycles (5 µs)
Application periods D(ai) = T (ai),∀ai ∈A [100−1000]∗ ms
Communication deadlines Dη(ai),∀ai ∈A 0.25·D(ai)
Protocol message size σprt 16 bytes
Execution context size σctx [1−128]∗ Kbytes
Inter-application message size σiam [1−128]∗ Kbytes
Application-set size |A | 200 applications
Maximum concurrent on-core masters M̂ 10
Probability of inter-app. comm. 5%
Simulated time 100 s
3.5.5.2 Experiment-Set 1: Analyses Comparison
Experiment 1a: Overall Improvements
In this experiment an overall analytic comparison of the novel method and the existing one
is conducted. Specifically, each application has [2− 10]∗ dispatchers, and is randomly mapped
on the grid with an arbitrary (rectangular or line-like) shape, assuming dispatcher placement con-
straints (Definition 3). Half of the applications execute the List protocol, and the other half Hy-
brid. Then, the analytic upper-bound on Rη(ai) is obtained for each application ai, with both
approaches. Finally, the obtained values are compared, where the improvements of the novel ap-
proach are computed with the metric described in Section 3.5.5.1. The process is repeated for
1000 application-sets.
Figure 3.11(a) shows the improvements of the new approach over the existing one. In only
9.63% of cases the novel method rendered worse results. This is further investigated in Ex-
periment 1c. In the remaining 90.37% of scenarios the novel approach reports improvements.
Specifically, in more than half of cases the improvements are greater than 50%, which means that
a derived upper-bound is at most half the one obtained with the existing approach. Finally, in
5.46% of scenarios the improvements are above 90%, corresponding to an estimate that is at most
one-tenth of the value against which it is compared! That is, Rηnew(ai) ≤ 110 ·Rηold(ai).
Experiment 1b: Improvements w.r.t. Number of Dispatchers
In order to test the scalability of the novel approach, and in order to see how the improvements
change with the amount of dispatchers, the number of dispatchers constituting each application
3.5 Towards More Deterministic Communication Patterns 153
(a) Overall
Worse Equal 1−10 11−20 21−30 31−40 41−50 51−60 61−70 71−80 81−90 91−1000
5
10
15
20
25
30
Improvement ranges (in %)
Am
ou
nt
 o
f a
pp
lic
at
io
ns
 (in
 %
)
 
 
2 ≤ |D| ≤ 4
2 ≤ |D| ≤ 6
2 ≤ |D| ≤ 8
2 ≤ |D| ≤ 10
2 ≤ |D| ≤ 12
2 ≤ |D| ≤ 14
(b) W.r.t. number of dispatchers
Figure 3.11: Analysis improvements (1/2)
is varied within the following range |D(ai)| = [2− x]∗,∀ai ∈ A . The parameter x is varied in
the range x ∈ {4, 6, 8, 10, 12, 14} (see the legend of Figure 3.11(b)). Assuming a certain range
(e.g. |D(ai)|= [2−4]∗), all applications are generated and randomly mapped. Again, half of the
applications execute each of the protocols. Then, for each application ai, the analytic Rη(ai) upper-
bound estimates are obtained with (i) the existing method and (ii) the novel method. Subsequently,
the obtained values are compared with the same metric as in the previous experiment. This is
repeated for 1000 application-sets, and for every range (Figure 3.11(b)).
As the number of dispatchers increases, so does the category with worse results and the cat-
egories with smaller improvements (until 60%), while other categories with more significant im-
provements (above 60%) decrease. The explanation is twofold. First, assuming the novel ap-
proach, more dispatchers cause more reroutings. This in turn causes more significant rerouting in-
terferences, which have an impact on the derived estimates. Conversely, in the existing approach
reroutings do not occur. Additionally, more dispatchers cause more messages, leading to more
significant and complex message interference scenarios. In such cases, making an assumption that
every higher priority message existing within the network will indeed cause interference might
not be too pessimistic. Thus, as the number of dispatchers increases, the existing method becomes
less pessimistic and hence improvements achieved by the new method are slowly decreasing.
Experiment 1c: Improvements w.r.t. Priorities
In this experiment, the improvements of the novel method over the existing one are observed,
with an emphasis on application priorities. This experiment also helps to recognise and investigate
the cases where the new method underperforms. The values obtained in Experiment 1a are used,
but the comparison between the approaches is additionally performed per-priority, as depicted in
Figure 3.12(a). The metric is the same as in the previous experiments.
It is evident that the novel approach performs worse for applications with higher priorities
(smaller numbers on the x-axis of Figure 3.12(a)). As these applications do not suffer significant
interference, the additional delay in the new approach is caused by reroutings and longer message
154 Limited Migrative Model - LMM
1−5 20−25 45−50 70−75 95−100 120−125 145−150 170−175 195−200−100
−80
−60
−40
−20
0
20
40
60
80
100
Priority
Im
pr
ov
em
en
t (i
n %
)
(a) W.r.t. priorities
0 10 20 30 40 50 60 70 80 90 1000
0.5
1
1.5
2
2.5
3
3.5
4
Improvement (in %)
Am
ou
nt
 o
f a
pp
lic
at
io
ns
 (in
 %
)
 
 
List protocol
Hybrid Protocol
(b) W.r.t. protocols
Figure 3.12: Analysis improvements (2/2)
distances (i.e. traversal constraints expressed by Definition 4). As priorities decrease, the interfer-
ence becomes a more dominant term in the derived upper-bounds, hence the penalty of the novel
approach slowly decays. At the priority level 15, both the approaches provide similar results.
Any additional decrease in the priority favours more the new method, where improvements report
logarithmic growth. On the far right end of the domain, the average improvements of the novel
approach asymptotically converge towards 70%.
Experiment 1d: Improvements w.r.t. Protocols
In this experiment, the novel and the existing approach are compared with an emphasis on the
employed agreement protocols. The comparison is performed for 2 different scenarios: (i) all ap-
plications are utilising the List protocol, (ii) all applications are utilising the Hybrid protocol. For
each application ai, the Rη(ai) value is obtained with both approaches. The process is repeated for
1000 application-sets. The results are given in Figure 3.12(b), where the improvements achieved
by the novel approach are expressed with the same metric used in the previous experiments.
The conclusions are very similar to those for Experiment 1b. The Hybrid protocol involves
more messages, which in the novel approach induce more reroutings. Hence, larger improvements
are reported for the List protocol than for the Hybrid protocol. Additionally, as the improvement
metric is based on the ratio, similarly to the logarithmic scale, each additional improvement per-
centage covers larger part of the domain; e.g. for improvements of 49−50% the following holds:
Rηold(ai)∈ {1.96 ·Rηnew(ai),2 ·Rηnew(ai)}, while for improvements of 89−90% the following holds:
Rηold(ai) ∈ {9.09 ·Rηnew(ai),10 ·Rηnew(ai)}. Due to that fact, improvement ranges around 90% cover
large parts of the domain, and cause local maximums (Figure 3.12(b)).
3.5.5.3 Experiment-Set 2: Performance Comparison
Experiment 2a: Overall Comparison
The purpose of this experiment is to perform an overall runtime performance comparison
of the novel and the existing approach. In order to do that, the application-sets generated for
3.5 Towards More Deterministic Communication Patterns 155
−60 −50 −40 −30 −20 −10 0 10 20 30 40 50 600
1
2
3
4
5
6
7
8
9
10
Improvement (in %)
Am
ou
nt
 o
f a
pp
lic
at
io
ns
 (in
 %
)
 
 
Worse
Equal
Better
(a) Overall
2 3 4 5 6 7 8 9 10−80
−60
−40
−20
0
20
40
60
80
Number of dispatchers per application
Im
pr
ov
em
en
t (i
n %
)
(b) W.r.t. number of dispatchers
Figure 3.13: Performance comparison (1/2)
Experiment 1a are used and the simulations are performed for two different scenarios: with and
without traversal constraints (Definition 4). Within each approach, Rη∗ (ai) is measured for each
application ai of the application-set. Consequently, the obtained values are compared with the
metric described in Section 3.5.5.1, and the results are presented in Figure 3.13(a).
Since the novel approach causes longer message distances and employs the rerouting mech-
anism, an intuitive guess would be that it will systematically suffer a significant runtime perfor-
mance penalty, when compared with the existing method. However, a surprising conclusion is
reached: not only was the penalty negligible in almost all cases, but also for almost 40%
of the scenarios the novel approach outperformed the existing one! These unexpected find-
ings are interpreted in the following way. Unpredictable message paths may lead to corner cases
where a traffic becomes heavily concentrated in certain links of the grid, resulting in significant
contentions and (almost) unbounded interference delays. This reflects the importance of having
predictable message-paths, and further implies that the overall efficiency of the system heavily
depends on the application-mapping process, which will be covered in the next section.
Experiment 2b: Comparison w.r.t. Number of Dispatchers
In this experiment, the runtime performance of the two approaches is investigated, assuming
variable dispatcher numbers. The values obtained in the previous experiment are used, but before
performing the comparison, the applications are divided into different categories, based on the
number of dispatchers. Then, the comparison is performed for each category, independently. The
results are depicted in Figure 3.13(b).
It is visible that for applications with 2− 4 dispatchers, where reroutings do not occur, in al-
most half of the cases the novel approach dominates the existing one. However, for applications
with more dispatchers and rectangular shapes, the reroutings are necessary, which is reflected with
a slight decrease of the average line in Figure 3.13(b), as the number of dispatchers increases.
Nonetheless, even for applications with 10 dispatchers, there are still numerous cases where the
new approach performs better. In cases where the novel approach underperforms, the perfor-
156 Limited Migrative Model - LMM
1−5 20−25 45−50 70−75 95−100 120−125 145−150 170−175 195−200−80
−60
−40
−20
0
20
40
60
80
Priority
Im
pr
ov
em
en
t (i
n %
)
(a) W.r.t. priorities
−40 −30 −20 −10 0 10 20 30 400
5
10
15
20
25
Improvement (in %)
Am
ou
nt
 o
f a
pp
lic
at
io
ns
 (in
 %
)
 
 
Worse
Equal
Better
(b) List protocol
Figure 3.14: Performance comparison (2/2)
mance loss is incomparably smaller than the gains achieved with the novel analysis method (see
Experiment-Set 1).
Experiment 2c: Comparison w.r.t. Priorities
The objective of this experiment is to investigate how the runtime performance of the two
approaches changes with the application priorities. The results from Experiment 2a are used,
but before comparing the obtained Rη∗ values, applications are classified into different categories,
based on their priorities. Then, the comparison is performed for each category, independently.
Figure 3.14(a) illustrates the results.
Figure 3.14(a) shows that, in the majority of cases, irrespective of the priority, both the ap-
proaches display very similar performance. It is barely noticeable that only for the highest priori-
ties, the novel approach underperforms slightly more than in other cases. This occurs, because in
these scenarios the existing method is not pessimistic (no network interference), while the existing
method still pays the penalty of rerouting operations and longer message paths.
Experiment 2d: Comparison w.r.t. Protocols
In this experiment the runtime performance of the two approaches is observed, assuming dif-
ferent agreement protocols. The application-sets generated for Experiment 1d are used, where, in
the first case, all applications executed the List protocol, and in the second case all application ex-
ecuted the Hybrid protocol. First, assuming the application-sets where all applications execute the
List protocol, the simulations are performed, and subsequently for each application ai the R
η
old∗(ai)
and Rηnew∗(ai) values are obtained. Then, the obtained values are compared. The results are illus-
trated in Figure 3.14(b). Finally, the same process is repeated for the application-sets where all
applications execute the Hybrid protocol. The results for that case are illustrated in Figure 3.15.
The explanation is similar to that for Experiment 1d. The List protocol involves less messages
and less rerouting operations, which favours more the novel approach. Indeed, from Figure 3.14(b)
it is visible that in more than half of the cases the new method outperforms the existing one. Con-
versely, the Hybrid protocol involves more messages and more rerouting operations, both of which
3.5 Towards More Deterministic Communication Patterns 157
−70 −60 −50 −40 −30 −20 −10 0 10 20 30 40 50 60 700
0.5
1
1.5
2
2.5
3
3.5
4
Improvement (in %)
Am
ou
nt
 o
f a
pp
lic
at
io
ns
 (in
 %
)
 
 
Worse
Equal
Better
Figure 3.15: Performance comparison for Hy-
brid protocol
2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
Number of dispatchers per application
R
at
io
 b
et
we
en
 m
ea
su
re
d 
an
d 
co
m
pu
te
d 
W
CC
D 
(in
 %
)
Figure 3.16: Analysis tightness across number
of dispatchers
have a negative impact on the worst-case delays obtained for the novel approach. Consequently,
the new approach performs better than the existing one in only 30% of the cases (see Figure 3.15).
3.5.5.4 Experiment-Set 3: Analysis Tightness
Experiment 3a: Tightness w.r.t. Number of Dispatchers
In this experiment the objective is to observe the pessimism of the novel method for the worst-
case communication delay analysis. In order to do that, the results from Experiment 1a and Exper-
iment 2a are used, where for each application ai the values of R
η
new(ai) and R
η
new∗(ai) were com-
puted. Before performing the comparison, the applications are classified into categories, based on
the number of dispatchers. Then, the comparison is performed for each category, independently.
The results are depicted in Figure 3.16, where the metric described in Section 3.5.5.1 is used.
It is visible that for the applications with fewer dispatchers, there exist cases where the ratio
between the observed and the analytically computed worst-case communication delay is very high.
Indeed, in the category of two-dispatcher applications, there exists an application with the ratio
around 85%, which suggests that the analysis is correct and renders very tight estimates. As
the number of dispatchers increases, the interference patterns become more complex, and the
scenarios considered by the worst-case analysis are less likely to be captured at runtime. This is
manifested with a slight decrease in the ratio, for the higher number of dispatchers. However, the
decrease is not significant, and there exist applications with 10 dispatchers, for which the ratio
between Rηnew∗ and R
η
new is around 65%.
Experiment 3b: Tightness w.r.t. Priorities
The goal of this experiment is to observe how the ratio between Rηnew∗ and R
η
new changes with
application priorities. For that, the same values as in the previous experiment are used (the results
from Experiment 1a and Experiment 2a). However, this time the applications are divided into
categories based on their priorities. The comparison is performed for each category, independently.
Figure 3.17(a) illustrates the findings.
158 Limited Migrative Model - LMM
1−5 20−25 45−50 70−75 95−100 120−125 145−150 170−175 195−200
0
10
20
30
40
50
60
70
80
Priority
R
at
io
 b
et
we
en
 m
ea
su
re
d 
an
d 
co
m
pu
te
d 
W
CC
D 
(in
 %
)
(a) Across priorities
1 5 10 15 20 25 30 35 40 45 500
5
10
15
20
25
Ratio between measured and computed WCCD (in %)
Am
ou
nt
 o
f a
pp
lic
at
io
ns
 (in
 %
)
 
 
List Protocol
Hybrid Protocol
(b) Across protocols
Figure 3.17: Analysis tightness
It comes as no surprise that the highest observed ratio from the previous experiment (around
85%) was in this experiment measured for the application with the highest priority (the smallest
numbers on the x-axis of Figure 3.17(a)). In other words, the ratio of 85% belongs to the two-
dispatcher application with the highest priority. This implies that the highest ratio occurs for
high-priority applications with few dispatchers. Furthermore, the results demonstrate an average
ratio of around 35% for applications with high priorities. As the priorities decrease, so does the
ratio, which asymptotically converges towards 2% for applications with very low priorities. The
explanation is as follows. As the priority decreases, the analysis considers more and more complex
interference (worst-case) scenarios, which are less and less likely to occur during simulations.
The experiments demonstrate that there is a small probability for lower-priority applications to
encounter analytically identified worst-case scenarios. Having that in mind, together with the fact
that in many practical scenarios occasional missed deadlines of lower-priority applications are in
fact tolerable, several questions could be raised. For instance, is the worst-case analysis the right
approach to treat the lower-priority applications? Or should some other (e.g. probabilistic)
techniques be applied? These questions represent good starting points for the future work.
Experiment 3c: Tightness w.r.t. Protocols
In this experiment the objective is to investigate how the ratio between Rηnew∗ and R
η
new changes
with the agreement protocols. In order to do that, the results from Experiment 1d and Experi-
ment 2d are used, where Rηnew∗ and R
η
new are obtained for two different scenarios: (i) all applications
execute the List protocol, and (ii) all applications execute the Hybrid protocol. The comparison is
performed for each protocol, independently. The results are depicted in Figure 3.17(b).
It is visible that the ratio is higher for the List protocol, which entirely coincides with the
findings of Experiment 1d and Experiment 2d. Specifically, the Hybrid protocol involves more
messages and reroutings, both of which lead towards more complex interference patterns. Conse-
quently, the analytically identified worst-case scenarios are less likely to occur at runtime, leading
to lower ratios. Conversely, the List protocol involves less messages and reroutings. This infers
3.6 Application Mapping 159
that identified worst-case scenarios involve less complex interference patterns, which are more
likely to be captured at runtime, leading to higher rations between Rηnew∗ and R
η
new.
3.5.5.5 Discussion
In this section, a novel method for the worst-case communication delay analysis of LMM was
proposed. The proposed approach potentially "sacrifices" the performance (via supermessages and
proxies), in order to gain in predictability (i.e. deterministic message paths). The novel method
was compared with the existing one, both in terms of the analysis and the performance. The
experiments demonstrate that the novel approach not only renders tighter upper-bound estimates
in more than 90% of the cases, but also demonstrates a comparable runtime performance, which
reflects the importance of having deterministic traffic routes.
3.6 Application Mapping
Notice that the efficiency of the aforementioned analyses highly depends on dispatcher positions.
In other words, the positions of individual dispatchers affect all terms contributing to the worst-
case communication delay (Equations 3.49-3.50), irrespective of the employed analysis. Specif-
ically, the communication delay in isolation, the on-core interference, the network interference
all depend on the positions of dispatchers of the analysed application. Therefore, the application
mapping is an important aspect of the worst-case analysis of LMM and it should be addressed.
It comes as no surprise that the application mapping for many-cores has been one of the most
investigated topics over the last decade, thus resulting in a vast amount of works. In this disser-
tation only a brief overview of the state-of-the-art methods will be given, whereas only the works
that are closely related to the real-time embedded domain will be covered. An interested reader
may consult a comprehensive survey of Sahu and Chattopadhyay [82], which covers scientific
publications related to the entire application mapping topic.
The problem of application mapping is equivalent to the quadratic assignment problem, which
is NP-Hard, hence searching for the optimal solution can be prohibitively expensive even for
small NoCs [40], e.g. 4× 4. Therefore, the existing approaches are predominantly heuristics-
based. Lei and Kumar assumed different processor types and developed a two-stage genetic al-
gorithm [56], where the workload is firstly mapped to a specific processor type and in the second
pass to a particular processor. Moein et al. [64] use a modification of the aforementioned heuris-
tics, called the chaos-genetic algorithm. By employing the branch-and-bound technique Hu and
Marchulescu [40] investigate mappings which minimise the power consumption within the net-
work. The authors present a power model to calculate the energy spent by the NoC infrastructure,
which is used as an objective function to evaluate derived mappings. The concept of minimising
the energy consumption was further extended to include performance enhancements [22], network
contentions [21] and runtime mappings [20]. Murali and De Micheli [65] elaborate on bandwidth
constraints and minimise the average communication delay, while Hung et al. [41] study thermal-
aware placements. Marcon et al. [61] are the first to consider the ordering and dependencies among
160 Limited Migrative Model - LMM
traffic messages, while Srinivasan and Chatha [91] introduce per-message latency constraints. As-
cia et al. [5] present an approach where not one, but a group of mappings is considered (Pareto
mappings), so as to derive a solution to the multi-objective approach. Kreutz et al. [51] explore the
same topic for interconnect mediums other than NoC, while Hu and Marculescu [39] exploit the
routing flexibility in order to reduce the routing energy consumption and improve the performance.
As already mentioned in Chapters 1-2, Shi and Burns proposed two worst-case communi-
cation delay analyses for wormhole-switched priority-preemptive NoCs [85, 86]. The former
analysis was combined with genetic algorithms with the objective of mapping hard real-time ap-
plications [62, 80], while the latter approach was further combined with a priority assignment
algorithm with the same aim [87]. Note that the studies mentioned in this paragraph are the most
related to the topics covered in this dissertation. That is, the authors of these works analyse the
application-mapping problem from the real-time perspective, where the main emphasis is on de-
riving a mapping such that all temporal constraints posed on communication delays are always
fulfilled, even under the worst-case conditions. Also note that these approaches rely on the as-
sumption that each application is statically assigned to a specific core at design-time, and does not
have the possibility to migrate (see fully partitioned approaches in Section 1). Conversely, in this
dissertation of interest is the application mapping problem for LMM.
3.6.1 Problem Statement
The objective is to propose an application mapping method which finds a mapping of a given
application workload onto a given platform (M =A →Ψ) which is schedulable with respect to
communication requirements. A mapping M is schedulable if all applications are schedulable,
i.e. Rη(ai)≤ Dη(ai),∀ai ∈A .
The secondary objective is to provide a schedulable mapping, such that the applications can
perform migrations across spatially distributed dispatchers. This objective is motivated with the
fact that far migrations are the efficient means to implement energy/thermal management, and
increase the resilience to core/cluster malfunctions. To address this aspect, the definition of ap-
plications is extended with one additional property called the migration coefficient M(ai), which
reflects the importance of a distributed mapping of the application ai, and its purpose will be
explained later.
3.6.2 Mapping Quality
The multi-objective nature of the proposed approach is reflected by the fact that the goal is not
just to provide a mapping which is schedulable, but to provide a schedulable mapping which
maximises the abilities of the application-set to perform runtime load balancing via application
migrations. Two aspects are identified as the most relevant for the migrative potential of an appli-
cation, namely the dimensions of the rectangular x× y shape its dispatchers are forming and the
distribution of the dispatchers on that shape. The strategy that is used to evaluate the mapping of
the application can be summarised with the following two observations:
3.6 Application Mapping 161
Observation 9. The greater the shape of the application is, the greater its migrative abilities are.
Observation 10. Assuming a fixed shape, the more even the distances between the application
dispatchers are, the greater its migrative abilities are.
The reasoning is that, given that the migration has to occur, it is better to accommodate the
next job execution on some far core, rather than on some near core, because far migrations allow
efficient global load balancing, energy/thermal management and make the system more resilient to
clustered failures (i.e. a part of the chip starts malfunctioning). Conversely, near migrations only
partially solve these issues, or don’t solve them at all. Thus, the intention behind this approach
is to prevent near migrations by maximising the shape of the application (Observation 9), and
distributing its dispatchers on the shape, such that the inter-dispatcher distances are as even as
possible (Observation 10).
Although it may seem that these objectives superficially contradict the schedulability require-
ment, this is not entirely true. If perceived as an optimisation problem, migrative abilities of
applications are the objective function which has to be maximised, and the schedulability is a con-
straint which must be fulfilled. Subsequently, the solution should be found such that (i) all the
applications are schedulable, (ii) the dimensions of the application shapes are as big as possible,
(iii) the distances between the dispatchers of the same application are as equal as possible.
In order to qualitatively evaluate different mappings, a proper metric has to be established such
that it implements the aforementioned reasoning. Let (d ji ,d
k
i ) denote a pair of neighbouring dis-
patchers of one application, and let hops(d ji ,d
k
i ) be the distance between them, expressed in hops.
Moreover, let DP(ai) denote all such pairs of neighbouring dispatchers of the application ai. A
quality of a mapping of the application ai, denoted by qi (Equation 3.53), is equal to the product of
the distances between its neighbouring dispatchers, multiplied by its migration coefficient M(ai).
qi = M(ai) · ∏
∀(d ji ,dki )∈DP(ai)
hops(d ji ,d
k
i ) (3.53)
The migration coefficient M(ai) symbolises the importance of application’s spatially dis-
tributed mapping. In other words, the more the system would benefit from the application’s
spatially distributed mapping, the greater this parameter is. For example, a computationally de-
manding application may have a significant impact on the thermal properties of the core where it
is executing (and its surrounding), therefore, having the possibility to perform far migrations of
that application is desirable from the thermal perspective. Subsequently, for every such applica-
tion ai its coefficient M(ai) should be set high. Similarly, allowing far migrations for a critical
application may improve its resilience towards core/cluster failures. Thus, each such application
ai should have its M(ai) coefficient set high. In this dissertation it is assumed that the values of mi-
gration coefficients have already been specified, and that the same are used as a means to classify
the applications according to the importance of their spatially distributed mappings.
There are three reasons why a product of inter-dispatcher distances is used as the evaluation
metric:
162 Limited Migrative Model - LMM
• Primarily, because it is a computationally cheap operation. That is, due to the rectangular or line-
like structure of the application shape (Definition 3), the inter-dispatcher distances can be obtained
in a single traversal across the circumference of the application’s shape in either clockwise or
couter-clockwise direction. Since during the mapping process the evaluation of many different
shapes of all applications will be performed, it is of paramount importance to limit its complexity.
• Second, the product of inter-dispatcher distances is monotonically increasing with the shape size
(see Observation 9).
• Finally, when assuming that the shape has been decided, the product of inter-dispatcher distances
reaches the maximum when all inter-dispatcher distances are as even as possible (see Observa-
tion 10 and for the formal proof see Theorem 17 in Appendix).
Note that a zero-distance between dispatchers presents a special case where the same are lo-
cated on a common core. In the assumed model, such a mapping is meaningless and the metric
expressed with Equation 3.53 also penalises such mappings, hence returning qi = 0.
Assuming that the circumference of an application shape c and the number of dispatchers
|D(ai)| are given, choosing and applying an optimal dispatcher placement would be trivial: if pos-
sible - make all distances equal c|D(ai)| , otherwise make distances either
⌈
c
|D(ai)|
⌉
or
⌊
c
|D(ai)|
⌋
. How-
ever, the corners of the application shape pose implicit constraints regarding the inter-dispatcher
distances and, in some cases, prevent optimal solutions. Also, not all the cores located on the
edges of the application shape might be available due to on-core schedulability reasons (Sec-
tion 3.8), thus further preventing optimal solutions. Therefore, the purpose of the aforementioned
evaluation metric is to provide a qualitative comparison between different possible suboptimal
dispatcher placements, in cases when optimal ones are not possible.
(a) (b)
Figure 3.18: Different dispatcher mappings for the same application-shape
The example illustrated in Figure 3.18 is used to demonstrate how different mappings are eval-
uated. Two possibilities are presented, in both of them the application claims the same shape, but
the placement of the dispatchers differs. For clarity purposes, assume that the migration coeffi-
cient is equal to 1. After solving Equation 3.53, the value of qi for Figure 3.18(a) is 144, while for
the mapping illustrated in Figure 3.18(b), qi = 324. Thus, the mapping from Figure 3.18(b) is a
more favourable option. This coincides with the reasoning and intentions. Notice that for a given
example c= 16, |D(ai)|= 6, and c|D(ai)| = 166 = 2.67. Therefore, the mapping from Figure 3.18(b)
is one of optimal solutions, due to the fact that its all dispatcher distances are either 2 or 3.
3.6 Application Mapping 163
Upon defining the individual, per-application quality metric, the quality of the mapping of an
entire application-set can be defined. It is equal to the sum of individual qualities of all applications
comprising an application-set A (Equation 3.54).
Q = ∑
∀ai∈A
qi (3.54)
3.6.3 Mapping Process
The proposed method consists of three mapping stages: the Initial Phase (IP), the Feasibility
Phase (FP) and the Optimisation Phase (OP).
Before the mapping process begins, based on their priorities, all applications are divided into
two groups: a set of high-priority applications AH , and a set of low-priority applications AL
(Equation 3.55). P(ai)> Pmin+P · (Pmax−Pmin)⇒ ai ∈AHP(ai)≤ Pmin+P · (Pmax−Pmin)⇒ ai ∈AL (3.55)
Pmax and Pmin denote the maximum and the minimum system priorities, respectively, while the
parameter P (0≤P≤ 1) is arbitrarily chosen by the system designer. All applications belonging to
the groupAH will be mapped duringIP and will not be subject to any changes duringFP and
OP . The rationale for this decision is that any change in the mapping of an application ai ∈AH
(e.g. its shape, the position of its dispatchers) requires a new schedulability check of both ai and
all lower priority applications which messages are directly interfered by the messages of ai. Thus,
the recalculation triggered by a high-priority application might be computationally expensive, as
there might be a substantial number of directly interfered applications. Therefore, the parameter
P is introduced as a means to control the complexity of the entire mapping process.
3.6.3.1 Initial Phase (IP)
As already described, during this phase only the high-priority applications are mapped. The appli-
cations are sorted non-increasingly with respect to their priorities, and mapped sequentially in that
order. By performing the mapping in this manner, in most cases, the mapping of one application
will not have an impact on the schedulability of previously mapped (higher-priority) applications.
Exceptions are cases when there exists an inter-application message from some already mapped
higher priority application a j, to the currently mapping application ai. In such cases, the mapping
of ai also maps the inter-proxy message f Pj,i which has the priority of a j. This invokes a schedu-
lability recheck of a j and all interfered applications with intermediate priorities. Depending on
the frequency of these situations, the complexity of IP ranges between O(|AH |) (no schedula-
bility rechecks necessary) and O(|AH |2) (the mapping of every application invokes schedulability
rechecks), where |AH | denotes the amount of applications inAH . As will be seen in Section 3.6.4,
the actual complexity is closer to the former estimate (a linear complexity).
164 Limited Migrative Model - LMM
When mapping an application, several mapping options are possible. A narrow mapping
presents a mapping where an application shape covers the minimum possible surface, i.e. a map-
ping where all dispatchers of an application occupy consecutive cores. For instance, the possible
narrow mappings for a 4-dispatcher application are the following shapes: 1×4,4×1,2×2. Con-
versely, a wide mapping is a mapping where an application shape covers the maximum possible
surface, in most cases the boundaries of the NoC. The surfaces of the narrow and the wide mapping
are referred to as Smin(ai) and Smax(ai), respectively.
Since the applications from AH are mapped during IP , and are not subject to any changes
in the subsequent mapping phases, it is essential to dedicate a proper shape to each of these ap-
plications. Assuming wide mappings for most applications during IP will create significant
high priority traffic within the network, which might cause the applications from AL to be unable
to reach the schedulability. Conversely, restricting the applications from AH to claim the nar-
row mappings might unnecessarily preserve the network resources underutilised, as there might
be very few applications in AL. In order to manage this design trade-off, the parameter G id
introduced. G presents an upper-bound on the allowed application shapes, and it controls the
mapping "greediness" of high-priority applications. For instance, if G = 0 only the shapes which
surface S is less than or equal to Smin(ai) are allowed. Similarly, if G = 1, shapes which fulfil
S ≤ Smax(ai) are allowed, which includes all rectangular/linear shapes. If 0 < G < 1, allowed
shapes are calculated by Equation 3.56. Specifically, if the mapping of a 4-dispatcher applica-
tion has to be performed on a 8× 8 platform with G = 0.5, then Smin(ai) = 4, Smax(ai) = 64
and S ≤ 34. Consequently, the following shapes would be excluded from the consideration:
{8×8,8×7,7×8,7×7,8×6,6×8,7×6,6×7,6×6}.
S≤ Smin(ai)+G · (Smax(ai)−Smin(ai)) (3.56)
Algorithm 18 illustrates how theIP stage is performed. First, the high-priority applications
are selected and sorted by their priorities, non-increasingly (lines 1− 4). The applications are
treated sequentially; the values of Smin(ai) and Smax(ai) are obtained (line 6), and a shape surface
threshold S is calculated (line 7). Then, only the shapes which surface is less than or equal to
the calculated threshold S are selected and sorted by the shape surface, non-increasingly (lines
8−12). The mapping is attempted on the entire grid with the selected application and the biggest
allowed shape (lines 15− 16). Note that this process involves (i) the placement of the shape at
the particular location on the grid, (ii) the generation of the supermessages, (iii) the assignment
of the individual proxy roles for both the application under analysis and the other applications
communicating with it, (iv) the generation of the inter-application messages assuming the elected
proxies, (v) the schedulability check for the analysed application. Each inter-application message
sent to the currently mapping application has a higher priority, hence each such message triggers
the schedulability recheck of its application (lines 18−20). This may also trigger the rechecks of
other applications influenced by these updates, therefore a schedulability recheck is performed for
each such application (lines 22−25).
3.6 Application Mapping 165
Algorithm 18 IP(A ,Ψ,P,G)
Input: application-set A , platform Ψ, parameter P, parameter G
Output: initial mappingMip
1: for each (ai ∈A | P(ai)> Pmin+P · (Pmax−Pmin)) do
2: AH ←AH ⋃ {ai}; // select all high-priority applications
3: end for
4: sort(AH ,P ↓); // sort by priority, non-increasingly
5: for each (ai ∈AH) do
6: Smin(ai)← minArea(ai); Smax(ai)← maxArea(ai); // compute Smin(ai) and Smax(ai)
7: S← Smin(ai)+G · (Smax(ai)−Smin(ai)); // find a shape surface threshold
8: Allowed← /0;
9: for each (shape | shape.S()≤ S) do
10: Allowed← Allowed ⋃ {shape}; // select allowed shapes
11: end for
12: sort(Allowed,S ↓); // sort by shape surface, non-increasingly
13: schedulable← f alse;
14: while (schedulable 6= true ∧ Allowed 6= /0) do
15: shape← remove(Allowed); // get the biggest existing shape
16: schedulable← test(ai,shape);
17: if (schedulable = true) then
18: for each (a j ∈ Senders(ai)) do
19: update(a j); // update inter-application messages
20: end for
21: if (Senders(ai) 6= /0) then
22: for each (ak ∈AH | recheck(ak) = true) do
23: schedulable← schedulable ∧ test(ak,shape(ak)); // schedulability recheck
24: if (schedulable 6= true) then break;
end if
25: end for
26: end if
27: end if
28: if (schedulable = true) then
29: location← f indBestLocation(ai,shape); // find the best location
30: map(ai,Ψ,shape, location);
31: distributeDispatchers(ai); // maximise the mapping quality
32: end if
33: end while
34: if (schedulable 6= true) then
35: mappingFailed(ai); // declare failure of IP
36: end if
37: end for
38: mappingSuccess(); // declare success of IP
39: return Mip;
166 Limited Migrative Model - LMM
If the application can be mapped with the selected shape on multiple places of the grid, the
location is found such that the worst-case delay of the application is minimised. In this way the
algorithm searches for the position on the grid where the application suffers the least interference
from other traffic. Notice that this strategy forces interacting applications to be mapped close to
each other. A special case occurs when proxies of two interacting applications share the same core
and the inter-proxy message does not exist. Furthermore, if there are several locations on the grid
where the application can be mapped and for which the delay is minimised, the approach selects
the one for which the sum of the dispatcher distances from the center of the grid is minimised. The
intention behind mapping the application as close to the center of the grid as possible is as follows:
any lower-priority application which is yet to be mapped, and which performs the communication
with the said application, will have the possibility to evaluate more mapping options so as to
minimise the communication penalty, while that would not be the case if the said application was
mapped on the border of the grid. Note that these two location selection criteria are not explicitly
mentioned in Algorithm 18, but it is assumed that this logic is encapsulated within the method in
the line 29.
Once the best location is found, the application is considered mapped (line 30). Subsequently,
the rest of dispatchers (if any) are positioned on the edges of the shape, such that the mapping
quality qi is maximised (line 31).
If the application cannot be mapped on the grid with the current shape, the mapping is at-
tempted with the next shape from the collection of allowed shapes. The process is repeated until
a proper shape and location are found, such that the schedulability constraints are satisfied for
the currently mapping and the previously mapped applications. If the schedulability cannot be
reached with any of the shapes, the mapping process declares a failure (line 35). Conversely,
when all the applications from AH are mapped, IP declares a success (line 38) and returns the
initial mappingMip (line 39).
3.6.3.2 Feasibility Phase (FP)
During this phase the mapping of low-priority applications is performed, with the primary ob-
jective to derive a schedulable mapping M f p where the entire application-set is mapped. There-
fore, during FP , every application is mapped with the narrow mapping. FP is described by
Algorithm 19. First, all low-priority applications are grouped in AL and sorted by priority, non-
increasingly (lines 1−4). The applications are treated sequentially; a surface of a narrow mapping
Smin(ai) is found (line 6), and subsequently it is used to find all possible shapes which correspond
to the narrow mapping of that application (lines 7− 9). The mapping is attempted with the first
shape from the list, without any preference, as all selected shapes represent narrow mappings
(lines 12-13). The same logic used inIP applies here: if needed, the inter-application messages
of higher-priority applications are updated (lines 15−17) and subsequently the schedulability of
affected applications is rechecked (lines 19−22). Similarly, if multiple grid locations are available
for the same shape, the one with the minimum delay is selected, while if more than one location
3.6 Application Mapping 167
Algorithm 19 FP(A ,Mip,Ψ,P)
Input: application-set A , initial mappingMip, platform Ψ, parameter P
Output: feasible mappingM f p
1: for each (ai ∈A | P(ai)≤ Pmin+P · (Pmax−Pmin)) do
2: AL←AL ⋃ {ai}; // select all low-priority applications
3: end for
4: sort(AL,P ↓); // sort by priority, non-increasingly
5: for each (ai ∈AL) do
6: Smin(ai)← minArea(ai); // compute Smin(ai)
7: for each (shape | shape.S() = Smin(ai)) do
8: Allowed← Allowed ⋃ {shape}; // select narrow-mapping shapes
9: end for
10: schedulable← f alse;
11: while (schedulable 6= true ∧ Allowed 6= /0) do
12: shape← remove(Allowed); // get the first allowed shape
13: schedulable← test(ai,shape);
14: if (schedulable = true) then
15: for each (a j ∈ Senders(ai)) do
16: update(a j); // update inter-application messages
17: end for
18: if (Senders(ai) 6= /0) then
19: for each (ak ∈A | recheck(ak) = true) do
20: schedulable← schedulable ∧ test(ak,shape(ak)); // schedulability recheck
21: if (schedulable 6= true) then break;
end if
22: end for
23: end if
24: end if
25: if (schedulable = true) then
26: location← f indBestLocation(ai,shape); // find the best location
27: map(ai,Ψ,shape, location);
28: end if
29: end while
30: if (schedulable 6= true) then
31: mappingFailed(ai); // declare failure ofFP
32: end if
33: end for
34: mappingSuccess(); // declare success ofFP
35: return M f p;
168 Limited Migrative Model - LMM
report the same delay, the one closer to the center of the grid has the precedence (line 26). Once
the best location is found, the application is mapped (line 27).
If the attempted shape violates the schedulability of the currently mapping application, or
any other already mapped application, the mapping is attempted with the next one from the list
of allowed shapes. The process is repeated until a shape is found such that all applications are
schedulable. If none of the shapes satisfies this requirement, the mapping process declares a
failure (line 31). Conversely, once all the applications are mapped, FP declares a success (line
34) and returns the feasible mapping M f p (line 35). The computational complexity of FP
also varies. If there are no schedulibility rechecks necessary, the complexity is equal to O(|AL|),
where |AL| denotes the number of low-priority applications. Conversely, if the mapping of every
application requires schedulability rechecks, the complexity is O(|AL| · |A |), where |A | denotes
the total number of applications in the application-set, i.e. |A |= |AH |+ |AL|. As will be seen in
Section 3.6.4, the actual complexity is closer to the former estimate (linear). Note that the output
ofFP is the first schedulable solution of an entire application-set.
3.6.3.3 Optimisation Phase (OP)
The objective of this phase is to improve on the feasible mapping M f p and produce the final
mapping Mop. This is performed by attempting to extend the shapes of narrow mappings that
low-priority applications claimed during FP . The process ends when no further extensions are
possible. That also marks the end of the entire mapping process and the reached mapping Mop
presents the final output. OP is depicted by Algorithm 20.
Similarly toFP , applications fromAL are selected and sorted non-increasingly (lines 1−4),
but during OP by the migration coefficient M. As mentioned in Section 3.6.2, it represents
the significance of the spatially distributed mapping of an application, i.e. the more the system
would benefit from the spatially distributed mapping of an application ai, the greater its parameter
M(ai) is. The positive side is that applications for which distribution matters more are given the
possibility to claim resources before others, thus increasing the chances of the mapping process
to derive a good quality solution. The downside is that the number of necessary schedulabil-
ity rechecks significantly increases; an expanding application can invoke not only schedulability
rechecks mentioned in the previous mapping stages, but also of all its directly interfered lower-
priority applications, which was not possible in the previous stages as applications were sorted by
their priorities, non-increasingly. Thus, unlike the previous mapping stages, where schedulability
rechecks were an exception, duringOP the same will be performed regularly. The computational
complexity ofOP is identical to that ofFP , however, as will be seen in Section 3.6.4, the actual
complexity is closer to the higher (sub-quadratic) estimate.
The applications are treated sequentially; the expanded shape is found and the mapping at-
tempted (lines 8− 9). The process consists of an attempt to stretch the application shape: (i) the
supermessages are re-generated, (ii) the proxy roles are re-assigned for both the application under
analysis and the other applications interacting with it, (iii) the inter-application messages are re-
generated assuming the newly elected proxies. As described above, this may require a significant
3.6 Application Mapping 169
Algorithm 20 OP(A ,M f p,Ψ,P)
Input: application-set A , feasible mappingM f p, platform Ψ, parameter P
Output: final mappingMop
1: for each (ai ∈A | P(ai)≤ Pmin+P · (Pmax−Pmin)) do
2: AL←AL ⋃ {ai}; // select all low-priority applications
3: end for
4: sort(AL,M ↓); // sort by migration coefficient, non-increasingly
5: for each (ai ∈AL) do
6: schedulable← true;
7: while (schedulable = true) do
8: newShape← expand(shape(ai)); // find expanded shape to attempt
9: schedulable← test(ai,newShape);
10: if (schedulable = true) then
11: for each (a j ∈A | recheck(a j) = true) do
12: schedulable← schedulable ∧ test(a j,shape(a j)); // schedulability recheck
13: if (schedulable 6= true) then break;
end if
14: end for
15: if (schedulable = true) then
16: a.shape← newShape; // claim expanded shape
17: rearrangeDispatchers(ai); // maximise mapping quality
18: end if
19: end if
20: end while
21: end for
22: mappingSuccess(); // declare success of OP and entire mapping process
23: return Mop;
170 Limited Migrative Model - LMM
amount of schedulability rechecks (lines 11− 14). The application is expanded until any further
stretches will cause unschedulability of either itself or any other application. Every expansion is
followed by the rearrangement of dispatchers of an expanding application, so as to equalise the
inter-dispatcher distances as much as possible and improve the mapping quality (line 17). Once
this process is performed for all applications from AL, the entire mapping process concludes (line
22), and the derived mappingMop is returned as the final output (line 23).
3.6.4 Experimental Evaluation
In this section, the evaluation of the proposed mapping approach is performed. Specifically,
through different case studies, the overall efficiency and applicability of the proposed applica-
tion mapping method are investigated, as well as how different choices and trade-offs, achievable
through parameter manipulations, influence the quality of derived mapping solutions.
3.6.4.1 Case Study 1: Application Shapes and Overheads of Constrained Routes
In many situations, several shapes have similar or identical characteristics, especially duringFP
when only narrow mappings are considered. In order to asses the differences between shape
types and reason about their applicability, their relevant characteristics are analysed, namely a
traversal distance of the average and the longest intra-application message. For easier calculation,
in all cases narrow mappings are assumed, while for rectangular shapes only an even number of
dispatchers is considered.
Equation 3.57 calculates the average traversal distance of the intra-application message, as-
suming a n-dispatcher application with a line-like shape. If the i− th dispatcher of a horizontally
stretched application is the master, it communicates with i−1 dispatchers on the left and the rest
n− i dispatchers on its right (the terms in the brackets of Equation 3.57). The outer summation is
performed so as to account for all possible masters. In order to obtain the distance of the average
message, the result is divided by the total number of messages (n masters sent n− 1 messages
each). Similarly, the value is calculated for the rectangular shape, where the message routes obey
to Definition 4 (Equation 3.58).
Line-AVG =
n
∑
i=1
(
i−1
∑
k=1
k+
n−i
∑
j=1
j
)
n(n−1) =
n+1
3
(3.57)
Rect-AVG =
d n−12 e
∑
i=1
i+
b n−12 c
∑
i=1
i
n−1 =
n2
4(n−1)
(3.58)
Finding the maximum message distance for both shapes is trivial (Equations 3.59-3.60).
Line-MAX = n−1 (3.59) Rect-MAX =
⌈
n−1
2
⌉
(3.60)
3.6 Application Mapping 171
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200
2
4
6
8
10
12
14
16
18
20
Number of dispatchers
D
is
ta
nc
e 
(in
 ho
ps
)
 
 
Line−AVG
Line−MAX
Rect−AVG
Rect−MAX
Free−AVG
Free−MAX
Figure 3.19: Shape comparison
4 6 8 10 12 14 16 18 200
5
10
15
20
25
30
35
Number of dispatchers
Fo
rc
ed
 ro
ut
e 
pe
na
lty
, i
n 
%
 o
f u
nc
on
st
ra
in
ed
 m
es
sa
ge
Figure 3.20: Rerouting penalty estimation
Figure 3.19 illustrates how the number of dispatchers influences the distance traversed by the
average and the longest intra-application message of these two shape types. The rectangular shape
type performs better than the line one, for both the average and the longest message, but at the
expense of potential reroutings, which do not occur in the latter case. The decision regarding
the shape type precedence can be made by the system designer, or left to the mapping process to
decide.
In order to estimate the penalty of constrained intra-application message routes, traversal dis-
tances of the average and the longest message are computed for a rectangular shape with the free
point-to-point communication between dispatchers (Equations 3.61-3.62), and subsequently plot-
ted in Figure 3.19. Equation 3.61 is derived by using the intermediate results of Equation 3.57.
Free-AVG =
2
n
2
∑
i=1
(
i−1
∑
k=1
k+
n
2−i
∑
j=1
j
)
+ n
2
4
n
2(n−1)
=
n2+3n−4
6(n−1) (3.61)
Free-MAX =
⌈
n−1
2
⌉
(3.62)
Figure 3.19 demonstrates that the removal of Definition 4 improves the performance for the
average message, but keeps the same longest message and, as already mentioned, makes the anal-
ysis pessimistic. Unlike Figure 3.19, which visualises the overhead of forced paths expressed in
absolute values, Figure 3.20 depicts the same overhead expressed relatively. In practical terms,
Definition 4 causes a 19% longer traversal path for the average intra-application message of a
10-dispatcher application with a rectangular shape.
3.6.4.2 Analysis Parameters
In the subsequent experiments the mapping process is performed, assuming the workload synthet-
ically generated by using the parameters from Table 3.5. An asterisk sign denotes a randomly
generated value, assuming a uniform distribution. To assure that all considered application-sets
172 Limited Migrative Model - LMM
are indeed schedulable, the individual per-application constraints on the worst-case communica-
tion delays were derived as follows. First, each application-set is mapped with the parameters
P = G = 0. This approach assures that all applications will be mapped during the IP phase
with shapes that correspond to narrow mappings, and will consequently suffer small worst-case
communication delays, called narrow shape delays (NSDs), hereafter. Thus, by verifying that the
generated communication deadline of each application is equal to, or greater than its respective
NSD (i.e. Dη(ai)≥NSD(ai)), it can be confirmed that the given application-set is schedulable with
at least one selection of the mapping parameters (P = G = 0). In the subsequent experiments, the
communication deadlines are set to be equal to multiples of NSDs of respective applications, i.e.
Dη(ai) = k·NSD(ai),∀ai ∈A . Notice, that a bigger value of k gives applications more "freedom"
to claim wider shapes, but also decreases the computation and memory deadlines of applications,
because Dη(ai) = D(ai)−Dτ+µ(ai).
Table 3.5: Analysis and simulation parameters for Section 3.6.4
NoC topology 2-D mesh with
Link width = flit size σ f lit 16 bytes
Router frequency νρ 2 GHz
Routing delay δρ 3 cycles (1.5 ns)
Link traversal delay δL 1 cycle (0.5 ns)
Rerouting delay δR 10000 cycles (5 µs)
OS operations δ←P , δ→P , δ←C , δ
→
C , δ
←
I , δ→I , δQ, δE 10000 cycles (5 µs)
Application periods D(ai) = T (ai),∀ai ∈A [100−1000]∗ ms
Communication deadlines Dη(ai),∀ai ∈A 0.25·D(ai)
Protocol message size σprt 16 bytes
Execution context size σctx [1−128]∗ Kbytes
Inter-application message size σiam [1−128]∗ Kbytes
Application-set size |A | 200 applications
Application migration coefficient Mi [0−50]∗
Number of dispatchers per application |D(ai)| [2−10]∗
Probability of inter-app. comm. 5%
Testing platform Intel dual-core desktop & Java (Max heap-size: 4 GB)
3.6.4.3 Case Study 2: Scalability
The objective of this case study is to test the scalability potential of the proposed approach. The
number of applications is varied in the range [2−200]. For each given value of the application-set
size, 1000 application-sets are generated and mapped on a 8×8 grid, in accordance with the values
from Table 3.5. The other parameters are: G = 0.3, P = 0.5 and Dη(ai) = 10·NSD(ai),∀ai ∈A .
The timing analysis is performed by capturing the duration of the mapping process of each run.
Figure 3.21 demonstrates the results. The horizontal axis stands for the application-set size.
For clarity purposes a reciprocal cumulative graph is used for the presentation, where the vertical
axis depicts the quantity of the runs that still did not complete the mapping process within a given
3.6 Application Mapping 173
(a) (b)
Figure 3.21: Influence of application-set size on analysis time (1/2)
time interval (depth axis). For small application-sets, almost all the runs complete very fast, which
is visible from Figure 3.21(a). As the number of applications increases, the mapping process
takes more time, thus the slope of the curve flattens. At the end of the first observed period in
Figure 3.21(a) (0− 3 seconds), almost all the smaller sets finished, while very few of the largest
ones report the completion of the mapping process.
The second observed period (3−40 seconds) is presented in Figure 3.21(b). Most of the runs
for application-sets up to 150 applications already finished, while the runs for the larger sets keep
the completion rate steady. Note that the observation period is larger, thus the slope looks steeper.
At the end of the observed period, only a small fraction of the largest application-sets did not
complete.
The results demonstrate an obvious trend regarding the duration of the mapping process across
application-set sizes, however, the mappings of two sets of the same size can report significantly
different durations, sometimes by an order of magnitude. To emphasize this fact, a different repre-
sentation of the same data is given in Figure 3.22. The horizontal axis stands for the application-set
size. The left and the right vertical axis represent the duration of the mapping process, in the linear
and logarithmic scale, respectively. Even though the whiskers were set to the 25th and 75th per-
centiles, which corresponds to the 99.3% coverage for the normal distribution, it is visible that a
non-negligible amount of runs falls outside the aforementioned area. This infers that the duration
of the mapping process may vary hugely, and highly depends on properties of the application-set
upon which it is being applied.
Overall, the duration of the mapping process exponentially increases with the number of appli-
cations. However, it is averaging at 12 seconds for the sets consisting of 200 applications, which
confirms that the proposed approach is applicable to most of realistic scenarios in the real-time
embedded domain.
Another scalability test is performed by varying the grid size (from 8× 8 to 16× 16), while
keeping all the other parameters at the same values. The size of the application-set is 150. 1000
sets were generated and subsequently mapped. Figure 3.23(a) shows how the variation of the
174 Limited Migrative Model - LMM
1 20 40 60 80 100 120 140 160 180 200
0
2
4
6
8
10
12
14
16
18
20
An
al
ys
is 
tim
e 
(in
 se
co
nd
s)
Application−set size
 
 
10−4
10−3
10−2
10−1
100
101
102
An
al
ys
is 
tim
e 
(in
 lo
ga
rith
mi
c s
ca
le)
Analysis time
Analysis time in logarithmic scale
Figure 3.22: Influence of application-set size on analysis time (2/2)
grid size (horizontal axis) influences the analysis time (depth axis). Again a cumulative reciprocal
representation is used, where vertical axis stands for the number of application-sets for which the
mapping did not finish within a given time.
It is visible that, within 6 seconds, in almost all cases the mapping completed for smaller grids,
while larger grids are more time consuming and report fewer completions in the same time interval.
Conversely to application-set size variations, the grid size variations cause a linear increase in the
duration of the mapping process, visible in Figure 3.23(b). Note that these results infer that on
average, the mapping process on a platform with 144 cores takes only 2 times more than on a
64-core one, assuming the same workload is mapped in both cases. The explanation is as follows.
Even though larger grids require more locations to be checked while mapping, at the same time this
gives the opportunity to map the applications in such a way that complex interference scenarios are
avoided, which decreases the duration of schedulability rechecks, since fewer contentions occur.
The number of outliers again confirms huge variations in duration times of the mapping process,
demonstrating that for some application-sets the duration of the mapping process can significantly
exceed the average time needed for that particular workload and platform size.
3.6.4.4 Case Study 3: Parameter P
The parameter P controls the amount of applications which will be grouped in AH and hence
mapped during IP , while the rest of the applications will undergo a two-stage mapping process
in FP and OP . Note that FP and OP are computationally more intensive than IP , so
mapping more applications during FP +OP can cause a significant increase in the computa-
tion time. However, this process is more thorough in search and potentially has higher chances of
finding better application mappings. Conversely, mapping most of the applications during IP
3.6 Application Mapping 175
(a)
8 9 10 11 12 13 14 15 16
0
2
4
6
8
10
12
14
16
18
20
An
al
ys
is 
tim
e 
(in
 se
co
nd
s)
Grid size
(b)
Figure 3.23: Influence of grid size on analysis time
may save the computation time, but makes the efficiency of the mapping process highly dependant
on the right selection of the parameter G, as is shown in Case Study 4. Thus, an intuitive assump-
tion is that the selection of the parameter P creates a trade-off between the analysis time and the
solution quality.
Assuming G= 0.3 and Dη(ai) = 10·NSD(ai),∀ai ∈A , all the application-sets are schedulable
across the entire observed domain, where the parameter P is varied in the range [0− 1]. The
application-set size is 150, and 1000 sets are created and mapped on a 8× 8 platform, for each
incremental step of P, as shown in Figure 3.24(a). The timing analysis is performed (the right
vertical axis), but also the qualities of generated solutions are collected (the left vertical axis),
since the objective is to observe the effect of P on both the analysis time and the solution quality.
As expected, the increase of the parameter P causes the duration time of the mapping process to
grow exponentially. As P increases, more applications undergo a more computationally extensive
mapping process, which results in longer analysis times. However, the results report a logarithmic
growth of the solution quality, almost on the entire domain. The only exception is the case when
P = 1, that is, when all applications are mapped duringFP and OP .
The results suggest that putting more applications in AL and placing them during FP and
OP might not produce a solution with a significantly better quality. The explanation is that the
final mapping phase - OP , sorts the applications non-increasingly by the migration coefficient
and tries to optimise their placements in that order. Since OP does not have a greediness con-
trol mechanism, every application will greedily assume the widest possible mapping such that its
schedulability is preserved. Thus, most of the available network resources are consumed by the
applications that are considered early during OP , which leaves the ones considered later to be
able to barely optimise their placements, if at all. Another interesting conclusion is that mapping
some applications during IP (i.e. P < 1) may prevent the OP phase to optimise the mapping
in the most efficient way. This is confirmed with the substantial increase in the solution quality
when all applications undergo the optimisation phase (P = 1).
176 Limited Migrative Model - LMM
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1.5
2
3
4 x 10
4
Parameter P
So
lu
tio
n 
qu
al
ity
 
 
1
2
3
4
5
6
An
al
ys
is 
tim
e 
(in
 se
co
nd
s)
Solution Quality
Analysis time
(a)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
10
20
30
40
50
60
70
80
90
100
Parameter P
Fr
ac
tio
n 
of
 to
ta
l a
na
lys
is 
tim
e 
(in
 %
)
 
 
IP Mapping
FP Mapping
OP Mapping
(b)
Figure 3.24: Influence of parameter P on mapping process
In order to gain a more detailed insight into the influence of P on the duration of the map-
ping process, the durations of the individual mapping phases is observed. Figure 3.24(b) shows
the fraction of the total analysis time that each phase consumes and compares them in relative
terms. For small values of P, almost all applications are mapped during IP , hence leaving less
workload for the subsequent phases. Until P = 0.25, IP witnesses an exponential decrease of
the analysis time, while the other two phases report a linear increase. Notice, that although all
three stages have the sub-quadratic complexity, as P increases the complexity of OP becomes
a dominant factor, while P > 0.75 produces scenarios where the analysis time is almost entirely
spent in OP . This is explained with the amount of necessary schedulability rechecks, which in
the first two stages occur rarely and cause their almost linear complexity, while in the last mapping
stage are performed regularly, thus contributing to the true sub-quadratic complexity.
3.6.4.5 Case Study 4: Parameter G
The parameter G controls the amount of greediness allowed to high-priority applications. An in-
tuitive reasoning suggests that the greater value of G causes the greater amount of traffic as a con-
sequence of spatially distributed mappings of applications from AH . Consequently, low-priority
applications, which get placed during FP and OP , will have less possibilities to assume well
distributed placements, since the network resources were greedily consumed by the applications
fromAH . In some cases this may cause applications fromAL to be unable to claim schedulability
even with narrow mappings, resulting in a failure of the mapping process. Oppositely, smaller
G may limit the distribution of high-priority workload and unnecessary preserve the network for
applications from AL, while there might be only few of them. This can significantly impact the
solution quality.
The effects of the parameter G highly depend on the parameter P, since G applies only to
high-priority applications, which amount is directly controlled by the parameter P. For small
values of P, the influence of G is amplified the most. As P increases, the effects of G mitigate and
at some point become negligible. To better observe the impact of G on the mapping process, the
3.6 Application Mapping 177
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
1
2
3
4
5 x 10
4
Parameter G
So
lu
tio
n 
qu
al
ity
 
 
0
1
2
3
4
5
An
al
ys
is 
tim
e 
(in
 se
co
nd
s)
Solution Quality
Analysis time
(a)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
10
20
30
40
50
60
70
80
90
100
Parameter G
Fe
as
ib
le
 a
pp
lic
at
io
n−
se
ts
 (in
 %
)
 
 
Wi = 2×NSDi
Wi = 5×NSDi
Wi = 10×NSDi
Wi = 20×NSDi
(b)
Figure 3.25: Influence of parameter G on mapping process
following parameters were used: P = 0 and initially Dη(ai) = 10·NSD(ai),∀ai ∈A . Notice that
this parameter selection caused some application-sets to be unschedulable on the entire domain,
where the parameter G is varied in the range [0−1]. Thus, in this part of the experiment only the
application-sets which are schedulable on the entire domain are considered. The application-set
size is 150, and 1000 sets are created and mapped on a 8×8 platform, for each incremental step
of G, as shown in Figure 3.25(a). The timing analysis is performed (the right vertical axis), but
also the qualities of derived solutions are observed (the left vertical axis), so as to study the effect
of G on both the analysis time and the solution quality.
It is visible that the duration of the mapping process is constant and does not depend on the pa-
rameter G. However, G has a significant effect on the solution quality. Specifically, as G increases,
high priority applications gain more freedom in assuming mappings wider than narrow ones, re-
sulting in a constant increase of the solution quality. This effect is noticeable until G = 0.75. For
higher values of G, the solution quality starts to decrease, since high G allows greedy mappings of
high-priority applications, and hence preserves less resources for the later mapping stages. In fact,
for G> 0.75, the network is so greedily consumed by high-priority applications, that low-priority
ones barely claim the schedulability with narrow mappings. Note that high values of G not only
significantly limit the optimisation of lower priority workload during OP , but also severely affect
the schedulability, as explained below.
As already mentioned, Figure 3.25(a) considered only those application-sets which are schedu-
lable across the entire domain of G. Conversely, Figure 3.25(b) illustrates the impact of the param-
eter G on the schedulability. The application-set size is 150, and 1000 sets are created. For each
generated application-set the individual per-application communication deadlines are varied to be
the following multiples of the respective NSDs: 2,5,10 and 20. Subsequently, assuming again
P = 0, the application-sets are mapped on a 8× 8 platform and of interest is how the number of
schedulable application-sets (the vertical axis) changes with the parameter G (the horizontal axis).
It is visible that the parameter G significantly impacts the schedulability, and as G increases,
the number of schedulable application-sets decreases. This is expected, as giving more freedom to
178 Limited Migrative Model - LMM
high-priority applications allows them to more greedily consume the available network resources.
This inevitably leaves low-priority applications with fewer resources, which, in many cases are
not enough to claim the schedulability even with narrow mappings. Notice, that even if the com-
munication deadlines are 20 times greater than the respective NSDs, when G = 1 only 30% of the
application-sets are schedulable.
3.6.5 Discussion
The efficiency of the mapping process significantly depends on the right selection of the param-
eters P and G. But more than that, it highly depends on the characteristics of the application-set
upon which it is being applied. As observed through the experiments, there are no individual val-
ues of P and G which derive the best solution for every application-set. These experiments are
useful to recognise and explain the general trends associated to these parameters, but also to show
that different mapping strategies can in some cases provide competitive results, and conversely,
the same mapping strategy may vary substantially in terms of efficiency when applied to different
application-sets.
For instance, for sets with tight communication deadlines, setting low values to P and G will
very fast produce a schedulable mapping but without a significant quality. Increasing P in such
cases may result in a solution with the better quality, but at the expense of the additional time. If the
computational complexity is not the problem, in such cases setting P = 1 is the preferable option.
For sets where applications have similar communication deadlines the best strategy is to invoke
the balanced network consumption by keeping P low and G low or moderate, depending on the
tightness of the deadlines. Conversely, the sets with significant differences in communication
deadlines will benefit the most from a spatial partialisation approach invoked by moderate or high
values of P and high values of G. This parameter selection will allow only a limited number of
applications to claim wide mappings and utilise the routes on the boundaries of the grid, and at the
same time preserve the central cores free for the applications with tight communication deadlines.
Finally, for sets with relaxed communication deadlines setting P low or moderate and G high
may result in a mapping with the good solution quality, but may also deem the application-set
unschedulable with that particular parameter selection. In such cases, decreasing the parameter G
until the application-set becomes schedulable is a preferable option.
The advantage of the proposed method is that by setting P and G low, a schedulable solution
can be derived very fast. This option may be useful in scenarios where limited computational
capacities are available, and the system designer is not very knowledgeable about the nature of the
workload. Alternatively, if additional computational capacities are available, the designer might
choose to increase P, until reaching a desirable trade-off between the solution quality and the
computational complexity. Moreover, if the designer knows workload characteristics (e.g. the
number of applications, the number of dispatchers, the tightness of the deadlines), he can take that
knowledge into account and make an informed decision regarding the parameter G, which may
additionally improve the solution quality.
3.7 Memory Traffic 179
3.7 Memory Traffic
In this section, two worst-case analysis of the memory traffic are presented. The proposed methods
only consider the delays of memory operations within the NoC, while the delays occurring within
the memory controllers are out of scope of this dissertation, and have been extensively studied in
other works (e.g. [78, 95]).
3.7.1 Workload
Each application ai performs a set of memory operations which are modelled as the flow-set
F µ(ai). Each flow inherits the priority of its application, i.e. P( f j) = P(ai),∀ f j ∈F µ(ai), and
each flow f j can be one of the following memory operations: (i) a read request, (ii) a read re-
sponse, (iii) a write request, and (iv) a write response. Additionally, ω( f j) denotes the maximum
number of occurrences of a given memory operation within one inter-arrival of its application. An
illustrative example of memory operations is given in Figure 3.26, where the flow f1 represents
the read request sent to the memory controller µ1, while the flow f2 symbolises the subsequent
read response. Similarly, f3 represents the write request sent to the memory controller µ2, while
f4 depicts the subsequent write response.
. . .
. . .
f
µ
µ2
1
1
f2
f3
f4
Figure 3.26: Memory operations
3.7.2 Challenges and Inapplicability of Existing Techniques
The problem of non-deterministic flow-paths, due to the master volatility property, is also present
in the memory traffic. This is illustrated with Figure 3.27, where a 4-dispatcher application sends
a read request to the memory controller µ1. As is visible, depending on which dispatcher is the
current master, the memory operation may involve any of the flows f1, f2, f3 and f4, which are
mutually exclusive.
Given that the accessed memory controller is the same, irrespective of the current master
dispatcher, one way to solve the aforementioned problem is to use the proxy dispatcher for memory
operations. An example of such an approach is illustrated in Figure 3.28. In this scenario, a
180 Limited Migrative Model - LMM
. . .
f
µ1
1
f4
f3f2
Figure 3.27: Master volatility problem for
memory operations
. . .
f
µ1
1
f3
f2
f4
Figure 3.28: Memory operations via proxy
dispatchers
master sends the read request via f1, f2 or f3 to its proxy (emphasised with a darker color in
Figure 3.28). Then, the proxy forwards the request via f4 to the memory controller on behalf of
its master. Upon receiving the response, the proxy forwards it back to the master. However, the
number of memory requests that an application issues within its minimum inter-arrival period can
reach thousands, which would create a substantial overhead on proxy cores and lead towards poor
performance. Thus, although applicable to inter-application communication, this approach does
not have sufficient scalability potential to be applied to memory traffic as well.
3.7.3 Access Constraints and Bounding Messages
A more efficient way to solve the message non-determinism problem is by enforcing access con-
straints described with Definition 8.
Definition 8 (Memory accesses). If an application accesses a memory controller, all dispatchers
access it from the same tile, which is (i) for the leftmost controllers (i.e. top and bottom) chosen
such that it is either in the column of the application’s leftmost dispatcher, or left from it, and
(ii) for the rightmost controllers, chosen such that it is either in the column of the application’s
rightmost dispatcher, or right from it.
In other words, on a platform with x× y routers, if an application’s leftmost and rightmost
dispatchers occupy columns i and j respectively, then the leftmost controllers can be accessed
via columns 1 : i and the rightmost via columns j : x. The intention behind this approach is to
cause overlapping paths of mutually exclusive messages (notice the visual difference between
Figure 3.29(a) and Figure 3.27). Now, a new construct called the bounding message is defined. It
represents a generalisation of overlapped, mutually exclusive messages.
Definition 9 (Bounding message). The bounding message f is a message that exists for each
memory operation that an application performs, and for each row and column of its shape, where
it has dispatchers. It connects a memory controller, and a dispatcher from a given row/column,
with the furthest distance from it.
The per-row bounding message is directed from the tile to the memory controller, and rep-
resents a generalisation of overlapped, mutually exclusive requests from its row. An application
3.7 Memory Traffic 181
. . .
µ1
1
f2
f
(a) Requests
. . .
µ1
f1 f2
(b) Responses
Figure 3.29: Bounding messages
with a rectangular shape and 4 dispatchers has 2 bounding messages for requests towards the
memory controller (e.g. f1 and f2 towards µ1 in Figure 3.29(a)). Similarly, the per-column bound-
ing message is directed from the memory controller to the router, and depicts a generalisation of
overlapped, mutually exclusive responses from its column (see Figure 3.29(b)).
The benefits of bounding messages can be seen by considering all possible read request mes-
sages that might be generated from the bottom row in Figure 3.29(a). These read requests are
mutually exclusive, as they belong to different dispatchers of the same application. Additionally,
both share a part or entirely the path with the bounding message f1. Thus, the maximum delay
of f1 presents an upper-bound on the worst-case delays of respective memory operations orig-
inating from the dispatchers of the bottom row. Therefore, a bounding message can substitute
all mutually exclusive read request messages from the same row, targeting the same controller.
The same applies for write requests. Similarly, a bounding message can substitute all mutually
exclusive read and write responses from the same controller, targeting the same column (e.g. in
Figure 3.29(b) the bounding message f2 can substitute responses towards the dispatchers located
in the right column). Notice that bounding messages have similar properties as supermessages,
which were used for the intra-application communication traffic. Specifically, both constructs are
master-independent and known at design-time. Therefore, the objective is to express all memory
traffic with the set of bounding messages, and perform the worst-case analysis on such a model. It
is trivial to see that if a bounding message fulfils posed time constraint, all substituted messages
also fulfil it.
Note, it may appear that the analysis of bounding messages inherently brings pessimism, since
many messages may share only a small fraction of the path with the bounding message, which
delay they will assume as the corresponding upper-bound. But it is not so. Timing constraints
are posed on groups of mutually exclusive messages and not individual messages. Thus, the
only important thing is whether all possible mutually exclusive messages (arising from different
dispatcher positions) meet a certain constraint, while the tightness of the estimate of individual
messages is irrelevant.
Also note, a bounding message exists for every memory operation. Therefore, if an application
182 Limited Migrative Model - LMM
performs both read and write operations with a memory controller, similarly to distinct read and
write messages, there will also be distinct bounding messages, for both operations.
3.7.4 Solution to Mutually Exclusive Bounding Messages
Bounding messages substitute mutually exclusive requests from the same row, and mutually ex-
clusive responses from the same column. However, mutually exclusive bounding messages still
exist. For example, in Figure 3.29(a), f1 and f2 are mutually exclusive, the same is true for f1 and
f2 in Figure 3.29(b).
In this section, the method to solve this problem is proposed. Let FE( f j) be a set of all
mutually exclusive bounding messages of f j, including f j. As during one minimum inter-arrival
of ai only one dispatcher can be a master, consequently only one of mutually exclusive bounding
messages may exist2. Observe that all of them have the same flow properties (i.e. size, priority,
number of occurrences), only differ in the traversal latency, and subsequently in the interference
delay they can cause per single occurrence (Equation 2.1). Thus, when a bounding message has
multiple mutually exclusive bounding messages in its list of directly interfering messages, it is
sufficient to assume one of them, with the maximum traversal latency. The conclusion reached for
FE( f j) is also valid for every set of mutually exclusive bounding messages of every application.
3.7.5 Approach One: Per-Packet Analysis
In this approach, of interest is the maximum delay of a single occurrence (a single packet traver-
sal) of a bounding message fi. Let FD( fi) be a set of directly interfering bounding messages
of fi. As already discussed, when computing the interference, not all bounding messages from
FD( fi) should be considered, as some of them are mutually exclusive. Thus, a reduced set of
directly interfering bounding messages is defined as follows. If two or more bounding messages
from FD( fi) are mutually exclusive, only one with the highest traversal latency is added to the
set FR( fi). Subsequently, only bounding messages from FR( fi) will be treated as interfering
messages.
Formally,FR( fi) can be defined as follows:
∀ f j ∈FD( fi) |6 ∃ fk ∈FD( fi) ∧ fk ∈FE( f j) ∧ C( fk)>C( f j)⇒ f j ∈FR( fi) (3.63)
2It is assumed that memory operations are not needed during agreement protocols, and that the data which is neces-
sary for the protocol execution (if any) is available within local caches.
3.7 Memory Traffic 183
The worst-case traversal time of a single occurrence (a single packet traversal) of fi can be
computed by solving Equation 3.64, where a f j denotes the application to which f j belongs.
WCT T ( fi) =C( fi)+ ∑
∀ f j∈FR( fi)
C( f j) ·ω( f j) ·
number of job releases︷ ︸︸ ︷(
1+
⌈
WCT T ( fi)−Dη(a f j)
T (a f j)
⌉)
(3.64)
Note that the term in the brackets presents the maximum number of job releases of the appli-
cation a f j within the observed time interval, and the proof is omitted because it is very similar to
the proof of Theorem 6, which computes the number of protocol executions within the observed
time interval.
Upon obtaining the worst-case traversal time of single occurrences of bounding messages, it
is possible to derive the worst-case memory traffic delay of an entire application. First, for an
application ai is define the reduced set of bounding messages F R(ai) as follows. From each set
of mutually exclusive bounding messages of ai only one with the maximum delay is added to
F R(ai). Formally,F R(ai) is defined as follows:
∀ f j ∈F (ai) |6 ∃ fk ∈FE( f j) ∧ WCT T ( fk)>WCT T ( f j)⇒ f j ∈F R(ai) (3.65)
In Equation 3.65, the termF (ai) represents a set of all bounding messages belonging to ai.
Now the worst-case memory traffic delay of an application ai, termed Rµ(ai), can be computed
by solving Equation 3.66.
Rµ(ai) = ∑
∀ f j∈FR(ai)
WCT T ( f j) ·ω( f j) (3.66)
In this, and the subsequently proposed methods, the condition for the schedulability with re-
spect to the memory traffic is Rµ(ai)≤ Dµ(ai).
3.7.6 Intermediate Step: Partial Per-Pattern Analysis
Obtaining the worst-case delay of a single occurrence of a bounding message, and consequently
multiplying it with the number of occurrences might be a beneficial approach for computation-
intensive applications, where few memory requests occur. However, memory-intensive applica-
tions have hundreds, if not thousands of memory requests per job execution. Thus, assuming the
worst-case scenario for every occurrence of a memory operation might be a very pessimistic ap-
proach. In this section, the new method to compute the worst-case delay of bounding messages
is presented. The focus of this method is not on the single message occurrence, but rather on the
group of occurrences, i.e. on the pattern. Specifically, the goal is to compute the worst-case delay
of a bounding message fi, but assuming all ω( fi) occurrences that happen within one minimum
inter-arrival period of an application.
184 Limited Migrative Model - LMM
Similarly to the previous approach, for the bounding message under analysis fi, a reduced set
of directly interfering messagesFR( fi) is defined, i.e. a set where only one of mutually exclusive
bounding messages exists. Consequently, the worst-case delay of fi is computed by Equation 3.67.
Note, as occurrences of fi might be spread within the interval corresponding to the joint deadline
of the communication and the memory traffic, the interference has to be computed within that
period: Dτ+µ(ai). Due to that fact, Equation 3.67 does not have a recursive notion.
WCT T (ω( fi)) =C( fi) ·ω( fi)+ ∑
∀ f j∈FR( fi)
C( f j) ·ω( f j) ·
(
1+
⌈
Dτ+µ(a fi)−Dη(a f j)
T (a f j)
⌉)
(3.67)
Now, Equation 3.68 is used to compute the worst-case memory traffic delay of an application.
Rµ(ai) = ∑
∀ f j∈FR(ai)
WCT T (ω( fi)) (3.68)
3.7.7 Approach Two: Full per-pattern analysis
The previous method is referred to as the partial per-pattern analysis. It is strictly worse than or
equal or this approach, so it is considered only to be an intermediate step. In this section, a method
called the full per-pattern analysis is presented.
As already mentioned, distinct bounding messages exist for each memory operation, irrespec-
tive of the fact that some of the may have overlapping paths. Therefore, if an application performs
read and write operations with a memory controller, distinct bounding messages will be gener-
ated for a read request, a read response, a write request and a write response. Although bounding
messages of read and write requests have different sizes and number of occurrences, they have the
same priority and some of them share same paths. In such cases they have the same list of directly
interfering messages. Therefore, the idea behind this approach is to merge bounding messages
forming read and write requests which share the same path, and compute their grouped delay. The
same idea is applied to read and write responses.
Consider fi and f j, which are bounding messages of a read and a write request of one ap-
plication, and which share the same path. Thus, their joined restricted list of directly interfering
messages FR( fi, j) is equivalent to their individual lists, i.e. FR( fi, j) =FR( fi) =FR( f j). Their
worst-case grouped delay WCT T (ω( fi, j)) can be expressed by Equation 3.69. Due to the same
reasons as for the partial per-pattern analysis, Equation 3.69 does not have a recursive notion.
Note that this approach behaves identically as the partial per-pattern analysis in cases where no
bounding messages that can be merged exist, i.e. Equation 3.69 becomes Equation 3.67. On the
other hand, intuitively, this approach should provide tighter estimates than the partial per-pattern
analysis in cases where bounding messages can be merged, i.e. both read and write operations are
performed with the same controller. This is further investigated in Section 3.7.8.
WCT T (ω( fi, j)) =C( fi) ·ω( fi)+C( f j) ·ω( f j)+
3.7 Memory Traffic 185
∑
∀ fk∈FR( fi, j)
C( fk) ·ω( fk) ·
(
1+
⌈
Dτ+µ(a fi, j)−Dη(a fk)
T (a fk)
⌉)
(3.69)
Now, Equation 3.68 can be used to compute the worst-case memory traffic delay of an appli-
cation.
3.7.8 Experimental Evaluation
In this section, the efficiency of the proposed approaches is evaluated. Specifically, the tightness
of the results derived with all three methods are compared, and it is investigated how these trends
change with different application parameters, e.g. the priority, the minimum inter-arrival period,
the number of memory operations. Furthermore, some practical conclusions concerning routing
policies and a distribution of memory accesses across memory controllers are derived.
3.7.8.1 Analysis parameters
Analysis parameters are given in Table 3.6. An asterisk sign denotes a randomly generated value,
assuming a uniform distribution.
Table 3.6: Analysis parameters for Section 3.7.8
NoC topology and size 2-D mesh with 8×8 routers
Link width = flit size σ f lit 16 bytes
Router frequency νρ 2 GHz
Routing delay δρ 3 cycles (1.5 ns)
Link traversal delay δL 1 cycle (0.5 ns)
Application-set size |A | 200 applications
Number of dispatchers per application |D(ai)| [2−10]∗
Application periods D(ai) = T (ai),∀ai ∈A [30−1000]∗ ms
Joint computation and memory operations deadlines Dτ+µ(ai),∀ai ∈A 0.75·D(ai)
Memory operations deadlines Dµ(ai),∀ai ∈A 0.15·D(ai)
Different memory controllers accessed by one application [1−4]∗
Memory requests of computation-intensive applications ω( fi) [1−50]∗
Memory requests of memory-intensive applications ω( fi) [100−1000]∗
Control message size (read request and write response) σ( fi) 32 bytes
Content message size (read response and write request) σ( fi) 1 Kbyte
3.7.8.2 Experiment 1: Overall Comparison
In this experiment, the overall comparison of the proposed approaches is performed. Each appli-
cation of an application-set is randomly mapped on the platform, assuming dispatcher placement
constraints (Definition 3). Consequently, bounding messages of memory operations are generated.
Then, the worst-case memory traffic delay is computed for each application, with all three meth-
ods. Finally, the obtained values are compared. The process is repeated for 1000 application-sets.
186 Limited Migrative Model - LMM
(a) Partial per-pattern vs. per-packet (b) Full per-pattern v. partial per-pattern
Figure 3.30: Analyses comparison
Figure 3.30(a) shows the improvements of the partial per-pattern analysis over the per-packet
analysis. It is visible that the per-packet analysis renders tighter estimates only for 3.64% of ap-
plications. This fact will receive additional attention in Experiment 4. In 0.53% of the cases, both
methods derive the same results, while in all other cases the partial per-pattern analysis performs
better. Specifically, for more than half of applications, the improvements are greater than 90%,
which means that the estimates are tighter by a factor of 10 or more.
Figure 3.30(b) compares the full per-pattern analysis and the partial per-pattern analysis. For
4.20% of applications both methods derive the same values. As discussed when the analyses were
presented, these are the cases when an application performs only one (read or write) operation with
every memory controller that it communicates with. In the rest of scenarios, the full per-pattern
analysis provides improvements, which in this case grow up to 50%. Additional conclusion is that
the full per-pattern analysis dominates the partial per-pattern analysis (strictly better or equal).
This coincides with the intuitive assumption from the previous section.
3.7.8.3 Experiment 2: Distribution of memory accesses across controllers
In this experiment, it is investigated how the distribution of memory accesses across memory
controllers impacts the analysis. Thus, it is assumed that all data that an application needs is
fetched from only one memory controller. Similarly, the worst-case delays of each application
of the application-set are derived, and the process is repeated for 1000 application-sets, assuming
the full per-pattern analysis. The obtained values are compared with the results from the previous
experiment for the full per-pattern analysis, where every application may access multiple memory
controllers.
The improvements achieved by a scheme where each application accesses only a single con-
troller are demonstrated in Figure 3.31. Surprisingly, this approach does not always yield better
results. In fact, for 9.27% of applications this scheme renders worse estimates. The explanation is
that, when an application accesses only a single controller, the path is "greedily" consumed by its
entire traffic. Consequently, some links of the NoC infrastructure are extensively used, causing a
3.7 Memory Traffic 187
Figure 3.31: Improvement with single controller
5 10 15 20 25 30 35 40 45 500
2
4
6
8
10
12
14
Contribution to cumulative delay (in %)
Qu
an
tity
 o
f m
es
sa
ge
s (
in 
%)
 
 
Read Reqest
Write Reqest
Read Response
Write Response
Figure 3.32: Different memory operations
substantial interference to any lower priority application which utilises that path. Oppositely, when
each application accesses multiple memory controllers, the traffic is more equally distributed, thus
making it possible for lower priority applications to suffer less interference. However, the rest of
applications benefit from the new scheme and have tighter estimates. Thus, it can be concluded
that the distribution of memory accesses plays an important role in the worst-case analysis of
memory traffic and it is a potential topic for the future research.
3.7.8.4 Experiment 3: Different memory operations and messages
In this experiment the objective is to quantify the fraction of the total delay spent in each individual
component, namely read requests, read responses, write requests and write responses. In order
to investigate that, the per-packet analysis is performed. Of interest are only applications that
contain all 4 components, i.e. perform both read and write operations with the memory controller.
The analysis is performed for each such application in the application-set, and repeated for 1000
application-sets. Upon obtaining the delays of single packet occurrences, the contributions of
individual delays in the cumulative delay of all 4 packets are estimated.
The results are presented in Figure 3.32. It is visible that read and write request messages
are almost overlapping. Furthermore, both messages share the same path and only differ in the
message size. The same is also true for both response messages. Thus, the first conclusion is that
the message size has almost no influence on the delay. Additionally, responses have significantly
higher delays than the requests, each averaging at 27% of the total cumulative delay. Requests
consume less, around 23% each. The explanation for this surprising finding is that the X-Y routing
mechanism is not very efficient for memory responses [2]. Specifically, each response message
is injected into the NoC either in the topmost or the bottommost row. In either case, a message
first traverses on the x-axis. However, all other responses also do the same. This can cause large
amount of contention within the topmost and the bottommost row. The effects in Figure 3.32
are substantially mitigated, due to the existence of per-priority virtual channels which reduce the
effects of indirect interferences. However, in scenarios with a single virtual channel, this can cause
more significant impacts on delays of response messages, thus further motivating the research in
the area of routing mechanisms. This topic is also a potential future work.
188 Limited Migrative Model - LMM
Figure 3.33: Full per-pattern vs. per-packet
1 20 40 60 80 100 120 140 160 180 2000
10
20
30
40
50
60
70
80
90
100
Pe
na
lty
(in
 %
)
Priority
 
 
2 Memory Controllers
3 Memory Controllers
4 Memory Controllers
Figure 3.34: Penalty of multiple controllers
3.7.8.5 Experiment 4: Computation-intensive applications
In this experiment the focus is on the applicability of the proposed methods to computation-
intensive applications. For each such application the number of memory operations and the
minimum inter-arrival periods are varied. Consequently, the per-packet and the full per-pattern
analysis are performed, and the obtained values are compared. The computation is repeated for
1000 application-sets.
The results are depicted in Figure 3.33. It is visible that the per-packet analysis renders tighter
estimates for applications with few memory operations and long periods. As the number of oper-
ations increases, the improvements over the full per-pattern analysis decline. Additional increase
favours the full per-pattern analysis, which outperforms the per-packet analysis on the rest of the
domain. Similar trends apply to minimum inter-arrival periods of applications, where any decrease
also decreases the benefits of the per-packet analysis.
3.7.8.6 Experiment 5: Memory-intensive applications
It is obvious that the full per-pattern analysis is the most suitable method for memory-intensive
applications, thus that aspect is not investigated. In this experiment again the focus is on the dis-
tribution of memory accesses across memory controllers. Experiment 2 demonstrated the positive
and negative sides of having the data of each application fetched from a single memory controller.
However, in some cases an application has to be mapped on a platform which already has an ex-
isting application-set. In such cases, dedicating only one memory controller to it might be a good
decision from the perspective of that application, but might have a negative impact on the already
existing workload (as shown in Experiment 2). Thus, in order to minimise the effects of the new
application on the existing system it might be more beneficial to distribute its memory accesses
across multiple memory controllers. Motivated by this reasoning, the objective of this experiment
is to quantify the individual, per-application overheads of having its memory content fetched from
multiple memory controllers. For each memory-intensive application the priority is varied and the
full per-pattern analysis is performed, assuming its memory operations are equally spread across:
3.8 Computation 189
(i) only one controller, (ii) two controllers, (iii) three controllers and (iv) four controllers. Conse-
quently, the penalty of accessing multiple controllers is estimated, relative to the schemes where
only one controller is accessed. The experiment is repeated for all 1000 application-sets.
Figure 3.34 shows the penalty of having the data fetched from multiple memory controllers.
Surprisingly, the priority has a very small impact on the results. For higher priorities (smaller
numbers), applications suffer fewer contentions, thus the penalty of accessing multiple controllers
is predominantly composed of increased traversal distances. As priority decreases, schemes with
multiple controllers suffer the interference, due to the spread traffic across the grid. Oppositely,
scenarios with single controller accesses consume less NoC infrastructure, and in many cases still
manage to avoid the interference. This causes the increase in the penalty. Around the priority level
10, the interference becomes predominant, resulting in significant delays even for single controller
accessing schemes, thus a slight drop in the penalty is visible. Finally, the penalty stays constant
on the rest of the domain. As is visible, the same trends apply for all scenarios involving accesses
to multiple controllers. Moreover, the results suggest that the distribution of application’s data
accesses among multiple controllers is very "expensive". Thus, one of the strategies when adding
new application might be to distribute its accesses among as much different memory controllers
as possible, such that its temporal constraints are still fulfilled. In this way, the impact of that
application on the existing system is minimised.
3.8 Computation
The focus of this section is on the computation requirements. As already defined, each application
ai releases a job with the execution time Cτ(ai) and the deadline Dτ(ai). Moreover, due to the
fact that the computation process and the memory requests are mutually interleaved, the observed
interval is Dτ+µ(ai).
In this section, only an initial step towards the worst-case computation delay analysis is pro-
posed. In particular, this approach is based on the simplifying assumption that the delays of
communication-related OS operations on each core are incomparably smaller than the computa-
tion delays and hence can be neglected, which can be expressed with Equation 3.70.
max{δ→P ,δ←P ,δ→C ,δ←C ,δQ,δE ,δ→I ,δ←I }Cτ(ai),∀ai ∈A (3.70)
The extended approach, which would render the aforementioned assumption obsolete, is a
potential future work. Moreover, for the presented coarse-grained analysis a mapping process is
proposed, which takes into account computation requirements.
3.8.1 Core Shutdowns
Conversely to the state of the art methods, which consider that all cores are always operational,
in this approach occasional core shutdowns are allowed for various beneficial reasons, e.g. power
and thermal management. Note that the previously presented communication- and memory-related
190 Limited Migrative Model - LMM
worst-case analyses of LMM are not violated with this assumption, and the explanation is twofold:
(i) shutting down cores has no effect on the message transfer and rerouting operations, because the
same are performed by routers, and (ii) dispatchers located on cores which are shut down can be
treated as non-existent.
Once a core is selected for shutting down, it will continue to execute the already assigned work-
load, however, it will reject new job releases. When the last previously assigned job completes,
the core will become temporarily unavailable (e.g. sleep interval, cooling period). Depending on
the purpose, a system designer might choose to apply more/less aggressive load balancing strate-
gies, involving more/less frequent core shutdowns. As a means to control that, a parameter K is
introduced, which symbolises the maximum number of concurrent core shutdowns. The value of
K is assumed to be already specified and is not elaborated upon in this dissertation.
3.8.2 Workload
Each dispatcher d ji of an application ai has its own priority which can be less than or equal to the
application’s priority, i.e. P(d ji ) ≤ P(ai),∀d ji ∈ D(ai)∧∀ai ∈ A . The priority assignment upon
dispatchers is also one of the contributions of this section. A job released by the dispatcher d ji
inherits its priority P(d ji ). Note that allowing dispatchers of the same application to have different
priorities implies that job priorities of one application are not necessarily constant, but depend on
master dispatchers which are elected to release them. This is also a novel concept in the real-time
domain.
Based on their computation requirements, applications are classified into three categories:
Safety-Critical Applications (SCA), Real-Time Applications (RTA) and Best-Effort Applications
(BEA).
SCA are considered to be the highest-priority workload in the whole system. Therefore, strong
guarantees regarding their computation requirements should be provided, ensuring that no missed
deadlines of SCA will occur, even under circumstances that involve core shutdowns.
RTA represent the medium-priority workload, and also require schedulability guarantees. How-
ever, the guarantees only need to hold when the system is working with the full capacity (no core
shutdowns).
Finally, BEA present the lowest-priority workload in the system. BEA can tolerate missed
deadlines, however, that has an impact on the quality of service. Hence, the focus is not on
deriving schedulability guarantees for BEA, but instead on establishing a notion of fairness, e.g.
by spreading missed deadlines as evenly as possible across all BEA. This decision is motivated by
the fact that, in many practical scenarios, maintaining all non-critical functionalities with reduced
quality is more desirable than cancelling some of them (e.g. multimedia applications).
The assumed workload classification and respectively posed requirements are inspired by the
existing workload classification which is well established in the real-time embedded area [17].
3.8 Computation 191
3.8.3 Problem Statement
The objective can be summarised with the following statement. Given the application-set A , the
platform Ψ and the maximum allowed number of concurrent core shutdowns K, assign priorities
to dispatchers and map them onto the platform (M =A →Ψ), such that:
• No missed deadline of SCA occurs, assuming that the system exhibits at most K concurrent
core shutdowns.
• No missed deadline or RTA occurs when the system is working with the full capacity (no
core shutdowns).
• The ratio of missed BEA deadlines is spread across all BEA as evenly as possible.
As already known, the application mapping for many-cores is an NP-Hard problem [40], hence
searching for the optimal solution is, in most cases, prohibitively expensive. Nonetheless, even
with infinite computational capacities, in this particular case the optimal solution could not be
found at design time, since the "optimality" directly depends on core shutdown decisions, which
are made at runtime. Therefore, in this section, an alternative heuristic-based approach is explored,
with the objective of finding a sub-optimal solution of acceptable quality.
3.8.4 Offline Schedulability Guarantees
If a dispatcher of an application passes an offline schedulability test, performed at design time,
it means that the application can perform the computation on its core as long as it is operational
and will never miss a deadline due to the other workload residing on the same core. The test is
performed by computing the worst-case response time for a fully preemptive fixed-priority sys-
tem [58], treating that core as an independent single-core device and assuming the workload that
can be generated by all dispatchers existing on that core.
Rτ(ai) =Cτ(ai)+ ∑
∀dmk ∈Dpi(d ji )
| P(dmk )>P(d ji )
⌈
Rτ(ai)+Dµ(ai)
T (ak)
⌉
·Cτ(ak) (3.71)
The worst-case response time of a job released by a dispatcher d ji on its core pi(d
j
i ) consists of
two terms (Equation 3.71). The first is the computation time of that job – Cτ(ai). Additionally, any
higher priority dispatcher dmk residing on the same core may release a job which can preempt the
execution of the job under analysis. Therefore, from every such dispatcher the maximum possible
interference has to be computed (the second term in Equation 3.71). Note that Equation 3.71 has
a recursive notion, thus is solved iteratively. If the computed value is less than or equal to the
deadline, i.e. Rτ(ai) ≤ Dτ(ai), an application is considered offline schedulable with respect to its
computation requirements, with its dispatcher d ji on the core pi(d
j
i ).
192 Limited Migrative Model - LMM
3.8.5 Online Schedulability
If a dispatcher is not offline schedulable, it does not mean that a job of its application can never
perform the computation on that core without missing a deadline. It only implies that the ability to
do so depends on the higher priority workload residing on that core at the moment of observation,
i.e. which of the higher priority dispatchers are elected by their respective applications to release
the jobs on their cores. Thus, in order to determine whether it can release a job and provide the
guarantees that the deadline will not be missed, a dispatcher has to perform an online response
time test (Equation 3.72).
Rτ(ai) =Cτ(ai)+ ∑
∀Jmk ∈RQ(pi(d ji )) | P(Jmk )>P(d ji )
C˜τ(ak)+
∑
∀dpn∈Dpi(d ji )
| P(dpn )>P(d ji )
⌈
t+Rτ(ai)+Dµ(ai)− rp+1n
T (an)
⌉
·Cτ(an) (3.72)
The response time consists of the computation time Cτ(ai), augmented by the higher priority
content of the ready-queue RQ(pi(d ji )), where C˜τ(ak) ≤Cτ(ak) stands for the remaining compu-
tation time of the higher-priority ready-queue job Jmk belonging to the application ak, observed
at the time instant t. Moreover, a conservative assumption is that all higher priority dispatch-
ers will be elected for their future releases, thus, the potential interference has to be considered
from every such dispatcher dpn . It is calculated as the maximum number of occurrences in the
interval between the next job release of dpn occurring at r
p+1
n ≥ t, and the absolute response time
t +Rτ(ai)+Dµ(ai). Notice, that r
p+1
n and t are absolute values, thus are represented with lower-
case characters. Equation 3.72 is also recursive, and if the computed value is less than or equal
to the deadline (i.e. Rτ(ai) ≤ Dτ(ai)), the application is online schedulable with respect to its
computation requirements at the instant t on the core of its dispatcher d ji .
By passing the online schedulability test, the application gets guarantees from the system that
its next job, released at t, can be executed on the core of its dispatcher d ji without missing a dead-
line. The guarantees are valid only for the next job, once its execution is completed the test should
be performed again. At this stage, a reader may raise two very valid concerns:
• Is it practical to perform a recursive computation online?
• Is it practical to manage remaining execution times?
These concerns are addressed in the following way. For cases where solving Equation 3.72
might be prohibitively expensive, a modification of the online schedulability test is proposed. A
test is agnostic with respect to remaining execution times and completes within a single iteration,
i.e. it is not recursive (Equation 3.73).
3.8 Computation 193
Rτ(ai) =Cτ(ai)+
interference from jobs with guarantees︷ ︸︸ ︷
∑
∀Jqs ∈RQ(pi(d ji )) | P(Jqs )>P(d ji )
min{Cτ(as),rqs +Rτ(as)− t}+
interference from jobs without guarantees︷ ︸︸ ︷
∑
∀Jvu∈RQ(pi(d ji )) | P(Jvu)>P(d ji )
Cτ(au) +
interference from future releases︷ ︸︸ ︷
∑
∀dmk ∈Dpi(d ji )
| P(dmk )>P(d ji )
⌈
t+Dτ+µ(ai)− rm+1k
Tk
⌉
·Cτ(ak)
(3.73)
When compared to Equation 3.72, the first term is the same, while the computation of the
interference due to future releases (the last term) is not recursive any more and is computed within
a single iteration for the interval Dτ+µ(ai). Conversely, the ready-queue content (the second and
the third term) is computed differently. First, for each higher-priority job Jqs , released at r
q
s , which
has either offline or online guarantees that its computation will complete until rqs +Rτ(as), the
remaining execution time is computed by finding the smaller between (i) its total execution time
and (ii) the difference between its worst-case response time, expressed in absolute values, and the
current time instant t. For each higher-priority job Jvu , which has been released without guarantees,
it has to be assumed that its remaining computation time is equal to its worst-case computation
time, because it may miss a deadline.
Note, intermediate approaches are also possible, e.g. the remaining computation times are
available, but the computation should be performed within a single iteration, and vice versa. For
such approaches online schedulability tests can be derived by modifying Equations 3.72-3.73,
which is not covered in this dissertation. In Section 3.8.11 the focus is on how the knowledge about
the remaining computation times and the allowed number of iterations influence the efficiency of
online schedulability tests.
3.8.6 Semi-Schedulability Guarantees
Even though multiple dispatchers of an application might be offline schedulable, each job will
be executed by only one of them (an elected master dispatcher), thus leaving the remaining dis-
patchers idle. This fact can be exploited, so as to derive another form of guarantees, termed
the semi-schedulability. An application is semi-schedulable if none of its dispatchers is offline
schedulable, however, at any time instant at least one is online-schedulable.
Consider two applications ap and ac, where the first one has a higher priority P(ap) > P(ac),
and each dispatcher inherits a priority of a respective application, i.e. P(dip) = P(ai),∀dip ∈D(ap)
∧ P(d jc) = P(ac),∀d jc ∈ D(ac). Let ap and ac share two common cores - pix and piy, and let ap
be offline schedulable on both of them. Additionally, assume that ac is not offline schedulable
on either, but the absence of ap would make ac offline schedulable on both pix and piy. Since ap
can execute only on one core at any time instant, say pix, the core piy might be able to accom-
modate ac. And vice versa. Figure 3.35 illustrates one such example. The fractions represent
194 Limited Migrative Model - LMM
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
2
3/5
4/6a
1
pi
a
2
1
pi
Figure 3.35: Semi-schedulability
computation times and minimum inter-arrival periods Cτ(ai)/T (ai), while for clarity purposes the
communication and memory operations have been omitted from this example, i.e.:
Cη(a1) =Cη(a2) = Dη(a1) = Dη(a2) =Cµ(a1) =Cµ(a2) = Dµ(a1) = Dµ(a2) = 0
Dτ(a1) = T (a1),Dτ(a2) = T (a2)
Formally, this relation can be described as follows:
Definition 10 (Semi-schedulability). An application ac is considered semi-schedulable with re-
spect to a higher priority application ap, if ac and ap share at least two common cores, on which
ap is offline schedulable and ac is not, but the absence of ap would make ac offline schedulable on
both of them. Then, ac is called a semi-schedulable child, while ap is called a semi-schedulable
parent.
The semi-schedulability provides guarantees that, even though an application does not have
offline schedulable dispatchers, at any time instant there will be at least one which can claim
online schedulability, thus will be able to accommodate the next job without missing a deadline.
The semi-schedulability creates a co-scheduling parent-child relationship, which implies that, at
certain points during runtime, some dispatchers may be prevented from being elected. Hence,
semi-schedulable applications will have less flexibility when electing dispatchers for their next
job releases. However, this does not violate the statement that the scheduling decision of each
application is made by the application itself. This is possible because all the information that is
necessary for deriving a release/migration decision (e.g. the workload state on candidate cores)
is available to the dispatchers from their local kernels and is communicated during the agreement
protocol. Semi-schedulability guarantees hold as long as both common cores remain operational
(neither one is selected for shutting down).
The relationship between a parent and a child application is not trivial to analyse, and if the
executions are not synchronised well, it can cause the child to miss deadlines. One such example
is given in Figure 3.36. An application a3 is semi-schedulable with respect to a1. Additionally,
an application a2 exists in the system, with the priority lower than that of a1, but higher than that
of a3, i.e. P(a1) > P(a2) > P(a3). On both the cores, pi1 and pi2, the applications a1 and a2 are
offline schedulable. a3 is not, but the absence of a1 would also make it offline schedulable. These
conclusions can be reached by solving Equation 3.71 for this example, with the job parameters
3.8 Computation 195
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
a
20 2221
3/5
6/15
7/15
2
1
a3
a
2
1
pi
pi
Figure 3.36: Non-synchronised semi-schedulability
given as fractions Cτ(ai)/T (ai) in Figure 3.36. Like in the previous example, the communication
and memory operations have been omitted for better clarity.
a1 completed its first job on pi1, but for the next release at t = 5 decided to migrate to pi2 and
stays there until the end of the example. In order to release its job, a3 runs its agreement protocol
at t = 6 and realises that the semi-schedulable parent is on pi2, hence migrates to pi1. However,
until its deadline, a job of a3 can not complete the execution, due to the higher priority workload
of a2 and thus misses its deadline.
As seen, migrations of a parent application can make a child application unschedulable, even
when organising their executions on different cores. In this example the first execution of a1 on
pi1 delayed an execution of a2 and created a workload backlog which consequently delayed the
execution of a3. Due to this delay, a2 requested more computation time than what is exhibited in
the offline schedulability test performed for a3, thus making it only a necessary but not a sufficient
condition for semi-schedulability (i.e. the Equation 3.71 shows that a2 induces 6 time units of
interference to a3 in the interval between its release and completion, while in the given example it
sums up to 9 time units). Note that the way in which a1 affects a3 is very similar to the effect of
indirect interferences in the domain of traffic flows (see Chapter 2).
Therefore, before performing the migration, a parent application should check whether its ac-
tions cause the unschedulability of its child. In the aforementioned example, through its agreement
protocol at t = 5, a1 should have checked whether its migration to pi2 would cause a3 to be un-
schedulable on pi1, and if so, act accordingly (i.e. either continue executing on pi1, or go to some
other core).
Thus, during its agreement protocol at time instant t, a parent application should perform an
online schedulability test for the next job release of a child application ai occurring at ri. It is
equivalent to performing an online schedulability test for an artificial application a∗i , with all the
properties of ai, except the release is at ri∗ = t and the period is extended to T (a∗i ) = T (ai)+ ri− t.
An illustrative example is given in Figure 3.37, and Theorem 15 provides the proof.
Theorem 15. Consider an application ac, which is on the core pix a semi-schedulable child with
respect to a parent application ap. Also consider that after a time instant t, any new release of ap
will occur on cores other than pix. Observed at t, the next future release of ac, occurring at rc, will
be schedulable if and only if an artificial application a∗c with the same priority and the computation
196 Limited Migrative Model - LMM
t = r
T(a1 )
T(a1 )
a1
a1
rr1 1+ T(a1 )
*
*
1*
Figure 3.37: Future release schedulability
time: P(a∗c) = P(ac)∧Cτ(a∗c) = Cτ(ac), but with the extended period T (a∗c) = T (ac) + rc− t,
released on pix at the time instant t is online schedulable.
Proof. Proven by contradiction.
•Assume that a∗c is online schedulable, but ac is not. As ac is not schedulable, it did not receive
the required Cτ(ac) computation time units during its inter-arrival period (the interval between rc
and rc+T (ac)). Since a∗c is schedulable and has the same schedulability requirements (Cτ(a∗c) =
Cτ(ac)), this means that it performed some computation before rc. The fact that it performed
some computation during the interval between rc∗ and rc, suggests that an eventual busy interval
of higher priority backlog caused by ap completed before rc. Thus, the offline schedulability
condition (Equation 3.71) is sufficient, since it covers the worst-case. As ac is offline schedulable
on pix excluding ap, it has to be online schedulable as well. The contradiction has been reached.
• Assume that a∗c is not online schedulable, but ac is. As ac is schedulable, it received the
required Cτ(ac) computation time units during its period, which is the interval between rc and
rc + T (ac). Since a∗c has a larger period, which is the interval between rc∗ and rc + T (ac), and
of which a period of ac is just a subset, it should have received at least the same amount of the
computation time. As Cτ(a∗c) =Cτ(ac), it straightforwardly follows that a∗c has to be schedulable
as well. The contradiction has been reached.
The implications of Theorem 15 are that, during its release, a semi-schedulable parent appli-
cation can perform the online schedulability test for the next future release of its semi-schedulable
child and make a choice regarding its own future executions, such that the schedulability of a
child application is preserved on at least one of semi-schedulable cores. By co-scheduling their
executions, semi-schedulable applications can safely co-exist within the system (see Theorem 16).
Theorem 16. Consider two semi-schedulable applications, a parent ap and a child ac, which
share only two cores - pix and piy. Assuming that pix and piy are fully operational, ap and ac can
safely co-exist within the same system without missed deadlines.
Proof. Proven by induction. Observe the time interval between two consecutive releases of ap,
termed the step.
3.8 Computation 197
• Step one. Since this is the first release of ap, no backlog workload exists which can jeopardise
the schedulability of ac, thus an offline schedulability excluding ap (which holds) is sufficient. Two
scenarios are possible. If ac has already released a job, ap may safely select either the other core,
or one of the cores which ap and ac do not have in common. If ac didn’t release its first job yet, ap
can, due to the inexistence of backlog, safely choose any of the cores and leave the other core for
ac. Note, that ac also has the possibility to choose one of the cores which it does not share with
ap. In either case, the next release of ac will be online schedulable.
• Step n+ 1. At the beginning of the step n+ 1, ap releases its job. As the assumption is
that until the end of the step n no missed deadlines occurred, assume that the previous completed
execution of the job of ac occurred on pix. If the new execution of ac already started before the step
n+ 1, it had either safely continued its execution on the same core, or had selected some of the
non-shared cores. Conversely, if the new release of ac is yet to occur, then by applying Theorem 15
and Equation 3.72, ap will check whether it can migrate to pix and force ac to go to piy (or other
cores), or not. In either case, the next release of ac will be online schedulable.
3.8.7 Blind Synchronisation
So far, it has been proven that as long as both semi-schedulable cores remain operational, semi-
schedulable applications may safely co-reside in the same system and organise their executions
in such a way that none of them misses a deadline. Yet, if the online schedulability test in not
performed by solving Equation 3.72, but rather by solving a more pessimistic Equation 3.73,
occasionally it may happen that the test reports the unschedulability of the child on both cores. In
such cases, both semi-schedulable applications temporarily enter a blind synchronisation mode.
By Definition 10, if a parent always executes on one semi-schedulable core, and the child on
another, none of them can miss a deadline. This fact is exploited during the blind synchronisation
mode, as follows:
Rule 1. If, the core last used, from the semi-schedulable core pair, by each of the two applications
(parent/child), is different, then each of the applications should choose the same core as before,
over the other one.
Rule 2. If, the core last used, from the semi-schedulable core pair, by each of the two applications
(parent/child), is the same, then whichever of them was released last on that core, should choose
that core over the other one; the other application should accordingly choose the latter one over
the former.
Rule 3. Both semi-schedulable applications can freely execute on other cores which are not semi-
schedulable.
Since there were no missed deadlines prior to the blind synchronisation mode, and given that
during it no migrations of semi-schedulable applications across semi-schedulable cores will occur,
from Definition 10 and Rules 1-3 it follows that no missed deadlines of these applications can
occur.
198 Limited Migrative Model - LMM
3.8.8 Parent-Child Relationship
The semi-schedulability was earlier defined in the context of application pairs, i.e. a child appli-
cation can have only one parent application, and vice versa. Yet, it is trivial to see that Theo-
rems 15-16 also hold even under an extended definition of semi-schedulability wherein a parent
application may have multiple semi-schedulable children applications but each child still has only
one parent. For example, a 4-dispatcher parent may share two cores with one child and two dif-
ferent cores with another child. One could also consider allowing the children of the same parent
to share cores with each other. However, such an extended model (i) would require additional
proofs and (ii) more importantly, would additionally reduce the flexibility of the approach, due to
the necessity to perform co-scheduling not just between a parent and its children, but also between
the children of a common parent. Due to the aforementioned reasons, at this point the multiple-
children approach is not employed, and during the experimental evaluation only a 1:1 parent-child
relationship is considered.
Note that the semi-schedulability property can be also studied from the perspective of mode
changes [81], where the (in)existence of semi-schedulable applications on respective cores can be
perceived as different system modes. Consequently, the analysis can be performed at design time,
in order to deduce under which conditions semi-schedulable applications can migrate between
cores. With this approach, when releasing its job, a parent would not need to perform an online
schedulability test for the next child’s release. This is a potential topic for future work.
3.8.9 Schedulability and Agreement Protocols
The aforementioned schedulability constructs are employed at runtime in the following way. When
an application runs its agreement protocol, all its dispatchers on currently operational cores par-
ticipate. For clarity purposes, the protocol can be simplified to the extent that every dispatcher
reports with a single variable s ∈ {>,⊥} about the service it can offer, regarding the next job
release on its core. > means that a dispatcher can guarantee the execution on its core without
missing a deadline, and ⊥ that no guarantees can be provided.
• If a dispatcher is offline schedulable and its core is not selected for shutting down s =>.
• If a dispatcher is a semi-schedulable parent and its execution will not cause the online un-
schedulability of a semi-schedulable child on all common cores and its core is not selected
for shutting down s =>.
• If a dispatcher is a semi-schedulable child and it passes the online schedulablity test on its
core and its core is not selected for shutting down s =>.
• If a dispatcher is not offline schedulable, nor semi-schedulable parent or child, and it passes
the online schedulability test and its core is not selected for shutting down s =>.
• In all other cases s =⊥.
3.8 Computation 199
#1 #2 #3 #4 #5 #6 #7 #8 #9 #101
10
20
30
40
50
60
70
80
90
100
Dispatcher
Pr
io
rit
y
 
 
4−dispatcher SCA with priority 90
3−dispatcher SCA with priority 85
10−dispatcher RTA with priority 80
6−dispatcher RTA with priority 60
8−dispatcher RTA with priority 40
3−dispatcher BEA with priority 30
Figure 3.38: Priority assignment upon dispatchers
3.8.10 Priority Assignment and Mapping
The priority assignment and the mapping are mutually dependent activities, hence are presented
in an interleaved fashion. The process starts with the most critical workload – SCA.
3.8.10.1 Mapping SCA
Since these applications tolerate no missed deadlines even under the worst-case conditions (with
at most K concurrent core shutdowns), the fundamental mapping requirement for each SCA is to
have at least K+ 1 dispatchers, and all of them mapped as offline schedulable. If any dispatcher
can not be mapped as offline schedulable, a mapping process declares a failure. Therefore, all
SCA dispatchers are assigned default priorities of their respective applications (see applications
with the priorities 90 and 85 in Figure 3.38).
SCA commence the mapping on an empty system, so most (if not all) of their dispatchers will
have the possibility to choose from multiple cores. An analogy can be made with the bin-packing
theory; dispatchers represent elements, cores symbolise bins, and the possibility of an element to
fit in the bin is equivalent to testing a dispatcher’s response-time on a core (Equation 3.71), where
a higher response-time is interpreted as a better mapping option (i.e. more efficiently packed
elements). As an additional constraint, dispatchers of the same application can not go to the
same core. Three possible mapping options of SCA are investigated: (i) Best-Fit, (ii) Worst-Fit,
(iii) Alternate-Fit (alternately mapping dispatchers of an application with the Best-Fit and Worst-
Fit techniques).
3.8.10.2 Mapping RTA
Once all SCA are mapped, the focus is on mapping RTA. As already stated, RTA require schedula-
bility guarantees when the system is fully operational. Therefore, the first dispatcher of each RTA
200 Limited Migrative Model - LMM
d1 d2
d6
dd d34 d57 d8 10d d11d9
pi pi pi pi1pi 2 3 4 5
Figure 3.39: Speculative mapping
is assigned the application’s priority and subsequently mapped as offline schedulable. If this is not
possible, the second dispatcher is also assigned the application’s priority, and both are attempted to
be simultaneously mapped as semi-schedulable children of some already mapped higher priority
application which is a potential semi-schedulable parent. If this is not possible either, the mapping
process declares failure. When choosing the core to map the first dispatcher (in the case of offline
schedulability) or when choosing the semi-schedulable parent to map the first and the second dis-
patcher (in the case of semi-schedulability), possible mapping options are similar to those of SCA:
(i) Best-Fit, (ii) Worst-Fit and (iii) Alternate-Fit.
Once this is done, there is no need to keep the priority of the rest of the application’s dispatch-
ers at the same level. Indeed, decreasing their priorities allows to preserve more schedulability
resources for lower-priority RTA whose mapping did not start yet. Hence, after fulfilling the
mapping condition (offline schedulability or semi-schedulability), the rest of the dispatchers of
an application are assigned linearly decreasing priorities, with the last dispatcher having the least
possible system priority, and they all undergo speculative mapping (see the next section). An ex-
ample of RTA priority assignment is given in Figure 3.38, where the application with the priority
80 managed to claim offline schedulability, while applications with priorities 60 and 40 could only
claim semi-schedulability.
3.8.10.3 Speculative mapping
Unlike previous mapping stages, where the focus was on providing schedulability guarantees,
when mapping speculatively the emphasis is on improving the system’s overall runtime behaviour.
In other words, dispatchers should not be necessarily mapped to cores where they can claim of-
fline or semi-schedulability, but rather to cores where they have a high chance of claiming online
schedulability at runtime. Therefore, the criterion for mapping is no longer the response-time, but
the per-core utilisation, which is calculated as the sum of the utilisations of all contained dispatch-
ers. The individual per-dispatcher utilisations are computed as follows: each dispatcher carries a
fraction of the application’s computation utilisation Uτ(ai) =
Cτ (ai)
T (ai)
, but offline schedulable and
semi-schedulable ones carry double the weight, as likelier to be elected.
The speculative mapping is explained with an illustrative example given in Figure 3.39, where
dispatchers of the same color belong to the same application. For simplicity reasons, assume that
(i) all applications have the same utilisation Uτ(ai)=
Cτ (ai)
T (ai)
= u, (ii) each dispatcher is either offline
3.8 Computation 201
or semi-schedulable, and (iii) each dispatcher inherits the priority of its application. Consider that a
dispatcher d6, which belongs to the application with the lowest priority, will undergo a speculative
mapping. The cores pi1 and pi2 are not possible mapping options, because the application of d6
already has dispatchers there (d4 and d5, respectively). From the remaining cores (pi3− pi5), a
dispatcher is mapped to the one with the globally minimal utilisation. For example, the utilisation
of the core pi3 is equal to the sum of utilisations of d2 and d8. Their individual utilisations are 13 u
and 12 u, as their applications have 3 and 2 dispatchers, respectively. After computing the utilisation
of every core, the conclusion is that the best mapping option is the core pi4 (see Table 3.7).
Table 3.7: Speculative mapping computation for Figure 3.39
Core pi1 pi2 pi3 pi4 pi5
Utilisation N/A N/A 56 u
1
2 u
5
6 u
3.8.10.4 Mapping BEA
When mapping BEA the objective is to spread the ratios of BEA missed deadlines across all BEA
as evenly as possible, i.e. maintain this particular notion of fairness, as described in Section 3.8.3.
Thus, all BEA dispatchers are mapped speculatively. In order to further equalise the consumption
of schedulability resources, dispatchers of the same application are assigned linearly decreasing
priorities (see the application with the priority 30 in Figure 3.38). In many cases this allows the
first dispatcher of a lower priority application to have a higher priority than the second and sub-
sequent dispatchers of some higher priority applications (even RTA), which gives it a scheduling
precedence and contributes to the intended fairness.
During the entire mapping process an ordered list of all dispatchers that are still not mapped
is maintained. The ordering criterion is the non-increasing priority. The dispatchers are removed
from the list and subsequently mapped in that order (e.g. in the example given in Figure 3.38 first
the application with the priority 90 is entirely mapped, then also entirely the application with the
priority 85, then the dispatchers 1−3 of that with the priority 80, then the dispatchers 1−2 of that
with the priority 60, etc.). The rationale for this decision is that a currently mapped dispatcher does
not have an influence on already mapped (higher-priority) workload, which reduces the complexity
of the entire process from sub-quadratic – O(|D(A )|2) to linear – O(|D(A )|), where |D(A )|=
∑
∀ai∈A
|D(ai)| denotes the total number of dispatchers in the application-set. Note that re-orderings
of the list may occasionally be required in cases where RTA claim semi-schedulability, and hence
the priorities of their respective unmapped dispatchers have to be elevated.
3.8.11 Experimental Evaluation
In this section, the objective is to explore the impacts of different priority assignment and mapping
strategies on several important aspects, namely: (i) schedulability guarantees, (ii) the runtime
behaviour assuming no core shutdowns, (iii) the runtime behaviour assuming core shutdowns. The
202 Limited Migrative Model - LMM
simulations were performed on the extended version of the simulator SPARTS [73]. The simulation
parameters are summarised in Table 3.8. An asterisk sign denotes a randomly generated value,
assuming a uniform distribution.
Table 3.8: Analysis and simulation parameters for Section 3.8.11
NoC topology and size 2-D mesh with 10×10 routers
Application-set size |A | 200 applications
SCA application period T (ai) [30−50]∗ ms
RTA application period T (ai) [30−100]∗ ms
BEA application period T (ai) [0.1−1]∗ s
Computation deadline Dτ(ai),∀ai ∈A T(ai)
Comm. and memory deadlines Dη(ai)∧Dµ(ai),∀ai ∈A 0 ms
Application utilisation Uτ(ai) =
Cτ (ai)
T (ai)
(0−0.7]∗
Application breakdown (average) { SCA, RTA, BEA } {10%,20%,70%}
Simulated time 100 s
Note that the application breakdown presents average values, and that in some generated
application-sets the amount of SCA can be as low as 6% and as high as 15%, however, the av-
erage value is 10%, as written in Table 3.8. These variations occur because the individual task
parameters are assigned after the generation of task types.
3.8.11.1 Experiment 1: RTA Priorities and Semi-Schedulability
The objective of this experiment is to observe the impacts of different priority assignment tech-
niques as well as the semi-schedulability on the provided schedulability guarantees. Recall, that
the fundamental requirement is that: (i) every dispatcher of every SCA is offline schedulable, and
(ii) every RTA has at least one dispatcher as offline schedulable, or a pair of dispatchers as semi-
schedulable children. All application-sets, for which a mapping M can be found, such that the
aforementioned objectives are fulfilled, are referred to as schedulable.
In this experiment, each application consists of 8 dispachers, and the mapping is performed
by employing the Best-Fit mapping technique. Three different approaches are analysed: (i) all
dispatchers of one application have the same priority, (ii) the priority assignment upon dispatchers
is performed as described in Section 3.8.10, and (iii) the priority assignment is as in the previous
case, but the semi-schedulability property (SS) is employed. The system utilisation (x-axis) is
varied and the amount of schedulable application-sets (y-axis) is observed. Figure 3.40 shows that
assigning priorities as described in Section 3.8.10 is an efficient strategy. Specifically, allowing
dispatchers of RTA to have decreasing priorities helps to preserve more schedulability resources
for lower-priority RTA that are yet to be mapped. Additionally, the tests reported a noticeable
improvement when the SS technique is employed. As these strategies have proven to be beneficial,
in the rest of the experiments both the semi-schedulability and the priority assignment, as proposed
in Section 3.8.10, are employed.
3.8 Computation 203
25 30 40 50 60 70 80 90 100
0
10
20
30
40
50
60
70
80
90
100
System Utilisation (in %)
Sc
he
du
la
bl
e 
Ap
pl
ica
tio
n−
Se
ts
 (in
 %
)
 
 
RTA Same Priority
RTA Decreasing Priorities without SS
RTA Decreasing Priorities with SS
Figure 3.40: Impact of RTA priorities and SS
on schedulability guarantees
40 50 60 70 80 90 100
0
10
20
30
40
50
60
70
80
90
100
System Utilisation (in %)
Sc
he
du
la
bl
e 
Ap
pl
ica
tio
n−
Se
ts
 (in
 %
)
 
 
SCA = RTA = BEA = 6
SCA = RTA = BEA = 8
SCA = RTA = BEA = 10
SCA = RTA = BEA = 12
Figure 3.41: Impact of number of dispatchers
on schedulability guarantees
3.8.11.2 Experiment 2: Number of Dispatchers
In this experiment, the objective is to investigate how the number of SCA dispatchers impacts
provided schedulability guarantees. Again, the Best-Fit mapping technique is employed. Four
different cases are analysed, where each application has 6,8,10 and 12 dispatchers. The varied
parameter is the system utilisation (x-axis), while the observed value is the number of schedulable
application-sets (y-axis). Figure 3.41 demonstrates that the price of having more SCA dispatch-
ers is expensive in terms of schedulability resources, however, at the same time it improves the
resilience of the system and allows a higher number of concurrent core shutdowns without SCA
missed deadlines.
3.8.11.3 Experiment 3: Mapping Strategies
The focus of this experiment is on different mapping techniques. Each application has 8 dispatch-
ers. Three different approaches are analysed, where the applications were mapped with (i) the
Worst-Fit technique, (ii) the Alternate-Fit technique and (iii) the Best-Fit technique, as described
in Section 3.8.10. Again, the varied parameter is the system utilisation (x-axis), while the ob-
served value is the number of schedulable application-sets (y-axis). From Figure 3.42 it is visible
that the Worst-Fit manages to map the least number of application-sets as schedulable. The other
two techniques perform similar to each other, although the Alternate-Fit shows marginal improve-
ments, because it tends to distribute the dispatchers more diversely, which in some cases results in
better opportunities for semi-schedulability.
3.8.11.4 Experiment 4: Online Schedulability Tests w/o Remaining Computation Times
Since the online schedulability tests are performed frequently, the performance of LMM depends
on the trade-off between the complexity and the efficiency of these tests. In some scenarios per-
forming the test by solving Equation 3.72 may be undesirably expensive, as it requires a fixed-point
search algorithm and also the knowledge about remaining computation times. For that purpose,
204 Limited Migrative Model - LMM
40 50 60 70 80 90 100
20
30
40
50
60
70
80
90
100
System Utilisation (in %)
Sc
he
du
la
bl
e 
Ap
pl
ica
tio
n−
Se
ts
 (in
 %
)
 
 
Worst−Fit
Alternate−Fit
Best−Fit
Figure 3.42: Impact of mapping strategies on
schedulability guarantees
0 1 2 5 unconstrained
5
10
15
20
25
30
35
Su
cc
es
sf
ul
 o
nl
in
e 
sc
he
du
la
bi
lity
 te
st
s 
(in
 %
)
Number of allowed iterations
Figure 3.43: Efficiency of online tests when
using remaining computation times
a lighter test was proposed (Equation 3.73), where a single computation is performed without the
knowledge about the remaining computation times. In this experiment, it is investigated whether
the lighter tests are practical.
Each application has 8 dispatchers, and the Best-Fit mapping technique is employed. The sys-
tem utilisation is fixed to 80%. The execution is simulated and the number of successful online
schedulability tests is measured. First, the tests are performed by taking into account the exact
remaining computation times. The number of allowed iterative computations is varied, and if
the value is not obtained within a given limit, a single computation is performed with the input
Rτ(ai) = Dτ(ai) in Equation 3.72. Figure 3.43 shows how the success ratio of the online schedu-
lability test (y-axis) changes with the number of allowed iterations (x-axis). As seen, even if only
two iterations are allowed, the efficiency of the test compared to the exact test (i.e. unconstrained
iterations) is barely affected. This is because the test converges fast anyway, almost always within
few iterations. This is not surprising, as the applications contributing interference, in the recur-
rence relation, are few (i.e. just those that have dispatchers on the considered core), unlike in
global scheduling. In this sense, the analysis of LMM is quite scalable.
Similarly, Figure 3.44 shows the success ratio of the online schedulability tests, but this time
by being agnostic with respect to remaining computation times. The average value from the pre-
vious figure is also plotted, so as to ease the visual comparison. The trends are similar to the
previous case, only a few iterations are needed. Moreover, being agnostic with respect to remain-
ing computation times, as expected, has a negative effect, however the same is very mild. Thus,
a light online test which is agnostic and has the limit of 5 iterations manages to succeed in more
than 90% of the cases which were successful by the exact test (Equation 3.72). This crucial finding
further motivates the research related to LMM.
3.8.11.5 Experiment 5: Number of Dispatchers and Runtime
In this experiment, the objective is to investigate how the number of RTA and BEA dispatchers
influences the runtime behaviour of the system. In other words, is it beneficial to have more RTA
3.8 Computation 205
0 1 2 5 unconstrained
5
10
15
20
25
30
Su
cc
es
sf
ul
 o
nl
in
e 
sc
he
du
la
bi
lity
 te
st
s 
(in
 %
)
Number of allowed iterations
 
 
Non−agnostic w.r.t. remaining execution times
Agnostic w.r.t. remaining execution times
Figure 3.44: Efficiency of online tests without
remaining computation times
65 70 75 80 85 90 95 100
0
10
20
30
40
50
60
70
80
90
100
System Utilisation (in %)
D
is
tri
bu
tio
n 
of
 B
EA
 d
ea
dl
in
e 
m
iss
 ra
tio
s 
(in
%)
 
 
RTA = BEA = 2
RTA = BEA = 4
RTA = BEA = 6
RTA = BEA = 8
RTA = BEA = 10
Figure 3.45: Impact of number of dispatchers
on runtime behaviour, without core shutdowns
and BEA dispatchers? The following parameter setup is used: K = 7, i.e. a system should allow
at most 7 concurrent core shutdowns. Thus, all SCA have 8 dispatchers. The number of RTA
and BEA dispatchers is varied in the range [1− 10]. The Best-Fit mapping technique was used,
and only application-sets which are schedulable with this parameter setup are considered. First,
observe the behaviour of the system when no core shutdowns occur. Given that in these conditions
no SCA, nor RTA missed deadlines can occur, of interest is the distribution of BEA missed dead-
line ratios. Note that despite of missing their deadlines, BEA jobs continue their computation. The
execution is simulated for different system utilisations, and BEA missed deadlines are captured.
Figure 3.45 shows that schemes with fewer dispatchers are more rigid and concentrate all BEA
missed deadlines among very few applications. Conversely, schemes with more dispatchers clearly
benefit from their flexibility, in a sense that BEA missed deadlines are evenly distributed among
all applications. These trends do not reach a saturation point, but show systematic improvements
as the number of dispatchers increases. This also validates the efficiency of the priority assign-
ment techniques, and proves that by assigning priorities in a strategic manner one can benefit from
the high number of dispatchers per application, and yet efficiently avoid the "suffocation effect"
among applications. Note, for RTA = BEA = 1, all BEA missed deadline ratios are 100%. For
better clarity, this case is omitted from Figure 3.45.
Again, the runtime behaviour is investigated, but this time assuming core shutdowns. The du-
ration of each shutdown is 1 second. In this and the next experiment the parameter P stands for
the individual per-core probability of being selected for at least one shutdown, P2 for at least two
shutdowns, etc. All shutdowns of all cores must occur within the simulated interval. Time instants
at which each individual core will experience a shutdown are randomly generated, but without vi-
olating a constraint that at most K = 7 of them can be selected concurrently. The system utilisation
is fixed to 80%, the parameter P (x-axis) is varied, and the average number of RTA missed dead-
lines (y-axis) is observed. Figure 3.46, shows a clear benefit of having more dispatchers, for every
value of P. A slight increase in the number of dispatchers may improve the resilience towards core
shutdowns even by one order of magnitude, while any additional increase clearly contributes to
206 Limited Migrative Model - LMM
0.15 0.3 0.45 0.6 0.75 0.9
0.001
0.01
0.1
1
10
Core Shutdown Probability Parameter P
Av
er
ag
e 
RT
A 
m
iss
ed
 d
ea
dl
in
es
 (in
 %
), l
og
 sc
ale
 
 
RTA = BEA = 1
RTA = BEA = 2
RTA = BEA = 4
RTA = BEA = 6
RTA = BEA = 8
RTA = BEA = 10
Figure 3.46: Impact of number of dispatchers
on runtime behaviour, with core shutdowns
65 70 75 80 85 90 95 100
0
10
20
30
40
50
60
70
80
90
100
System Utilisation (in %)
D
is
tri
bu
tio
n 
of
 B
EA
 d
ea
dl
in
e 
m
iss
 ra
tio
s 
(in
%)
 
 
Worst−Fit
Alternative−Fit
Best−Fit
Figure 3.47: Impact of mapping strategies on
runtime behaviour, without core shutdowns
the system flexibility to tolerate more frequent core shutdowns.
3.8.11.6 Experiment 6: Mapping strategies and Runtime
In this experiment, the objective is to investigate how different mapping techniques influence the
runtime behaviour of the system. Again, K = 7. Each application has 8 dispatchers, and again
only schedulable application-sets are considered. First, the system behaviour is observed when no
core shutdowns occur and the focus is on the distribution of BEA missed deadline ratios. All three
proposed mapping techniques for different system utilisations (x-axis) are simulated and the ratios
of BEA missed deadlines (y-axis) are captured. Figure 3.47 shows the results. It is noticeable
that the Best-Fit technique achieves the best results, although the differences are very subtle and
almost negligible.
Now, the runtime behaviour is observed again, but this time with core shutdowns. The system
utilisation is 80% and the parameter P is varied (x-axis). The focus is on RTA missed deadlines
(y-axis). Figure 3.48 suggests that all techniques demonstrate a comparable performance.
3.8.11.7 Experiment 7: Blind synchronisation
In this experiment, the focus is on the blind synchronisation mode (BSM) and the frequency of
its occurrences. The assumed setup is identical to that of Experiment 4, with the only difference
that the releases of semi-schedulable applications which cause the BSM are observed (y-axis of
Figure 3.49). The varied parameter is the allowed number of iterations in the schedulability test
recurrence (x-axis). For each value of the allowed number of iterations the simulations are per-
formed, assuming two types of online schedulability tests, ones which are agnostic with respect to
remaining execution times, and ones which are not. It comes as no surprise that the agnostic tests,
due to being more pessimistic, cause more frequent occurrences of the BSM, than the respective
non-agnostic ones. However, the differences are negligible. The explanation for this finding is
3.8 Computation 207
0.15 0.3 0.45 0.6 0.75 0.9
0
0.02
0.04
0.06
0.08
0.1
0.12
Core Shutdown Probability Parameter P
Av
er
ag
e 
RT
A 
m
iss
ed
 d
ea
dl
in
es
 (in
 %
)
 
 
Worst−Fit
Alternate−Fit
Best−Fit
Figure 3.48: Impact of mapping strategies on
runtime behaviour, with core shutdowns
0 1 2 5 unconstrained0
0.5
1
1.5
Number of allowed iterations
Am
ou
nt
 o
f r
el
ea
se
s 
tri
gg
er
in
g 
bl
in
d 
sy
nc
hr
on
isa
tio
n 
(in
 %
)
 
 
Non−agnostic w.r.t. remaining execution times
Agnostic w.r.t. remaining execution times
Figure 3.49: Blind synchronisation mode
(BSM)
twofold. First, Experiment 4 demonstrated that the pessimism of the agnostic tests, when com-
pared to the respective non-agnostic ones, is not significant. Second, in many cases, the BSM
requires specific (worst-case) conditions, which do not occur frequently during runtime.
As expected, allowing more iterations decreases the occurrences of the BSM, because the
respective online schedulability tests become less pessimistic. This coincides with the findings of
Experiment 4. In any case, even when using the most pessimistic tests (e.g. Equation 3.73), the
BSM is triggered, on average, in just 1.5% of the releases of semi-schedulable applications. This
shows that the conditions leading to the BSM arise very rarely during runtime, and also shows
that for many semi-schedulable application pairs the BSM cannot occur, not even theoretically,
irrespective of the employed online schedulability test.
3.8.12 Discussion
Assigning priorities as proposed in Section 3.8.10 proved to be an efficient approach. Also, the
semi-schedulability exhibited a huge positive impact on schedulability guarantees. Assigning the
application’s default priority to all K+ 1 dispatchers of SCA is costly, in terms of schedulability
resources, but achieves the required schedulability guarrantee (i.e. at up to K concurrent core shut-
downs). Understandably, providing strong guarrantees to SCA, for the event of core shutdowns
(which is the main objective), commensurately "withholds" resources from RTA and BEA, but this
is mitigated to a large extent by the flexibility of LMM.
Having more RTA and BEA dispatchers proved to be beneficial in both schemes, with and
without core shutdowns. Due to the efficient priority assignment technique, schedulability guar-
antees for RTA are not influenced by the number of dispatchers per application. The additional
system flexibility, brought by multiple dispatchers, indirectly through RTA and directly through
BEA, contributes to the equal distribution of missed deadlines among BEA (assuming no core
shutdowns) and minimises the number of RTA missed deadlines (assuming core shutdowns). How-
ever, as the number of dispatchers increases, the benefits from additional dispatchers start to level
208 Limited Migrative Model - LMM
off, which may be an important factor when the communication delays (Sections 3.3-3.5) are taken
into account as well.
Mapping with different mapping strategies has almost negligible effects. The Alternate-Fit
approach is the most efficient in providing schedulability guarantees, while the Best-Fit technique
is the best in terms of runtime behaviour, both with and without core shutdowns. The Worst-Fit
approach performs worse than both the aforementioned techniques, in all investigated categories
and its use cannot be justified.
Performing a light online schedulability test (agnostic with respect to remaining execution
times, with at most 5 iterations) is in more than 90% of the cases as good as performing an exact
test (Equation 3.72), while in only 0.04% of the releases of semi-schedulable applications it causes
the blind synchronisation mode.
It is apparent that there exists no single strategy which yields the best results under all circum-
stances. Facts such as the purpose of the system, the amount and the nature of the workload, the
maximum number of concurrent core shutdowns K, core shutdown policies, the tolerable amount
of RTA/BEA missed deadlines, are only few factors, out of many, which a system designer should
take into account when choosing the strategy. One can perceive the mapping process as an adaptive
activity, where different strategies are attempted until reaching the solution with (i) the necessary
amount of schedulability guarantees, (ii) the acceptable level of flexibility and resilience towards
core shutdowns, and (iii) the satisfactory runtime performance.
Chapter 4
Conclusions and Future Work
During the last decade, many-core platforms became mainstream in many computing areas, e.g.
high-performance and general-purpose computing. This trend is not surprising, because, when
compared with their ancestors (single- and multi-core systems), many-cores offer numerous bene-
ficial possibilities. For instance, the abundance of processing elements allows to enhance the exist-
ing functionalities, as well as to integrate new ones. Furthermore, the transition to the many-core
domain gives the possibility to achieve significant design cost reductions, as functionalities previ-
ously executed on numerous single- and multi-core devices, can be accommodated within fewer
many-core platforms. Moreover, many-cores are highly flexible, and efficient thermal/power man-
agement strategies can be implemented by configuring the system behaviour to fit current needs
and application workload, while unused cores can be temporary shut down. Finally, the abun-
dance of processing elements allows to develop efficient strategies for improving the platform’s
resilience to core failures.
However, despite all these benefits, many-core devices are still the next frontier technology in
the real-time embedded domain, and their application in this area can be expected in the forthcom-
ing years. The major drawback is the complex system design, which makes the real-time analysis
of many-cores a very challenging topic.
The ultimate objective of this dissertation is to make many-core platforms more amenable for
the real-time analysis. As demonstrated in this thesis, this goal can be achieved by:
• an adequate hardware support for (i) the message passing communication paradigm and
(ii) virtual channels,
• an efficient OS design that promotes scalability and message-passing,
• a mindful and thoughtful worst-case analyses.
The contributions presented in this dissertation have been divided into two categories. The fo-
cus of the first category is on the NoC interconnect, which is one of the most complex-to-analyse
shared resources in many-core platforms. First, the novel method for the worst-case analysis was
proposed for the type of interconnects that are the most common choice in present many-core
209
210 Conclusions and Future Work
platforms: 2-D mesh NoC with the round-robin arbitration policy and without the support for vir-
tual channels. Then, the focus was on the type of interconnects which are not yet commercially
available, however, they are currently considered to be the most suitable for the real-time analysis:
2-D mesh NoCs with the priority-preemptive arbitration policy and the support for virtual chan-
nels. Assuming these interconnects and the new hardware feature of the existing platforms, which
allows traffic flows to dynamically change virtual channels, a novel packet-routing technique was
proposed. With this technique, the worst-case analysis remains unaffected, and yet the require-
ments for hardware resources (i.e. the number of virtual channels) are significantly reduced.
Then, based on the EDF methodology, which is a well-established concept in the scheduling
domain, a novel arbitration policy for NoC routers and the accompanying method for the worst-
case analysis were proposed. It has been observed that there are cases where the new method
outperforms the state-of-the-art techniques, but there are also cases where the proposed approach is
less efficient. However, on average, the novel method yields better results, which further motivates
research activities in this domain. Finally, the improvement over the existing methods for the
worst-case analyses was proposed, which helps to derive less pessimistic worst-case traffic delay
estimates. The proposed improvement exploits the fact that traffic flows can impose interference
upon each other only while they are competing for the common NoC resources, i.e. links.
The second set of contributions has been developed around the novel workload execution
paradigm called the Limited Migrative Model (LMM). The LMM approach is inspired by the
latest trends in the general-purpose and high-performance computing areas. Specifically, it is
based on the fundamental concepts of the multi-kernel OS architecture, and uses the message-
passing technique as the communication primitive, which are promising steps towards scalable
and predictable many-core systems. First, the model itself was introduced, and the method for
the worst-case communication delay analysis was proposed. Then, assuming the aforementioned
approach, the three-staged application mapping method was proposed. After that, the focus was
on the memory requirements, and the method for the worst-case memory traffic analysis was
presented. Finally, the computation requirements of applications are addressed with a coarse-
grained method for the worst-case analysis, which has been developed with several simplifying
assumptions. This approach presents only an initial step towards the complete method for the
worst-case analysis of computation requirements, which is a potential future work. Consequently,
the application mapping algorithm was proposed for this coarse-grained method.
It has been demonstrated that LMM is a beneficial and promising approach, and as such,
represents one possible framework for the integration of many-cores into the real-time embed-
ded domain. The worst-case analysis of LMM is not nearly complete, especially in the domain of
workload computation requirements. Moreover, a unified application mapping approach is needed,
such that it takes into account all three identified aspects (communication, memory and computa-
tion requirements) and derives mappings where all posed timing constraints are satisfied. Finally,
implementing LMM concepts into existing operating systems and observing the runtime behaviour
would shed a different light on this model, and would likely elicit new research activities.
Appendix
Theorem 17. Let X = {x1,x2, ...,xn} ∈ N be the distances between the neighbouring dispatchers
of an application, and let c be the circumference of the application shape. The product of the
distances between the neighbouring dispatchers f (X) = ∏
∀xi∈X
xi reaches the maximum when all
the distances are as even as possible.
Proof. Proven directly. Note that the distances between the dispatchers are natural numbers, which
is a subset of real numbers. There are two cases:
1) cn ∈ N: In this scenario the results of Theorem 18 hold, i.e. the maximum on the continuous
domain of real numbers (superset) is also the maximum on the discontinuous domain of natural
numbers (subset). Therefore, in these scenarios the function f (X) reaches the maximum when all
the distances are equal, i.e. xi = cn ,∀xi ∈ X .
2) cn 6∈ N: In this case the maxima are different. As proven in Theorem 18, f (X) has a unique
maximum on a continuous real-number domain, inferring that the function is monotonically in-
creasing from the boundary to the extremum, with respect to each variable, when treating all other
variables as constants. Therefore, the maximum on a discontinuous natural-number domain is the
point geometrically the closest to the continuous maximum, which corresponds to a set of solu-
tions (multiple maxima) on a discontinuous domain where the distances between dispatchers are
either
⌈ c
n
⌉
or
⌊ c
n
⌋
, i.e. xi ∈
{⌈ c
n
⌉
,
⌊ c
n
⌋}
,∀xi ∈ X .
Theorem 18. Let X = {x1,x2, ...,xn} ∈R be a set of real number variables, such that the following
holds:
∑
∀xi∈X
xi = c (4.1)
xi ≥ 0,∀xi ∈ X (4.2)
The function f (X) = ∏
∀xi∈X
xi has only one maximum on the domain, and that is the point:
x1 = x2 = ...= xn = cn .
Proof. Proven directly. This is a constrained optimisation problem, with one constraint expressed
by the equality (Equation 4.1), and n constraints expressed by the inequalities (Inequality 4.2). Let
211
212 Appendix
inequalities be temporary excluded from consideration. The extreme values of the function f (X),
subject to the equality constraint, can be found by the Lagrange Multipliers Method.
f (X) = ∏
∀xi∈X
xi, g(X) = ∑
∀xi∈X
xi = c ⇒ L = ∏
∀xi∈X
xi+λ ·
(
c− ∑
∀xi∈X
xi
)
(4.3)
A new variable λ is called the Lagrange multiplier. According to the first derivative test, the
necessary condition for the extreme point is that the partial derivative of the Lagrange function
with respect to ∀xi ∈ X and λ is equal to 0 (see Equations 4.4-4.6).
∂L
∂x1
=
∂ f (X)
∂x1
−λ · ∂g(X)
∂x1
= 0 ⇒ ∏
∀xi∈X\{x1}
xi = λ (4.4)
...
∂L
∂xn
=
∂ f (X)
∂xn
−λ · ∂g(X)
∂xn
= 0 ⇒ ∏
∀xi∈X\{xn}
xi = λ (4.5)
∂L
∂λ
= 0 ⇒ ∑
∀xi∈X
xi = c (4.6)
There are two cases: 1) λ = 0 and 2) λ 6= 0.
1) λ = 0: This is possible only if at least two of the variables are also equal to 0, that is
∃xi ∈ X ,∃x j ∈ X | xi = x j = 0∧ i 6= j. These points are both critical and stationary, and therefore
should be further examined by the second derivative test.
2) λ 6= 0: There exists only one point and that is x1 = x2 = ... = xn = Cn . This point is also
critical and stationary. It also holds for this point that it can be examined by the second derivative
test.
Additionally, due to the inequality constraints (xi ≥ 0,∀xi ∈ X), it is necessary to check the
boundaries of the solution space as well. The boundaries are represented with the solutions where
only one of the variables is 0, (i.e. ∃xi ∈ X ∧ 6 ∃x j ∈ X | xi = x j = 0 ∧ i 6= j). Those are called
boundary points and there exists no test to prove their properties, they have to be individually
checked. In this case it is easy; it is obvious that those points represent the minima on the domain,
since f (X) = 0 for all of them.
The next step in the analysis is the second derivative tests for the cases 1) and 2). It is conducted
in the form of the Bordered Hessian. The process consists of finding the first partial derivatives of
g(X) with respect to ∀xi ∈ X , then finding the second partial derivatives of f (X) also with respect
to ∀xi ∈ X and finally putting them into the matrix called the Bordered Hessian. The general form
of the Bordered Hessian is represented by Figure 4.1.
1) λ = 0: Figure 4.2 presents the Bordered Hessian for the case where λ = 0. The variable zi j
stands for the second partial derivative with respect to the variables xi and x j and it is described
213

0 ∂g∂x1
∂g
∂x2 · · ·
∂g
∂xn
∂g
∂x1
∂ 2 f
∂x21
∂ 2 f
∂x1∂x2 · · ·
∂ 2 f
∂x1∂xn
∂g
∂x2
∂ 2 f
∂x2∂x1
∂ 2 f
∂x22
· · · ∂ 2 f∂x2∂xn
... ... ... . . . ...
∂g
∂xn
∂ 2 f
∂xn∂x1
∂ 2 f
∂xn∂x2 · · ·
∂ 2 f
∂x2n

Figure 4.1: General form of Bordered Hessian for n variables and one constraint
by Equation 4.7. zi j has a non-zero value only in cases where at most two variables are equal to
zero, otherwise it is also equal to zero. A sufficient condition for the local maximum is that the
Bordered Hessian is negative definite, i.e. the determinants of its principal minors alternatively
change their signs (Equation 4.8). However, from Figure 4.2 it is visible that there always exists
some |Hi| = 0, thus making this test inconclusive. In such cases, each of the points should be
examined individually. Yet, in this case it is obvious that these points represent the minima on the
domain, since f (X) = 0 for all of them.

0 1 · · · 1 · · · 1 · · · 1
1 0 · · · 0 · · · 0 · · · 0
...
...
. . .
...
...
...
...
...
1 0 · · · 0 · · · zi j · · · 0
...
...
...
...
. . .
...
...
...
1 0 · · · z ji · · · 0 · · · 0
...
...
...
...
...
...
. . .
...
1 0 · · · 0 · · · 0 · · · 0

Figure 4.2: Bordered Hessian for λ = 0

0 1 1 ... 1 ... 1
1 0 z12 ... z1i ... z1n
1 z21 0 ... z2i ... z2n
...
...
...
. . .
...
...
...
1 zi1 zi2 ... 0 ... zin
...
...
...
...
...
. . .
...
1 zn1 zn2 ... zni ... 0

Figure 4.3: Bordered Hessian for λ 6= 0
zi j = z ji = ∏
∀xk∈X\{xi,x j}
xk (4.7)
|H1|< 0, |H2|> 0, ... ⇒ sign(|Hi|) = (−1)i,∀i ∈ {1, ...,n} (4.8)
2) λ 6= 0: The Bordered Hessian for this case is represented by Figure 4.3. For zi j also holds
Equation 4.7, however, in this case these are all non-zero values. Note that since xi = cn ,∀xi ∈ X ,
all the second partial derivatives are also equal (Equation 4.9).
zi j = z ji = z =
( c
n
)n−2
,∀i ∈ {1, ..,n},∀ j ∈ {1, ...,n} | i 6= j (4.9)
214 Appendix
When computed, the determinants of the principal minors are |H1| = 2z, |H2| = −3z2, |H3| =
4z3, ..., |Hn| = (−1)nnzn. Since both n and z are strictly positive, the sign of the determinant
depends only on the first term of the product: (−1)i, and therefore alternatively changes when
successive principal minors are considered. This fulfils the sufficient condition for the maximum,
so it can be concluded that the function f (X) has one maximum on the domain, which is located
in the point x1 = x2 = ...= xn = cn and the value is max( f (X)) =
( c
n
)n.
References
[1] Sani Abba and Jeong-A Lee. A parametric-based performance evaluation and design trade-
offs for interconnect architectures using fpgas for networks-on-chip. Microprocessors and
Microsystems, 2014.
[2] Dennis Abts, Natalie D. Enright Jerger, John Kim, Dan Gibson, and Mikko H. Lipasti.
Achieving predictable performance through better memory controller placement in many-
core cmps. In Proceedings of the 36th International Symposium on Computer Architecture,
2009.
[3] Adapteva. Epiphany Architecture.
www.adapteva.com/docs/epiphany_arch_ref.pdf.
[4] Hazem Ismail Ali, Luís Miguel Pinho, and Benny Akesson. Critical-Path-First Based Alloca-
tion of Real-Time Streaming Applications on 2D Mesh-Type Multi-Cores. In Proceedings of
the 19th IEEE Conference on Embedded and Real-Time Computing and Applications, 2013.
[5] Giuseppe Ascia, Vincenzo Catania, and Maurizio Palesi. Multi-objective mapping for mesh-
based noc architectures. In Proceedings of the 2nd International Conference on Hard-
ware/Software Codesign and System Synthesis, 2004.
[6] Neil Audsley, Alan Burns, Mike Richardson, Ken Tindell, and Andy Wellings. Applying new
scheduling theory to static priority pre-emptive scheduling. Software Engineering Journal,
1993.
[7] Theodore Baker. An analysis of fixed-priority schedulability on a multiprocessor. Real-Time
Systems Journal, 2006.
[8] Sundar Balakrishnan and Fusun Ozguner. A priority-driven flow control mechanism for
real-time traffic in multiprocessor networks. IEEE Transactions on Parallel and Distributed
Systems, 1998.
[9] Sanjoy Baruah and Theodore Baker. Schedulability analysis of global edf. Real-Time Sys-
tems Journal, 2008.
[10] Andrea Bastoni, Björn Brandenburg, and James Anderson. An empirical comparison of
global, partitioned, and clustered multiprocessor edf schedulers. In Proceedings of the 31st
IEEE Real-Time Systems Symposium, 2010.
[11] Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Si-
mon Peter, Timothy Roscoe, Adrian Schüpbach, and Akhilesh Singhania. The multikernel:
A new os architecture for scalable multicore systems. In ACM Symposium on Operating
Systems Principles, 2009.
215
216 REFERENCES
[12] Luca Benini and Giovanni De Micheli. Networks on chips: a new soc paradigm. The
Computer Journal, 2002.
[13] Bruno Bessette, Redwan Salami, Roch Lefebvre, Milan Jelinek, Jani Rotola-Pukkila, Janne
Vainio, Hannu Mikkola, and Kari Jarvinen. The adaptive multirate wideband speech codec
(amr-wb). IEEE Transactions on Speech and Audio Processing, 2002.
[14] Tobias Bjerregaard and Jens Sparso. Implementation of guaranteed services in the mango
clockless network-on-chip. IEE Proceedings - Computers and Digital Techniques, 2006.
[15] Konstantinos Bletsas and Björn Andersson. Preemption-light multiprocessor scheduling of
sporadic tasks with high utilisation bound. In Proceedings of the 30th IEEE Real-Time Sys-
tems Symposium, 2009.
[16] Evgeny Bolotin, Israel Cidon, Ran Ginosar, and Avinoam Kolodny. Qnoc: Qos architecture
and design process for network on chip. Journal of System Architecture, 2004.
[17] Scott Brandt, Scott Banachowski, Caixue Lin, and Timothy Bisson. Dynamic integrated
scheduling of hard real-time, soft real-time and non-real-time processes. In Proceedings of
the 24th IEEE Real-Time Systems Symposium, 2003.
[18] Alan Burns and Andy Wellings. Real-Time Systems and Programming Languages. Addison-
Wesley Educational Publishers Inc, 2009.
[19] John Calandrino, James Anderson, and Dan Baumberger. A hybrid real-time scheduling ap-
proach for large-scale multicore platforms. In Proceedings of the 19th Euromicro Conference
on Real-Time Systems, 2007.
[20] Chen-Ling Chou and Radu Marculescu. Incremental run-time application mapping for ho-
mogeneous nocs with multiple voltage levels. In Proceedings of the 5th International Con-
ference on Hardware/Software Codesign and System Synthesis, 2007.
[21] Chen-Ling Chou and Radu Marculescu. Contention-aware application mapping for network-
on-chip communication architectures. In Proceedings of the International Conference on
Computer Design, 2008.
[22] Chen-Ling Chou, Umit Ogras, and Radu Marculescu. Energy- and performance-aware in-
cremental mapping for networks on chip with multiple voltage levels. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 2008.
[23] William Dally. Performance analysis of k-ary n-cube interconnection networks. IEEE Trans-
actions on Computers, 1990.
[24] William Dally. Virtual-channel flow control. IEEE Transactions on Parallel and Distributed
Systems, 1992.
[25] William Dally and Charles Seitz. Deadlock-free message routing in multiprocessor intercon-
nection networks. IEEE Transactions on Computers, 1987.
[26] William Dally and Brian Towles. Route packets, not wires: on-chip interconnection net-
works. In Proceedings of the 38th Design Automation Conference, 2001.
[27] Dakshna Dasari, Borislav Nikolic´, Vincent Nelis, and Stefan M. Petters. Noc contention
analysis using a branch and prune algorithm. ACM Transactions on Embedded Computing
Systems, 2013.
REFERENCES 217
[28] Jonas Diemer and Rolf Ernst. Back suction: Service guarantees for latency-sensitive on-chip
networks. In International Symposium on Networks-on-Chip, 2010.
[29] Jeff Draper and Joydeep Ghosh. A comprehensive analytical model for wormhole routing in
multicomputer systems. Journal of Parallel and Distributed Computing, 1994.
[30] Jose Duato, Sudhakar Yalamanchili, and Ni Lionel. Interconnection Networks: An Engineer-
ing Approach. M.K. Publishers, 2002.
[31] Christof Ebert and Capers Jones. Embedded software: Facts, figures, and future. IEEE
Computer, 2009.
[32] Thomas Ferrandiz, Fabrice Frances, and Christian Fraboul. A method of computation for
worst-case delay analysis on spacewire networks. In Proceedings of the IEEEInternational
Symposium on Industrial Embedded Systems, 2009.
[33] Thomas Ferrandiz, Fabrice Frances, and Christian Fraboul. Using network calculus to com-
pute end-to-end delays in spacewire networks. SIGBED Rev., 2011.
[34] Thomas Ferrandiz, Fabrice Frances, and Christian Fraboul. A network calculus model for
spacewire networks. In Proceedings of the 17th IEEE Conference on Embedded and Real-
Time Computing and Applications, 2011.
[35] Thomas Ferrandiz, Fabrice Frances, and Christian Fraboul. A sensitivity analysis of two
worst-case delay computation methods for spacewire networks. In Proceedings of the 24th
Euromicro Conference on Real-Time Systems, 2012.
[36] Kees Goossens, John Dielissen, and Andrei Radulescu. Aethereal network on chip: concepts,
architectures, and implementations. IEEE Design & Test of Computers, 2005.
[37] Pierre Guerrier and Alain Greiner. A generic architecture for on-chip packet-switched in-
terconnections. In Proceedings of the 3rd Conference on Design Automation and Test in
Europe, 2000.
[38] Mehmet Harmanci, Nuria Escudero, Yusuf Leblebici, and Paolo Ienne. Providing qos to
connection-less packet-switched noc by implementing diffserv functionalities. In Interna-
tional Symposium on System-on-Chip, 2004.
[39] Jingcao Hu and Radu Marculescu. Exploiting the routing flexibility for energy/performance
aware mapping of regular noc architectures. In Proceedings of the 6th Conference on Design
Automation and Test in Europe, 2003.
[40] Jingcao Hu and Radu Marculescu. Energy-aware mapping for tile-based noc architectures
under performance constraints. In Proceedings of the 8th Asia and South Pacific Design
Automation Conference, 2003.
[41] Wei-Lun Hung, Charles Addo-Quaye, Theocharis Theocharides, Yuan Xie, Narayanan Vi-
jaykrishnan, and Mary Irwin. Thermal-aware ip virtualization and placement for networks-
on-chip architecture. In Proceedings of the International Conference on Computer Design,
2004.
[42] Intel. Single-Chip-Cloud Computer, .
www.intel.com/content/dam/www/public/us/en/documents/
technology-briefs/intel-labs-single-chip-cloud-article.pdf.
218 REFERENCES
[43] Intel. Intel R© Xeon PhiTM , .
http://www.intel.com/content/www/us/en/processors/xeon/
xeon-phi-detail.html.
[44] Kalray. MPPA-256 Manycore Processor.
www.kalray.eu/products/mppa-manycore/mppa-256.
[45] Hany Kashif and Hiren Patel. Bounding buffer space requirements for real-time priority-
aware networks. In Proceedings of the 19th Asia and South Pacific Design Automation
Conference, 2014.
[46] Hany Kashif, Sina Gholamian, and Hiren Patel. Sla: A stage-level latency analysis for real-
time communication in a pipelined resource model. IEEE Transactions on Computers, 2014.
[47] Shinpei Kato, Nobuyuki Yamasaki, and Yutaka Ishikawa. Semi-partitioned scheduling of
sporadic task systems on multiprocessors. In Proceedings of the 21st Euromicro Conference
on Real-Time Systems, 2009.
[48] Nikolay Kavaldjiev and Gerard Smit. A survey of efficient on-chip communications for soc.
In Proceedings of the 4th Symposium on Embedded Systems, 2003.
[49] Byungjae Kim, Jong Kim, Sungje Hong, and Sunggu Lee. A real-time communication
method for wormhole switching networks. In Proceedings of the 1998 International Confer-
ence on Parallel Processing, 1998.
[50] Scott Kirkpatrick, Daniel Gelatt, and Mario Vecchi. Optimization by simulated annealing.
Science, 1983.
[51] Marcio Kreutz, Cesar Marcon, Luigi Carro, Ney Calazans, and Altamiro Susin. Energy and
latency evaluation of noc topologies. In Proceedings of the International Symposium on
Circuits and Systems, 2005.
[52] Rakesh Kumar, Timothy Mattson, Gilles Pokam, and Rob van der Wijngaart. The case for
message passing on many-core chips. In Multiprocessor System-on-Chip. Springer, 2011.
[53] Hugh Lauer and Roger Needham. On the duality of operating system structures. In Proceed-
ings of the 2nd International Symposium on Operating Systems, 1978.
[54] Jean-Yves Le Boudec and Patrick Thiran. Network Calculus: A Theory of Deterministic
Queuing Systems for the Internet. Springer-Verlag, 2001.
[55] Thomas LeBlanc and Evangelos Markatos. Shared memory vs. message passing in shared-
memory multiprocessors. In IEEE Parallel and Distributed Processing Symposium, 1992.
[56] Tang Lei and Shashi Kumar. A two-step genetic algorithm for mapping task graphs to a net-
work on chip architecture. In Proceedings of the Euromicro Symposium on Digital Systems
Design, 2003.
[57] Ye Li, Matthew Danish, and Richard West. Quest-v: A virtualized multikernel for high-
confidence systems. Technical report. http://www.cs.bu.edu/~richwest/quest.
html.
[58] Chang Liu and James Layland. Scheduling algorithms for multiprogramming in a hard-real-
time environment. Journal of the ACM, 1973.
REFERENCES 219
[59] Zhonghai Lu, Axel Jantsch, and Ingo Sander. Feasibility analysis of messages for on-chip
networks using wormhole routing. In Proceedings of the 10th Asia and South Pacific Design
Automation Conference, 2005.
[60] Mark Lundstrom. Moore’s law forever? Science, 2003.
[61] Cesar Marcon, Andre Borin, Altamiro Susin, Luigi Carro, and Flavio Wagner. Time and
energy efficient mapping of embedded applications onto nocs. In Proceedings of the 10th
Asia and South Pacific Design Automation Conference, 2005.
[62] Paris Mesidis and Leandro Soares Indrusiak. Genetic mapping of hard real-time applications
onto noc-based mpsocs – a first approach. In 6th International Workshop on Reconfigurable
Communication-centric Systems-on-Chip, 2011.
[63] Mikael Millberg, Erland Nilsson, Rikard Thid, and Axel Jantsch. Guaranteed bandwidth
using looped containers in temporally disjoint networks within the nostrum network on chip.
In Proceedings of the 7th Conference on Design Automation and Test in Europe, 2004.
[64] Fahime Moein-darbari, Ahmad Khademzade, and Golnar Gharooni-fard. Cgmap: a new
approach to network-on-chip mapping problem. IEICE Electronics Express, 2009.
[65] Srinivasan Murali and Giovanni De Micheli. Bandwidth-constrained mapping of cores onto
noc architectures. In Proceedings of the 7th Conference on Design Automation and Test in
Europe, 2004.
[66] Matt Mutka. Using rate monotonic scheduling technology for real-time communications in
a wormhole network. In Proceedings of the 2nd International Workshop on Parallel and
Distributed Processing, 1994.
[67] Lionel Ni and Philip McKinley. A survey of wormhole routing techniques in direct networks.
The Computer Journal, 1993.
[68] Borislav Nikolic´ and Stefan M. Petters. Towards network-on-chip agreement protocols. In
Proceedings of the 12th International Conference on Embedded Software, 2012.
[69] Borislav Nikolic´ and Stefan M. Petters. Edf as an arbitration policy for wormhole-switched
priority-preemptive nocs – myth or fact? In Proceedings of the 14th International Conference
on Embedded Software, 2014.
[70] Borislav Nikolic´ and Stefan M. Petters. Real-time application mapping for many-cores using
a limited migrative model. Real-Time Systems Journal, 2014.
[71] Borislav Nikolic´, Konstantinos Bletsas, and Stefan M. Petters. Priority assignment and
application mapping for many-cores using a limited migrative model. Technical report,
. Available at: http://www.cister.isep.ipp.pt/people/Borislav+Nikolic/
publications/.
[72] Borislav Nikolic´, Leandro Soares Indrusiak, and Stefan M. Petters. A tighter real-time
communication analysis for wormhole-switched priority-preemptive nocs. Technical report,
. Available at: http://www.cister.isep.ipp.pt/people/Borislav+Nikolic/
publications/.
220 REFERENCES
[73] Borislav Nikolic´, Muhammad Ali Awan, and Stefan M. Petters. SPARTS: Simulator for
power aware and real-time systems. In Proceedings of the 8th IEEEInternational Conference
on Embedded Software and Systems, 2011.
[74] Borislav Nikolic´, Hazem Ismail Ali, Stefan M. Petters, and Luís Miguel Pinho. Are virtual
channels the bottleneck of priority-aware wormhole-switched noc-based many-cores? In
Proceedings of the 21th International Conference on Real-Time Networks and Systems, 2013.
[75] Borislav Nikolic´, Patrick Meumeu Yomsi, and Stefan M. Petters. Worst-case memory traffic
analysis for many-cores using a limited migrative model. In Proceedings of the 19th IEEE
Conference on Embedded and Real-Time Computing and Applications, 2013.
[76] Borislav Nikolic´, Patrick Meumeu Yomsi, and Stefan M. Petters. Worst-case communication
delay analysis for many-cores using a limited migrative model. In Proceedings of the 20th
IEEE Conference on Embedded and Real-Time Computing and Applications, 2014.
[77] Christian Paukovits and Hermann Kopetz. Concepts of switching in the time-triggered
network-on-chip. In Proceedings of the 14th IEEE Conference on Embedded and Real-Time
Computing and Applications, 2008.
[78] Rodolfo Pellizzoni, Andreas Schranzhofer, Jian-Jia Chen, Marco Caccamo, and Lothar
Thiele. Worst case delay analysis for memory interference in multicore systems. In Pro-
ceedings of the 47th ACM/IEEE Conference on Design Automation Conference, 2010.
[79] Yue Qian, Zhonghai Lu, and Wenhua Dou. Analysis of worst-case delay bounds for best-
effort communication in wormhole networks on chip. In International Symposium on
Networks-on-Chip, 2009.
[80] Adrian Racu and Leandro Soares Indrusiak. Using genetic algorithms to map hard real-time
on noc-based systems. In 7th International Workshop on Reconfigurable Communication-
centric Systems-on-Chip, 2012.
[81] Jorge Real and Alfons Crespo. Mode change protocols for real-time systems: A survey and
a new proposal. Real-Time Systems Journal, 2004.
[82] Pradip Kumar Sahu and Santanu Chattopadhyay. A survey on application mapping strategies
for network-on-chip design. Journal of System Architecture, 2013.
[83] Zheng Shi. Real-Time Communication Services for Networks on Chip. PhD thesis, Depart-
ment of Computer Science, University of York, United Kingdom, 2009.
[84] Zheng Shi and Alan Burns. Priority assignment for real-time wormhole communication in
on-chip networks. In Proceedings of the 29th IEEE Real-Time Systems Symposium, 2008.
[85] Zheng Shi and Alan Burns. Real-time communication analysis for on-chip networks with
wormhole switching. In International Symposium on Networks-on-Chip, 2008.
[86] Zheng Shi and Alan Burns. Real-time communication analysis with a priority share policy in
on-chip networks. In Proceedings of the 21st Euromicro Conference on Real-Time Systems,
2009.
[87] Zheng Shi and Alan Burns. Schedulability analysis and task mapping for real-time on-chip
communication. Real-Time Systems Journal, 2010.
REFERENCES 221
[88] Zheng Shi, Alan Burns, and Leandro Soares Indrusiak. Schedulability analysis for real time
on-chip communication with wormhole switching. International Journal on Embedded and
Real-Time Communication Systems, 2010.
[89] Hyojeong Song, Boseob Kwon, and Hyunsoo Yoon. Throttle and preempt: a new flow
control for real-time communications in wormhole networks. In Proceedings of the 1997
International Conference on Parallel Processing, 1997.
[90] Marco Spuri. Analysis of deadline scheduled real-time systems. Technical report inria-
00073920, INRIA, France, 1996.
[91] Krishnan Srinivasan and Karam Chatha. A technique for low energy mapping and routing
in network-on-chip architectures. In Proceedings of the International Symposium on Low
Power Electronics and Design, 2005.
[92] Andrew Tanenbaum. Computer Networks. Prentice Hall Professional Technical Reference,
4th edition, 2002.
[93] Tilera. TILE64
TM
Processor.
www.tilera.com/products/processors/TILEPro_Family.
[94] David Wentzlaff and Anant Agarwal. Factored operating systems (fos): the case for a scal-
able operating system for multicores. SIGOPS Operating Systems Review, 2009.
[95] Heechul Yun, Gang Yao, R. Pellizzoni, M. Caccamo, and Lui Sha. Memory access control
in multiprocessor for real-time systems with mixed criticality. In Proceedings of the 24th
Euromicro Conference on Real-Time Systems, 2012.
