Performance Analysis of Priority-Aware NoCs with Deflection Routing
  under Traffic Congestion by Mandal, Sumit K. et al.
Performance Analysis of Priority-Aware NoCs with
Deflection Routing under Traic Congestion
Sumit K. Mandal1, Anish Krishnakumar1, Raid Ayoub2, Michael Kishinevsky2, Umit Y. Ogras1
1Dept. of ECE, University of Wisconsin-Madison; 2Intel Corporation, Hillsboro, OR
ABSTRACT
Priority-aware networks-on-chip (NoCs) are used in industry to
achieve predictable latency under dierent workload conditions.
These NoCs incorporate deection routing to minimize queuing
resources within routers and achieve low latency during low trac
load. However, deected packets can exacerbate congestion during
high trac load since they consume the NoC bandwidth. State-of-
the-art analytical models for priority-aware NoCs ignore deected
trac despite its signicant latency impact during congestion. This
paper proposes a novel analytical approach to estimate end-to-end
latency of priority-aware NoCs with deection routing under bursty
and heavy trac scenarios. Experimental evaluations show that
the proposed technique outperforms alternative approaches and
estimates the average latency for real applications with less than
8% error compared to cycle-accurate simulations.
1 INTRODUCTION
Pre-silicon design-space exploration and system-level simulations
constitute a crucial component of the industrial design cycle [1, 2].
They are used to conrm that new generation designs meet power-
performance targets before labor- and time-intensive RTL im-
plementation starts [3]. Furthermore, virtual platforms combine
power-performance simulators and functional models to enable
rmware and software development while hardware design is in
progress [4]. These pre-silicon evaluation environments incorpo-
rate cycle-accurate NoC simulators due to the criticality of shared
communication and memory resources in overall performance [5, 6].
However, slow cycle-accurate simulators have become the major
bottleneck of pre-silicon evaluation. Similarly, exhaustive design-
space exploration is not feasible due to the long simulation times.
Therefore, there is a strong need for fast, yet accurate, analytical
models to replace cycle-accurate simulations to increase the speed
and scope of pre-silicon evaluations [7].
Analytical NoC performance models are used primarily for fast
design space exploration since they provide signicant speed-up
compared to detailed simulators [8–11]. However, most existing
analytical models fail to capture two important aspects of indus-
trial NoCs [12]. First, they do not model routers that employ pri-
ority arbitration. Second, existing analytical models assume that
the destination nodes always sink the incoming packets. In reality,
network interfaces between the routers and cores have nite (and
typically limited) ingress buers. Hence, packets bounce (i.e., they
are deected) at the destination nodes when the ingress queue is
full. Recently proposed performance models target priority-aware
This article will appear in the Proceedings of ICCAD 2020. This work was supported
partially by Strategic CAD Labs, Intel Corporation, USA.
Author’s addresses: S. K. Mandal, A. Krishnakumar and U. Y. Ogras, Department of
Electrical and Computer Engineering, University of Wisconsin-Madison, WI, 53706.
Emails: {skmandal, anish.n.krishnakumar, uogras}@wisc.edu;
R. Ayoub and M. Kishinevsky, Intel Corporation, 2111 NE 25th Ave., Hillsboro, OR
97124; emails: {raid.ayoub, michael.kishinevsky}@intel.com .
0 . 0 0 0 . 0 5 0 . 1 0 0 . 1 5 0 . 2 0 0 . 2 5 0 . 1 0 . 2
0 . 3 0 . 4
0 . 5
1 0
2 0
3 0
6 57 0
D e f l e c
t i o n  
P r o b a
b i l i t y
Ave
rag
e La
ten
cy 
      
  (cy
cles
)
I n j e c t i o n  R a t e  ( p a c k e t s / c y c l e / s o u r c e )
Figure 1: Cycle-accurate simulations on a 6×6NoC show that
the average latency increases signicantly with larger de-
ection probability (pd ) at the sink.
NoCs [13, 14]. However, these ignore deection due to nite buers
and uses the packet injection rate as the primary input. This is a
signicant limitation since the deection probability (pd ) increases
both the hop count and trac congestion. Indeed, Figure 1 shows
that the average NoC latency increases signicantly with the prob-
ability of deection. For example, the average latency varies from
6–70 cycles for an injection rate of 0.25 packets/cycle/source when
pd varies from 0.1–0.5. Therefore, performance models for priority-
aware NoCs have to account for deection probability at the desti-
nations.
This work proposes an accurate analytical model for priority-
aware NoCs with deection routing under bursty trac. In addition to
increasing the hop count, deection routing also aggravates trac
congestion due to extra packets traveling in the network. Since the
deected packets also have a complex eect on the egress queues of
the trac sources, analytical modeling of priority-aware NoCs with
deection routing is challenging. To address this problem, we rst
need to compute the probability distribution of inter-arrival times
of deected packets. Specically, the average number of deected
packets as well as their mean inter-arrival times and second-order
moment, are required since we consider bursty trac. To this end,
the proposed approach starts with a canonical queuing system
with deection routing. We rst model the distribution of deected
trac and the average queuing delay for this system. However,
this methodology is not scalable when the network has multiple
queues with complex interactions between them. Therefore, we
also propose a superposition-based technique to obtain the wait-
ing time of the packets in arbitrarily sized industrial NoCs. This
technique decomposes the queuing system into multiple subsys-
tems. The structure of these subsystems is similar to the canonical
queuing system. After deriving the analytical expressions for the
parameters of the distribution model of deected packets of indi-
vidual subsystems, we superimpose the result to solve the original
system with multiple queues. Thorough experimental evaluations
ar
X
iv
:2
00
8.
03
90
4v
1 
 [c
s.P
F]
  1
0 A
ug
 20
20
with industrial NoCs and their cycle-accurate simulation models
show that the proposed technique signicantly outperforms prior
approaches [9, 13]. In particular, the proposed technique achieves
less than 8% modeling error when tested with real applications from
dierent benchmark suites. The major contributions of this work are
as follows:
• An accurate performance model for priority-aware NoCs
with deection routing under bursty trac,
• An algorithm to obtain end-to-end latency using the pro-
posed performance model,
• Detailed experimental evaluation with industrial priority-
aware NoC under varying degrees of deection.
The rest of this paper is organized as follows. Section 2 summa-
rizes the related research. Section 3 presents the background and
overview of the proposed work. Section 4 describes the proposed
methodology to construct the analytical model for priority-aware
NoCs, which considers deection routing. Section 5 details experi-
mental evaluations, and Section 6 concludes the paper.
2 RELATEDWORK
Deection routing was rst introduced in the domain of optical NoC
as hot-potato routing [15]. Later, it was adapted for the NoCs used
in high-performance SoCs to minimize buer requirements and
increase energy eciency [16–18]. This routing mechanism always
assigns the packets to a free output port of a router, even if the as-
signment does not result in minimum latency. This way, the buer
size requirement in the routers is minimized. Authors in [19] per-
form a thorough study on the eectiveness of deection routing for
dierent NoC topology and routing algorithm. Deection routing
is also used in industrial priority-aware NoC [12]. Since arbitrary
deections can cause livelocks and unpredictable latency, industrial
priority-aware NoCs deect the packets only at the destination
nodes when the ingress buer is full. Furthermore, the deected
packets always remain within the same row or column, and they
are guaranteed to be sunk after a xed number of deections.
NoC performance analysis techniques have been used for de-
sign space exploration and architectural studies such as buer siz-
ing [8, 20, 21]. However, most of these techniques do not consider
NoCs with priority arbitration and deection routing, which are
the key features of industrial NoCs [12]. Performance analysis of
priority-aware queuing networks has also been studied for o-chip
networks [22–24]. These analytical models consider the queuing
networks in continuous time. However, each transaction in NoC
happens at each clock cycle. Therefore, the underlying queuing
system needs to be considered in the discrete time domain. A perfor-
mance analysis technique for a priority-aware queuing network in
discrete time domain is presented in [24]. However, this technique
suers from high complexity for a complex queuing network, hence
not applicable to industrial priority-aware NoCs.
A recent technique targets priority-aware NoCs [9], but it con-
siders only a single class of packets in each queue of the network.
In contrast, industrial priority-aware NoCs have multiple classes
of packets that can exist in the same queue. NoCs with multiple
priority trac classes has recently been analyzed in [13]. However,
this analysis assumes that the input trac follows a geometric dis-
tribution. This technique has limited applications since industrial
NoCs can experience bursty trac. Furthermore, it does not con-
sider deection routing. Since deection routing increases trac
congestion, it is crucial to incorporate this aspect while construct-
ing performance models. An analytical bound on maximum delay
in networks with deection routing is presented in [25]. However,
evaluating maximum delay is not useful since it leads to signicant
overestimation. Another analytical model for NoCs with deection
routing is proposed in [26]. The authors rst compute the blocking
probability at each port of a router using an M/G/1 queuing model.
Then, they compute the contention matrix at each router port. The
average waiting time of packets at each port is computed using the
contention matrix. However, this analysis ignores dierent priority
classes and applies to only continuous-time queuing systems.
In contrast to prior work, we propose a performance analysis
technique that considers both priority-aware NoCs with deflec-
tion routing under bursty and high trac load. The proposed
technique applies the superposition principle to obtain the statisti-
cal distribution of the deected packets. Using this distribution, it
computes the average waiting time for each queue. To the best of
our knowledge, this is the rst analytical model for priority-aware
industrial NoCs with deection routing under high trac load.
3 BACKGROUND AND OVERVIEW
3.1 Assumptions and Notations
Architecture: This work considers priority-aware NoCs used in
high-end servers and many core architectures [12]. Each column of
the NoC architecture, shown in Figure 2, is also used in client sys-
tems such as Intel i7 processors [27]. Hence, the proposed analysis
technique is broadly applicable to a wide range of industrial NoCs.
In priority-aware NoCs, the packets already in the network have
higher priority than the packets waiting in the egress queues of the
sources. Assume that Node 2 in Figure 2 sends a packet to Node 12
following Y-X routing (highlighted by red arrows). Suppose that
a packet in the egress queue of Node 6 collides with this packet.
The packet from Node 2 to Node 12 will take precedence since the
packets already in the NoC have higher priority. Hence, packets
experience a queuing delay at the egress queues but have predictable
latency until they reach the destination or turning point (Node 10
in Figure 2). Then, it competes with the packets already in the
corresponding row. That is, the path from the source (Node 2) to
Router links
Path of deflected
packets
Path of packet 
w/o deflection
Sink / Junction
Routers
1 3 4
5 6 7 8
9 11
13 14 15 16
2
10 12
Figure 2: A representative 4×4meshwith deection routing.
the destination (Node 12) can be considered as two segments, which
consist of a queuing delay followed by a predictable latency.
Deection in priority-aware NoCs happens when the ingress
queue at the turning point (Node 10) or nal destination (Node 12)
become full. This can happen if the receiving node, such as a cache
controller, cannot process the packets fast enough. The probability
of observing a full queue increases with smaller queues (needed
to save area) and heavy trac load from the cores. If the packet is
deected at the destination node, it circulates within the same row,
as shown in Figure 2. Consequently, a combination of regular and
deected trac can load the corresponding row and pressure the
ingress queue at the turning point (Node 10). This, in turn, can lead
to deection on the column and propagates the congestion towards
the source. Finally, if a packet is deected more than a specic
number of times, it reserves a slot in the ingress queue. This bounds
the maximum number of deections and avoids livelock.
Trac: Industrial priority-aware NoCs can experience bursty traf-
c, which is characteristic of real applications [11, 28]. This work
considers generalized geometric (GGeo) distribution for the input
trac, which takes burstiness into account [29]. GGeo trac is
characterized by an average injection rate (λ) and the coecient
of variation of inter-arrival time (CA). We dene a trac class as
the trac of each source-destination pair. The average injection
rate and coecient of variation of inter-arrival time of class-i are
denoted by λi and as CAi respectively, as shown in Table 1. Finally,
the mean service time and coecient of variation of inter-departure
time of class-i are denoted as Ti and CSi .
3.2 Overview of the Proposed Approach
Our goal is to construct an accurate analytical model to compute
the end-to-end latency for priority-aware NoCs with deection
routing. The proposed approach can be used to accelerate full sys-
tem simulations and also to perform design space exploration. We
assume that the parameters of the GGeo distribution of the input
trac to the NoC (λ,CA) are known from the knowledge of the
application. The proposed model uses the deection probability (pd )
as the second major input, in sharp contrast to existing techniques
that ignore deection. Its range is found from architecture simu-
lations as a function of the NoC architecture (topology, number
of processors, and buer sizes). Its analytical modeling is left for
future work. The proposed analytical model utilizes the distribution
of the input trac to the NoC (λ,CA) and the deection probability
(pd ) to compute the average end-to-end latency as a sum of four
Table 1: Summary of the notations used in this paper.
λi Arrival rate of class-i
pdj Deection probability at sink-j
Ti , T̂i Original and modied mean service time of class-i
ρi Mean server utilization of class-i (=λiTi )
CAi
Coecient of variation
of inter-arrival time of class-i
CSi , Ĉ
S
i
Coecient of variation of original
and modied service time of class-i
CDi
Coecient of variation
of inter-departure time of class-i
CMi
Coecient of variation of inter-departure time
of merged trac of class-i
Wi Mean waiting time of class-i
components: (1) Queuing delay at the source, (2) the latency from
the source router to the junction router, (3) queuing delay at the
junction router, and (4) the latency from the junction router to the
destination. Note that all these components account for deection,
and it is challenging to compute them, especially under high trac
load.
4 METHODOLOGY
This section presents the proposed performance analysis technique
for estimating the end-to-end latency for priority-aware NoCs with
deection routing. We rst construct a model for a canonical system
with a single trac class, where the deected trac distribution is
approximated using a GGeo distribution (Section 4.1). Subsequently,
we introduce a scalable approach for a network with multiple trac
classes. In this approach, we rst develop a solution for the canoni-
cal system. Then, employ the principle of superposition to extend
the analytical model to larger and realistic NoCs with multiple traf-
c classes (Section 4.2). Finally, we propose an algorithm that uses
our analytical models to compute the average end-to-end latency
for a priority-aware NoC with deection routing (Section 4.3).
4.1 An Illustration with a Single Trac Class
Figure 3(a) shows an example of a single class input trac and
egress queue that inject trac to a network with deection routing.
The input packets are buered in the egress queue Qi (analogous
to the packets stored in the egress queue of Node 2 in Figure 2). We
denote the trac of Qi as class-i , which is modeled using GGeo
Sink
destination
Deflected 
traffic
1
Ring buffers (𝑸𝒅)
Sink signal 
(𝒑𝒅𝒊)
𝝀𝒊 , 𝑪𝒊𝑨
𝝀𝒅𝒊
2
(a)
S
Egress queue
(𝑸𝒊)
Priority 
arbiter
1
2
Port with a
lower index has 
a higher priority 𝑸𝒊
𝑸𝒅 		𝑺#𝒅𝒊𝝀𝒅𝒊 , 𝑪𝒅𝑨𝒊	
𝝀𝒊 , 𝑪𝒊𝑨 		𝑺#𝒊(𝑻( 𝒊 , 𝑪(𝒊𝑺)
𝑪𝒅𝒊𝑫
𝑪𝒊𝑫𝑪𝒊𝑴
(𝑻(𝒅𝒊 , 𝑪(𝒅𝑺𝒊)
𝟏− 𝒑𝒅𝒊
(To sink)
(b)
𝒑𝒅𝒊
(To 𝑸𝒅)
Figure 3: (a) Queuing system of a single class with deection routing (b) Approximate queuing system to compute CAdi .
distribution with two parameters (λi ,CAi ). The packets in Qi are
dispatched to a priority arbiter and assigned a low priority, marked
with 2 . In contrast, the packets already in the network have a high
priority, which are routed to the port marked with 1 . The packet
traverses a certain number of hops (similar to the latency from the
source router to the junction router in Figure 2) and reaches the
destination. Since the number of hops is constant for a particular
trac class, we omit these details in Figure 3(a) for simplicity. If the
ingress queue at the destination is full (with probability pdi ), the
packet is deected back into the network. Otherwise, it is consumed
at the destination (with probability 1 − pdi ). Deected packets
travel through the NoC (within the column or row as illustrated
in Figure 2) and pass through the source router, but this time with
higher priority. The prole of the deected packets in the network
is modeled by a buer (Qd ) in Figure 3(a), since they remain in
order and have a xed latency from the destination to the original
source. This process continues until the destination can consume
the deected packets.
Our goal is to compute the average waiting timeWi in the source
queue, i.e., components 1 and 3 of the end-to-end latency described
in Section 3.2. To obtainWi , we rst need to derive the analytical
expression for the rate of deected packets of class-i (λdi ) and the
coecient of variation of inter-arrival time of the deected packets
(CAdi ) as follows.
Rate of deected packets (λdi ): λdi is obtained by calculating
the average number of times a packet is deected (Ndi ) until it is
consumed at the destination as:
Ndi = pdi (1 − pdi ) + 2p2di (1 − pdi ) + . . . + np
n
di
(1 − pdi ) + . . .
=
∞∑
n=1
npndi
(1 − pdi ) =
pdi
1 − pdi
(1)
Therefore, λdi can be expressed as:
λdi = λiNdi = λi
pdi
1 − pdi
(2)
Coecient of variation of inter-arrival time of deected
packets (CAdi ): To compute C
A
di
, the priority related interaction
between the deected trac of Qd and new injections in Qi must
be captured. This computation is more involved due to the prior-
ity arbitration between the packets in Qd and Qi that involve a
circular dependency. We tackle this problem by transforming the
system in Figure 3(a) into an approximate representation shown
in Figure 3(b) to simplify the computations. The idea here is to
transform the priority queuing with a shared resource into separate
queue nodes (queue + server) with a modied server process. This
transformation enables the decomposition of Qd and Qi and their
shared server into individual queue nodes with servers Ŝd and Ŝi
respectively. The departure trac from these two nodes merge at
the destination, consumed with a probability 1 − pdi and deected
otherwise.
The input trac to the egress queue, as well as the deected
trac, may exhibit bursty behavior. Indeed, the deected trac
distribution can be bursty because of the server-process eect and
the priority interactions between the input trac and the deected
trac, even when the input trac is not bursty. Therefore, we
approximate the distribution of the deected trac via GGeo dis-
tribution. To compute the parameters of the GGeo trac, we need
to apply the principle of maximum entropy (ME) as shown in [29].
To obtain the modied service process of class-i , we rst calculate
the probability of no packets in Qi and in its corresponding server
(i.e., pQi (0)) using ME as,
pQi (0) = 1 − ρi − ρdi
ni
ni + ρi + ρdi
(3)
where ρi and ρdi denote the utilization of the respective servers,
and ni is the occupancy of class-i in Qi . Next, we apply Little’s law
to compute the rst order moment of modied service time (T̂i ) as:
T̂i =
1 − pQi (0)
λi
(4)
Subsequently, we obtain the eective coecient of variation ĈSi as:
(ĈSi )2 =
(1 − ρ̂i )(2ni + ρ̂i ) − ρ̂i (CAi )2
ρ̂2i
(5)
where ρ̂i = λiT̂i . We follow similar steps (Equation 3 – Equation 5)
for the deected trac to obtain T̂di and Ĉ
S
di
. With the modied
service process, the coecients of variation of inter-departure time
of the packets in Qd (CDdi ) and Qi (C
D
i ) are computed using the pro-
cess merging method [30]. Then, we nd the coecient of variation
(CMi ) of the merged trac from queues Qd and Qi as:
(CMi )2 =
1
λdi + λi
(λdi (CDdi )
2 + λi (CDi )2) (6)
We note that CMi is a function of the coecient of variation for
the inter-arrival of deected trac CAdi . Since part of this merged
trac is consumed at the sink, we apply the trac splitting method
from [30] to approximate CAdi as:
(CAdi )
2 = 1 + pdi ((CMi )2 − 1) (7)
Finally, we extend the priority-aware formulations in continuous
time domain [23] to discrete time domain to obtain the average
waiting time of the packets in Qdi and Qi :
Wdi =
ρdi (Tdi − 1) + ρi (Ti − 1) +Tdi ((CAdi )
2 + λdi − 1)
2(1 − ρdi )
(8)
Wi =
ρdi (Tdi + 1) + 2ρdiWdi + ρi (Ti − 1) +Ti ((CAi )2 + λi − 1)
2(1 − ρi − ρdi )
(9)
4.2 Queuing System with Multiple Trac Classes
The analytical model for the system with a single class presented
in Section 4.1 becomes intractable with a higher number of trac
classes. This section introduces a scalable approach based on the
superposition principle that builds upon our canonical system used
in Section 4.1.
Figure 4(a) shows an example with priority arbitration and N
egress queues, one for each trac class. We note that this queuing
system is a simplied representation of a real system. The packets
routed to port i have higher priority than those routed to port j
for i < j. The deected trac in the network is buered in Qd ,
which has the highest priority in the queuing system. The primary
goal is to model the queuing time of the packets of each trac
(a) (b)
𝑸𝒅
1𝝀𝒅, 𝑪𝒅𝑨 	
S 𝟏− 𝒑𝒅
𝒑𝒅
𝑸𝑵𝝀𝑵 , 𝑪𝑵𝑨
𝑸𝟏𝝀𝟏 , 𝑪𝟏𝑨 2 (𝑻𝒅, 𝑪𝒅𝑺 )…(𝑻𝑵 , 𝑪𝑵𝑺 )
𝑸𝟐𝝀𝟐 , 𝑪𝟐𝑨 3
N+1
(c)
𝝀𝒅 =#𝝀𝒅𝒊𝑵𝒊$𝟏 ,	𝑪𝒅𝑨 ≈𝓜	(	𝑪𝒅𝟏𝑨 ,… , 𝑪𝒅𝑵𝑨 )
1
𝑸𝒅𝝀𝒅, 𝑪𝒅𝑨 𝑸𝟏𝝀𝟏 , 𝑪𝟏𝑨 S𝑸𝟐𝝀𝟐 , 𝑪𝟐𝑨 𝑸𝑵𝝀𝑵 , 𝑪𝑵𝑨
2
3
N+1
𝟏 − 𝒑𝒅
𝑪𝟏𝑴
Subsystem-1
𝟏 − 𝒑𝒅𝑸𝟏
𝑸𝒅 		𝑺#𝒅𝟏𝝀𝒅𝟏 , 𝑪𝒅𝟏𝑨
𝝀𝟏 , 𝑪𝟏𝑨
𝑪𝒅𝟏𝑫
𝑪𝟏𝑫(𝑻-𝟏 , 𝑪-𝟏𝑺)
(𝑻-𝒅𝟏 , 𝑪-𝒅𝟏𝑺 )
		𝑺#𝟏
𝑪𝑵𝑴 𝟏 − 𝒑𝒅𝑸𝑵
𝑸𝒅 		𝑺#𝒅𝑵𝝀𝒅𝑵 , 𝑪𝒅𝑵𝑨
𝝀𝑵 , 𝑪𝑵𝑨 		𝑺#𝑵
𝑪𝒅𝑵𝑫
𝑪𝑵𝑫(𝑻-𝑵 , 𝑪-𝑵𝑺 )
(𝑻-𝒅𝑵 , 𝑪-𝒅𝑵𝑺 )Subsystem-N 𝒑𝒅
𝒑𝒅
(𝑻𝒅, 𝑪𝒅𝑺 )…(𝑻𝑵 , 𝑪𝑵𝑺 )
Figure 4: (a) Queuing system with N classes with deection routing, (b) Decomposition into N subsystems to calculate GGeo
parameters of deected trac per class, (c) Applying superposition to obtain the GGeo parameters of overall deected trac.
M denotes the merging process.
class. Modeling the coecient of variations of the deected trac
becomes harder since deected packets interact with all trac
classes rather than a single class. These interactions complicate the
analytical expressions signicantly.
Priority arbitration enables us to sort the queues in the order at
which the packets are served. The queue of the deected packets has
the highest priority, while the rest are ordered with respect to their
indices. Due to this inherent order between the priority classes, their
impact on the deected trac distribution can be approximated
as being independent of each other. This property enables us to
decompose the queuing system into multiple subsystems and model
each subsystem separately, as illustrated in Figure 4(b). Then, we
apply the principle of superposition to obtain the parameters of the
GGeo distribution of the deected trac. Note that each of these
subsystems is identical to the canonical system analyzed in Section 4.1.
Hence, we rst compute λdi andC
A
di
of each subsystem-i following
the procedure described in Section 4.1. Subsequently, we apply the
superposition principle to λdi and C
A
di
for i = 1 . . .N to obtain the
GGeo distribution parameters of the deected trac (λd ,CAd ).
In general, we obtain the GGeo distribution parameters of the
deected trac corresponding to class-i by setting all trac classes
to zero expect class-i , (λj = 0, j = 1 . . .N , j , i). The values of λdi
and CAdi can be expressed as:
λdi = λd

λj=0, j,i ;λi>0
and CAdi = C
A
d

λj=0, j,i ;λi>0
(10)
Subsequently, we apply the principle of superposition to obtain the
distribution parameters of Qd as shown in Figure 4(c). First, we
compute λd by adding all λdi as:
λd =
N∑
i=1
λdi (11)
The value of CAd is approximated by applying the superposition-
based trac merging process [30] for each CAdi , as shown below:
(CAd )2 =
N∑
i=1
λdi
λd
(CAdi )
2 (12)
Next, we use these distribution parameters (λd ,CAd ) of the deected
packets to calculate the waiting time of the trac classes in the
system. The formulation of the priority-aware queuing system is
applied to obtain the waiting time of each trac class-i (Wi ) [22]:
Wi =
ρd (Td + 1) + 2ρiWd
2(1 − ρd −
∑i
n=1 ρn )
+
∑i−1
n=1(ρn (Tn + 1) + 2ρiWn )
2(1 − ρd −
∑i
n=1 ρn )
+
ρi (Ti − 1) +Ti ((CAi )2 + λi − 1)
2(1 − ρd −
∑i
n=1 ρn )
(13)
The rst term in Equation 13 denotes the eect of deected trac
on class-i; the second term denotes the eect of higher priority
0 . 0 4 0 . 0 8 0 . 1 2 0 . 1 60
8
1 6
2 4
3 2
4 0
 S i m u l a t i o n   A n a l y t i c a l  ( P r o p o s e d ) A n a l y t i c a l  ( w / o  D e c o m p o s i t i o n  a n d  w / o  D e f l e c t i o n )  [ 9 ] A n a l y t i c a l  ( w / o  D e f l e c t i o n  R o u t i n g )  [ 1 3 ]
Ave
rag
e La
ten
cy 
(cy
cles
)
I n j e c t i o n  R a t e  ( p a c k e t s / c y c l e / s o u r c e )
Figure 5: Comparison of average latency between simula-
tion and analytical model for the canonical example shown
in Figure 4 with pd = 0.3 and N = 5.
classes (class-j , j < i) on class-i; and the last term denotes the eect
of class-i itself. For more complex scenarios that include trac
splits, we apply an iterative decomposition algorithm [14] to obtain
the queuing time of dierent classes.
Figure 5 shows the average latency comparison between the
proposed analytical model and simulation for the system in Figure 4.
In this setup, we assume the number of classes is 5 (N = 5),pd = 0.3,
and input trac distribution is geometric. The results show that
the analytical model performs well against the simulation, with
only 4% error on average. In contrast, the analytical model from [9]
highly overestimates the latency as it does not consider multiple
trac classes. The performance model of the priority-aware NoC
in [13] accounts for multiple trac classes, but it does not model
deection. Hence, it severely underestimates the average latency.
4.3 Summary & End-to-End Latency Estimation
Summary of the analytical modeling: We presented a scalable
approach for the analytical model generation of end-to-end latency
that handles multiple trac classes of priority-aware NoCs with
deection routing. It applies the principle of superposition on sub-
systems where each subsystem is a canonical queuing system of
a single trac class to signicantly simplify the approximation of
the GGeo parameters of deected trac and in turn, the latency
calculations.
End-to-End latency computation: Algorithm 1 describes the
end-to-end latency computation with our proposed analytical
model. The input parameters of the algorithm are the NoC topology,
Algorithm 1: End-to-end latency computation
1 Input: NoC topology, routing algorithm, service process,
input distribution for each class, (λ, CA), deection
probability (pd ) for each sink
2 Output: Average end-to-end latency (Lavд )
3 S = set of all classes in the network
4 N = number of queues in the network
5 Sn = set of classes in queue n
/* Distribution of deflected traffic */
6 for i = 1: |S| do
7 Compute λdi and CAdi using Equation 10
8 Compute λd and CAd using Equation 11 and Equation 12
9 end
10 ComputeWd using λd and CAd
/* Average waiting time of each class */
11 for n = 1:N do
12 for s = 1:|Sn | do
13 ComputeWns using Equation 13 (if |Sn | = 1)
14 ComputeWns following the decomposition method in
[14] (if |Sn | > 1)
15 end
16 end
17 Lavд =
∑N
n=1
∑Sn
s=1(Wns+Lns )λns∑N
n=1
∑Sn
s=1 λns
(For mesh this term includes the
latency both on the rows and the columns.)
routing algorithm, service process of each server, input trac distri-
bution for each class, and deection probability per sink. It outputs
the average end-to-end latency (Lavд ). First, the queuing system is
decomposed into multiple subsystems as shown in Figure 4(b) and
λdi and C
A
di
for each subsystem-i are computed. Subsequently, the
proposed superposition methodology is applied to compute λd and
CAd , shown in lines 6–9 of the algorithm. Then, λd andC
A
d are used
to compute the average waiting time of the deected packets (Wd ).
Then, the average waiting time for class-s in Qn (Wns ) is computed
as shown in lines 13–14. The service time combined with static la-
tency from source to destination (Lns ) is added toWns to obtain the
end-to-end latency. Finally, the average end-to-end latency (Lavд )
is computed by taking a weighted average of the latency of each
class, as shown in line 16 of the algorithm.
5 EXPERIMENTAL EVALUATIONS
This section validates the proposed analytical model against an
industrial cycle-accurate NoC simulator under a wide range of
trac scenarios. The experiment scenarios include real applications
and synthetic trac that allow evaluations with varying injection
rates and deection probabilities. The evaluations include a 6×6
mesh NoC and a 6×1 ring as representative examples of high-end
server CPUs [12] and high-end client CPUs [27], respectively. In
both cases, the trac sources emulate high-end CPU cores with a
100% hit rate on the shared last level cache (LLC) to load the NoCs.
The target platforms are more powerful than experimental [31]
and special-purpose [32] platforms with simple cores, although
the mesh size is smaller. To further demonstrate the scalability of
the proposed approach, we also present results with mesh sizes
up to 16×16. All cycle-accurate simulations run for 200K cycles,
with a warm-up period of 20K cycles, to allow the NoC to reach the
steady-state.
5.1 Estimation of Deected Trac
One of the key components of the proposed analytical model is
estimating the average number of deected packets. This section
evaluates the accuracy of this estimation compared to simulation
with a 6×6 mesh. To perform evaluations under heavy load, we
set the deection probability at each junction and sink to pd = 0.3
and injection rates at each source to 0.33 packets/cycle/source,
which are relatively large values seen in actual systems. We rst
run cycle-accurate simulations to obtain the average number of
deected packets at each row and column of the mesh. Then, the
analytical model estimates the same quantities for the 6×6 mesh.
R o w
- 1
R o w
- 2
R o w
- 3
R o w
- 4
R o w
- 5
R o w
- 6 C o l - 1 C o l - 2 C o l - 3 C o l - 4 C o l - 5 C o l - 6
0
2 0
4 0
6 0
8 0
1 0 0
Est
ima
tion
 Ac
cur
acy
 of
Def
lect
ed P
ack
ets
 (%
)
Figure 6: Estimation accuracy of average number of pack-
ets deected for each row and column in a 6×6 mesh with
pd=0.3.
0 . 0 9 0 . 1 8 0 . 2 7 0 . 3 6 0 . 4 5 0 . 5 40
8
1 6
2 4
3 2
4 0
0 . 0 7 0 . 1 4 0 . 2 1 0 . 2 8 0 . 3 50
1 1
2 2
3 3
4 4
5 5
( a )
 S i m u l a t i o n   A n a l y t i c a l  ( P r o p o s e d ) A n a l y t i c a l  ( w / o  D e c o m p o s i t i o n  a n d  w / o  D e f l e c t i o n )  [ 9 ]   A n a l y t i c a l  ( w / o  D e f l e c t i o n  R o u t i n g )  [ 1 3 ]
Ave
rag
e La
ten
cy 
(cy
cles
)
I n j e c t i o n  R a t e  ( p a c k e t s / c y c l e / s o u r c e )
p d  =  0 . 1
( b )
Ave
rag
e La
ten
cy 
(cy
cles
)
I n j e c t i o n  R a t e  ( p a c k e t s / c y c l e / s o u r c e )
p d  =  0 . 3
Figure 7: Comparison of average latency between simulation, the analytical model proposed in this work, and analytical
models proposed in [9, 13] for a 6×6 mesh with deection probability (a) 0.1 and (b) 0.3.
Figure 6 shows the estimation accuracy for all rows and columns.
The average estimation accuracy across all rows and columns is
96% and the worst-case accuracy is 92%. Overall, this evaluation
shows that the proposed model accurately estimates the average
number of deected packets.
5.2 Evaluations with Geometric Trac Input
This section evaluates the accuracy of our latency estimation tech-
nique when the sources inject packets following a geometric trac
distribution. We note that our technique can also handle bursty
trac, which is signicantly harder. However, we start with this
assumption to make a fair comparison to two state-of-the-art tech-
niques from the literature [9, 13]. The model presented in [9] does
not incorporate multiple trac classes and deection routing. On
the other hand, the model presented in [13] considers multiple
trac classes but does not consider bursty trac and deection
routing.
The evaluations are performed rst on the server-like 6×6 mesh
for deection probabilities pd = 0.1 and pd = 0.3 while sweeping
the packet injection rates. Figure 7(a) and Figure 7(b) show that the
proposed technique follows the simulation results closely for all
injections. More specically, the proposed analytical model has only
7% and 6% percentage error on average for deection probabilities
of 0.1 and 0.3, respectively. In sharp contrast, the analytical model
proposed in [9] signicantly overestimates the latency starting
with moderate injection rates, since it does not consider multiple
trac classes. Its performance degrades even further with larger
deection probability, as depicted in Figure 7(b). We note that it
also slightly underestimates the latency at low injection rates since
it ignores deection. Unlike this approach, the technique presented
in [13] considers multiple trac classes in the same queue, but it
ignores deected packets. Consequently, it severely underestimates
the latency impact of deection, as shown in Figure 7.
We repeated the same evaluation on a 6×1 priority-aware ring
NoC which follows a high-end industrial quad-core CPU with an
integrated GPU and memory-controller [12]. The average error
between the proposed analytical model and simulations are 7%
and 4% for deection probabilities of 0.1 and 0.3, respectively. In
contrast, the model presented in [9] underestimates the latency at
low injection rates and signicantly overestimates it under high
trac load similar to the 6×6 results in Figure 7. Similarly, the
analytical model presented in [13] severely underestimates the
average latency. It leads to an average 43% error with respect to
simulation. The plots of these results are not included for space
considerations since they closely follow the results in Figure 7.
5.3 Latency Estimation with Bursty Trac Input
Since real applications exhibit burstiness, it is crucial to perform
accurate analytical modeling under bursty trac. Therefore, this
section presents the comparison of our proposed analytical model
with respect to simulation under bursty trac. For an extensive
and thorough validation, we sweep the packet injection rate (λ),
probability of burstiness (pbr ), and deection probability (pd ). The
injection rates cover a wide range to capture various trac con-
gestion scenarios in the network. Likewise, we report evaluation
results for two dierent burstiness (pbr = {0.2,0.6}), and three dif-
ferent deection probabilities (pd = {0.1,0.2,0.3}). The coecient
of variation for the input trac (CA), the nal input to the model,
is then computed as a function of pbr and λ [33]. We simulate the
6×6 mesh and 6×1 ring NoCs using their cycle-accurate models for
all input parameter values mentioned above. Then, we estimate the
average packet latencies using the proposed technique, as well as
the most relevant prior work [9, 13].
The estimation error of all three performance analysis techniques
is reported in Table 2 for all input parameters. The mean and me-
dian estimation errors of our proposed technique are 9.3% and 9.5%,
Table 2: Validation of the proposed analytical model for 6×6 mesh and 6×1 ring
with bursty trac arrival, and comparisons against prior work [9, 13]. ‘E’ signies error >100%.
Topo. 6×6 Mesh 6×1 Ring
pd 0.1 0.2 0.3 0.1 0.2 0.3
pbr 0.2 0.6 0.2 0.6 0.2 0.6 0.2 0.6 0.2 0.6 0.2 0.6
λ 0.1 0.3 0.4 0.1 0.3 0.4 0.1 0.3 0.4 0.1 0.3 0.4 0.1 0.2 0.3 0.1 0.2 0.3 0.1 0.3 0.4 0.1 0.3 0.4 0.1 0.3 0.4 0.1 0.3 0.4 0.1 0.2 0.3 0.1 0.2 0.3
Er
ro
r(%
) Prop. 7.39.68.1 14 13 14 8.98.07.7 13 12 12 9.69.26.5 11 12 13 1.0 4.15.8 4.65.25.5 0.72.34.2 6.37.38.6 0.70.93.3 6.38.58.6
Ref[9] 2.6 E E 26 E E 22 E E 39 E E 35 18 E 57 E E 7.0 E E 34 E E 23 E E 45 E E 42 E E 54 E E
Ref[13] 12 15 23 3.1 18 23 28 41 65 19 33 49 42 45 55 39 35 31 15.3 18 22 18 24 33 30 38 67 31 44 54 41 50 73 42 50 58
S Y S m a
r k 1 4 g c c b w a v e
s m c f
G e m s F
D T DO M N e T
+ + X a l a nP e r l b e
n c h
0 . 0
0 . 8
1 . 6
2 . 4
3 . 2
Ave
rag
e La
ten
cy
(No
rma
lize
d)
 S i m u l a t i o n   A n a l y t i c a l  ( P r o p o s e d ) A n a l y t i c a l  ( w / o  D e c o m p o s i t i o n  a n d  w / o  D e f l e c t i o n )  [ 9 ]   A n a l y t i c a l  ( w / o  D e f l e c t i o n  R o u t i n g )  [ 1 3 ]
( a ) ( b )
S Y S m a
r k 1 4 g c c b w a v e
s m c f
G e m s F
D T DO M N e T
+ + X a l a nP e r l b e
n c h
0 . 0
0 . 8
1 . 6
2 . 4
3 . 2 p d  =  0 . 3p d  =  0 . 1
Ave
rag
e La
ten
cy
(No
rma
lize
d)
Figure 8: Average latency comparison between simulation, the analytical model proposed in this work, and analytical models
proposed in [9, 13] for a 6×6 mesh with (a) pd = 0.1 and (b) pd = 0.3.
respectively. Furthermore, we do not observe more than 14% error
even with relatively higher trac load, probability of deection,
and burstiness than seen in real applications (presented in the fol-
lowing section). In strong contrast, the analytical models proposed
in [9] severely overestimate the latency similar to the results pre-
sented in Section 5.2. The estimation error is more than 100% for
most cases since the impact of multiple trac classes and deected
packet become more signicant under these challenging scenarios.
Similarly, the model proposed in [13] underestimates the latency
because it does not model bursty trac.
The right-hand side of Table 2 summarizes the estimation errors
obtained on the 6×1 ring NoC that follows high-end client systems.
In most cases, the error with the proposed analytical model is
within 10% of simulation, and the error is as low as 1%. With pd =
0.1, pbr = 0.6 and λ = 0.4, the error is 14%, which is acceptable,
considering that the network is severely congested. In contrast, the
analytical models proposed in [9] overestimate the latency, whereas
the models in [13] underestimate the latency which conforms the
results with geometric trac, as in the 6×6 mesh results.
5.4 Experiments with Real Applications
In addition to the synthetic trac, the proposed analytical model
is evaluated with applications from SPEC CPU®2006 [34], SPEC
CPU®2017 benchmark suites [35], and the SYSmark®2014 appli-
cation [36]. Specically, the evaluation includes SYSmark14, gcc,
bwaves, mcf, GemsFDTD, OMNeT++, Xalan, and perlbench applica-
tions. The chosen applications represent a variety of injection rates
for each source in the NoC and dierent levels of burstiness. Each
application is proled oine to nd the input trac parameters.
Of note, the probability of burstiness for these applications ranges
from pb = 0.25 to pb = 0.55, which is aligned with the evaluations
in Section 5.3.
The benchmark applications are executed on both 6×6 mesh
and 6×1 ring architectures. The comparison of average latency be-
tween simulation and proposed analytical model for the 6×6 mesh
is shown in Figure 8. The proposed model follows the simulation
results very closely for deection probability pd = 0.1 and pd = 0.3,
as shown in Figure 8(a) and Figure 8(b), respectively. These plots
show the average packet latencies normalized to the smallest la-
tency from the 6×1 ring simulations due to the condentiality of
the results. On average, the proposed analytical model achieves less
than 5% modeling error. In contrast, the analytical models which do
not consider deection routing [9, 13] underestimate the latency,
since the injection rates of these applications are in the range of
0.02–0.1 its/cycle/source (low injection region).
4 x 4 6 x 6 8 x 8 1 0 x 1 0 1 2 x 1 2 1 4 x 1 4 1 6 x 1 61 0 - 3
1 0 - 2
1 0 - 1
1 0 0
M e s h  S i z e
Exe
cut
ion
 Tim
e of
Ana
lytic
al M
ode
l (s)
Figure 9: Execution time of the proposed analytical model
(in seconds) for dierent mesh sizes.
We observe similar results for the 6×1 ring NoC. The average
estimation error of our proposed technique is less than 8% for all
applications. In contrast, the prior techniques underestimate the
latency by more than 2× since they ignore deected packets, and
the average trac loads are small. In conclusion, the proposed
technique outperforms state-of-the-art for real applications and a
wide range of synthetic trac inputs.
5.5 Scalability Analysis
Finally, we evaluate the scalability of the proposed technique for
larger NoCs. We note that accuracy results for larger NoCs are not
available since they do not have detailed cycle-accurate simula-
tion models. We implemented the analytical model in C. Figure 9
shows that the analysis completes in the order of seconds for up to
16×16 mesh. In comparison, cycle-accurate simulations take hours
with this size, even considering linear scaling. When we scale the
mesh size aggressively to 16×16, the analysis completes in about 5
seconds, which is orders of magnitude faster than cycle-accurate
simulations of NoCs with this size.
6 CONCLUSIONS
Industrial NoCs incorporate priority-arbitration and deection rout-
ing to minimize buer requirement and achieve predictable latency.
Analytical performance modeling of these NoCs is needed to per-
form design space exploration, fast system simulations, and tuning
architectural parameters. However, state-of-the-art performance
analysis models for NoCs do not incorporate priority arbitration
and deection routing together. This paper presented a performance
analysis technique for industrial priority aware NoCs with deec-
tion routing under heavy trac. Experimental evaluations with
industrial NoCs show that the proposed technique signicantly
outperforms existing analytical models under both real application
and a wide range of synthetic trac workloads.
REFERENCES
[1] Maurizio Palesi and Tony Givargis. Multi-objective Design Space Exploration
using Genetic Algorithms. In Proceedings of the tenth international symposium
on Hardware/software codesign, pages 67–72, 2002.
[2] Arijit Ghosh and Tony Givargis. Analytical Design Space Exploration of Caches
for Embedded Systems. In 2003 Design, Automation and Test in Europe Conference
and Exhibition, pages 650–655, 2003.
[3] Nathan Binkert et al. The Gem5 Simulator. SIGARCH Computer Architecture
News, May. 2011.
[4] Rainer Leupers, Lieven Eeckhout, Grant Martin, Frank Schirrmeister, Nigel
Topham, and Xiaotao Chen. Virtual Manycore Platforms: Moving Towards
100+ Processor Cores. In 2011 Design, Automation & Test in Europe, pages 1–6.
IEEE, 2011.
[5] Niket Agarwal, Tushar Krishna, Li-Shiuan Peh, and Niraj K Jha. GARNET: A
Detailed on-chip Network Model inside a Full-System Simulator. In 2009 IEEE
International Symposium on Performance Analysis of Systems and Software, pages
33–42, 2009.
[6] Nan Jiang et al. A Detailed and Flexible Cycle-accurate Network-on-chip Simu-
lator. In 2013 IEEE Intl. Symp. on Performance Analysis of Systems and Software
(ISPASS), pages 86–96.
[7] Sungjoo Yoo, Gabriela Nicolescu, Lovic Gauthier, and Ahmed Amine Jerraya.
Automatic Generation of Fast Timed Simulation Models for Operating Systems in
SoC Design. In Proceedings 2002 Design, Automation and Test in Europe Conference
and Exhibition, pages 620–627, 2002.
[8] Umit Y Ogras, Paul Bogdan, and Radu Marculescu. An Analytical Approach for
Network-on-Chip Performance Analysis. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, 29(12):2001–2013, 2010.
[9] Abbas Eslami Kiasari, Zhonghai Lu, and Axel Jantsch. An Analytical Latency
Model for Networks-on-Chip. IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, 21(1):113–123, 2013.
[10] Hany Kashif and Hiren Patel. Bounding Buer Space Requirements for Real-time
Priority-aware Networks. In 2014 19th Asia and South Pacic Design Automation
Conference (ASP-DAC), pages 113–118, 2014.
[11] Zhi-Liang Qian, Da-Cheng Juan, Paul Bogdan, Chi-Ying Tsui, Diana Marculescu,
and Radu Marculescu. A Support Vector Regression (SVR)-based Latency Model
for Network-on-Chip (NoC) Architectures. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, 35(3):471–484, 2015.
[12] James Jeers, James Reinders, and Avinash Sodani. Intel Xeon Phi Processor High
Performance Programming: Knights Landing Edition. Morgan Kaufmann, 2016.
[13] Sumit K Mandal, Raid Ayoub, Michael Kishinevsky, and Umit Y Ogras. Analytical
Performance Models for NoCs with Multiple Priority Trac Classes. ACM
Transactions on Embedded Computing Systems (TECS), 18(5s), 2019.
[14] Sumit K Mandal, Raid Ayoub, Michael Kishinevsky, Mohammad M Islam, and
Umit Y Ogras. Analytical Performance Modeling of NoCs under Priority Arbi-
tration and Bursty Trac. IEEE Embedded Systems Letters, 2020.
[15] Allan Borodin, Yuval Rabani, and Baruch Schieber. Deterministic Many-to-
Many Hot Potato Routing. IEEE Transactions on Parallel and Distributed Systems,
8(6):587–596, 1997.
[16] Thomas Moscibroda and Onur Mutlu. A Case for Buerless Routing in On-
chip Networks. In Proceedings of the 36th Annual International Symposium on
Computer Architecture, pages 196–207, 2009.
[17] Chris Fallin, Chris Craik, and Onur Mutlu. CHIPPER: A Low-Complexity Buer-
less Deection Router. In 2011 IEEE 17th International Symposium on High
Performance Computer Architecture, pages 144–155, 2011.
[18] Chris Fallin, Greg Nazario, Xiangyao Yu, Kevin Chang, Rachata Ausavarungnirun,
and Onur Mutlu. MinBD: Minimally-buered Deection Routing for Energy-
ecient Interconnect. In 2012 IEEE/ACM Sixth International Symposium on
Networks-on-Chip, pages 1–10, 2012.
[19] Zhonghai Lu, Mingchen Zhong, and Axel Jantsch. Evaluation of On-chip Net-
works using Deection Routing. In Proceedings of the 16th ACM Great Lakes
Symposium on VLSI, pages 296–301, 2006.
[20] Yulei Wu, Yulei Wu, Geyong Min, Mohamed Ould-Khaoua, Hao Yin, and Lan
Wang. Analytical Modelling of Networks in Multicomputer Systems under
Bursty and Batch Arrival Trac. The Journal of Supercomputing, 51(2):115–130,
2010.
[21] Michele Petracca, Benjamin G Lee, Keren Bergman, and Luca P Carloni. Photonic
NoCs: System-Level Design Exploration. IEEE Micro, 29(4):74–85, 2009.
[22] Dimitri P Bertsekas, Robert G Gallager, and Pierre Humblet. Data Networks,
volume 2. Prentice-Hall International New Jersey, 1992.
[23] Gunter Bolch, Stefan Greiner, Hermann De Meer, and Kishor S Trivedi. Queue-
ing Networks and Markov Chains: Modeling and Performance Evaluation with
Computer Science Applications. John Wiley & Sons, 2006.
[24] Joris Walraevens. Discrete-time Queueing Models with Priorities. PhD thesis,
Ghent University, 2004.
[25] Jack T Brassil and Rene L. Cruz. Bounds on Maximum Delay in Networks
with Deection Routing. IEEE Transactions on Parallel and Distributed Systems,
6(7):724–732, 1995.
[26] Pavel Ghosh, Arvind Ravi, and Arunabha Sen. An Analytical Framework with
Bounded Deection Adaptive Routing for Networks-on-Chip. In 2010 IEEE
Computer Society Annual Symposium on VLSI, pages 363–368, 2010.
[27] Efraim Rotem. Intel Architecture, Code Name Skylake Deep Dive: A New
Architecture to Manage Power Performance and Energy Eciency. In Intel
Developer Forum, 2015.
[28] Paul Bogdan and Radu Marculescu. Workload Characterization and Its Impact on
Multicore Platform Design. In 2010 IEEE/ACM/IFIP International Conference on
Hardware/Software Codesign and System Synthesis (CODES+ ISSS), pages 231–240,
2010.
[29] Demetres D Kouvatsos. Entropy Maximisation and Queuing Network Models.
Annals of Operations Research, 48(1):63–126, 1994.
[30] Guy Pujolle and Wu Ai. A Solution for Multiserver and Multiclass Open Queueing
Networks. INFOR: Information Systems and Operational Research, 24(3):221–230,
1986.
[31] Sriram R Vangal et al. An 80-tile sub-100-w teraops processor in 65-nm cmos.
IEEE Journal of Solid-State Circuits, 43(1):29–41, 2008.
[32] David Wentzla et al. On-chip Interconnection Architecture of the Tile Processor.
IEEE micro, 27(5):15–31, 2007.
[33] DD Kouvatsos and PA Luker. On the analysis of queueing network models:
Maximum entropy and simulation. In UKSC 84, pages 488–496. 1984.
[34] John L Henning. SPEC CPU2006 Benchmark Descriptions. ACM SIGARCH
Computer Architecture News, 34(4):1–17, 2006.
[35] James Bucek, Klaus-Dieter Lange, and Jóakim v. Kistowski. SPEC CPU2017:
Next-Generation Compute Benchmark. In Companion of the 2018 ACM/SPEC
International Conference on Performance Engineering, pages 41–42, 2018.
[36] Business Applications Performance Corporation (BAPCo). Benchmark, sys-
mark2014. http://bapco.com/products/sysmark-2014, accessed 27 May 2020.
