Zero-Load Predictive Model for Performance Analysis in Deflection Routing NoCs by Weldezion, Awet Yemane et al.
                          Weldezion, A. Y., Grange, M., Jantsch, A., Tenhunen, H., & Pamunuwa, D.
(2015). Zero-Load Predictive Model for Performance Analysis in Deflection
Routing NoCs. Microprocessors and Microsystems, 39(8), 634-647. DOI:
10.1016/j.micpro.2015.09.002
Peer reviewed version
Link to published version (if available):
10.1016/j.micpro.2015.09.002
Link to publication record in Explore Bristol Research
PDF-document
Copyright © 2015 Elsevier B.V. All rights reserved.
University of Bristol - Explore Bristol Research
General rights
This document is made available in accordance with publisher policies. Please cite only the published
version using the reference above. Full terms of use are available:
http://www.bristol.ac.uk/pure/about/ebr-terms.html
Zero-Load Predictive Model for Performance Analysis in
Deﬂection Routing NoCs
Awet Yemane Weldeziona,∗, Matt Grangeb, Axel Jantschc,
Hannu Tenhunena, Dinesh Pamunuwad
aThe Royal Institute of Technology (KTH), SE 16440 Kista, Sweden
bMentor Graphics, Oregon, USA
cVienna University of Technology (TU Wien), Austria
dUniversity of Bristol, UK
Abstract
We study a static model for 2-D and 3-D networks that accurately represents
the average distance travelled by packets under deﬂection routing, which is
a speciﬁc form of adaptive routing. The model captures static properties of
the network topology and the spatial distribution of traﬃc, but does not take
into account traﬃc loading and congestion. Even though this static model
cannot accurately predict packet latency under high load, we contend that it
is a perfect predictor of deﬂection routing networks’ relative performance un-
der any load condition below saturation, and thus always correctly predicts
the optimum network conﬁguration. This is veriﬁed through cycle-accurate
simulations of congested and uncongested networks with fully adaptive, de-
ﬂection routing for regular traﬃc patterns such as uniform random, localized,
bursty, and others, as well as irregular patterns in both regular and irregu-
lar networks. As the networks with minimal average distance perform best
even under high traﬃc load, the average distance model establishes a robust
relation between a static network property, average distance, and network
performance under load, providing new insight into network behaviour and
an opportunity to identify the optimal network conﬁguration without time-
consuming simulations.
Keywords: Alpha-model; average distance; b-model; hot-spot nodes; local
traﬃc pattern; network optimisation; NoC; zero-load predictive model;
∗Corresponding author
Email address: aywe@kth.se (Awet Yemane Weldezion)
Preprint submitted to Journal of Microprocessors and Microsystems September 3, 2015
*Manuscript
Click here to view linked References
1. Introduction
Analytical models of communication performance in networks are diﬃcult
to obtain because of the chaotic and complex nature of a congested commu-
nication system. The delicate balance between the switching, buﬀering, ﬂow
control, routing algorithm, and the traﬃc distribution across the network5
and over time determines whether a network operates at peak eﬃciency or
exhibits overloaded and unbounded latencies. Predicting the expected packet
delay in a network when it is near its saturation point is notoriously diﬃcult.
In fact, the speciﬁc load level that causes a network to become saturated not
only depends on details of the spatial distribution (where packets are routed)10
and the burstiness (data injection patterns over time) of the traﬃc, but also
on the history of the network’s congestion.
To understand worst case timing, analytic models are indispensable and
various methods have been applied to derive the worst case delay and perfor-
mance in Networks-on-Chip (NoC) [1]. For instance, scheduling theory [2],15
network calculus [3], data ﬂow analysis [4], and models used in statistical
physics [5] are actively being pursued in the literature for networks that
use deterministic routing. However, these models derive the upper latency
bounds based on the worst possible interference patterns and congestion,
which often is far from the average case. The task is even more daunting for20
adaptive routing. In deterministic routing networks, the delay of a packet
only depends on direct and indirect interference on the packet’s path. In
contrast, adaptive routing balances the load over the entire network, which
means that every packet may directly compete with any other packet. How-
ever, adaptive routing is a popular technique in NoCs due to its high perfor-25
mance [6, 7, 8], its load balancing capabilities [9, 10] and its fault-tolerant
properties [11]. There have been attempts to exploit such properties by using
a model-based approach in routing decisions [12], but there is no work known
to us that oﬀers an analytic delay model for average performance. Due to
the exceedingly complex spatial and temporal interference patterns of pack-30
ets across the network in adaptive routing networks, an accurate analytic
latency model seems to be out of reach.
Consequently, simulation has been the predominant tool to assess the
performance of networks for particular applications and application classes.
The shortcomings of simulations are obvious: high eﬀort in setting up realis-35
2
tic simulations; even higher eﬀort in setting up realistic application scenarios;
very long simulation times; limited predictive value for application variants
that are not simulated; and diﬃculties in obtaining clues for improving per-
formance.
Given the challenges in formulating accurate analytic models and the40
enormous eﬀort in setting up useful simulation scenarios, we ask the follow-
ing question: Are there static, analytic properties that can serve as reliable
predictors for network performance, even if their accuracy in predicting la-
tency is limited?
In this work, we study one candidate for such a predictor: the average45
distance in hops that packets traverse in a network. In our model, a routing
node consists of a router with one or more processing elements connected
to it. Given any network topology, the geometric distance, expressed in the
number of hops, between two speciﬁc routing nodes can easily be computed.
For example, in a 3×4 2-D mesh network, the distance between any two50
neighbouring routing nodes is 1, and the distance between two diagonally
opposite corner routing nodes is 5. If the probability of a speciﬁc node A
sending a packet to a speciﬁc node B is known for all routing nodes A and
B in the network, the average distance travelled by all packets for the given
network topology and set of probabilities can be computed. We denote the55
average distance of a network by H(φ, ψ) (or H for short), where φ is the
spatial distribution of traﬃc and ψ represents the topology. We call this
metric the zero-load average distance model as it models the average latency
in networks completely free of congestion, or in other words networks with
zero loading. We use the terms average-distance model and zero-load model60
interchangeably to mean the same thing.
We demonstrate the predictive power of H by showing that for any topol-
ogy with deﬂection routing, whether homogeneous or heterogeneous, under
numerous realistic traﬃc scenarios, the model exhibits near perfect ﬁdelity
for all investigated cases. Fidelity is deﬁned as the average latency for net-65
work ’A’ being consistently less than network ’B’ regardless of the congestion
level and traﬃc pattern, when the average distance H is less for network ’A’
as predicted by the zero-load model. We examine the ﬁdelity of our model
by considering the packet latencies of networks that are equally sized in
terms of total routing nodes, but have unequal radices (for example 4×4×470
versus 8×8×1 versus 2×4×8) as well as diﬀerent conﬁgurations (diﬀerent
placements of speciﬁc traﬃc generators and consumers), under various traf-
ﬁc patterns with increasing injection rate. The zero-load model diﬀerentiates
3
between networks when other commonly used metrics, such as bisection chan-
nel bandwidth, Bc [13], can be inconclusive. For example, Bc = 8 for 8×8×175
and 2×4×8 meshes.
The main contribution of this work is to validate that the average dis-
tance model predicts relative network performance well for deﬂection rout-
ing networks, by means of a wide range of cycle accurate simulations using
spatio-temporal traﬃc generators. Additional experiments are performed for80
placement of hot-spot nodes and IP-cores in irregular networks to demon-
strate the potential of the model in network architecture optimisation.
The paper is organised as follows: in section 2, we discuss related works.
In section 3 the diﬀerent spatial traﬃc patterns are analysed and corre-
sponding expressions for the zero-load average distance formulated. Also the85
basis for modelling bursty traﬃc is described. Section 4 describes the sim-
ulation environment and experimental methodology used in the study. In
section 5 we validate our model by showing experimental results based on
cycle-accurate simulations for regular network conﬁgurations under load for
all the regular traﬃc patterns investigated.90
Then, in section 6 we present results for irregular traﬃc patterns for both
regular and irregular networks, and demonstrate the potential use of the
model in optimizing network conﬁguration. After discussing the results in
7, we draw our conclusions in section 8.
2. Related Works95
The performance of communication networks has been widely studied
and, in particular, there is a substantial body of work that deals with delay
models for deterministic routing and regular topologies [1, 14, 15, 16]. Much
less work has been done for adaptive routing networks, because the task is
inherently more diﬃcult. Therefore, all previous approaches make simpli-100
fying assumptions that make the task tractable but renders the model less
general and restricts its scope.
One of the ﬁrst delay models for adaptive routing networks was devel-
oped by Boura et al. in 1994 [17] for hypercube topologies. In 1998 Ould-
Khaoua [18] reported a delay model for general k-ary n-cubes covering Du-105
ato’s fully adaptive routing algorithm for wormhole switched networks and
two or more virtual channels [19]. In 2000 Sarbazi-Azad et al. [20] pro-
posed a modiﬁcation which results in a model which has improved accuracy
but is computationally very expensive because it recursively computes the
4
packet blocking delays in each node for every possible path a packet may110
take. In 2003 Khonsari et al. [21] provided an alternative delay model based
on Boura’s et al. earlier work [17] but for general k-ary n-cubes. It is less
accurate but signiﬁcantly faster to compute than the model of Sarbazi-Azad
et al. [20].
These models assume a uniform spatial distribution and a Poisson pro-115
cess to model the temporal distribution of packet generation. In 2007 Min
at al. [22] considered bursty traﬃc based on a compound Poisson process
that models bursts, burst lengths and inter-arrival times of bursts as well as
allowing exactly one hotspot. A model has also been proposed to predict the
formation of hotspot traﬃc for the use of congestion-aware routing in certain120
networks [23].
All these delay models are fairly accurate only under the given assump-
tions, which are however, quite restrictive with regard to topology as well
as traﬃc distribution. In relation to topology, some are restricted to hyper-
cubes [17, 22], while all others target k-ary n-cubes [21, 20, 18]. No model125
available in the literature considers meshes (i.e. links do not wrap around
peripheral nodes), which are popular for NoCs, or other regular or irregu-
lar topologies. All models except [22] assume and use Poisson processes for
packet generation under a uniform spatial distribution. Min et al. do al-
low for bursty traﬃc and one hot-spot. However, self-similar traﬃc, found130
by many to closely resemble traﬃc ﬂow in real applications [24, 25, 26], or
spatial distribution of traﬃc beyond a single hot-spot, cannot be modelled.
These are severe restrictions because real-world applications do not follow
these idealistic assumptions. Relaxing or changing some of these assump-
tions requires a signiﬁcant eﬀort to adapt the delay model or develop a new135
approach without any guarantee of success. In contrast, our approach works
for any topology and traﬃc pattern. We have collected evidence that it is
valid and useful over a wide range of regular and irregular topologies and
traﬃc patterns.
An even smaller number of works discuss the modelling and usage of the140
average distance as a performance metric. General zero-load latency models
for diﬀerent networks are described in [13, 27]. An approached based on
average distance has been used by [28] to formulate models for static latency
when accessing memory in large scale chip multiprocessors. In comparing
network topologies, Agarwal [29] analysed the network latency for 2-D, 3-D145
and 4-D networks under localized traﬃc. The analysis is performed for zero-
load and disregards the eﬀect of congestion on the latency. It assumes that
5
routers and wires are the only constraints that aﬀect delay. They report the
following expression for the average distance in k-ary n-mesh networks:
H =
n
3
(k − 1
k
). (1)
In practice, networks are rarely conﬁgured with equal radices. This is150
especially true with the advent of 3-D integration technologies. For a given
network size the routing nodes are often arranged with diﬀerent k1, k2 and k3
radices. In formulating a simple adaptive partitioning strategy to minimise
the communication cost, Liu et al. in [30] derived an expression for average
distance in k1×k2 type 2-D mesh networks:155
H =
1
3
(k1 − 1
k1
) +
1
3
(k2 − 1
k2
). (2)
When k1 = k2, equation (2) is equivalent to Agarwal’s equation (1).
In previous work [31] we showed how the average distance depends on
the probability of transmission, pi,j, of a packet with source i and destination
j, and the actual source-destination Manhattan distance in terms of hops,
with the following formulation for a 1-D network:160
H1×k =
∑k
i=1
∑k
j=1 pi,j × |i− j|∑k
i=1
∑k
j=1 pi,j
. (3)
We went on to discuss unequal radices and formulated an average distance
model for an n-D mesh that is the generalisation of Agarwal’s and Liu’s
formulation:
Hurt =
n∑
i=1
Hurt1×ki
=
K1
3
− 1
3K1
+
K2
3
− 1
3K2
+ ...+
Kn
3
− 1
3Kn
(4)
Based on (3) we also derived average distance models for the spatial traﬃc
patterns of uniform random traﬃc and local random traﬃc and veriﬁed the
ﬁdelity of the model by simulating networks under loading for these traﬃc
patterns.
6
A more comprehensive approach is to use spatio-temporal traﬃc patterns165
that exhibit bursty characteristics, which is more representative of how real
applications communicate over networks. Several studies have already shown
that both system- and chip-level networks demonstrate properties of self-
similarity [24, 26]. However, to our knowledge, no latency models have been
published for spatio-temporal traﬃc patterns.170
The network link bandwidth is dependent on the number of links available
in the network. Depending on the conﬁguration, the total number of links
varies, even though the total number of routing nodes may be identical. This
has been shown in [32] through comparative analysis of 2-D mesh and 3-D
cube networks having the same routing node size. The expression for the175
total number of links is:
L2D = 4k1k2 − 2(k1 + k2)
L3D = 6k1k2k3 − 2k1k2 − 2k1k3 − 2k2k3
(5)
which quantiﬁes the diﬀerences in link bandwidth in diﬀerent topologies.
Most analytical models take congestion into account to predict absolute
network performance. The zero-load delay in such models can be deduced by
setting the congestion level down to zero. However, ﬁnding zero-load delay180
in such an approach doesn’t guarantee the accurate prediction of the relative
performance of networks under load. Depending on the switching mechanism,
and routing algorithm of the network, ﬁdelity may not be maintained all the
time. Also, the application of such models is limited to regular network
topologies with regular traﬃc patterns.185
In this paper, we broaden the scope of the study we presented in [31]
by considering signiﬁcantly more traﬃc pattern models including bursty and
irregular traﬃc, as well as irregular networks. We evaluate the ﬁdelity of
the model in each case by comparing against results obtained from cycle
accurate simulations, and demonstrate how it provides insight into the rel-190
ative performance of diﬀerently conﬁgured networks under dynamic loading
conditions. The study signiﬁcantly expands on our previous work in terms
of more experimentation and on understanding the underpinning theoretical
concepts.
3. Traﬃc Patterns and Hop-count Models195
Synthetic traﬃc models play an important role in design space exploration
and veriﬁcation. When an application runs, packets injected into the network
7
tend to exhibit repetitive spatio-temporal patterns that can be captured in
a model [25]. The model should replicate both the temporal distribution,
i.e. the timing of release of packets in the period under consideration, and200
the spatial distribution, i.e. the variation of destination addresses.
3.1. Spatial Distribution
Most works in the literature that propose synthetic traﬃc patterns discuss
their spatial distribution, which determines how destination addresses are
generated for packets. Spatial and temporal distributions are orthogonal to205
each other, and any temporal distribution can be superimposed on any spa-
tial distribution. The set of destinations may contain only one node, resulting
in a deterministic pattern, or it may include all nodes in the network with
an associated probability. If the probability of transmitting to each node
is identical, the traﬃc pattern is uniform random, while a probability that210
decreases with increasing distance results in localized traﬃc. In our experi-
ments we utilise the following commonly used deterministic and probabilistic
traﬃc patterns: uniform random (URT), bit reverse (BRT), bit complement
(BCT), and local random (LRT) traﬃc.
For the purpose of deﬁning spatial traﬃc patterns, routing nodes are215
assigned unique numbers S = 0 · · ·N−1 with N being the number of routing
nodes. In a 3-D mesh topology the x, y and z address components are mapped
from these routing node identiﬁers as follows:
x = S mod Nx
y = (S div Nx) mod Ny
z = S div (NxNy)
(6)
where div is integer division and Nx, Ny, Nz denote the size of the network
in each dimension. For a 2-D mesh the same equations hold except for the220
third, which becomes irrelevant.
For each traﬃc pattern, zero-load hop count models based on our original
deﬁnition are expressly derived or stated below.
3.1.1. Uniform Random Traﬃc (URT)
In URT, the destination addresses are generated randomly and can be any225
processing element across the network other than the source itself 1. For a
1Our convention does not allow a source to generate packets to itself. This does not
detract from the generality, as simple modiﬁcations in the expressions can accommodate
8
Figure 1: The bit-reverse traﬃc pattern with the node identiﬁer shown in binary.
given network size of N routing nodes, URT creates a uniformly distributed
spatial pattern, with equal destination probabilities for all source-destination
pairs:
PD =
1
N − 1 (7)
The overall average over the Manhattan distances associated with all source-230
destination pairs gives the average distance travelled by a packet across the
network. In our previous work [31], the average distance expression in terms
of hop count for a 3-D network under URT has been generalized as given in
(8)
Hurt =
1
3
((x− 1
x
) + (y − 1
y
) + (z − 1
z
)). (8)
3.1.2. Bit-Reverse Traﬃc (BRT)235
In BRT, the destination address is formed by reversing the binary format
of the source node identiﬁer as deﬁned in section 3.1 and equation (6). For
example, source node (001110) will send all its packets to destination node
(011100). Figure 1 shows this pattern for a 4× 4 network.
Let ςn denote the bit-reverse of n, (e.g. ς100 = 001), S the source node240
identiﬁer, D the destination node identiﬁer, and Sx, Sy, Sz, Dx, Dy, Dz the
address components of the source and destination nodes respectively [13].
the case with self-traﬃc. For example, equation (7) would have N instead of N − 1.
9
Then equation (6) results in the following dependencies:
Sx = S mod Nx
Sy = (S div Nx) mod Ny
Sz = S div (NxNy)
D = ςS mod N
Dx = D mod Nx
Dy = (D div Nx) mod Ny
Dz = D div (NxNy).
(9)
If N is not a power of 2, i.e. N = 2k, some bit-reversed values ςS will be
greater than N . Therefore we deﬁne D = ςs mod N . When N = 2
k, as in245
Figure 1, the modulo operation has no eﬀect.
The distance between a source S and a destination D is the sum of the
x, y and z diﬀerences:
Hbr,SD = |Dx − Sx|+ |Dy − Sy|+ |Dz − Sz|. (10)
For a network with N nodes, the average distance is expressed as the mean
over all source-destination distances:
Hbr =
1
N
N−1∑
S=0
N−1∑
D=0
Hbr,SD. (11)
3.1.3. Bit-Complement Traﬃc (BCT)
The destination node identiﬁers in the bit-complement pattern are derived
by bit-wise complementing the source node identiﬁer [13]. Figure 2 shows
an example. If ¬n denotes the bit-wise complement operation on a bit string
n (e.g. ¬01011 = 10100), then equation (6) gives:
Sx = S mod Nx
Sy = (S div Nx) mod Ny
Sz = S div (NxNy)
D = ¬S mod N
Dx = D mod Nx
Dy = (D div Nx) mod Ny
Dz = D div (NxNy).
(12)
10
Figure 2: The bit-complement traﬃc pattern with the node identiﬁer shown in binary.
As in the bit-reverse case, the distance between any source S and any desti-
nation D is
Hbc,SD = |Dx − Sx|+ |Dy − Sy|+ |Dz − Sz|, (13)
giving the average distance for a 3-D mesh as:
Hbc =
1
N
N−1∑
S=0
N−1∑
D=0
Hbc,SD. (14)
3.1.4. Localized Random Traﬃc (LRT) - the Alpha Model
In any architectural design, common sense dictates that components which250
communicate frequently with each other are placed in close proximity to
avoid unnecessary delay and congestion, inasmuch as is possible within the
physical constraints on placement. Local traﬃc models capture such sensible
design decisions. Under a local traﬃc pattern, the probability of a given
routing node being the destination for a generated packet varies inversely255
as the source-destination distance. Thus for any given source node, packets
with close-by destinations are more numerous than packets with far-away
destinations.
The level of localization can be explicitly speciﬁed in the model by the
locality coeﬃcient, α. When α=0, localization does not exist, and every260
node generates packets with equal probability to all nodes (always excluding
self-traﬃc), whether near or far; this is identical to URT. As α increases, the
localization eﬀect increases and the number of packets generated with nearby
destinations increases. As α→∞, the average packet distance approaches 1
hop.265
11
Figure 3: Eﬀect of the variation of localization coeﬃcient α on the average distance
measured in hop counts. The hop count converges to 1 with increasing α.
For a given network size of N routing nodes the probability of sending a
packet from S to D is
PD =
1
KS
.
1
|S −D|α (15)
for S =D, where |S−D| is the geometric (Manhattan) distance and KS is a
normalizing factor that limits the sum of all probabilities to 1. Its value is
diﬀerent for each source S and is calculated as follows:
KS =
N−1∑
D=0
1
|S −D|α . (16)
Then the average hop count is derived as follows:
Hα,SD =
1
N − 1
N−1∑
S=0
N−1∑
D=0
(|S −D|PD). (17)
Substituting (15) in (17) results in:
Hα,SD =
1
N − 1
N−1∑
S=0
N−1∑
D=0
|S −D|1−α
Ks
. (18)
The localization eﬀect varies according to the network size and topology.
Figure 3 shows the localization eﬀect on the average distance for a network
of 216 routing nodes arranged as 8×8×8 and 2×8×16 cuboids and a 16×16×1
mesh. When α=0, the average hop count is the same as with URT, though the
values are diﬀerent for each conﬁguration. As α increases, the localization of270
traﬃc increases, and the average distance decreases until all curves converge
to a value of 1 hop count.
12
3.2. Temporal Distribution
The temporal distribution deﬁnes the timing of release of packets into the
network. Several studies have concluded that realistic network traﬃc demon-275
strate the property of self-similarity over a long period of time [24, 25, 26].
As bursty traﬃc is very prevalent in real applications, we have established a
self-similar synthetic pattern as a bursty traﬃc model which emulates real-
istic streaming of data.
Discrete self-similar traﬃc can be modelled by the bursty model (B-280
Model) as described in [25]. In the B-Model, a bias β (0 < β < 1) is
introduced to the streaming pattern. A bias β=0.5 indicates that packets
are streamed at a uniformly distributed rate throughout a time interval com-
prising, say, n cycles. When the bias is set below or above 0.5, the streaming
rate becomes skewed, with the n-cycle time interval being split into two equal285
portions, and a speciﬁed fraction of packets being emitted in the ﬁrst half,
and the rest in the second half. For example, a bias of β=0.2 implies that
20% of the packets are streamed in the ﬁrst half and 80% in the second half of
the time interval under consideration, or vice versa. This process of halving
is continued for each generated half of the original interval, for a number of290
times that is deﬁned as the depth d, resulting in some number of discrete time
intervals in which the packets are distributed. For an n-cycle time series, the
number of such discrete intervals in the ﬁnal sequence is given by n
2d
. The
maximum value for d is limited by the inequality n
2d
≥1 or d≤ log2(n) (where
the simulation cycle duration has been normalised to 1), as a simulation cycle295
is an indivisible, atomic unit of time. After each division, the choice of which
half is assigned 20%, and which 80% (in this example), is made randomly.
The total number of packets, φtotal, to be transmitted within the time-
series of n cycles depends on the injection rate, γ, (0 ≤ γ ≤ 1). At the
maximum injection rate (γmax = 1) a node can inject at most one packet per300
clock cycle, i.e. n packets in an n-cycle time-series. In general the following
relationship holds:
φtotal(γ) = γn. (19)
The physical time at the beginning of period i in the sequence is given
by i n
2d
. If the total number of packets allocated to each period in the ﬁnal
sequence is x(i n
2d
), the total traﬃc volume (total number of packets) is the305
13
Figure 4: Allocation of packets in the B-model. The original n-cycle time-series is divided
into a number of discrete time intervals by a process of continuously splitting the parent
sequence into two halves. In the ﬁrst step, the sequence comprises the entire time series,
and hence there is one interval which contains all of the packets. Halving this interval for
a number of times, d (depth), and randomly allocating to each resulting interval a fraction
of either β or 1−β of the total number of packets in the parent sequence results in the
ﬁnal distribution of packets over time.
sum of the traﬃc volume in each period:
φtotal(γ) =
2d∑
i=0
x(i
n
2d
). (20)
The number of packets that a node injects into the network within any
period, x(i n
2d
), can be expressed as a function of the bias, β, the division
depth, d, and the injection rate, γ
x(i
n
2d
) = ({β, 1− β})d(γn−
i−1∑
j=0
x(j
n
2d
). (21)
In (7), the traﬃc volume at a given point in the ﬁnal time sequence is310
deﬁned as a function of the traﬃc volume at the coarser time step, and has
a straightforward recursive implementation.
Figure 5 shows the distribution of 1,000 packets over 10,000 cycles with
a bias of β = 0.2 and an injection rate of γ = 0.1. If the total is increased to
2,000 packets (γ = 0.2), the only change is in the amplitude (y-axis). The315
temporal distribution (x-axis) is identical.
It turns out that the temporal distribution of packet generation has no
impact on the average hop count. The distance is determined by locations of
source and destination of packets, but not when in time they travel. Conse-
quently, it has no eﬀect on the average hop count metric. However, it is well320
known that bursty traﬃc is unhealthy for networks. Given a certain amount
14
Figure 5: Distribution of 1000 packets over 10,000 cycles according to the B-model with
β=0.2.
of traﬃc to be transmitted, networks handle smooth traﬃc ﬂows much better
than bursty traﬃc with big spikes. Viewed diﬀerently, a network needs more
buﬀering resources to cope well with bursty traﬃc.
In this light one can suspect that the average distance model has diﬃ-325
culties to predict the relative performance of networks if the traﬃc is very
unevenly distributed over time. Intuition would suggest that two network
conﬁgurations may have signiﬁcantly diﬀerent capabilities to handle bursts,
even if both of them exhibit the same average distance. It therefore came as
a surprise to us that this eﬀect did not appear in any of our simulations, as330
exempliﬁed by Figures (7, 8, 10, 11, 13, 14, 16, 17). Certainly, more bursty
traﬃc results in heavier loading and causes the network to saturate at a lower
injection rate. However, this aﬀects all alternative network conﬁgurations in
the same way and any diﬀerences of networks to cope with bursts are evened
out by the load distribution capability of the deﬂection routing algorithm.335
4. Simulation Environment and Comparison Methodology
In this section we describe the simulation environment and experimental
setup. In our network simulator, a hop is counted when a packet traverses
the link between adjacent routers.
4.1. Traﬃc conﬁguration340
The average distance predicted by the zero-load model for the various
traﬃc patterns as described in section 3 is calculated numerically for three
network topologies having the same total number of routing nodes, 4×4×4,
15
2×4×8 and 8×8×1. Each spatial traﬃc model is then combined with a self-
similar temporal bursty traﬃc model with bias values of 0.1, 0.3 and 0.5,345
and used to generate traﬃc for cycle-accurate register-transfer-level (RTL)
simulations. The zero-load case is emulated by having a very low injection
rate of 0.01 packets per node per cycle, and the ﬁdelity of the model in pre-
dicting the network performance is checked by increasing the injection rate
beyond the saturation point. For a range of injection rates within the simu-350
lation period, the average latency values are calculated for packets collected
from a sample window deﬁned within the stable phase of the network (after
the warm-up phase and before the cool-down phase). For the speciﬁc router
micro-architecture considered, a single hop count is equivalent to ﬁve clock
cycles in simulation. The simulation window is always long enough that no355
packets are dropped.
4.2. On-Chip Network Architecture
A hop count can be translated into network latency given that the phys-
ical constraints are known. Ideally, the router-to-router hop delay is equal
throughout the network. This assumes that the link sizes are the same and360
that all routers are identical. If the network is not regular, router-to-router
length is not the same and hops cannot be directly converted to network la-
tency. An example is a 3-D cube network where through silicon vias (TSVs)
are used to connect the vertical layers. The TSVs are short and fast com-
pared to long global planar wires due to their lower electrical parasitics. As a365
result, inter-layer communication is typically faster than intra-layer commu-
nication. This means that horizontal networks hops are slower than vertical
hops. Thus, vertical and horizontal hops are separated to calculate the net-
work latency.
In this study we use a buﬀer-less switch and non-minimal, fully adaptive,370
deﬂective routing, also known as hot-potato routing. Buﬀer-less routers have
an inherent advantage of simplicity, energy-eﬃciency, and cost-eﬀectiveness [33].
Diﬀerent implementations of buﬀer-less architectures have been reported [34]
[35]. Each router consists of control units and sorting units and utilise a
crossbar architecture, pipelined in three stages, with connectivity between375
input/output ports to six directions and to the resource. The six directional
ports are North, South, East, West, Up, and Down with the seventh port pro-
viding access to the resource (such as on-chip processing elements or oﬀ-chip
blocks such as memory or I/O).
16
Mesh networks are used throughout to connect routers in 2-D, and cube380
networks in 3-D. Meshes are chosen because of their simplicity in conﬁgura-
tion and practicality in hardware implementation. Depending on its position,
a single router connects between two and four routers in its Manhattan neigh-
bourhood in a 2-D mesh; a router connects between three and six routers in
3-D cube networks.385
A packet is a single ﬂit long containing both control and payload bits.
Once the destination address is provided by the source, the packetization
process is initiated in a network interface (NI) component. A relative ad-
dressing scheme is used to set the destination bits in the form of X, Y, Z.
For temporal traﬃc, a self-similar pattern is used; the spatial traﬃc patterns390
comprise variously uniform, bit-reverse, bit-complement, hot-spot, and local
patterns. When running simulations the injection rate is varied depending
on the traﬃc pattern in use.
When two or more packets compete for the same link, we honour an
oldest packet ﬁrst priority scheme. No packets are dropped from the network.395
Instead, when the network is congested, the packets are accumulated in a
FIFO buﬀer in the network interface (NI) situated between each router and
its local processing element. More details of the routing protocol and router
micro-architecture are given in [36].
4.3. Simulation Methodology400
Packet latencies are extracted by running cycle accurate RTL simulations,
collating packet injection, ejection and traversal data at each router over the
entire simulation and processing this data in Matlab. The latency in all
graphs is given in multiples of clock cycles. This allows a straightforward
comparison between regular and irregular networks where a hop comprises405
diﬀerent numbers of clock cycles. More details of the simulation methodology
are given in [36].
5. Experiments on Regular Traﬃc Patterns
In this section and the next, we show how the zero-load predictive model
exhibits almost perfect ﬁdelity for regular and irregular traﬃc patterns and410
network topologies, and any deviation is within the limits of numerical ac-
curacy imposed by the simulations and calculations. We also show how the
model can be used to ﬁnd the optimum network node placement within the
limits allowed by the communication and physical constraints imposed by
17
Figure 6: Variation of average latency with increasing injection rate for non-bursty URT
with bias, β=0.5
the speciﬁcations. This section concentrates on regular traﬃc patterns and415
regular networks, characterised by homogeneity across the network in both
cases. Section 6 looks at irregular traﬃc patterns and irregular networks
characterised by heterogeneity across the network.
5.1. Uniform Random Traﬃc (URT)
Figure 6 plots the simulation results for a 64 routing node network conﬁgured420
as an 8×8×1 2-D mesh, and 2×4×8 and 4×4×4 3-D meshes. Packets are
injected under the URT model with no burstiness (i.e. bias β=0.5). At very
low injection rates, the average hop counts are equal to the zero-load delay in
terms of clock cycles. The conﬁguration with the minimum average distance
is the 4 × 4 × 4 3-D mesh, as its geometry dictates that packets have to425
traverse fewer links to reach their destinations. When the injection rate is
increased the network congestion levels increase, and as a consequence the
average delay grows for all conﬁgurations.
Interestingly, increasing injection rates increase the diﬀerences between
these conﬁgurations under load. We observe this phenomenon in many, but430
not all traﬃc patterns, with hot-spot being a notable exception as discussed
in section 6.1.
Figures 7 and 8 show the growth of latency with increasing injection rate
for bursty URT with bias β=0.3 and β=0.1 respectively. While the saturation
injection rate drops, the zero-load average distance model exhibits perfect435
ﬁdelity.
18
Figure 7: Variation of average latency with increasing injection rate for bursty URT with
bias β=0.3.
Figure 8: Variation of average latency with increasing injection rate for bursty URT with
bias β=0.1.
5.2. Bit-Reverse Traﬃc (BRT)
The injection rate is varied from 0.01 up to 1.0 packets per node per cycle for
biases of β=0.5, β=0.3 and β=0.1 under the self-similar temporal model for
bit-reverse traﬃc. Figure 9 shows the result for β=0.5 which is equivalent440
to the case of packets being uniformly distributed in time according to the
bit-reverse spatial pattern. When the bias is skewed to β=0.3, as shown
in Figure 10, the average distances start to increase in each case due to
the increased congestion in the network and the network exit points. This
worsens when the bias is set to β=0.1, as shown in Figure 11. In all cases,445
the zero-load model predicts the relative performance of the conﬁgurations
correctly up to the saturation point.
19
Figure 9: Variation of average latency with increasing injection rate for non-bursty BRT
with bias, β=0.5.
Figure 10: Variation of average latency with increasing injection rate for bursty BRT with
bias, β=0.3.
5.3. Bit-Complement Traﬃc (BCT)
Figure 12 shows the results for unbiased traﬃc with the bit-complement
spatial distribution. For low injection rates the average latency converges450
to the delay predicted by the zero-load model in terms of clock cycles for
each conﬁguration. When the injection rates are increased, the latency also
increases without the curves ever crossing each other. Similarly, we observe
also perfect ﬁdelity for bursty traﬃc with bias β=0.3 and β=0.1 as shown in
Figures 13 and 14 respectively.455
20
Figure 11: Variation of average latency with increasing injection rate for bursty BRT with
bias, β=0.1.
Figure 12: Variation of average latency with injection rate for non-bursty BCT when bias,
β=0.5.
5.4. Localized Random Traﬃc (LRT)
Figure 15 shows how the latency increases with injection rate when the lo-
calization coeﬃcient α=1 and the self-similar bias β=0.5, ensuring uniform
streaming of packets under a local traﬃc pattern. At low injection rates the
latency converges to the zero-load average distance as for the other cases.460
When the bias is set to β=0.3 (Figure 16), or β=0.1 (Figure 17), the result-
ing temporally skewed traﬃc causes insigniﬁcant changes. This is because
strong localization in the traﬃc generation results in more packets with des-
tinations within a relatively short distance compared to the network dimen-
sions. Clearly, for each conﬁguration, the average latency for local traﬃc is465
less than the corresponding URT traﬃc shown in Figures 6, 7 and 8.
21
Figure 13: Variation of average latency with increasing injection rate for bursty BCT with
bias, β=0.3.
Figure 14: Variation of average latency with increasing injection rate for bursty BCT with
bias, β=0.1.
6. Experiments on Irregular Traﬃc Patterns
In this section, we further validate the zero-load predictive model for networks
with irregular traﬃc patterns as well as irregular networks. We also show
how such networks can be conﬁgured for optimal performance.470
6.1. Networks with Hot-Spots
Nodes that generate or receive a greater proportion of traﬃc than other
nodes are called hot-spots. Typical hot-spot nodes are memory controllers,
a critical processing resource, or a system controller.
For instance, the wide-IO JEDEC standard speciﬁes 512 bit wide data475
interfaces [37] from the logic plane to the DRAM memory plane in stacked
22
Figure 15: Variation of average latency for non-bursty (bias, β=0.5) LRT with alpha α=1.
Figure 16: Variation of average latency for LRT with α=1 and bias β=0.3.
.
systems. DRAM layers can be physically stacked on top of (or below) logic
layers and connected by means of through-silicon vias (TSVs). Each wide
I/O access port requires 512 interconnects for data and additional lines for
addressing. The processing elements or cores in the logic layers typically480
share the memory layers. This means that data access is made only through
the parallel TSV clusters, which in turn are accessed on the die through
a dedicated resource. Such shared access creates a hot-spot region in an
on-chip network architecture. Hot-spot regions should be designed in such
a way that there is suﬃcient link bandwidth to support worst-case traﬃc485
congestion. This leads us to explore the optimal placement of hot-spot nodes
on a die to minimise congestion, given placement constraints.
23
Figure 17: Variation of average latency for LRT with α = 1 and bias β = 0.1.
.
Figure 18: Placement of hot-spot nodes on top layer.
Figure 18 shows diﬀerent conﬁgurations of two resources serving as access
ports to DRAM either stacked in the same package or placed oﬀ-chip. The
memory access resources have to be on the top logic layer due to I/O consid-490
erations. Each core in the network in any of the three layers that has access
to any block in the memory layers sends requests through the access ports.
The combined requests generate a hot-spot region with heavier traﬃc in the
area surrounding these tiles.
The optimal placement of these hot-spots that yields the best performance495
is found through cycle accurate RTL simulations for networks under loading.
For this experiment, we examine networks of three diﬀerent sizes, 4×4×4,
24
Figure 19: Variation of average latency with injection rate for hot-spot traﬃc in 4×4×4
network with HS1 and HS2 hot-spot placement on top layer
Figure 20: Variation of average latency with injection rate for hot-spot traﬃc in 7×7×7
network with HS1, HS2 and HS3 hot-spot placement
7×7×7, and 10×10×10. The diﬀerent placements of the access tiles on the
top layer considered are shown in Figure 18. While some other arrange-
ments are possible, many can be eliminated through symmetry, and these500
are carefully selected as being representative of most sensible conﬁgurations
to validate the case.
In this exercise hot-spot nodes receive 80% of the packets generated by the
non-hot-spot nodes, while the remaining 20% are sent to other non-hot-spot
destinations under a uniform random distribution. This spatial distribu-505
tion is then uniformly distributed over time (bias β=0.5 in the self-similar
model). Figures 19 to 21 show the results for each network conﬁguration
with increasing injection rate from 0.001 up to 0.01 packets per node per
25
Figure 21: Variation of average latency with injection rate for hot-spot traﬃc in 10×10×10
network with HS1, HS2 and HS3 hot-spot placement
Figure 22: The three conﬁgurations have an equal no. of routing nodes and each has two
hot-spots located at their center. The Y-axis gives the saturation injection rate while the
X-axis denotes the fraction of overall traﬃc directed to the hot-spots. As this fraction
increases, the networks’ saturation injection rates decrease and converge.
cycle.
With increasing injection rate, the average packet latency in each conﬁg-510
uration increases without the curves crossing each other. The model again
exhibits perfect ﬁdelity for all tested hot-spot conﬁgurations.
It is interesting to note that the diﬀerences in latency of diﬀerent conﬁgu-
rations decrease as the network load increases, unlike all the traﬃc patterns
studied earlier. For URT, LRT, BRT and BCT the diﬀerences grow because515
the longer a packet has to travel the more it will suﬀer from increased con-
gestion simply because there is more time for it to be aﬀected. As the less
optimal topologies have on average more packets that travel longer, they are
aﬀected more by congestion and hence the diﬀerences in latency increase.
26
Figure 23: A 6× 6 network conﬁguration in 3 layers with node clustering
Figure 24: Percent diﬀerences for selected conﬁgurations
With hot-spot traﬃc, packets that travel to a hot-spot or nearby a hot-spot520
will suﬀer more than other packets from congestion. As load increases, the
congestion around hot-spots will rise ﬁrst, aﬀecting all packets that travel
near-by indiscriminately. Therefore we see in Figures 19, 20, and 21 that
the latency curves of the diﬀerent conﬁgurations increase roughly in par-
allel until the congestion starts to dominate the delay, at which point the525
lines converge. This convergence of saturation points is demonstrated in
Figure 22. It shows three network conﬁgurations with two hot-spots around
their respective centres. As the fraction of traﬃc directed to these hot-spots
increases, the injection rate at which these diﬀerent conﬁgurations saturate
decrease and converge. Hence, strong hot-spot traﬃc tends to dominate a530
loaded network, deﬁnes its saturation point, but does not, even under high
load, reverse the relative performance ranking of networks. Consequently, the
average distance model is a valid predictor for hot-spot dominated networks.
See section 7 for a discussion of when this trend is likely to be reversed.
27
Table 1: Traﬃc probabilities for MAP IP-cores
Source IP cores Probability to target IP-core Relative Injection rate
GPU 68% L2 GPU, 2% CPU, 20% Display Interface, 9% total to all other interfaces, 1% System control 1 IR
CPU 40% L2 CPU, 8% All GPU, 10% Audio, 10% Video, 4% Camera, 5% Security 22% all other Interface, 1% System control 0.7 IR
Audio 30% WideIO, 28% Security, 20% CPU, 15% Standard, 3% Ethernet, 3% User, 1% System control 0.2 IR
Video 50% WideIO, 9% Security, 20% CPU, 20% all interfaces, 1% System control 0.8 IR
Camera 30% WideIO, 60% Display, 5% CPU, 4% Security, 1% System control 0.8 IR
Security 60% WideIO, 20% Audio, 14% Video, 5% CPU, 1% System control 0.3 IR
L2 GPU 19% L3, 80% GPU, 1% System control 0.8 IR
L3 26% L2 GPU A, 26% L2 GPU B, 26% L2 CPU, 21% WideIO, 1% System control 1 IR
L2 CPU 20% L3, 79% CPU, 1% System control 0.8 IR
WideIO 48% L3, 15% Audio, 26% Video, 5% Security, 5% Camera, 1% System control 1 IR
System Control 24% CPU, 24% GPU, 4% To every remaining 13 cores, 0.2 IR
Standard Interface 16% GPU, 20% CPU, 46% Audio, 10% Video, 5% Security, 2% Camera, 1% System control 0.5 IR
User 24% GPU, 22% CPU, 24% Audio, 24% Video, 5% Security, 1% System control 0.5 IR
Ethernet 24% GPU, 25% CPU, 15% Audio, 30% Video, 5% Security 1% System control 0.5 IR
Display 64% GPU, 12% Video, 15% CPU, 12% WideIO, 3% Camera, 5% Security,1% System control 0.5 IR
6.2. Conﬁguration of Regular Networks with Irregular Traﬃc Patterns535
In this example we attempt to identify the best placement conﬁguration in
a complex 3-D network by means of the zero-load predictive model. The
network has three layers with each layer having 6×6 nodes as shown in Fig-
ure 23. The network includes two hot-spot nodes. The ﬁrst provides access
to oﬀ-chip data inputs and outputs, and is placed at the periphery of the540
bottom layer based on I/O considerations. The second provides access to a
wide-IO port that connects to a memory layer stacked on top of the three
layers. It is placed in the middle of the network based on manufacturing
considerations [38]. In order to simplify the traﬃc allocation, the routing
nodes are grouped into six diﬀerent clusters deﬁned by their traﬃc genera-545
tion probability as shown in table 2.
Table 2: Traﬃc generation probabilities of cores in diﬀerent clusters to hot-spot nodes
(memory & oﬀ-chip) and other nodes
Number of cores To Memory To Oﬀ-chip To other cores
Cluster 1=18 7.14% 7.14% 85.71%
Cluster 2=18 14.29% 14.29% 71.43%
Cluster 3=18 21.43% 21.43% 57.14%
Cluster 4=18 28.57% 28.57% 42.86%
Cluster 5=18 35.71% 35.71% 28.57%
Cluster 6=18 42.86% 42.86% 14.29%
By permutation of the six clusters in the network, 720 possible conﬁgurations
(M001-M720) can be derived. The zero-load average distance model reveals
that conﬁguration M451 with 4.6041 hops has the shortest average distance
whereas M245 with 4.888 hops has the longest average distance. Figure 24550
28
shows the ﬁrst ﬁve top conﬁgurations (M451, M549, M470, M452, & M450)
with shortest average distance as well as the one with the longest (M245)
in the top row. The relative % diﬀerence between any two conﬁgurations is
calculated.
The diﬀerence in average distance between all six conﬁgurations are also555
shown in Figure 24. For example, the relative diﬀerence of M245 and M451
is 6.17%. Given the large number of possible conﬁgurations, the diﬀerence
between any two consecutive conﬁgurations is quite small.
In order to check the ﬁdelity of the model and see how the predictions
hold up under increasing load, we carried out RTL simulations for all conﬁg-560
urations.
In this example we did ﬁnd cross-overs in the latency curves for diﬀerent
conﬁguration, which are marked as red cells in Figure 24. It turns out that
cross-overs only occurred when the diﬀerence in the zero-load average dis-
tance was less than or equal to 0.13%, which translates to an absolute diﬀer-565
ence on the order of 0.006 for the zero-load average distance values prevalent
in this example. In the post-processing, latency values are rounded to two
decimal places and truncated, leading to a maximum absolute error in indi-
vidual readings of 0.005, which can accumulate over multiple transactions.
Also, the stochastic processes used to generate packets over time for the570
simulations deviate in any ﬁnite time period from the ideal probability dis-
tributions used in the zero-load model. Therefore these cross-overs seem to
be within the range of the numerical error introduced by the simulations and
proceeding calculations and appear not to represent a true violation of the
ﬁdelity of the model.575
6.3. Conﬁguration of Irregular Networks with Irregular Traﬃc Patterns
The relevance of the zero-load predictive model for diﬀerent applications is
further investigated with an irregular network, based on two conﬁgurations
of a generic mobile application processor (MAP) shown in Figure 25(a) and
25(b). The MAP is composed of processing elements of diﬀerent sizes each580
connected to a routing node and thus the network topology used to connect
them is an irregular one. We have used the following core descriptions to
obtain likely tile sizes for the network as well as representative (irregular)
traﬃc pattern models of communication between the diﬀerent elements. The
MAP consists of GPU clusters each with four GPU cores, and a single CPU585
cluster with eight cores. Each cluster accesses its own dedicated L2 cache.
There is a common L3 cache with direct access to a 3-D wide-IO port located
29
(a) MAP Model 1 (b) MAP Model 2
Figure 25: Two conﬁgurations of a generic Mobile Application Processor (MAP)
at the center. The wide-IO DRAM blocks are stacked on top of the MAP.
There are also application speciﬁc IP-cores such as an audio DSP and video
codec, a camera, and security and system controls. For oﬀ-chip accesses, a590
standard interface such as USB, SPI or any user deﬁned interface can be
used. A display interface and wired connectivity through Ethernet is also
included.
The traﬃc generated by individual IP-cores is non-uniform. Cores such
as GPUs stream packets at a higher rate while IP-cores such as system con-595
trollers generate packets at a lower rate with a small contribution to the
overall traﬃc. Table 1 shows the spatial probability distribution of traﬃc
used to simulate the two MAP conﬁgurations, normalised to the GPU in-
jection rate. For example, the relative injection rate of 0.2 for the security
IP-core means that its traﬃc contribution is only 20% of the maximum traﬃc600
contribution by a core (i.e. the GPU’s contribution). The simulation results
are shown in Figure 26, and the results conﬁrm 100% ﬁdelity of the model
for injection rates below saturation.
7. Discussion
In the absence of reliable delay models for NoCs with adaptive routing,605
cycle-accurate simulation is the only tool to assist system architects in decid-
ing upon network topology, mapping, and other critical early-phase design
30
Figure 26: Average clock cycles with self-similar irregular traﬃc pattern for MAP conﬁg-
urations
choices. By focusing on a relative rather than absolute performance metric,
we have formulated a model that predicts with high ﬁdelity whether one con-
ﬁguration will exhibit better performance than another even under high load610
with burstiness in packet injection. The zero-load model is a static prop-
erty of topology, mapping and traﬃc probabilities. Even though it does not
take into account congestion, interference or temporal variability in traﬃc,
it surprisingly shows almost perfect ﬁdelity for deﬂection routing networks.
We studied the model under a wide range of loading and topological con-615
ditions from uniform random to hot spots to irregular traﬃc and networks.
Under all these conditions we only observed the relative performance of dif-
ferent network conﬁgurations changing under load in a few cases when the
average distance of two alternative conﬁgurations diﬀered by 0.13% or less.
These diﬀerences fall within the numerical error introduced by rounding and620
stochastic variations in the traﬃc generation.
It is interesting to note that in all studied cases of regular traﬃc patterns
the diﬀerences in delay grow with increasing traﬃc load, as attested to by
the diverging delay curves in Figures 6-17. For hot-spot and other irregular
traﬃc patterns the curves run parallel (Figures 19, 21, 26) or even converge625
(Figure 20). It seems that divergence occurs when congestion builds up
uniformly in the whole network, thus aggravating every initial diﬀerence.
However, if network behaviour is dominated by the congestion in a small
area, the saturation point is reached when this small area becomes heavily
congested, and thus any initial advantage in terms of the average distance630
is lost. Thus, in these scenarios the delay curves converge towards the same
31
saturation point (Figure 22). A prime example is a pronounced hot spot
where the congestion in a single routing node’s exit link determines when
the network is saturated.
More generally, whenever some local congestion cannot be absorbed and635
balanced over the whole network, it will dominate the network at high load.
If diﬀerent conﬁgurations still have the same (or similar) bottleneck channel
or channels (ﬁgures 22), the zero-load predictive model holds. If they have
a diﬀerent bottleneck channel, as may happen with deterministic routing
algorithms, the average distance does not contain suﬃcient information to640
predict relative performance under load.
Thus, it needs to be emphasised, that the predictive power of the average
distance model relies on the load distribution capability of adaptive routing.
Our experiments have shown that it is less suitable for deterministic routing
because in such networks individual links may constitute bottlenecks deter-645
mining the limit of the network’s load, even though the network as a whole
has abundant spare capacity. The average distance model is a global property
and it averages out local imbalances, thus mirroring closely the load distri-
bution of adaptive routing. We have validated the model only for deﬂection
routing, which, it can be argued, has a perfect load distribution capability.650
We hypothesize that the model is well suited for other adaptive routing al-
gorithms to the extent that they have good load balancing capabilities; to
conﬁrm this hypothesis is future work.
Hence, it is ironic but understandable, that deﬂection routing together
with other adaptive routing algorithms deﬁes all attempts to formulate an655
accurate analytic delay model but ﬁnds in the average distance model a very
good predictor of relative performance.
8. Conclusion
Delay models for NoCs with adaptive routing that can accommodate a
range of spatio-temporal traﬃc patterns and topologies do not exist, due to660
the inherent complexity in capturing the eﬀect of packet interaction across
time and space. However we have have shown that a static, relative metric
that does not consider congestion is able to predict with remarkable ﬁdelity
whether a network will exhibit better or worse performance than another,
even under heavy loading and bursty traﬃc. This metric, the zero-load665
average distance, is a good predictor of the relative performance of NoCs with
32
adaptive routing because it is a global property that captures the essence of
the load balancing capability of a network.
References
References670
[1] A. E. Kiasari, A. Jantsch, Z. Lu, Mathematical formalisms for perfor-
mance evaluation of networks-on-chip, ACM Computing Surveys.
[2] N. e. a. Audsley, Applying new scheduling theory to static priority pre-
emptive scheduling, Software Engineering Journal 8 (5) (1993) 284–292.
[3] Z. L. Y. Qian, W. Dou, Analysis of worst-case delay bounds for on-675
chip packet-switching networks, Computer-Aided Design of Integrated
Circuits and Systems, IEEE Transactions on 29 (5) (2010) 802 –815.
doi:10.1109/TCAD.2010.2043572.
[4] M. B. et al., Dataﬂow analysis for real-time embedded multiprocessor
system design, in: Dynamic and Robust Streaming in and between Con-680
nected Consumer-Electronic Devices, Springer, 2005, pp. 81–108.
[5] P. Bogdan, R. Marculescu, Non-stationary traﬃc analysis and its impli-
cations on multicore platform design, Computer-Aided Design of Inte-
grated Circuits and Systems, IEEE Transactions on 30 (4) (2011) 508–
519. doi:10.1109/TCAD.2011.2111270.685
[6] J. Duato, P. Lo´pez, Performance evaluation of adaptive routing al-
gorithms for k-ary n-cubes, in: by Kevin Bolding, L. Snyder (Eds.),
Proceedings of the First International Workshop on Parallel Computer
Routing and Communication, Springer, 1994, pp. 45–59.
[7] J. Hu, R. Marculescu, Dyad: smart routing for networks-on-chip, in:690
Proceedings of the 41st annual Design Automation Conference, DAC,
2004, pp. 260–263.
[8] J. K. et al, A low latency router supporting adaptivity for on-chip inter-
connects, in: Proceedings of the 42nd Design Automation Conference,
2005, pp. 559–564.695
33
[9] E. N. et al, Load distribution with the proximity congestion awareness
in a network on chip, in: Proceedings of the Design Automation and
Test Europe (DATE), 2003, pp. 1126–1127.
[10] L. P. L. Shang, A. Kumar, N. K. Jha., Thermal modeling, characteriza-
tion and management of on-chip networks, in: Proceedings of the 37th700
MICRO, 2004.
[11] C. F. et al., Addressing transient and permanent faults in NoC with eﬃ-
cient fault-tolerant deﬂection router, IEEE Transactions on Very Large
Scale Integration Systems (TVLSI) 21 (6) (2013) 1053–1066.
[12] P.-A. Tsai, Y.-H. Kuo, E.-J. Chang, H.-K. Hsin, A.-Y. Wu, Hybrid705
path-diversity-aware adaptive routing with latency prediction model
in network-on-chip systems, in: VLSI Design, Automation, and Test
(VLSI-DAT), 2013 International Symposium on, 2013, pp. 1–4. doi:
10.1109/VLDI-DAT.2013.6533884.
[13] W. J. Dally, B. P. Towles, Principles and practices of interconnection710
networks, Elsevier, 2004.
[14] Z. Qian, D.-C. Juan, P. Bogdan, C.-Y. Tsui, D. Marculescu, R. Mar-
culescu, A comprehensive and accurate latency model for network-on-
chip performance analysis, in: Design Automation Conference (ASP-
DAC), 2014 19th Asia and South Paciﬁc, 2014, pp. 323–328. doi:715
10.1109/ASPDAC.2014.6742910.
[15] S. Foroutan, Y. Thonnart, R. Hersemeule, A. Jerraya, An analytical
method for evaluating network-on-chip performance, in: Proceedings of
the Conference on Design, Automation and Test in Europe, DATE ’10,
European Design and Automation Association, 3001 Leuven, Belgium,720
Belgium, 2010, pp. 1629–1632.
URL http://dl.acm.org/citation.cfm?id=1870926.1871319
[16] U. Ogras, P. Bogdan, R. Marculescu, An analytical approach for
network-on-chip performance analysis, Computer-Aided Design of In-
tegrated Circuits and Systems, IEEE Transactions on 29 (12) (2010)725
2001–2013. doi:10.1109/TCAD.2010.2061613.
34
[17] C. D. Y. Boura, T. Jacob, A performance model for adaptive routing in
hypercubes, in: Proceedings of the International Workshop on Parallel
Processing, 1994, pp. 11–16.
[18] M. Ould-Khaoua, An analytical model of Duato’s adaptive routing al-730
gorithm, IEEE Transactions on Computers 48 (12) (1999) 1–8.
[19] J. Duato, A new theory of deadlock-free adaptive routing in wormhole
routing systems, IEEE Transactions on Parallel and Distributed Systems
4 (12) (1994) 1320–1331.
[20] H. Sarbazi-Azad, M. Ould-Khaoua, L. Mackenzie, Performance analysis735
of k-ary n-cubes with fully adaptive routing, in: Parallel and Distributed
Systems, 2000. Proceedings. Seventh International Conference on, 2000,
pp. 249–255. doi:10.1109/ICPADS.2000.857705.
[21] A. Khonsari, M. Ould-Khaoua, J. Ferguson, A general analytical model
of adaptive wormhole routing in k-ary n-cube interconnection networks,740
SIMULATION SERIES 35 (2003) 547–554.
[22] G. M. et al, Performance modelling of adaptive routing in hypercubic
networks under non-uniform and batch arrival traﬃc, in: 32nd IEEE
Conference on Local Computer Networks, 2007, pp. 583–590.
[23] E. Kakoulli, V. Soteriou, T. Theocharides, Intelligent hotspot prediction745
for network-on-chip-based multicore systems, Computer-Aided Design of
Integrated Circuits and Systems, IEEE Transactions on 31 (3) (2012)
418–431. doi:10.1109/TCAD.2011.2170568.
[24] J. H. Bahn, N. Bagherzadeh, A generic traﬃc model for on-chip in-
terconnection networks, in: NoCArc, First International Workshop on750
Network on Chip Architectures, 2008.
[25] M. W. et al., Data mining meets performance evaluation: Fast algo-
rithms for modeling bursty traﬃc, in: ICDE, 2002.
URL citeseer.ist.psu.edu/article/wang01data.html
[26] R. M. Girish Varatkar, On-chip traﬃc modeling and synthesis for755
MPEG-2 video applications, IEEE Trans. on VLSI Syst 12 (1) (2004)
108–119.
35
[27] J. Duato, S. Yalamanchili, L. M. Ni, Interconnection networks: An en-
gineering approach, Morgan Kaufmann, 2003.
[28] N. Nikitin, J. de San Pedro, J. Cortadella, Architectural exploration of760
large-scale hierarchical chip multiprocessors, Computer-Aided Design of
Integrated Circuits and Systems, IEEE Transactions on 32 (10) (2013)
1569–1582. doi:10.1109/TCAD.2013.2272539.
[29] A. Agarwal, Limits on interconnection network performance, IEEE
Transactions on Parallel and Distributed Systems 4 (6) (1991) 613–624.765
[30] H. Liu, W. Lin, Y. Song, An eﬃcient processor partitioning and thread
mapping strategy for mesh-connected multiprocessor systems, in: Proc.
ACM symposium on Applied computing, 1997.
[31] M. G. et al., Optimal network architectures for minimizing average dis-
tance in k-ary n-dimensional mesh networks, in: Proceedings of the770
Networks on Chip Symposium (NoCS), Pittsburgh, Pennsylvania, USA,
2011.
[32] A. Y. W. et al., Scalability of network-on-chip communication architec-
ture for 3-d meshes, in: Proceedings of the International Symposium on
Networks-on-Chip, San Diego, CA, 2009.775
[33] T. Moscibroda, O. Mutlu, A case for buﬀerless routing in on-chip net-
works, in: Proceedings of the 36th Annual International Symposium on
Computer Architecture, ISCA ’09, ACM, New York, NY, USA, 2009,
pp. 196–207. doi:10.1145/1555754.1555781.
URL http://doi.acm.org/10.1145/1555754.1555781780
[34] C.-K. Hsu, K.-L. Tsai, J.-F. Jheng, S.-J. Ruan, C.-A. Shen, A low power
detection routing method for buﬀerless noc, in: Quality Electronic De-
sign (ISQED), 2013 14th International Symposium on, 2013, pp. 364–
367. doi:10.1109/ISQED.2013.6523636.
[35] N. Zhang, H. Gu, Y. Yang, D. Fan, Qbnoc: Qos-aware buﬀerless785
noc architecture, Microelectronics Journal 45 (6) (2014) 751 – 758.
doi:http://dx.doi.org/10.1016/j.mejo.2014.04.015.
URL http://www.sciencedirect.com/science/article/pii/
S0026269214001050
36
[36] A. Y. W. et al., A scalable multi-dimensional NoC simulation model790
for diverse spatio-temporal traﬃc pattern, in: Proceedings of the 3D
Systems Integration Conference (3DIC), San Francisco, California, USA,
2013.
[37] J. S. S. T. Association, et al., Jedec standard: Wide i/o single data rate
speciﬁcation (2011).795
[38] P. Vivet, 3D integrated circuits: A memory-to-logic WideIO example, in:
Design Impacts and 3D CAD Design Perspectives, HIPEAC RAPIDO
WorkShop, 2022.
Biography
Awet Yemane Weldezion is a researcher in Electronic
Systems Design at KTH - Royal Institute of Technology,
Stockholm, Sweden. He received BSc(2000) in Electrical
and Computer Engineering from Addis Ababa University,
MSc(2006) in SoC design from KTH, MBA (2012) in Inno-
vation and Growth from University of Turku - Finland. Since 2008, he is
pursuing Ph.D. studies in Electronic Systems Design at KTH in areas of
3D-NoC.
Matt Grange received his MEng and PhD degrees in Elec-
tronic Systems Design from Lancaster University in the UK
in 2007 and 2011 respectively. His PhD thesis focused on
high-speed digital circuit applications and physical modeling
for 3-D ICs. He currently works in the Calibre division of
Mentor Graphics in Wilsonville, Oregon. His main interests
are IC veriﬁcation, thermal validation, RTL simulation and
synthesis, place and route, interconnect modeling, NoCs, and emerging tech-
nologies.
37
Axel Jantsch (M97) received the Dipl.Ing. and Dr.Tech. de-
grees from the Technical University of Vienna, Vienna, Aus-
tria, in 1988 and 1992, respectively. He has been a Full Pro-
fessor of electronic system design with the Royal Institute of
Technology, Stockholm, Sweden, since December 2002. Cur-
rently he is a chair professor at Vienna University of Technol-
ogy (TU Wien). His research interests include VLSI design
and synthesis, system-level speciﬁcation, modeling and validation, HW/SW
co-design and co-syntheses, reconﬁgurable computing, and networks-on-chip.
Hannu Tenhunen received his MSc(’82) from Helsinki Uni-
versity of Technology, Finland and PhD (’85) from Cornell
University, Ithaca, NY, USA. Since 1992, he is chair pro-
fessor at the Royal Institute of Technology. He was one
of the originators of the interconnect centric design, glob-
ally asynchronous locally synchronous, and network-on-chip
(NoC) paradigms. He has supervised over 70 M.Sc. thesis,
39 doctoral thesis, and 8 post-doc and published over 700 reviewed pub-
lications. During the last 20 years he has been actively involved in high
technology policies, technology impact studies, innovations and changing the
educational system.
Dinesh Pamunuwa (M04-SM09) received the B.Sc. degree
(with honors) in EE Eng’g from the University of Peradeniya,
Sri Lanka in 1997, and the Ph.D. degree in Electronic Sys-
tem Design from the Royal Institute of Technology (KTH),
Stockholm, Sweden, in 2003. He was a Senior Lecturer (2004-
2010) at Lancaster University and since 2011 a Reader in Mi-
croelectronics at University of Bristol. He has authored and
coauthored over 60 international peer-reviewed articles in areas ranging from
interconnect design and signal integrity issues, to methodologies and architec-
tures for electronic system design and networks-on-chip, to nanoelectronics
and nano-electro-mechanical (NEM) relay based circuit design.
38
