Performance Evaluation and Design Tradeoffs of On-Chip Interconnect Architectures by Bakhouya, Mohmed et al.
Performance Evaluation and Design Tradeoffs of
On-Chip Interconnect Architectures
Mohmed Bakhouya, Suboh Suboh, Jaafar Gaber, Tarek El-Ghazawi, Smail
Niar
To cite this version:
Mohmed Bakhouya, Suboh Suboh, Jaafar Gaber, Tarek El-Ghazawi, Smail Niar. Perfor-
mance Evaluation and Design Tradeoffs of On-Chip Interconnect Architectures. 2010. <hal-
00534521>
HAL Id: hal-00534521
https://hal.archives-ouvertes.fr/hal-00534521
Submitted on 9 Nov 2010
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
Accepted Manuscript
Performance Evaluation and Design Tradeoffs of On-Chip Interconnect Archi‐
tectures
M. Bakhouya, S. Suboh, J. Gaber, T. El-Ghazawi, S. Niar
PII: S1569-190X(10)00220-0
DOI: 10.1016/j.simpat.2010.10.008
Reference: SIMPAT 1023
To appear in: Simulation Modeling Practices and Theory
Received Date: 2 July 2010
Revised Date: 8 October 2010
Accepted Date: 14 October 2010
Please cite this article as: M. Bakhouya, S. Suboh, J. Gaber, T. El-Ghazawi, S. Niar, Performance Evaluation and
Design Tradeoffs of On-Chip Interconnect Architectures, Simulation Modeling Practices and Theory (2010), doi:
10.1016/j.simpat.2010.10.008
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers
we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting proof before it is published in its final form. Please note that during the production process
errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
  
Performance Evaluation and Design Tradeoﬀs of
On-Chip Interconnect Architectures
M. Bakhouya1, S. Suboh2, J. Gaber1, T. El-Ghazawi2, S. Niar3
1Universite de Technologie de Belfort/Montbeliard
Rue Thierry Mieg, 90010 Belfort Cedex, France
{mohamed.bakhouya,gaber}@utbm.fr
2The George Washington University
Washington DC. 20052, USA
{suboh,tarek}@gwu.edu
3 Universite de Valenciennes et du Hainaut-Cambresis
59313 VALENCIENNES Cedex 9, France
smail.niar@univ-valenciennes.fr
Abstract
Network-on-Chip (NoC) has been proposed as an alternative to bus-based
schemes to achieve high performance and scalability in System-on-Chip (SoC)
design. Performance analysis and evaluation of on-chip interconnect archi-
tectures are widely based on simulations, which become computationally ex-
pensive, especially for large-scale NoCs. In this paper, a Network Calculus-
based methodology is presented to analyze and evaluate the performance and
cost metrics, such as latency and energy consumption. The 2D Mesh, Spi-
dergong, and WK-recursive on-chip interconnect architectures are analyzed
using this methodology and results are compared with those produced using
simulations. The values obtained by simulations and by analysis show sim-
ilar trends in the same order of magnitude. Furthermore, WK outperforms
the other on-chip interconnects in all considered metrics.
Key words: Network-on-Chip, On-chip interconnect, Analytical modeling
and evaluation, Design Tradeoﬀs, Network calculus
1. Introduction
System-On-chip (SoC) has recently emerged as a key technology behind
most embedded and smart miniaturized systems to provide high ﬂexibility
Preprint submitted to SMPT Journal October 21, 2010
  
and better performance. These systems must provide high-performance while
meeting system requirements, such as a low energy consumption and small
area. For example, future mobile communication terminals should support
many applications, which range from web browsing/navigation, to real-time
multimedia applications such as audio and video communication. Therefore,
the design of these systems should be highly ﬂexible, adaptable, and meet
stringent time-to-market constraints, while providing high-performance and
lower energy consumption.
A key element in the performance and energy consumption in SoCs is
the On-Chip Interconnect (OCI), which allows diﬀerent SoC components to
communicate eﬃciently. Network-on-chip has been proposed as an alterna-
tive to bus-based schemes to achieve high performance and scalability in SoC
design. Diﬀerent OCI-based architectures using packet-switching have been
recently studied and adapted for SoCs. Examples of these architectures are
Fat-Tree (FT)1, 2D mesh2, Ring3, Butterﬂy-Fat Tree (BFT)4, Torus5, Spi-
dergon6, Octagon7, WK-Recursive8,9. However, their increasing complexity
makes their design extremely challenging. Furthermore, understanding and
studying traﬃc generated between components and traverse the OCI is a
crucial task10. Therefore, it is useful to perform a traﬃc analysis in early
stages of the design process, such that the designer can select appropriate
parameters for the on-chip interconnect architecture. Indeed, the selection of
the on-chip interconnect architecture, based on traﬃc patterns that an appli-
cation speciﬁc SoC generates, allows designers to detect and locate network
contentions and bottlenecks.
Evaluating the performance of NoC architectures are usually performed
using simulations1,11,12,13,14,15. Generally, the simulation is extremely slow for
large systems and provides little insight on how diﬀerent design parameters
aﬀect the actual NoC performance16. Analytical models, however, allow fast
evaluation of performance metrics in early stages of the design process. This
paper extends the work we have done by evaluating the performance (e.g., la-
tency) of three on-chip interconnect architectures using Network Calculus17.
We show how Network Calculus can be used to evaluate the performance
metrics, energy consumption and area requirements of on-chip interconnects
and their design tradeoﬀs. The main objective is to illustrate the eﬀective-
ness of this methodology in evaluating on-chip interconnect architectures. As
a case study, a detailed analysis and evaluation of three on-chip interconnect
architectures, the 2D mesh, WK-Recursive, and Spidergon, under diﬀerent
traﬃc loads is presented.
2
  
The rest of this paper is structured as follows. In section 2, we summarize
the existing work on performance analysis methods proposed for evaluating
on-chip interconnects. Section 3 provides a brief overview of Network Calcu-
lus concepts and features. In section 4, we present the on-chip interconnect
modeling methodology, and the results obtained using both simulations and
Network Calculus. Conclusions and future work are given in section 5.
2. Related Work
On-chip interconnect architectures adopted for SoCs are characterized
by trade-oﬀs between latency, throughput, communication load, energy con-
sumption, and silicon area requirements. Several works, such as presented
in18, have demonstrated that there is a crucial need for system design tools
and methodologies to analytically evaluating and comparing NoC architec-
tures. The authors in18 have pointed out that the current design tools and
methodologies are not suitable for NoC evaluation, and simulation meth-
ods, despite their accuracy, are very expensive and time consuming. There-
fore, techniques and tools are required to extract application communication
characteristics and to eﬃciently estimating their performance and energy
consumption in addition to area requirements for candidate communication
architectures.
Recently, there has been a great deal of interest in the development of
analytical performance models for NoC design. Approaches proposed in the
literature can be classiﬁed in four main categories: deterministic approaches,
probabilistic approaches, physics based approaches, and system theory based
approaches. In the ﬁrst category, approaches are mainly based on graph the-
ory used successfully in many software and computer engineering domains.
For example, in19, a model using a cyclo-static dataﬂow graph was used for
buﬀer dimensioning for NoC applications. Deterministic approaches assume
that the designer has thorough understanding of the pattern of communica-
tion among cores and switches.
Most of the work to date using probabilistic approaches are based on
queuing theory. For example, an analytical model using queuing theory was
introduced in20 to evaluate the traﬃc behavior in Spidergon NoC. Simulation
results to verify the model for message latency under diﬀerent traﬃc rates
and variable message lengths have been reported. A queuing-theory-based
model for evaluating the average latency and energy consumption of on-chip
interconnects was proposed in21. The results from the analytical model were
3
  
validated with those obtained when using a cycle-accurate simulator. Most
queuing approaches consider incoming and outgoing traﬃc as probability dis-
tributions (e.g., Poisson traﬃc) and allow designers to perform a statistical
analysis on the whole system in order to evaluate certain network metrics,
such as average buﬀer occupancy and average buﬀer delay in an equilibrium
state. However, NoC applications exhibit traﬃc patterns that are very dif-
ferent compared to Poisson distribution used in queuing model22,12. More
precisely, the Poisson model fails to capture some important network char-
acteristics like self-similarity or long-range dependence23.
In24, the authors suggested statistical physics and information theory
for NoC design and evaluation. Unlike stochastic approaches that make
Markovian assumptions about the network behavior, statistical physics can
model the interactions among various components while considering the long-
term memory eﬀects. A quantum-like approach was proposed in24 to model
the information ﬂow and buﬀers behavior in NoCs. The main concept in
this model is that packets in the network move from one node to another
in a manner that is similar to particles moving in a Bose gas and migrating
between various energy levels as a consequence of temperature variations.
The authors have focused on the buﬀer sizing issue, which is a major factor
that aﬀects the energy consumption and the silicon area requirements.
The fourth category uses system theory that is successfully applied to de-
sign electronic circuits. Network Calculus features are derived from system
theory so that performance bounds (e.g., end-to-end delay) in networks such
as the Internet can be modeled and evaluated25,26. The attractive feature of
Network Calculus is its ability to capture all traﬃc patterns with the use of
bounds. In other words, based on shapes of the traﬃc ﬂows (by analogy, sig-
nals in system theory), designers are able to capture some dynamic features
of the network. For example, in27, we have presented a performance analysis
methodology using Network Calculus to analyze and evaluate performance
metrics of 2D Mesh on-chip interconnect. Simulations are performed and re-
sults are compared with those from the Network Calculus-based methodology
in order to underline its usefulness for evaluating on-chip interconnects.
In this paper, the Network Calculus-based methodology is used to eval-
uate other performance metrics (e.g., load and throughput) as well as cost
metrics (e.g., energy consumption and area overhead). Three on-chip inter-
connects, that are the 2D mesh, WK-Recursive, and Spidergon, are evaluated
and compared under diﬀerent traﬃc loads. Results show the eﬀectiveness of
Network Calculus as a useful tool for NoC design and evaluation. It’s worth
4
  
noting that we have selected 2D mesh, WK-Recursive, and Spidergon be-
cause they outperform other on-chip interconnects, such as FT and Ring, in
all performance and cost metrics9,28.
3. Network Calculus: an Overview
Network Calculus25,26 is a modeling framework that allows designers to
specify a system as a mathematical model and evaluate main performance
bounds such as end-to-end delay. This theory is based on (min,+) alge-
bra for deterministic network performance analysis, especially for worst-case
analysis25. Based on shapes of the traﬃc ﬂows, designers are able to capture
some dynamic features of the network. In this section, we brieﬂy introduce
Network Calculus, in particular service and arrival curves that represent
traﬃc patterns, as well as some performance bounds.
We consider that any system can be composed of one or several compo-
nents that exchange traﬃc in order to accomplish a given task. The traﬃc
pattern of the system can be deﬁned by arrival curves of incoming traﬃc ﬂows
to each component of the system. Let’s consider f a data ﬂow characterized
by an input function denoted by R(t), which represents the cumulative data
units (e.g., packets, bits) of f arriving at the component C within the time
interval [0, t]. Let’s consider R∗(t) the output function (see Figure 1), which
represents the cumulative amount of data that leaves the component during
the time interval [0, t], R(t) ≥ R∗(t). Having the input and output func-
tions, we can derive the following two quantities of interest, the backlog and
the virtual delay 25. The backlog x(t) is the amount of data units that are
held inside the system, x(t) = R(t) − R∗(t). The virtual delay d(t) is the
delay that would be experienced by a data unit arriving at time t if all units
received before it is served before it, d(t) = inf{τ ≥ 0, R(t) = R∗(t + τ)}.
In order to calculate the delay and the backlog, the input and output func-
tions have to be deﬁned. Their deﬁnition is based on (min,+) convolution
and deconvolution principles deﬁned as follows. Given f and g wide-sense
increasing functions and f(0) = g(0) = 0, their convolution is deﬁned as
(f ⊗ g)(t) = inf0≤s≤t{f(t− s) + g(s)} and their deconvolution is deﬁned as
(f  g)(t) = sups≥0{f(t + s)− g(s)}.
Each input function can be characterized by an arrival curve as follows.
An arrival curve α(t) characterizes a traﬃc ﬂow R(t), iﬀ it upperbounds
the amount of arriving data of this traﬃc ﬂow during any time interval
[0, t]. More formally, given a wide-sense increasing function α(t) deﬁned
5
  
for t ≥ 0, we say that a ﬂow R(t) is constrained by α iﬀ for all s ≤ t:
R(t) − R(s) ≤ α(t − s). It is also said that R has α as an arrival curve,
or also that R is α-smooth25. Using (min,+) convolution, α is an arrival
curve of an input function R iﬀ R ≤ R ⊗ α. An example of the arrival
curve is a leaky bucket controller, which enforces an arrival curve constraint
α(t) = rt + b. It means that no more than b data units can be sent at once
and rbit/s on long-term.
Figure 1: Arrival and service curves in Network Calculus with delay and backlog bounds
The output function R∗(t) can be calculated after the modiﬁcation of the
input function R(t) by the component C described by the service curve β(t)
of that component. We say that C oﬀers to the ﬂow R a service curve β (non-
decreasing function such that β(0) = 0) iﬀ : ∀t ≥ 0, R∗(t) ≥ inf0≤s≤t{R(s)+
β(t− s)}. Using (min,+) convolution of these two functions, β is a service
curve of ﬂow R iﬀ R∗ ≥ R⊗β. An example of the service curve is rate latency
function β(t) = R(t − T )+, where R denotes a guaranteed service rate and
T is the maximum latency caused by the component29. The expression (x)+
equals to x when x > 0 and 0 otherwise. Figure 1 shows a component with
input/output curves, service curve, delay and backlog.
Knowing the service curve β(t) oﬀered by a component C, the output
curve α∗(t) of R∗(t), can be calculated as follows: α∗(t) = (α  β)(t). For
example, assuming that a ﬂow is constrained by an arrival curve α(t) =
rt + b and C provides a guaranteed service curve β(t) = R(t − T )+ to the
ﬂow, the output bound can be calculated as follows: α∗(t) = α(t) + rt.
These curves, α(t) and α∗(t), act like bounds on the input and output traﬃc
ﬂows respectively, and are used to compute the delay bound D and the
backlog bound B as follows. The delay D for a data ﬂow R(t) constrained
by an arrival curve α(t) that receives the service β(t) to produce a data ﬂow
R∗(t) constrained by the arrival curve α∗(t) is upper-bounded by: d(t) ≤
sups≥0(inf{τ ≥ 0 : α(s) ≤ β(s + τ)}). The backlog x(t) can be upper-
bounded by: x(t) ≤ sups≤0{α(s)− β(s)}, ∀t.
6
  
An example is illustrated in Figure 2 that shows the delay and the backlog
bounds of a component receiving a traﬃc ﬂow characterized by an arrival
curve α(t) = rt + b and providing a service curve β(t) = R(t − T )+, where
R ≥ r is the guaranteed bandwidth, and T is the maximum latency of the
service. Using these curves, the backlog B and delay bounds D can be
expressed as follows: B = b + rT and D = b/R + T .
Figure 2: Example of backlog bound B (a) and delay bound D (b)
4. OCIs exploration
In this section, three on-chip interconnect architectures are selected for
analysis and evaluations, 16-node conﬁgurations are used. Figure 3 shows
these conﬁgurations with application data ﬂows generated as a case study
(e.g., in 2D Mesh, f1 = (c8, s8, s12, c12)). As shown in this ﬁgure, there
are three important elements in NoC: cores, routers (or switches), and bidi-
rectional links. Each core can be either a source or a sink, in which ﬂits
are constructed or consumed. Each ingress port in a switch has a buﬀer
for temporary storage of information. When a ﬂit arrives at a switch, it
must go into the buﬀer that corresponds to a Drop-tail queue with an FIFO
queue management mechanism. The rest of this section presents the Network
Calculus-based model and how it is used to evaluate the performance and
cost metrics.
4.1. Network Calculus-based Model
In order to analyze and evaluate the performance of each OCI, we need to
build a model for the entire system. The NoC architecture can be viewed as
a distributed system composed of autonomous nodes that communicate by
7
  
 
    
s12 
 
s11 
 
s10 
 
s9 
 
s5 s6 s7 s8 
 
s13 
 
s14 s15 s16 
c4 
 
 
 
 
 
c3 
 
 
c2 
 
 
c1 
c12 
 
 
 
 
c11 
 
c10 
 
c9 
 
 
c5 
 
 
c6 
 
c7 
 
c8 
 
c13 
 
c14  c15 
 
c16 
 
s4 
 
s3 
 
s2 
 
s1 
(b) 
f1 
f1 f1 f1 
f1 
f2 
f2 f2 f2 
f2 f3 f3 
f3 
f3 
f4 
f4 f4 
f4 
f4 
f5 
f5 f5 
f5 
f5 f1 
 
f2 
    
s8 
 
s7 
 
s6 
 
s5 
 
s12 s11 s10 s9 
 
si 
s15 s14 s13 
c4 
 
 
 C8 
 
 
c3 
 
 
c2 
 
 
c1 c8 
 
 
C16 
 
c7 
 
c6 
 
c5 
 
 
c12
2 
 
 
c11 
 
c10 
 
c9 
 c16 
 
 
 
  
c15 
 
c14 
 
c13 
 
s4 
 
s3 
 
s2 
 
s1 
f5 
f5 
f5 
f1 
f1 
f1 
f3 
f3 
f3 
f3 
f4 
f4 
f4 
f4 
f2 
f2 
f2 
f2 
f3 
ci Switch IP core Link 
s16 
fi 
(c)  
 Data flow 
 
    
s8 
 
s7 
 
s6 
 
s5 
 
s12 s11 s10 s9 
 
s16 
 
s15 s14 s13 
c4 
 
 
8 
 
 
c3 
 
 
c2 
 
 
c1 
c8 
 
 
 
 
c7 
 
c6 
 
c5 
 
 
c12
 
 
 
c11 
 
c10 
 
c9 
 
c16 
 
c15 
 
c14 
 
c13 
 
s4 
 
s3 
 
s2 
 
s1 
f2        f2 f2 
f1 f1 
f1 
f2 
f4 
f4 f4 
f4 
f4 
f4 
f5 
f5 
f5 
f3 
f3 
f2 
f5 
f3 
f3 
(a) 
f3 
Figure 3: On-chip interconnects with data flows: (a) 2DMesh, (b) Spidergon, (c) WK(4,2)-
recursive.
exchanging messages through an on-chip interconnect30. The on-chip inter-
connect can be described as a graph OCI(V,E) whose nodes v ∈ V represent
switches or cores and whose edges  ∈ E represent the communication links
between two neighboring nodes u and v. For each node v ∈ V , rv is the
injection rate and for each link  ∈ E, R denotes the guaranteed service rate
or the link bandwidth. Similarly, an application can be represented by an
acyclic digraph, called Task Graph TG, where each v ∈ V represents a task
and each  = (u, v) ∈ E is a communication ﬂow edge having one attribute
α(t), the input arrival curve that represents the data ﬂow sent by u to v.
After a random mapping of the TG on the OCI, as illustrated in Fig-
8
  
ure 3, the cores (c6, c8, c11, c15) are selected to be traﬃc sources. Cores
(c1, c5, c12, c13), considered as sinks, are selected according to the following
communication locality principle in which 25% of the traﬃc takes place be-
tween neighboring cores and 75% of the traﬃc is uniformly distributed among
the rest. We can see, in this traﬃc pattern, that c8 is selected two times as a
traﬃc source and c12 is selected two times to be a traﬃc sink. Data ﬂows are
represented by sequences of hops from a source core ci to a destination core
cj. These data ﬂows are computed using a deterministic routing protocol to
direct ﬂits between switches.
Having these data ﬂows, we can express the input and output arrival
curves, αsi(t), αci(t), and α(t) of each switch si, core ci, and link  respec-
tively. The maximum data ﬂow sent to a switch si is constrained by the
arrival curve αi(t) = rit + bi, where bi is the maximum burst size of the
data ﬂow and ri is its average rate. Using this arrival curve, a node can
send bi bits at once, but without exceeding ribit/s over the long run. Each
switch also provides a guaranteed service constrained by the service curve
βi(t) = Ri(t − Ti)+, where Ri denotes the guaranteed service rate and Ti is
the maximum latency caused by the switch si. This service curve is called
the rate-latency service curve in which data is delayed by a ﬁxed time Ti
and then routed out at a rate Ri. These two curves are widely used in eval-
uating systems31,32,33,34. We use these curves to evaluate and compare the
considered OCIs.
After deﬁning data ﬂows and nodes participating in transmitting and/or
receiving data, the entire network can be described to obtain the perfor-
mance model by merging all arrival and output ﬂows. For example, Figure
3 (b) shows the 16-nodes conﬁguration of the Spidergon on-chip intercon-
nect. As shown in this ﬁgure, ﬁve data ﬂows are selected as follows: f1 =
(c8, s8, s9, s10, s11, s12, c12), f2 = (c8, s8, s7, s6, s5, c5), f3 = (c6, s6, s5, s13, c13),
f4 = (c11, s11, s3, s2, s1, c1), f5 = (c15, s15, s14, s13, s12, c12).
Based on these data ﬂows, the input and output curves of each switch
are iteratively calculated. For example, α15(t) and α
∗
15(t) respectively have
to be calculated ﬁrst. We have then, α15(t) = rt + b and α
∗
15(t) = rt + b +
rT . The output bound of the switch s15 is an input to the switch s14, so
α14(t) = rt + b + rT and α
∗
14(t) = rt + b + 2rT . In the second iteration,
input and output curves α8(t) are calculated as follows, α8(t) = 2rt+2b and
α∗8(t) = 2rt + 2b + 2rT . In the third iteration, the input and output curves
of α7(t) and α9(t) respectively have to be calculated in the same manner
according to data ﬂows. The calculation will be repeated with nodes s6, s5,
9
  
s13, s10, s11, s12, s3, s2, and s1, till we obtain the following equations:
α1(t) = rt + b +
9
2
rT α9(t) = rt + b + rT
α2(t) = rt + b +
7
2
rT α10(t) = rt + b + 2rT
α3(t) = rt + b +
5
2
rT α11(t) = 2rt+ 2b + 3rT
α5(t) = 2rt + 2b + 4rT α12(t) = 2rt+ 2b + 6rT
α6(t) = 2rt + 2b + 2rT α13(t) = 2rt+ 2b + 5rT
α7(t) = rt + b + rT α14(t) = rt + b + rT
α8(t) = 2rt + 2b α15(t) = rt + b
(1)
In the same manner, the arrival curve, αci(t), of each core ci, and the
arrival curve, α(t), of each link  can be calculated. One of the main ad-
vantages of using Network Calculus is that the designer can model the data
ﬂows of an application and their interactions (i.e., ﬂows are dependent to
each other) which are necessary for NoC design and evaluation.
SoC applications generally have broad computation and/or communica-
tions requirements. Understanding application communication patterns is
critical for eﬃcient use of SoC resources within a given set of constraints
such as area, power and performance. In the rest of this section, we will
show how to evaluate the performance, the energy consumption, and the
area requirements based on the OCI model describing the arrival curves of
each switch, core, and link. Analytical and simulation results are compared
using the same traﬃc pattern to conﬁrm the usefulness of Network Calculus
for NoC design and evaluation. Simulations are conducted using a simulator
developed in14.
In the simulation, we consider that an application is represented as com-
municating parallel processes. Each process is linked with a traﬃc generator
that injects ﬂits according to the CBR (Constant Bit Rate) model at a de-
terministic rate r, which is varied between 25Mbps and 100Mbps. It’s worth
noting that, in this evaluation, we have used Network Calculus theory, which
is mainly proposed to study lossless system, i.e., with the assumption that no
ﬂits are ever lost. Once a ﬂit is injected in the NoC, it will eventually reach
its destination. When the injection rate is above 100Mbps, a lot of ﬂits are
lost. This is the reason why at this rate the network becomes congested and
router start dropping ﬂits. The maximum service rate R is ﬁxed to 200Mbps
in this simulation and same for each switch. In NoCs, the maximum ser-
vice rate was expected to be in the order of Gigabits/s. However, because
of the limitations from real conditions and since an event-simulator not cy-
cle accurate simulator (event can represent many cycles that allow this high
10
  
bandwidth) is used, and processor power limitation, the maximum service
rate can only add up to 200Mbps.
In the analytical evaluation, the arrival curve we have used for each node i
is a leaky bucket controller which enforces an arrival curve constraint α(t) =
rt+ b. Using this arrival curve, a node i can send b bits at once, but without
exceeding rbit/s over the long run. One of the applications using arrival curve
is in the Generic Cell Rate Algorithm (GCRA) with two parameters, target
inter-arrival time of packets T , and τ the tolerance that quantiﬁes how early
packets may arrive with respect to the ideal spacing T 25. A CBR connection
is deﬁned by one GCRA with parameters (T, τ), in which b = Sf (
τ
T
+1) and
r =
Sf
T
, where Sf is the ﬂit size. In the simulator we have used, the CBR
was implemented with τ equal to 0, therefore, b is all time equal to the ﬂit
size ﬁxed.
The ﬂit is an elementary unit of information exchanged in the commu-
nication network in a unit of time (e.g., clock cycle), but a packet is an
element of information that an IP core sends to another core, which consists
of a variable number of ﬂits. The size of ﬂits can be 8, 16, 32 or 64 bits, but
in our evaluation, we keep the ﬂit size to 8 bytes. The size has inﬂuence on
the performance and cost metrics but not on the comparison results between
on-chip interconnects. It is worth noting that, the length of packet, number
and size of ﬂits and the buﬀer size are all parameterized during the design
space exploration. More precisely, after comparing diﬀerent on-chip intercon-
nects the designer can customize the suitable one by selecting appropriate
parameters, such as the maximum service rate, the buﬀer size for each input
port, and ﬂit size, given a speciﬁc application.
In this evaluation study, we have considered latency, throughput, and
communication load, which are the most important performance metrics used
in evaluating on-chip interconnects1,3,11,13. Another performance metric is
the loss rate not considered in this study because we are analyzing lossless
NoCs. In addition to these performance metrics, cost metrics that are energy
consumption and area requirements are considered.
4.2. Performance Metrics
In this section, performance metrics, mainly the latency, throughput, and
communication load, will be evaluated using the input and output arrival
curves αci(t), α(t), and αsi(t).
11
  
4.2.1. Latency
Latency is deﬁned as the time that elapses between the injection start of
the ﬂits into the network at the source core and its arrival at the destination
core. For a ﬂit to reach the destination cores (e.g., processing elements), it
must travel through a path consisting of a set of links and switches. Using
Network Calculus, the latency Lsi in each switch si constrained by an arrival
curve rit + bi can be calculated as follows
25:
Lsi =
bi
Ri
+ Ti (2)
where Ri is the service bandwidth and Ti is the maximum latency of the
service at a switch si. Therefore, the average latency can be calculated based
on equation 2. For example, as shown in the previous section (see eq.1), in
Spidergon, since α7(t) = rt + b + rT , D7 = (
r
R
+ 1)T + b
R
, if the injection
rate is r = 100Mbps, R = 200Mbps, b= 64bits, and the ﬂit size is Sf = 8
bytes, then D6 = 0.8µs, where T = Sf/R. After computing the delay bound
of each switch, the total delay, called end-to-end delay bound, Dfi of each
data ﬂow fi (from the source to the sink) can be calculated by summing up
the delay of each participating switch. It is deﬁned as the time that elapses
between the injection start of the ﬂit into the network at the node source and
its arrival to the destination node. For example, since Df3 = D5 +D6 +D13,
if r = 75Mbps, then Df3 = 4.2µs. The calculation continues in the same
manner with Df1 , Df2 , Df4, and Df5 to ﬁnd the average end-to-end delay.
Figure 4 compares the average latency of the three on-chip interconnect
architectures under diﬀerent injection rate using Network Calculus (analysis)
and simulation. As shown in this ﬁgure, when increasing the injection rate,
the network becomes more congested with heavy traﬃc and hence queues
become full causing more ﬂits to wait, and therefore increasing the latency.
We can also see that the latency obtained using network calculus analysis
(i.e., a worst case analysis) is in the same order of magnitude as the latency
obtained using simulations, i.e., both show a deviation of less than 14%
on average. Furthermore, regardless of the injection rate used and in both
simulation and analysis results, the Spidergon has higher average latency
compared to the Mesh and WK because of high average number of hops ﬂits
traversed. We can also see that WK is less sensitive to the injection rate
increases and has lower average latency.
12
  
Analysis
1
1,5
2
2,5
3
3,5
4
4,5
5
5,5
0 20 40 60 80 100 120
Injection rate (Mbps)
Av
. L
at
en
cy
 (u
s)
WK
Mesh
Spidergon
Simulation
1
1,5
2
2,5
3
3,5
4
4,5
0 20 40 60 80 100 120
Injection rate (Mbps)
A
v.
 L
at
en
cy
 (u
s)
WK
Mesh
Spidergon
Figure 4: The average latency
4.2.2. Network Load
Communication load is a relative value of arrival rate versus departure
rate on all links. Let’s consider Dr(t) is the maximum number of ﬂits that
can possibly, under ideal circumstances, be transmitted over all links at time
t, and Ar(t) is the actual number of ﬂits that have arrived over all links at
time t14. The communication load L(t) can be deﬁned as the ratio between
the departure rate Dr(t) and the arrival rate Ar(t) as follows:
L(t) = Ar(t)Dr(t) =
∑N
i=1 αi(t)
NRt
(3)
where αi(t) is the number of ﬂits arrived in the link i, R is the bandwidth
of each link i, and N is the number of unidirectional links involved in
transporting ﬂits. We consider that all links have the same bandwidth, R.
The results depicted in Figure 5 show the variation of communication
load under diﬀerent traﬃc rates for the three OCIs. The communication load
obtained using Network Calculus analysis is in the same order of magnitude
as the load obtained using simulations with a deviation of less than 28%.
Furthermore, regardless of the injection rate used, in both simulation and
analysis results, the Spidergon has a higher communication load compared
to the Mesh and WK. Furthermore, WK is less sensitive to the injection rate
increases and has a slightly lower load.
4.2.3. Throughput
The throughput for each core ci represents how many bits arrive at that
core per second (bps). The aggregate throughput T (t) is the sum of through-
13
  
Analysis
0
0,02
0,04
0,06
0,08
0,1
0,12
0,14
0,16
0,18
25 50 75 100
Lo
ad
WK
Mesh
Spidergon
Simulation
0
0,02
0,04
0,06
0,08
0,1
0,12
25 50 75 100
Injection rate (Mbps)
Lo
ad
WK
Mesh
Spidergon
Figure 5: The communication load
put of each destination core ci during the interval [0, t]. It can be calculated
as follows:
T (t) =
Nd∑
i=1
αci(t) (4)
where Nd is the number of cores selected as destinations (i.e., sinks), and
αci(t) is the arrival curve that represents the accumulated number of bits
arrived (i.e., accumulated) at the destination core ci until time t.
In the example depicted in Figure 3, cores (c1, c5, c12, c13) are selected to
be sinks. Using, the OCI model of the Spidergon, the arrival curve αci(t) of
each core ci can be calculated, for example, αc1(t) = rt+b+
11
2
rT and αc5(t) =
rt + b + 3rT . Figure 6 shows the variation of aggregate throughput under
diﬀerent injection rates for the three OCIs. The throughput increases linearly
when the injection rate increases because of the number of ﬂits generated.
Furthermore, the throughput obtained using analysis is slightly similar to
all OCIs and is in the same order of magnitude as the throughput obtained
using simulations with a deviation of less than 5%.
4.3. Cost Metrics
This section presents the analytical evaluation of cost metrics, mainly the
average energy consumption and area overhead. Analytical results are also
compared to those obtained using simulations.
4.3.1. Energy
The total energy can be decomposed into the energy consumed on the
switches (traversal of input and output switches) and energy consumed per
14
  
Analysis
0
100
200
300
400
500
600
25 50 75 100
Ijection rate (Mbps)
Th
ro
ug
hp
ut
 (M
bp
s)
WK
Mesh
Spider
Simulation
0
100
200
300
400
500
600
25 50 75 100
Injection rate (Mbps)
Th
ro
ug
hp
ut
(M
bp
s)
WK
Mesh
Spidergon
Figure 6: The aggregated throughput
wires or links between cores and switches. The total energy E(t), can be
calculated as follows:
E(t) =
N∑
i=1
αi(t)Ei +
Ns∑
j=1
αsj(t)Esj (5)
where αi(t) and αsj (t) are the number of bits arrived until time t to the link i
and sj respectively. N and Ns are the number of links and switches involved
in transporting the application ﬂows. Therefore, the ﬁrst term represents
the energy consumed, at time t, on all links involved, and the second term
represents the energy consumed inside the switches28. Ei is the energy
consumed during transporting one bit on a link i, and Esj is the energy
consumed during buﬀering and routing operations of one bit inside each
switch sj .
The values of Ei and Esj depend mainly on the switch architecture and
the link characteristic such as the width, the length, etc. In this evaluation,
we use the values already estimated in the energy model proposed in35 in
which the average amount of energy required for a single bit to pass a switch
is equal to 0.9776pJ/bit and the average amount of energy required for a
single bit to cross a link  is (0.39 + 0.12L) pJ/bit, where L is the length
of the link . To calculate L, we consider that the link between each core
and its corresponding switch is of length 1mm. We consider that all links
(horizontal or vertical) between neighboring switches are of length 2mm. For
example, as shown in Figure 3, WK(4,2) has 16 links of length 1mm, 20 links
of length 2mm, and 10 links of length 4mm. However, only 5 links of length
1mm, 5 of length 2mm, and 5 of length 4mm are involved in transporting
15
  
ﬂits.
Figure 7 shows the energy consumption using analytical evaluation and
simulations. This ﬁgure shows that the energy consumption increases lin-
early when the injection rate increases. This increase can be explained by
the big number of ﬂits generated as the injection rate increases. Furthermore,
regardless of the injection rate used, in both simulation and analysis results,
the Spidergon has higher average energy consumption compared to the Mesh
and WK. This increase can be explained by the higher number of hops tra-
versed by ﬂits. We can also see that the energy obtained using analysis is in
the same order of magnitude as the energy obtained using simulations, i.e.,
the diﬀerence between simulation and analysis is about 1%.
Analysis
0
0,01
0,02
0,03
0,04
0,05
0,06
25 50 75 100
Injection rate (Mbps)
En
er
gy
(m
J)
WK
Mesh
Spidergon
Simulation
0
0,01
0,02
0,03
0,04
0,05
0,06
25 50 75 100
Injection rate (Mbps)
En
er
gy
 (m
J)
WK
Mesh
Spidergon
Figure 7: The average energy consumption
4.3.2. Area
In NoC design, three sources of area overhead can be identiﬁed, switches,
cores, and links. Switches have two main components: the buﬀers to tempo-
rally store ﬂits and logic to implement the routing algorithm. Area overhead
of links depends on their lengths inside the chip36. The total area value can
be then calculated as follows:
A =
Ns∑
i=1
As(i) +
Nc∑
j=1
Ac(j) +
N∑
k=1
A(k) (6)
where Ns is the number of switches, Nc is the number of IP cores, N is the
number of bidirectional links, As(i) and Ac(j), and A(k) is the area require-
ment for the switch i, core j and link k respectively. The average on-chip
interconnect area Av will be determined by the average link area A, the
16
  
average switch area As, and the average IP core area As. We consider the
average since the resources (e.g., DSP, FPGA, Memory) are heterogeneous,
the length of links are diﬀerent, and the size of switches depends on their
emplacement in the on-chip interconnect (e.g. degree). We use the architec-
tures’ layout presented in28 to determine these values, in particular As and
A. So the average area A can be derived from eq.6 as follows:
Av = Ns(Rs + asdgSfBs) + NcAc + aNL (7)
where Bs is the average buﬀer size, as is the area required for one byte, Sf is
the ﬂits’ size in bytes, a and L is the average width and the average length
of each link , Rs is another switch silicon area, such as routing table and
logic to implement the routing algorithm, and dg is the average degree of the
on-chip interconnect, which represents the average number of buﬀers inside
the switch.
It was demonstrated in previous works, for example in36,37, that a domi-
nant part of the NoC area is due to the buﬀer sizes. To calculate the average
buﬀer size Bs, we have to calculate the buﬀer size Bsi of each switch si
as follows. As described above, each switch si is constrained by an arrival
curve in the form αsi(t) = rit + bi and provides a guaranteed service curve
βi(t) = Ri(t−Ti)+ to each ﬂow. Therefore, Bsi can be calculated as follows25:
Bsi = bi + riTi (8)
where ri is the core injection rate and Ti is the maximum latency of the service
at the switch si. For example, in Spidergon, since αs1(t) = rt + b +
9
2
rT ,
Bs1 =
11
2
rT + b, if the injection rate is r = 75Mbps, R = 200Mbps, b=
64bits, and the ﬂit size is Sf = 8 bytes, then Bs1 =24.5 bytes (∼3 ﬂits),
where T = Sf/R.
Figure 8 shows the area requirements (in mm2) for zero ﬂits drop (i.e.,
lossless system) under diﬀerent injection rates. In this evaluation, the area
required to store the routing table and other related area are considered
constant, Rs = 1mm
2, and as = 0.005mm
2, a = 0.02mm, Ac = 2mm
2,
Rs = 1mm
2. We also consider that the chip size is of 20mm × 20mm. The
value of L is calculated based on the architectures layout
28. As shown in
Figure 8, when injection rate increases, the area requirement increases be-
cause the network becomes more congested with heavy traﬃc and so more
space is needed to absorb diﬀerences in speed and burstiness between the IP
cores. In other words, as the injection rate increases more space is needed
17
  
to avoid ﬂits from being dropped. We can see, that the WK and Spider-
gon require more area because of the additional links and more buﬀer size
respectively, when compared with the Mesh. Furthermore, area obtained
using analytical evaluation is in the same order of magnitude as the area ob-
tained using simulations, i.e., the diﬀerence between simulation and analysis
is about 1.5%.
Analysis
49,5
50
50,5
51
51,5
52
52,5
53
53,5
25 50 75 100
Injection rate (Mpbs)
Ar
ea
WK
Mesh
Spidergon
Simulation
49,5
50
50,5
51
51,5
52
52,5
53
25 50 75 100
Injection rate (Mbps)
Ar
ea
WK
Mesh
Spidergon
Figure 8: The average silicon area
5. Conclusions and Future Work
In this paper, a Network Calculus-based methodology is presented to
evaluate on-chip interconnects in terms of performance (i.e., latency, com-
munication load, throughput) and cost metrics (i.e., energy consumption and
area requirements) based on a given traﬃc pattern. The main objective is
to illustrate the practical use of the Network Calculus approach to analyti-
cally evaluating on-chip interconnects. The 2D regular Mesh, Spidergon, and
WK on-chip interconnect architectures are compared and evaluated using a
given traﬃc pattern. The results show that this approach can provide the
designer with initial insight on on-chip interconnects and the relationship
between application traﬃc and performance. The results show that WK-
Recursive outperforms the 2D Mesh and Spidergon on-chip interconnects in
all considered metrics.
Further work concerns the development of a design space exploration soft-
ware tool that will be built around Network Calculus and integrated with
a simulation and experimental environment. This software tool allows de-
signers to rapidly explore design options over a wide range of energy budget
and performance requirements. The utility of this tool will be demonstrated
18
  
via several prototypes that are created using reconﬁgurable platforms based
on the FPGA technology where actual performance can be measured. Com-
bining applications characterization, performance simulation and analysis,
and implementation in one software tool allows ﬁlling the gap between pure
simulation that may be too slow and analytic methods that are not accurate
enough to be used in a design space exploration of SoCs.
References
[1] P. P. Pande, C. Grecu, M. Jones, A. Ivanov, R. Saleh, Performance
evaluation and design tradeoﬀs for network-on-chip interconnect archi-
tectures, IEEE Trans. on Computer 54 (8) (2005) 1025-1040.
[2] S. Kumar, A. Jantsch, J.-P. Soininen, M. Forsell, M. Millberg, J. berg,
K. Tiensyrj, A. Hemani, A network on chip architecture and design
methodology, Proc. Int’t Symp. VLSI (ISVLSI) (2002) 117-124.
[3] L. Bononi, N. Concer, Simulation and analysis of network on ship archi-
tectures: Ring, spidergon, and 2d mesh, DATE Proc. (2006) 6 pages.
[4] P. Guerrier, A. Greiner, A generic architecture for on-chip packet-
switched interconnections, DATE Proc. (2000) 250-256.
[5] W. J. Dally, B. Towles, Route packets, not wires: On chip interconnec-
tion networks, DAC Proc. (2001) 683-689.
[6] M. Coppola, R. Locatelli, G. Maruccia, L. Pieralisi, A. Scandurra, Spi-
dergon: a novel on-chip communication network, Proc. International
Symposium on System-on-Chip (2004) 250-256.
[7] F. Karim, A. Nguyen, S. Dey, An interconnection architecture for net-
working systems on chip, IEEE Microprocessors 22 (5) (2002) 36-45.
[8] S. Suboh, M. Bakhouya, T. El-Ghazawi, Simulation and evaluation
of on-chip interconnect architectures: 2d Mesh, Spidergon, and WK-
recursive networks, NoCS Proc. (2008) 205-206.
[9] S. Suboh, M. Bakhouya, J. Gaber, T. El-Ghazawi, An interconnection
architecture for network-on-chip systems, Telecom. Systems 37 (1-3)
(2008) 137-144.
19
  
[10] K. Lahiri, A. Raghunathan, S. Dey, System-level performance analy-
sis for designing on-chip communication architectures, IEEE Trans. On
CAD of Ics and Systems 20 (6) (2001) 768-783.
[11] S. Suboh, M. Bakhouya, S. Lopez-Buedo, T. El-Ghazawi, Simulation-
based approach for evaluating on-chip interconnect architectures, SPL
Proc. (2008) 75-80.
[12] G. Varatkar, R. Marculescu, Trac analysis for on-chip networks design
of multimedia applications, DAC Proc. (2002) 510-517.
[13] A. Hegedus, G. M. Maggio, L. Kocarev, A ns-2 simulator utilizing
chaotic maps for network-on-chip traﬃc analysis, Proc. of IEEE Inter-
national Symposium on Circuits and Systems (2005) 3375-3378.
[14] Y. R. Sun, S. Kumar, A. Jantsch, Simulation and evaluation of a network
on chip architecture using ns2, IEEE NorChip Proc.
[15] J. Xu, W. Wolf, J. Henkel, S. Chakradhar, A design methodology for
application-speciﬁc networks-on-chip, ACM Transactions on Embedded
Computing Systems 5 (2) (2006) 263-280.
[16] U. Y. Ogras, R. Marculescu, Analytical router modeling for networks-
on-chip performance analysis, DATE Proc. (2007) 1-6.
[17] M. Bakhouya, S. Suboh, J. Gaber, T. El-Ghazawi, Analytical perfor-
mance comparison of 2d mesh, wk-recursive, and spidergon nocs, The
19th IPDPS Conference, PMEO-UCNS workshop.
[18] K. Lahiri, S. Dey, A. Raghunathan, Evaluation of the traﬃc perfor-
mance characteristics of system-on-chip communication architectures,
VLSI Design Proc. (2001) 29.
[19] A. Hansson, M. Wiggers, A. Moonen, K. Goossens, M. Bekooij, Apply-
ing dataow analysis to dimension buﬀers for guaranteed performance in
networks on chip, NOCS Proc. (2008) 211-212.
[20] M. Moadeli, A. Shahrabi, W. Vanderbauwhede, M. Ould-Khaoua, An
analytical performance model for the spidergon NoC, 21st AINA Proc.
(2007) 1014-1021.
20
  
[21] H. J. Kim, D. Park, C. Nicopoulos, V. Narayanan, C. Das, Design and
analysis of an NoC architecture from performance, reliability and energy
perspective, ACM SANCS Proc. (2005) 173-182.
[22] U. Y. Ogras, J. Hu, R. Marculescu, Key research problems in NoC de-
sign: A holistic perspective, Proc. of International Conference on Hard-
ware/Software Codesign and System Synthesis (2005) 69-74.
[23] R. Marculescu, P. Bogdan, The chip is the network: Toward a science
of network-on-chip design, Foundations and Trends in Electronic Design
Automation 2 (4) (2007) 371-461.
[24] P. Bogdan, R. Marculescu, Quantum-like eﬀects in network-on-chip
buﬀers behavior, Proc. of the 44th Design Automation Conference
(2007) 266-267.
[25] J.-Y. L. Boudec, P. Thiran, Network calculus: A theory of deterministic
queuing systems for the internet, Book, LNCS 2050 (2001) 265 pages.
[26] R. L. Cruz, A calculus for network delay, part ii: Network analysis,
IEEE Tran. on Information Theory 37 (1) (1991) 132-141.
[27] M. Bakhouya, S. Suboh, J. Gaber, T. El-Ghazawi, Analytical modeling
and evaluation of on-chip interconnects using network calculus, Proc.
of the 3rd ACM/IEEE International Symposium on Networks-on-Chip
(2009) 74-79.
[28] M. Bakhouya, Evaluating the energy consumption and the silicon area
of on-chip interconnect architectures, Journal of Systems Architecture
55 (7-9) (2009) 387-395.
[29] D. Stiliadis, A. Varma, Latency-rate servers: a general model for analysis
of traﬃc scheduling algorithms, IEEE/ACM Trans. Networking 6 (5)
(1998) 611-624.
[30] S. Suboh, M. Bakhouya, J. Gaber, T. El-Ghazawi, Analytical model-
ing and evaluation of network-on-chip interconnect architectures, Inter-
national Conference on High Performance Computing and Simulation
(HPCS) (2010) 491-497.
21
  
[31] A. Bouillard, B. Gaujal, S. Lagrange, E. Thierry, Optimal routing for
end-to-end guarantees using network calculus, performance evaluation,
Performance Evaluation 64 (11-12) (2008) 883-906.
[32] V. Firoiu, J.-Y. L. Boudec, D. Towsley, Z.-L. Zhang, Theories and mod-
els for internet quality of service, Proceedings of the IEEE 90 (9) (2002)
1565-1591.
[33] J.-P. Georges, E. Rondeau, T. Divoux, Evaluation of switched ether-
net in an industrial context by using the network calculus, 4th IEEE
Workshop on Factory Communication Systems (2002) 19-26.
[34] J. B. Schmitt, F. A. Zdarsky, U. Roedig, Sensor network calculus with
multiple sinks, Proceedings of the Performance Control in Wireless Sen-
sor Networks Workshop at the IFIP Networking Conference (2006) 6-13.
[35] P. T. Wolkotte, G. Smit, N. Kavaldjiev, J. Becker, J. Becker, Energy
model of networks-on-chip and a bus, Proc. of International Symposium
on System-on-Chip (2005) 82-85.
[36] A. O. Balkan, G. Qu, U. Vishkin, A mesh-of-trees interconnection net-
work for single-chip parallel processing, Proc. of International Con-
ference on Application-speciﬁc Systems, Architectures and Processors
(2006) 73-80.
[37] M. Coenen, S. Murali, A. Ruadulescu, K. Goossens, G. D. Micheli, A
buﬀer-sizing algorithm for networks on chip using TDMA and credit-
based end-to-end ﬂow control, Proc. of International Conference on
Hard- ware/Software Codesign and System Synthesis (2006) 130-135.
22
