Computing Accurate Performance Bounds for Best Effort Networks-on-Chip by Rahmati, Dara et al.
Computing Accurate Performance Bounds
for Best Effort Networks-on-Chip
Dara Rahmati, Student Member, IEEE, Srinivasan Murali, Member, IEEE,
Luca Benini, Fellow, IEEE, Federico Angiolini, Member, IEEE,
Giovanni De Micheli, Fellow, IEEE, and Hamid Sarbazi-Azad, Member, IEEE
Abstract—Real-time (RT) communication support is a critical requirement for many complex embedded applications which are
currently targeted to Network-on-chip (NoC) platforms. In this paper, we present novel methods to efficiently calculate worst case
bandwidth and latency bounds for RT traffic streams on wormhole-switched NoCs with arbitrary topology. The proposed methods
apply to best-effort NoC architectures, with no extra hardware dedicated to RT traffic support. By applying our methods to several
realistic NoC designs, we show substantial improvements (more than 30 percent in bandwidth and 50 percent in latency, on average)
in bound tightness with respect to existing approaches.
Index Terms—SoC, NoC, performance, QoS, best-effort analysis, real time, analytical model, wormhole switching
Ç
1 INTRODUCTION
THE Network-on-Chip [1], [2] paradigm has emerged inrecent years to overcome the power and performance
scalability limitations of point-to-point signal wires, shared
buses, and segmented buses [1], [2], [3], [4], [5], [6]. While
the scalability and efficiency advantages of NoCs have been
demonstrated in many occasions, their timing predictability
and suitability to transport real-time communication are
still a source of technical concern.
Many applications have strict requirements on latency
and bandwidth of on-chip communication, which are often
expressed as real-time constraints on traffic flows. On a NoC
fabric, this translates to guaranteed quality of service (QoS)
requirements for packet delivery. Different approaches have
been used to support guaranteed QoS for NoCs: priority-
based switching schemes [7], time-triggered communication
[8], time-division multiple access [9], and many variations
thereof. All these approaches require the use of special
hardware mechanisms and often come with strict service
disciplines that limitNoC flexibility and penalize the average
performance to provide worst-case guarantees. In fact, NoC
prototypes are often classified as being either best- effort (BE)
or guaranteed-service (GS), depending on the availability of
hardware support for RT traffic.
Our work takes a new viewpoint. We consider best-effort
NoC architectures without special hardware support for
QoS traffic. We only assume that the traffic injected by the
network’s end-nodes is characterized in terms of worst case
behavior. We then formulate algorithms to find latency and
bandwidth bounds on end-to-end traffic flows transported by a
best-effort wormhole NoC fabric with no special hardware support
for RT traffic. For applications with traffic streams that have
RT latency and/or bandwidth constraints, it is critical to
be able to bound the maximum delay and minimum
injectable bandwidth for packets of such streams. This
helps in choosing topologies that meet the RT constraints
with minimum area, power overhead and optimum
utilization of resources. Our approach is inspired by the
work by Lee et al. [10] for traditional multiprocessor
networks, and extends it in several directions. We propose
two different methods for characterizing worst case
performance. The first method, Real-Time Bound for
High-Bandwidth traffic (RTB-HB), is conceived for NoCs
supporting workloads where injected flows have high
demands of average bandwidth and require a guaranteed
worst traffic minimum bandwidth (mBW) and maximum
upper bound NoC traversal latency (UB). In this case, we do
not assume any a priori regulation on the traffic injection
rate; a core can send packets at any time, as long as the
network has buffer capacity to accept them.
The second method considers applications with latency-
critical flows that require low and guaranteed UB values,
but have moderate bandwidth requirements, and thus can
send packets at intervals no shorter than a minimum
permitted interval—which obviously implies a maximum
bandwidth (MBW) limitation. This method, called Real-
Time Bound for Low-Latency traffic (RTB-LL) requires a
very simple traffic regulation at network injection points.
RTB-LL is a significant improvement to the WCFC bound
452 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. 3, MARCH 2013
. D. Rahmati is with the HPCAN, the Department of Computer
Engineering, Sharif University of Technology, Room 701, Azadi Avenue,
Tehran, Iran. E-mail: d_rahmati@ce.sharif.edu.
. S. Murali and F. Angiolini are with the iNoCs Sarl, Lausanne and the
EPFL, INF 331, Station 14, Lausanne-CH 1015, Switzerland.
E-mail: {murali, angiolini}@inocs.com.
. L. Benini is with the Micrel Lab @ DEIS - Dipartimento di Elettronica,
Informatica e Sistemistica, Facolta` di Ingegneria, Universita` di Bologna,
Viale Risorgimento 2, Bologna 40136, Italy.
E-mail: luca.benini@unibo.it.
. G. De Micheli is with the Institute of Electrical Engineering and the
Integrated Systems Centre, Ecole Polytechnique Federale de Lausanne
(EPFL), INF 341, Station 14, Lausanne-CH 1015, Switzerland.
E-mail: giovanni.demicheli@epfl.ch.
. H. Sarbazi-Azad is with the Department of Computer Engineering, Sharif
University of Technology, Room 621, Azadi Avenue, Tehran, Iran.
E-mail: azad@sharif.edu.
Manuscript received 24 Jan. 2011; revised 15 Nov. 2011; accepted 29 Nov.
2011; published online 14 Dec. 2011.
Recommended for acceptance by Y. Yang.
For information on obtaining reprints of this article, please send e-mail to:
tc@computer.org, and reference IEEECS Log Number TC-2011-01-0051.
Digital Object Identifier no. 10.1109/TC.2011.240.
0018-9340/13/$31.00  2013 IEEE Published by the IEEE Computer Society
proposed in [10], while RTB-HB is completely new. Table 1
compares typical values for upper bound delay and
bandwidth of RTB-HB, RTB-LL, and WCFC methods. In
[11] we presented these methods in their basic modes of
operation. In this paper, we extend the methods to be more
comprehensive, considering more generic NoC models,
supporting various modes of operation and more experi-
ments. In particular, we have several new and important
contributions from our earlier work in [11].
The remainder of the paper is organized as follows:
Section 2 summarizes related work. Section 3 gives
definitions and basic concepts. Section 4 describes RTB-
HB and RTB-LL methods. Section 5 focuses on experimental
results and quantitative comparisons. Section 6 describes
the time complexity of the proposed methods. Finally,
Section 7 concludes the paper.
2 RELATED WORK
The body of knowledge on macroscale RT networks is
extensive and an overview of the state of the art is beyond
the scope of this work. The interested reader is referred to
[12], [13], [14], [15], [16], [17], [18], [19], [20], [21]. Here, we
focus on RT-NoCs, which have often been called guaran-
teed-service or QoS-enabled NoCs.
QoS is an important issue in many application domains
such as multimedia, aerospace, healthcare, and military.
Many of these applications have one or more traffic flows
that have real-time requirements and need hard QoS
guarantees. Two major parameters that account for QoS
guarantees in NoC are worst case delay and worst case
bandwidth. They are sometimes referred as upper bound delay
and lower bound bandwidth of flows. Historically, designers
have focused on extracting the average delay and average
bandwidth, and a large body of work to extract such
parameters exists [22], [23], [24], [25], [26], [27], [28], [29],
[30], [31], [32], [33]. Simulation and mathematical modeling
are two different approaches to do so. While simulation is
widely used in many situations, it is time consuming and it
gives limited insight on sensitivity to traffic parameters and
worst case conditions. In contrast, devising accurate
mathematical models of a system is complicated, but if
such models can be extracted, they are usually computa-
tionally efficient and insightful. Therefore, they can be used
within design tools, for example, to iterate in the NoC
synthesis process to tailor NoC architectures for specific
applications. Frequently used mathematical frameworks are
queuing theory and statistical timing analysis [25], [26], [28],
[31]. A node is modeled as a queuing system, which can be
M=G=1;M=G=1=t;G=G=1, etc. The NoC then is modeled by
interconnecting a number of queues and the parameters are
then extracted using standard solutions from queuing
theory. In these models, the applicability and accuracy are
the main concerns.
In order to provide QoS, some NoC architectures use
special hardware mechanisms. They are known as Guaran-
teed Service NoCs, as opposed to Best Effort NoCs. To
distinguish briefly, GS NoCs commit to a performance level
for one or more flows (typically latency or bandwidth).Hard
or soft QoS can be provided, depending on whether the
NoC actually strictly enforces the desired performance level
or merely strives to achieve it. GS NoCs can leverage
resource reservation or priority-based scheduling mechanisms.
The former technique usually achieves hard QoS, but the
resource utilization may be poor because reserved resources
are underutilized. The latter may achieve better resource
utilization as resources are used on demand, like a best
effort fashion with priority, but generally only ensures soft
QoS and problems like starvation of low priority flows may
occur. GS NoCs require extra hardware complexity with
respect to BE NoCs to support redundant resources or
priority mechanisms. On the other hand, the performance of
flows in a GS NoC can be more easily characterized. In a
pure BE NoC, analyzing the temporal behavior of the flows
is very complex, due to the large number of contentions that
may block a packet of a flow several times along its journey
to the destination. Also simulation is very complicated and
time consuming, as identifying the worst case scenario, and
enforcing the network to operate in such worst case
situation, is extremely difficult, if not impossible. Thus,
the modeling approach may be an option; however, due to
extreme complexity, until recently there have been no
applicable approaches to model the performance para-
meters in worst case situations. Thus, most of the efforts to
provide QoS have been in the context of GS or combinations
of GS and BE NoCs.
In [9], Goossens et al. present the Æthereal NoC which
combines GS with BE. The MARS [34], aSoc [35], and
Nostrum [36] architectures use time division multiplexing
(TDMA) mechanisms to provide real-time guarantees on
packet-switched networks. The aelite NoC [37] provides a
GS and scalable TMDA-based architecture, using meso-
chronous or asynchronous links. In Shi and Burns [7], a
priority-based wormhole switching for scheduling RT flows
is presented. In [8], Paukovits and Kopetz propose the
concept of a predictable Time-Triggered NoC (TTNoC) that
realizes QoS-based communication services. Diemer and
Ernst in [38] introduce Back Suction, a flow control scheme
to implement service guarantees using a prioritized
approach between BE and GS services. In [39], Hansson
et al. provide the latency and throughput guarantees based
on the approach of data flow analysis technique, determin-
ing required buffer size at network interfaces, which is
applicable to the Æthereal NoC. Many other works have
been published with variations over these basic ideas [40],
[41], [42], [43], [44], [45], [46], [47], [48], [49].
However, most NoC architectures are of best effort type
[50] and do not have special hardware mechanisms to
guarantee QoS. Today, to the best of our knowledge, there
are only few works that calculate worst case bandwidth and
delay values for a BE NoC. In [51], the lumped link model
was proposed where the links a packet traverses are
lumped into a single link. This model does not distinguish
direct contention (due to arbitration losses) from indirect
contention (due to full buffers ahead along the path), thus
the estimated bounds are pessimistic. In [52], Qian et al.
provide a method based on network calculus [53], [54] to
RAHMATI ET AL.: COMPUTING ACCURATE PERFORMANCE BOUNDS FOR BEST EFFORT NETWORKS-ON-CHIP 453
TABLE 1
Typical Upper Bound Delay and Bandwidth in Different Methods
calculate real-time bounds for NoCs; the method uses
service curves and arrival curves that characterize the
service characteristics of switches and injected traffic.
Extracting arrival curves is not a straightforward task for
many applications. Thus, for an arbitrary injected traffic
load, traffic regulators may be needed to make sure that the
amount of injected traffic in a specified period does not
exceed a specified level. In [55] a buffer optimization
problem is solved under worst case performance con-
straints based on network calculus. In [56], Bakhouya et al.
also present a model based on network calculus to estimate
the maximum end-to-end and buffer size for mesh net-
works; the delay bounds calculated for the flows are not
hard bounds and real values may be larger.
In [11], we proposed some methods for worst case
analysis that do not need traffic regulators. The bounds
presented by those methods are tighter than those reported
by previous studies. In this paper, we have extended these
methods in several directions: we provide a more detailed
switch model to differentiate stage delay and buffer depth; this
results in tighter bounds in calculations of RTB-LL method.
Our analysis also considers networks with virtual channels
and variable buffer lengths; the analysis for small buffers is
also extended to account for message lengths, which results
in tighter bounds with respect to our previous results.
3 THE NETWORK MODEL
A router model is essential to characterize network latency
and bandwidth. We consider the very general reference
architecture shown in Fig. 1 where a crossbar handles the
connections among input and output channels inside the
router. For more generality, we consider optional buffering
at input and output ports. We assume round-robin
arbitration in the switches, a commonly used arbitration
scheme in many NoCs. Each port is equipped with some
virtual channels sharing the bandwidth of the physical
channel associated with it. Links, which can be pipelined to
maximize the operating frequency, connect the output ports
to the input ports of adjacent routers. Note that due to
backpressure signaling, packet loss and packet dropping do
not happen in switches. Table 2 summarizes the parameters
used to describe the model. For the sake of simplicity, we
use a single parameter Freq for the operating frequency of
all cores and FlitWidth as the data width of all NoC links.
The buffer depth ðBdÞ parameter is used in the paper
frequently. As seen in Fig. 1, Bd is the summation of a
number of registers and/or of the number of slots of one or
two FIFOðsÞ,1 from the arbitration point (at the entry of the
crossbar) of switch j to the arbitration point of switch jþ 1.
The input buffer depth is denoted by b1 (we assume at least
one register), b01 is the minimum delay of the header flit of a
packet in the input buffer of a switch, b2 is the number of
cycles to traverse a switch crossbar (if pipelined), b3 is the
depth of the output buffer (if any), b03 is the minimum delay
of the header flit of a packet in the output buffer of a switch
and a is the number of registers (if any) along a link to
compensate the propagation delay of the wires. Note that b2
and a represent latencies that packets face even in the
absence of congestion, while b1 and b3 become most
relevant in case of blocking, when buffers fill up; in the
absence of congestion, input and output buffers can be
traversed in a single cycle instead ðb01 and b03Þ. Blocking
always happens because of arbitration conflicts, either
directly in front of a switch crossbar, or indirectly due to
full buffers ahead. For simplicity, throughout the paper, we
consider the buffering between two adjacent switches to be
lumped, so we mention “output buffer of switch j” and “input
buffer of switch jþ 1” equivalently, referring to the same
number of intermediate registers or FIFOs between the
arbitration points of switches j and jþ 1, i.e., to Bd. The
stage delay ðSdÞ parameter instead describes the minimum
delay the header flit of a packet will face in the absence of
any contention with other packets, between arbitration
points of two adjacent switches. Please also note that the
switches along a path are indexed j ¼ 1::m, but j ¼ 0 can
454 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. 3, MARCH 2013
Fig. 1. Switch model and parameters.
TABLE 2
Network Parameters, Symbols, and Used Notation
1. Throughout this paper we assume registers to be clocked pipeline
registers, in which to traverse a number of n cascaded registers, n clock
cycles are always required. A FIFO is the traditional First-In, First-Out
queue. In case a flit traverses an empty FIFO of length n, in the absence of
traffic contention, it will face one cycle and in case of a filled FIFO, will face
n cycles of delay. We use the general term buffer throughout the paper to
address both types of registers and FIFOs or a combination of them.
conveniently be seen as a virtual switch inside the source
node, which acts like a physical switch to model source
conflicts (i.e., sending more than one flow from a source
node). The parameters ts1 and ts2 model the setup time at
NoC sources and consumption time at NoC destinations to
inject and eject packets. Of course, to be able to use finite
parameters, we assume that the receiving nodes are able to
accept incoming data at the required rates.
4 NETWORK TRAVERSAL DELAY ANALYSIS
Table 3 lists the parameters we use to describe traffic flows
across the network, while Table 4 summarizes the para-
meters that we use to model the performance of such flows.
Most notably, UBi represents the upper bound delay for a
packet of flow Fi traversing the network, and is a key factor
for the interconnect designer.We try to use a notation as close
as possible to that used in [10] for ease of comparison. We
first present a method called Real Time Bound for High-
Bandwidth traffic which calculates UBi in a completely
worst case traffic situation. Crucially, this includes the
possibility for other system cores to inject unregulated bandwidth,
i.e., any amount of traffic at irregular intervals. This is a key
property for real-world interconnects analysis, as most
available IP cores operate on an unregulated-injection basis.
In order to calculate UBi in such a case, we consider all
intermediate buffers along the route to be full, and we
assume arbitration loss at all switches where other flows
are contending for the same output port. As it will be seen,
the calculations always provide solutions for worst case
situation on different flows that are unique due to the
employed deterministic calculation procedure. The calcu-
lated values are tight in most network scenarios as the worst
case situation can really happen. The boundsmay be slightly
pessimistic in rare network scenarios where some of the
contending flows connect two switches on different routes
that can prevent providing enough contending packets to
create worst situations.
Deadlock and livelock do not occur as we assume the
routing path along the switches for all the flows are
deterministic and predefined (like the networks used in
[57]). As we are modeling the worst case behavior, we
consider that the flows send the packets at maximum
possible rate that the network permits; thus at this level of
design, knowing the flow behavior is not important. Since
switches are assumed to feature round-robin arbitration,
even though we assume the current flow to be serviced last,
the maximum delay is bounded, i.e., starvation cannot
occur. Therefore, the packets sent by the source Si are
eventually delivered. RTB-HB calculates the Maximum
Interval MIi, i.e., the number of cycles after which the
output buffer of Si is guaranteed to be free again for further
injection. From this value, the worst traffic minimum
injectable bandwidth ðmBWiÞ can also be easily derived.
This analysis can be applied to most NoC architectures,
without any specific QoS hardware or software provision-
ing. We then move on to the description of another method,
called RTB-LL. In this scenario, we assume that traffic
injection can be regulated, as in some application scenarios.
Therefore, we also calculate a minimum permitted interval
ðmLiÞ between two consecutive packets from the same
source, which can be translated into a maximum permitted
bandwidth ðMBWiÞ. This approach is similar to the
previously reported method [10] (Real-time wormhole channel
feasibility checking or WCFC, which will be briefly described
later) but provides much better results in terms of bound
tightness. For a proper operation, the system must then
respect MBWi bounds at runtime.
4.1 The Proposed Delay Model RTB-HB
The goal here is to calculate the parameters UBi (worst case
latency to traverse the network) and MIi (maximum worst
case interval). Let us first consider the case Bd ¼ L.
4.1.1 The Case Bd ¼ L
Note that Bd ¼ L means that a packet fills exactly the
buffering resources between the arbitration points of two
adjacent switches. Considering the casewhere the network is
completely loaded (an unrealistic scenario just for visualiza-
tion purposes) and Bd ¼ L, the network operates by
shuffling packets around in lockstep: all switches simulta-
neously rearbitrate every L cycles and packets trail each
other, filling up the buffers as soon as they become free.
More formally, when Pi is generated in Si, we consider
all intermediate buffers along its route being full of packets
from different flows. In the worst case, for Pi to reach its
destination, all these packets must leave their buffers.
Focusing on hop j, Pi may have arbitration conflicts with a
RAHMATI ET AL.: COMPUTING ACCURATE PERFORMANCE BOUNDS FOR BEST EFFORT NETWORKS-ON-CHIP 455
TABLE 3
Traffic Model Parameters
TABLE 4
Performance Model Parameters
number zcði; jÞ of other flows contending for the output
channel c, e.g., flows Fa and Fr. Since round-robin
arbitration is assumed, it is enough to consider all
contending flows to send a packet before Pi to guarantee
a worst case analysis. The order in which contending flows
obtain the arbitration is not important for the latency
calculation of Pi. So, Pa should make a one-hop forward
progress. While Pa frees the buffers at hop j, flit by flit, the
flits from Pr will smoothly replace the free buffer spaces.
Eventually, Pi also goes through. Section 4.1.3 presents a
simple example to visualize this. The parameter u0i
represents the time needed for Pi to be ejected from Si
and be placed in the output buffer of Si (or input buffer of
the first switch of FiÞ. uji then represents the time needed for
Pi to go from the input buffer of SWj to the input buffer of
SWjþ1, except for the last switch. At the last switch, Pi is
ejected, so it is instead the time needed to get into the input
buffer of destination Di. To calculate UBi, as shown in (1),
all these contributions must be aggregated, plus the fixed
overhead for the packet creation and ejection:
UBi ¼ ts1 þ ts2 þ
X
j
uji ; j ¼ 0::hi: ð1Þ
The time needed for Si to inject the next packet is the time to
create such a packet, plus the time needed for this packet to
move on to the input buffer of the first switch. Thus,
MIi ¼ ts1 þ u0i : ð2Þ
To be consistent with the notations from [10], we introduce
the uppercase Uji symbol, which models the hop delay from
output buffer to output buffer (instead of from input buffer
to input buffer).
Let us consider a packet of flow Fi initiated at the source
Si. For this packet to reach the input buffer of the first
switch, all existing packets at that buffer have to leave. Such
an existing packet could be a packet from the same flow Fi
or any of the contending flows at the output channel of the
source. Thus, the worst case time taken for any existing
packet to leave the buffer is given by MAXxðU0i ; U0IðxÞÞ,
where IðxÞ is the index of contending flows at the output
channel, with x ¼ 1 :::z0ði; 0Þ. Also, all other contending
flows of Fi may have to send a packet before this flow.
Thus, the total delay for a packet from Fi to reach the input
buffer of the first switch is given by:
u0i ¼MAXx

U0i ; U
0
IðxÞ
þ
X
x
U0IðxÞ;
x ¼ 1::z0ði; 0Þ:
ð3Þ
Similarly, for the subsequent hops, uji can be calculated as:
uji ¼MAXx

Uji ; U
j
IðxÞ
þ
X
x
UjIðxÞ;
x ¼ 1::zcði; jÞ; 1  hi  hi:
ð4Þ
Please note that, if there is no contention for the flow, the
above equation reduces to uji ¼ Uji This is again akin to a
packet moving in a pipeline fashion in the network.
In order to calculate Uji values, let us consider the packet
from flow Fi moving from output buffer of the source to the
output buffer of the first switch. For the packet to move, any
existing packet from the output buffer of the first switch
should move to the output buffer of the second switch.
Similar to the above calculations, the maximum delay is
given to be MAXxðUjþ1i ; Ujþ1IðxÞÞ. Please note the small
difference from the uji calculations that, in this case, the
values of Uji at a switch ðjÞ depends on the values at the next
switch (jþ 1) on the path. The Uji values can be obtained as
Uji ¼MAXx

Ujþ1i ; U
jþ1
IðxÞ
þ
X
x
Ujþ1IðxÞ;
x ¼ 1::zcði; jþ 1Þ; 0  j  hi  1:
ð5Þ
For the case of the last switch, from the output port, the
packet can be ejected in Li cycles (one flit per cycle). Thus,
Uhii ¼ Li; UhI ðxÞIðxÞ ¼ LIðxÞ:
Based on (3) and (4), now the problem of finding UBi
and MIi ((1) and (2)) is mapped onto a summation of U
j
i
values, which can be solved by (5). Please note that we
assume that the destination has enough buffers to eject the
packets at the rate at which the network delivers them. By
applying the above formulas recursively, we can obtain the
worst case delay (UB) and injection rate (MIÞ for the
different flows.
To describe the details of different aspects of analytical
method RTB-HB for the calculation of upper bound delay
and interval, we apply them step-by-step to an example
NoC (shown in Fig. 2). The NoC contains four switches and
there are four message flows from S1 toD1, S2;3 toD3, S2;3 to
D2; 4, and S4 to D2; 4 ðS2;3 and D2; 4 are source and
destination with two flows originated from or finished at
them). We consider Bd ¼ L ¼ 4 , in this example.
As an example, we study the time needed for a packet P 0
of flow F1 to cross the network. In general, from (1), we can
write: ts2 þ
P
j u
j
1; j ¼ 0::3.
To start, let us model the time u01 needed to move from S1
to the input buffer of switch SW1. We start from the most
congested network possible, so there exists another
packet P 1 of the same flow ahead, and this packet needs
U01 to move from the output buffer of the source (remember
that source nodes are tagged with superscript 0) to the
output buffer of SW1 ; so, u
0
1 ¼ U01 . U01 has to be calculated
recursively based on the delays of the contending packets
and delays of the packets ahead along the same route.
We observe that two factors mainly contribute when
calculating the delay: first, the possibility of losing arbitra-
tions at SW1 ; second, the fact that there may be no available
buffer space at the output of SW1 (due to arbitration losses
ahead), which also effectively stalls packets at the input of
SW1. For what concerns the arbitration loss, it can be seen
456 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. 3, MARCH 2013
Fig. 2. A simple example network.
that flow F1 contends with flow F2 at the output of SW1.
Thus, a packet P 2 of F2 currently in the input buffer of SW1
could be arbitrated before P 1. For what concerns the output
buffer full condition, in the worst case, there will be a single
(Bd ¼ L) packet P 3 in the output buffer of SW1. P 3 could
belong to either F1 or F2, where, respectively, U
1
1 or U
1
2
models the time for such a packet to move ahead, from the
output buffer of SW1 to the output buffer of SW2.
MAXðU12 ; U11 Þ models the worst case delay affecting the
flow under study. During the time MAXðU12 ; U11 Þ, the
packet P 1 moves on to the output buffer of SW2, leaving
the output buffer at SW1 empty. However, in the worst
case, an arbitration loss occurs to P 1, so it is packet P 2
which will smoothly replace P 3 (Fig. 3a). Before P 1 can
move on by one hop, we must also consider the time for
packet P 2 to go from the output buffer of SW1 to the output
buffer of SW2 (Fig. 3b), which is U
1
2 . Thus, we can write:
u01 ¼ U01 ¼MAX

U12 ; U
1
1
þ U12 ;
which traces back to (3). As mentioned above, this is the
delay for P 1 to move one hop on, but equivalently is also the
delay for P 0 to replace it in the previous location (Fig. 3c).
Now, similarly, P 0 needs to move another hop on, from
the input buffer of SW1 to the input buffer of SW2, with a
delay which is defined as u11. It is possible to use the relation
u11 ¼ U01 based on the equations described in previous
section, but for clarity, we always describe uji based on U
j
i .
As shown in Figs. 3d, 3e, and 3f), in the worst case, packet P 0
should wait for a packet P 4 of F2. A packet P
1, again either
from F1 or F2, should be considered at the output buffer of
SW1. So again, during the time MAXðU12 ; U11 Þ, while P 1
moves on to the output buffer of SW2, P
4 will replace it. P 4
itself then takes U12 to move on, allowing P
0 to eventually get
to the input buffer of SW2. Thus,
u11 ¼MAX

U12 ; U
1
1
þ U12 :
In a similar manner u21 can be calculated. Once P
0 is in
the input buffer of SW3, it is only one hop away from its
destination and as there is no contending flow at the
destination, the ejection time for the messages equals L1.
For the sake of uniformity of presentation, we can write
u31 ¼ U31 ¼ L1:
Now, the target metric UB1 can be calculated recursively,
as a function of Uji variables for the whole network starting
from the last hop of each flow. The calculation of all
intermediate values, e.g., U11 ; U
1
2 ; U
2
1 ; U
2
2 ; U
3
1 and of the
relevant metrics UBi and MIi, is shown in Fig. 4. Please
note that to speed up the recursion steps, intermediate
values can be calculated only once and then stored and
used later.
When considering the source S2;3 it can be noticed that
two flows F2 and F3 can originate from it; therefore, source
conflicts may happen. As Fig. 4 shows, for example, when
analyzing flow F2, u
0
2 (the time to transfer a packet of F2
into the input buffer of the first switch) should include a
delay MAXðU02 ; U03 Þ, which accounts for a packet of either
F2 or F3 to move away from the input of SW1 toward the
input of SW2 (during which time we must assume, in the
worst case, that it is a packet of F3 which replaces it), and
then again the time U03 for this latter packet to also move
on, and finally letting a packet from F2 in. The calculation
for u03 is done similarly.
4.1.2 The Case Bd < L
The same values calculated for the case Bd ¼ L can be used
for the case Bd < L. Assume network A with Bd < L and a
completely equivalent network B but with Bd ¼ L. Let us
consider UBiðAÞ and UBiðBÞ to be the upper bound delays
for these networks. It is possible to use value of UBiðAÞ
instead of UBiðBÞ for network A. The reason is simply that
the number of contending flows along the path of a flow is
the same in both the networks A and B, but the number of
in-flight packets in the worst case situation is larger for
network B.
In order to tighten the performance bounds, we also
present equations to calculate UBiðAÞ directly, achieving
lower figures than those calculated above for network B.
For this purpose, we introduce a new parameter ji and
extend the definitions of uji and U
j
i to support the case
Bd < L for network A ; the new definitions comply with
those used for the case Bd ¼ L.
RAHMATI ET AL.: COMPUTING ACCURATE PERFORMANCE BOUNDS FOR BEST EFFORT NETWORKS-ON-CHIP 457
Fig. 3. (a, b, c) packet P 0 goes from a process that generates it (F1 in S1)
to the input buffer of SW1 (d, e, f) packet P
0 goes from input buffer of
SW1 to input buffer of SW2.
Fig. 4. The complete calculation of UBi andMIi for the example NoC of
Fig. 2 in method RTB-HB.
Definition 1. We denote by ji the worst case delay (in cycles)
elapsing from the moment when the header flit of Pi enters the
arbitration point of switch jþ 1 to the moment when the tail
flit leaves the arbitration point of switch j.
In particular, when Bd ¼ L (as in network BÞ, this time is
zero, since as soon as the header flit enters the arbitration
point of switch jþ 1, the tail flit has left the arbitration point
of switch j.
In order to calculate the parameters for network A, it is
needed to calculate the values of ji for different i and jvalues.
Here, for convenience, we consider L being divisible by Bd;
thus, when there is a sufficiently long cascade of switches, a
whole packet can fit between the arbitration points of the first
and last switches. For this purpose, the number of required
switches is L=Bd þ 1 (2 when Bd ¼ L). To calculate ji , since
the header flit is at the arbitration point of switch jþ 1, L
Bd flits of the packet have not passed the arbitration point of
switch j, and therefore the header flit needs to traverse S
more switches ðS ¼ LBd=Bd ¼ L=Bd  1Þ, to ensure that
the tail flit passes the arbitration point of switch j. If the
number of remaining switches in the path of flow i is smaller
than S, the packet simply exits the network at the destination
node flit by flit, until the tail flit has left the arbitrationpoint of
switch j ; thus,
ji ¼
Xs
k¼1
ujþki ðAÞ when jþ s  hi
Xhij
k¼1
ujþki ðAÞ þ ðjþ s hiÞ Bd when jþ s > hi;
j ¼ 0::hi:
8>>><
>>>:
ð6Þ
Definition2. We denote by Uji ðAÞ the worst case delay elapsing
from the moment when the header flit of Pi enters the
arbitration point of switch jþ 1 to the moment when the tail
flit leaves this point. When Bd ¼ L (as in network B), this
definition matches the previous definition for Uji ðBÞ in (5).
Uji ðAÞ ¼MAXx

Ujþ1i ðAÞ  jþ1i ; Ujþ1IðxÞðAÞ  jþ1IðxÞ

þ
X
x
Ujþ1IðxÞðAÞ þ jþ1i ;
x ¼ 1:: zcði; jþ 1Þ; 0  j  hi  1;
ð7Þ
and
Uhii ðAÞ ¼ L: ð8Þ
In case Bd < L, a packet occupies more buffering space than
that available between arbitration points of two adjacent
switches. Consider now a situation where the header flit of
a packet p of flow i is at the arbitration point of switch jþ 1,
while an interfering packet q of flow t is also traversing
switch jþ 1, but with its header already at the arbitration
point of switch jþ 2þ SðS ¼ L=Bd  1Þ. Also, a number
zcði; jþ 1Þ of packets at different input ports of switch jþ 1
are contending with p for the same output port. Before the
tail flit of packet p can leave the arbitration point at switch
jþ 1, the following must happen:
1. Packet q should proceed for Bd steps (registers), so
that its tail flit leaves the arbitration point of switch
jþ 2. After this step the header flit of one of the
zcði; jþ 1Þ packets should be at the arbitration point
of switch jþ 2. This time is calculated as ujþ2þst ðAÞ; it
is easy to see that ujþ2þSt ðAÞ ¼ Ujþ1t ðAÞ  jþ1t . As t
can be selected from any of i or zcði; jþ 1Þ flows, the
term to calculate the required time for this step is:
MAXx

Ujþ1i ðAÞ  jþ1i ; Ujþ1IðxÞðAÞ  jþ1IðxÞ

;
x ¼ 1::zcði; jþ 1Þ:
2. All the zcði; jþ 1Þ packets, to account for the worst
case, shall be arbitrated before p; so, after this step the
header of p is at the arbitration point of switch jþ 2.
The required time for this step is
P
x U
jþ1
IðxÞðAÞ with x ¼
1:: zcði; jþ 1Þ.
3. Packet p, whose header is at the arbitration point of
switch jþ 2, must proceed for a number of steps
until its tail flit leaves the arbitration point of switch
jþ 1. This time is calculated as jþ1i .
The summation of the time for these steps results in (7).
As can be seen in (8), in the last switch for a flow, L cycles
are required for the packet to be ejected to the destination.
Definition 3. We denote by ujiðAÞ the worst case delay elapsing
from the moment when the header flit ofPi enters the arbitration
point of SWjð1  j  hi  1Þ to the time it enters the
arbitration point of SWjþ1 (for j ¼ 0 it is the worst case delay
elapsing from when the header flit of the packet in the source
node is ready to be injected in the network to the time it enters
the arbitration point of the first switch, also for j ¼ hi, it is the
worst case delay elapsing from the moment the header flit of Pi
enters the arbitration point of the last switch hi to when a
number of Bd flits of the packet have ejected the network at Di.
ujiðAÞ ¼MAXx

Uji ðAÞ  ji ; UjIðxÞðAÞ  jIðxÞ
þ
X
x
UjIðxÞðAÞ;
x ¼ 1:: zcði; jÞ; 0  j  hi  1;
ð9Þ
and
uhii ðAÞ ¼ Bd þ
X
x
UhiIðxÞðAÞ; x ¼ 1:: zcði; hiÞ: ð10Þ
It is obvious that calculation of ujiðAÞ is like Uj1i ðAÞ except
for step 3 which is not required and term ji is omitted; thus,
we can write:
Uj1i ðAÞ ¼ ujiðAÞ þ ji ; j  1: ð11Þ
Using the above definitions the equations to calculate
UBiðAÞ and MIiðAÞ can be summarized as:
UBiðAÞ ¼ ts1 þ ts2 þ
X
j
ujiðAÞ þ ðLBdÞ;
j ¼ 0:: hi:
ð12Þ
In other words, the time needed for a packet to move
from source to destination includes the source and destina-
tion overheads (ts1 and ts2), the time needed for the header
flit to move across the network switch by switch, and L
Bd (as for the last switch, (10) is used and the packet needs
LBd further steps to completely leave the network).
458 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. 3, MARCH 2013
To calculate the value of the maximum interval between
two consecutive packets, it is enough to add the source
overhead (ts1), the time needed to send the header flit of the
packet to the arbitration point of the first switch ðu0i ðAÞÞ,
and the time needed for the last flit of the packet to leave the
source node which is equal to 0i . That is,
MIiðAÞ ¼ ts1 þ u0i ðAÞ þ 0i : ð13Þ
As seen in (9), in order to calculate ujiðAÞ, different values of
ji should be calculated, which in turn requires u
k
i ðAÞ with
k > j; so the order of calculation of ujiðAÞ values should be
from the larger switch indices toward the smaller ones, i.e.,
from destinations toward sources.
In order to show the implementation for the case Bd < L,
Fig. 5 shows the case of L ¼ 4 and Bd ¼ 2 for the example
scenario in Fig. 2. In order to calculate UB1, the values u
3
1, u
2
1,
u11, and u
0
1 should be calculated. u
3
1 as described in (10) is
equal to Bd as there is no flow contention on the port
connected to D1 in SW3. Equation (9) is used to calculate u
2
1,
which is equal to the time needed for the header flit of a
packet P1 of F1 to move from the arbitration point of SW2 to
the arbitration point of SW3. As there is no contention from
flows from different input ports of SW2 for the same output
port, the term
P
x U
j
IðxÞðAÞ in (9) is zero; thus, only the value
ofMAXðU22  22; U21  21Þ is calculated (the reason for using
theMAX operator is that the type of the packet just ahead of
P1 may be either from F1 or F2). It is obvious that flow
contention happens between F1 and F2 for the output port of
SW1 which is connected to SW2; thus, to calculate u
1
1 this
contention should be considered and based on (9), the value
is MAXðU12  12; U11  11Þ þ U12 . To calculate u01, as there is
no source conflict, the term
P
x U
j
IðxÞðAÞ is zero and the term
MAXðUji ðAÞ  ji ; UjIðxÞðAÞ  jIðxÞÞwill result inU01  01 . The
calculation for other flows is done in the same manner.
4.1.3 The Case Bd > L
For simplicity, let us first consider the simple case Bd ¼ 2L.
In this case, exactly two packets fit between two adjacent
switches along the path of flow Fi. It is possible to break
the buffering space by imagining a dummy switch at the
middle of input buffer of all the switches and rewrite the
above formulas. For example, in Fig. 6, with the introduction
of dummy switches, the buffering space between switches k0
and k becomes Bd ¼ L.
In such a topology, we can write always uk0i ¼ uki ;
independent of the flows of the packets in the buffer thus,
UBi and MIi are given as:
UBi ¼ ts1 þ ts2 þ 2
X
j
uji ; j ¼ 0::hi
MIi ¼ ts1 þ u0i :
ð14Þ
Extending this approach to the case Bd ¼ m L, m 1
dummy switches can now be inserted between each pair of
switches and thus:
UBi ¼ ts1 þ ts2 þm
X
j
uji ; j ¼ 0::hi
MIi ¼ ts1 þ u0i :
ð15Þ
In cases where Bd is not a multiple of L, i.e.,
Bd ¼ m Lþ k; 0 < k < m:
A straightforward conservative solution can round up Bd to
ðmþ 1Þ  L and consider m dummy switches; so, the
formula becomes
UBi ¼ ts1 þ ts2 þ ðmþ 1Þ 
X
j
uji ; j ¼ 0::hi
MIi ¼ ts1 þ u0i :
ð16Þ
The above equation shows that MIi, and thus mBWi, is
not a function of m. In other words, the minimum
guaranteed bandwidth in RTB-HB does not depend on
the network’s buffering space when buffer depth is larger
than message length. Indeed, in this scenario, we assume
that a buffer of size m L exists just before the arbitration
point of the inputs of the switches and for the calculation
purposes we assume m 1 dummy switches are placed to
split the buffer to m stages; as the delay for all the stages
are equal, we can compare these stages in the network
with a pipeline in which increasing or decreasing the
number of stages does not affect the data injection rate into
the pipeline although the delay will increase when m
increases. This is true for the buffers just at the outputs of
the source cores and thus the injection rate to the network
is not affected when m changes. This is better shown in an
example in the next section. Also as in Section 4.1.2 we
have proposed the solution for parameterized packet
Length of L and Bd ¼ 1, it is possible to combine this
solution with dummy switches solution to support
arbitrary Bd on different switches (i.e., Bdj on switch j).
It is possible to consider spreading Bdj  1 dummy
switches among the elements of a buffer of depth Bdj,
then reenumerate all the real and dummy switches in the
RAHMATI ET AL.: COMPUTING ACCURATE PERFORMANCE BOUNDS FOR BEST EFFORT NETWORKS-ON-CHIP 459
Fig. 5. The complete calculation of UBi andMIi for the example NoC of
Fig. 2 with method RTB-HB when Bd < LðL ¼ 4 ; Bd ¼ 2Þ.
Fig. 6. Insertion of a dummy switch.
network and solve the equation for Bd ¼ 1 (in the whole
network) as suggested in Section 4.1.2.
The message length Li can vary on a per-flow basis. To
tackle this challenge, let us call the shortest message
length Lmin. Let us first consider Bd  Lmin. In this case, it
is intuitively possible to use the same equation as in
Section 4.1.1. The only difference is that for every
individual flow, Li must be used as a parameter instead
of L. Also, for Bd  Lmin it is possible to use exactly the
same approach used in Section 4.1.2. To further describe
this, when Bd ¼ m Lmin þ k, with 0 < k < m, it is
possible to consider m dummy switches (or alternatively
Bd  1 dummy switches as described above) between each
two consecutive switches and do the calculations as
described in Section 4.1.2, using Li values for different
flows. Thus, based on the approach of variable message
length described here, throughout the paper, we have
used Li instead of L.
4.1.4 Virtual Channels
The extension of the proposed method to support virtual
channels is straightforward, provided that the index of the
virtual channel used for a flow in different switches is a
predefined parameter. For this purpose, all virtual chan-
nels in intermediate switches can be considered as physical
channels in the proposed model. Arbitration conflicts
would be among all the incoming virtual channels, and
each output virtual channel would account for a separate
output port (so, e.g., the value for Uhii would change from
Li to m Li, assuming m virtual channels per physical
channel). In the following equations, the modifications
needed to support virtual channels are shown. The
parameter uji ½vi;j denotes the maximum delay to move
from the arbitration point of switch j to the arbitration
point of switch jþ 1 when using the predefined input
virtual channel vi; j used for flow i at the input of switch j.
Similarly, Uji ½vi; jþ1 is the maximum delay to move from
the output buffer of switch j to the output buffer of switch
jþ 1 using input virtual channel vi;jþ1 at the input of
switch jþ 1. Also vi;hiþ1 is defined as the virtual channel
index used for flow i at the input port of the destination
(Di) IP core. The parameter u
0
i ½1 is used to calculate MIi as
shown below; in this case, the selection of virtual channel
number 1 is mandatory as only one flow enters the virtual
switch inside an IP core
uji ½vi;j ¼MAXx

Uji ½vi;jþ1; UjIðxÞ½vIðxÞ;jþ1

þ
X
x
UjIðxÞ½vIðxÞ;jþ1;
x ¼ 1::zcði; jÞ; 0  j  hi
Uji ½vi;jþ1 ¼MAXx

Ujþ1i ½vi;jþ2; Ujþ1IðxÞ½vIðxÞ;jþ2

þ
X
x
Ujþ1IðxÞ½vIðxÞ;jþ2;
x ¼ 1:: zcði; jþ 1Þ; 0  j  hi  1
Uhii ½vi;hiþ1 ¼ m Li
UBi ¼ ts1 þ ts2 þ
X
j
uji ½vi;j; j ¼ 0::hi
MIi ¼ ts1 þ u0i ½1:
ð17Þ
To illustrate the support of virtual channels, let us consider
the example shown in Fig. 2 with two virtual channels per
physical channel. In this case, Table 5 shows the mapping
between flows and virtual channels at each switch input
port. Fig. 7 shows the calculation for this example. By
comparing the results with the case where no virtual
channels are used, it can be observed that virtual channels
can reduce the number of flow contentions in different
switches. At the same time, the bandwidth of a physical
channel is shared among its virtual channels, thus depend-
ing on the application and on the strategy that assigns
virtual channels to different flows, the worst case latency
and bandwidth values can be better or worse compared to
the case without virtual channels, regardless of extra
hardware cost and complexity of virtual channels.
4.2 The Proposed Delay Model RTB-LL
We present a substantial improvement to the previously
published method WCFC [10]. WCFC also calculates the
upper bound propagation delays and permitted injection
intervals for the flows in a wormhole network. It considers
the arbitration contention packets face and the delay
incurred by other packets sharing some part of their route
due to such blockings. With a notation similar to that used
above, WCFC employs [10] the following equation to
calculate UBi and mIi:
UBi ¼ ts1 þ ts2 þ Li þ aþ
X
j
uji ; j ¼ 0::hi
mIi ¼ ts1 þ Li þ
X
j
uji  hi  Sd; j ¼ 0::hi
uji ¼ Sd þ
X
x
UjIðxÞ; x ¼ 1:: zcði; jÞ; 0  j  hi:
ð18Þ
In the WCFC method, the calculations are based on the
assumption that each flow injects packets with a minimum
permitted interval. For the applications that can support
such an assumption, we present a method that provides
significant improvement in bound tightness over the
WCFC method, which we call RTB-LL. As RTB-LL is less
460 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. 3, MARCH 2013
Fig. 7. The calculation of UBi and MIi for the example NoC in Fig. 2
using RTB-HB, considering two virtual channels per physical channel.
TABLE 5
Virtual Channel Selection for the Example Network in Fig. 2
pessimistic than WCFC in evaluating worst case perfor-
mance, it enables the design of more hardware-efficient
NoCs. To improve upon WCFC, a new concept, called
overlapping flows, is introduced. If two or more different
flows contend for the same output port at a switch, and
they also share the same input port, we call such flows
overlapping at the switch. This notion allows us to
significantly optimize the bound tightness.
When Fi contends with multiple overlapping flows at a
switch, it is possible to locally coalesce all such overlapping
flows into a single one. This is because the arbitration
cannot be lost to multiple of those flows, as they cannot
physically produce a contending packet simultaneously
given that they enter the switch through the same input
port. If there exist, e.g., two overlapping contending flows
at hop j with delay parameters Uji1 and U
j
i2, it is possible to
consider MAXðUji1; Uji2Þ as their representative delay,
instead of Uji1 þ Uji2, for calculating the parameters of Fi.
Moreover, whenever Fi overlaps with other flows, those
other contending flows should be ignored. By applying
these optimizations, we have noticed a significant improve-
ment in bound tightness for RTB-LL, shown in the next
section and as summarized in Table 1.
Figs. 8 and 9 show the calculated UBi and mIi values for
both WCFC and RTB-LL methods for the same example in
Fig. 2. Since flows F1 and F2 are overlapping at SW2, our
proposed RTB-LL improves the bound tightness compared
to WCFC. Consider, for the sake of exemplification, a NoC
variant shown in Fig. 10, with another contending flow at
SW2. In RTB-LL, the delay u
2
3 of F3 at SW2 can be modeled
as Sd þMAXðU21 ; U22 Þ instead of the overly pessimistic
value Sd þ U21 þ U22 calculated by WCFC.
To show how virtual channels are supported in RTB-LL,
the same example for method RTB-HB is considered. Fig. 11
shows the results. Like in RTB-HB, using virtual channels
results in better latency and bandwidth bounds, since flow
contentions are resolved.
5 STUDIES ON APPLICATIONS
The proposed methods RTB-HB and RTB-LL can be used to
analyze the scheduling of traffic flows in real-world
applications. In this section, we present studies on a
multimedia application (four other multimedia and RT
applications are considered in the appendix, which can be
found on the Computer Society Digital Library at http://
doi.ieeecomputersociety.org/10.1109/TC.2011.240). We
compare methods RTB-HB and RTB-LL to the baseline
method WCFC, using the parameters listed in Table 6. In
these applications, we assume that NoC topologies are
predefined based on application communication require-
ments, but without any feedback from the proposed
algorithms to customize the network structure for better
upper bound delay and interval time results (considering
such a feedback is a possible extension for future work). In
particular, for many applications, it is possible to identify a
small subset of flows as critical, and then to optimize the
NoC based on feedback loops from RTB-HB and RTB-LL to
improve the performance of such critical flows. It is possible
to do this without dedicated hardware support or any
priority scheme.
5.1 Case Study: A Multimedia Application
In this section, we compare the results of applying RTB-HB,
RTB-LL, and WCFC to D26-media (Fig. 12), a real-time
multimedia application with 67 communication flows, some
of which critical. The application is mapped onto different
NoC topologies, each with different switch counts and
switch radices. Fig. 14 shows the average flow latency for
different analysis methods and topologies. In particular we
have shown the implementation for switch counts up to
RAHMATI ET AL.: COMPUTING ACCURATE PERFORMANCE BOUNDS FOR BEST EFFORT NETWORKS-ON-CHIP 461
Fig. 8. The complete calculation of UBi and mIi for the example NoC of
Fig. 2 with method WCFC.
Fig. 9. The complete calculation of UBi and mIi for the example NoC of
Fig. 2 with method RTB-LL.
Fig. 10. NoC variant where flow F3 contends with two overlapping flows
F1 and F2 at SW2.
Fig. 11. Calculation of UBi and mIi for the example in Fig. 2 using
RTB-LL and WCFC, with two virtual channels per physical channel.
TABLE 6
Network Parameters for the Study
seven that are typical values and a topology with
20 switches that is a reasonable point with many longer
hop flows. Topologies with one or few switches (e.g., 1-3
switches in this example) need them to be “fat” (high-
radix), while other cases need “medium” or “thin” (low-
radix) switches. It is important to note that the limitations in
physical implementation (like power consumption, area,
frequency, etc.) may limit the use of fat switches in practice;
we still present the results for these cases for the sake of
latency comparison, and assume that proper constraints can
be implemented at a higher level in the NoC synthesis flow.
Two detailed implementations with 5 and 20 switches are
shown in Fig. 13. Fig. 15 presents the results in terms of
latency, intervals and bandwidth for the whole set of flows
for 1-, 5- and 20-switch networks. Figs. 15a, 15b, and 15c
compare UBi. The RTB-LL model always provides the
tightest bounds. Compared to WCFC, the largely improved
tightness (more than 50 percent on average) is due to the
analysis of overlapping flows, a novelty of this paper.
Please note that the improved tightness comes without any
impact on the accuracy of the bounds, which are still under
worst case assumptions. For the topology with only one
switch and without overlapping flows in Fig. 15a, the
results for RTB-LL and WCFC are identical but increasing
switch counts triggers different performance profiles. RTB-
HB is intrinsically expected to return higher worst case
latencies, due to the assumption that no hardware traffic
injection regulation facilities are available. Still, due to the
more accurate calculation approach, the bounds are on
average 30 percent lower than those given by WCFC,
despite the less restrictive assumptions. There are, however,
a few flows for which WCFC predicts lower delays than
RTB-HB, due to the regulated injection assumption. In a
zero-load scenario (with no contention at all), the minimum
theoretical latency to traverse the 5-switch NoC for flows
spanning a single hop is eight cycles ðSd þ LÞ, while RTB-LL
gives a minimum upper-bound of 17 cycles in worst case
contention. The delays calculated for the 20-switch topol-
ogy, in this example, are higher, as a result of longer paths
(more hops) per flow, higher probability of contention, and
especially for RTB-HB, more in-flight packets. This sug-
gests, as intuitively expected, that NoCs with fewer hops
guarantee, on average, lower delay bounds. As described
earlier, physical implementation limitations may prevent
using fat switches in practice. On the other hand, increasing
the number of switches not only increases the system cost
but requires a careful consideration in the design process to
reduce flow contentions to acquire tighter upper bound
delays; so, a trade off may be considered for the number of
switches. For RTB-LL, the number of contending in-flight
packets is unrelated with the NoC topology, so the delay
will not increase as a result of more hops.
Figs. 15d, 15e, and 15f show the maximum and minimum
injection intervals (MIi and mIi). Intuitively, if traversal
delays are lower, new packets can be injected sooner, soMIi
(mIi) plots resemble UBi trends: flows with lower latencies
can be injected more frequently. Thus, the mIi intervals are
always shorter in RTB-LL and theMIi intervals often shorter
in RTB-HB when compared to mIi in WCFC (except when
using a few fat switches). These intervals can be directly
translated into mBWi and MBWi using the equations in
Table 3. Results are shown in Figs. 15g, 15h, and 15i. The
maximum injectable bandwidths (MBWi) are, on average
35 percent higher according to RTB-LL when compared to
WCFC, and 25 percent higher according to the minimum
bandwidth (mBWi) in RTB-HB. The maximum theoretical
injectable bandwidth is 1;600 MB=s ðFreq FlitWidthÞ; ac-
cording to RTB-LL, even under worst case assumptions,
some flows on the 5-switch NoC are guaranteed injection
462 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. 3, MARCH 2013
Fig. 12. Communication graph for D26-media.
Fig. 13. D26-media application mapped on 5-switch and 20-switch NoC.
Fig. 14. Average upper bound delay of the flows in D26-media for
methods WCFC, RTB-LL, and RTB-HB, applied to topologies with
different switch counts.
rates of as much as 533 MB=s. In the 20-switch network, the
higher contention likelihood affects injectable bandwidth
negatively, but the use of more resources has a positive effect
on many-hop flows, resulting in comparable injectable
bandwidths. In summary, NoCs with few hops exhibit
clearly better flow average upper bound traversal delays, but
in terms of injectable bandwidths, the mapping of the flows
(i.e., the contention patterns) and the amount of used
resources play a decisive performance role.
5.2 The Effect of Virtual Channels
The results of employing different number of virtual
channels in the network in Fig. 12 are reported in Figs. 16
and 17. Here we consider D26-media application with RTB-
HB and RTB-LL methods for one, two and four virtual
channels per physical channel. The strategy that assigns
virtual channels to the flows is to share the load of input
ports among the virtual channels and thus minimizing the
contentions on switch output ports. The figures show that
RAHMATI ET AL.: COMPUTING ACCURATE PERFORMANCE BOUNDS FOR BEST EFFORT NETWORKS-ON-CHIP 463
Fig. 16. Worst case delay comparison for D26-media application, in
method RTB-HB with 1, 2, and 4 virtual channels.
Fig. 15. (a), (d), (g) UBi;MIiðmIiÞ and mBWiðMBWiÞ characterized with WCFC, RTB-LL, and RTB-HB for D26-media mapped onto a 1-switch
NoC. (b), (e), (h) The same metrics mapping onto a 5-switch NoC. (c), (f), (i) mapping onto a 20-switch NoC. Horizontal axes enumerate the
communication flows.
Fig. 17. Worst case delay comparison for D26-media application, in
method RTB-LL with 1, 2, and 4 virtual channels.
for both methods increasing the number of virtual channels
results in better average RT metrics for different flows.
5.3 Study with Variable Buffer Depths
Figs. 18 and 19 show the comparison of worst case delay
and maximum interval for D26-media application for the
case Bd < L. The figure suggests that shallower buffers,
when the message length is fixed, will result in smaller
worst case delay and maximum interval bounds. Fig. 20
illustrates the effects of increasing the buffer depth Bd, in all
switches of the NoC, as an integer multiple of message
length L. The test is run for the D26-media application using
method RTB-HB. A linear relationship can be observed
between Bd and the upper bound delay UBi. But because of
the pipeline effect, as described in Section 4.1.2, there is no
such a relation between Bd and mBWi.
A contradictionmay be perceived since deeper buffers are
generally expected to improve performance, while Fig. 20a
reveals worse latency with deeper buffers. In fact, the average
latency is probably improved with a larger Bd , but worst
case latency is not, as shown in Fig. 20a. To understand why
the worst case latency deteriorates, consider the following
explanation. When the basic case of Bd ¼ L is considered,
RTB-HB calculates the maximum delay for a packet Pi from
the time the packet is supposed to be injected into the
network until when it is ejected at the destination. Since the
output buffer of the source core has a depth of Bd ¼ L, it is
not possible to have more than one packet in this buffer
awaiting to be serviced before Pi. In the worst situation, the
traffic generator can inject a packet everyWIi cycles, at most;
if it injects more, the traffic generator may be stalled by NoC
backpressure until this interval elapses. In the more complex
case Bd > L, some packets may be queued in the source core
buffer ahead of Pi, and since RTB-HB imposes no restriction
on injection rates, they are expected to be there by a worst
case analysis. Thus, the calculated worst case delay for Pi
includes the extra time needed to service them. Exactly the
same reasoning applies to Bd > L in other intermediate
buffers in the network. As a consequence, deeper buffers
increase worst case latencies, but this is not incompatible
with the fact that increasing buffer size will decrease the
average delay. Indeed if we apply the same traffic pattern to
two identical networks, but with different buffer sizes, the
total average delay in the network with larger buffers will
typically be lower than that of the other network. Our
method RTB-LL does not consider the buffering space Bd to
calculate the worst case delay; instead it uses stage delay Sd.
Therefore, changing the buffer size will not affect the
calculated worst case delay.
5.4 Suitability to Critical Flows
Fig. 21 shows the average UBi traversal delay and the
average mBWi injectable bandwidth for the flows traver-
sing x hops of the 5-switch NoC, considering the D26-media
application and using RTB-HB. It is seen 1-hop flows exhibit
reasonably low latencies and high bandwidths, suitable for
critical traffic loads. Thus, the proposed methodology has a
clear applicability to industrial RT applications.
6 COMPLEXITY OF THE METHODS
To estimate the time complexity of the proposed methods,
we calculate the maximum number of required operations.
As (1), (3) and (5) show, the only operations are additions
and comparisons (for the MAX operator); we consider one
cycle to execute each of such operations. We call h the
maximum number of switches traversed by a flow, and k
the number of flows. We also pessimistically assume the
maximum number of contending flows at a switch output
to be k. For calculating one Uji parameter, we need
(according to (5)) at most k comparisons and k additions,
thus a total of 2k operations. The number of Uji parameters
to be calculated is hk; so, the maximum number of
464 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. 3, MARCH 2013
Fig. 21. (a) Average Upper Bound Delay on left and (b) Average
Minimum Bandwidth for x-hop flows on right in D26-media for RTB-HB
Method.
Fig. 18. Worst case delay comparison for D26-media application, L ¼ 4,
and Bd ¼ 1; 2; and 4.
Fig. 20. (a) Average flow Upper Bound Delay versus Buffer Depth on left
and (b) Average flow Bandwidth versus Buffer Depth when it is a multiple
of Message Length on right in D26-media mapped on 5-switch NoC.
Fig. 19. Maximum interval comparison for D26-media application, L ¼ 4,
and Bd ¼ 1; 2; and 4.
operations is 2hk2. Each uji parameter (except for j ¼ 0Þ can
be derived from the equality Uji ¼ uji þ 1 ; thus, we only
need to calculate the case of u0i for all flows. In this case, one
u0i (3) needs 2k operations; for all u
0
i parameters we need
2k2 operations.
In RTB-HB, the outcome is k UBi and kMIi values.
For calculating one UBi value (according to (1)), we need
hþ 1 additions; so, for all k UBi values, we need ðhþ 1Þk
operations, while k operations are needed in the case of
MIi. The total number of operations is the summation of all
the above, i.e., 2hk2 þ 2k2 þ ðhþ 1Þkþ k. Therefore, the
complexity of the algorithm is Oðhk2Þ.
For RTB-LL, using the same approach, we can show that
the complexity of the algorithm for calculating UBi and mIi
is again Oðhk2Þ. Thus, both algorithms have quadratic time
complexity. Also the timing complexity of WCFC algorithm
is similar to RTB-LL as it exhibits a similar recursive
behavior [10]. In practice, the execution time for all our test
applications is very small (few seconds on a standard PC)
and the modeling of delay and bandwidth parameters does
not pose significant runtime issues.
7 CONCLUSION AND FUTURE WORK
We have proposed two different methods to characterize
bandwidth and latency for NoC-based real-time SoCs,
aiming at guaranteed QoS provisions. The choice of the
most suitable method depends on the performance
demands of the system and on whether dedicated hardware
facilities can be provided by the NoC. One method is aimed
at applications demanding the minimum latencies and
requires injection regulation, while the other is suitable for
applications where packet injection must be flexible to
accommodate for higher average injected bandwidths and
no hardware regulation is possible. We have proved that
the proposed methods return the worst case metrics in a
much tighter way than existing approaches, rendering them
quite applicable for real-world SoC applications. The next
step is to use the results of this work as an input to NoC
synthesize and optimization tools whereby the QoS
demands of critical traffic flows are met.
ACKNOWLEDGMENTS
This work is partly supported by EU ARTIST-DESIGN and
ARTEMIS project SMECY.
REFERENCES
[1] W.J. Dally and B. Towles, “Route Packets, Not Wires: On-Chip
Interconnection Networks,” Proc. 38th Ann. Design Automation
Conf. (DAC), pp. 684-689, 2001.
[2] L. Benini and G.D. Micheli, “Networks on Chips: A New SoC
Paradigm,” IEEE Computer J., vol 35, no. 1, pp. 70-78, Jan. 2002.
[3] P. Guerrier and A. Greiner, “A Generic Architecture for On-Chip
Packet-Switched Interconnections,” Proc. Design, Automation and
Test in Europe Conf. and Exhibition (DATE ’00), pp. 250-256, 2000.
[4] Z. Guz, I. Walter, E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny,
“Network Delays and Link Capacities in Application-Specific
Wormhole NoCs,” Proc. 17th Int’l Conf. VLSI Design, May 2007.
[5] J. Henkel, W. Wolf, and S. Chakradhar, “On-Chip Networks: A
Scalable, Communication-Centric Embedded System Design
Paradigm,” Proc. 17th Int’l Conf. VLSI Design (VLSID ’04),
vol. 17, pp. 845-851, Jan. 2004.
[6] S. Furber and J. Bainbridge, “Future Trends in SoC Interconnect,”
Proc. IEEE Int’l Symp. VLSI Design, Automation and Test, pp. 183-
186, 2005.
[7] Z. Shi and A. Burns, “Real-Time Communication Analysis for On-
Chip Networks with Wormhole Switching,” Proc. ACM/IEEE
Second Int’l Symp. Networks-on-Chip, pp. 161-170, 2008.
[8] C. Paukovits and H. Kopetz, “Concepts of Switching in the Time-
Triggered Network-on-Chip,” Proc. IEEE Int’l Conf. Embedded and
Real-Time Computing Systems, pp.120-129, 2008.
[9] K. Goossens, J. Dielissen, and A. Radulescu, “The Æthereal
Network on Chip: Concepts, Architectures, and Implementa-
tions,” IEEE Design and Test of Computers, vol. 22, no. 5, pp. 414-
421, Sept./Oct. 2005.
[10] S. Lee, “Real-Time Wormhole Channels,” J. Parallel Distributed
Computer, vol. 63, pp. 299-311, 2003.
[11] D. Rahmati et al., “A Method for Calculating Hard QoS
Guarantees for Networks-on-Chip,” Proc. IEEE/ACM Int’l Conf.
Computer-Aided Design, ICCAD, pp. 579-586, 2009.
[12] D. Kandlur, K. Shin, and D. Ferrari, “Real-Time Communication
in Multihop Networks,” IEEE Trans. Parallel and Distributed
Systems, vol. 5, no. 10, pp. 1044-1056, Oct. 1994.
[13] M. Zhang, J. Shi, T. Zhang, and Y. Hu, “Hard Real-Time
Communication over Multi-Hop Switched Ethernet,” Proc. Int’l
Conf. Networking, Architecture, and Storage, pp. 121-128, 2008.
[14] S. Gopalakrishnan, S. Lui, and M. Caccamo, “Hard Real-Time
Communication in Bus-Based Networks,” Proc. IEEE 25th Int’l
Real-Time Systems Symp., pp. 405-414, 2004.
[15] A. Yiming and T. Eisaka, “A Switched Ethernet Protocol for Hard
Real-Time Embedded System Applications,” Proc. 19th Conf.
Advanced Information Networking and Applications, pp. 41-44, Mar.
2005.
[16] K. Watson and J. Jasperneite, “Determining End-to-End Delays
Using Network Calculus,” Proc. Fifth IFAC Int’l Conf. Fieldbus
Systems and Their Applications (IFAC-FET ’03), pp. 255-260, July
2003.
[17] J. Chen, Z. Wang, and Y. Sun, “Real-Time Capability Analysis for
Switch Industrial Ethernet Traffic Priority-Based,” Proc. Int’l Conf.
Control Applications, pp. 525-529, Sept. 2002.
[18] J. Jaspernite, P. Neumann, M. Theis, and K. Watson, “Determi-
nistic Real-Time Communication with Switched Ethernet,” Proc.
IEEE Fourth Int’l Workshop Factory Comm. Systems (WFCS ’02),
pp. 11-18, 2002.
[19] S. Lee, K.C. Lee, and H.H. Kim, “Maximum Communication
Delay of a Real-Time Industrial Switched Ethernet with Multiple
Switching Hubs,” Proc. IEEE 30th Conf. Industrial Electronics Soc.,
vol. 3, pp. 2327-2332, 2004.
[20] J. Loeser and H. Haertig, “Low-Latency Hard Real-Time Com-
munication over Switched Ethernet,” Proc. 16th Euromicro Conf.
Real-Time Systems (ECRTS), pp. 13-22, 2004.
[21] J. Kiszka, B. Wagner, Y. Zhang, and J. Broenink, “RTnet - A
Flexible Hard Real-Time Networking Framework,” Proc. IEEE
10th Int’l Conf. Emerging Technologies and Factory Automation,
pp. 450- 456, 2005.
[22] J. Hu, U.Y. Ogras, and R. Marculescu, “System-Level Buffer
Allocation for Application-Specific Networks-on-Chip Router
Design,” IEEE Trans. Computer-Aided Design of Integrated Circuits
and Systems, vol. 25, no. 12, pp. 2919-2933, Dec. 2006.
[23] A.E. Kiasari, H. Sarbazi-Azad, and M. Ould-Khaoua, “An
Accurate Mathematical Performance Model of Adaptive Routing
in the Star Graph,” Future Generation Computer Systems, vol. 24,
no. 6, pp. 461-474, 2008.
[24] J. Kim and C.R. Das, “Hypercube Communication Delay with
Wormhole Routing,” IEEE Trans. Computers, vol. 43, no. 7, pp. 806-
814, July 1994.
[25] I. Cohen, O. Rottenstreich, and I. Keslassy, “Statistical Approach
to Networks-on-Chip,” IEEE Trans. Computers, vol. 59, no. 6,
pp. 748-761, June 2010.
[26] A.E. Kiasari, H. Sarbazi-Azad, and S. Hessabi, “Caspian: A
Tunable Performance Model for Multi-Core Systems,” Proc. 14th
Int’l Euro-Par Conf. Parallel Processing, vol. 5168, pp. 100-109, 2008.
[27] U.Y. Ogras and R. Marculescu, “Analytical Router Modeling for
Networks-on-Chip Performance Analysis,” Proc. Conf. Design,
Automation and Test in Europe (DATE), pp. 1-6, 2007.
[28] V. Soteriou, H. Wang, L.-S. Peh, “A Statistical Traffic Model for
On-Chip Interconnection Networks,” Proc. IEEE Int’l Symp.
Modeling, Analysis, and Simulation of Computer and Telecomm.
Systems, pp. 104-116, 2006.
RAHMATI ET AL.: COMPUTING ACCURATE PERFORMANCE BOUNDS FOR BEST EFFORT NETWORKS-ON-CHIP 465
[29] W.J. Dally, “Performance Analysis of k-ary n-Cube Interconnec-
tion Networks,” IEEE Trans. Computers, vol. 39, no. 6, pp. 775-785,
June 1990.
[30] J. Draper and J. Ghosh, “A Comprehensive Analytical Model for
Wormhole Routing in Multicomputer Systems,” J. Parallel and
Distributed Computing, vol. 23, no. 2, pp. 202-214, 1994.
[31] P. Hu and L. Kleinrock, “An Analytical Model for Wormhole
Routing with Finite Size Input Buffers,” Proc. 15th Int’l Teletraffic
Congress, pp. 549-560, 1997.
[32] R. Marculescu, U.Y. Ogras, L.-S. Peh, N.E. Jerger, and Y. Hoskote,
“Outstanding Research Problems in NoC Design: System, Micro-
architecture, and Circuit Perspectives,” IEEE Trans. Computer-
Aided Design of Integrated Circuits and Systems, vol. 28, no.1, pp. 3-
21, Jan. 2009.
[33] S. Foroutan et al., “An Analytical Method for Evaluating Network-
on-Chip Performance,” Proc. Conf. Design, Automation and Test in
Europe (DATE), pp. 1629-1632, 2010.
[34] H. Kopetz et al., “Distributed Fault-Tolerant Real-Time Sys-
tems: The Mars Approach,” IEEE Micro, vol. 9, no.1, pp. 25-40,
Feb. 1989.
[35] J. Liang, S. Swaminathan, and R. Tessier, “aSOC: A Scalable,
Single-Chip Communications Architecture,” Proc. Int’l Conf.
Parallel Architectures and Compilation Techniques, pp. 37-46, 2000.
[36] M. Millberg, R.T.E. Nilsson, and A. Jantsch, “Guaranteed
Bandwidth Using Looped Containers in Temporally Disjoint
Networks within the Nostrum Network on Chip,” Proc. Design,
Automation and Test in Europe (DATE) Conf. and Exhibition, pp. 890-
895, 2004.
[37] A. Hansson, M. Subburaman, and K. Goossens, “aelite: A Flit-
Synchronous Network on Chip with Composable and Predictable
Services,” Proc. Design, Automation and Test in Europe (DATE),
pp.250-255, 2009.
[38] J. Diemer and R. Ernst, “Back Suction: Service Guarantees for
Latency-Sensitive on-Chip Networks,” Proc. Networks-on-Chip Int’l
Symp. (NOCS), pp. 155-162, 2010.
[39] A. Hansson, M. Wiggers, A. Moonen, K. Goossens, and M.
Bekooij, “Enabling Application-Level Performance Guarantees in
Network-Based Systems on Chip by Applying Dataflow Analy-
sis,” IET J. Computers and Digital Techniques, vol. 3, no. 5, pp. 398-
412, Sept. 2009.
[40] E. Bolotin et al., “QNoC: QoS Architecture and Design Process for
Network on Chip,” J. Systems Architecture, vol. 50, nos. 2/3,
pp. 105-128, 2004.
[41] T. Bjerregaard and J. Sparsoe, “A Router Architecture for
Connection-Oriented Service Guarantees in the MANGO Clock-
less Network-on-Chip,” Proc. Design, Automation and Test in Europe
(DATE), vol. 2, pp. 1226-1231, 2005.
[42] A. Bouhraoua and M.E. Elrabaa, “A High-Throughput Network-
on-Chip Architecture for Systems-on-Chip Interconnect,” Proc.
Intl. Symp. System-on-Chip (Soc), pp. 1-4, Nov. 2006.
[43] F. Felicijan and S. Furber, “An Asynchronous on-Chip Network
Router with Quality-of-Service (QoS) Support,” Proc. IEEE Int’l
SOC Conf. (SOCC), pp. 274-277, Sept. 2004.
[44] N. Kavaldjiev et al., “A Virtual Channel Network-on-Chip for GT
and BE Traffic,” Proc. IEEE CS Ann. Symp. Emerging VLSI
Technologies and Architectures (ISVLSI), Mar. 2006.
[45] A. Leroy et al., “Spatial Division Multiplexing: A Novel
Approach for Guaranteed Throughput on NoCs,” Proc. IEEE/
ACM/IFIP Int’l Conf. Hardware/Software Codesign and System
Synthesis (CODES+ISSS), pp. 81-86, 2005.
[46] A. Mello, L. Tedesco, N. Calazans, and F. Moraes, “Evaluation of
Current QoS Mechanisms in Network on chip,” Proc. Int’l Symp.
System-on-Chip (Soc), pp. 115-118, 2006.
[47] R. Mullins, A. West, and S. Moore, “The Design and Implementa-
tion of a Low-Latency on-Chip Network,” Proc. Asia and South
Pacific Conf. Design Automation (ASP-DAC), 2006.
[48] A. Radulescu et al., “An Efficient on-Chip NI Offering Guaranteed
Services, Shared-Memory Abstraction, and Flexible Network
Configuration,” IEEE Trans. Computer-Aided Design, vol. 24,
no. 1, pp. 4-17, Jan. 2005.
[49] E. Rijpkema et al., “Trade Offs in the Design of a Router with Both
Guaranteed and Best-Effort Services for Network on Chip,” IEE
Proc. Computers and Digital Techniques, vol. 150, no. 5, pp. 294-302,
Sept. 2003.
[50] E. Salminen, A. Kulmala, and T. Hamalainen, “Survey of
Network-on-Chip Proposals,” www.ocpip.org, Mar. 2008.
[51] S. Balakrishnan and F. Ozguner, “A Priority-Driven Flow Control
Mechanism for Real-Time Traffic in Multiprocessor Networks,”
IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 7, pp. 665-
678, July 1998.
[52] Y. Qian, Z. Lu, and W. Dou, “Analysis of Worst-Case Delay
Bounds for Best-Effort Communication in Wormhole Networks on
Chip,” IEEE Trans. Computer-Aided Design, vol. 29, no. 5, pp. 802-
815, May 2010.
[53] C. Chang, Performance Guarantees in Communication Networks.
Springer-Verlag, 2000.
[54] J.-Y. Le Boudec and P. Thiran, Network Calculus: A Theory of
Deterministic Queuing Systems for the Internet. Springer-Verlag,
2001.
[55] F. Jafari, Z. Lu, A. Jantsch, and M.H. Yaghmaee, “Optimal
Regulation of Traffic Flows in Networks-on-Chip,” Proc. Design,
Automation and Test in Europe Conf. and Exhibition (DATE),
pp. 1621-1624, 2010.
[56] M. Bakhouya et al., “Analytical Modeling and Evaluation of On-
Chip Interconnects Using Network Calculus,” Proc. ACM/IEEE
Int’l Symp. Networks-on-Chip (NOCS), pp. 74-79, 2009.
[57] S. Murali et al., “Design of Application-Specific Networks on
Chips with Floorplan Information,” Proc. IEEE/ACM Int’l Conf.
Computer-Aided Design (ICCAD), pp. 355-362, 2006.
Dara Rahmati received the BSc degree in
computer engineering from the University of
Tehran, Iran, in 1998, the MSc degree in
computer engineering from the University of
Tehran in 2001. He is currently working toward
the PhD degree in computer engineering at
Sharif University of Technology, Tehran, Iran.
His research interests are in design and perfor-
mance evaluation of optimized networks-on-chip
architectures including topology selection, rout-
ing and switching algorithms, performance boundary evaluation, and
QoS support for Real Time networks-on-chip. He is a student member of
the IEEE.
Srinivasan Murali received the MS and PhD
degrees in electrical engineering from Stanford
University, Palo Alto, CA, in 2007. He is a
cofounder and the chief technical officer with
iNoCs, Lausanne, Switzerland. He is also
currently a research scientist with Ecole Poly-
technique Federale de Lausanne. He has
authored a book and presented more than 40
publications in leading conferences and journals.
His current research interests include intercon-
nect design for systems-on-chips, thermal modeling, and reliability of
multicore systems. He is a recipient of the European Design and
Automation Association Outstanding Dissertation Award in 2007 for his
work on interconnect architecture design. He received the Best Paper
Award at the Design Automation and Test in Europe Conference in
2005. One of his papers has also been selected as one of “The Most
Influential Papers of 10 Years DATE”. He is a member of the IEEE and
the IEEE Computer Society.
466 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. 3, MARCH 2013
Luca Benini received the PhD degree in
electrical engineering from Stanford University
in 1997. He is a full professor at the Department
of Electrical Engineering and Computer Science
(DEIS) of the University of Bologna. He also
holds a visiting faculty position at the Ecole
Polytechnique Federale de Lausanne (EPFL)
and he is currently serving as a chief architect
for the Platform 2012 project in STmicroelec-
tronics, Grenoble. His research interests are in
energy-efficient system design and multicore SoC design. He is also
active in the area of energy-efficient smart sensors and sensor networks
for biomedical and ambient intelligence applications. He has published
more than 500 papers in peer-reviewed international journals and
conferences, four books and several book chapters. He has been
general chair and program chair of the Design Automation and Test in
Europe Conference. He has been an associate editor of several
international journals, including the IEEE Transactions on Computer
Aided Design of Circuits and Systems and the ACM Transactions on
Embedded Computing Systems. He is a fellow of the IEEE, a member of
the Academia Europaea, and a member of the steering board of the
ARTEMISIA European Association on Advanced Research & Technol-
ogy for Embedded Intelligence and Systems.
Federico Angiolini received the MS degree
(summa cum laude) in electrical engineering
from the University of Bologna, Bologna, Italy,
in 2003, and the PhD degree from the
Department of Electronics and Computer
Science, University of Bologna, in 2008. He is
currently the vice president of Engineering with
the INoCs, Lausanne VD, Switzerland. His
current research interests include memory
hierarchies, multiprocessor-embedded systems,
and networks-on-chip. He is a member of the IEEE and the IEEE
Computer Society.
Giovanni De Micheli (S’79-M’79-SM’80-F’94) is
currently a professor and the director of the
Institute of Electrical Engineering and the Inte-
grated Systems Centre, Ecole Polytechnique
Federale de Lausanne, Switzerland. He is also a
program leader of the Nano-Tera.ch program.
He was a professor of electrical engineering at
Stanford University. His research interests in-
clude several aspects of design technologies for
integrated circuits and systems, such as synth-
esis for emerging technologies, networks on chips 3-D integration,
heterogeneous platform design including electrical components and
biosensors, as well as in data processing of biomedical information. He
is the recipient of the 2003 IEEE Emanuel Piore Award. He is a fellow of
the Association for Computing Machinery. He was also the recipient of
the Golden Jubilee Medal for outstanding contributions to the IEEE CAS
Society in 2000 and the 1987 D. Pederson Award for the best paper on
the IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems (TCAD/ICAS). He was the Division 1 Director (2008-2009),
cofounder and the president elect of the IEEE Council on EDA (2005-
2007), the president of the IEEE CAS Society 2003, the editor in chief of
the IEEE TCAD/ICAS (1987-2001). He has been the chair of several
conferences, including DATE (2010), pHealth (2006), VLSI SOC (2006),
DAC (2000), and ICCD (1989). He is a fellow of the IEEE.
Hamid Sarbazi-Azad received the BSc degree
in electrical and computer engineering from
Shahid-Beheshti University, Tehran, Iran, in
1992, the MSc degree in computer engineering
from Sharif University of Technology, Tehran, in
1994, and the PhD degree in computing science
from the University of Glasgow, United Kingdom,
in 2002. He is currently a professor of computer
engineering at Sharif University of Technology,
and heads the School of Computer Science of
the Institute for Research in Fundamental Sciences (IPM), Tehran, Iran.
His research interests include high performance computer architectures,
NoCs and SoCs, parallel and distributed systems, performance
modeling/evaluation, graph theory and combinatorics, and wireless/
mobile networks, on which he has published more than 200 refereed
conference and journal papers. He received Khwarizmi International
Award in 2006 and TWAS Young Scientist Award in engineering
sciences in 2007. He is a member of the managing board of Computer
Society of Iran (CSI), and has been serving as the editor in chief for the
CSI Journal on Computer Science and Engineering since 2005. He is an
associate editor of IEEE Transactions on Computers and has guest
edited several special issues on high-performance computing architec-
tures and networks (HPCANs) in related journals. He is a member of the
ACM and the IEEE.
. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
RAHMATI ET AL.: COMPUTING ACCURATE PERFORMANCE BOUNDS FOR BEST EFFORT NETWORKS-ON-CHIP 467
