Least Upper Delay Bound for VBR Flows in Networks-on- Chip with Virtual Channels by Jafari, F et al.
ALeast Upper Delay Bound for VBR Flows in Networks-on-Chip with
Virtual Channels
FAHIMEH JAFARI, KTH Royal Institute of Technology, Sweden
ZHONGHAI LU, KTH Royal Institute of Technology, Sweden
AXEL JANTSCH, Vienna University of Technology, Austria
Real-time applications such as multimedia and gaming require stringent performance guarantees, usu-
ally enforced by a tight upper bound on the maximum end-to-end delay. For FIFO multiplexed on-chip packet
switched networks we consider worst-case delay bounds for Variable Bit-Rate (VBR) flows with aggregate
scheduling, which schedules multiple flows as an aggregate flow. VBR Flows are characterized by a max-
imum transfer size (L), peak rate (p), burstiness (σ), and average sustainable rate (ρ). Based on network
calculus, we present and prove theorems to derive per-flow end-to-end Equivalent Service Curves (ESC)
which are in turn used for computing Least Upper Delay Bounds (LUDBs) of individual flows. In a realistic
case study we find that the end-to-end delay bound is up to 46.9% more accurate than the case without
considering the traffic peak behavior. Likewise, results also show similar improvements for synthetic traf-
fic patterns. The proposed methodology is implemented in C++ and has low run-time complexity, enabling
quick evaluation for large and complex SoCs.
Categories and Subject Descriptors: C.4 [Performance of Systems]: Modeling techniques
General Terms: Design, Performance
Additional Key Words and Phrases: Network-on-chip (NoC), performance evaluation, network calculus,
worst-case delay bound, FIFO multiplexing
ACM Reference Format:
Jafari, F., Lu, Z., and Jantsch, A. 2015. Least Upper Delay Bound for VBR Flows in Networks-on-Chip with
Virtual Channels. ACM Trans. Des. Autom. Electron. Syst. V, N, Article A (January YYYY), 34 pages.
DOI = 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000
1. INTRODUCTION
In networks-on-chip, resources like wires, buffers, and switches are shared among
multiple communication flows to provide cost efficiency. At the same time many ap-
plications have real-time requirements and, consequently, delay and throughput con-
straints on the communication. To guarantee maximum delay and minimum through-
put for one given communication flow, the interference in the shared resources from
other flows has to be analyzed and bounded. We assume that all traffic can be well
characterized as flows and scheduled as aggregate which means multiple flows are
scheduled as an aggregate flow. For a given flow, we study the maximum interference
of all other flows based on the network calculus theory [Le Boudec et al. 2004].
Author’s addresses: F. Jafari and Z. Lu, Department of Electronic and Computer Systems, School of
Information and Communication Technology, KTH Royal Institute of Technology, Stockholm, Sweden;
email: {fjafari, zhonghai}@kth.se; A. Jantsch, Vienna University of Technology, Vienna, Austria; email:
jantsch@ict.tuwien.ac.at.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights
for components of this work owned by others than ACM must be honored. Abstracting with credit is per-
mitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component
of this work in other works requires prior specific permission and/or a fee. Permissions may be requested
from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or permissions@acm.org.
c© YYYY ACM 1084-4309/YYYY/01-ARTA $10.00
DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:2 F. Jafari et al.
In network calculus, flows are characterized as arrival curves and the service offered
to flows by a network element such as a link or a switch is abstracted as service curve.
Since the network contention for shared resources includes not only direct contention
but also indirect contention, predicting the worst-case performance is extremely hard.
To calculate the accurate delay bound per flow, the main problem is to obtain the end-
to-end Equivalent Service Curve (ESC) and internal output arrival curves of individ-
ual flows in an arbitrary network of servers in terms of the latencies of the individual
schedulers in the network. Since the required theorems for calculating performance
metrics of VBR traffic transmitted in the FIFO order and scheduled as aggregate have
not been represented so far, we [Jafari et al. 2011; Jafari et al. 2012] have defined and
proved them based on network calculus [Chang 2000; Le Boudec et al. 2004]. In [Ja-
fari et al. 2011], we proposed and proved the required theorem for deriving the output
characterization of VBR traffic under the defined system model to have exact vision
about output metrics used for obtaining performance bounds. In [Jafari et al. 2012],
the required theorems for computing end-to-end ESC and end-to-end delay bound are
defined and proved. Moreover, we presented a simple example to show how the pro-
posed theorems can be used in the network. The method presented in [Jafari et al.
2012] only considers direct contentions of a tagged flow. In this paper, we use the pro-
posed theorems [Jafari et al. 2011; Jafari et al. 2012] to present a formal approach for
performance analysis modeling both direct and indirect contentions.
VBR is a class of traffic in which the rate can vary significantly from time to time,
containing bursts. Real-time compressed voice and video and time-sensitive bursty
data traffic are examples of VBR traffic. Real-time VBR flows can be characterized by
a set of four parameters, (L, p, σ, ρ), where L is the maximum transfer size, p peak rate,
σ burstiness, and ρ average sustainable rate [Le Boudec et al. 2004]. For instance, in a
NoC with a link data width of 32 bits, frequency of 500 MHz. This means a link band-
width of 16 Gbits/s (32 bits×500 MHz). An HDTV video stream can be characterized
with L = 32 bits, p = 16 Gbits/s, σ = 960 Kbits, ρ = 76 Mbits/s. Our assumption is
that the application-specific nature of the network enables to characterize traffic with
sufficient accuracy.
For an individual flow, called a tagged flow, we first consider resource sharing sce-
narios (channel sharing, buffer sharing, and channel&buffer sharing) in the routers
and then build analysis models for different resource sharing components. We assume
that the routers employ round robin scheduling to share the link bandwidth. Based
on these models, we can derive the intra-router ESC for an individual flow. To con-
sider the contention which a flow may experience along its routing path, we present a
recursive algorithm to classify and analyze flow interference patterns. The algorithm
uses the proposed theorems to analyze the effect of contention flows on the tagged flow.
Based on this algorithm, we derive the end-to-end ESC and then Least Upper Delay
Bound (LUDB) for a tagged flow under the mentioned system model. To show the po-
tential of our method, we experiment three case studies to derive delay bounds and
compare them with simulation results. It is worth mentioning that the paper does not
deal with the back-pressure, but calculates the buffer size thresholds to make sure the
back-pressure does not occur in the network.
The remainder of this paper is organized as follows. Section 2 gives an account of
related works. In Section 3, we introduce the basics of network calculus. Section 4
discusses the underlying system model and notations in our analysis. Section 5 is de-
voted to the theorems required for computation of performance metrics. We present
our formal method for the performance analysis and computation of LUDB in Section
6. Numerical results are reported in Section 7. Finally, Section 8 gives the conclusions
and highlight directions for future work.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Least Upper Delay Bound for VBR Flows in Networks-on-Chip with Virtual Channels1 A:3
2. RELATED WORK
Recently, NoC designers have a great deal of interest in the development of analytical
performance models [Bakhouya et al. 2011]. Ogras et al. [2005 ] give a unified rep-
resentation of NoC architectures and applications and consider some major research
problems in the design area. As represented in this work, most of research problems
need to analytically analyze and evaluate performance metrics in the network. we
[Kiasari et al. 2013] have surveyed four popular mathematical formalisms -dataflow
analysis, schedulability analysis, queueing theory, and network calculus- along with
their applications in NoCs. Also, we have reviewed strengths and weaknesses of each
technique and its suitability for a specific purpose.
Dataflow analysis is a deterministic approach based on graph theory. As an example,
Hansson et al. [2008 ] present a model using a cyclo-static dataflow graph for buffer
dimensioning for NoC applications. In dataflow analysis, it is assumed that the pat-
tern of communication among cores and switches are deterministic and predefined.
Dataflow analysis must be used with restricted models such as DDF and CSDF to
capture dynamic behavior. In other words, the expressiveness is typically traded off
against analyzability and implementation efficiency in this formalism.
Schedulability analysis is an analytical approach for investigating the timing prop-
erties in real-time systems. It gets a set of tasks, their worst-case execution time, and
a scheduling policy as inputs and determines whether these tasks can be scheduled
such that deadline misses never occur. One example of this approach in NoCs is pre-
sented by Shi and Burns [2008 ]. Schedulability analysis uses simpler event models
compared to the other mathematical formalisms and consequently the performance
model is easily extracted with less accuracy.
The proposed models by Lee [2003 ] and Rahmati et al. [2009 ; 2013 ] are inspired
by schedulability analysis. Lee [2003 ] presents a worst-case analysis model for real-
time communication and also proposes a feasibility test algorithm for a simplex virtual
circuit in wormhole networks. This work is extended by Rahmati et al. [2009 ] towards
NoCs, computing real-time bounds for high bandwidth traffic. They also extend the
model [2013 ] to provide more detailed switch models and consider virtual channels
and variable buffer lengths. The key advantage of these methods is that they compute
the worst-case bounds with low time complexity without any special hardware support,
but the main limitation is that they do not leverage the input arrival patterns, which
it leads to over approximations of the performance analysis.
Most of the current works use queuing theory-based approaches. For example, Moad-
eli et al. [2007 ] analyze the traffic behavior in a NoC with the spidergon topology and
wormhole routing and then present a queuing-theory-based analytical model for eval-
uating the average message latency in the network. Ben-Itzhak et al. [2011 ] propose
an analytical model for deriving average end-to-end delay in a heterogeneous worm-
hole based NoC with heterogeneous traffic patterns, non-uniform link capacities and a
variable number of virtual channels per link. Queuing approaches often use probabil-
ity distributions like Poisson to model traffic in the network while Poisson distribution
used in queuing model is not appropriate for characterizing traffic patterns in NoC ap-
plications because it is not able to model all significant features in this network. Queu-
ing theory generally evaluate average quantities of metrics in an equilibrium state and
characterizing the transient behavior is a very difficult problem. An approach for ad-
dressing this problem is suggested by Bogdan and Marculescu [2007 ]. Authors in this
work proposed a statistical physics-inspired framework to model the information flow
and buffers behavior in NoCs. They analyze the traffic dynamics in NoCs and effec-
tively capture the nonstationary effects of the system workload. In following up to this
work, Bogdan et al. [2010 ] proposed QuaLe model based on statistical physics that
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:4 F. Jafari et al.
can account for nonstationary observed in packet arrival processes. They also investi-
gated the impact of packet injection rate and the data packet sizes on the multifractal
spectrum of NoC traffic.
Network calculus is a mathematical framework for deriving worst-case bounds on
maximum latency, backlog, and minimum throughput in network-based systems. It is
able to model all traffic patterns with bounds defined by arrival curves. In this respect,
designers can capture some dynamic features of the network based on shapes of the
traffic flows [Bakhouya et al. 2011]. Network calculus can also abstract many schedul-
ing algorithms and arrival classes at single queue with multiplexed arrival flows, by
service curves. The service curves through a network can be convolved as a single ser-
vice curve. Hence a multi-node network analysis can be simplified to a single-node
analysis. Regarding these two features, network calculus can analyze many schedul-
ing algorithms and arrival classes over a multi-node network in a uniform framework
while classical queuing theory separately models different combination of them [Ciucu
et al. 2012]. The probabilistic version of (deterministic) network calculus is stochastic
network calculus. In some networks, such as wireless networks, the service offered by
a communication channel may vary randomly over time due to channel contention and
impairment. Such networks can only provide stochastic services and guarantees. For
example, Rizk and Fidler [2012 ] use stochastic network calculus to derive per-flow
end-to-end performance bounds in a network of tandem queues under open-loop fBm
cross traffic which is a model for self-similar and long-range dependent aggregate In-
ternet traffic. Since we employ deterministic network calculus, in the rest of our paper,
network calculus refers to the deterministic type. A network calculus-based methodol-
ogy [Bakhouya et al. 2011] analyzes and evaluates performance and cost metrics, such
as latency, energy consumption, and area requirements in on-chip interconnects. Au-
thors in this paper compare 2D mesh, spidergon, and WK-recursive topologies using
a given traffic pattern and show that WK-recursive outperforms mesh and spidergon
in all considered metrics. The proposed model in this paper is simple without consid-
ering virtual channel effects and modeling all interferences between flows sharing a
resource in the network. Moreover, the model does not investigate the peak behavior
of flows which leads to less accurate bounds while we consider performance analysis
for VBR traffic in on-chip networks employing aggregate resource management.
The performance evaluation of real-time services in networks employing aggregate
scheduling is particularly challenging because of its complexity. Aggregate scheduling
arises in many cases. In addition to NoC, for example, it can also be applied for ob-
taining scalability in large-size networks. The Differentiated Service (DiffServ) [Blake
et al. 1998] is an example of an architecture based on aggregate scheduling in the
Internet. Despite the research efforts, few results have appeared on this subject. A
survey on the subject can be found in [Bennett et al. 2002]. Charny et al. [2000 ] con-
sider a closed-form delay bound for a generic network configuration under the fluid
model assumption. It is also extended by Jiang [2002 ] to consider packetization ef-
fects. However, these works can derive bounds only for small utilization factors in a
generic network configuration.
Martin et al. [2006 ; 2003 ] and Bauer et al. [2010 ] employ Trajectory Approach
(TA) to compute end-to-end delay bounds in FIFO systems. The Trajectory Approach
computes all the possible trajectories of a system under constraints and then takes
maximum end-to-end delays on them. Bauer et al. [2010 ] compare Network Calculus
and the Trajectory approaches on a real avionics AFDX configuration and shows that
The Trajectory approach computes upper bounds which are tighter than the upper
bounds computed by the network calculus one. However, they derive delay bounds by
summing per-node bounds, expectedly not arriving at tight bounds but reported as
being at least close under practical conditions.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Least Upper Delay Bound for VBR Flows in Networks-on-Chip with Virtual Channels1 A:5
The computation of delay bounds through network calculus in feed-forward net-
works under arbitrary multiplexing has already been addressed in different lectures
[Schmitt et al. 2008; Kiefer et al. 2010; Bouillard et al. 2010]. One of these works
[Bouillard et al. 2010] describes the first algorithm which can compute the worst-case
end-to-end delay for a given flow for any feed-forward network under blind multiplex-
ing, with concave arrival curves and convex service curves. Since the problem is in-
trinsically difficult (NP-hard), the authors show that in some cases, like tandem net-
works with cross-traffic interfering along intervals of servers, the complexity becomes
polynomial. Then, the approach is refined [Bouillard et al. 2011] in order to take into
account fixed priorities. Bouillard et al. study networks with a fixed priority service
policy which means each flow is assigned a fixed priority and try to take into account
the pay multiplexing only once (PMOO) phenomenon. This stream of works deal with
networks of arbitrary multiplexing also known as general or blind multiplexing, which
means no assumption is made about the service policy while by assuming an explicit
multiplexing scheme like FIFO, tighter bounds can be obtained.
A related stream of works [Lenzini et al. 2006; Lenzini et al. 2008; Bisti et al. 2010]
propose a methodology which calculates delay bounds in tandem networks of rate-
latency nodes traversed by leaky bucket shaped flows. They also introduce a software
tool, called DEBORAH, which implements algorithms employed in their methodology
to compute delay bounds. These works consider servers in tandem or sink trees, while
our proposed method computes end-to-end delay in a generic topology of NoC. More-
over, these works investigate computing delay bounds only for average behavior of
flows and they do not consider peak behavior, which results in less accurate bounds.
Boyer [2010 ] tries to model shaping for an end-to-end delay where each server is
shared by two flows. An applicative token bucket γr,b is shaped by the bit-rate of the
link λR, leading to a two-slopes affine arrival curve which this arrival curve is similar
to one we consider for double leaky buckets. The paper investigates a simple topology,
a sequence of rate-latency servers, each one shared by two flows with a FIFO policy,
and a simple case of nested contentions. Moreover, authors state that their model-
ing is incomplete: when computing the worst-case traversal time of a flow, they model
only the shaping on the considering flow, not on the interfering ones (leading to the
title ‘half-modeling of shaping’) In this paper, we investigate both nested and crossed
contentions in general to model all flows (even interfering ones) with complex interfer-
ences in on-chip networks.
All aforementioned works in the subject of aggregate resource management compute
delay bounds in various network infrastructures but not on-chip networks. As regards
to NoC architecture, analytical models are very close to the reality of the system. For
instance, a router in on-chip networks can be modeled in pure hardware which means
the micro-architecture is feasible for analysis. Therefore, network calculus can provide
the analysis more accurate in on-chip networks.
Qian et al. [2010 ] present analytical models for traffic flows under strict priority
queueing and weighted round robin scheduling in on-chip networks. They then derive
per-flow end-to-end delay bounds using these models. Like most of mentioned works,
the proposed model by Qian et al. [2010 ] does not deal with peak behavior of flows,
which results in less accurate bounds. The proposed method in this paper considers
performance analysis for VBR traffic characterized by (L, p, σ, ρ) in on-chip networks
employing aggregate resource management. As such, our method achieves more accu-
rate delay bounds.
3. NETWORK CALCULUS BACKGROUND
Network calculus is a mathematical framework to derive worst case bounds and an-
alyze performance guarantees in networks. This paper uses Traffic SPECification
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:6 F. Jafari et al.
o l
u m
e
j jtσ ρ+
jσ
d a
t a
 v
o
L
j jL p t+
jL
tij j
j
j jp
σθ ρ
−= −
me
Fig. 1. Arrival curve of flow fj with TSPEC (Lj , pj , σj , ρj).
(TSPEC) [Wroclawski 1997] to model the average and peak characteristics of flow fj as
arrival curve αj(t) = min(Lj + pjt, σj + ρjt) in which Lj is the maximum transfer size,
pj the peak rate (pj ≥ ρj), σj the burstiness (σj ≥ Lj), and ρj the average (sustainable)
rate. We denote it as fj ∝ (Lj , pj , σj , ρj). As shown in Figure 1, θj = (σj − Lj)/(pj − ρj)
and αj(t) = Lj + pjt if t ≤ θj ; αj(t) = σj + ρjt, otherwise.
In this paper, we also consider a class of curves, namely pseudoaffine curves [Lenzini
et al. 2006], which is a multiple affine curve shifted to the right and given by β =
δT ⊗ [⊗1≤x≤nγσx,ρx ]. In fact, a pseudoaffine curve represents the service received by
single flows in tandems of FIFO multiplexing rate-latency nodes. Due to concave affine
curves, it can be rewritten as β = δT ⊗ [∧1≤x≤nγσx,ρx ], where the non-negative term T
is denoted as offset, and the affine curves between square brackets as leaky-bucket
stages. It is clear that a rate-latency service curve is in fact pseudoaffine, since it can
be expressed as β = δT ⊗ γ0,R.
Given arrival curve α and service curve β, the delay is bounded by the horizontal
deviation between the arrival and service curves.
4. SYSTEM MODEL AND NOTATIONS
As depicted in Figure 2, we consider an NoC architecture in which every node contains
a router and a core which performs its own computational, storage or I/O processing
functionality, and is equipped with a Network Interface (NI). As you can see in the fig-
ure, buffers are arranged to construct VCs in each input channel. To characterize flows
based on their defined TSPEC, we assume unbuffered leaky bucket controllers (regu-
lators) which do not buffer the packets, but stall the traffic producers or IPs [Jafari
et al. 2010].
Assumptions in this work are listed as follows:
— The NoC architecture can have different topologies.
— Packets have fixed length and traverse the network in a best-effort fashion with
virtual-cut-through switching technique using a deadlock-free deterministic routing.
— Routers have only input buffers and VCs.
— Buffers are bounded and the network is lossless.
— The router can have multiple VCs per in-port. VC allocation is deterministic and each
VC receives an aggregate service.
— All traffic is the part of TSPEC flows f = TSPEC(L, p, σ, ρ) at the entry into the
network.
— In each node that guarantees to serve the flow a pseudo affine service curve β =
δT ⊗ γσx,ρx , it is assumed that ρ ≤ ρx and p ≥ ρx.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Least Upper Delay Bound for VBR Flows in Networks-on-Chip with Virtual Channels1 A:7
4r1r 3r2r
2f1f
Core
8r6r 7r5r
5f
Core CoreCore Core
ejection 
channel
injection 
channel
Network Interface (NI)
DEMUX
10r 11r9r 12r
3f4f
2f
Core
Core Core Core Core
Core Core Core
output channe
input channe
1 1
q
D
EM
U
X
D
EM
13r 16r15r14r
f
5f 3f
Core Core Core Core
Crossbar Switch
lsels
p
Routing Control Unit
M
U
X
Arbiter
4
Fig. 2. An example of an NoC with 16 nodes and 5 flows along with the structure of a single node.
— Flows are classified into a pre-specified number of aggregates.
— Traffic of each aggregate is buffered and transmitted in the FIFO order, denoted as
FIFO multiplexing.
— Different aggregates are buffered separately and each aggregate is guaranteed a
rate-latency service curve.
— We use a concrete policy, in this case, round-robin arbitration, to support the assump-
tion on rate-latency service curve. Indeed, it can use some other arbitration policies
as well. We also assume a fixed word length of Lw in all of flows.
— The peak rate is limited by the hardware. It is always 1 flit/cycle.
NoC designers can obtain per flow end-to-end delay bound in NoC architectures by
the proposed method in this paper under the mentioned assumptions.
Most of assumptions in this paper have been widely used by some previous models
[Qian et al. 2009; Jafari et al. 2010]. The system model in this paper is more general
than the mentioned models [Qian et al. 2009; Jafari et al. 2010] because they consider
a Constant Bit Rate (CBR) flow in NoCs, defined by (σ, ρ) which is a special case of
TSPEC. Furthermore, we have relaxed a significant limitation of the previous analyt-
ical model [Jafari et al. 2010] which presumes the number of VCs for each PC is the
same as the number of flows passing through that channel.
We use an example depicted in Figure 2 to explain terminology used in the paper.
The figure shows a network with 16 nodes numbered from 1, 2, ..., 16 connected by links.
There are 5 flows in the example denoted as f1, ..., f5. Multiple flows share the same
buffer and channel in the router are scheduled as a flow called aggregate flow. For
instance, f{1,2} in router 3 is an aggregate flow. A tagged flow is the flow that we shall
derive its delay bound and other flows that share resources with the tagged flow are
contention flows. In this example, f1 is the tagged flow, and f2, f3, and f4 are contention
flows. Notations in the paper are listed in Table I.
We use sub-index ”(fi, rj)” for notations to indicate that they are related to flow fi in
router rj . For example, α(f1,r2) denotes the arrival curve of flow f1 in router r2. We also
employ sub-index ”(si, rj)” to state notations are related to fsi in router rj . In this case,
fsi can be one flow or an aggregate flow. For instance, β({1,2,3},r2) indicates the service
curve of aggregate flow f{1,2,3} in router r2.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:8 F. Jafari et al.
Table I. The list of notations
fi Flow i
αi The arrival curve of fi
α∗i The output arrival curve of fi
Li The maximum transfer size of fi (flits)
pi The peak rate of fi (flits/cycle)
σi The burstiness of fi (flits)
ρi The average rate of fi (flits/cycle)
Src(i) The source node of fi
rj Router j
βj The service curve of rj
R The minimum service rate in a rate-latency service curve
T l The maximum processing latency of the arbiter in the router (cycles)
THoL The maximum waiting time in the FIFO queue of the router (cycles)
TTotal
The total processing delay which comes from contention flows and equals to the
sum of T l and THoL (cycles)
Drouter Time spent for packet routing decision (cycles)
Lw The word length in the flow (flits)
C The channel capacity (flits/cycle)
ρx The minimum service rate in a pseudo affine service curve
CFt The set of contention flows of tagged flow ft in the network
si
The set of joint flows in an aggregate flow (when the number of elements of si is
equal to 1, there is only a single flow)
fsi An aggregate flow of si
|si| The cardinality of set si, which is a measure of the ”number of elements of the set”
S = {si} A set of si ’s in a tandem of routers
sm
A set which has the maximum cardinality between the sets in S.
sm =
{
sx
∣∣|sx| = max (|si|) ; ∀si ∈ S}
fsm The flow related to sm
rm The router related to sm
βm The service curve related to sm
FB
(si,rj)
The set of flows which share the same buffer in router rj with flow fsi∣∣∣V(si,rj)∣∣∣ The number of virtual channels that passing flows from them share the samechannel of router rj with flow fsi
F(V Ck,PCi,rj)
The set of flows passing through V Ck in physical channel PCi of router rj
5. PROPOSED THEOREMS
In this section, we review the earlier proposed theorems [Jafari et al. 2011; Jafari
et al. 2012] which are required for analyzing performance of VBR flows in a FIFO
multiplexing network.
We first represent a theorem for computing delay bound as follows.
Theorem 1. (Delay Bound) Let β be a pseudo affine curve, with offset T and n leaky-
bucket stage γσx,ρx , 1 ≤ x ≤ n, this means we have:
β = δT ⊗ [⊗1≤x≤nγσx,ρx ] = δT ⊗ [∧1≤x≤nγσx,ρx ]
and let α = min(L + pt, σ + ρt) = γL,p ∧ γσ,ρ. If ρ∗β ≥ ρ (ρ∗β = min1≤x≤nρx), then the
maximum delay for the flow is bounded by
h(α, β) = T +
[
∨1≤x≤nL− σx + θ (p− ρx)
+
ρx
]+
(1)
PROOF. We have proved it in [Jafari et al. 2012]. See Appendix A.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Least Upper Delay Bound for VBR Flows in Networks-on-Chip with Virtual Channels1 A:9
In the rest of the paper, we apply Theorem 1 on the end-to-end ESC to calculate
LUDB for a tagged flow. Due to our proposed method in Section 6, to obtain the end-
to-end ESC, we should able to subtract contention flows from a service curve. To this
end, we propose Proposition 1 and Theorem 2. In Proposition 1, we derive ESC with
FIFO multiplexing where service curve is a pseudo affine curve. We then use Corollary
1 which is an immediate consequence of Proposition 1 to propose Theorem 2. This
theorem is employed for deriving ESC in the underlying system model.
In Proposition 1 and Theorem 2, we obtain ESC with FIFO multiplexing under dif-
ferent assumptions.
Proposition 1. (Equivalent Service Curve) Let β be a pseudo affine curve, with offset
T and n leaky-bucket stage γσx,ρx , 1 ≤ x ≤ n, this means we have:
β = δT ⊗ [⊗1≤x≤nγσx,ρx ] = δT ⊗ [∧1≤x≤nγσx,ρx ]
and let α = min(L + pt, σ + ρt) = γL,p ∧ γσ,ρ. If ρ∗β ≥ ρ (ρ∗β = min1≤x≤nρx) and
p ≥ ρ◦β (ρ◦β = max1≤x≤nρx), then the ESC obtained by subtracting arrival curve α,
{βeq(α, τ), τ = h(α, β)} ≡ βeq(α), with
βeq(α) = δ
T+∨1≤i≤n
[
L−σi+θ(p−ρi)+
ρi
]+
+θ
⊗ [⊗1≤x≤n [
γ
ρx
{
∨1≤i≤n
[
L−σi+θ(p−ρi)+
ρi
]+
−σ−σx−(ρx−ρ)θρx
}
,ρx−ρ
]] (2)
PROOF. We have proved it in [Jafari et al. 2012]. See Appendix B.
The following corollary is an immediate consequence.
Corollary 1. Let β = δT ⊗ γσx,ρx be a pseudo affine curve, with offset T and one leaky-
bucket stage γσx,ρx , and let α = min(L + pt, σ + ρt) = γL,p ∧ γσ,ρ. If ρx ≥ ρ and p ≥ ρx,
then the ESC obtained by subtracting arrival curve α, βeq
βeq = δ
T+
[
L−σx+θ(p−ρx)+
ρx
]+
+θ
⊗ γ0,ρx−ρ (3)
PROOF. We can easily obtain this corollary by applying Proposition 1 for service
curve β when n = 1.
We can specifically capitalize on Corollary 1 to obtain a parametric expression for
the ESC of a tagged flow passing through a rate-latency node. We assume the number
of flows passing through this node is K+1. Therefore, for computing equivalent service
curve for the tagged flow, we should subtract the arrival curves of other K flows. It can
be calculated by iteratively applying Corollary 1 forK times. Without loss of generality,
we presume that the tagged flow is flow K + 1. We now present following theorem:
Theorem 2. (Equivalent Service Curve for Rate-Latency Service Curve with K + 1
Flows) Consider one node with a rate-latency service curve βR,T = δT ⊗ γ0,R. Let αi =
min(Li+pit, σi+ρit) = γLi,pi∧γσi,ρi be arrival curve of flow i and pi ≥ R−
∑K+1
(j=1;j 6=i) ρj ,
where 1 ≤ i ≤ K + 1 and K + 1 is the number of flows passing through that node as
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:10 F. Jafari et al.
TR,b
11 :af
KKf a:
11 : ++ KKf a
ttf a:
22 :af
Fig. 3. Computation of equivalent service curve for flow K + 1 in a rate-latency node.
shown in Figure 3. Assuming
∑K+1
j=1 ρj ≤ link rate, where C is the link rate, the ESC
for flow K + 1 in the node, obtained by subtracting K arrival curves, is:
βeqK+1 = δ
T+
∑K
i=1
([
Li+θi(pi−R+
∑i−1
j=1
ρj)
+
R−∑i−1
j=1
ρj
]+
+θi
) ⊗ γ0,R−∑Kj=1 ρj (4)
PROOF. We have proved it in [Jafari et al. 2012]. See Appendix C.
Theorem 3 states how output arrival curve of a VBR flow in a FIFO multiplexing
node can be calculated.
Theorem 3. (Output Arrival Curve with FIFO) Consider a VBR flow, with TSPEC
(L, p, ρ, σ), served in a node that guarantees to the flow a pseudo affine service curve
β = δT ⊗ γσx,ρx . The output arrival curve α∗ given by:
α∗ =
 θ > T γ(p∧ρx)T+θ(p−ρx)++L−σx,p∧ρx∧γσ−σx+ρT,ρθ ≤ T γσ−σx+ρT,ρ (5)
PROOF. We have proved it in [Jafari et al. 2011]. See Appendix D.
We apply this theorem to calculate internal output arrival curves. For instance, in
Section 6.2, we obtain the output arrival curve of a crossed flow when it is split into
two nested flows.
6. FORMAL METHOD FOR LUDB DERIVATION
We have presented and proved the required theorems for deriving LUDB for VBR flows
in on-chip networks based on aggregate scheduling with multiple virtual channels. As
mentioned before, to calculate LUDB per flow, we should first obtain the end-to-end
ESC which the tandem of routers provides to the flow. For calculating the end-to-end
ESC, we propose two following steps:
•Step 1: Intra-router ESC
•Step 2: Inter-router ESC
In the first step, we consider resource sharing scenarios in the routers and then
build analysis models for different resource sharing components. Based on these mod-
els, we can derive the intra-router ESC for an individual flow. In the second step, we
consider the contention which a flow may experience along its routing path. Therefore,
we present recursive algorithm End-to-End ESC to classify and analyze resource shar-
ing models and flow interference patterns. Based on this algorithm, we can derive the
end-to-end ESC for a tagged flow passing through the tandem of routers.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Least Upper Delay Bound for VBR Flows in Networks-on-Chip with Virtual Channels1 A:11
Crossbar 
Switch
D
E
M
U
X
Channel
2f
1f
Ø Buffer&Channel Sharing
• Set of contention flows of tagged flow , sharing both buffer and channel in router
denoted as CBrf jiF ),(
if jr
Crossbar 
Switch
D
E
M
U
X
Channel
{ }2,1f
• We consider them as an aggregated flow and calculate inter-router ESC for the aggregated 
flow
Fig. 4. An example of channel&buffer sharing.
Crossbar 
Switch
D
E
M
U
X
Channel
3f
1f
DEMUX
C
LF
F
Crf
w
C
jrifC
jrif
ji ´÷
ø
ö
ç
è
æ -
Ä=
1,0),( ),(
),(
dgb
C
LF
T
F
CR
w
C
P
rfCrf
jrif
ji
jrif
ji
´÷
ø
öç
è
æ -
==
1
          ;
),(
),(
),(),(
• If fi is the tagged flow:
C
LTCR wP rfrf == ),(),( 11           ;2
• In the example f1 is the tagged flow and we have
Ø Channel Sharing
• Set of contention flows of tagged flow ,
sharing a channel in router denoted as C
jrif
F
),(
if
jr
2f
Fig. 5. An example of a channel sharing three flows.
6.1. Step1: Intra-router ESC
To compute intra-router ESC for a tagged flow, it is necessary to investigate resource
sharing. At each router, we identify three types of resource sharing, namely, chan-
nel sharing, buffer sharing, and channel&buffer sharing. Channel sharing means that
multiple flows share the same out-port and thus the output channel bandwidth. Buffer
sharing means that multiple flows share the same buffer but not channel. In chan-
nel&buffer sharing, multiple flows share both buffers and channels. They are sched-
uled as a flow called aggregate flow.
6.1.1. Channel&Buffer Sharing. Figure 4 depicts an example of flows sharing both chan-
nel and buffer in the router. As shown in the figure, we consider these flows as an
aggregate flow. When an aggregate flow includes the tagged flow, it is called as tagged
aggregate flow. In this respect, we calculate intra-router ESC for the tagged aggregate
flow in the router instead of the tagged flow. In Section 6.2, we show how ESC of the
tagged flow is extracted from the ESC of the tagged aggregate flow by removing con-
tention flows one by one. For simplicity, in the rest of the paper, ”tagged flow” refers to
both tagged flow and tagged aggregate flow.
6.1.2. Channel Sharing. Figure 5 depicts a channel shared between three flows f1, f2,
and f3. Since the arbitration policy determines how much the flows influence each
other, it has to be known. We assume that, while serving multiple flows, the routers
employ round robin scheduling to share the channel bandwidth. Assuming a fixed word
length of Lw in all of flows, round robin arbitration means that each flow fsi in router
rj gets at least a C∣∣∣V(si,rj)∣∣∣ of the channel bandwidth, where C is the channel capacity
and
∣∣V(si,rj)∣∣ the number of virtual channels that passing flows from them share the
same channel of router rj with flow fsi . A flow may get more if other flows use less,
but we now know a worst-case lower bound on the bandwidth. Round robin arbitration
has good isolation properties because the minimum bandwidth for each flow does not
depend on properties of the other flows.
Since network calculus uses the abstraction of service curve to model a network
element processing traffic flows [Le Boudec et al. 2004], we can also model a round
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:12 F. Jafari et al.
Crossbar 
D E
M
U X Channel
1f
2f P2P2P2P2...P2P2P1
1f 0
       
),(
),(
1
1
=
=
P
rf
rf
T
CR
Switch
Ch
a
n
n
el
2f
0
       
),(
),(
2
2
=
=
P
rf
rf
T
CR
Fig. 6. An example of a buffer sharing two flows.
robin arbiter in router rj for flow fsi as a rate-latency server [Gebali et al. 2009] that
its function is as β(si,rj) = R(si,rj)(t − T l(si,rj))+, where R(si,rj) is the minimum service
rate and T l(si,rj) is the maximum processing latency of the arbiter in router rj for flow
fsi . R(si,rj) and T
l
(si,rj)
are defined as follows:
R(si,rj) =
C∣∣V(si,rj)∣∣ (6)
T l(si,rj) =
(∣∣V(si,rj)∣∣− 1)× (LwC +Drouter
)
(7)
where Drouter is the delay for packet routing decision in a router.
As mentioned in Section 5, a rate-latency service curve is in fact a pseudoaffine.
Therefore, β(si,rj) can be expressed as δ(∣∣∣V(si,rj)∣∣∣−1)×(LwC +Drouter)⊗γ0, C∣∣∣∣V(si,rj)∣∣∣∣ . Assuming
f1 is the tagged flow in the example, β(f1,r) = δ2×(LwC +Drouter) ⊗ γ0,C3 .
6.1.3. Buffer Sharing. Figure 6 shows a buffer shared between two flows f1 and f2. In
this type of sharing, in addition to maximum processing latency for link sharing, T l,
we introduce the head-of-Line delay for a tagged flow as below:
Head-of-Line delay (HoL): Given a flow comes at time t in a router, the maximum
waiting time in the FIFO queue would be in time t+ THoL.
Therefore, the total processing delay which comes from contention flows for tagged
flow fsi in router rj , TTotal(si,rj), is equal to T
l + THoL
We assume f1 in Figure 6 is the tagged flow. According to Equation (7), T l(f1,r) = 0.
From the figure, it is clear that THoL(f1,r) is equal to the maximum delay for passing
packets of flow f2 in the buffer. According to the network calculus theory [Le Boudec
et al. 2004], the maximum delay for flow fj is bounded by Equation (8).
D¯(fj ,r) = T
l
(fj ,r)
+
Lj + θj(pj −R(fj ,r))+
R(fj , r)
(8)
Therefore, we formulate THoL(f1,r) as follows:
THoL(f1,r) = T
l
(f2,r)
− θ2 + L2 + θ2p2
R(f2,r)
(9)
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Least Upper Delay Bound for VBR Flows in Networks-on-Chip with Virtual Channels1 A:13
P2P3P3P2...P3P2P1
1f
3fCrossbar D
E M
U X Channel
1f
2f
3f
2f
Ch
a
n
n
el
Switch
Ch
a
n
n
el
Fig. 7. An example of a buffer sharing three flows.
If there is more than one flow sharing the buffer with the tagged flow as shown in
Figure 7, HoL delay for tagged flow fsi in router rj is given by
THoL(si,rj) =
∑
∀fc∈FB(si,rj)
T
HoL(fc)
(si,rj)
(10)
where FB(si,rj) is the set of flows which share the same buffer in router rj with tagged
flow fsi . T
HoL(fc)
(si,rj)
is calculated as follows.
T
HoL(fc)
(si,rj)
= T l(fc,r) − θc +
Lc + θcpc
R(fc,r)
(11)
Therefore router rj can serve flow fsi by curve β(si,rj) = δTTotal(si,rj)
⊗ γ0,R(si,rj) , where
TTotal(si,rj) = T
HoL
(si,rj)
+ T l(si,rj) and R(si,rj) is calculated by Equation (6).
We analyze the buffer space threshold for each VC based on traffic specifications of
flows passing through that VC, and also interference between them. The buffer space
threshold for virtual channel V Ck in physical channel PCi of router rj is given as
below:
B(V Ck,PCi,rj) =
∑
∀fc∈F(V Ck,PCi,rj)
(
σc + ρcT
p
(fc,rj)
+
(
θ − T p(fc,rj)
)+ [(
pc −R(fc,rj)
)+ − pc + ρc])
(12)
where F(V Ck,PCi,rj) is the set of flows passing through V Ck in physical channel PCi
of router rj .
6.2. Step2: Inter-router ESC
We have analyzed and modeled three kinds of sharing to compute the intra-router
ESC. After analyzing per-router resource sharing (intra-ESC), the effects of buffer
sharing and channel sharing on tagged flow have been considered and we can view an
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:14 F. Jafari et al.
analysis model which keeps only channel&buffer sharing for tagged flow. This model is
called aggregate analysis model. For example, suppose that a tagged flow f1 traverses
a tandem of routers, and is multiplexed with contention flows as depicted in Figure
8(a). After analyzing intra-router ESC, aggregate analysis model is shown as 8(b). In
this model, β(si,rj) indicates that the service curve is related to flow fsi in router rj .
For instance, β({1,2},r3) is the service curve of flow f{1,2} in router r3. f{1,2} indicates to
a flow aggregated by flows f1 and f2. A set of si’s in a tandem of routers is denoted as
S = {si}. For example, in Figure 8(b), S = {{1}, {1, 2, 3}, {1, 2}, {1}}.
Now, we consider aggregate analysis model to recognize interference patterns and
remove contention flows one by one. A tagged flow directly contends with contention
flows. Also, contention flows may contend with each other and then contend with the
tagged flow again. To consider inter-ESC in the aggregate analysis model, we decom-
pose a complex contention scenario to two basic contention patterns, namely, Nested
and Crossed.Figures 8, 9, 10, and 11 illustrate examples of different kinds of nested
contentions and an example of crossed contention is shown in Figure 12. In the follow-
ing, we will describe these examples with more details.
We use the algebra of sets to recognize the contention scenarios. To facilitate our
discussion, we define convenient notations by the example in Figure 8(b). In the ex-
ample, the tandem of servers is as
{
β({1},r1), β({1,2,3},r2), β({1,2},r3), β({1},r4)
}
and S =
{si} = {{1}, {1, 2, 3}, {1, 2}, {1}}. We define sm =
{
sx
∣∣|sx| = max (|si|) ;∀si ∈ S}, where
|sx| is the cardinality (the number of elements) of set sx. The service curve, flow, and
router related to sm are denoted as fsm , βm, and rm, respectively. Thus, in Figure 8(b),
sm = {1, 2, 3}, fsm = f{1,2,3}, rm = r2, and βm = β({1,2,3},r2).
We denote the service curve placed before βm on the aggregate analysis model by
βPrev and related aggregate flow and router as fsPrev and rPrev, respectively. Notation
βNext indicates to the service curve placed after βm, as well. Therefore, due to βm =
β({1,2,3},r2) in Figure 8(b), β
Prev = β({1},r1), s
Prev = {1}, fsPrev = f{1}, rPrev = r1,
βNext = β({1,2},r3), s
Next = {1, 2}, fsNext = f{1,2}, rNext = r3.
Contention recognition procedure in an aggregate analysis model can be generalized
as following steps:
(1) Find sm =
{
sx
∣∣|sx| = max (|si|) ;∀si ∈ S}.
(2) if sPrev ⊂ sNext then the contention is Nested; –Remove fsm−(sm∩sPrev) from βm.
(3) if sNext ⊂ sPrev then the contention is Nested; –Remove fsm−(sm∩sNext) from βm
(4) else
(a) if sPrev ⊂ sm and sNext 6⊂ sm then the contention is Nested;
— Remove fsm−(sm∩sPrev) from βm.
(b) if sNext ⊂ sm and sPrev 6⊂ sm then the contention is Nested;
— Remove fsm−(sm∩sNext) from βm
(c) else, it is Crossed.
— The problem is strictly transformed to the combination of two nested flows
To remove a contention flow from a service curve and derive the new service curve
from that, we apply the proposed corollary 1 in Section 5.
When sm is not unique, each of them can be selected. In this paper, we choose the
first one from the left side in the aggregate analysis network.
In the case of sNext = sPrev, there are two possibilities:
(1) sNext = sPrev 6= sm: Since sNext ⊂ sPrev and sPrev ⊂ sNext, the contention is nested
as previously described in contention recognition steps.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Least Upper Delay Bound for VBR Flows in Networks-on-Chip with Virtual Channels1 A:15
 Nested Flows-Case 1
1fa)
1r 2r 3r 4r
2f 3f
b)
c)
{ } ),3,2,1( 2rβ
1f
2f 3f
{ } ),2,1( 3rβ
1f
2f
{ } ),2,1( 3rβ
{ } ),1( 1rβ { } ),1( 4rβ
{ } ),1( 1rβ { } ),1( 4rβ{ } ),2,1( 2rβ
Fig. 8. Analysis for the first type of nested flows.
1f
3f2f
a)
1r 2r 3r 4r
 Nested Flows-Case 2
1f
2f 3f
{ } ),1( 4rβ
1f
2f
{ } ),1( 4rβ
b)
c)
{ } ),1( 1rβ
{ } ),1( 1rβ { } ),2,1( 2rβ
{ } ),2,1( 2rβ { } ),3,2,1( 3rβ
{ } ),2,1( 3rβ
Fig. 9. Analysis for the second type of nested flows.
1f
3f
a)
1r 2r 3r 4r 5r
4f
 Nested Flows-Case 3
2f
1f
2f 3f
{ } ),4,1( 4rβ
1f
2f
b)
c)
{ } ),1( 1rβ
{ } ),1( 1rβ { } ),2,1( 2rβ
{ } ),2,1( 2rβ { } ),3,2,1( 3rβ
{ } ),2,1( 3rβ
4f
{ } ),1( 5rβ
{ } ),4,1( 4rβ
4f
{ } ),1( 5rβ
Fig. 10. Analysis for the third type of nested flows.
(2) sNext = sPrev = sm: In this case, three nodes sNext, sPrev, and sm should be com-
bined as a single server by applying the theorem of concatenation of network ele-
ments [Le Boudec et al. 2004]. It will be discussed in Section 6.3.
In the following, we give examples for various contention patterns.
6.2.1. Nested Flows. Four different types of nested contention are exemplified as Fig-
ures 8, 9, 10, and 11. Flow f3 is nested in flow f2 in Figures 8, 9, and 10 and it is also
nested in flow f4 in Figure 11.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:16 F. Jafari et al.
1f
2f
a)
3r 4r 5r2r
 Nested Flows-Case 4
1r
3f
4f
1f
2f 3f
{ } ),4,1( 4rβ
b)
c)
{ } ),1( 1rβ { } ),2,1( 2rβ { } ),4,3,1( 3rβ
4f
{ } ),1( 5rβ
1f
2f
{ } ),4,1( 4rβ{ } ),1( 1rβ { } ),2,1( 2rβ { } ),4,1( 3rβ
4f
{ } ),1( 5rβ
Fig. 11. Analysis for the fourth type of nested flows. Crossed Flows
b)
a)
5r
1f
3f
2r 3r 4r
2f
1r
f
c)
d)
1
2f 3f
{ } ),3,1( 4rβ
1f
2f 3f 3f ′
1f
2f
{ } ),2,1( 3rβ
{ } ),1( 1rβ
{ } ),1( 1rβ
{ } ),1( 1rβ
{ } ),2,1( 2rβ
{ } ),2,1( 2rβ
{ } ),2,1( 2rβ
{ } ),3,2,1( 3rβ
{ } ),3,2,1( 3rβ { } ),3,1( 4rβ
{ } ),1( 4rβ
{ } ),1( 5rβ
{ } ),1( 5rβ
{ } ),1( 5rβ
Fig. 12. Analysis for crossed Flows.
— Figure 8(b) shows the first type of nested flows after applying intra-ESC, in which
sm = {1, 2, 3}, sPrev = {1}, and sNext = {1, 2}. In this case, sPrev ⊂ sNext and due
to step 2 of contention recognition procedure, we remove flow f{1,2,3}−({1,2,3}∩{1,2}) =
f{3} from β({1,2,3},r2) and derive β({1,2},r2), as depicted in Figure 8(c).
— The second type of nested flows in the aggregate analysis model is depicted in Figure
9. Due to Figure 9(b), sm = {1, 2, 3}, sPrev = {1, 2}, and sNext = {1}. In this case,
sNext ⊂ sPrev and flow f{1,2,3}−({1,2,3}∩{1,2}) = f{3} is eliminated from β({1,2,3},r3)
regarding step 3 of contention recognition procedure. Figure 9(c) shows aggregate
analysis model after removing f3.
— Figure 10 shows an example of the third type of nested contention. Based on ag-
gregate analysis model depicted in Figure 10(b), sm = {1, 2, 3}, sPrev = {1, 2}, and
sNext = {1, 4}. Since sNext 6⊂ sPrev, sPrev 6⊂ sNext, sPrev ⊂ sm, and sNext 6⊂ sm, due to
step 4.a) of contention recognition procedure, the case is nested contention and flow
f{1,2,3}−({1,2,3}∩{1,2}) = f{3} is removed from β({1,2,3},r3), as shown in Figure 10(c).
— Figure 11 shows a type of nested contention related to step 4.b) of contention
recognition procedure. Due to Figure 11(b), sm = {1, 3, 4}, sPrev = {1, 2}, and
sNext = {1, 4}. Since sNext 6⊂ sPrev, sPrev 6⊂ sNext, sNext ⊂ sm, and sPrev 6⊂ sm, it
is a nested contention and Figure 11(c) shows that flow f{1,3,4}−({1,3,4}∩{1,4}) = f{3}
is eliminated from β({1,3,4},r3).
6.2.2. Crossed Flows. Figure 12 shows contention flow f2 crossed with f3. Regarding
Figure 12(b), sm = {1, 2, 3}, sPrev = {1, 2}, and sNext = {1, 3}. Since sPrev is not a
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Least Upper Delay Bound for VBR Flows in Networks-on-Chip with Virtual Channels1 A:17
subset of sNext, and vice versa and also both of them are a subset of sm, due to step 4.c)
of contention recognition procedure, this case is a crossed contention. There are two
cross points, one between r2 and r3 and the other between r3 and r4. We cut f3 at the
second cross point, i.e., at the ingress of r4, f3 will be split into two flows, f3 and f´3, as
shown in Figure 12(c). Then the problem is strictly transformed to the combination of
nested flows such that f3 is nested in flow f2 and f´3 in f1. It is clear that the arrival
curve α(f3,r3) equals to α3 and the arrival curve α(f´3,r3) equals to α
∗
(f3,r3)
. To compute
α∗(f3,r3), we need to get the ESC of r3 for f3, β(f3,r3). Then, we calculate the output
arrival curve of f3 as α∗(f3,r3) = α(f3,r3)  β(f3,r3) by applying the proposed Theorem 3
in Section 5. Now, nested flows f3 and f´3 can be removed from the tandem as shown in
Figure 12(d).
6.3. End-to-end ESC
We show a high-level analysis flow for deriving the end-to-end ESC in Figure 13 and
then present end-to-end ESC algorithm along with more details and one example.
To calculate end-to-end ESC, we first obtain intra-router ESC for the tagged flow in
each router. Then we use the theorem of concatenation of network elements [Le Boudec
et al. 2004] to model nodes sequentially connected and each is offering a service curve
on the same aggregate flows β(si,rj), j = 1, 2, ..., n as a single server as follows:
β(si,r1,2,...,n) = β(si,r1) ⊗ β(si,r2) ⊗ ...⊗ β(si,rn)
In the next step, we calculate inter-router ESC by applying contention recognition
stages and removing contention flows as described in Section 6.2. After that, the con-
catenation theorem is applied again to find more equivalent servers and reduce the
number of service curves. For instance, after removing contention flow f3 in Figure
8(c), the service curve of sub-tandem {r2, r3} for aggregate flow f{1,2} is computed as
β({1,2},r2,3) = β({1,2},r2) ⊗ β({1,2},r3). If we repeat contention recognition steps, the next
contention flow is f2 nested in f1. If we similarly remove it from β({1,2},r2,3) and calcu-
late convolution β({1},r1,2,3) = β({1},r1) ⊗ β({1},r2,3), the end-to-end ESC of tagged flow f1
is obtained.
Algorithm 1 explains the procedure of calculating end-to-end ESC with more details.
— Joining node: In Lines 2−8, the algorithm checks if source node of a contention flow
fi is one of the nodes along the tagged flow’s path or not. If it is not, this means that
we should calculate input TSPEC of the contention flow fi in the point joined to the
tagged flow’s route (point A in Figure 14 when f1 is the tagged flow). We obtain this
point by function JoiningPoint(fi) and call it joining node.
We give an example in Figure 15 to show how to derive an aggregate analysis model
and obtain end-to-end ESC by following the proposed algorithm.
Assuming the tagged flow is f1, line 1 of the algorithm finds CFt which is {f2, f3, f4}
in the example.
—Loop 1 in the algorithm (Lines 2−8): In Lines 3−4, the algorithm obtains joining
node for each contention flow which its source node is not one of the nodes along the
tandem. Then, end-to-end ESC of flow fj from the source node to joining node has
been derived by recursively calling ESC(fj , Src(j), joiningnode) in Line 5. Line 6
computes output arrival curve which is input arrival curve to the joining node and
input TSPEC is extracted from that. In the example of Figure 15(a), all source nodes
of contention flows are in the tagged flow’s route and lines 4−6 are skipped for them.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:18 F. Jafari et al.
Application
- communication pattern
-TSPEC of flows
Architecture
- topology
- deadlock-free deterministic routing
- tagged flow  - service curve of routers
Calculate Intra-router ESC for each router            due to Sectiobn 6.1 ),( ji rfβ
Yes
Are intra-router ESC 
calculated in all routers 
No
f⊗⊗⊗ ββββCalculate { } { } nrsrsrsrs iiiinjnijijinjjnii ==== ...  ... 21),(),(),(),( 2211,...,1,...,1
Find the first maximum         where  ms ( ){ }Ssssss ixxxm ∈∀== ;||max||
Calculate Inter-router ESC based on Section 6.2
?1=ms
No
Yes
Calculate
One            is remained which is end-to-end ESC for the tagged flow),( ji rsβ
{ } { } nrsrsrsrs iiiifnjnijijinjjnii ===⊗⊗⊗= ...    ... 21),(),(),(),( 2211,...,1,...,1 ββββ
Fig. 13. End-to-end ESC analysis flow.
Line 9 obtains intra-router ESC for the tagged flow due to Section 6.1. Figure 15(b)
shows the aggregate analysis model for the example. Due to line 10, β({1,2},r3,4) =
β({1,2},r3) ⊗ β({1,2},r4). Figure 15(c) depicts the example in this step. Regarding line
11, sm = {1, 2, 3}.
—Loop 2 in the algorithm (Lines 12 − 32): In Lines 13 − 29, we consider different
contention scenarios along the route using the algebra of sets. In this step, we intend
to remove contention flows one by one due to their effects on the tagged flow as
1f
2f
4r
3r2r1r
A
5r
Fig. 14. The example of joining point.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Least Upper Delay Bound for VBR Flows in Networks-on-Chip with Virtual Channels1 A:19
Algorithm 1 end-to-end ESC
1: Find the set of contention flows of tagged flow ft, denoted by CFt
2: for ∀j ∈ CFt do
3: if Src(j) /∈ Path(t) then
4: Find joiningnode = JoiningPoint(fj)
5: Calculate X = ESC(fj , Src(j), joiningnode)
6: αj = αj X
7: end if
8: end for
9: Calculate intra-router ESC based on Section 6.1.
10: Calculate β(si1 ,rj1) ⊗ β(si2 ,rj2) ⊗ ...⊗ β(sin ,rjn ) if i1 = i2 = ... = in.
11: Find sm =
{
sx
∣∣|sx| = max (|si|) ;∀si ∈ S}.
12: repeat
13: if sPrev ⊂ sNext then
14: Remove fsm−(sm∩sNext) from βm
15: else if sNext ⊂ sPrev then
16: Remove fsm−(sm∩sPrev) from βm.
17: else
18: if sPrev ⊂ sm and sNext 6⊂ sm then
19: Remove fsm−(sm∩sPrev) from βm
20: else if sNext ⊂ sm and sPrev 6⊂ sm then
21: Remove fsm−(sm∩sNext) from βm.
22: else
23: Find joiningnode = JoiningPoint(f(sm−sPrev)).
24: Calculate X = ESC(f(sm−sPrev), joiningnode, rNext).
25: α´(sm−sPrev) = α(sm−sPrev) X
26: Remove f(sm−sPrev) from βm.
27: Remove f´(sm−sPrev) from βNext.
28: end if
29: end if
30: Calculate β(si1 ,rj1) ⊗ β(si2 ,rj2) ⊗ ...⊗ β(sin ,rjn ) if i1 = i2 = ... = in.
31: Find sm.
32: until |sm| 6= 1
33: return end-to-end ESC for tagged flow ft
mentioned in Section 6.2. Lines 13−21 consider nested contentions and lines 22−28
crossed one.
—Nested contention in the example: From Figure 15(c), sm = {1, 2, 3},
sPrev = {1}, and sNext = {1, 2}. Since sPrev ⊂ sNext, due to line 13, flow
f{1,2,3}−({1,2,3}∩{1,2}) = f3 is removed from β({1,2,3},r2) as shown in Figure 15(d).
Lines 30−31 are the same as lines 10−11 which calculate concatenation of the nodes
on the same aggregate flows and then obtain new sm, which result in β({1,2},r2,3,4) =
β({1,2},r2) ⊗ β({1,2},r3,4), and sm = {1, 2, 4} (Figure 15(e)).
—Crossed contention in the example: If we repeat contention recognition steps
in Loop 2, the next contention in the example is crossed. From Figure 15(e), sm =
{1, 2, 4}, sPrev = {1, 2}, and sNext = {1, 4}. Since neither sPrev ⊂ sNext nor sNext ⊂
sPrev and also either sNext ⊂ sm and sPrev ⊂ sm, it goes to the else part (lines 22−
28) of the algorithm. As shown in Figure 15(e), contention flow f2 is crossed with
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:20 F. Jafari et al.
{ } ),3,2,1( 2rβ
1f
2f 3f 4f
{ } ),2,1( 3rβ
b)
),1( 1rβ )},2,1({ 4rβ )},4,2,1({ 5rβ
1fa)
{ } ),3,2,1( 2rβ
1f
2f 3f 4f
),1( 1rβ )},2,1({ 4,3rβ
c)
1f β βd)
7r
6({1,4}, )rβ 7(1, )rβ
5({1,2,4}, )rβ 6({1,4}, )rβ 7(1, )rβ
β β
1r 2r 6r5r3r 4r
2f 3
f
4f
2f 4f
),1( 1r )},2,1({ 4,3r)},2,1({ 2rβ
1f
2f 4f
),1( 1rβ )},2,1({ 4,3,2rβ
e)
1f
2f 4f
),1( 1rβ )},2,1({ 4,3,2rβ
4 f ′
f)
1f
2f
7(1, )rβ),1( 1rβ )},2,1({ 4,3,2rβ
g)
5({1,2}, )rβ 6({1}, )rβ
5({1,2,4}, )r 6({1,4}, )r 7(1, )rβ
5({1,2,4}, )rβ 6({1,4}, )rβ 7(1, )rβ
5({1,2,4}, )rβ 6({1,4}, )rβ 7(1, )rβ
Fig. 15. An example of end-to-end ESC computation.
f4. There are two cross points, one between r2,3,4 and r5 and the other between r5
and r6. Regarding the algorithm, we cut f4 at the second cross point, i.e., at the
ingress of r6, f4 will be split into two flows, f4 and f´4, as shown in Figure 15(f).
Then, the problem is transformed to the combination of two nested scenarios.
Apparently the arrival curve αf´4 of f´4 is equal to α
∗
f4
of f4. To compute α∗f4 , we
need to get the ESC of f4 from r5 to r6, which is derived regarding lines 23 and 24.
Then, line 25 calculates output arrival curve α∗f4 (αf´4 ) by applying the proposed
Theorem 3 in Section 5. Then, f4 and f´4 are removed from r5 and r6 due to lines
26 and 27, respectively, as shown in Figure 15(g).
Therefore, according to lines 30 − 31, β({1,2},r2,3,4,5) = β({1,2},r2,3,4) ⊗ β({1,2},r5),
β({1},r6,7) = β({1},r6) ⊗ β({1},r7) , and sm = {1, 2}. We similarly repeat contention
recognition and convolution steps until |sm| 6= 1. When |sm| = 1, the end-to-end ESC
of tagged flow f1 is obtained.
6.4. LUDB Derivation
To compute the delay bound for a flow passing a series of nodes, one simple way is
to calculate the summation of delay bounds at each node. However, this results in a
loose total delay bound. To tighten the worst-case delay bound along the network, the
end-to-end service curve of the flow is used as stated in corollary Pay Bursts Only Once
[Le Boudec et al. 2004]. Hence, we first calculate the end-to-end ESC of the tagged flow
based on the proposed algorithm and then obtain LUDB according to Theorem 1. We
have implemented algorithms employed in our methodology.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Least Upper Delay Bound for VBR Flows in Networks-on-Chip with Virtual Channels1 A:21
2f
1f
3f
4f
1r
2r
4r 3r
1r 2r
3r4r
2f
1f
4f
a) b)
3f
Fig. 16. A synthetic example.
7. EXPERIMENTS
7.1. Experimental Setup
To evaluate the capability of our method, we applied it to a synthetic traffic pattern
and a realistic one. Throughout the experiments, we assume an SoC with 500 MHz
frequency in which packets traverse the network using the XY routing algorithm.
Flows follow TSPEC, fi ∝ (Li, pi, σi, ρi), and each node guarantees the service curve
of βR,T (t) = δT ⊗ γ0,R, where the serving rate R is C flit/cycle and the latency T ,
Lw
C +Drouter cycle. We have implemented the proposed analytical model in C++ to au-
tomate analysis steps.
7.2. Synthetic Traffic Pattern
We synthesize a simple traffic pattern as shown in Figure 16 to follow the analytical
approach step by step and derive numerical results. The figure depicts a network with
4 flows and 4 routers which serve flows in the FIFO order. f1 is the tagged flow and f2
and f4 are contention flows.
7.2.1. Computation of the end-to-end equivalent service curve.
Step 1: We first calculate the intra-router ESC for the tagged flow in each node. Then,
we can model a flow passing through a series of routers as a series of concatenated
pseudoaffine servers.
It is worth mentioning that TSPEC of each flow fj mentioned above is the TSPEC of
the input flow to its source node, for example f2 ∝ (L2, p2, σ2, ρ2) which means ρ(f2,r1) =
ρ2 and other characteristics can be obtained as well.
— In router r1: From Equation (6) and (7), the ESC for aggregate flow f{1,2} in node 1
is given by:
β(f{1,2},r1) = δ0 ⊗ γ0,C . (13)
— In router r2: FB(f1,r2) = {f2} and due to Equation (6) and (7), R(f1,r2) = C and
T l(f1,r2) = 0. Furthermore, T
Total
(f1,r2)
= T l(f1,r2) + T
HoL
(f1,r2)
and regarding to Equation (10)
and (11), THoL(f1,r2) = max∀fc∈FB(f1,r2)
(
T
HoL(fc)
(f1,r2)
)
= T
HoL(f2)
(f1,r2)
where THoL(f2)(f1,r2) is calcu-
lated as follows:
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:22 F. Jafari et al.
T
HoL(f2)
(f1,r2)
= T l(f2,r2) − θ(f2,r2) +
L(f2,r2) + θ(f2,r2)p(f2,r2)
R(f2,r2)
(14)
where R(f2,r2) =
C
2 , T
l
(f2,r2)
= LwC + Drouter, because two VCs (one transmits f2
and the other f3) are sharing the ejection channel of router r2. In Equation (14),
we should obtain TSPEC of input flow f2 to r2 which is TSPEC of output flow
f2 from r1. Since TSPEC is derived from arrival curve, we obtain arrival curve
of output flow f2 from r1 by applying the proposed Theorem 3 in Section 5. We
assumed θi ≤ T(fi,rj) for ∀fi passing through ∀rj . Thus, α∗(f2,r1) = α(f2,r2) =
γσ(f2,r1)+ρ(f2,r1)T(f2,r1),ρ(f2,r1) where ρ(f2,r1) = ρ2 and σ(f2,r1) = σ2. In this respect, we
can say α(f2,r2) = γσ2+ρ2T(f2,r1),ρ2 . For deriving T(f2,r1), we should first obtain ESC for
flow f2 in router r1, β(f2,r1), as follows.
From Equation (13), β(f{1,2},r1) = δ0 ⊗ γ0,C . We then remove f1 from aggregate flow
f{1,2} according to Corollary 1 in Section 5, β(f2,r1) is given by:
β(f2,r1) = δ[L1+θ1(p1−C)+
C
]+
+θ1
⊗ γ0,C−ρ1 = δL1+θ1p1
C
⊗ γ0,C−ρ1 (15)
In this respect T(f2,r1) =
L1+θ1p1
C , and α(f2,r2) = γσ2+ ρ2(L1+θ1p1)C ,ρ2
which means
σ(f2,r2) = σ2 +
ρ2(L1+θ1p1)
C , ρ(f2,r2) = ρ2, L(f2,r2) = L(f2,r1) = L2, and p(f2,r2) = p(f2,r1) =
p2. Therefore, Equation (14) is rewritten as below:
T
HoL(f2)
(f1,r2)
=
Lw
C
+Drouter − θ(f2,r2) +
L2 + θ(f2,r2)p2
C
2
(16)
where θ(f2,r2) =
σ(f2,r2)−L(f2,r2)
p(f2,r2)−ρ(f2,r2)
= σ2C+ρ2L1+ρ2θ1p1−L2CC(p2−ρ2) .
As mentioned before, R(f1,r2) = C, T
l
(f1,r2)
= 0, and TTotal(f1,r2) = T
l
(f1,r2)
+ THoL(f1,r2). There-
fore, the ESC for tagged flow f1 in router 2 is given by:
β(f{1},r2) = δ0+THoL(f2)
(f1,r2)
⊗ γ0,C . (17)
— In router r3: Since VC of f1 is sharing the ejection channel of r3 with VC of f4, due
to Equation (6) and (7), R(f1,r3) =
C
2 and T
l
(f1,r3)
= LwC + Drouter. Thus, the ESC for
tagged flow f1 in router 3 is given by:
β(f{1},r3) = δ(LwC +Drouter)
⊗ γ0,C2 . (18)
Step 2: Now, we are able to compute per-flow ESC provided by the tandem of routers
the tagged flow passes. Figure 17 depicts different steps of computing end-to-end ESC
for tagged flow f1. After calculating intra-router ESC as mentioned in Step 1, we have
an aggregate analysis model as shown in Figure 17(b). Since we have investigated the
effect of flow f2 on tagged flow f1 in router r2, when we calculated β(f1,r2) in step 1, f2
is removed from r2 in Figure 17(b). Similarly, f3 and f4 are eliminated from r2 and r3,
respectively. We then obtain end-to-end ESC for tagged flow f1 by following Algorithm
1. Due to the algorithm, β({1},r2,3) in Figure 17(c) is calculated as β({1},r2) ⊗ β({1},r3).
We use the theorem of Concatenation of network elements [Le Boudec et al. 2004].
Given are two nodes sequentially connected and each is offering a latency-rate service
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Least Upper Delay Bound for VBR Flows in Networks-on-Chip with Virtual Channels1 A:23
a)
1f
2fb)
)},2,1({ 1rβ )},1({ 2rβ )},1({ 3rβ
1f
2fc)
)},2,1({ 1rβ )},1({ 3,2rβ
1fd) β
1f
2f 3r2r1r
3f 4f
)},1({ 1rβ )},1({ 3,2r
1fe)
)},1({ 3,2,1rβ
Fig. 17. Analysis steps for the example in Figure 15.
curve βRi,Ti , i = 1and2. These nodes can be represented as a single latency-rate server
as follows:
βR1,T1 ⊗ βR2,T2 = βmin(R1,R2),T1+T2 (19)
Therefore, β({1},r2,3) is given by:
β({1},r2,3) = δLw
C +Drouter+T
HoL(f2)
(f1,r2)
⊗ γ0,C2 . (20)
In Figure 17(c), sm = {1, 2}, sPrev = {}, and sNext = {1}. The algorithm then removes
flow f2 from aggregate flow f{1,2} in router r1. To this end, we apply the proposed
corollary 1 to obtain ESC β({1},r1) by subtracting arrival curve of α2 from β({1,2},r1), as
follows:
β({1},r1) = δL2+θ2(p2−C)+
C +θ2
⊗ γ0,C−ρ2 (21)
Figure 17(c) depicts the example after removing arrival curve of flow f2 from
β({1,2},r1). Now, end-to-end ESC can be calculated by:
β({1},r1,2,3) = β
eq
f1
= β({1},r1) ⊗ β({1},r2,3)
= δLw+L2+θ2(p2−C)+
C +Drouter+θ2+T
HoL(f2)
(f1,r2)
⊗ γ0,min(C2 ,C−ρ2) (22)
Suppose that flows follow TSPEC, f1 ∝ (1, 1, 8, 0.128), f2 ∝ (1, 1, 2, 0.032), f3 ∝
(1, 1, 2, 0.008), and f4 ∝ (1, 1, 4, 0.128). Therefore, θj is computed for each flow fj as
θ1 = (σ1 − L1)/(p1 − ρ1) = (8 − 1)/(1 − 0.128) = 8.027, θ2 = 1.033, θ3 = 1.008, and
θ4 = 3.44. Also, we assume serving rate C = 1 flit/cycle, Lw = 1 flit, and Drouter = 1
cycle. We then replace the variables in Equation (22) by numbers as follows:
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:24 F. Jafari et al.
βeqf1 = δ9.363 ⊗ γ0,0.5 (23)
7.2.2. Computation of LUDB.
According to Theorem 1 and Equation (22), the maximum delay for flow f1 is
bounded by
h(α1, β
eq
f1
) =
⌈
Lw + L2 + θ2(p2 − C)+
C
+Drouter + θ2 + T
HoL(f2)
(f1,r2)
+
L1 + θ1
(
p1 −min
(
C
2 , C − ρ2
))+
min
(
C
2 , C − ρ2
) ⌉
(24)
If we substitute the values into variables in the above mentioned equation, h(α1, βegf1 )
is equal to d19.39e = 20.
In what follows, we consider the accuracy of our proposed analytical method through
the BookSim simulator [Jiang et al. 2013] and then compare it with the methods with-
out considering the traffic peak rate behavior [Lenzini et al. 2006].
7.2.3. Computation of Buffer Size Thresholds.
As routers are assumed to be input-buffered, we derive buffer size threshold for
each input channel in each router by following Eq. (12). In the example of Figure 16,
we have assumed one VC per each PC. Therefore, buffer size thresholds are calculated
and presented as Table II.
Table II. Buffer size thresholds in the case study with synthetic traffic pattern
Injection
Channel
Western
Channel
Northern
Channel
Eastern
Channel
Southern
Channel
Router 1 6 − − − −
Router 2 − 11 − − 3
Router 3 − 8 8 − −
Router 4 6 − − − −
The buffers size thresholds marked by ”-” are not used by flows and thus not relevant
for the threshold calculation.
The value of buffer size thresholds per channel depends on the traffic load on that
channel which is affected by the number of flows passing through the channel, their
traffic specifications, and the contention between them.
It is worth mentioning that while the calculated delay is an upper bound, the calcu-
lated buffer size threshold gives the lower bound of a buffer size to avoid back-pressure
and buffer overflow. Therefore, the buffer sizes in simulations are set to be equal to or
larger than the corresponding calculated buffer thresholds. To go into more details,
Table III shows the delay bounds derived from both analytical model and simulation
results for the tagged flow f1 versus different values of the buffer size. As it is assumed
that all routers has the same buffer space, the buffer size in the table should be equal
to or larger than the maximum calculated space threshold in order to no buffer space
threshold has been violated. Due to Table II, the maximum space threshold in this
example is equal to 11 flits. As it can been seen from Table III, when the buffer size
is equal to or larger than 11 flits, the delay bound calculated by the simulator is fixed
and very close to the analytical result. Otherwise, the simulation results cannot be
compared with the analytical results because the back-pressure and buffer overflow
may happen and in turn the delay bound calculated from the model becomes invalid.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Least Upper Delay Bound for VBR Flows in Networks-on-Chip with Virtual Channels1 A:25
Table III. End-to-end delay bound for tagged flow f1 under different buffer sizes
Buffer Size (flits) 14 13 12 11 8 6 4 3 2
Simulation Results (cycles) 19 19 19 19 19 19 19 17 16
Analytical Model Results (cycles) 20 20 20 20 20 20 20 20 20
t
L
s
VBRD
CBRD
5.0=eqR
eq
CBRT
eq
VBRT
5.0=eqR
q
Fig. 18. Comparing D¯V BR and D¯CBR with the same equivalent service curve.
7.2.4. Simulation Result.
We investigate the accuracy of the proposed analytical model through BookSim sim-
ulator which is a cycle-accurate simulator [Jiang et al. 2013]. The simulation uses the
same assumptions as the analytical model. We have considered a 2 × 2 mesh on-chip
interconnect as shown in Figure 16 and input-buffered routers with 12 flits in each
input channel. It takes 1 clock cycle to pass a flit within a router and 1 clock cycle
to transmit a flit over wires between neighboring routers. We also consider the XY
routing algorithm to route the data packets among cores.
Simulation result shows that worst-case delay for tagged flow f1 in the previously
mentioned system is equal to 19 cycles, which is below the LUBD of 20 cycles, predicted
by our model.
We also change the value of σ2 from 2 to 4 to consider more experiments. The LUDB
calculated by our analytical model for tagged flow f1 is equal to 24 cycles and the result
from the simulation is also 24 cycles, again below the analytical LUDB.
7.2.5. Comparison.
If we use (σ, ρ) instead of TSPEC, each flow j would be constrained by arrival curve
αj = σj+ρjt = γσj ,ρj . Therefore, flows in the example are represented as f1 ∝ (8, 0.128),
f2 ∝ (2, 0.032), f3 ∝ (2, 0.008), and f4 ∝ (4, 0.128). We then follow the stages of comput-
ing individual delay bounds for a tagged flow as stated before. For this purpose, we
can easily revise our proposed theorems for (σ, ρ) flows by substituting σ and ρ into L
and p, respectively, in all formulas. We can also apply the method presented in [Lenzini
et al. 2006]. With both approaches, the same value for h(α1, βeqf1 ) is achieved and equals
to 26. Thus, our proposed method which calculates D¯V BR has 23% improvement on the
accuracy of the delay bound than the method with CBR flows (D¯CBR).
To analyze delay sensitivity, Table IV depicts LUDB for tagged flow f1 in a network
with CBR flows (D¯CBR) and also VBR flows (D¯V BR) versus the different values of ser-
vice rate R, along with values for the end-to-end ESC parameterized by Reqf1 , T
eq
f1,CBR
and T eqf1,V BR. From this table, it is clear that the end-to-end equivalent service rate,
Reqf1 , is decreasing by reducing R, while the end-to-end processing delays and delay
bounds are increasing as well. Also, it is worth mentioning that the improvement per-
centage (η) decreases because of reduction of Reqf1 and increase of T
eq
f1,CBR
and T eqf1,V BR.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:26 F. Jafari et al.
f
PaddingVOP 
memory
94
313Memory
Down sampling 
&
context calculation
16
Reference 
memory
Up 
sampling
16
16
16
16
f3
f4
f6
8
f7
f9
f
n1 n2 n3 n4
f1
f6
f10
f15f16
f17
Stripe 
memory
Up
VOP 
reconstruction
300
313500
Context-based 
Arithmetic 
decoder
157
16
16
f1
f2
f5 10
f11
f12
f13
n5 n6 n7 n8
n9 n10 n11 n12
f2
f9
f11
f f
f18
f20
f21
Variable 
length
decoder
70 Run-
length
decoder
362 Inverse 
scan
362 AC/DC 
prediction
362
iQuant
357
IDCT
4927
 
sampling
353f14
f15 f16 f17
f18 f19
f20 f21
n13 n14 n15 n16
f3f4 f5
f7 f8
12
f13
14
f19
Fig. 19. VOPD Application
This is due to the relation between these parameters which we will elaborate it in the
following.
Figure 18 shows D¯CBR and D¯V BR for Req = 0.5 where p ≥ Req and the end-to-
end ESCs are in the form of δT eq ⊗ γ0,Req . According to the network calculus theory
[Le Boudec et al. 2004], D¯CBR = T eqCBR +
σ
Req and with Theorem 1, D¯V BR = T
eq
V BR +
L+θ(p−Req)
Req . η is calculated as follows.
η =
D¯CBR − D¯V BR
D¯CBR
=
σ − L− pθ +Req (T eqCBR − T eqV BR + θ)
σ + T eqCBRR
eq
(25)
To analyze the behavior of η, we compute the derivative of function η in terms of Req
as follows:
dη
dReq
=
T eqCBR (L+ pθ)− σ (T eqV BR − θ)
(σ + T eqCBRR
eq)
2
From Figure 18, it is obvious that L + pθ ≥ σ and T eqCBR ≥ T eqV BR − θ which results
T eqCBR (L+ pθ) − σ (T eqV BR − θ) ≥ 0. Thus, dηdReq ≥ 0 and η is an increasing function in
terms of Req which means that when Req increases or decreases, η shows the same
behavior as Req.
Table V shows Reqf1 , T
eq
f1,CBR
and T eqf1,V BR, D¯CBR, and D¯V BR for tagged flow f1 versus
the different values of processing delay T . From this table, it can be seen that the
end-to-end processing delays and delay bounds are decreasing by reducing T .
Table IV. End-to-end delay comparison for tagged
flow f1 under different service rates
R1 = 1 R2 = 0.7 R3 = 0.5
Reqf1
0.5 0.35 0.25
T eqf1,CBR
10 13.428 18
T eqf1,V BR
9.363 13.326 18.951
D¯f1,CBR 26 37 50
D¯f1,V BR 20 32 48
ηf1 23% 13.5% 4%
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Least Upper Delay Bound for VBR Flows in Networks-on-Chip with Virtual Channels1 A:27
200
300
400
500
600
700
De
la
y 
Bo
un
d 
(c
yc
le
)
Delay_VBR
Delay_Simulation
0
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
D
Flow Index
Fig. 20. Comparison of delay bounds from the proposed model and simulator for VOPD application
7.3. Realistic Traffic Pattern
We consider a real-time multimedia application with a random mapping to the tiles
of a 4 × 4 mesh on-chip network. Figure 19 shows the task graph and flow mapping
of a Video Object Plane Decoder (VOPD) [Bertozzi et al. 2002] in which each block
corresponds to an IP and the numbers near the edges represent the bandwidth (in
MBytes/sec) of the data transfer, for a 30 frames/sec MPEG-4 movie with 1920 × 1088
resolution [Van der Tol et al. 2002]. There are 21 communication flows which charac-
terized by TSPEC. We assume Li and pi for all flows are the same and equal to 1 flit
and 1 flit/cycle, respectively. ρi is determined in flits/cycle due to associated band-
width with flow fi in Figure 19 and also, σi varies between 8 and 128 flits for different
flows.
We derive delay bounds from the proposed analytical model, D¯fi,V BR, and BookSim
simulator, D¯fi,Sim for the whole set of flows in Figure 20. In order to have a better in-
sight about the proposed model, for each obtained delay bound, the relative error with
respect to simulation result is calculated. The calculations show that the maximum
and average relative errors are about 12.1% and 6.8%, respectively, which confirm the
accuracy of the proposed model.
As can be observed from Figure 20, a flows may have larger (like f7) or smaller
(like f14) worst-case delay bound than the other flows, which depends on its traffic
specification (TSPEC) and the situation of that flow in the network. For example, if the
worst-case delay bound of a particular flow is too large, 1) the flow is probably more
limited by its TSPEC parameters for injecting to the network, 2) the flow may have a
longer path from its source to destination, or 3) the flow may have more contentions
(both direct and indirect) with other flows along its path.
Figure 21 compares the results of applying our analytical model, D¯fi,V BR, and the
method with CBR flows, D¯fi,CBR. As you can see in this figure, the proposed model
in this paper is more accurate than the method without considering the traffic peak
Table V. End-to-end delay comparison for tagged flow f1 under different
processing delay
T1 = 10 T2 = 2 T3 = 1 T4 = 0.5 T5 = 0.1
Reqf1
0.5 0.5 0.5 0.5 0.5
T eqf1,CBR
26 10 8 7 6.2
T eqf1,V BR
25.363 9.363 7.363 6.363 5.563
D¯f1,CBR 42 26 24 23 23
D¯f1,V BR 39 20 18 17 16
ηf1 7.1% 23% 25% 26.1% 30.4%
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:28 F. Jafari et al.
300
400
500
600
700
800
900
1000
De
la
y 
Bo
un
d 
(c
yc
le
)
Delay_VBR
Delay_CBR
0
100
200
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
D
Flow Index
Fig. 21. Comparing D¯V BR and D¯CBR for VOPD application
35
40
45
50
c e
n t
)
10
15
20
25
30
m
p r
o v
e m
e n
t   (
P e
r c
0
5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
I m
Flow Index
600
700
800
900
1000
d  
( c
y c
l e
)
Delay_VBR
Delay_CBR
100
200
300
400
500
D e
l a
y   B
o u
n
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Flow Index
Fig. 22. Improvement percentage of D¯V BR than D¯CBR for VOPD application
behavior. Figure 22 presents improvement percentage for each flow fi, ηfi , as defined
in Eq. 25 to show the effectiveness of our model. Compared to previous models with
two parameters, the proposed method improves the accuracy of the delay bounds up to
46.9% and more than 37% on average over all flows.
Table VI presents buffer size threshold of input channels used by flows due to Eq.
12.
Table VI. Buffer size thresholds for VOPD application
B(r1,I) = 1 B(r5,I) = 17 B(r ,W ) = 1 B(r9,S) = 259 B(r12,I) = 1 B(r14,N) = 1
B(r1,S) = 275 B(r5,N) = 1 B(r7,N) = 68 B(r10,I) = 1 B(r12,W ) = 4 B(r14,E) = 7
B(r2,I) = 1 B(r5,E) = 7 B(r7,E) = 7 B(r10,E) = 68 B(r13,I) = 204 B(r15,I) = 16
B(r2,E) = 1 B(r5,S) = 262 B(r7,S) = 1 B(r10,S) = 84 B(r13,N) = 1 B(r15,N) = 1
B(r3,I) = 1 B(r6,I) = 1 B(r8,I) = 1 B(r11,I) = 17 B(r13,E) = 68 B(r15,E) = 7
B(r3,W ) = 1 B(r6,E) = 1 B(r8,W ) = 1 B(r11,N) = 77 B(r14,I) = 2 B(r16,I) = 1
B(r3,E) = 1 B(r6,S) = 17 B(r9,I) = 68 B(r11,E) = 1 B(r14,W ) = 84 B(r16,N) = 1
B(r4,I) = 1 B(r7,I) = 1 B(r9,N) = 16 B(r11,S) = 16
Sub-index (r, L) in Table VI refers to input channel L of router r, where r is the
number of the router and L is a letter assigned to the input port which is defined as
I: Injection channel, W : Western input channel, N : Northern input channel, E: East-
ern input channel, and S: Southern input channel. For example, B(r3,W ) indicates the
buffer size threshold in western input channel of router 3.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Least Upper Delay Bound for VBR Flows in Networks-on-Chip with Virtual Channels1 A:29
0
100
200
300
400
500
600
700
800
900
1000
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55
De
lay
 Bo
un
d (
cyc
le)
Flow Index
Delay_VBR
Delay_Simulation
0
200
400
600
800
1000
1200
1400
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55
De
lay
 Bo
un
d (
cyc
le)
Flow Index
Delay_VBR
Delay_CBR
Fig. 23. Comparison of delay bounds from the proposed model and simulator under the transpose traffic
pattern
7.4. Transpose Traffic Pattern
To investigate a larger network, we experiment a 8× 8 mesh network under the trans-
pose traffic pattern with 56 communication flows characterized by TSPEC. In this traf-
fic pattern, the node with binary value an−1, an−2, ..., a1, a0 communicates with the node
a¯n/2−1, ..., a¯0, a¯n−1, ..., a¯n/2. For all traffic flows, we assume the same values for Li and
pi which are 1 flit and 1 flit/cycle, respectively. For different flows, ρi varies between
0.001 and 0.03 flits/cycle, and σi between 2 and 128 flits. Table VII presents the source
and destination of flows along with the index assigned to them.
Similar to previous case studies, delay bounds from the proposed analytical model,
D¯fi,V BR, and BookSim simulator, D¯fi,Sim are derived for all flows and presented as
Figure 23. As can be seen from this figure, all delays observed in simulations are below
the LUDB but not too far, suggesting that the analytical bound is fairly tight since the
simulation typically does not exercise the worst case.
To consider the accuracy of the analytical model, the relative errors with respect
to simulation results are computed. The calculations show that the maximum and
average relative errors are about 33.3% and 13%, respectively.
We also calculate per-flow delay bounds from our proposed method, D¯fi,V BR, and
CBR analytical model, D¯fi,CBR, as depicted in Figure 24 and compare the results by
computation of improvement percentages per flow, ηfi . As shown in Figure 25, our
Table VII. The list of flows
f1 : 0 −→ 63 f11 : 20 −→ 29 f21 : 25 −→ 52 f30 : 55 −→ 1 f39 : 29 −→ 20 f48 : 44 −→ 26
f2 : 1 −→ 55 f12 : 10 −→ 46 f22 : 24 −→ 60 f31 : 47 −→ 2 f40 : 46 −→ 10 f49 : 52 −→ 25
f3 : 2 −→ 47 f13 : 9 −→ 54 f23 : 34 −→ 43 f32 : 39 −→ 3 f41 : 54 −→ 9 f50 : 60 −→ 24
f4 : 3 −→ 39 f14 : 8 −→ 62 f24 : 33 −→ 51 f33 : 31 −→ 4 f42 : 62 −→ 8 f51 : 43 −→ 34
f5 : 4 −→ 31 f15 : 19 −→ 37 f25 : 32 −→ 59 f34 : 23 −→ 5 f43 : 37 −→ 19 f52 : 51 −→ 33
f6 : 5 −→ 23 f16 : 18 −→ 45 f26 : 41 −→ 50 f35 : 15 −→ 6 f44 : 45 −→ 18 f53 : 59 −→ 32
f7 : 6 −→ 15 f17 : 17 −→ 53 f27 : 40 −→ 58 f36 : 22 −→ 13 f45 : 53 −→ 17 f54 : 50 −→ 41
f8 : 13 −→ 22 f18 : 16 −→ 61 f28 : 48 −→ 57 f37 : 30 −→ 12 f46 : 61 −→ 16 f55 : 58 −→ 40
f9 : 12 −→ 30 f19 : 27 −→ 36 f29 : 63 −→ 0 f38 : 38 −→ 11 f47 : 36 −→ 27 f56 : 57 −→ 48
f10 : 11 −→ 38 f20 : 26 −→ 44
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:30 F. Jafari et al.
0
100
200
300
400
500
600
700
800
900
1000
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55
De
lay
 Bo
un
d (
cyc
le)
Flow Index
Delay_VBR
Delay_Simulation
0
200
400
600
800
1000
1200
1400
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55
De
lay
 Bo
un
d (
cyc
le)
Flow Index
Delay_VBR
Delay_CBR
Fig. 24. Comparing D¯V BR and D¯CBR under the transpose traffic pattern
0
5
10
15
20
25
30
35
40
45
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55
Im
pro
ve
me
nt 
(Pe
rce
nt)
Flow Index
Fig. 25. Improvement percentage of D¯V BR than D¯CBR under the transpose traffic pattern
proposed analytical model is up to 39.3% more accurate than CBR analytical model
and more than 31% on average over all flows.
The run-time of the proposed method in C++ is typically in the order of a few seconds.
It is about 0.58 sec and 1.02 sec for the VOPD application and transpose traffic pattern,
respectively.
7.5. Discussion About Other Metrics
Although the paper targets an analytical model for latency bound, we briefly consider
evaluating other metrics including throughput, communication load, energy consump-
tion, and area requirements.
The network throughput is the sum of the data rates that are delivered to all ejection
channels in a network and communication load is estimated by utilized bandwidth and
calculated as the sum of the data rates injected to the network. As the paper models
the network which is not saturated, the throughput and communication load have
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Least Upper Delay Bound for VBR Flows in Networks-on-Chip with Virtual Channels1 A:31
the same values. This value is equal to 0.296 flits/cycle for the synthetic example in
Section 7.2 and 0.73 flits/cycle for VOPD application in Section 7.3.
Network calculus does not directly evaluate energy consumption and area require-
ments. However, we can present a comparative discussion between VBR and CBR
analyses, which is the main contribution of this work. Since we study the classic input-
queuing virtual-channel router, there is nothing new or changed in the structure and
design details of routers. In terms of area, what brings difference is in the calculated
backlog, which determines the buffer size thresholds. In network calculus, the upper
bound on backlog along the network is computed by the sum of the individual bounds
on every element [Le Boudec et al. 2004]. Thus, the total required buffer for flow i is
bounded by:
B¯i =
∑
j∈Lfi
B¯ij (26)
where B¯ij is the upper bound on the buffer size for flow i in each channel j ∈ Lfi and
Lfi is the set of channels along the path of flow i. B¯ij for VBR traffic flows, B¯V BRij , and
CBR traffic flows, B¯CBRij , are given by Eq. (27) and Eq. (28), respectively.
B¯V BRij = σi + ρiTj + ((σi − Lj)/(pi − ρi)− Tj)+[(pi −Rj)+ − pi + ρi] (27)
B¯CBRi = σi + ρiTj (28)
In Eq. (27), term [(pi − Rj)+ − pi + ρi] is negative because Rj ≥ ρi and pi ≥ Rj due
to channel capacity constraint and the assumption stated in Section 4, respectively.
Further, term ((σi − Lj)/(pi − ρi) − Tj)+ is always positive because a+ = a, if a ≥ 0;
a+ = 0, otherwise. Therefore, ((σi −Lj)/(pi − ρi)− Tj)+[(pi −Rj)+ − pi + ρi] < 0 which
means that B¯V BRij ≤ B¯CBRij . In Section 7.2.3, we have calculated the required buffer
size (buffer size threshold) in each input port of routers for a synthetic example. The
sum of these values is the total required buffer size, B¯V BR, which is equal to 42 flits. If
we calculate the total required buffer size for CBR analysis, B¯CBR, by Eq. (26) and (28),
it would be equal to 51 flits, which is about 21.4% larger than B¯V BR. Similarly, B¯V BR
is calculated for VOPD application as a realistic traffic pattern by summing buffer
size bounds derived in Section 7.3. The calculations show that B¯V BR = 1673 flits and
B¯CBR = 2827 flits. Therefore VBR analysis leads to about 40.8% reduction of the total
required buffers. We have also derived B¯CBR and B¯V BR for the case study represented
in Section 7.4 which is a 8 × 8 mesh network under the transpose traffic pattern. Due
to calculations, B¯CBR = 18256 flits and B¯V BR = 12556 flits which shows that the total
required buffers is reduced about 31.2% by VBR analysis. As a result, under the same
network and application, VBR analysis gives tighter backlog bound than CBR analysis
and can thus give more accurate bounds on the buffer requirements. From the design
perspective, the tighter backlog bounds lead to the area saving in the router buffers.
Regarding power consumption, the network power comprises router power (buffer,
switch, control circuit) and link power which are traffic dependent. It is notable that
although VBR analysis derives tighter delay bounds, it does not change the packet
transfer behavior, because it is only deriving more accurate analytical delay bounds
without any change in design features of the router like switching, control, and link
traversal. Therefore, the design decision of the router which our analysis brings impact
on is the buffer dimensioning. Assuming the same system model, VBR analysis can
indeed derive tighter bounds than CBR analysis on buffer requirements, leading to
power consumption saving. Following a power model for the buffers using, e.g. Orion
[Shi et al. 2002], we can safely assume that the power consumption for buffers will
decrease proportionally to the buffer size.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:32 F. Jafari et al.
8. CONCLUSIONS
In this work, we have derived the analysis procedure to investigate per-flow delay
bound. To this end, we have given theorems to calculate end-to-end ESC and internal
output arrival curves in a FIFO multiplexing network. Based on the proposed analysis
technique, we have conducted case studies of worst-case performance analysis, consid-
ered the accuracy of the proposed model through simulation, and compared it with a
method without considering the traffic peak behavior. We have developed algorithms
to automate analysis steps. The algorithms run very fast and can be applied for larger
networks with more flows. In the future, we plan to develop network calculus models
to investigate different scheduling policies and then compare them. We also plan to
extend the proposed analytical method in case of back-pressure in the network. There
are some network calculus-based analytical models [Qian et al. 2009; Zhao et al. 2013]
which analyze worst-case delay bounds for CBR flows due to back-pressure in the net-
work. It would be interesting to derive possibly tighter delay bound for VBR flows. In
this respect, we have to extend the analytical models under a given fixed buffer size
rather than to-be-determined bounded buffer size.
REFERENCES
BAKHOUYA, M., SUBOH, S., GABER, J.,EL-GHAZAWI, T., AND NIAR, S. 2011. Performance evaluation and
design tradeoffs of on-chip interconnect architectures. Simulation Modelling Practice and Theory, Else-
vier. 19, 6, 1496-1505.
BAUER, H., SCHARBARG, J. L., AND FRABOUL, C. 2010. Improving the Worst-Case Delay Analysis of an
AFDX Network Using an Optimized Trajectory Approach. IEEE Transactions on Industrial Informat-
ics. 6, 4, 521–533.
BEN-ITZHAK, Y., CIDON, I., AND KOLODNY, A. 2011. Delay Analysis of Wormhole Based Heterogeneous
NoC. In Proceedings of the International Networks-On-Chip Symposium (NOCS). 161–168.
BENNETT, J. C. R., BENSON, K.,CHARNY, A.,COURTNEY, W. F., AND LE BOUDEC, J. -Y. 2002. Delay jitter
bounds and packet scale rate guarantee for expedited forwarding. IEEE/ACM Transactions on Net-
working. 10, 4, 529–540.
BERTOZZI, D., JALABERT, A., MURALI, S., TAMHANKAR, R., STERGIOU, S., BENINI, L., AND DE MICHELI,
G. 2005. NoC synthesis flow for customized domain specific multiprocessor systems-on-chip. IEEE
Transactions on Parallel and Distributed Systems. 16, 2, 113–129.
BISTI, L., LENZINI, L., MINGOZZI, E., AND STEA, G. 2010. DEBORAH: A Tool for Worst-case Analysis of
FIFO Tandems. In Proceedings of ISoLA 2010, LNCS 6415. 152–168.
BLAKE, S., BLACK, D., CARLSON, M., DAVIES, E., WANG, Z., AND WEISS, W. 1998. An architecture for
differentiated services. IETF RFC 2475.
BOGDAN, P. AND MARCULESCU, R. 2007. Quantum-like effects in network-on-chip buffers behavior. In
Proceedings of the 44th Design Automation Conference (DAC). 266–267.
BOGDAN, P., KAS, M., MARCULESCU, R., AND MUTLU, O. 2010. QuaLe: A Quantum-Leap Inspired Model
for Non-stationary Analysis of NoC Traffic in Chip Multi-processors. In Proceedings of the International
Networks-On-Chip Symposium (NOCS). 241–248.
BOUILLARD, A., JOUHET, L., AND THIERRY, E. 2010. Tight performance bounds in the worst-case analysis
of feed-forward networks. In Proceedings of Infocom. 1316–1324
BOUILLARD, A. AND JUNIER, D. 2011. Worst-case delay bounds with fixed priorities using network calculus.
In Proceedings of Valuetools. 381–390
BOYER, M. 2010. Half-modelling of shaping in FIFO net with network calculus. RTNS 2010.
CHANG, C. 2000. Performance Guarantees in Communication Networks. Springer-Verlag.
CHARNY, A. AND LE BOUDEC, J. -Y. 2000. Delay Bounds in a Network with Aggregate Scheduling. In
Proceedings of QofIS. 1–13.
CIUCU, F., AND SCHMITT, J. 2012. Perspectives on Network Calculus - No Free Lunch but Still Good Value.
ACM Sigcomm. 42, 4, 311–322.
GEBALI, F. AND ELMILIGI, H., EDITORS 2009. Networks on Chip: Theory and Practice. Taylor and Francis
Group LLC - CRC Press.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
Least Upper Delay Bound for VBR Flows in Networks-on-Chip with Virtual Channels1 A:33
JAFARI, F., LU, Z., JANTSCH, A., AND YAGHMAEE, M. H. 2010. Buffer Optimization in Network-on-Chip
Through Flow Regulation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems (TCAD). 29, 12, 1973–1986.
JAFARI, F., JANTSCH, A., AND LU, Z. 2011. Output Process of Variable Bit-Rate Flows in On-Chip Networks
Based on Aggregate Scheduling. In Proceedings of the International Conference on Computer Design
(ICCD’11). Amherst, USA, 445–446.
JAFARI, F., JANTSCH, A., AND LU, Z. 2012. Worst-Case Delay Analysis of Variable Bit-Rate Flows in
Network-on-Chip with Aggregate Scheduling. In Proceedings of the Design, Automation and Test in
Europe Conference (DATE’12). Dresden, Germany, 538–541.
JIANG, Y. 2002. Delay bounds for a network of guaranteed rate servers with FIFO aggregation. Computer
Networks. 40, 6, 683–694.
JIANG, N., BECKER, D. U., MICHELOGIANNAKIS, G., BALFOUR, J., TOWLES, B., KIM, J., AND DALLY, W. J.
2013. A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator. In Proceedings of the IEEE
International Symposium on Performance Analysis of Systems and Software (ISPASS). 86–96.
KIASARI, A. E., JANTSCH, A., AND LU, Z. 2013. Mathematical formalisms for performance evaluation of
networks-on-chip. ACM Computing Surveys. 45, 3.
KIEFER, A., GOLLAN, N., AND SCHMITT, J.B. 2010. Searching for Tight Performance Bounds in Feed-
Forward Networks. In Proceedings of MMB/DFT. 227–241.
HANSSON, A., WIGGERS, M.,MOONEN, A., GOOSSENS, K., AND BEKOOIJ, M. 2008. Applying dataflow
analysis to dimension buffers for guaranteed performance in networks on chip. In Proceedings of NOCS.
211–212.
LE BOUDEC, J. Y. AND THIRAN, P. 2004. Network Calculus: A Theory of Deterministic Queuing Systems for
the Internet. (LNCS, vol. 2050). Berlin, Germany: Springer-Verlag.
LEE, S. Real-Time Wormhole Channels. Parallel Distributed Computer. 63, 299–311.
LENZINI, L., MARTORINI, L.,MINGOZZI, E., AND STEA, G. 2006. Tight end-to-end per-flow delay bounds in
fifo multiplexing sink-tree networks. Performance Evaluation. 63, 9, 956–987.
LENZINI, L., MINGOZZI, E., AND STEA, G. 2008. A Methodology for Computing End-to-end Delay Bounds
in FIFO-multiplexing Tandems. Elsevier Performance Evaluation. 65, 11-12, 922–943.
MARTIN, S., MINET, P., AND GEORGE L. 2003. Deterministic End-to-End Guarantees for Real-Time Appli-
cations in a DiffServ-MPLS Domain. In Proceedings of SERA 2003, LNCS 3026. 51–73.
MARTIN, S. AND MINET, P. 2006. Schedulability analysis of flows scheduled with FIFO: Application to the
Expedited Forwarding class. In Proceedings of IPDPS. Rhodes Island, 25–29.
MOADELI, M., SHAHRABI, A., VANDERBAUWHEDE, W., AND OULD-KHAOUA M. 2007. An analytical per-
formance model for the spidergon noc. In Proceedings of 21st AINA. 1014–1021.
OGRAS, U. Y., HU, J., AND MARCULESCU R. 2005. Key research problems in noc design: A holistic perspec-
tive. In Proceedings of CODES+ISSS 2005. 69–74.
QIAN, Y., LU, Z., AND DOU, W. 2009. Analysis of Worst-case Delay Bounds for Best-effort Communication
in Wormhole Networks on Chip. In Proceedings of the 3rd ACM/IEEE International Symposium on
Networks-on-Chip (NOCS’09). ACM/IEEE, San Diego, CA, 44–53.
QIAN, Y., LU, Z., AND DOU, Q. 2010. QoS Scheduling for NoCs: Strict Priority Queueing versus Weighted
Round Robin. In Proceedings of the 28th International Conference on Computer Design (ICCD’10). Amer-
stedam, the Netherlands, 52–59.
RAHMATI, D., MURALI, S., BENINI, L., ANGIOLINI, F., DE MICHELI, G., AND SARBAZI-AZAD, H. A method
for calculating hard QoS guarantees for Networks-on-Chip. In Proceedings of the IEEE/ACM Interna-
tional Conference on Computer-Aided Design (ICCAD’09). 579–586.
RAHMATI, D., MURALI, S., BENINI, L., ANGIOLINI, F., DE MICHELI, G., AND SARBAZI-AZAD, H. Comput-
ing Accurate Performance Bounds for Best Effort Networks-on-Chip. IEEE Transactions on Computers
(IEEE-TC). 62, 3, 452–467.
RIZK, A. AND FIDLER, M. 2012. Non-asymptotic End-to-end Performance Bounds for Networks with Long
Range Dependent fBm Cross Traffic. Computer Networks. 56, 1, 127–141.
SCHMITT, J. B., ZDARSKY, F. A., AND FIDLER, M. 2008. Delay bounds under arbitrary multiplexing: When
network calculus leaves you in the lurch .... In Proceedings of INFOCOM. 1669–1677.
SHI, Z., AND BURNS A. 2008. Real-time communication analysis for on-chip networks with wormhole
switching. In Proceedings of the 2nd ACM/IEEE International Symposium on Networks-on-Chip (NOCS
2008). IEEE Computer Society. 161–170.
VAN DER TOL, E.B. AND JASPERS, E.G. T. 2002. Mapping of MPEG4 Decoding on a Flexible Architecture
Platform. SPIE. 4674, , 1–13.
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
A:34 F. Jafari et al.
STILIADIS, D. AND VARMA, A. 1998. Latency-rate servers: A general model for analysis of traffic scheduling
algorithms. IEEE/ACM Transactions on Networking. 6, 5, 611–624.
WROCLAWSKI, J. 1997. The Use of RSVP with IETF Integrated Services. IETF RFC 2210.
WANG, H. S., ZHU, X.,PEH, L. S., AND MALIK S. 2002. Orion: A Power-Performance Simulator for In-
terconnection Networks. In Proceedings of the 35th annual ACM/IEEE international symposium on
Microarchitecture (MICRO).
ZHAO, X. and LU, Z. 2013. Per-flow Delay Bound Analysis Based on a Formalized Micro-architectural Model.
In Proceedings of the 7th ACM/IEEE International Symposium on Networks-on-Chip (NOCS’2013),
Tempe Arizona, USA, April 2013.
Electronic Appendix
ACM Transactions on Design Automation of Electronic Systems, Vol. V, No. N, Article A, Pub. date: January YYYY.
