NoC Design Flow for TDMA and QoS Management in a GALS Context by Evain, Samuel et al.
NoC Design Flow for TDMA and QoS Management in a
GALS Context
Samuel Evain, Jean-Philippe Diguet, Dominique Houzet
To cite this version:
Samuel Evain, Jean-Philippe Diguet, Dominique Houzet. NoC Design Flow for TDMA and QoS
Management in a GALS Context. EURASIP Journal on Embedded Systems, SpringerOpen,
2006, 2006, pp.63656. <10.1155/ES/2006/63656>. <hal-00321372>
HAL Id: hal-00321372
https://hal.archives-ouvertes.fr/hal-00321372
Submitted on 13 Sep 2008
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
Hindawi Publishing Corporation
EURASIP Journal on Embedded Systems
Volume 2006, Article ID 63656, Pages 1–12
DOI 10.1155/ES/2006/63656
NoC Design Flow for TDMA and QoS Management
in a GALS Context
Samuel Evain,1 Jean-Philippe Diguet,1 and Dominique Houzet2
1 LESTER, UBS/CNRS, Centre de recherche´, BP 92116, Lorient Cedex 56321, France
2 IETR, INSA/CNRS, 20 Avenue des Buttes de Coe¨smes, Rennes Cedex 35043, France
Received 15 December 2005; Revised 6 July 2006; Accepted 5 August 2006
Recommended for Publication by S. Ramesh
This paper proposes a new approach dealing with the tedious problem of NoC guaranteed tracs according to GALS constraints
impelled by the upcoming large System-on-Chips with multiclock domains. Our solution has been designed to adjust a trade-
o between synchronous and clockless asynchronous techniques. By means of smart interfaces between synchronous sub-NoCs,
Quality-of-Service (QoS) for guaranteed trac is assured over the entire chip despite clock heterogeneity. This methodology can
be easily integrated in the usual NoC design flow as an extension to traditional NoC synchronous design flows. We present real
implementation obtained with our tool for a 4G telecom scheme.
Copyright © 2006 Samuel Evain et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. INTRODUCTION
Network-on-Chip (NoC) provides designers with a system-
atic, flexible, and scalable framework tomanage communica-
tion between a large set of intellectual property (IP) blocs [1].
It can also reduce IP connection wires and optimize their us-
age. The dynamic reconfigurability of communication paths
responds to the fluctuating processing needs of embedded
systems [2]. To comply with synchronous design rules, the
traditional NoC implementation is based on a global system
clock distributed with a limited skew. This design assump-
tion is inconsistent with large future SoCs where multicycle
communication paths and varying delays [3] are unavoid-
able. Thus, NoC designers must consider these issues in order
to allow for manageable and controllable communication on
such chips.
Another shortcoming of traditional NoC implementa-
tion is its unadaptability tomulticlock domain SoCs to which
the large SoCs are converging. Dierent reasons can cause
such situations, for instance (1) voltage/frequency dynamic
scaling embedded SoCs that optimize power consumption
with variable computing and communication loads, (2) re-
configurable and/or adaptive architecture chips, (3) chips
with right access areas for security purpose [4]. Finally, mul-
ticore chips or multiboard systems intrinsically mean dis-
joint clock domains. Those technologically inherited condi-
tions complicate and induce challenges to the task of design-
ing NoCs for large chips where the need for flexible, scalable,
manageable, and controllable communications is increasing.
We discuss in the state-of-the-art section some dierent,
relevant NoC solutions to cope with these issues. The rest is
divided into 4 parts. In Section 3, we ask the question of NoC
guaranteed trac (GT) with dierent time domains and de-
scribe how this new technique is inserted in our NoC synthe-
sis framework. In Section 4, we detail our architecture solu-
tions. In Section 5, we present a case study for dierent pos-
sible configurations.
Finally, we conclude with perspectives based on this
work.
2. STATE OF THE ART
NoC can be divided into dierent categories, regarding clock
management, flexibility, and communication QoS. Three
kinds of clock schemes can be proposed: synchronous SoC,
GALS with heterogeneous clocks, and fully asynchronous
SoC (clockless). Our main observations are related to the
tradeo between guaranteed trac and heterogeneous clock
management.
Traditional approaches have always considered syn-
chronous NoC as having a single clock domain. The con-
cept of virtual channel (VC) [5] has been introduced to
2 EURASIP Journal on Embedded Systems
bring Quality-of-Service. In QNoC [6] for instance, it oers
four trac priority ranks. Guaranteed trac is not flexible
since it is based on trac traces and paths that are com-
puted according to simulation. The question of real latency
and bandwidth has been solved by Ætheral [7] and Nos-
trum [8] with the use of time division multiplexing access
(TDMA) technique over pipelined circuits. This scheduling
prevents any conflicts between packets while network trav-
elling. This scheduling means that all network elements are
synchronous. We know that such an assumption is not com-
pliant with wide chip or NoC distributed over several chips
(e.g., ASIC and FPGA). Usual techniques implemented to
solve skew of multicycle paths issues are based on pipelin-
ing [9]. These post-routing decisions require design revision
to resolve incoherency of path time slot allocations; this is a
very time-consuming process. Furthermore, a single global
TDMA table may lead to oversized time slot table size and
consequently impact latency, design complexity, and buer
depth growth.
The second approach consists in using asynchronous
communications with synchronous components (GALS). In
[10], an asynchronous cross-bar is presented but real-time
constraints are not considered. An industrial solution is pro-
vided by Arteris [11] where intercluster communications are
implemented as asynchronous packet exchanges across a net-
work. The intercluster media are a network that transports
packets without guaranty; however, packets can be tagged
with priorities. The main advantages of this solution are reli-
ability and low-cost switches. In an ANOC [12], a GALS so-
lution is also presented, it is based on the use of hand/check
protocols at flit level. Authors use two virtual channels to dif-
ferentiate high and low priority communications. However,
trac can be really guaranteed only with additional con-
straints. A first solution consists in using spatially distinct GT
paths. A second solution is based on simulations when un-
ambiguous static and deterministic scheduling schemes can
be extracted from the target applications.
The third approach is the ultimate stage of the GALS con-
cept; it is based on clockless asynchronous communications.
This solution is implemented in [12, 13]. In [12], the au-
thors use virtual channels and low latency routers, however,
the NoC architecture does not oer latency and bandwidth
guaranties. This shortcoming is solved in [13] with the ALG
scheduling discipline. These guaranties are obtained under
the condition that the maximum flit transmission delay is
known and that the interflit intervals are bounded. This ap-
proach is promising but it seems that the number of VC has
a strong impact on the cost of the router implementation. It
means that the implementation of several guaranteed tracs
may lead to a very costly solution. Actually, there is prob-
ably a tradeo between a synchronous TDMA-based solu-
tion and an asynchronous one depending on the number of
GT communications. The main advantages of clockless so-
lutions are delay insensitiveness, dynamic power optimiza-
tion, and communication speed since communication can
run as fast as possible according to local link characteristics.
However, such solutions require a specific implementation
usually based on a four-phase dual rails protocol where each
bit is coded with two wires. This is the case in [12] and in
[13] where the implementation examples are given for a sin-
gle router. The drawback of asynchronous techniques is the
increase of gate area and interconnect wires, as static power
is directly related to area, it may be a serious issue with tech-
nologies lower than 90 nm.
Another point is the question of flexibility; it means that
the NoC features (e.g., paths, time slot allocation) may be
programmable in order to adapt the characteristics of NoC
links to the variations of CPU load, and communication vari-
ations as depicted in [2]. Source routing appears as the ap-
propriate solution to cope with this issue, it has been imple-
mented with the dierent approaches previously described
with synchronous TDMA [7], asynchronous [12]. Our ap-
proach also provides flexibility compliant with time division
(TDMA), path allocation but also with sub-NoC composi-
tion.
Our solution is GALS-oriented since it enables commu-
nication between locally synchronous sub-NoCs with hetero-
geneous clock domains which means that no global clock is
required. Themainmotivation of our work is to find solution
to really guaranty tracs (latency and throughput) within an
NoC with heterogeneous clocks. TDMA techniques are in-
teresting, firstly, for bandwidth allocation and latency con-
trol, secondly, for FIFO sizing. Actually, TDMA technique
organizes communication in such a way that conflicts are
avoided and consequently, routers require minimum FIFO
resources. On the network interface (NI) side, FIFO require-
ments strongly depend on TDMA size and path lengths. In
practice, very short TDMA tables are required and a 3D
(x, y, t) path search space increases chances of finding the
shortest paths. However, usual TDMA techniques require
a global clock which is not acceptable within future large
chips. We also notice that asynchronous techniques require
complex solutions [13] or imply NoC underused with con-
straints such as paths nonoverlapping [12]. We do not use
asynchronous techniques [12, 13] but dual clock FIFOs for
interfacing dierent clock domains. There are dierent rea-
sons for this. First, the advantage in terms of power con-
sumption does not appear to be that decisive, since the dy-
namic power reduction is balanced by the static power in-
crease. Moreover, synchronous solutions are sucient and
cheap enough for medium-size designs that can also benefit
from available power management techniques based on dy-
namic VDD/Clock/Bias [14]. Finally, we only need protocol
wrappers with standards such as OPB or AHB to reuse usual
synchronous IP, asynchronous techniques need an additional
wrapper at the physical level for asynchronous/synchronous
interfacing.
Table 1 synthesizes the pros and cons of these dierent
approaches. Thus, our work is focused on TDMA techniques,
we introduced time routing and TDMA synchronisers in or-
der to extend the network interface concept to sub-NoC in-
terfaces. The other advantage of the sub-NoCs approach is
related to programmability, basically paths within each sub-
NoC can be coded independently from other sub-NoCs.
Samuel Evain et al. 3
Table 1: GALS techniques pros and cons.
Skew GT Area Power
GALS + dual
+   (Paths must not overlap) + +
clock FIFOs
GALS + clockless
+   (Paths must not overlap)   (Asynchronous logic cost)
+ Dynamic
com.   static
Synchonous NoC   (Delay insertion) + Scheduling +
+ Dynamic
  static
GALS + TDMA + Only on
identified links
+ TDMA scheduling
+ (TDMA enables
accurate FIFO sizing)
  Dynamic
(Clock gating)
+ staticsynchroniser
3. PROBLEM FORMULATION AND SOLUTIONS
3.1. µSpider NoC
Our NoC uses a wormhole packet switching technique to
carry messages. Routers and network interfaces (NI) are con-
nected by two unidirectional opposite links. Credit-based
flow control is implemented for link-level flow control. A
packet is a set of FLITs (flow control unit). Communication
flow control between routers is based on flit credits. A flit is
an elementary packet on which link flow control operations
are performed. The width, in bits, of the communication link
is a phit (physical unit). A flit is measured in phits.
We use the source routing technique to route packets.
Instructions are in the packet header and are proceeded by
crossed routers to determine the right output port.
NIs connect IP blocks to the NoC in a way which is quite
comparable to the Ætheral approach [15]. Virtual channels
are used to carry best eort (BE) and guaranteed trac. To
prevent contention between the multiple GTs, TDMA tech-
niques are used for time slot allocation in NIs [16].
Moreover, our NoC is customizable through an associ-
ated CAD tool [17]. Our CAD tool is a decision and synthesis
tool to help designers obtain the ad hoc NoC depending on
the application and implementation constraints. It is able to
configure the various functionalities of our NoC, like topol-
ogy, routing technique, and so forth. Finally, this tool gener-
ates an optimized dedicated NoC VHDL code at RTL level.
In this paper, we focus on its ability to manage GT on
a global network composed of various clock domain sub-
NoCs.
3.2. Hierarchical NoC and µSpider flow
Today’s chips may have multiple-clocks with dierent fre-
quencies. Each clock domain may have IPs that need to com-
municate through an NoC. An NoC in certain clock domains
may suer TDMA coherency problems due to variable mul-
ticlock delay links. Even if such a problem does not exist, we
may be interested in having more than one NoC to optimize
TDMA slot table.
We propose a hierarchical NoC structure. A sub-NoC is
a synchronous network area having a single clock frequency
and obeying its own TDMA table.
The interconnection of those heterogeneous (dierent
frequencies and TDMA rules) sub-NoCs forms a global NoC.
GT constraints
(BW, latency) Architectural mapping constraints
Topology computation and network partitioning analysis
Sub-NoC clustering of IPs around NIs
TDMA synchroniser insertion
Sub-NoC TDMA sizing
Mapping, path, and slot time allocations
Constraints ok?
Interrouter delay estimation
Delays < 1/clock frequency
Insertion of time routers
Path and slot time allocations
Constraints ok?
Ok
No
No
No
Figure 1: µSpider NoC design flow.
The causes of its network division may be the following.
(i) Multicore chips separating the networks in several ar-
eas, causing long distance wires between them. No
classic solution can be used (relay station [9] or link
pipelining) because those solutions cannot take place
between cores.
(ii) Architectural mapping constraints.
(iii) Reuse of previously designed NoCs implemented as
IPs.
(iv) Clock dynamic management for power optimization.
(v) Areas with dierent security levels.
Figure 1 shows the µSpider design flow modified to take into
account dierent time domains.
4 EURASIP Journal on Embedded Systems
NoC with
dissociated TDMA
subnetworks
Area clock skew
and/or uncertain
link delay
Classic
connection
technique
TDMA synchroniser
(Section 4.2)Coherent
TDMA possible
(ordering)
1 FIFO
End-to-end
flow control at global NoC
level
Time routing (Section 4.1)
- Time coder and time
router
- 1 FIFO
FIFO x number of destination
queues in the sub-NoC
End-to-end
flow contol
at sub-NoC level
FIFO x number of IP sender-destination pairs
of communications crossing the link intersub-NoC
No
No
No
No
Figure 2: Solution tree.
The flow provides two outstanding evolutions from the
initial single NoC design flow. First, the designer has the pos-
sibility of specifying dierent sub-NoCs. Sub-NoC manage-
ment is supported by synchronization techniques explained
in Section 4. The decisions of sub-NoC allocations are per-
formed together to avoid local optimal subsolutions but not
optimal global solutions. Then, in the design flow, if delays
which are potentially larger than one cycle are noticed, it
means that synchronous design assumptions are not valid.
In that case, time routers (see Section 4) are inserted and
path/slot time allocations are recomputed with initial map-
ping constraints.
3.3. Solution tree
Dierent solutions may be considered regarding the prob-
lem to be solved. Figure 2 shows the associate solution tree.
The first question deals with identifying uncertain or long
delay links, if they belong to the same TDMA domain, then
they are solved with the time routing technique described in
Section 4.1. Otherwise, they are intrinsically solved with the
TDMA synchroniser technique described in Section 4.2. In
this case, some additional characteristics lead to dierent im-
plementations depending on situations.
Case 1. The coherence between the TDMA tables of the sub-
NoCs; namely the data are emitted and completely read with
the same order.
Case 2. The implementation of the end-to-end flow control
can be local to sub-NoCs or global.
4. µSPIDER ARCHITECTURAL SOLUTIONS
Time routing is a bridge between portions of NoC having the
same TDMA but with a possible phase skew.
Clock
+skew 1 +skew 2
Data
Clock 1
Clock 1
locally
synchronous
area 1
Clock 2
locally
synchronous
area 2
Figure 3: Same clock but dierent clock skews.
TDMA synchroniser is a bridge between sub-NoCs hav-
ing dierent TDMA, moreover a phase skew between the
sub-NoC clocks is possible.
4.1. Time routing solution
In this first case, we consider router clusters having the same
clock but dierent clock skews and a link with unpredictable
but bounded delay.
Figure 3 shows a global clock distributed on both areas.
Clock1 and clock2 frequencies are equal but have dierent
skews. Moreover, the data transmission has a delay that is a
function of the link’s physical characteristics (length, capac-
ity, . . . ). Skew and delay transmission can cause synchroni-
sation problems in the time division multiplexing over the
pipelined path across the considered link.
To be independent from variable skews and data link de-
lays in a given TDMA of a sub-NoC, we have considered that
these two variables are bounded with known values. TDMA
time slot allocations are computed using the worst case delay
and clock skew between those routers. The time coherency
between routers is controlled in order to impose this worst
case delay for all packets. This control is implemented by in-
tegrating the time coder (TC) and the time router (TR) in the
sending and the receiving routers, respectively (see Figure 4).
Samuel Evain et al. 5
Clock border
Sending part of the router 1
Time coder module
Current time slot1 Clock1
TSI1,2 = (CTS1 +Delta1,2)% S 
Packet Time slot instruction
insertion in header
Receiving part of the router 2
Time router module
Clock2
CTS2
Dual clock
FIFO
Time slot
instruction
reader
Time slot instruction1,2
3 Path instructions
Path instructions
Figure 4: Time coder and time router.
The architecture of the dual clock FIFO is not presented here
since it is not the main topic of this paper.
The TC adds a time slot instruction in the header path
instruction field of passing packets. This time slot instruc-
tion indicates to the TR the appropriate time slot number at
which it must release this packet. The TR does not use the
TDMA table; however, it remains aware of the current time
slot with a simple counter. The received packet is stored in
an FIFO and is released once the current count time and the
time slot instruction are identical.
Figure 4 shows the architecture of the time coder and
time router on both sides of an unpredictable delay unidi-
rectional link. Note that the TC and the TR processings are
performed in parallel with the usual router computations so
no additional delay is introduced.
The time slot instruction can be arbitrary chosen; the
maximum required FIFO depth is equal to the number of
reserved slots of communications crossing this link during a
complete TDMA table rotation. However, we can adjust the
time slot instruction to reduce latency.
Hereafter, we show how to determine the maximum slot
time dierence and the required FIFO depth.
The following parameters are used to formalize the com-
putation.
(i) Slot size is made of Ls phits.
(ii) The time slot table size is  S .
(iii) CTSi = current time slot in area i. 0  CTSi <  S ,
for a communication from area x to area y.
(iv) TSIx,y = time slot instruction computed in area x and
proceeded in area y. 0  TSIx,y <  S .
(v) SKEWupx,y = upper bound clock skew between both
x and y area clocks.
(vi) SKEWlox,y = lower bound clock skew between both x
and y area clocks.
(vii) Tupx,y = the necessary number of periods for a flit
to cross over a link considering the longest specified
boundary delay.
If
(
Clox,y < 0
)
,
then FDEPTH = S  Ls,
else FDEPTH =Min
((
Deltax,Y  Ls
)
  Clox,y , S  Ls
)
Algorithm 1
(viii) Tlox,y = the necessary number of periods for a flit
to cross over a link considering the shortest specified
boundary delay.
(ix) Hence the maximum cycle delay dierence between
area x and area y (in clock periods) is
Cupx,y = Tupx,y +SKEWupx,y . (1)
(x) The minimum cycle delay dierence between area x
and area y is
Clox,y = Tlox,y +SKEWlox,y . (2)
(xi) Thus, the maximum slot time dierence between area
x and area y (in slots) is
Deltax,y =
⌈
Cupx,y
Ls
⌉
(3)
and time slot instruction for router in area y is
TSIx,y =
(
CTSx +Deltax,y
)
% S . (4)
(xii) Finally, FIFO depth is given by Algorithm 1.
The value of Deltax,y could be added in the time router
module instead of the time coder module. Moreover, this
value may be static and preconfigured in the time router
module. However, this configurability ability oers more
flexibility in the time domain to find a TDMA solution.
6 EURASIP Journal on Embedded Systems
R
NI
NI
A
B
Sub-NoC 1
TDMA 1
Sub-NoC 2
TDMA 2
Sub-NoC 3
TDMA 3
Figure 5: NoC made of 3 connected subnetworks.
Clock border
A
ro
u
te
r
o
f
su
b
-N
o
C
1
Synch 1
Packet
Credit
Credit
Packet
Synch 2
Dual clock
FIFO
Queue ID
Header
decoder
Queue
Queue
Queue
TDMA slot
scheduler
Space
Sending
port
Header
Path
Credit
Forwarder
A
ro
u
te
r
o
f
su
b
-N
o
C
2
Figure 6: Synchroniser architecture.
4.2. TDMA synchroniser
4.2.1. Concept
In such a case, we consider independent clocks between lo-
cal sub-NoCs having distinct TDMA tables. To illustrate our
concepts, we consider the NoC shown in Figure 5. It is made
of three side-to-side joined sub-NoCs. Sub-NoC 2 oers
guaranteed tracs to flow through.
A TDMA synchroniser, detailed in Figure 6, is introduced
between each pair of communicating sub-NoCs as a bridge
between their TDMA slot tables. It is composed of two syn-
chronisers. Each synchroniser is composed of two correlated
parts: the first one synchronizes the trac (trac synchro-
niser) while the other one forwards opposite sense trac
(forwarder). Two synchronisers are connected head to tail,
that is, the forwarder of synchroniser 1 is connected to the
trac synchroniser of synchroniser 2, and visa versa. The
general synchroniser architecture is detailed only for syn-
chroniser 2.
A synchroniser belongs to the sub-NoC to which it sends
data, and is seen as a NI to this sub-NoC. As for any classic
NI, the synchroniser TDMA table contents are coherent with
the sub-NoC TDMA table. Each sub-NoC sees the other one
as a classic NI.
A packet leaves a sub-NoC by traversing the forwarder
module of the local synchroniser. When it arrives to the re-
mote synchroniser, it is stored in the dual clock FIFO. The
header decoder uses the queue ID in the packet header to find
the right queue into which it will store this packet. The ex-
act number of FIFOs and synchroniser architecture, in gen-
eral, depends on problem parameters and designer choices
that are discussed in the next paragraph. The TDMA sched-
uler requests read operations from a certain queue according
to its TDMA time slot reservation table. Moreover, the path
field in the packet header receives the correct path instruc-
tions to cross this sub-NoC depending on the communicated
queue ID.
This guaranteed trac service can be oered if and only
if the following property is respected:
∣∣Si∣∣  Bi
F
  S  +
Nh  Lh
Ls
, (5)
where
(i)  Si  is the number of reserved slots for a communica-
tion i.
(ii) Bi is the specified payload in phits/s for communica-
tion i.
(iii) F is the channel link frequency in sub-NoC (Hz).
Samuel Evain et al. 7
R
NI
NI
A
B
NI
C
NI
D
Sub-NoC 1
TDM 1
Sub-NoC 2
TDM 2
Sub-NoC 3
TDM 3
End-to-end flow control
Figure 7: End-to-end flow control at global NoC level.
(iv) Nh is the number of headers during slot table iteration.
(v) Lh is the header size in phit unit.
(vi) Ls is the slot size in phit unit.
4.2.2. End-to-end flow control
To ensure that no overflow can occur in destination queue,
we use the end-to-end flow control. At connection setup be-
tween a sender and a receiver pair, the full space of the des-
tination queue is allocated to the sender. This queue is called
round trip latency hiding FIFO. Then, the sender can only
send data to the receiver when it has space credits, credits
represent the amount of queue space at receiver. Moreover,
the sender decreases this value each time it sends data to this
destination. The receiver grants credits to the sender when
data have been consumed and so new empty space is avail-
able in the destination queue.
End-to-end flow control policy can be introduced at dif-
ferent hierarchical levels, global or local. An end-to-end flow
control between NIs across the global NoC is global. Inde-
pendent end-to-end flow controls in each sub-NoC are local.
The choice of this policy depends on a lot of conditions such
as reusability and adaptability.
4.2.3. Global end-to-end flow control
In the case of global end-to-end flow control, if the designer
selects this option for FIFO optimization reasons, it means
that dierent communication sharing the same resources can
be considered as a single communication from the point of
view of the crossed sub-NoC as the sub-NoC 2 in the exam-
ple of Figure 7. In sub-NoC TDMA synchroniser, only one
FIFO queue is implemented per destination independently
of the original source, moreover, credit and space modules of
Figure 4 are removed. End-to-end flow control at global NoC
level (across all the NoCs) needs large depth FIFO queue
located in the destination NIs. The objective is to hide the
round trip latency (of credit at end-to-end flow control level)
due to the long distance for which it takes a message to ar-
rive and credits to return. These FIFO queues can be large
in case of long paths. Another pertinent issue is TDMA co-
herency, basically two sub-NoCs are coherent if packets trav-
elling from one to another are entirely emitted and con-
sumed with the same order. From a sub-NoC point of view,
if the designer is able to identify a coherency possibility be-
tween two sub-NoCs, then a single FIFO is needed to store
data after the removal of queue ID. This implies interesting
fee minimization.
4.2.4. Local end-to-end flow control
The other possibility is an end-to-end flow control at sub-
NoC level. In a TDMA synchroniser, there is one FIFO queue
for each sender-destination pair of communication crossing
the link inter sub-NoC. Packets are depacketized when leav-
ing a sub-NoC and repacketized when entering in another
sub-NoC.
End-to-end flow control at sub-NoC level means multi-
ple small end-to-end flow controls (Figure 8). Round trips
are short but numerous. The FIFO cost for round trip pur-
poses is distributed in the crossed TDMA synchronisers.
Moreover, each sub-NoC buer dimensioning is indepen-
dent of neighbouring sub-NoCs. So sub-NoCs can be seen
as single IPs.
4.2.5. Local versus global end-to-end flow control
The choice between global and local flow controls mainly
depends on the nature of constraints. Actually, for a given
application, the total FIFO size in the local case is equal or
slightly larger than the global case. On the one hand, the local
case uses more small distributed FIFOs, so induces a larger
control cost (including counters), the other drawback is the
decision of FIFO size distribution over the whole set of dis-
tributed FIFOs. On the other hand, the local case brings a
subdivision of concerns and consequently facilitates design
reconfiguration. The drawback is the large number of small
FIFOs.
We can extract two extreme cases for which the choice
is clear. First, in a case of many dierent communications
8 EURASIP Journal on Embedded Systems
R
NI
A
NI
B
NI
C
NI
D
Sub-NoC 1
TDM 1
Sub-NoC 2
TDM 2
Sub-NoC 3
TDM 3
End-to-end
flow control
End-to-end
flow control
End-to-end
flow control
Figure 8: End-to-end flow control at subnetwork level.
IP1 IP2
IP3 IP4
NI1 NI2
NI3 NI4
R
R
R
R
R
R
R
R
R
R
R
R
R
R
S
S
S
S
S
S
S
S
Sub-NoC 1
TDMA 1
Sub-NoC 2
TDMA 2
Sub-NoC 3
TDMA 3
Queue ID
Path instructions to
cross the sub-NoC 1 1 3 2
Path instructions to
cross the sub-NoC 2 3 2
Path instructions to
cross the sub-NoC 3 2
Figure 9: Sub-NoC interconnections and header path instructions.
between two sub-NoCs, a global implementation is required.
On the contrary, if a very few amount of communications
is specified, then a local configuration is more appropriate
since the number of FIFO is reasonable. Moreover, the packet
resizing overhead is existing only in the global case as ex-
plained in Section 3.3.
4.2.6. Header path instructions
Figure 9 shows packet header path instruction fields. The
used path in a given sub-NoC is only related to that sub-NoC
which means that it can be reconfigured independently from
the others. When a packet leaves a sub-NoC, the previously
used path instructions are removed. When a packet reaches a
synchroniser, its queue ID request is analyzed and its destina-
tion is deducted. The sub-NoC automatically inserts the ap-
propriate path to reach this destination inside this sub-NoC.
This path knowledge distribution reduces path instruc-
tion lengths and solves the problem of the path size field in
the header of packets. With this solution, we keep the main
advantages of an NoC, which must be scalable and reconfig-
urable.
4.3. Packet resizing in case of global end-to-end
flow control
4.3.1. Problem formulation
The number of reserved slots (send window) for a same com-
munication may be dierent in crossed sub-NoCs due to dif-
ferent slot table size and frequency.
This leads to the problem of packet resizing with split and
merge operations. Actually, packets may be split or merged
according to available send windows. It means a control cost
for packet reorganisation.
When a packet must be split in two parts, its header is
copied to be the header of the second part. The credit infor-
mation field is not copied. This leads to an increase of the
number of headers, so to a decrease of the available band-
width. The cost of header insertion is not negligible in case
of small packets.
Rebuilt Packet implies the removal of header, after mem-
orizing its credit to be able to add it in the new packet header
belonging to an identical transaction. However, packets in-
terleaved with packets belonging to another transaction can-
not be pasted together, except if distinct FIFOs are used to
Samuel Evain et al. 9
Table 2: Configuration associate rules.
Send window size in
Rules
sub-NoCi compared
to send window in
sub-NoCi+1.
>
– Bandwidth oered in sub-NoCi+1 must
be suciently higher than bandwidth
oered in sub-NoCi to carry additional
headers introduced by packet splitting.
= 1
– Packet cannot be split.
– Bandwidth oered in sub-NoCi+1 must
be at least equal to bandwidth oered in
sub-NoCi.

– Packet header can be emitted only
during the first slot of the send window
to be sure it will not be split.
– Bandwidth oered in sub-NoCi+1 must
be at least equal to bandwidth oered in
sub-NoCi.
order packets belonging to the same transaction; in usual
cases, this approach is too complex.
Packer resizing can introduce bandwidth degradation. In
many cases, it may be acceptable; however, simple solutions
make avoiding it possible.
4.3.2. Solutions
Consider packet going from sub-NoCi to sub-NoCi+1, in
Table 2, we compare send window width between these two
neighbouring sub-NoCs and give rules to respect.
A solution to obtain the same number of reserved slots
and the same bandwidth for communications in each Sub-
NoC consists in conserving the same (frequency/slot table
size) ratio. Figure 10 shows an example with two sub-NoCs
with dierent TDMA slot table size, running at dierent fre-
quencies.
5. CASE STUDIES
5.1. Application context
To point out the dierent solution costs, we consider a 4G
telecommunication application. This is a two-way tran-
sceiver implementing MC-CDMA MC-SS-MA baseband
communication techniques. The application constraint is
665.6 µs and a frame is composed of 32 symbols. TheMC-SS-
MA part requires 18 IP ports and the MC-CDMA one needs
32 IP ports. IP ports are grouped into 22 clusters. This appli-
cation requires 29 unidirectional communications between
dedicated hardwares including local memories. We use end-
to-end flow control, so communications can be seen as bidi-
rectional.
Required communication bandwidths are very dierent.
Communications close to the antenna in the application
graph need bigger bandwidth than decoded data or not yet
TDMA slot table
in sub-NoC 1
TDMA slot table
in sub-NoC 2
Reserved slot Reserved slot
Sub-NoC 1 Sub-NoC 2
Slots frequency = 100MHz
Slot table size = 8
One reserved slot frequency =
(slots frequency/slot table size)
100MHz
8
12.5MHz
50MHz
4
12.5MHz
Figure 10: Frequency and slot table size ratio.
coded data. This application is mapped on a topology com-
posed of two parts called areas. Mapping is made to group
communications with small bandwidth in left area 1, and
large bandwidth communications in right area 2. The topol-
ogy of our example is shown in Figure 11. Area 1 and area
2 are implemented as one single NoC or two sub-NoCs in
the cases described hereafter. Three communications go from
area 1 to area 2, they are named interarea communications.
Interarea communications represent approximately 10% of
the total. This proportion is representative of sub-NoCs com-
position. Only interarea communications are represented in
Figure 11.
In this NoC, a phit is a 32 bit width word; flit size is two
phits; a header is one phit.
5.2. Case study descriptions
We have used our tool to find design solutions according
to application constraints and usual real-life cases (dierent
clock frequencies and long delay).
Case 1. A single clock NoC without any delay larger than the
clock period. It corresponds to a classic NoC case, this is our
reference for comparisons.
In the following Cases 2 to 5, we assume a maximum de-
lay between both areas equal to 50 nanoseconds.
Case 2. A single clock NoC with a time routing solution to
solve the long delay problem.
For the following three cases, area 1 and area 2 have dif-
ferent clock domains (9 and 100MHz, resp.). They are im-
plemented as sub-NoC 1 and sub-NoC 2. The sub-NoC 1
slot table size is no longer 6 but 4, all other parameters re-
main unchanged. Two TDMA synchronizers are introduced
on link between the two sub-NoCs. These following three
cases correspond to dierent system conditions.
10 EURASIP Journal on Embedded Systems
1 2
3
4 5
6 7
8
9
10
11
12 13
14
15 16
17 18
19 20 21 22
NI NI
NI
NI NI
NI
NI NI NI NI
NI
NI NI
NI
NINI
NI NI
NI NI NI NI
R R
R R
R R
R R
Area 1 Area 2
Comm. C
Figure 11: Topology and interarea communications.
Table 3: Case studies.
Cases
Area 1 Area 2
Comments and solutions
F MHz S F MHz S
1 100 6 100 6 Same TDMA, 1 cycle delay
2 100 6 100 6
Same TDMA, 5 cycles
delay Time routing
3 9 4 100 6
Synchronisers with
coherent TDMA
4 9 4 100 6
Synchronisers with Global
NoC end-to-end flow control
5 9 4 100 6
Synchroniser with Local
end-to-end flow control
Case 3. Two sub-NoCs with heterogeneous clock, coherent
TDMA, and global end-to-end flow control.
Case 4. Two sub-NoCs with heterogeneous clock, noncoher-
ent TDMA, and global end-to-end flow control.
Case 5. Two sub-NoCs with heterogeneous clock, noncoher-
ent TDMA, and local end-to-end flow control.
Note finally that this application does not have any la-
tency constraints but only bandwidth limitations. Our tool
found some solutions that meet the bandwidth constraints
for the various schemes without any latency test. However,
latency can be easily guaranteed when it is necessary. In prac-
tice, the latency and bandwidth are computed and checked
simultaneously.
The considered cases are summarized in Table 3.
5.3. Result analysis
Table 4 shows each case relative cost.
In Case 1, the total FIFO depth is due to the sum of de-
coupling and round trip latency hiding FIFOs in NIs; the
global latency for communication C is 200 nanoseconds.
The time delay added in Case 2 increases the latency with
2 slots duration. The increase of FIFO cost is due to the fol-
lowing two reasons.
Table 4: Latency and FIFO depth cost.
Case Studies
1 2 3 4 5
Number of FIFO in synchronisers
0 2 2 3 3
(control cost, FIFO size distribution)
Sum of FIFO depths in words 323 335 348 348 348
FIFO depths increase for the three
0 12 18 18 18
inter-area communications
Latency for communication C (ns) 200 240 1938 1938 1938
(i) The FIFO in the two TRs (12 words).
Two TRs are added (one for each direction). Each TR
has an FIFO. The FIFO depth is constant (6 words), it
does not depend on interarea communications.
(ii) The depth of round trip latency FIFOs in NIs depends
on interarea communications, they are used to hide
the latency of the returned credit for end-to-end flow
control (2 words).
In Cases 3 to 5, the reduced frequency in sub-NoC 1 im-
plies an increase of latency (+1578 nanoseconds) for com-
munication travelling in sub-NoC 1. The latency increase is
also due to the resynchronization process between TDMAs.
Case 3 benefits from coherent TDMAs into sub-NoC 1
and sub-NoC 2 and thus each synchroniser uses only one
FIFO. Case 4 implements incoherent TDMAs with global
end-to-end flow control. It needs more FIFOs but the sum
of all FIFO depths remains unchanged.
Case 5 is similar to the previous one but uses a local flow
control.
To conclude, we observe that the management of fluctu-
ating multicycle delays and/or heterogeneous clock domains
has an acceptable cost for the four cases compared to the ref-
erence one.
Additionally, our experience shows that the frequency re-
duction implies an area increase for latency hiding. An accu-
rate study is needed to see if the dynamic power reduction
is really interesting compared to the increase of static power
due to the area overhead.
Samuel Evain et al. 11
6. CONCLUSION AND PERSPECTIVES
In this paper, we have introduced alternative solutions oer-
ing guaranteed throughput tracs in the context of NoCs
with dierent clock areas and skew. We have presented two
original techniques (time router and TDMA synchroniser)
included in a new design approach based on the concept of
sub-NoC compositions. This approachmeets the design pro-
ductivity constraints since it is compliant with a classic syn-
chronous single NoCmethodology and can be easily inserted
in a usual SoC design flow. It has been implemented in our
µSpider design tool and applied to a real-life telecom applica-
tion. Moreover, the solution of sub-NoCs, with independent
or disjoint time domains, makes the implementation of lo-
cal/global NoCmanager for power management by means of
Vdd/Clock dynamic selection and security monitoring pos-
sible. These two points are our current research directions.
REFERENCES
[1] W. J. Dally and B. Towles, “Route packets, not wires: on-chip
interconnection networks,” in Proceedings of the 38th Design
Automation Conference (DAC ’01), pp. 684–689, Las Vegas,
Nev, USA, June 2001.
[2] T. Marescaux, A. Bartic, D. Verkest, S. Vernalde, and R. Lauw-
ereins, “Interconnection networks enable fine-grain dynamic
multi-tasking on FPGAs,” in Proceedings of the 12th Interna-
tional Conference on Field-Programmable Logic and Applica-
tions (FPL ’02), pp. 795–805, Montpellier, France, September
2002.
[3] W. J. Dally, “Interconnect-limited VLSI architecture,” in Pro-
ceedings of IEEE International Conference Interconnect Technol-
ogy, pp. 15–17, San Francisco, Calif, USA, May 1999.
[4] S. Evain and J.-Ph. Diguet, “From NoC security analysis to de-
sign solutions,” in Proceedings of The IEEE Workshop on Signal
Processing Systems (SIPS ’05), Athens, Greece, November 2005.
[5] W. J. Dally, “Virtual-channel flow control,” IEEE Transactions
on Parallel and Distributed Systems, vol. 3, no. 2, pp. 194–205,
1992.
[6] E. Bolotin, A. Morgenshtein, I. Cidon, R. Ginosar, and A.
Kolodny, “Automatic hardware-ecient SoC integration by
QoS network on chip,” in Proceedings of the 11th IEEE In-
ternational Conference on Electronics, Circuits and Systems
(ICECS ’04), pp. 479–482, Tel-Aviv, Israel, December 2004.
[7] K. Goossens, J. Dielissen, J. van Meerbergen, et al., “Guaran-
teeing the quality of services in networks on chip,” in Networks
on Chip, A. Jantsch and H. Tenhunen, Eds., pp. 61–82, Kluwer
Academic, Dordrecht, The Netherlands, 2003.
[8] M. Millberg, E. Nilsson, R. Thid, and A. Jantsch, “Guaranteed
bandwidth using looped containers in temporally disjoint net-
works within the Nostrum network on chip,” in Proceedings of
Design, Automation and Test in Europe (DATE ’04), vol. 2, pp.
890–895, Paris, France, February 2004.
[9] L. P. Carloni and A. L. Sangiovanni-Vincentelli, “Coping with
latency in SOC design,” IEEE Micro, vol. 22, no. 5, pp. 24–35,
2002, special issue on systems on chip.
[10] A. Lines, “Nexus: an asynchronous crossbar interconnect for
synchronous system-on-chip designs,” in Proceedings of the
11th Symposium on High Performance Interconnects, pp. 2–9,
Stanford, Calif, USA, August 2003.
[11] Arteris, http://www.arteris.net.
[12] E. Beigne´, F. Clermidy, P. Vivet, A. Clouard, and M. Renaudin,
“An asynchronous NOC architecture providing low latency
service and its multi-level design framework,” in Proceedings
of the 11th IEEE International Symposium on Asynchronous
Circuits and Systems (ASYNC ’05), pp. 54–63, New York, NY,
USA, March 2005.
[13] T. Bjerregaard and J. Sparsø, “A scheduling discipline for la-
tency and bandwidth guarantees in asynchronous network-
on-chip,” in Proceedings of the 11th IEEE International Sympo-
sium on Asynchronous Circuits and Systems (ASYNC ’05), pp.
34–43, New York, NY, USA, March 2005.
[14] D. Hillman, “Using mobilize power management IP for dy-
namic & static power reduction in SoC at 130 nm,” in Pro-
ceedings of Design, Automation and Test in Europe (DATE ’05),
vol. 3, pp. 240–246, Munich, Germany, March 2005.
[15] E. Rijpkema, K. Goossens, A. Ra˘dulescu, et al., “Trade os in
the design of a router with both guaranteed and best-eort ser-
vices for networks on chip,” in Proceedings of Design Automa-
tion and Test Conference in Europe (DATE ’03), pp. 350–355,
Munich, Germany, March 2003.
[16] A. Radulescu, J. Dielissen, K. Goossens, E. Rijpkema, and P.
Wielage, “An ecient on-chip network interface oering guar-
anteed services, shared-memory abstraction, and flexible net-
work configuration,” in Proceedings of Design, Automation and
Test in Europe (DATE ’04), vol. 2, pp. 878–883, Paris, France,
February 2004.
[17] S. Evain, J.-Ph. Diguet, andD. Houzet, “A generic CAD tool for
ecient NoC design,” in Proceedings of International Sympo-
sium on Intelligent Signal Processing and Communication Sys-
tems (ISPACS ’04), pp. 728–733, Seoul, Korea, November 2004.
Samuel Evain is an Associate Professor at
the UBS University (France) and works at
the LESTER Laboratory. His research inter-
ests include Network-on-Chip concept and
design methodology. He is currently finish-
ing a Ph.D. degree in electronics from the
Institut National des Sciences Applique´es
(INSA) of Rennes, France.
Jean-Philippe Diguet received the M.S. de-
gree and the Ph.D. degree from Rennes Uni-
versity (France), in 1993 and 1996, respec-
tively. His thesis, within the LASTI labora-
tory (IRISA/R2D2) addressed the estima-
tion of hardware complexity and algorith-
mic transformations for high level synthe-
sis. Then he joined the IMEC in Leuven,
where he worked as a postdoctoral fellow
on memory hierarchy decisions for power
optimization. He has been a Member of the LESTER laboratory
(Lorient, France) since 1998, where he started research project in
design space exploration at both algorithmic and system levels.
He has been an Associated Professor at UBS University (France)
from 1998 until 2002. In 2003, he initiated a technology trans-
fer and cofunded the Dixip Company in the domain of wireless
embedded systems. Since 2004 he has been a CNRS Researcher.
His current work focuses firstly on managing the EDA framework
project design trotter for design space exploration in the domain
of heterogeneous real-time embedded systems. The second topic
12 EURASIP Journal on Embedded Systems
is the definition of environmentaware and self-adaptive architec-
tures under QoS and power constraints, it includes RTOS new
services, security concerns, NOC, and architecture reconfiguration
control.
Dominique Houzet received the M.S. de-
gree in computer sciences, in 1989 from
Paul Sabatier University, Toulouse, France,
and the Ph.D. degree and HDR degree in
computer architecture, in 1992 and 1999,
both from INPT, ENSEEIHT, Toulouse,
France. He worked at IRIT Laboratory and
ENSEEIHT Engineering School from 1992
to 2002 as an Assistant Professor and at
IETR Laboratory INSA Engineering School
in Rennes from 2002 to 2006 and also as a Digital Design Consul-
tant with SME and large companies. He is now a Professor at LIS-
INPG, Grenoble. He has published a number of research papers
in the area of parallel computer architecture and SoC design and
a book on VHDL principles. His research interests include code-
sign and SoC design methodologies applied to image processing
and radiocommunications. He is a Member of the IEEE Computer
Society.
