Wormhole cut-through switching: Flit-level messages interleaving for virtual-channelless network-on-chip by Samman, Faizal Arya, Dr.-Ing. et al.
Microprocessors and Microsystems 35 (2011) 343–358Contents lists available at ScienceDirect
Microprocessors and Microsystems
journal homepage: www.elsevier .com/locate /micproWormhole cut-through switching: Flit-level messages interleaving for
virtual-channelless network-on-chip
Faizal Arya Samman a,b,⇑, Thomas Hollstein c, Manfred Glesner a
a Technische Universität Darmstadt, FG Integrierte Elektronische Systeme, Research Group on Microelectronic Systems, Merckstrasse 25, D-64283 Darmstadt, Germany
bUniversitas Hasanuddin, Jl. Perintis Kemerdekaan Km. 10, Makassar 90245, Indonesia
c Tallin University of Technology, Department of Computer Engineering, Dependable Embedded Systems Group, Estonia
a r t i c l e i n f o a b s t r a c tArticle history:
Available online 4 February 2011
Keywords:
Network-on-chip
VLSI router architecture
Synchronous parallel pipeline
Switching method
Wormhole cut-through switching
Link-level ﬂit ﬂow control0141-9331/$ - see front matter  2011 Elsevier B.V. A
doi:10.1016/j.micpro.2011.01.003
⇑ Corresponding author. Addresses: Technische
Integrierte Elektronische Systeme, Forschungsgruppe
Merckstr. 25, D-64283 Darmstadt, and LOEWE-Ze
Research, Innovation, Application) Fraunhofer Ins
D-64289 Darmstadt, Germany.
E-mail addresses: faizal.samman@mes.tu-darm
loewe-adria.de (F. Arya Samman).A VLSI microrchitecture of a network-on-chip (NoC) router with a wormhole cut-through switching
method is presented in this paper. The main feature of the NoC router is that, the wormhole messages
can be interleaved (cut-through) at ﬂit-level in the same buffer pool and share communication links. Each
ﬂit belonging to the same message can track its routing paths correctly because a local identity-tag
(ID-tag) is attached on each ﬂit that varies over communication resources to support the wire-sharing
message transportation. Flits belonging to the same message will have the same local ID-tag on each
communication channel. The concept, on-chip microarchitecture, performance characteristics and inter-
esting transient behaviors of the proposed NoC router that uses the wormhole cut-through switching
method are presented in this paper. Routing engine module in the NoC architecture is an exchangeable
module and must be designed in accordance with user speciﬁcation i.e., static or adaptive routing algo-
rithm. For quality of service purpose, inter-switch data transfers are controlled by using link-level over-
ﬂow control to avoid drops of data.
 2011 Elsevier B.V. All rights reserved.1. Introduction
Network-on-chip (NoC) is a bridging concept between system-
on-chip (SoC) and multiprocessor system-on-chip (MPSoC). The
MPSoC is special case of the SoC, where the system uses multiple
processor or digital signal processor cores in a single-chip to pro-
cess several tasks. Interprocessor communication in the SoC sys-
tems can be undertaken by using buses or direct (point-to-point)
interconnection. However, buses can only efﬁciently connect 3 un-
til 10 communication partners and cannot be scaled for higher
numbers [23]. Direct or point-to-point interconnection between
processor systems is effective when the number of processors is
less than ﬁve, because for a larger radix switch, the wires will
highly dominate the overall circuit area, which leads to a link inter-
ference problem. In the future, a large number of processors (more
than 16 cores) will be implemented in a single-chip, which consists
of billion transistors using nanometer-scale technology. The NoCsll rights reserved.
Universität Darmstadt, FG
Mikroelektronische Systeme,
ntrum AdRIA (Adaptronik-
titut LBF, Bartningstr. 53,
stadt.de, faizal.samman@offer a promising solution to the scalability problem in the MPSoC
design.
Fig. 1 represents an example of a multiprocessor system in the
NoC platform. The chip uses 2D 4x4 mesh topology, where each
mesh router is connected to Local port with one resource tile
through a network interface (NI). The other ports (East, North,
West and South ports) are connected with the other mesh
router nodes. A resource tile can be an ASIC core, a bus-based
microprocessor or digital signal processor system (as the net-
worked processing unit, NPU), a (small/medium size) memory with
a direct memory access (DMA) controller, a reconﬁgurable logic
block, IO components, or even a single memory component with
DMA controller. The NI is an important component to assembly
and disassembly packets.
The selection of the switching method for NoC router can affect
the selections of design parameters such buffer sizes and the need
for virtual channels. The historical review of existing switching
methodologies, which have been used in high performance com-
puting (HPC) area (off-chip networks) and so far have been imple-
mented also in on-chip network prototypes, will be brieﬂy
reported in this section. The descriptions of the switching methods
in the HPC area presented in the literature have been well cited in
[12].
Packet Switching method is commonly called also as Store-and-
Forward (SAF) Switching. This switching method is implemented
by dividing data messages into a number of packets. Each packet
dma
Router
NI
RA
M
DSP
uP/
rcl
cache
I/O
Ti
le
 P
ro
ce
ss
or
Router
ASIC
CoreNI
Fig. 1. Multiprocessor system-on-chip interconnected with NoC routers.
344 F. Arya Samman et al. /Microprocessors and Microsystems 35 (2011) 343–358is completely stored in a FIFO buffer before it is forwarded into
next router. Therefore, the size (depth) of FIFO buffers in the router
is set equal to the size of the packet in order to be able to store
completely the packet. The packet switching method is the ﬁrst
switching method that has been used in many parallel machines.
The early parallel machines that use the packet switching are for
examples Denelcor HEP machine, which is well introduced in
[13], MIT Tagged Token Dataﬂow machine [2] and Manchester Dy-
namic Dataﬂow computer [17]. Like in the off-chip networks area,
most of the early NoC concepts and prototypes use also the packet
switching method such as Proteo [36], Nostrum [30], MESCAL [38],
MicroNet [42], CLICHÉ [27] and Arteris [29].
In the Wormhole Switching method, messages are divided into a
number of ﬂow control digitswhich are commonly called as ﬂit. Every
ﬂit may bring a data word. The main advantage of the wormhole
switching is that the buffer size can be set as small as possible to re-
duce thebufferingarea cost. In thewormhole switching,messages in
the network ﬂows like a worm through holes in the ground. The
header ﬂit of each packet makes a routing direction on each node
and reserves a set of routingpaths,while the payloadﬂitswill follow
the pathmade by the header ﬂit. The tail ﬂit of eachpacket at the end
will then terminate the reservation. The main drawback of the
wormhole switchingmethod is the problem of head-of-line blocking.
The wormhole switching method was ﬁrstly introduced in [9]. The
work in [8] has presented also the performance of the wormhole
switching in k-ary n-cube interconnection networks. Some parallel
machines in the HPC area that use wormhole switching are for
example Intel Paragon XP/S [22], Cray T3D (Toroidal 3D) [33], IBM
Power Parallel SP1 [40] and Meiko CS-2 (Computing Surface) [4].
In the NoC area, the wormhole switching is also preferable and has
been used in some latest NoC-based CMP systems prototypes such
as Tile64 [41], TRIPS [15], Teraﬂops [20] and SCC NoC [21].
In the store-and-forward packet switching method, the packet
is completely stored before it is forwarded to the next router.
The delay to wait for the complete packet storing can be reduced
by forwarding the ﬁrst lines of the packet to the next router soon
after routing has been made for the packet and there is enough
space in the required FIFO buffer in the next router to store the ﬁrst
wordlines of the packet. This switching technique is known as Vir-
tual Cut-Through (VCT) switching and was ﬁrstly introduced in [25].
On-chip router of Alpha 21364 [31] is one of the multiprocessor
system that uses VCT switching method. The work in [26] presents
Chaos Router, which is one of the best VCT switching implementa-
tions. A few NoC prototypes such as SPIN [16] and IMEC NoC [3]
use this VCT switching method.
The Circuit Switchingmethod is commonly used in a connection-
oriented communication protocol. The circuit switching method is
performed by establishing connection and reserving some commu-
nication resources. When virtual circuit from source to destinationnodehasbeenconﬁguredand the successful connectionhasbeen in-
formed by the destination node by sending a response packet to the
source node, thenmessage can be transmitted through the network
in a pipeline manner. At the end of the data transmission, a control
packet is sent to the network to terminate the connection
circuit. The circuit switching method is commonly used to provide
guaranteed-bandwidth or guaranteed-throughput communication
protocol for quality of service. The circuit switching method is
originally used in telephone networks. In HPC area, some parallel
machines that have used the circuit switching method are Intel
iPSC/2 [32] that uses a Direct Connect Communications Technology
and Motorola-based BBN GP 1000 [7], which uses multistage
interconnection network with butterﬂy interconnect structure. In
the NoC area, the circuit switching method is used to provide
guaranteed-throughput service. Some NoCs that uses the circuit
switching method are e.g. DSPIN [34], PNoC [19], MANGO [6] and
Æthereal [35].
There are still a few hybrid methods for data switching that
have been proposed in the interconnection network community
such as pipelined circuit switching (PCS) [1,14] that combines
the characteristics of the wormhole and circuit switching method,
and buffered wormhole switching (BWS) which is the variant the
wormhole and combines the store-and-forward packet switching
characteristics. The BWS is ﬁrstly introduced in IBM Power Parallel
SP2 [18]. Switching strategies such as Mad Postman Switching [24]
and Scouting Switching [11,10] are also introduced. The scouting
switching strategy is proposed to improve the performance and
the capability of the PCS methods to tolerate faulty links. The work
in [12] has also summarized well the mechanisms and the history
of the switching methodologies used so far in the existing inter-
connection networks and high performance computing (HPC) area.
Analysis and implementation of another hybrid switching
method used in off-chip network is presented in [39]. The work
dynamically combines a VCT switching and wormhole switching
to achieve better performance compared to traditional wormhole
switching while reducing buffer space requirement compared to
VCT switching. The work in [28] present a layered switching
method suitable for NoCs. The layered switching method concep-
tually implements a wormhole switching on top of VCT switching,
which is enabled by virtually partitioning ﬂits of a packet into log-
ical groups. The layered switching method uses virtual-channels
(VCs), where each VC is allocated for each packet. However, a link
is allocated group-by-group, where as soon as a buffer slot in the
next downstream router is free, a ﬂit of each group is pipelined
in the link. Compared to VCT switching, it requires less buffer size,
because when a data blocking occurs, the buffer can store the en-
tire group of a packet instead of the packet.
The remaining chapters will be organized in the following.
Section 2 presents a short description about the main contribution
of this work. Section 3 describes how the wormhole cut-through
switching method can solve the head-of-line blocking problem
without using virtual channels. Section 4 presents more detailed
feature and characteristic of our NoC. Section 5 shows the protoy-
ping of our NoC using CMOS standard-cell library as well as the
logic area comparisons with other existing NoCs. Section 6 pre-
sents an experiment result of a cycle-accurate RTL-simulation.
Finally, concluding remarks and future works of the current NoC
implementation are described in Section 7.2. Contribution
The main problem of this traditional wormhole switching
method is a head-of-line blocking problem. If the header ﬂit is
blocked then it will block the remaining paths that are still used
by the wormhole packet. In order to solve the problem, virtual
F. Arya Samman et al. /Microprocessors and Microsystems 35 (2011) 343–358 345channels can be used. However, the main criticism of the use of vir-
tual channels in the NoC context is prohibitive area cost in terms of
buffering. Virtual channels will increase total buffer counts and re-
sult in power consumption that would exceed the target constraint
for a certain embedded application [21]. The silicon area of a NoC
router is dominated by buffers. Hence, power is mostly dissipated
from these buffer components. The design of NoC router with min-
imum buffer size would be always an important aspect for NoC-
based embedded multiprocessor systems-on-chip (MPSoC), whose
power supply capacity is limited by the battery life of the system.
However, the minimum buffer size would be also an interesting
design parameter for chip-level multiprocessor (CMP) systems do-
main, especially if network size is very large, in which the power
dissipations are scaled up by the network size. Virtual channels
will add also more arbitration to router’s critical path, potentially
affecting the cycle time or pipeline depth of the router [15].
This paper presents a parallel pipeline VLSI microarchitecture of
a NoC with a unique switching method called ‘‘wormhole cut-
through (WormCT) switching method’’. The main contribution of
our NoC architecture is the ability to interleave ﬂits of different
messages in the same link by using a ﬂit-by-ﬂit rotating arbitration
and routing organization based on local variable ID-tag manage-
ment. By using the proposed wormhole cut-through switching
method that is implemented in our NoC router, the head-of-line
blocking problem present on the traditional wormhole switching
method can be solved without the need for virtual channels. The
use of virtual channel ID is replaced by the use of variable local
ID in the context of our proposed method. The functionality of vir-
tual channel controller units used in the context of the virtual
channel solution is implicitly replaced by local ID-management
units implemented on our NoC router.3. Wormhole cut-through switching
3.1. Blocking problem in traditional wormhole switching
Fig. 2 shows two snapshots of a blocked data ﬂow of a tradi-
tional wormhole packet switching. We assume that the ﬂow rates
of the wormhole packets A, B, C and D are the same and will be rou-
ted to node (4,1) by consuming the maximum bandwidth capacity
of the NoC links. We can see in Snapshot 1 of the ﬁgure that packet
C is blocked at node (3,1) because the required link (East output
port) is acquired by the wormhole packet D. Packets A and B are
also blocked at node (2,1) and (3,1) respectively because the re-
quired outgoing links are reserved previously by the packet C. If
the size of the packet D is very large then the blocking situation
will occur also for long time.
If the wormhole packet size is 4 for instance, then we can esti-
mate from the Snapshot 2 of Fig. 2 that packet C will escape from(2,1)
B1C2C3
(1,1)
A1
(2,1)
B1
B2
4C 3C
(1,1)
A1
A2
Pck A Pck B Pck C Pck D
D1
(4,1)
Snapshot 1
D3
(4,1)
(3,1)
C1 D2
(3,1)
2C 1C D4
Snapshot 2
Fig. 2. Blocking problem in traditional wormhole switching.the blocking situation after a few cycles. As presented of the Snap-
shot 2 of the ﬁgure, because the depth of the FIFO buffers are two,
then the 4-ﬂit wormhole packet C will occupy two FIFO buffers at
West input port of the router node (2,1) and (3,1). If the FIFO depth
is four, then all ﬂits of the packet C will be stored in West port FIFO
at node (3,1). Thus, packet A will occupy the West port FIFO at
node (2,1). Each wormhole packet can acquire the link after the
other packet ﬁnishes forwarding its last ﬂits. A solution can bemade
by adding virtual channels and introducing virtual channel ID
(VC-ID) organized by a virtual channel controller unit. However,
this approach will increase logic area and static power, and may
also add more arbitration to router’s critical path, which poten-
tially affecting the cycle time or pipeline depth of the router [15].
The following subsection will show how the link can be shared
by wormhole packet by using local ID slot which replaces the
VC-ID functionality, and hence the virtual channels are not used,
accordingly.3.2. Blocking problem solution
Fig. 3 presents six snapshots on how four packets are escaped
from blocking situation because of the contention of the packets
that are injected to consume the maximum bandwidth capacity
of the link. In general, the arbiter units at every output port will ro-
tate their selection among input ports having routing request ﬂags
sent to the arbiters. Based on the blocking situation in Fig. 3 (Snap-
shots 2, 3, 4, 5 and 6), the arbiter at each port will rotate its selec-
tion between the West and North input ports. At node (3,1), the
arbiter at East output port rotates its selection between packet D
from North port and packet C from West port. Both packets will
then occupy 50% of the maximum link bandwidth capacity (Bmax).
Initially, packet C and B as well as packet C and A occupy 50% of the
Bmax of the East outgoing link at node (2,1) and (1,1), respectively.
But after a few cycle, because of the sequential blocking, packet C
and B will share the remaining 12 Bmax of the East outgoing link at
node (2,1), i.e. 14 Bmax for each packet C and B. But, because packet
C must share also the East outgoing link at node (1,1) with packet
A, then at steady-state point, packet A and C will ﬁnally use only
12.5% of the Bmax or 18 Bmax.
As shown in Fig. 3 (Snapshot 4, 5 and 6), the communication
link connecting East output port of router node (3,1) and West in-
put port of node (4,1) is shared by packet D, C and B, where each of
them is allocated to local ID-slots 0, 1 and 2, respectively as shown
in the ID slot table at East output port of node (3,1). In the node
(3,1), packet D, C and B are coming from North, West and West in-
put port, and have old ID-tag 0, 0 and 1, respectively. The ID slot
number is then attached to each ﬂit of the packets as their ID-
tag, in which ﬂits belonging to the same packet will have the same
ID-tag. In order to guarantee such situation, once an ID slot is re-
served by the header of a packet, two important information, i.e.
the old ID-tag of the packet and from which port it comes are
stored in the ID slot table of each output port. Thus, the payload
ﬂits can be allocated to the right ID slot through the identiﬁcation
of both information. Therefore, by applying a ID-based routing res-
ervation table at each input port, each ﬂit of the interleaved pack-
ets can be routed correctly to its requested output path. The detail
mechanisms of the ID-based routing and ID-tag updating are
shown later in Sections 4.3.2 and 4.4.4. NoC features and microarchitecture
The XHiNoC (eXtendable Hierarchical Network-on-Chip) router
architecture is designed based on a modular approach. The mic-
roarchitecture consists of four main components, two components,
a routing engine with data buffer (REB) and a FIFO queue (Q), are
Id Rd
0
1
2
3
IdO Fr IdN
0
1
2
3
Id Rd
0
1
2
3
IdO Fr IdN
0
1
2
3
Id Rd
0
1
2
3
IdO Fr IdN
0
1
2
3
Snapshot 1
C0 WC
(1,1)
A1:0C3
:0
(4,1)
D
1:
0
0 C D 0 N S D
(2,1)
B1:0C2
:0
D2:0
(3,1)
C1
:0
EE WC
Id Rd
0
1
2
3
IdO Fr IdN
0
1
2
3
Id Rd
0
1
2
3
IdO Fr IdN
0
1
2
3
Id Rd
0
1
2
3
IdO Fr IdN
0
1
2
3
Id Rd
0
1
2
3
IdO Fr IdN
0
1
2
3
Id Rd
0
1
2
3
IdO Fr IdN
0
1
2
3
Id Rd
0
1
2
3
IdO Fr IdN
0
1
2
3
Id Rd
0
1
2
3
IdO Fr IdN
0
1
2
3
Id Rd
0
1
2
3
IdO Fr IdN
0
1
2
3
Id Rd
0
1
2
3
IdO Fr IdN
0
1
2
3
Id Rd
0
1
2
3
IdO Fr IdN
0
1
2
3
Id Rd
0
1
2
3
IdO Fr IdN
0
1
2
3
Id Rd
0
1
2
3
IdO Fr IdN
0
1
2
3
Id Rd
0
1
2
3
IdO Fr IdN
0
1
2
3
Id Rd
0
1
2
3
IdO Fr IdN
0
1
2
3
Id Rd
0
1
2
3
IdO Fr IdN
0
1
2
3
Pck A
Pck B Pck C
Pck D
Snapshot 4
C0 WC
(1,1)
A2:0
(4,1)
0 C D 0 N S D
(2,1)
D3:0
(3,1)
D4:0
C 0 W C0 N
E W
B
C
E B
E
NAE0 NA
A
1:
1
C3
:0A3:0
C4
:0
B
1:
2
C2
:0
B 1 W E B
Snapshot 6
C0 WC
(1,1) (4,1)
0 C D 0 N S D
(2,1) (3,1)
C 0 W C0 N
E W
B
C
E B
E
NAE0 NA
B3:0
B4:0
B 1 W E B
D4:0
D5:0
C2
:0
B
2:
1
E
C5
:0
C4
:0
Snapshot 3
C0 WC
(1,1)
A2:0
(4,1)
D
2:
0
0 C D 0 N S D
(2,1)
0:3D0:2B
(3,1)
B
1:
1
C 0 W C0 N
E W
B
C
E B
E
NAE0 NA
C2
:0
A
1:
1 B3:0A3:0
C4
:0
C3
:0
Snapshot 2
C0 WC
(1,1)
A2:0C3
:0
(4,1)
C1
:1
0 C D 0 N S D
(2,1)
0:2D0:2B
(3,1)
D3:0
A
1:
1
C2
:0
B
1:
1
C 0 W C0 N
E W
B
C
E B
E
NAE0 NA
C4
:0
B3:0
B2:0
Snapshot 5
C0 WC
(1,1)
A2:0
(4,1)
0 C D 0 N S D
(2,1) (3,1)
C 0 W C0 N
E W
B
C
E B
E
NAE0 NA
A
1:
1
C3
:0A3:0
C4
:0
C5
:0
B
2:
1
B3:0
B 1 W E B
C2
:0
D
3:
0
D4:0
A3:0
A2:0 C3
:0
A
1:
1
IdO = Current Local ID
IdN = New Local ID
Rd = Route Direction
Fr = From Input Port
Fig. 3. Blocking problem solution in wormhole cut-through switching.
MIM
Q
foew
MIM
Q
foew
PORT 1
ra 1−1
ra 1−2
ra 1−N
rr 1−1
rr 1−2
rr 1−N
sel
rr
A
fiew
rr 2−1
rr 1−1
ra 2−1
ra 1−1
ra
d1
PORT 1
rr
enr
ra
es
REB
PORT N
ra N
−1
ra N
−2
rr N
−2
rr N
−N
sel
rr
A
fiew
ra 1−N
ra 2−N
rr 2−N
ra
dN
PORT N
rr
enr
ra
es
REB.....
.....
.....
crossbar
ra N
−N
......
d1d2dN d1d2dN
........
........ ...
...
...
...
......
...
.........
... .........
rr 1−N
rr N
−N
ra N
−N
rr N
−1
rr N
−1
ra N
−1
Fig. 4. Microarchitecture of the crossbar switch (router).
346 F. Arya Samman et al. /Microprocessors and Microsystems 35 (2011) 343–358placed at input port, while two components, a crossbar multiplexor
with ID-management unit (MIM) and an arbiter unit (A), are lo-
cated at output port. Fig. 4 presents the microarchitecture of the
crossbar switch. For N number of I/O pairs, there will N number
of routing request (rr) and routing arbitration (ra) signals from
and to each (REB) module as well as from and to each (A) unit at
every input and output port.
4.1. Simultaneous parallel pipeline IO interconnects
Our NoC can switch maximum N simultaneous intra-IO-port
interconnects in parallel, because a routing engine and an arbiter
unit is allocated at each input and output port, respectively. The
N number represents the number of I/O pairs in the router. This
feature is certainly not a new topic in the NoC research area as it
has been implemented by some NoC architectures such as Intel
Teraﬂops NoC [20], SCC NoC [21], Æthereal NoC [35], etc. If N
simultaneous incoming data will be switched to each differentoutgoing port, then a switch with single and multiple routing ma-
chines will make N times routing and arbitration steps. The differ-
ence is that, the switch with single routing machines will make it
sequentially, while the switch with multiple routing machines will
make it concurrently. It is clear that bandwidth capacity of the
router with multiple routing machines can be N times higher. Since
energy is related to power  time, then the energy to store data in
the input buffer is lower because the switch with single routing
machine stores the data in the input buffers for longer time be-
cause of the sequential routing delay. However, the extra routing
machines in the router with multiple routing machines will cer-
tainly increase the static power dissipation of the switch because
of the larger logic gates consumption.
We will now present in this section, how the NoC performs such
advantageous feature of a modern NoC router design with high
bandwidth capacity. Compared with Intel Teraﬂops NoC [20],
which has six-stage pipeline, i.e. Buffer Write, Buffer Read, Route
Compute, Port/Lane Arbitration, Switch Traversal and Link Tra-
versal, we use also the same technique with reduced pipeline
stages. Our NoC uses 4-stage pipeline, i.e. Buffer Write, Buffer
Read+Route Compute, Port Arbitration and Switch/Link Traversal.
We can combine the Buffer Read and Route Compute pipeline line
stages to reduce the cycle delay from the input port to the output
port of the router, while Intel Teraﬂops cannot, because it imple-
ments a double-pumped (dual lane) crossbar switch. When the
REB unit read a ﬂit from the FIFO queue, then the REBwill compute
the routing direction and store the ﬂit at its buffer in the same step.
We combine also the Switch and Link Traversal stages because
once the ﬂit is switched out, then it can be written in the FIFO
queue in the next router. Certainly, both pipeline stage reductions
must be controlled carefully to avoid ﬂit drops and unnecessary ﬂit
replications.
Fig. 5 presents how to simultaneous parallel intra-port inter-
connects are performed, i.e. switch interconnect from Port1 to
Port3 and from Port3 to Port1. The data switching in the XHiNoC
44 4 4 44
33 3 3 33
22 2 2 22
11 1 1 11
MIM MIM
we ofwe of
55 5 5 55
request snapshot grant/ack. snapshot accept snapshot
(a) (b) (c)
a1
b1
b2 c1
c2a2
a2
a1
b1
b2
c1
c2
ra 1−1
ra 1−2
rr 1−5
rr 1−4
rr 1−2
rr 1−1
ra 1−5
ra 1−4
d1
PORT 1
sel
rr
A
fiew
rr 5−1
rr 4−1
rr 2−1
rr 1−1
ra 5−1
ra 4−1
ra 2−1
ra 1−1
ra
d1d2d4d5
ra 3−5
ra 3−4
ra 3−3
ra 3−2
d3
PORT 3
sel
rr
A
fiew
ra 2−3
ra 3−3
ra 4−3
ra 5−3
rr 2−3
rr 3−3
rr 4−3
rr 5−3
ra
d2d3d4d5
rr 3−5
rr 3−4
rr 3−3
rr 3−2
3 TROP1 TROP
rr
enr
ra
Q
rr
enr
ra
Q
eses
REB REB
(d)
Fig. 5. Simultaneous parallel intra-port interconnects.
Payload
Payload
Payload
Payload
Payload
Payload
Payload
Payload
Payload
Payload
Payload
Payload
Payload
Payload
Payload
Payload
Payload
Payload
Payload
Payload
H
ead
Control bits
Payload
Payload
Payload
Tail
Flit
W
ord
.
.
.
.
.
.
1 Message (M MByte)
(a) Single packet message
Tail
DBod
DBod
id−tag
id−tag
id−tag
Payload Data
Payload Data
3b 4b 4b 4b 4b 4b 4b 4b 4b 4b
Head id−tag Xs Ys Zs Xt Yt Zt(1)
(2)
(3) Payload Data
ext1 ext2
32b data wordtype label
(b) XHiNoC’s packet format
Fig. 6. Packet format.
F. Arya Samman et al. /Microprocessors and Microsystems 35 (2011) 343–358 347router consists of three steps. The ﬁrst step is routing request. The
second step is routing-acknowledge or routing grant step, and then
will be followed the data outgoing or data switching step. The
steps are conceptually presented in Fig. 5a–c, respectively. Fig. 5d
shows architectural view of the steps. The control paths that are
set during the routing request step (i.e. paths a1 and a2) and the
routing acknowledge (i.e. paths b1 and b2) phase are assigned in
the ﬁgure. While the data paths from input to output ports are as-
signed with c1 and c2. The path names are made in accordance
with the path names in Fig. 5a–c to explain easily our proposed
switching methodology.
4.2. Packet format
For a unicast message in the XHiNoC, the packet will have only
one header ﬂit, even if the size of the message is very large. Hence,
in this paper, the terms ‘‘packet’’ and ‘‘message’’ have similar inter-
pretation. The single-packet-based message assembly is shown in
Fig. 6a. The detail packet format and the control bits used in the
XHiNoC architecture is presented in Fig. 6b. The message is split
into several ﬂits and has 39-bit width, 32 bits for dataword plus
7 extra bits i.e., 3-bit ﬁeld to deﬁne the type of ﬂits and 4-bit ﬁeld
to determine the local identity label or ID-tag of a message. This
single-packet-based message assembly is also suitable for stream-
ing-based data communication, where the size of the message is
extremely large.
For our current NoC version with Best-Effort (BE) service, the ﬂit
type can be (1) a header ﬂit (Head), (2) a databody (payload) ﬂit
(DBod), or (3) a tail (end of payload data) ﬂit (Tail). The rest possible
ﬂit types are used for our future extended NoC version with Guar-
anteed-Throughput (GT) service, or combination of both BE and GT
services. Routing direction on each router is made only once by the
packet header. Afterwards, the payload ﬂits (probably a very long
data stream) will track the routing paths made by the header. By
using the packet format shown in Fig. 6b. There will be no out-
of-order problem, even when we use adaptive routing algorithm.The message is classiﬁed into three ﬂit types i.e., header ﬂit
(Head), databody or payload data ﬂit (DBod) and tail ﬂit (Tail).
The source and target addresses of the message are deﬁned into
3D address (x,y,z). The z address is not used in this current 2D
mesh topology but it is spared to be used for developing a
hierarchical 2D, or stacked 3D networks on chip. Flits belonging
to the same message have the same local identity number (ID-
tag) to differentiate it from other ﬂits of different messages,
when it passes through a communication link of the NoC. The
ID-tag of the data ﬂits of one message will vary over different
communication links allowing different messages are interleaved
each other at ﬂit-level while being routed with wormhole
switching.
4.3. Packet interleaving and ID-based routing organization
4.3.1. Packet interleaving
Messages or packets in the NoC are routed based on their ID-
tag. Each ﬂit of the messages has a local ID-tag attached to the
36th  33rd bits. Flit belonging to the same message will always
have the same ID-tag on every communication resource. The local
ID-tag is updated to manage ID slot allocation for messages which
share the communication link and to allow message interleaving
on the link. The upper and bottom views of Fig. 7 present how
three messages can be interleaved in the same link based on the
ID-tag identiﬁcation process of the messages.
As shown in Fig. 7, messages D, G and E share the link connect-
ing router R5 and R2. Each of them reserves ID-tag number 0, 1 and
2, respectively. Each payload ﬂit can be routed at the North input
port of router R2, by reading the ﬂag in the routing reservation ta-
ble that has been programmed when the header of each message
are routed from that input port. As presented in the ﬁgure, message
D, G and E are routed to East (L), Local (L) and South (S) output port,
respectively.
4.3.2. ID-based routing mechanism
The Routing Engine (RE) module consists of the routing reserva-
tion table and the routing state machine. Algorithm 1 presents a
routing state machine using a static XY routing algorithm. Ftype is
the ﬂit type, SREB is the state of the REB module and ra is the routing
arbitration signal. The number of registers in the routing reserva-
tion table is equal to the number of available ID slots on each link
of the NoC routers. The ID slots are described as S 2 {0,1,2, . . . ,M}.
Thus, there are M + 1 available ID slots.
12
0
R2
0
R5
E
W
W
S
S
N
N
L
L
0
0
1
0
2
Msg.D
Msg.E
Msg.D
0
Msg.E
Msg.
Msg.G
Msg.G
D
G
E
10
0
1
2
210
0
0
0
0
00
00
0
0
1
1
0
AB
C
D
E
F G
A F
GC B
0
E
0
0 2 1
D
0
ID slot table on R5
id
west
north
local
id
old
1
0 1
0
0
from
port
new
2
M
at South port
E
id E N W
M
0
1
2
3
on R2 North port
L
1
S
E
G
D
Msg.
Routing Rervation Table
1
12
Fig. 7. ID-based link sharing and routing organization.
348 F. Arya Samman et al. /Microprocessors and Microsystems 35 (2011) 343–358Algorithm 1. Routing state machine with XY routing algorithm1:while Ftype=Header and (SREB=free or (SREB=busy and ra is
set)) do
2: Xoffset = Xtarget  Xsource
3: Yoffset = Ytarget  Ysource
4: if Xoffset = 0 and Yoffset = 0 then
5: Routing = Local
6: else if Xoffset > 0 then
7: Routing = East
8: else if Xoffset < 0 then
9: Routing =West
10 else if Xoffset = 0 and Yoffset > 0 then
11: Routing = North
12: else if Xoffset = 0 and Yoffset < 0 then
13: Routing = South
14: end if
15:end while
When a header ﬂit comes from an incoming port with ID-tag h
where h 2 S, and appears on the output of the queue module, then
the routing state machine in the routing engine computes a routing
direction D based on the target address present in the header ﬂit.
The routing direction D is then stored in the routing reservation
table, exactly in the register number h in accordance with its ID-
tag. Routing direction D is actually a set of output port routing
information that is represented in the routing reservation tableas a binary set. For instance, in our mesh switch router, we make
a deﬁnition such that 10000 , East(E), 01000 , North (N),
00100 ,West(W), 00010 , South(S), 00001 , Local(L) and
00000 , No direction.
We can describe that the routing direction made by the routing
engine (RE) module is compressed, and the key used by databody or
tail ﬂits to open the compressed routing direction is the ID-tag.
Therefore, when payload ﬂits come from the incoming port also
with ID-tag h (belonging to the same message with the header ﬂit),
then the routing direction D will be open from the routing reserva-
tion table using the ID-tag key of the payload ﬂit. Based on this
routing technique, the payload ﬂits can track the correct paths
even when the ﬂits are interleaved in the same queue with other
ﬂits of different messages.
Fig. 8 exhibits an example of the ID-based routing mechanism
in detail. The ﬁgure presents how the organization of the combined
routing state machine and the routing reservation table can route
the ﬂits of interleaved different messages into correct output direc-
tions. In the example ﬁgure, the message will be routed with a sta-
tic XY routing algorithm. Three ﬂits of three different messages are
buffered in the FIFO queue.
A message can be differentiated from the other messages based
on its local ID-tag. As presented in the ﬁgure, message A, B and C
have local ID-tag 0, 1 and 2, respectively. We assume that the head-
ers of the message A and C has been routed before. Hence, the con-
tents of the routing reservation table with index 0 and index 2 have
been written with the binary sets of routing information. We will
explain every step in the subﬁgures into three phases in the
following.
1. Routing for header. Now as presented in Fig. 8a, the header ﬂit of
message B with ID-tag h = 1 and target address (xt,yt,zt) =
(1,0,0) appears in the output port of the queue. Because the ﬂit
type is a header and current address of the router is (xc,yc,zc) =
(1,1,0), then the routing state machine computes the routing
direction D = South, and at the same cycle, the header is buf-
fered in the data buffer of the routing engine. The routing infor-
mation will be then stored in register number 1 of the routing
table. As presented in the ﬁgure, the column S in the row 1 is
set to 1. Hence, the routing request rr signal will be set to
00010.
2. Routing for data body. Now the data body of the message C with
local ID-tag 2 appears in the output port of the queue as shown
in Fig. 8b. In this phase, because the ﬂit of message C is a data
body, then the routing information is fetched directly from
the routing reservation table based on the local ID-tag 2 of
the ﬂit. The routing state machine is not active in this case. As
explained in advance, the header of the message C has been rou-
ted before. Thus, the routing request rr signal will be set to
00001 or Local outgoing direction.
3. Routing for data tail. As shown in Fig. 8c, the tail ﬂit of the mes-
sage Awith local ID-tag 0 is now in the output port of the queue.
The routing state machine is also not active in this case. In
accordance with its local ID-tag 0, the routing information is
also fetched directly from the routing reservation table, i.e.
D = North , 01000. But in this case, ‘‘tail’’ ﬂit type is identiﬁed
in the type ﬁeld of the ﬂit, thus in the next cycle when the ﬂit
will be switched out to North outgoing port, then content of
the routing reservation table in the index register 0 will be
deleted or reset to 00000.
4.3.3. Runtime local ID-based interconnect conﬁguration
Fig. 7 presents trafﬁc in the XHiNoC which are switched and
scheduled based on variable local ID-tag routing organization.
The ﬁgure shows a small 2D 3  2 mesh NoC topology. Seven
State
Machine
Routing
SouthSouth
LID E N W S
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
1
2
3
M
H
ead
D
B
od
Tail
0 2 1
Src. A
dd.
Head
R
E1
0
0
0
0
0 1
Ext.
Tgt. A
dd.
Router Current add.=(1,1,0)
A C B
Message
Flits of
Queue
00010
R
E B
uffer
(1,0,0)
(a) Routing a header flit.
State
Machine
Routing
Local
LID E N W S
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
1
2
3
M
DBod
R
E2
1 0 2
D
B
od
Tail
D
B
od
0
0
0
0
0 1
Local
CB A
Router Current add.=(1,1,0)
Message
Flits of
Queue
00001
R
E B
uffer
(b) Routing a databody flit.
State
Machine
Routing
N
orth
LID E N W S
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
1
2
3
M
Tail
0
0
0
0
0
0 1
1 02
Tail
D
B
od
D
B
od
Router Current add.=(1,1,0)
Message
Flits of
Queue
North
R
E B
uffer
01000
reset to 0 (next cycle)
R
E
(c) Routing a tail (end of message) flit.
C B A
Fig. 8. ID-based routing mechanism.
state
id
used
free
free
free
used
slot
id
1
2
3
0
M
id
north
id
old
M
3
1
0
0
from
port
new
2set to
used
nid=2
0
0
Head 0 word
Head 2 word
search for
free id slot
A
local
west
0
1
nid = number of 
used IDs 
nid is set
from 2 to 3
Fig. 9. Local ID-tag management.
F. Arya Samman et al. /Microprocessors and Microsystems 35 (2011) 343–358 349messages, i.e. message A, B, C, D, E, F and G are routed in the net-
works and share the communication resources. In general, ﬂits
belonging to the same wormhole packet will always have the same
local ID-tag on each NoC link. The ID-tag is updated every time a
wormhole packet is switched out to an outgoing link, in which
its header ﬂit reserves a new local ID-tag from a local ID slots by
indexing the reserved ID with its old/current ID-tag and from
which port it comes. By using this technique, we can guarantee
that different ﬂits will have different local ID-tag on each commu-
nication link and allowing to implement a routing reservation table
in the next input port to route correctly the interleaved wormhole
packets based on their local ID-tag.
Message B for instance is injected from router node R4 with lo-
cal ID-tag 1 and is accepted or ejected in the router node R3 with
local ID-tag 0. During transmission in the intermediate links, its lo-
cal ID-tag is updated and mapped properly to allow resource com-
munication sharing with other ﬂits of different message in
interleaving manner. For example, in the link connecting East port
of R4 and West port of R5, its local ID-tag is 2. While the others
messages, i.e. message A and message D reserve local ID-tag 0
and local ID-tag 1, respectively. On the next remaining intermedi-
ate downstream links, the message B reserves local ID-tag 1 and 0,
successively.
The bottom part of the Fig. 7 shows in detail the contents or
states of the ID Slot Table at South Output Port of router node R5
and the Routing Reservation Table at the North Input Port of the
router node R2. The view of the router nodes R5 and R2 is enlarged
to present again clearly the routing paths of message D, E and G in
both nodes.
The detail procedure on how to write routing information into
the routing reservation table has been explained in Sub Section
4.3.2. While the following Sub Section 4.4 will describe how the
ID Slot Table is programmed by a header ﬂit at runtime and how
the ID-tag of the message is locally updated and managed.4.4. Local identity-tag updating and management
Fig. 9 presents a mechanism to update and to organize the ID
slot table in the MIM modules. The ﬁgure shows a packet header
coming from NORTH port with ID-tag 0. The header ﬂit is just
switched from crossbar switch and its ID-tag is updated to reserve
an new ID-tag in the slot table of the MIM module. The ID update
process can be described into four steps.
In the 1st step, the IDM detects a new incoming packet header
and then looks for a free ID slot by checking the ID-state table. In
this case, the ID-tag 0 and the ID-tag 1 have been used by messages
coming fromWest and Local input ports, respectively. The ID-tag 2is now detected as a free ID slot, and then in the 2nd step, this ID is
assigned as the new ID-tag for the new packet. In the 3rd step, the
ID-slot 2 is indexed based on two informations i.e., the old local ID-
tag 0 and data ‘‘North’’ from which port the header ﬂit comes.
Hence, in the next time period, when payload ﬂits come from
North port with ID-tag 0, then they will be assigned also with
the new ID-tag 2.
In the 4th step, ID-tag 2 state is set from ‘‘free’’ to ‘‘used’’ state,
and the number of used ID (UID) is incremented. When all ID slots
have been used, then ‘‘empty free ID ﬂag’’ is set. When the tail ﬂit
(the end of databody) of the message with ID-tag 0 is passing
through the outgoing port, then the related ID-tag 0 state is set
from ‘‘used’’ to ‘‘free’’, the UID is decremented and the information
related to the tail ﬂit ID-number is then deleted from the ID Slot
Table.
In our current architecture in 2D 4  4 mesh topology, 16 local
ID slots are available on each outgoing link, where one ID slot, i.e.
ID slot 15 or ‘‘1111’’ is reserved for control purpose. This number is
equal to the network size, in order to cover the ID slot run out
problem. However, if the free local ID slot runs out, the header that
fails to reserve a free ID slot on a link will be assigned with local ID-
tag ‘‘1111’’. This header will be then routed and always assigned
with the ID-tag ‘‘1111’’ until it reaches its target node. The data
payload ﬂits will be dropped in the link. Higher level protocol
can be further implemented in Network-Interface, where the target
node send back a status or response ﬂit to the source node. After
receiving the status ﬂit, the source node can decide whether the
data send will be repeated or not.
In order to avoid packet dropping, the number of ID slots avail-
able on a communication link must be determine in such a way
that it is sufﬁcient to cover the number of possible trafﬁc which
will use the communication link. Theoretically, if there are NPE
350 F. Arya Samman et al. /Microprocessors and Microsystems 35 (2011) 343–358number of processing elements (PEs), then the number of ID slot
per link NSlot must be set equal to NPE to avoid packet dropping.
However, in practice, when minimal routing is applied, not all link
will be acquired by packets from all PEs. Futhermore, if NoC will be
applied to an embedded multiprocessor running limited number of
applications, the 16 local ID slot numbers are enough to cover traf-
ﬁc in the NoC having 16 compute element cores. Indeed, the trafﬁc
in the embedded SoC applications are predictable. Hence, we can
run an application mapping program at compile time, in which
the mapping result will balance the trafﬁc and make sure that each
link will be not loaded by too many trafﬁc or more than the avail-
able local ID slots (15 slots in our current implementation).4.5. Link level overﬂow control and ﬂit-by-ﬂit arbitration
Data ﬂows in our NoC are controlled using a link-level ﬂit ﬂow
control. The data ﬂows are controlled at link and at ﬂit-level be-
cause we use a ﬂit-by-ﬂit rotating arbitration to switch and sche-
dule data ﬂows between contenting ﬂits requesting the same
outgoing link. Fig. 10 shows the timing diagram of the inter-switch
data ﬂow where a contention does not occur. Flits of message A are
switched from the West input link to the East output link of a rou-
ter. A data ﬂit is transmitted on the input link in every two-cycle.
Then every ﬂit is stored in the queue and switched to the
outgoing link through two phases, i.e routing request-phase andclock
Flit A1West Input Link Flit A4
Queue Out (West)
Flit A2 Flit A3
RE Dout (West)
West
Routing (West)
Arbitration (East)
East Output Link Flit A1
East East
Flit A2 Flit A3
Flit A2
West
Flit A1 Flit A2 Flit A3 Flit A4
West
phase 1 phase 2 phase 3 phase 4phase 0
East
Flit A1
t0 t1 t2 t3 t4 t5 t6 t7 t8
phase 5 phase 6 phase 7
Fig. 10. Timing diagram (without contention) of the data switching and control
paths.
(a) Snapshot 1
(b) Snapshot 2
(c) Snapshot 3
(d) Snapshot 4
1 2 3 4
1 2 3 4
B(20) C(20)1 2 3 4
A(20) B(20) C(20) D(20)
D(20)
B(20)
1 2 3 4
A(20) D(20)
B(20)
A(20)
A(20)
B(20)
B(20)
B(20)
A(20)
A(20)
B(20)
A(20)A(20)
A(20) A(20)
A(20)A(20)
A(20)
C(20)
C(20)
C(20)
C(20) D(20)
D(20)C(20)
C(20)
B(20)
A(20)
B(20)
B(20)
C(20) D(20)
D(20)C(20)
C(20)
D(20)
(a) When messages are injected such that total
bandwidth requirement does not exceed the link
bandwidth capacity.
Fig. 11. Four snapshots of link bandwidth sharing situations. The values in the brackets
capacity and the reserved ID slot (% of Max BW: local ID slot).routing-acknowledge phase. Phases 2 and 3 shown in Fig. 10 pre-
sents the request and the acknowledge (grant) phases for the
Flit A1. While phase 4 and phase 5 present the request and the
grant phase for the Flit A2. The data path signal ﬂow of each ﬂit
is made pipeline synchronously, while the control path signal ﬂow
is made in two-cycle phases. Although the ﬂit ﬂow in the router is
delayed for two cycles, the ﬂow rates of the ﬂits in the input link
will be equal to the ﬂow rates in output link as long as the ﬂits is
transmitted on link with 0.5 ﬂit per cycle or slower.
Fig. 11a shows 4 snapshots of link bandwidth sharing situation
as well as the local ID slot reservation when messages are injected
such that total bandwidth requirement does not exceed the link
bandwidth capacity. As presented in the brackets, Message A, B,
C and D are injected to the NoC with local ID-tag 0 and with 20%
of the maximum bandwidth capacity of the NoC link resulting in
the total of 80% maximum link capacity when they share a link
in the North output port at node 4 as presented in the ﬁgure. In this
situation, the NoC will be not saturated.
If a link is consumed by a few or some messages, in which the
total expected bandwidths of the messages exceeds the maximum
capacity of the link, then the message ﬂows will be blocked for a
while because of the ﬂits contention. The blocking situation in
our NoC is acceptable. The ﬂow of the data ﬂits in the congested
link is constant at its maximum rate. Thus, the contenting ﬂits
must share this maximum rate. Therefore, the ﬂow rates of the
contenting ﬂits will be slower than their expected rates. While,
the injection rates at their source nodes are still at their expected
rate, which are larger than the actual rates on the congested link,
then the NoC will be saturated. Network-Interface (NI) at source
node will be then stop injecting new ﬂit when a queue in the Local
input is full. Because of the beneﬁt of the ﬂit interleaving and link
sharing capability, the data ﬂows are not blocked permanently.
After a few cycle, there will be a free space again in the queue
and the NI can inject again the next ﬂit. So, in steady-state situation
the actual injection rates at source node should follow the actual
acception rates at target node of each communication edges in
the NoC.A(100) B(100) C(100) D(100)
A(100) D(100)
B(100) C(100)
1 2 3 4
C(50) D(50)
D(50)C(50)
A(25)
A(25)
A(25)
B(25)
B(25)1 2 3 4
A(25)
B(25)
C(50)
D(50)1 2 3 4
D(50)
C(25)
C(25)
B(12.5)
A(12.5)
A(12.5)
C(25)B(12.5)
B(12.5)
B(12.5)
A(12.5)A(12.5)
A(12.5)
A(50) B(50) C(50) D(50)
A(50) D(50)C(50)
B(50)
A(50)
1 2 3 4
C(50)
(a) Snapshot 1
(b) Snapshot 2
(c) Snapshot 3
(d) Snapshot 4
(b) When messages are injected with maximum
injection rates.
represent the actual percentage message bandwidth over the maximum NoC link
F. Arya Samman et al. /Microprocessors and Microsystems 35 (2011) 343–358 351Fig. 11b shows the other four successive snapshots of the actual
bandwidth consumption and local ID slot reservation, where the
messages are expected to be injected to consume 100% of the max-
imum link bandwidth capacity. Initially, all messages will utilize
the maximum link bandwidth capacity. Afterwards, when they
start sharing a link, their data rate will be automatically reduced
such that the total actual bandwidth of all message is equal to
100%. In this situation, the NoC will be saturated. Because our
NoC is facilitated with the link-level ﬂit ﬂow control, no ﬂit will
be dropped due to congestion. This congestion state will trace back
to the injection nodes such that the injection rates of all messages
will be reduced dynamically following their steady data rate in the
congestion nodes as presented in each snapshot in Fig. 11b. This
phenomena will be also presented later by observing actual injec-
tion and acception rates in the simulation results.
Fig. 12 presents another timing diagram of the data ﬂow when
contention occurs. The ﬁgures presents the ﬂit ﬂows of message A
transmitted from West input link and message B injected from Lo-
cal input port. They compete to acquire the same outgoing link
(East Output link). We assume that each input port has two-depth
FIFO queue. The ﬁgure presents also the queues (R0 and R1) in each
input port to show the contents and the signal (full ﬂag) states of
the queues during contention. The message A is transmitted on
the West input link in every two-cycle, while the message B is in-
jected at every one cycle (at maximum injection rate). Because of
contention, the FIFO queues will be full (the full ﬂag is set) at
any cycle period. In the ﬁgure, the full ﬂag of the Local FIFO queue
is set in phase 3, 4 and 5, because the registers R0 and R1 of the Lo-
cal FIFO queue are used to store the Flit B2 and Flit B3. The Flit B1
itself in the data buffer of the Routing Engine (RE) must wait for
a few cycle because the arbiter in the East outgoing port has se-
lected the ﬂit (Flit A1) from the West input port in phase 3 as the
winner ﬂit to access the outgoing port. In phase 5, the arbiter se-
lects the ﬂit (Flit B1) from Local input port to be switched in the
East output link in the next cycle. Hence, in phase 6, the Flit B1 is
switched out in the East output link, and the full ﬂag of the Local
FIFO queue is reset back. Hence, in the next cycle (phase 7), Flit
B4 that has been waiting in the Local input port can be now stored
in the FIFO queue.
In general, the interesting characteristic of the link-level data
ﬂit ﬂow control can be seen by observing the data ﬂow in the West
input link, in the Local input port and in the East outgoing link. In
the East outgoing link, the ﬂit ﬂow rate is about 0.5 ﬂit per cycle
(fpc), or one ﬂit per two-clock-cycle. We can also see that the arbi-
ter unit makes a ﬂit-by-ﬂit rotating arbitration. Because the East
outgoing link is shared in a fair manner by the ﬂits of message AQueue Out (West)
RE Dout (West)
Routing (West)
3AtilF1AtilF
Local Input Port Flit B1 Flit B3
Queue R1 (Local)
Queue R0 (Local)
Queue Out (Local)
RE Dout (Local)
Routing (Local)
Flit B1
Flit B1 Flit B2
Flit B2
WestArbitration (East)
East Output Link Flit A1 Flit 
Flit B3
Queue R1 (West)
Queue R0 (West) Flit A1
Full (Local)
Full (West)
Flit A3
clock
Flit A1West Input Link
phase 1 phase 2 phase 3 phase 4phase 0
t0 t1 t2 t3 t4 t5 t6
Flit A1
East East
Flit B1
East
Flit B4
Eas
Flit
Flit
phase 5 phase
Flit B2
Flit AFlit A3Flit A2
Flit A2
Flit A2
Flit A2
Local
Fig. 12. Timing diagram (with contention) oand message B, then the ﬂit ﬂow rates of both messages in the
West and the Local input ports experience slower rates. They share
also the maximum bandwidth capacity (0.5 fpc) of the shared out-
going link, i.e. 0.25 fpc (half of the maximum capacity) for each
message, or one ﬂit is transmitted in every four-cycle in both West
input link and Local input port.
The congestion in the West input link as presented in Fig. 12
will affect the ﬂow rate of the message A on the upstream links
in successive clock cycles. The congestion situation will soon reach
the source node from where the message A is injected. The injec-
tion rate reduction experienced by the message B will also occur
in the source node of the message A. Therefore, globally, the injec-
tion rates of the message A and message B in their source nodes
will be equal to their acception rates in their destination nodes,
i.e. 0.25 ﬂit per cycle if we assume that there is no other trafﬁc con-
sidered in the NoC.
The same mechanism is also valid in the input side of the West
input link. By using the ﬂit ﬂow regulation mechanism mentioned
before, the data ﬂit ﬂows at link-level can be controlled automati-
cally. This mechanism is also useful not only to enable reducing the
buffer sizes of the FIFO queue but also can avoid data drops in the
NoC. Data dropping in the context of NoC-based multiprocessor
computation can degrade the application performance.
5. CMOS prototyping
We have synthesized our XHiNoC router prototypes using 130-
nm CMOS standard-cell library from Faraday technology Corpora-
tion. The router is targeted to work with about 1.1 GHz data
frequency (or 0.9ns clock period). Table 1 presents the synthesis
result for the NoC prototypes with static XY and minimal adaptive
routing algorithms. As shown in Table 1 we can see the area of the
ID-management unit in multiplexor unit (MIM) implementations.
The area of the MIM component is about three times the area of
the 2-depth FIFO. Theoretical, if we want to interleave 15 messages
by using VC approach where the FIFO buffer is not shared by differ-
ent messages, then 15 virtual channels must be implemented on
each input or output port as well as 15 VC controller with 15
virtual channel ID on both input and output ports. Hence, our pro-
posed architecture with the local ID-Management unit will be
much more efﬁcient than the VC approach.
By using 130-nm CMOS standard-cell library, the total logic cell
area of our NoC with XY routing algorithm is about 0.106 mm2 (32-
bit data + seven control bits). The net switching power and cell
internal power of the NoC router is about 15.99 mW and
44.25 mW, respectively. We compare the area of our NoC withB1
t7 t8
Flit A4
Flit B2Flit A2
Flit A3
Flit A4
t
 B2
 B3
Flit B4
East
Flit B3
Flit B4
Flit B6
Local
Flit A3
East
Flit B3
Flit A4
East
Flit A5
Flit A6
Flit A6
Flit B4
Flit B5
East
t9 t10 t12 t13 t14 t15t11
 6 phase 7 phase 8 phase 9 phase 10 phase 11 phase 12 phase 13 phase 14
Flit B5
Flit B5
Flit A5
Flit A54
West Local West
f the data switching and control paths.
Table 1
Synthesis results of the router with ﬂit-level interleaved wormhole switching method
using 130-nm CMOS technology with targeted working frequency of about 1.1 GHz
(0.9 ns clock period).
NoC router component
With West-First (WF)
routing (mm2)
With XY routing (mm2)
Total 5 FIFO buffers 0.016837 0.016909
% of total cell area 14.87 16.0
Total 5 Arbiters 0.008071 0.007489
% of total cell area 7.13 7.0
Total 5 MIMs 0.056472 0.056224
% of total cell area 49.88 53.0
Total 5 REBs 0.031834 0.025810
% of total cell area 28.12 24.0
Total cell area 0.113214 0.105877
352 F. Arya Samman et al. /Microprocessors and Microsystems 35 (2011) 343–358other current developed NoCs, although we realize that the logic
area reports may be not fairly comparable because of the differ-
ences of the design parameters such as buffer sizes, the use of vir-
tual channels (VCs) and the width of data word. However, at least
we can see the impact of the use the wormhole cut-through
switching method, which does not need for virtual channels and
can reduce the depth of the FIFO buffer to two registers. Hence,
the area cost of our on-chip router can be reduced. Now we can
see the comparisons as follows.
Table 2 presents the impacts of increasing the number of ID
slots on the logic areas of the overall NoC router, REB and MIM
modules. Table 3 presents the impacts of increasing the FIFO buf-
fer sizes on the logic areas of the overall NoC router and FIFO
buffer. If we assume that each VC consist of two buffer slots,
increasing our NoC buffer from 2 to 8 slots could be comparable
to the addition of 3 VCs. As shown in Table 3, the NoC area is in-
crease 49% when the buffer size is increased from 2 to 8 slots. As
shown in Table 2, the NoC area is increase 65% when the number
of ID slots is increased from 16 to 32 slots. According to tradi-
tional wormhole router, the number of VCs (NVC) is comparable
to the number of ID slots (NSlot) according to our NoC wormhole
cut-through router, because both NoC will have capability to mix
NVC or NSlot number of different wormhole messages in the same
link. Based on the indirect comparison mentioned above, anTable 2
Gate-level synthesis (130-nm CMOS technology) for different number of available ID
slots per link (2-depth FIFO buffer).
Num. of ID slots per link
8 16 32
Total logic cell area (mm2) 0.074 0.106 0.175
REB cell area (mm2) 0.0184 0.0258 0.0419
% of total logic cell area 24.9 24.0 23.9
MIM cell area (mm2) 0.0314 0.0562 0.1084
% of total logic cell area 42.7 53.0 62.0
Table 3
Gate-level synthesis (130-nm CMOS technology) (16 ID slots/per link) for different
FIFO buffer sizes (queue-depth).
FIFO depth
2 4 8
Total logic cell area (mm2) 0.102 0.119 0.152
FIFO cell area (mm2) 0.0164 0.0326 0.0656
% of total logic cell area 16 28 43increase of one VC has more signiﬁcant impact on logic area
compare to an increase of one ID slot.
The TRIPS NoC [15] that uses VCs, contains two data networks,
the OPN and the OCN, in which the logic areas of the OCN and OPN
routers are 1.10 mm2 and 0.43 mm2, respectively by using 130-nm
technology. The TRIP NoC clock frequency is about 366 MHz. The
Xpipes NoC [5] (without speciﬁcally describing whether VCs are
used or not) has 0.19 mm2 logic area at 800 MHz with 4-IO-port
and 64-bit-ﬂit router implementation using 130-nm technology.
The larger area of the TRIPS NoC is due to the use of virtual chan-
nels, where four virtual channels per input port are implemented
in the TRIPS router. The depth of the FIFO buffer in each channel
is two ﬂits.
Teraﬂops [20] NoC router that uses a double-pumped crossbar
switch to reduce the routing area has a compact 0.34 mm2 router
area using 65-nm technology (32-bit data + 6 control bits). For var-
ious voltage levels ranging from 0.75 V until 1.2 V, Teraﬂops NoC
router can be clocked from 1.7 GHz until 5.1 GHz, resulting in
power consumptions ranges from 98 to 924 mW, respectively.
The SCC [21] NoC router synthesis result by using 65-nm stan-
dard-cell library is 0.097 mm2 (32-bit ﬂow control digits/ﬂits) with
250 MHz working frequency. The Teraﬂops NoC and SCC NoC do
not use VCs. Speciﬁcally, SCC NoC does not use of VCs because they
increase total buffer counts and result in power consumption that
would exceed the SCC NoC’s target constraints [21] for a speciﬁc
embedded application. In Teraﬂops NoC [20], 22% of the total
power dissipation is in FIFO queues and data path at 4 GHz, 1.2 V
operation. With the same motivation, we do not implement also
virtual channels in our NoC to save area and power dissipation as
well as to characterize speciﬁcally how our NoC can solve the
head-of-line blocking problem without the use of VCs. Table 4
summarizes the comparisons of our NoC with other NoC
prototypes.
Compared also with our previous NoC architecture [37], our
current NoC CMOS implementation has larger cell area overhead
about 10%, because in the current architecture, we add a new buf-
fer in the new routing pipeline stage and set the available ID slots
from 8 to 16 ID slots. In the data output stage, we combine the ID-
management unit stage into crossbar data multiplexing stage. As a
result, the maximum data frequency of the current VLSI architec-
ture can be increased from 472 MHz until 1.1 GHz (increase about
2.3 or more than 100% speed overhead). By using our current VLSI
architecture, the critical path of our previous on-chip router that is
located in the routing stage has been cut to increase the maximum
data frequency of the on-chip router. The critical path in the cur-
rent architecture is found in the Multiplexor with ID-Management
Unit (MIM Component).
6. Experimental results
In this experiment, we make simulations to compare the laten-
cies and bandwidths of our XHiNoC prototypes when we use static
XY routing and minimal adaptive West-First (WF) routing algo-
rithm. In our ﬂexible NoC architecture, the routing function of
our NoC can be easily reconﬁgured at design time by exchanging
the routing state machine unit in the routing engine module. The
objective of this simulation experiment is not to judge the advan-
tages and disadvantages of using static XY and minimal adaptive
WF routing algorithms. The main objective is to present how trafﬁc
in the router can share each communication link in the NoC during
saturation and non-saturation conditions, where ﬂits of different
message can be interleaved at ﬂit-level using the wormhole cut-
through switching method.
The testbench programs are written in VHDL and perform a
cycle-accurate RTL-level simulation. In order to perform the simu-
lation, some aspects are taken into account as follows.
Table 4
NoC prototypes.
NoC name Techn. size in nm Switch area in mm2 Data freq. in GHz
TRIPS 130 1.100 (OCN) 0.366
1.430 (OPN) 0.366
Xpipes 130 0.190 0.800
Teraﬂops 65 0.340 1.700 (0.75 V)
5.100 (1.20 V)
SCC 65 0.097 0.250
XHiNoC 130 0.106 1.100 (1.32 V)
F. Arya Samman et al. /Microprocessors and Microsystems 35 (2011) 343–358 353 On each network node, we implement a trafﬁc pattern generator
(TPG) and a trafﬁc response evaluator (TRE).
 Each message is encoded by the TPG unit such that every mes-
sage can be differentiated from other messages.
 Each ﬂit of the message is numbered in-order by the TPG unit.
Thus, it is easy for the TRE unit to check whether any or some
ﬂits loose in the NoC or are not accepted in the destination
node.
 Each TRE unit at destination node will check the header of a
packet, and analyzes whether the accepted packet is correct
(the packet has attained its destination node correctly). The
TRE unit counts also how many clock cycles the header needs
to attain the destination node. In this simulation, we interpret
the latency metric as the number of clock cycles to transfer a ﬂit
from its source to its destination node.
 For each accepted ﬂit, the TRE unit will check again one by one
the order and the packet code of every accepted ﬂit. 10,000
ﬂits or equivalent to 10,000  4 Byte = 40 GBytes packets are 0
 10000
 20000
 30000
 40000
 50000
 60000
 70000
 80000
 90000
0  1000  2000  3000  4000  5000  6000  7000  8000  9000  10000
ta
il 
fli
t l
at
en
cy
 (c
loc
k c
yc
les
)
workloads (Num. of Inj. Flits)
xy routing, IR = 0.500 fpc
wf routing, IR = 0.500 fpc
xy routing, IR = 0.200 fpc
wf routing, IR = 0.200 fpc
xy routing, IR = 0.125 fpc
wf routing, IR = 0.125 fpc
(a)
 400
 600
 800
 1000
 1200
 1400
0  1000  2000  3000  4000  5000  6000  7000  8000  9000  10000
m
ea
su
re
d 
BW
 (M
eg
aB
yte
/se
c.)
workloads (num. of flits)
xy routing, IR = 0.500 fpc
wf routing, IR = 0.500 fpc
xy routing, IR = 0.200 fpc
wf routing, IR = 0.200 fpc
xy routing, IR = 0.125 fpc
wf routing, IR = 0.125 fpc
(c)
Fig. 13. Latency and bandwidth measuremeinjected from every TPG unit at each injector (data producer)
node.
 In the experiment, the TRE unit will write the simulation results
into an output text ﬁle. The TRE unit will give some information,
i.e. how many clock cycles that are required to transfer the
500th, 1000th, 2000th, 3000th, 4000th, 5000th, 6000th,
7000th, 8000th, 9000th and the 10,000th ﬂits for each commu-
nication partner (source–destination pairs).
 The TRE units will give also the communication bandwidth in
Mega-Byte per second (MB/s) for each communication pair
(source–destination pair).
In this experiment, although the current NoC prototype can be
clocked at 1.1 GHz data frequency, we clock the NoC with 1 GHz.
Hence, the maximum bandwidth capacity of each communication
link is Bð1 GHzÞmax‘ ¼ 4 Byte 1 GHz 12 ¼ 2 GB=s or 2000 MB/s. By using
ﬂit per cycle (fpc) unit, the maximum data rate of every link is
0.5 fpc. Thus, 0.5 fpc is equal to 2000 MB/s. If we have data rate R
in fpc then we have the relevant bandwidth rate B = 4000 
RMB/s. If we clock our NoCwith themaximumdata frequency, then
the maximum bandwidth capacity of each communication link is
Bð1:1 GHzÞmax‘ ¼ 4 Byte 1:1 GHz 12 ¼ 2:2 GB=s or 2200 MB/s. It looks
that themaximumcapacity of theNoC link is increased10%. Because
our NoC router in the mesh standard topology with 5-IO-port, can
performﬁve simultaneous parallel intra-IO-port interconnects, then
the maximum capacity of the router is Bð1 GHzÞmaxR ¼ 5 2000 MB=s ¼
10 GB=s.
6.1. Bit complement trafﬁc scenario
This subsection presents the performance of our XHiNoC over
the bit complement trafﬁc pattern under 4  4 mesh planar 30000
 40000
 50000
 60000
 70000
 80000
 90000
 100000
 110000
 400  600  800  1000  1200  1400  1600  1800  2000
av
er
ag
e 
la
te
nc
y 
(cl
oc
k c
yc
les
)
requested BW (MegaByte/sec.)
xy routing, 10000 injected flits
wf routing, 10000 injected flits
(b)
 300
 400
 500
 600
 700
 800
 900
 1000
 1100
 400  600  800  1000  1200  1400  1600  1800  2000av
er
ag
e 
ac
tu
al
 B
W
 (M
eg
aB
yte
/se
c.)
requested BW (MegaByte/sec.)
xy routing, 10000 injected flits
wf routing, 10000 injected flits
(d)
nts in bit complement trafﬁc scenario.
354 F. Arya Samman et al. /Microprocessors and Microsystems 35 (2011) 343–358topology, in which a message is injected from source node with a
binary address and will be accepted in target node of bit comple-
ment of the binary source address. For example, if a packet is in-
jected from node (1,3), where its binary address is 01, 11, then
the packet will be accepted at node (10,00) in binary code address
or node (2,0) in decimal code address. In the 2D 4  4 mesh with
the bit complement trafﬁc, we will have 16 node communication
pairs (16 node as injector and as acceptor node at the same time).
Figs. 13–15 will present the XHiNoC unique behaviors to re-
spond the bit complement trafﬁc pattern under saturated and
non-saturated conditions. Fig. 13a shows the measurement of the
average latency to transfer the tail ﬂit (end of payload ﬂit) from
source to target node for different numbers of the total injected
ﬂits per data producer node and different injection rates (IR) in ﬂit
per cycle (fpc). The average tail ﬂit latency is davg ¼ 116
P16
k¼1dk,
where dk is the latency of the communication pair k in the bit com-
plement trafﬁc scenario.
Fig. 13c shows also the measurement of the average bandwidth
over different numbers of the total injected ﬂits per message. The
average bandwidth is Bavg ¼ 116
P16
k¼1Bk, where Bk is the actual/mea-
sured bandwidth of the communication pair k in the bit comple-
ment trafﬁc scenario. It looks that for equivalent injection rate,
the average actual/measured bandwidth rate is constant although
the communication volumes are changed from 500 ﬂits until
10,000 ﬂit per data producer node. The average transfer latency
grows up also linearly when the total number of injected ﬂits is in-
creased in this scenario even when the NoC is saturated. This
behavior is unique compared with the traditional wormhole
switching method because of the link sharing and ﬂit interleaving
capability as well as the mechanism to control dynamically the
injection rate when the NoC is saturated. 0
 0.05
 0.1
 0.15
 0.2
 0.25
 10  20  30  40  50  60  70  80  90  100
da
ta
 ra
te
 (f
lits
 pe
r c
yc
le 
− f
pc
)
clock cycles
inj. rate set point
measured inj. rate
measured acc. rate
(a) Comm 1
 0
 0.05
 0.1
 0.15
 0.2
 0.25
 0.3
 0.35
 10  20  30  40  50  60  70  80  90  100
da
ta
 ra
te
 (f
lits
 pe
r c
yc
le 
− f
pc
)
clock cycles
inj. rate set point
measured inj. rate
measured acc. rate
(c) Comm 1
Fig. 14. Measurement of the actual injection and acception rate atFig. 13b and d shows the average latency and average actual
bandwidth respectively over different requested bandwidth rates
when 10,000 number of ﬂits per message are injected from every
data producer node. By using static X-First routing, the latency
start being saturated, when the requested bandwidth rates
increases starting from 1000 MB/s (0.25 fpc). While, by using
adaptive West-First routing, the latency start being saturated,
when the requested bandwidth rates increases starting from
666.67 MB/s (0.1667 fpc). Because of the existing mechanism to
control dynamically the injection rates, the term expected/requested
bandwidth rate or injection rate setpoint is different from the
actual/measured injection rate. The former is assumed constant
while the latter changes in accordance with the NoC saturation
condition. Therefore, in the saturation condition, the injection rate
at a source node as well as the acception rate at its target node
change dynamically to a certain stable rate or swing around a ﬁxed
acceptable rate.
Fig. 14a and b present the transient response observation/mea-
surement of the injection and acception rate of two selected com-
munication pairs, i.e. Com1 and Com1 respectively by using the
static X-First routing algorithm. Com1 is the communication edge
from node (0,0) to node (3,3), while Com2 is the communication
edge from node (2,3) to node (1,0). As presented in the ﬁgure,
the injection setpoint is 0.2 fpc or equal to 800 MB/s. If we check
again the NoC latency and bandwidth behaviors over different re-
quired bandwidth rate depicted in Fig. 13b and d, then we can see
that the NoC is not yet saturated when messages are injected with
bandwidth rate of 800 MB/s using static X-First routing. Hence, the
injection and acception rates will simply follow the injection rate
set point. Meanwhile, if the messages are injected with 0.333 fpc
or equal to 1333.33 MB/s, then according to Fig. 13b by using static 0
 0.05
 0.1
 0.15
 0.2
 0.25
 10  20  30  40  50  60  70  80  90  100
da
ta
 ra
te
 (f
lits
 pe
r c
yc
le 
− f
pc
)
clock cycles
inj. rate set point
measured inj. rate
measured acc. rate
(b) Comm 2
 0
 0.05
 0.1
 0.15
 0.2
 0.25
 0.3
 0.35
 10  20  30  40  50  60  70  80  90  100
da
ta
 ra
te
 (f
lits
 pe
r c
yc
le 
− f
pc
)
clock cycles
inj. rate set point
measured inj. rate
measured acc. rate
(d) Comm 2
two selected communication pairs using static X-First routing.
 0
 0.02
 0.04
 0.06
 0.08
 0.1
 0.12
 0.14
 10  20  30  40  50  60  70  80  90  100
da
ta
 ra
te
 (f
lits
 pe
r c
yc
le 
− f
pc
)
clock cycles
inj. rate set point
measured inj. rate
measured acc. rate
(a) Comm 1
 0
 0.02
 0.04
 0.06
 0.08
 0.1
 0.12
 0.14
 10  20  30  40  50  60  70  80  90  100
da
ta
 ra
te
 (f
lits
 pe
r c
yc
le 
− f
pc
)
clock cycles
inj. rate set point
measured inj. rate
measured acc. rate
(b) Comm 2
 0
 0.05
 0.1
 0.15
 0.2
 20  40  60  80  100  120  140
da
ta
 ra
te
 (f
lits
 pe
r c
yc
le 
− f
pc
)
clock cycles
inj. rate set point
measured inj. rate
measured acc. rate
(c) Comm 1
 0
 0.05
 0.1
 0.15
 0.2
 20  40  60  80  100  120  140
da
ta
 ra
te
 (f
lits
 pe
r c
yc
le 
− f
pc
)
clock cycles
inj. rate set point
measured inj. rate
measured acc. rate
(d) Comm 2
Fig. 15. Measurement of the actual injection and acception rate at two selected communication pairs using minimal adaptive West-First routing.
 0
 50000
 100000
 150000
 200000
 250000
 300000
 0  1000  2000  3000  4000  5000  6000  7000  8000  9000  10000
ta
il 
fli
t l
at
en
cy
 (c
loc
k c
yc
les
)
workloads (Num. of Inj. Flits)
xy routing, IR = 0.100 fpc
wf routing, IR = 0.100 fpc
xy routing, IR = 0.050 fpc
wf routing, IR = 0.050 fpc
xy routing, IR = 0.040 fpc
wf routing, IR = 0.040 fpc
(a)
 200000
 220000
 240000
 260000
 280000
 300000
 320000
 340000
 360000
 380000
 400000
 420000
 0  200  400  600  800  1000  1200  1400  1600  1800  2000
av
er
ag
e 
la
te
nc
y 
(cl
oc
k c
yc
les
)
requested BW (MegaByte/sec.)
xy routing, 10000 injected flits
wf routing, 10000 injected flits
(b)
 130
 140
 150
 160
 170
 180
 190
 200
 210
 220
 0  1000  2000  3000  4000  5000  6000  7000  8000  9000  10000
m
ea
su
re
d 
BW
 (M
eg
aB
yte
/se
c.)
workloads (num. of flits)
xy routing, IR = 0.100 fpc
wf routing, IR = 0.100 fpc
xy routing, IR = 0.050 fpc
wf routing, IR = 0.050 fpc
xy routing, IR = 0.040 fpc
wf routing, IR = 0.040 fpc
(c)
 80
 100
 120
 140
 160
 180
 200
 220
 240
 0  200  400  600  800  1000  1200  1400  1600  1800  2000av
er
ag
e 
ac
tu
al
 B
W
 (M
eg
aB
yte
/se
c.)
requested BW (MegaByte/sec.)
xy routing, 10000 injected flits
wf routing, 10000 injected flits
xy routing, 500 injected flits
wf routing, 500 injected flits
(d)
Fig. 16. Latency and bandwidth measurements in hotspot trafﬁc scenario.
F. Arya Samman et al. /Microprocessors and Microsystems 35 (2011) 343–358 355
356 F. Arya Samman et al. /Microprocessors and Microsystems 35 (2011) 343–358routing, this data rate will make the NoC being in saturated condi-
tion. Therefore, as presented in Fig. 14c and d, the injection and
acception rates of the Com1 and Com2 are stable at 0.25 fpc point
or lower than the requested injection rate setpoint.
Fig. 15a and b present also the same non-saturated condition
when using the adaptive West-First routing algorithm. As pre-
sented in ﬁgure, the requested injection rate setpoint is 0.125 fpc
or 500 MB/s. In accordance with Fig. 13b and d, at 500 MB/s re-
quested bandwidth rate, the NoC is not yet saturated when using
adaptive West-First routing. Hence, both the injection and accep-
tion rates of the Com1 and Com2 will be stable at 0.125 fpc. How-
ever, if the requested communication rate setpoint is 0.2 or
800 MB/s as presented Fig. 15c and d, then the NoC is saturated.
Therefore, the average injection and acception rates will be lower
than the requested communication bandwidth. At initial clock cy-
cles the injection rate follows the requested injection rate, but ﬁ-
nally the injection rate at the source node follows the actual
measured acception rate at the target node and swings around a
ﬁxed average rate.
6.2. Hotspot trafﬁc scenario
In this subsection, the NoC performance is simulated under hot-
spot trafﬁc pattern, in which all nodes send message to a single
hotspot node, i.e. node (3,3). So, this node will receive all messages
from all other 15 source nodes. Hence, there are 15 communication
pairs in this scenario.
Fig. 16a and c show the NoC average latency and bandwidth
behaviors over variable number of injected ﬂits per data producer
node and with different injection rate. If the messages on each
source node are injected with lower injection rate, then the NoC 0.01
 0.015
 0.02
 0.025
 0.03
 0.035
 0.04
 0.045
 100  200  300  400  500  600  700  800
da
ta
 ra
te
 (f
lits
 pe
r c
yc
le 
− f
pc
)
clock cycles
inj. rate set point
measured inj. rate
measured acc. rate
(a) Comm 1 (XY routing)
 0.01
 0.015
 0.02
 0.025
 0.03
 0.035
 0.04
 0.045
 0.05
 0.055
 50  100  150  200  250  300  350  400
da
ta
 ra
te
 (f
lits
 pe
r c
yc
le 
− f
pc
)
clock cycles
inj. rate set point
measured inj. rate
measured acc. rate
(c) Comm 1 (WF routing)
Fig. 17. Measurement of the actual injection and acception rate at two selected comwill be not saturated. In the non-saturated conditions, the perfor-
mance of the NoC with the static X-First and adaptive West-First
routing algorithms will be similar.
Fig. 16c and d present the NoC average latency and bandwidth
responses over different requested injection or bandwidth rates.
There is a different NoC characteristic presented in Fig. 16d, when
the total number of injected ﬂits per source node is different. We
select for instance the total number of 500 and 10,000 ﬂits per data
producer node. If the number of injected ﬂits is 500 ﬂits then the
average bandwidth start being saturated at 133.33 MHz requested
bandwidth rate. We can observe that the number of messages
sharing the local output port of the target node (3,3) is 15 mes-
sages. Because the maximum capacity of the outgoing port is
2000 MHz, then the average actual/measured bandwidth is 200015 ¼
133:33 MHz. However, if the number of injected ﬂits is 10,000 ﬂits
per producer node, then the saturation point moves to a higher
rate. Based on the observation in the cycle-accurate RTL-simula-
tion, the last ﬂits of some producer nodes located nearby the target
node (3,3) are accepted early, while the other producer nodes far
from the target node (3,3) are still injecting their payload ﬂit.
Therefore, the curves presented in Fig. 16c and d will be exponen-
tially reduced from the 133.33 MHz point until they reach the
bandwidth saturation point. We call the area in the exponentially
reduced curves as the exponential region.
Fig. 17a and b present the injection rate and acception rates of
two selected communication pairs in the hotspot trafﬁc scenario
by using the static X-First routing algorithm. Com1 is a communi-
cation edge that transferring data from node (2,2) to node (3,3),
while Com2 is a communication edge that transferring data from
node (1,1) to node (3,3). The requested injected setpoints from
the source nodes are 0.04 fpc or equal to 160 MB/s. According to 0.01
 0.015
 0.02
 0.025
 0.03
 0.035
 0.04
 0.045
 100  200  300  400  500  600  700  800
da
ta
 ra
te
 (f
lits
 pe
r c
yc
le 
− f
pc
)
clock cycles
inj. rate set point
measured inj. rate
measured acc. rate
(b) Comm 2 (XY routing)
 0.01
 0.015
 0.02
 0.025
 0.03
 0.035
 0.04
 0.045
 0.05
 0.055
 50  100  150  200  250  300  350  400
da
ta
 ra
te
 (f
lits
 pe
r c
yc
le 
− f
pc
)
clock cycles
inj. rate set point
measured inj. rate
measured acc. rate
(d) Comm 2 (WF routing)
munication pairs using static X-First and minimal adaptive West-First routing.
F. Arya Samman et al. /Microprocessors and Microsystems 35 (2011) 343–358 357Fig. 16b and d, the NoC is in the exponential region (is not satu-
rated), when the requested BW is 160 MB/s. Therefore, as pre-
sented in the Fig. 17a the injection rate of Com1 can follow the
expected injection rate setpoint, while its acception rate in the tar-
get node swings around the expected injection rate setpoint.
Fig. 17b shows the transient responses of the injection and accep-
tion rate of the Com2. It looks that the rates are reduced and swings
around certain lower rates than the expected rate.
Fig. 17c and d shows the other transient responses by using
adaptive West-First routing in which the expected injection rate
is 0.05 fpc or equal to 200 MB/s. We can see that the injection
and acception rates of Com1 as shown in Fig. 17c will be stable
at the expected rate. While the injection and acception rates of
Com2 as shown in Fig. 17d will be reduced to about 0.025 fpc.
The expected 200 MB/s data rate makes the NoC being also in the
exponential region according to Fig. 16b and d.7. Conclusions and future works
This paper has presented the VLSI architecture of NoC with a
speciﬁc feature, where the wormhole packets can be interleaved
(cut-through) at ﬂit-level in the same queue to share communica-
tion resources with other different packets. The fair communica-
tion resource utilization, which is effective during non-saturating
condition, is also supported by the implementation of the ﬂit-by-
ﬂit rotating arbitration over wormhole packets requiring the same
output channel. Although, the ﬂits of the wormhole messages are
interleaved in the same communication channel, each ﬂit belong-
ing to the same message can track its routing paths correctly be-
cause of the local identity (ID-tag) present on each ﬂit that varies
over communication resources to support the ﬂexible wire-sharing
message transportation.
This paper has presented also the link-level ﬂit ﬂow control
mechanism and the unique performance characteristics (in satura-
tion and non-saturation) of our NoC that uses the wormhole cut-
through switching method. The transient response behaviors of
our NoC has shown how actual measured injection rates at source
node and acception rates at destination node change dynamically
to move the NoC in a steady-state point. There is no ﬂit dropping
in our NoC since the link-level ﬂit ﬂow control is implemented in
our NoC router. As consequence, the implementation of the auto-
matic injection rate control at source node is feasible, in which
the actual injection rate will follow the actual acception rate steady
point especially during saturation condition.
There could be a few variants of implementation techniques
that could be used to design switch architectures with VC-based
method. The performance as well as the complexity of every VC-
based router are also implementation-dependent. Therefore, this
paper has not reported so far the direct performance and logic
complexity comparisons with a NoC making use of traditional
wormhole switching with and without VCs. These further works
could be an open challenge to be addressed in the future.Acknowledgements
The authors gratefully acknowledge the comments and sugges-
tions made by the reviewers, and DAAD (Deutcher Akademischer
Austausch-Dienst, German Academic Exchange Service) awarding
DAAD-Scholarship for Faizal Arya Samman to pursue doctoral de-
gree at Darmstadt University of Technology in Germany. The
authors would also like to thank LOEWE-Zentrum AdRIA in Fraun-
hofer Institute LBF Darmstadt for further cooperation and for pos-
sible implementation of the concept and the switch architecture to
design adaptive multiprocessing systems within Project AdRIA
(Adaptronik-Research, Innovation, Application) funded by HessianMinistry of Science and Arts with Grant number III L 4 – 518/
14.004 (2008).References
[1] J.D. Allen, P.T. Gaughan, D.E. Schimmel, S. Yalamanchili, Ariadne – an adaptive
router for fault-tolerant multicomputers, ACM SIGARCH Computer
Architecture News 22 (2) (1994) 278–288.
[2] Arvind, R.S. Nikhil, Executing a program on the MIT tagged token dataﬂow
architecture, Lecture Notes in Computer Science, vol. 259, Springer-Verlag,
Berlin/Heidelberg, 1987. pp. 1–29.
[3] T.A. Bartic, J.Y. Mignolet, V. Nollet, T. Marescaux, D. Verkest, S. Vernalde, R.
Lauwereins, Topology adaptive network-on-chip design and implementation,
IEE Proceedings of Computers and Digital Techniques 152 (4) (2005) 467–472.
[4] J. Beecroft, M. Homewood, M. McLaren, Meiko CS-2 interconnect Elan-Elite
design, Parallel Computing 20 (10–11) (1994) 1627–1638.
[5] L. Benini, D. Bertozzi, Network-on-chip architectures and design methods, IEE
Proceedings of Computers and Digital Techniques 152 (2) (2005) 261–272.
[6] T. Bjerregaard, J. Sparsø, Implementation of guaranteed services in the MANGO
clockless network-on-chip, IEE Proceedings of Computers and Digital
Techniques 153 (4) (2006) 217–229.
[7] F. Bodin, D. Windheiser, W. Jalby, D. Atapattu, M. Lee, D. Gannon, Performance
evaluation and prediction for parallel algorithms on the BBN GP1000, ACM
SIGARCH Computer Architecture News 18 (3b) (1990) 401–413.
[8] W.J. Dally, Performance analysis of k-ary n-cube interconnection networks,
IEEE Transactions on Computers C-39 (6) (1990) 775–785.
[9] W.J. Dally, C.L. Seitz, The torus routing chip, Journal of Distributed Computing 1
(3) (1986) 187–196.
[10] B.V. Dao, J. Duato, S. Yalamanchili, Conﬁgurable ﬂow control mechanisms for
fault-tolerant routing, in: Proceedings of the 22nd International Symposium
on Computer Architecture (ISCA’95), June 1995, pp. 220–229.
[11] J. Duato, B.V. Dao, P.T. Gaughan, S. Yalamanchili, Scouting: fully adaptive
deadlock-free routing in faulty pipelined networks, in: Proceedings of the
International Conference on Parallel and Distributed Systems, December 1994,
pp. 608–613.
[12] Jose Duato, Sudhakar Yalamanchili, Lionel Ni, Interconnection Networks: An
Engineering Approach, Revised Printing, Murgan Kaufmann, 2003.
[13] J.S. Kowalik (Ed.), Parallel MIMD Computation: HEP supercomputer and its
applications, MIT Press, Cambridge, MA, 1985.
[14] P.T. Gaughan, S. Yalamanchili, A family of fault-tolerant routing protocols for
direct multiprocessor networks, IEEE Transactions on Parallel and Distributed
Systems 6 (5) (1995) 482–497.
[15] P. Gratz, C. Kim, K. Sankaralingam, H. Hanson, P. Shivakumar, S.W. Keckler, D.
Burger, On-chip interconnection networks of the TRIPS chip, IEEE Micro 27 (5)
(2007) 41–50.
[16] P. Guerrier, A. Greiner, A generic architecture for on-chip packet-switched
interconnection, in: Proceedings of Design, Automation and Test in Europe
Conference and Exhibition (DATE’00), 2000, pp. 250–256.
[17] J. Gurd, C.C. Kirkham, I. Watson, The Manchester prototype dataﬂow computer,
Communications of the ACM 28 (1) (1985) 34–52.
[18] J.R. Herring, C.B. Stunkel, R. Sivaram, Multicasting Using A Wormhole Routing
Switching Element, US Patent No. 6,542,502 B1, (Assignee: IBM Corp.), April 1,
2003.
[19] C. Hilton, B. Nelson, PNOC: a ﬂexible circuit-switched NoC for FPGA-based
systems, IEE Proceedings of Computers and Digital Techniques 153 (3) (2006)
181–188.
[20] Y. Hoskote, S. Vangal, A. Singh, N. Borkar, S. Borkar, A 5-GHz mesh intercon-
nects for a teraﬂops processor, IEEE Micro 27 (5) (2007) 51–61.
[21] D.A. Ilitzky, J.D. Hoffman, A. Chun, B.P. Esparza, Architecture of the scallable
communications core’s network on chip, IEEE Micro 27 (5) (2007) 62–74.
[22] Intel Corp, Paragon XP/S Product Overview, Supercomputer Systems Division,
Beaverton, OR, 1991.
[23] Axel Jantsch, Hannu Tenhunen, Networks on Chip, Kluwer Academic
Publishers, 2003.
[24] C.R. Jesshope, P.R. Miller, J.T. Yantchev, High performance communications in
processor networks, in: Proceedings of the 16th International Symposium on
Computer Architecture (ISCA’89), May–June 1989, pp. 150–157.
[25] P. Kermani, L. Kleinrock, Virtual cut-through: a new computer communication
switching technique, Computer Networks 3 (1979) 267–286.
[26] S. Konstantinidou, L. Snyder, Chaos router: architecture and performance, in:
Proceedings of the 18th International Symposium on Computer Architecture
(ISCA’91), June 1991, pp. 79–88.
[27] S. Kumar, A. Jantsch, J.-K. Soininen, M. Forsell, M. Millberg, J. Öberg, K.
Tiensyrja, A. Hemani, A network on chip architecture and design methodology,
in: Proceedings of IEEE Computer Society Annual Symposium on VLSI, 2002,
pp. 105–112.
[28] Z. Lu, M. Liu, A. Jantsch, Layered switching for networks on chip, in:
Proceedings of Design Automation Conference (DAC’07), June 2007, pp. 122–
127.
[29] P. Martin, Design of a virtual component neutral network-on-chip transaction
layer, in: Proceedings of Design, Automation and Test in Europe Conference
and Exhibition (DATE’05), 2005, pp. 336–337.
[30] M. Millberg, E. Nilsson, R. Thid, A. Jantsch, Guaranteed-bandwidth using
looped containers in temporally disjoint networks within the nostrum
358 F. Arya Samman et al. /Microprocessors and Microsystems 35 (2011) 343–358network on chip, in: Proceedings of Design Automation and Test in Europe
(DATE’04), February 2004, pp. 890–895.
[31] S.S. Mukherjee, P. Bannon, S. Lang, A. Spink, D. Webb, The Alpha 21364
network architecture, IEEE Micro 22 (1) (2001) 26–35.
[32] S. Nugent, The iPSC/2 direct connect communications technology, in:
Proceedings of the 3rd Conference on Hypercube Concurrent Computers and
Applications, January 1988, pp. 51–59.
[33] Wilfried Oed, The Cray Research Massively Parallel Processing System: Cray
T3D, Cray Research Inc., 1993.
[34] I.M. Panades, A. Greiner, A. Sheibanyrad, A low cost network-on-chip with
guaranteed service well suited to the GALS approach, in: Proceedings of the 1st
International Conference and Workshop on Nano-Networks, 2006, pp. 1–5.
[35] E. Rijpkema, K. Goossens, A. Radulescu, J. Dielissen, J. van Meerbergen, P.
Wielage, E. Waterlander, Trade offs in the design of a router with both
guaranteed and best-effort services for networks on chip, Proceedings of IEE
Computers & Digital Techniques 150 (5) (2003) 294–302.
[36] I. Saastamoinen, D.S. Tortosa, J. Nurmi, Interconnect IP node for future system-
on-chip designs, in: Proceedings of the 1st IEEE International Workshop on
Electronic Design, Test and Applications (DELTA’02), 2002, pp. 116–120.
[37] F.A. Samman, T. Hollstein, M. Glesner, Flexible parallel pipeline network-on-
chip based on dynamic packet identity management, in: Proceedings of the
22nd IEEE International Parallel and Distributed Processing Symposium (in
Reconﬁgurable Architecture Workshop), April 2008, pp. 1–8.
[38] M. Sgroi, M. Sheets, K. Keutzer, S. Malik, J. Rabaey, A.S. Vincentelli, Addressing
the system-on-a-chip interconnect woes through communication-based
design, in: Proceedings of the 38th Design Automation Conference (DAC’01),
2001, pp. 667–672.
[39] K.G. Shin, S.W. Daniel, Analysis and implementation of hybrid switching, IEEE
Transactions on Computers 45 (6) (1996) 684–692.
[40] C.B. Stunkel, D.G. Shea, D.G. Grice, P.H. Hochschild, M. Tsao, The SP1 high-
performance switch, in: Proceedings of the Scalable High Performance
Computing Conference, May 1994, pp. 150–157.
[41] D. Wentzlaff, P. Grifﬁn, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina,
et al., Characterizing the cell EIB on-chip network, IEEE Micro 27 (5) (2007)
15–31.
[42] D. Wingard, MicroNetwork-based integration for SOCs, in: Proceedings of the
38th Design Automation Conference (DAC’01), 2001, pp. 673–677.
Faizal Arya Samman was born in Makassar, Indonesia.
He received his Bachelor of Engineering degree in
Electrical Engineering from Gadjah Mada University at
Yogyakarta, Indonesia in 1999. In 2002, he received his
Master of Engineering degree with Scholarship Award
from Indonesian Ministry of National Education in
Control and Computer System Laboratory and in
Inter-University Center for Microelectronics Research,
Bandung Institute of Technology in Indonesia. In 2002,
he was appointed to be a research and teaching staff at
Hasanuddin University, in Makassar, Indonesia. From
2006 until 2010, he received scholarship award from
Deutscher Akademischer Austausch-Dienst (DAAD, German Academic Exchange
Service) to pursue the engineering doctoral degree at Darmstadt University of
Technology, in Germany. He is now working toward the postdoctoral program in
LOEWE-Zentrum AdRIA (Adaptronik-Research, Innovation, Application) within the
research cooperation framework between Darmstadt University of Technology and
Fraunhofer Institute LBF in Darmstadt. His research interests include network on-
chip (NoC) microarchitecture, NoC-based multiprocessor system-on-chip applica-
tion mapping, programming models for multiprocessor systems, design and
implementation of analog and digital electronic circuits for control systemapplications on FPGA/ASIC as well as energy harvesting systems and wireless sensor
networks.
ThomasHollstein graduated fromDarmstadt University
of Technology in Electrical Engineering/Computer
Engineering in 1991. In 1992 he joined the research
group of the Microelectronic Systems Lab at Darmstadt
University of Technology. He worked in several research
projects in neural and fuzzy computing and industrial
VHDL based design. Since 1995 he focused his research
on hardware/software codesign and in 2000 he received
his Ph.D. on ‘‘Design and interactive Hardware/Software
Partitioning of complex heterogeneous Systems’’ at
Darmstadt University of Technology. Since 2000 he is
working as a senior researcher, leading a research group
focusing System-on-Chip communication architectures, the design of reconﬁgurable
HW/SW Systems-on-Chip and integrated SoC test and debug methodologies. His
current research interests are in the ﬁelds of Networks-on-Chip, Hardware-/Software
Co-Design, Systems-on-Chip design, printable organic and inorganic electronics, and
RFID circuit and system design. Furthermore, Dr.-Ing. Hollstein is giving lectures on
VLSI design and CADmethods. From 2001 until now he has been member of a leader
team initiating and establishing a new international master programme in
‘‘Information&CommunicationEngineering’’ at DarmstadtUniversity of Technology.
In 2010, he was appointed as a professor at Tallin University of Technology in
Department of Computer Engineering, Dependable Embedded Systems Group.
Manfred Glesner received the diploma degree and the
Ph.D. degree from Saarland University, Saarbrücken,
Germany, in 1969 and 1975, respectively. His doctoral
research was based on the application of nonlinear
optimization techniques in computer-aided design of
electronic circuits. He received three Doctor Honoris
Causa degrees from Tallinn Technical University,
Tallinn, Estonia in 1996, Poly-technical University of
Bucharest, Bucharest, Romania in 1997, and Mongolian
Technical University, Ulan Bator, Mongolia in 2006.
Between 1969 and 1971, he has researched work in
radar signal development for the Fraunhofer Institute in
Werthoven/Bonn, Germany. From 1975 to 1981, he was a Lecturer in the areas of
electronics and CAD with Saarland University. In 1981, he was appointed as an
Associate Professor in electrical engineering with the Darmstadt University of
Technology, Darmstadt, Germany, where, in 1989, he was appointed as a Full
Professor for microelectronic system design. His current research interests include
advanced design and CAD for micro- and nanoelectronic circuits, reconﬁgurable
computing systems and architectures, organic circuit design, RFID design,
mixed-signal circuit design, and process variations robust circuit design. With the
EU-based TEMPUS initiative, he built up several microelectronic design centers in
Eastern Europe. Between 1990 and 2006, he acted as a speaker of two DFG-funded
graduate schools. Dr. Glesner is a member of several technical societies and he is
active in organizing international conferences. Since 2003, he has been the vice-
president of the German Information Technology Society (ITS) in VDE and also a
member of the DFG decision board for electronic semiconductors, components, and
integrated systems. He was a recipient of the honor/decoration of ‘‘Palmes
Academiques’’ in the order of Chevalier by the French Minister of National
Education (Paris) for distinguished work in the ﬁeld of education in 2007/2008.
