Network-on-Chip with Guaranteed-Bandwidth Data Communication Service by Samman, Faizal Arya
Network-on-Chip with Guaranteed-Bandwidth Data
Communication Service
Interconnect Platform for Computer Vision and
Multimedia Applications on Many Core Processors
Faizal Arya Samman
Department of Electrical Engineering, University of Hasanuddin at Makassar
Kampus Gowa, Jl. Poros Malino Km. 20, Borongloe 92172, Bontomarannu, Gowa
Email: faizalas@unhas.ac.id
Abstract—A network-on-chip (NoC), having guaranteed-
throughput (GT) or guaranteed bandwidth service by using a
flexible method to establish a connection-oriented data commu-
nication at runtime, is presented in this paper. The GT packets
can share communication link with a flexible manner, where flits
belonging to the same packet will have the same local identity-
tag (ID-tag). The ID-tags of each packet will vary locally along
communication links, and are organized with an ID-tag mapping
management unit, which is implemented at each output part of
the on-chip routers. There is no need for a specific algorithm
for finding a conflict-free scheduling as commonly used in the
TDM-based methods that use time slots allocation technique. The
contention problem is solved with the hardware solution based
on the locally organized message identity (ID). This guaranteed
bandwidth/throughput service will provide a good interconnect
platform for many core processor systems running computer
vision and multimedia applications with better performance.
Keywords—Guaranteed Bandwidth, Network-on-Chip, Multi-
media, Computer Vision, Many Core Processors
I. INTRODUCTION
Quality of Service has been an important issue in an internetwork-
ing data communication to provide a better service for certain data
traffics. Some network-on-chip (NoC) prototypes have also consid-
ered this issue in the NoC context to provide a specific service for
traffics requiring a guaranteed throughput and latency such as video
stream data. The effort to provide the better service will always face a
problem on how to manage contentions (conflicts) between different
types of packets such that they can share each communication link
in the NoC, while maintaining application performance. The quality
of service can be made through separate virtual channels or even
completely partitioned subnetworks [1].
In internetworking communities, some communication protocols
have been introduced to guarantee the quality of service. In gen-
eral, such protocols can be characterized as connectionless and
connection-oriented protocols. In general, connection-oriented ser-
vices provide some levels of delivery guaranties, whereas connec-
tionless services do not [2]. We can interpret also a packet being
injected to the network with the connectionless protocol as a best-
effort (BE) packet, and a packet being injected with the connection-
oriented protocol as a guaranteed-throughput (GT) packet.
In order to provide a specific quality of service for certain traffics,
two approaches are proposed i.e., resource reservation and priority-
based scheduling strategies. The resource reservation strategy can
provide a hard performance guaranty, but communication resource
utilization may be lower. While, priority-based strategy can achieve
better communication resource utilization, but the performance guar-
anty is soft [3]. Regarding the connectionless and connection-
oriented services mentioned before, the resource reservation strategy
will be implemented using the connection-oriented service, while the
priority-based scheduling method can be made using the connection-
less service.
Data communication between IP cores in the NoC-based multi-
processor systems can be realized in general using circuit switching,
packet switching and wormhole switching method. A virtual cut-
through (VCT) switching method is special case of the packet
switching where a packet can cut-through i.e., the packet can be
switched out soon after a routing decision has been made without
waiting for the packet to fully store in a buffer of the NoC router.
Some recently published chip multiprocessor (CMP) systems with
interconnected processing elements such Cell EIB [4], Tile64 [5],
TRIPS [6], Teraflops [7] and SCC NoC [8] uses the packet switching
method. The circuit-switching based on the Time-Division Multiplex-
ing (TDM) method has been used by some NoC proposals. Examples
of them are [3], and Æthereal [10].
The use of switching methodology for a NoC-based system can
determine how the guaranteed-service can be feasible implemented
in the NoC. The circuit switching method for instance enables the
feasible implementation of a TDM-based guaranteed-service through
the communication resource reservation approach. This paper will
present an extensive use of the wormhole switching method, in which
the wormhole packets can be interleaved (cut-through) at flit-level
with different packets on the same link [16]. Instead of using time
slots, we use local ID slots to solve the BE-GT packet conflicts.
The communication resources reservation approach can be used
because of the support of the local message ID organization tech-
nique. Our NoC uses both the connectionless (BE) and connection-
oriented with GT protocols. The difference between both protocols
is not much and explained in the following. The BE data traffics
are injected to the NoC soon to follow the BE header, while the
BE header flit is making communication resource reservation. The
source node does not wait for a response from the destination node
to know whether the connection has been successfully established or
not. The GT data traffics or streams are injected after the connection
establishment is successfully made by the GT header flit by reading
an information in the accepted response flit sent by the destination
node. The communication links are shared by all GT packets fairly,
and can also be shared with BE packets by controlling the injection
rate of the GT packets in such as way that there will be an instant
time used by the BE packets being routed to the requested links.
The paper is organized in the following sections. Section III
presents some works related to our current research and introduces
a novel methodology to implement and to combine the connetion-
less and connection-oriented routing protocols with guaranteed-
bandwidth service. Section IV presents the flexibility of our method
to establish connection at runtime (during application execution)
compared with the existing time-slot-based TDM switching method.
Features and characteristics of the proposed NoC microarchitecture
including the hardware solution to realize the connection-oriented
guaranteed-service is exhibited in Section V. Section IX presents
experimental results to observe the effectiveness of our methodology
to combine BE and GT traffics in our NoC. Finally, Section X gives
concluding remarks and future research directions.
978-1-5090-5548-7/16/$31.00 c© 2016 IEEE
CYBERNETICSCOM 2016 ISBN: 978-602-73589-1-1
79
II. RELATED WORKS
The work in [11] proposes a unified Mapping, Routing and
Slot Allocation (UMARS+) algorithm for NoCs supporting Best-
Effort and Guaranteed Services. The proposed algorithm couples
path selection, mapping of application on cores and channel time-slot
allocation to minimize a network required to meet the constraint of
the application. The work in [3] has also presented an interesting al-
gorithm and methodology to make time-slot-based link configuration
and scheduling. The work focuses on TDM Virtual Circuit (VC) and
address a multinode configuration problem. The VC configuration is
the background idea of a connection-oriented communication service
between communicating cores in a packet switched network. For
a static case, the VC configurations are computed off-line. The
authors have mentioned that the methodology can also be used for
dynamic and semi-static cases. However, the further work has not
been presented in detail so far.
The design flow of µSpidergon NoC [12] proposes also an
alternative solution offering TDMA-based guaranteed-service for
data traffics in the context of NoC with different clock area and
skew (GALS Globally Asynchronous Locally Synchronous context).
A time router and TDMA synchronizer are included in the design
flow based on a concept of sub-NoC compositions. The main feature
of this approach is the use the GALS-oriented which means that
no global clock is required. The work in [13] has implemented
the concepts of spatial division multiplexing (SDM) for guaranteed
throughput NoCs. Beside the advantageous mentioned in the work,
the SDM-based method has drawbacks also, i.e. the difficulty to
implement end-to-end flow control, larger area overhead for BE
service implementation, and the need for more complex switch
control.
The aforementioned TDM-based switching methods that use time
slots to allocate each packet must be implemented with a conflict-free
routing and scheduling. As a result, the utilization of the communi-
cation resources is not optimal. Link scheduling at runtime during
application executions based on the time-slot allocation technique is
difficult, and the probability that a packet fails to establish connection
is high because of the conflict-free routing requirement. Therefore,
a specific (probably a complex) time-slot allocation algorithm is
required to achieve the conflict-free routing and scheduling as
proposed in the aforementioned works in [11], [3] and [12]. The
aforementioned approaches lead to a few time-overhead. Before an
application is executed, the time-slot allocation algorithm must be run
at compile time or probably at design time. Afterwards, the results
of the time-slot allocation must be distributed in the time-slots of
each link or router. Hence, a reconfiguration unit on each router
is necessary to store the information which can lead also to an area
overhead. Furthermore, the time slot allocation algorithms could give
also complexity overhead to the task application mapping algorithm.
In addition, the works in [11], [3] and in [12] have presented
the off-line algorithms (high-effort software solutions) for finding
conflict-free routing based on time-slot allocation on every commu-
nication link. In this paper, we propose only a simple and flexible
hardware solution by using the local ID slots organization technique,
in which the routing conflicts can be simply managed without the
off-line software.
III. CONTRIBUTION
Our previous paper has presented a NoC that only provides
the connection-oriented guaranteed-through service [18]. This pa-
per proposes a new contribution to combine connection-oriented
(guaranteed-throughput) and connectionless (best-effort) services
based on locally organized message identity (ID).
The main idea of the flexibility to combine two types of packet
for quality-of-service is the use of the dynamic local ID manage-
ment and control [14]. Instead of using time slots and preventing
packet conflicts, we use local ID slots to optimize communication
resources utilization and to ease runtime connection configuration.
In our approach, conflicts between flits of different best-effort and
guaranteed-throughput messages to share link are allowed, but they
are controlled and managed by using a local ID-slot mapping and
management technique.
NINININI
NI
NI
NI
(0,0) (1,0) (2,0) (3,0)
(0,1)
(0,2)
(1,1)
(1,2)
(1,3)
(2,1)
(2,2)
(2,3)
(3,1)
(3,2)
(3,3) Link of
X+ Subnetwork
Link of
X− Subnetwork
(0,3)
Resource
NI uP /
DSP
cache
reconf.
logic
i/o
RA
M
dma
Resource
NI NI NI
NININI
NI NI NI
Router
Router
Router
Router
RouterRouter
Router
Router Router
RouterRouter
RouterRouter
Router Router Router
Fig. 1. NoC-based multiprocessor system on mesh planar topology.
IV. RUNTIME CONNECTION SETUP METHODS
Æthereal NoC [10] utilizes a Slot Table to avoid packet contentions
on a link. It divides up bandwidth per link between connections, and
switch data to a correct output. Every slot table T consists of S time
slots and N number of router output ports. Synchronicity is made
based on incremented time slot, where all routers in the network
are in the same fixed-duration slot. In a slot s at most one block of
data can be taken per input port or forwarded per output port. In
the next slot, (s+1)%S, the taken packet block are written to their
appropriate output ports. Hence, the packet blocks propagate in a
store-and-forward fashion.
Nevertheless, this time slot method has disadvantage. For in-
stances, (TDM) scheduling method has higher probability of failure
to setup connection at runtime, eventhough there is still free time
slots in the considered outgoing port. This failure happens for
example when a contention occurs. In order to overcome the con-
tention problem, NoCs which use time-slot-based TDM scheduling
to provide guaranteed-service should provide a time-slot allocation
algorithm to achieve a contention-free routing. Such algorithms for
instances have been introduced in [3], [11] and in [12].
Instead of including a time slot allocation algorithm in design flow
that must be run during compile time (before application execution
time) and overloads the complexity of functional task application
mapping algorithm, a more flexible approach by introducing a
runtime ID-slot-based scheduling method that uses ID-Tag Mapping
Management (IDM) units on every link is proposed in this paper. It
provides guaranteed-service and can optimize dynamically the link
bandwidth utilization of the NoC.
V. XHINOC CHARACTERISTICS
A. Network Topology
Fig. 1 presents an example a chip multiprocessor (CMP) system
on a 2D 4x4 mesh planar NoC topology. Physically, the mesh planar
NoC is divided into the X+ and the X− subnetworks. Compared
with standard mesh router, the NoC has two pairs of vertical links
connecting South and North sides of the router, i.e. South1 and
North1 links on the left side used to route packets through the X+
subnetwork and South2 and North2 links on the right side used
to routes packet through the X− subnetwork. The additional links
increase bandwidth capacity of the router and allow us to implement a
2D planar adaptive routing algorithm. In a mesh standard, for M×N
network size, there are N × (M − 1)+M × (N − 1) available full-
duplex links, where M is the width and N is the height of the mesh
network. In a 2D mesh planar architecture as presented in Fig. 1,
there are N × (M − 1) + 2 ×M × (N − 1) available full-duplex
links.
Each mesh router presented in Fig. 1 can be connected to a
resources tile through a network interface. Network interface (NI) is a
component that packetizes data from the tile to the NoC, depacketizes
packet from the NoC to the tile, and undertakes any other necessary
functionality in accordance with additional communication protocol
specifications. The tile can be a bus-based digital signal processor
CYBERNETICSCOM 2016 ISBN: 978-602-73589-1-1
80
crossbar
ew(1:0) ew(1:0)
MIM MIM
G
gr
sel
rr gr
A
rr
sel
er
rr
ew(1:0) (1:0)ff
G
gr
sel
rr gr
A
rr
sel
er
rr
ew(1:0) (1:0)ff PORT NPORT 1
QBE QGT QBE QGT
PORT 1 PORT N
re
es es
re
(1:0)ff (1:0)ff
RE RE
Fig. 2. Generic router architecture.
(DSP) or multiprocessor system, or an application-specific hardware
component. Our NoC presented in this paper uses the 2D mesh planar
topology to enable the use of minimal 2D planar adaptive routing
algorithm and to increase the bandwidth capacity of the NoC.
B. Planar Adaptive Routing Function
The description of the planar adaptive routing algorithm has been
presented in our previous work [17]. If a message injected from
(xsource, ysource) will be sent to a target node (xtarget, ytarget), and
the x-distance between source and target nodes (xoffs = xtarget −
xsource) is positive or zero, then packets will be routed through the
physical channels of the X+ subnetwork. In contrast, if xoffs is
zero or negative, then the packets will be routed through the physical
channels of theX− subnetwork. The ports connected to vertical links
of X+ and X− subnetworks are denoted by (North1, South1) and
(North2, South2) ports, respectively. Hence, a message transported
via the X+ subnetwork can be routed adaptively to make West–
North1, West–South1, North1–East and South1–East turns as well
as West–East, North1–South1 and South1–North1 straightforward
routing direction. While the packets communicated via the X−
subnetwork will have adaptivity to choose between East–North2,
East–South2, North2–West and South2–West turns as well as to
choose a straightforward routing direction, i.e. East–West, North2–
South2 and South2–North2 [17].
The packets are routed adaptively in the network. When a packet
has two options for outgoing ports, the packet will be routed to an
outgoing port having more free ID slots. More detail on such routing
algorithm on the planar network topology can be found in [19].
C. Router Architecture
The generic microarchitecture of the XHiNoC router is presented
in Fig. 2. The router is designed with modular-oriented method,
where each modular component is regularly instantiated for each
input-output port. The XHiNoC in general consists of three com-
ponents in incoming port i.e., FIFO queues (comprising a Best-
Effort Queue (QBE) and a Guaranteed-Throughput Queue (QGT)),
a Routing Engine with multiplexed data buffering (RE) and a Grant
request acknowledge (G) component. In each outgoing port, there are
a Multiplexor with ID-tag Management unit (MIM) and an Arbiter
(A) component. In order to keep the router size small, the depth of
each virtual channel is set only to 2 slots.
VI. PACKET FORMAT FOR GT AND BE MESSAGES
Fig. 3 present the packet format used in XHiNoC. The key role
of the flexible connection setup is denoted by a specific format
of the XHiNoC packets. The 39-bit (0 − 38th) packet consists of
a header (stream request) flit followed by payload data flits. The
bits 38th − 36th represents flit types of the packet, and the bits
35th−32th represents the ID (Identity) tag. Table I shows the binary
encoding of 8 flit types to differentiate packets used for best-effort
and guaranteed-throughput service.
A data message in the XHiNoC can be associated as a single
packet, where the message is divided into several flits. In other words,
the message is not divided into several packets, where each packet
DBod
DBod
id−tag
id−tag
id−tag
Payload Data
Payload Data
Payload Data / Specific Information
3b 4b 4b 4b 4b 4b 4b 4b 4b 4b
Resp id−tag InfoXt Yt Zt Xs Ys Zs
id−tag Xs Ys Zs Xt Yt Zt(1)
(2)
(3)
(4)
Head
Tail
Fig. 3. Packet format.
TABLE I
FLIT TYPES ENCODING.
Hex Binary Flit type
0 “000” not data
1 “001” header flit for BE packets
2 “010” databody for BE packets
3 “011” tail flit for BE packets
4 “100” header for GT packets
5 “101” databody for GT packets
6 “110” tail flit for GT packets
7 “111” response flit
consists of a few flits and one header containing a routing information
or destination address of the packet. Hence, the terms “packet” or
“message” have similar interpretation in this paper. A message in
XHiNoC, even if its size is extremely large, has only one header
for one unicast message. The Computing of a routing direction for
each message on each router is made once when its header is routed.
Next, the payload flits, including the tail flit, will track routing paths
made by the header. The message may consist of a very long flit
stream. By using this packet format, there will be no out-of-order
problem for each message, although an adaptive routing algorithm
is used to route the stream.
The source address (XS , YS, ZS) and target address (XT , YT , ZT )
of the packet are asserted in the header flit. The Source ZS and
Target ZT fields are dedicated for addressing the resource tiles of
hierarchical networks (e.g. in tree-based topologies), when our NoC
will be extended to be a hierarchical NoC, or are used to address the
3D locations of computing resources (tiles). The extended address
can also be used, when our NoC will be designed to be a stacked
3D NoC. The sub-hierarchical networks will be connected to a local
port of each mesh node. Hence, each resource tile located in the
subnetwork will have (X,Y, Z) address, where the (X,Y ) denotes
the 2D address of the mesh network and the (Z) represent the address
of the tile in the sub-hierarchical network. Each flit, which belongs
to the same message, has the same local identity number (ID-tag).
The unique local ID is used to differentiate each flit from the other
packets when it passes through a communication link.
VII. ROUTING SERVICES FOR DIFFERENT PACKETS
A. Guaranteed-Throughput (GT) Packets
The process to establish and to terminate connection for the
guaranteed-throughput message can be described in four different
phases as depicted in Fig. 4, in which core A sends a data stream to
core B via the on-chip network.
1) To initiate a connection, core A sends a request flit to core B
as shown in Fig. 4(a).
2) After receiving the request flit, core B will analyze the request
flit to find out whether the requested connection is successful
or not. core B will send then a response flit as presented in
Fig. 4(b) to tell core A the connection process.
3) If the connection from core A to core B is successfully
established as indicated by the response flit accepted by
core A, then core A will start sending the data stream to core B
as depicted in Fig. 4(c).
CYBERNETICSCOM 2016 ISBN: 978-602-73589-1-1
81
Core A Core B
outin outin
Request
header flit
Request
Analyse
XHiNoC
(a) Connection Establishment.
Core A Core B
outin outin
Response
response flits
Analyse
Response
Send
XHiNoC
(b) Response Analysis.
outin outin
databody flits
Send
Data
Accept
Data
XHiNoC
Core A Core B
(c) Data Transfer.
outin outin
Connect.
tail flit
Tear
XHiNoC
Core A Core B
(d) Connection Termination.
Fig. 4. Connection setup with progressive approach for connection termination.
4) But, if the request flit fails to establish connection from core A
to core B as indicated in the “info” field by the response flit,
then core A will terminate the connection by sending a tail flit
to remove the reserved communication resources as presented
in Fig. 4(d). Afterwards, core A will start again to send a new
request to establish connection to core B.
A requested connection can fail because there is no more available
ID-slot in certain communication resources in intermediate routers.
Our current router implementation uses 4 bits for ID-tag field. It
means that there are 16 ID slots available on each communication
resource. However, we use only 15 ID slot for communication and
the remaining one ID slot is reserved to control the flow of header
flits which flow in the links that run out of ID slots. In the design,
we use ID-tag “1111” as the control ID-tag. For instance, if a header
flows through a link that run out of ID slot, then the header will be
assigned with the ID-tag “1111”. Once a header is assigned with
the ID-tag “1111”, then it will be always assigned with the ID-tag
“1111” on each communication link until it reaches its destination
node.
B. Best-Effort (BE) Packets
Beside the GT packets, the NoC can route also the BE packets. A
best-effort data communication is connectionless. In our XHiNoC,
the best-effort databody and the last databody (the tail flit) are sent
by following its header flit injected in advance without waiting for a
response flit from the destination node. Hence, we provide a different
mechanism to handle a message that cannot reserve an ID slot in
a certain intermediate node. As explained in Subsection VII-A, a
header entering a link which run out of ID slot will be assigned with
ID-tag “1111”, and will be always assigned with the ID-tag “1111”
when entering the next communication resources until it reaches the
destination node.
The aforementioned rule is also valid for headers of the BE
messages. But, the BE payload flits belonging to the same header
having ID-tag “1111” on the considered link will be dropped in
the outgoing ports of the link. This data dropping mechanism is
provided for the BE messages in our NoC to avoid deadlock (due to
the possible ID slot run out problem), because the BE payload data
are injected soon after the header flit have been injected from source
node without waiting for a response flit from target node to let the
source node recognize whether the header has successfully reserved
one ID slot on every required communication link connecting the
source and the target node.
Furthermore, after the header flit is accepted by the destination
node, then the node will send a response flit to the source node.
After analyzing the response flit, the source node will stop injecting
the best-effort data, send the tail flit to remove ID slot utilization
in each router, and send again the message. In our current XHiNoC
prototype, we have provided 15 ID slots plus 1 ID slot reserved for
link over-capacity control purpose on each link. We can still increase
available ID slots per link at design time. For instance, with we use
5-bit ID-tag field on each flit, then 32 ID slots will be available on
each link.
VIII. ID-BASED DATA FLOW CONTROL
A. ID-based Routing Mechanism
Routing engine (RE) units in the XHiNoC combines a routing state
machine (RSM) and a routing look-up table (LUT) unit. This combi-
L
E
L
er
RE
r
Header 1
N2
S2
N
W
S
WS2
flit
type
Routing
er
Dest.
Addr.
Local&LW S2
LUT at North2 Port
r
RE
N
0
1
2
3
ID
Pload2 2
2Pload1
state
machine
0 0 1Adest
destAdir
dir
Fig. 5. ID-tag-based routing engine mechanism.
nation is useful to support a runtime link interconnect configuration.
Every flit brings also its flit type and ID-tag together with a data word
A local ID slots is defined as a set Ω = {0, 1, 2, · · · , Nslot − 1},
where Nslot is the number or maximum number of available local
ID slot on each communication link of the NoC. A flit flowing on a
communication link can be classified as BEHeader, GTHeader,
BEDatabody, GTDatabody, BETail, GTTail, Response.
In the context of the mesh planar router structure presented in this
paper, then we can assign routing direction East, North1, West,
South1, North2, South2 and Local as output port direction 1, 2,
3, 4, 5, 6 and 7, respectively.
If the RE unit detects a packet header (request flit) having ID-tag
Fid ∈ Ω from the input port, then the RSM unit will look for a
correct routing direction based on destination address stated in the
header flit and the current address of the router. The routing direction
is then asserted in a register number Fid of the LUT unit (indexing
based on the ID-tag Fid of the packet). In the next time periods as
the RE unit detects payload data flits having ID-tag Fid, then their
routing direction will be get directly from the LUT unit according to
their ID-tag number indexed before. Fig. 5 shows an example of the
ID-based routing mechanism in a mesh router. The packet, having
ID-tag 2, is routed to the Local port and its ID-tag is updated to a
new ID-tag number, i.e. 1.
B. Optimal Link Utilization
Fig. 6 shows some packets (pck A,B,C,D and E) can be
interleaved each other to share the same communication link. In
our NoC implementation rule, flits belonging to the same packet
will have the same ID-tag on each link. Hence, the ID-tag attached
in each flit enables each payload data to track their correct routing
direction. In other words, the ID-tag represents the compressed form
of routing direction made by a header flit. The header flit reserves
communication resources by using one ID slot in an ID Slot Table at
each outgoing link, and use this slot as its current ID-tag. Each flit is
then routed in accordance with its current local ID-tag, where header
flits find the routing direction and compress it in a Routing Table,
while payload flits will extract the routing direction from the Routing
Table by using its current ID-tag as the table index. Remember that
flits belonging to the same message will always have the same local
ID-tag on each communication link. Each interleaved flit can extract
the required routing direction by searching it in the routing table in
accordance with its ID-tag.
CYBERNETICSCOM 2016 ISBN: 978-602-73589-1-1
82
B3
0
2
1
0
1
30 1
1
1 0
1
2
0
0 0
2
3
21
R1 R2 R3
R4 R5 R6
E
D C
A E
ADCB
Fig. 6. ID-tag-based routing organization and connection scheduling.
   
   


 
 


  
  
  



 
 


  
  


  
  
  



  
  


   
   


 
  
  
  
  
  
  
  
  









 
 
  
  
  
  
  
  
  









  
 
  
  
  
  
  
  
  
  










 
 
 
 
 
 
 
 








 
 
  
  
  
  
  
  
  
  










 
  
  
  
  
  
  
  








   
  
 
 
 
 
 
 







 
 
 
 
 
 
 
 








 
 
 
 
 
 
 
 








  
 
 
 
 
 






  
 
  
  
  
  
  
  
  









  
 
 
 
 
 
 
 








 0
 50
 100
 150
 200
 250
 300
 350
comm1 comm2 comm3 comm4 comm5 comm6
comm number k
GT Request
GT Response
GT Tail
BE Header
BE Tail
La
te
nc
y 
(cl
oc
k c
yc
les
)
Fig. 7. Transfer latency of the header, response and tail flits.
IX. EXPERIMENTAL RESULTS
This section presents the effectiveness of our methodology to com-
bine the best-effort and connection-oriented guaranteed-throughput
data delivery services. We use a transpose traffic scenario running
on 4x4 mesh planar topology, where a message will be injected
from source node (i, j), and will be accepted in target node (j, i).
Fig. 1 presents the detail of the 2D 4x4 mesh planar addresses.
In the 2D 4x4 mesh with the matrix transpose traffic, we will
have 6 node communication pairs. Communication 1 (Comm1)
is a communication pair between node (1,0) as data injector
and (0,1) as data acceptor, and is represented in this paper as
Comm1|(1, 0) ⇒ (0, 1). Hence, the rest communication pairs can
be represented as: Comm2|(2, 0) ⇒ (0, 2), Comm3|(3, 0) ⇒
(0, 3), Comm4|(2, 1) ⇒ (1, 2), Comm5|(3, 1) ⇒ (1, 3) and
Comm6|(3, 2) ⇒ (2, 3).
A traffic pattern generator (TPG) and a traffic response evaluator
(TRE) is implemented on each network node. The TPG unit encodes
each packet or message such that every message can be recognized
and differentiated from other messages. Each flit of a packet is
numbered in-order by the TPG unit. The TRE unit will check every
accepted flit and evaluate if any or some flits loose in the NoC or
are not accepted in a destination node. The TRE unit at destination
node will analyze also the header, databody and the tail flit of a
packet, and check whether the accepted packet has correctly reached
its destination node. The TRE unit counts also the number of clock
cycles required by the header and other flits to reach the destination
node. We translate the latency metric in our simulation as the number
of clock cycles to transfer a flit from its source node to its destination
node, where initial clock count is set at simulation start time. In other
words, we do not reset initial clock count when a new flit is injected
from its source node. Hence, in the simulation result figures, we will
see that the transfer latency will increase linearly. The slope of the
graphic represents the communication bandwidth (flit acception rate)
of each source–to–target communication. The smaller the slope, the
higher the communication bandwidth is.
In the first experiment, two independent simulations are made.
The first simulation is run in which all data injector nodes send BE
messages, while in the second simulation, all data injector nodes send
GT messages and receive response flits from consumer nodes. This
experiment is intended to present the effect of connection mechanism
over transfer latency, where a data injector must wait for a response
flit from a data acceptor before injecting the streaming data. 50 flits
are injected from each source node in both simulation, and the flits
are injected with 0.25 fpc (flit per clock cycle). In other words, a new
flit is injected in every 4-clock cycle. Fig. 7 presents the comparison
of tail flits transfer latency of all communication pairs. It is clear that,
the transfer latencies of the GT tail flits in the first simulation are
larger than the latencies of the BE tail flits in the second simulation,
because the BE payload data are injected to the NoC soon after the
BE header flit (no latency to wait for a response flit). The tail flit
transfer latencies of the GT tail flit is approximately similar to the
transfer latencies of the BE tail flit plus the transfer latency of the
response flit to the data injector (measured when the data injector
starts injecting the header flit).
The second experiment is run, in which BE and GT messages are
mixed in the matrix transpose traffic scenario. As explained before,
we have 6 communication pairs in the transpose traffic pattern. We
set Comm2, Comm4 and Comm6 as GT-type injector-acceptor
communication pairs, while Comm1, Comm3 and Comm5 as BE-
type injector-acceptor communication pairs. The transfer latencies
of each tail flit increase linearly as the number of workloads (flits
per producer node) is increased. In the simulations, the workloads
are increased from 250 flits per data producer until 4000 flits per
producer. The injection rate per producer is also changed between
0.1, 0.125 and 0.25 fpc (flit per clock cycle), which means that a
data flit is injected to the NoC in every 10, 8 and 4-clock cycle,
respectively.
The simulation result shown in Fig. 8 has presented a very
interesting characteristic of the XHiNoC. It performs a very flexible
runtime communication resource reservation even if the data injection
rates at each producer node are changed. It is certainly difficult
to obtain such characteristic if we use the time-slot-based TDM
method which has been used by other NoC proposals. The data
communication is also lossless, i.e. all injected flits in source nodes
are accepted in target nodes. We run the simulations by encoding
each flit generate by data producer. Hence we can easily recognize
the data flits of each data communication pair and to ensure that
there is no loss of data.
X. CONCLUSION AND FUTURE WORKS
This paper has presented a novel approach to combine the con-
nectionless BE and the connection-oriented GT services in the NoC
by using the runtime dynamic ID-based routing method, in which
the communication resource reservation is made during application
execution time (at runtime) based on the dynamic local ID slots al-
location and ID-based organization technique. The proposed method
does not require a specific slot allocation algorithm that must be run
at compile time.
By using our runtime ID-slot-based routing and scheduling, the
injection rate of each GT packet can be freely determined, or
predetermined according to the required injection rate (required
bandwidth communication). This interesting characteristic is difficult
to achieve by using its counterpart time-slot-based scheduling in case
that routing and scheduling are made at runtime during application
execution. This guaranteed bandwidth service will provide a good
interconnect platform for many core processor systems. Hence, ap-
plications requiring high-bandwidth interprocessor communications
such computer vision and multimedia applications can be run with
better performance.
REFERENCES
[1] J.D. Owens, W.J. Dally, R. Ho, D.N. Jayasimha, S.W. Keckler and L.-
S. Peh, “Research Challenges for On-Chip Interconnection Networks,”
IEEE Micro, vol. 27, no. 5, pp. 96–108, Sep/Oct. 2007.
[2] Cisco, “Internetworking Technology Handbook” available online 2009,
http : //www.cisco.com.
[3] Z. Lu and A. Jantsch, “TDM Virtual-Circuit Configuration for Network-
on-Chip,” IEEE Trans. on Very Large Scale Integration Systems, vol. 16,
no. 8, pp. 1021–1034, Aug. 2008.
[4] T.W. Ainsworth and T.M. Pinkston, “Characterizing The Cell EIB On-
Chip Network,” IEEE Micro, vol. 27, no. 5, pp. 6–14, Sep/Oct. 2007.
CYBERNETICSCOM 2016 ISBN: 978-602-73589-1-1
83
 0
 5000
 10000
 15000
 20000
 25000
 30000
 35000
 40000
 0  500  1000 1500 2000 2500 3000 3500 4000
la
te
nc
y 
(cl
oc
k c
yc
les
)
workload (num. of flits/producer)
0.25 fpc
0.125 fpc
0.10 fpc
(a) Comm1
 0
 5000
 10000
 15000
 20000
 25000
 30000
 35000
 40000
 0  500  1000 1500 2000 2500 3000 3500 4000
la
te
nc
y 
(cl
oc
k c
yc
les
)
workload (num. of flits/producer)
0.25 fpc
0.125 fpc
0.10 fpc
(b) Comm2
 0
 5000
 10000
 15000
 20000
 25000
 30000
 35000
 40000
 0  500  1000 1500 2000 2500 3000 3500 4000
la
te
nc
y 
(cl
oc
k c
yc
les
)
workload (num. of flits/producer)
0.25 fpc
0.125 fpc
0.10 fpc
(c) Comm3
 0
 5000
 10000
 15000
 20000
 25000
 30000
 35000
 40000
 0  500  1000 1500 2000 2500 3000 3500 4000
la
te
nc
y 
(cl
oc
k c
yc
les
)
workload (num. of flits/producer)
0.25 fpc
0.125 fpc
0.10 fpc
(d) Comm4
 0
 5000
 10000
 15000
 20000
 25000
 30000
 35000
 40000
 0  500  1000 1500 2000 2500 3000 3500 4000
la
te
nc
y 
(cl
oc
k c
yc
les
)
workload (num. of flits/producer)
0.25 fpc
0.125 fpc
0.10 fpc
(e) Comm5
 0
 5000
 10000
 15000
 20000
 25000
 30000
 35000
 40000
 0  500  1000 1500 2000 2500 3000 3500 4000
la
te
nc
y 
(cl
oc
k c
yc
les
)
workload (num. of flits/producer)
0.25 fpc
0.125 fpc
0.10 fpc
(f) Comm6
Fig. 8. Simulation with variable workloads and injection rates (fpc = flit per cycle).
[5] D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey,
M. Mattina, et. al., “On-Chip Interconnection Architecture of the Tile
Processor,” IEEE Micro, vol. 27, no. 5, pp. 15–31, Sep/Oct. 2007.
[6] P. Gratz, C. Kim, K. Sankaralingam, H. Hanson, P. Shivakumar,
S. W. Keckler and D. Burger, “On-Chip Interconnection Networks of
the TRIPS Chip,” IEEE Micro, vol. 27, no. 5, pp. 41–50, Sep/Oct. 2007.
[7] Y. Hoskote, S. Vangal, A. Singh, N. Borkar and S. Borkar, “A 5-GHz
Mesh Interconnects for A Teraflops Processor,” IEEE Micro, vol. 27,
no. 5, pp. 51–61, Sep/Oct. 2007.
[8] D. A. Ilitzky, J. D. Hoffman, A. Chun and B. P. Esparza, “Architecture of
the Scallable Communications Core’s Network on Chip,” IEEE Micro,
vol. 27, no. 5, pp. 62–74, Sep/Oct. 2007.
[9] M. Millberg, E. Nilsson, R. Thid and A. Jantsch, “Guaranteed Band-
width using Looped Containers in Temporally Disjoint Networks within
the Nostrum Network on Chip,” Proc. Design, Automation and Test in
Europe Conf. and Exhibition (DATE’04), pp. 890–895, 2004.
[10] E. Rijpkema, K. Goossens, A. Radulescu, J. Dielissen, J. van Meer-
bergen, P. Wielage and E. Waterlander, “Trade-offs in the design of a
router with both guaranteed and best-effort services for networks on
chip,” IEE Proc. Computers and Digital Techniques, vol. 150, no. 5,
pp. 294–302, Sep. 2003.
[11] A. Hansson, K. Goossens and A. Raˇdulescu, “A Unified Approach
to Mapping and Routing on a Network-on-Chip for Best-Effort and
Guaranteed Service Traffic,” VLSI Design, Journal of Hindawi Pub.,
vol. 2007, doi:10.1155/2007/68432, pp. 1–16, 2007.
[12] S. Evain, J.-P. Diguet and D. Houzet, “NoC Design Flow for TDMA and
QoS Management in a GALS Context,” EURASIP Journal on Embedded
Systems, Hindawi Pub., vol. 2006, doi:10.1155/ES/2006/63656, pp. 1–
12, 2006.
[13] A. Leroy, D. Milojevic, D. Verkest, F. Robert and F. Catthoor, “Concepts
and Implementation of Spatial Division Multiplexing for Guaranteed
Throughput in Networks-on-Chip,” IEEE Trans. Computers, vol. 57,
no. 9, pp. 1182–1195, Sep. 2008.
[14] F.A. Samman, T. Hollstein, and M. Glesner, “Multicast Parallel Pipeline
Router Architecture for Network-on-Chip”, in Proc. Design Automation
and Test in Europe (DATE’08), pp. 1396-1401, 2008.
[15] F.A. Samman, T. Hollstein, and M. Glesner. “New Theory for Deadlock-
Free Multicast Routing in Wormhole-Switched Virtual-Channelless
Networks-on-Chip”. IEEE Trans. on Parallel and Distributed Systems,
vol. 22, no. 4, pp. 544–557, April 2011.
[16] F.A. Samman, T. Hollstein, and M. Glesner. “Wormhole Cut-
Through Switching: Flit-Level Messages Interleaving for Virtual-
Channelless Network-on-Chip”. Elsevier Journal, Microprocessors and
Microsystems–Embedded Hardware Design, vol. 35, no. 3, pp. 343–358,
May 2011.
[17] F.A. Samman, T. Hollstein and M. Glesner. “Planar adaptive network-
on-chip supporting deadlock-free and efficient tree-based multicast
routing method”. Elsevier Journal, Microprocessors and Microsystems–
Embedded Hardware Design, vol. 36, no. 6, pp. 449–461, Aug. 2012.
[18] F.A. Samman. “Runtime Connection-Oriented Guaranteed-Bandwidth
Network-on-Chip with Extra Multicast Communication Service”. Else-
vier Journal, Microprocessors and Microsystems–Embedded Hardware
Design, vol. 38, no. 2, pp. 170-181, March, 2014.
[19] F.A. Samman, T. Hollstein and M. Glesner. “Runtime Contention- and
Bandwidth-Aware Adaptive Routing Selection Strategy for Networks-
on-Chip”. IEEE Trans. on Parallel and Distributed Systems, vol. 24,
no. 7, pp. 1411-1421, July 2013.
CYBERNETICSCOM 2016 ISBN: 978-602-73589-1-1
84
