The efficiency of buffer and buffer-less data-flow control schemes for congestion avoidance in Networks on Chip  by Aldammas, Ahmed et al.
Journal of King Saud University – Computer and Information Sciences (2016) 28, 184–198King Saud University
Journal of King Saud University –
Computer and Information Sciences
www.ksu.edu.sa
www.sciencedirect.comThe eﬃciency of buﬀer and buﬀer-less data-ﬂow
control schemes for congestion avoidance in
Networks on Chip* Corresponding author.
E-mail addresses: bindammas@student.ksu.edu.sa (A. Aldammas),
asoudani@ksu.edu.sa (A. Soudani), dhelaan@ksu.edu.sa
(A. Al-Dhelaan).
Peer review under responsibility of King Saud University.
Production and hosting by Elsevier
http://dx.doi.org/10.1016/j.jksuci.2015.11.002
1319-1578  2015 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).Ahmed Aldammas *, Adel Soudani, Abdullah Al-DhelaanCollege of Computer and Information Sciences, King Saud University, Saudi ArabiaReceived 27 May 2015; revised 24 November 2015; accepted 25 November 2015
Available online 15 December 2015KEYWORDS
Network-on-Chip (NoC);
Congestion control;
Data ﬂow control;
Flit’s dropping;
Quality-of-service (QoS)
insuranceAbstract The design of efﬁcient architectures for communication in on chip multiprocessors sys-
tem involves many challenges regarding the internal router functions used in Network on Chip
(NoC) infrastructure. The on-chip router should be designed to provide per-ﬂit processing with
enhanced granularity. In fact, the quality of service experienced at the application level depends
on the capabilities of the router to avoid congestion and to ensure efﬁcient data-ﬂow control. Con-
sequently, an enhanced router architecture is needed to achieve the requested QoS.
This paper proposes an internal router architecture, for on chip communication, implementing
ﬂow-control mechanism for congestion avoidance with QoS consideration. It describes the internal
functions of this router for optimal output ﬂit scheduling and its capability to apply per-class service
for inbound ﬂows. The paper focuses mainly on the description and performance analysis of two
proposed schemes for data ﬂow control that can be used with the proposed router architecture.
The results shown in this paper prove that the application of these proposed schemes in NoC
achieves an interesting enhancement in the measured end to end QoS. We carried out an extensive
comparison of the proposed solutions with the existing schemes published in the literature to show
that the proposed solution outperforms these, maintaining an interesting tradeoff with the hardware
characteristics when designed with 45 nm integration technology.
 2015 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University. This is
an open access article under theCCBY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).1. Introduction
The communication aspect of Multi-Processor Systems-on-
Chip (MP-SoC) is one of the biggest challenges in the new gen-
eration of embedded systems. A lot of active research is direc-
ted toward designing an efﬁcient communication architecture
for these systems. In depth, Networks-on-Chip (NoC) used
to interconnect the multi-cores, are characterized by high-
bandwidth links between the routers due to the high frequency
The efﬁciency of buffer and buffer-less data-ﬂow control schemes 185of the chip. However, the area cost of the memory part of the
integrated circuit strongly limits the memory capacity of the
router, which limits its ability to absorb heavy trafﬁc bursts.
The lack of memory in these routers confronts the networks
to the problem of congestion. It induces substantial ﬂit losses
and overhead delays that heavily affect the perceived quality
of service at the application.
The efﬁcient design of these systems requires to take into
account and combine different constraints. In particular,
embedded hardware constraints have to be considered simulta-
neously with networking constraints in order to evolve the
NoC to meet the requirements of end-to-end QoS (e.g. end
to end delay, jitter variation, number of lost ﬂits, throughput
at the reception level, etc). In this regard, data ﬂow and con-
gestion control mechanisms designed for computer networks-
oriented routers are not applicable in our context.
Recently, very-large-scale integration (VLSI) technology
has offered a ﬂexible and powerful environment for the design-
ers of digital circuits (Chen et al., 2009). Indeed, it allows
maintaining a tradeoff between the complexity of used algo-
rithms and the cost of the designed circuits. Additional tasks
to manage QoS in NoC routers are, therefore, possible with
relatively acceptable hardware overhead. With this considera-
tion, the main challenge lies then, in the speciﬁcation and
implementation of an efﬁcient scheme for congestion control
applicable to on-chip communication.
High ﬂit arrival rate in NoC routers requires an enhanced
scheme for internal queuing of the incoming ﬂits that helps
to process and forward them according to their QoS con-
straints taking into account at the same time the router instant
load. To achieve this goal, a per-ﬂit management process
should be implemented according to the importance of the
payload data type of the transported ﬂit.
The main contribution of this paper is, then, to propose a
new architecture of NoC router with integrated mechanisms
for congestion control. It presents two schemes for congestion
control with respect to the end-to-end QoS requirements. The
ﬁrst proposed scheme is based on output buffering of ﬂits
before they are injected into the next hop that is experiencing
congestion. The second proposed scheme, called feedback sig-
naling mechanism, is a buffer-less approach based on the inter-
action between the source and router in congestion to notify the
source to reduce its sending window size. The paper evaluates
the efﬁciency of the proposed congestion control solutions for
different trafﬁc loads by comparing the performances of the
two proposed schemes regarding QoS.
The remainder part of this paper is organized in the follow-
ing sections. We ﬁrst survey the main existing contributions in
the area of congestion and data ﬂow control in network on
chip. We secondly, present the proposed router architecture
and the proposed schemes for data-ﬂow and congestion con-
trol. We will then focus on the evaluation of the ﬂow-control
mechanisms in terms of QoS and congestion control. Finally,
we discuss the hardware characteristics of our proposed router
before ending with a conclusion and directions for future work.2. Overview of related work
NoC communication, which is characterized by high band-
width, has recently become an active research area, mainly
regarding the problems of data-ﬂow control and congestionmanagement. The main goal is to improve the efﬁciency of
bandwidth allocation while avoiding router saturation. Trafﬁc
balancing based on a congestion-aware, adaptive routing pro-
cess is an interesting approach to achieve load fairness (Wang
et al., 2012). This solution often involves an exchange of load
states and link utilization between neighbors, optimizing the
routing process according to these exchanged data.
The authors in Lotﬁ-Kamran et al. (2010) studied a new
solution for congestion avoidance based on dynamic XY rout-
ing (DyXY). In their approach, called Enhanced Dynamic XY
(EDXY) routing, each router signals congestion to its neigh-
bors through a two-bit bus. The presented solution keeps all
data ﬂows from trying to share the same path, consequently
reducing data trafﬁc congestion. Performance analysis of this
solution has demonstrated good results for packet latency
across these routers, outperforming the direct XY routing
approach. However, this scheme did not address the problem
of congestion as a general phenomenon, which might affect a
network even with trafﬁc distribution.
In Wang et al. (2013), a scheme using an energy- and
buffer-aware adaptive routing algorithm (EBAR) was pro-
posed to distribute thermal energy in a NoC. The main idea
is to share data trafﬁc for optimal thermal distribution across
the NoC, while maintaining fair communication performance.
For that purpose, the routing of ﬂits is based on the thermal
state of the next hop, avoiding high-temperature routers in
the communications path. Thermal management is a critical
problem in the design of integrated systems on chips (SoCs),
but from a communications point of view, the proposed
approach does not address congestion problems.
Congestion-awareness techniques based on signaling have
been proposed as an efﬁcient method to share locally the state
of router buffers (Aci and Akay, 2010; Kaddachi et al., 2008;
Daneshtalab et al., 2012). The intention is to avoid congested
areas in the communications path. Applying this method
requires broadcasting congestion information using extra
buses or with additional data carried by the ﬂits. We think that
this method is interesting; however, we believe that it should be
enhanced with the application of an adequate ﬂow-control
mechanism.
Ebrahimi et al. (2012) have studied a new algorithm for the
routing process that uses local and non-local network informa-
tion. The proposed scheme, called the Congestion Aware
Trapezoid-Based Routing Algorithm (CATRA), deﬁnes a set
of nodes that are likely to be involved in the data-
communication path based on their congestion status. The
congestion information is diffused through an extra bus that
allows real-time updating without increasing trafﬁc. The paper
discussed the hardware implementation of this scheme,
demonstrating that the additional cost to implement the pro-
posed idea is reasonable in terms of circuit area and power
consumption. While this approach seems attractive for NoC
implementation, the paper did not show any enhancement to
end-to-end QoS at the application level, which would be the
main argument to justify its adoption.
An approach to ﬂow control was proposed by Becker et al.
(2012). The presented solution aims to control the input buf-
fers occupation per data ﬂow. The proposed scheme determi-
nes the credit allowed for each virtual channel based on
performance observations. The authors demonstrated the efﬁ-
ciency of this approach, but they did not study its power
consumption.
186 A. Aldammas et al.The authors in Peh and Dally (2000) suggest a data-ﬂow
control process using a ﬂit-reservation mechanism that
reserves buffer space and link bandwidth for the data stream.
The reservation mechanism increases the efﬁciency of buffer-
space utilization. In fact, the reserved buffer space is dynami-
cally changed, which means that the buffer will be free for the
next transmission for another data ﬂow. This method mini-
mizes latency during the ﬂit transaction.
A congestion-control approach for communication in
NoCs has been proposed in van den Brand et al. (2007). The
suggested method is based on the use of a new service deﬁned
by a strategy called Congestion-Controlled Best Effort
(CCBE). CCBE controls bandwidth usage to deliver ﬂits with
constant and reduced latency, using link utilization as a con-
gestion metric. In fact, a centralized predictive algorithm
(Model Predictive Controller, or MPC) constantly supervises
link utilization as a congestion metric to determine the load
of the connection. The measurements obtained by hardware-
analysis probes are sent to the MPC, which decides CCBE
loads based on the predictions it makes using this information.
Additional research has addressed the problem of conges-
tion avoidance and load sharing in NoCs (Mishra et al.,
2011; Ascia et al., 2006; Kumar and Mahapatra, 2005;
Rijpkema et al., 2003). However, we think that a few contribu-
tions have involved per-ﬂit ﬂow-management schemes that
might efﬁciently manage congestion while keeping the best
QoS granularity. We believe, then, that it is necessary to design
an enhanced router architecture that implements an adequate
data-ﬂow control approach along with a suitable internal-
processing scheme for each per-hop transaction. The remain-
der of this paper presents our contribution in this regard,
addressing the design of a micro-architecture for an on-chip
router with enhanced capabilities to avoid and solve conges-
tion while considering the end-to-end QoS.3. Proposed router architecture
Fig. 1 presents the general internal architecture of the pro-
posed router for on-chip communication. It is an enhanced
version of the architecture studied in Adel et al. (2014). The
main new contribution concerns the integration of a ﬂow-
control mechanism and the adequacy of the architecture for
efﬁcient end-to-end QoS.Class servic
memory
Routing
table  
Flit
 Classifier 
 
In_Port FIFO South Port 
East Port 
West Port 
Neighbo
state tab
Local Port 
North Port 
Memory occupancy 
manager (WRED-like) Flow-control 
Signals 
Figure 1 Internal hardwareThe proposed architecture is a modular, scalable system
that integrates new control and data ﬂow-management func-
tions. It performs per-ﬂit processing with respect to character-
istics of the data ﬂow, which is a novel approach. Flits layout
includes a set of ﬁelds making it possible to perform granular
routing, allowing the router to handle passing ﬂits according to
their constraints.
In this architecture, communication ﬂows are processed
according to their class of service, with classiﬁcation examined
as ﬂits queue in the internal memory of the router. Scheduling
ﬂits into the routing process as well mechanisms related to con-
gestion avoidance are based on this classiﬁcation. When con-
gestion is detected, a speciﬁc mechanism unloads the memory
according to the importance of the ﬂits. A tag is then associated
with the importance of the ﬂit payload in the packet (i.e. prior-
ity tag: pr_tag). In addition, a router experiencing congestion
will instruct its neighbors to update their routing tables. This
approach delegates more autonomy to the router to make local
decisions regarding congestion management, thus optimizing
the use of network resources for efﬁcient data delivery.
The speciﬁed architecture comprises different blocks, each
implementing an internal task designed to process ﬂit travers-
ing the hop. All of these internal components are synchronized
by an external clock signal.
 The input FIFO: these input queues are used to keep asyn-
chronous incoming ﬂits from neighboring routers. Their
depths deﬁne the capability of the routers to absorb bursts
of incoming ﬂits. In terms of structure, the input FIFO(s)
are physically independent and managed separately.
 The flit classifier: this block is responsible for identifying
ﬂits according to their class-of-service identiﬁer (id_serv).
Every clock cycle, it extracts ﬂits from the input FIFO(s)
that connect the router to its neighbors. The id_serv is then
used to in-queue the received ﬂits to their corresponding vir-
tual in-queue in a common class of services-memory used
for this purpose. This classiﬁcation helps later to perform
ﬂit-output delivery according to service constraints.
 The central memory: this memory in-queues ﬂits in different
classes of service (see Fig. 1). Extraction of the ﬂits for for-
warding to their destinations is performed by the output
scheduling process, proportionally to the weight of their
classes of service. The memory occupancy manager
permanently controls the depth of the in-queues in memory,es 
 
 
S
ch
eduler
 
R
o
uting 
Pro
cess
 
 
Out_Ports data  
Signals and 
Data Control   
r’s 
le  
Flow-control 
Process 
Sig 
Block Flow-control 
signals   
architecture of the router.
The efﬁciency of buffer and buffer-less data-ﬂow control schemes 187estimating the memory occupancy state. Above a certain
occupancy threshold, the memory is considered fully
loaded, and this process starts to select from the in-queue
tails less signiﬁcant ﬂits from each communications process
to be dropped in order to avoid router congestion.
 The scheduler process: this block links the memory with the
routing process. It extracts ﬂits from the memory outputs
and forwards them to the routing process. It functions
based on a Weighted Round Robin (WRR)-like algorithm.
In fact, it associates a weight (Wi) for each service class i.
The ﬂits are extracted from the four classes of services pro-
portionally to their weight (Wi). The idea is to handle the
incoming ﬂits according to their quality-of-service con-
straints while maximizing the router’s output rate.
In the proposed architecture, in each clock cycle, the rout-
ing block is able to process at maximum ﬁve ﬂits with different
destination addresses. The scheduling mechanism aims to max-
imize the number of ﬂits processed in each clock cycle. Opti-
mizing the router’s output delivery will reduce the memory
load. In every clock cycle, the output scheduler extracts, from
the memory, the highest number of ﬂits with different destina-
tion addresses that can be forwarded in one clock cycle to the
different output ports of the router. If we consider that, in an
output scheduling cycle, the number of ﬂits with different des-
tinations in-queued into the class of service (i) is deﬁned by
Flit_class_i (Eq. (1)):
Flit class i ¼
XMaxClassið Þ1
j¼0
flitðjÞ ð1Þ
This equation expresses the maximum number of ﬂits that the
scheduler might extract from one class of service.MaxClassi in
Eq. (1) is deﬁned by the weight assigned to the class of service
(i) when performing the WRR-like algorithm.
MaxClassi ¼ xi; where i 2 f1::4g ð2Þ
In this proposed router architecture, the routing block
triggers the output scheduler process. When the scheduler is
triggered, the number of ﬂits it will transfer to the routing block
during one scheduling cycle is expressed by Eqs. (3) and (4). All
selected ﬂits should be assigned different output ports:
NbrFlit ¼
X4
i¼1
XMaxClassið Þ1
j¼0
flitðjÞ ð3Þ
NbrFlit 6 5 ð4Þ
As expressed in this equation, in one output scheduling
cycle, all classes of service might be involved. So, while this
approach based on a WRR-like mechanism allows the router
to send out a variable number of ﬂits from different classes
of service according to their weights (Wi), it also maximizes
the total number of ﬂits that are managed by the router in
one routing cycle. Consequently, it minimizes the total in-
queue length of ﬂits waiting in memory, which in turn reduces
transaction time and avoids congestion.
 The routing block: this is the main block in the NoC router.
It routes the different incoming ﬂits based on the wormhole
technique and allows channel multiplexing using theconcept of virtual circuits, which improves bandwidth allo-
cation through the Time Division Multiple Access concept
(Ascia et al., 2006; Palesi et al., 2006).
The communications process starts with the transmission of
the head-data flit, which contains information about the pro-
cess identiﬁer (id_proc), source address, and destination
address. The information in this ﬂit is used to establish a vir-
tual circuit for the remaining data ﬂits. Each router has an
internal representation of all the neighboring routers trafﬁc
load. The routing algorithm will, then, favor the application
of a direct X–Y routing scheme to forward the header ﬂit to
the next router in the path. If the next hop is congested, an
adaptive X–Y scheme based on neighbors’ states will be used
to ﬁnd an alternative next hop. The current router will update
the routing table with the two pieces of information (id_proc,
output_port). All the following incoming data ﬂits with the
same id_proc will be forwarded through the same channel,
using the updated information in the routing table. After
receiving the end_data_flit, the created virtual channel is
terminated.
 The signaling block: this block implements some tasks
related to the signaling process in the network. It generates
and broadcasts signaling from asynchronous ﬂits in order
to update neighboring routers with new trafﬁc load states.
This block analyzes newly received signaling ﬂits in order
to update the internal routing table with the neighboring
routers states. The information stored in this table is
required to manage a head-ﬂit during the path-
establishment process. This block interacts with the mem-
ory controller, in order to obtain the updated state of the
memory.
 Memory occupancy manager block: this block uses an
approach based on Weighted Random Early Detection to
drop low-importance incoming ﬂits when required in order
to avoid memory congestion. When memory occupancy is
above a maximum threshold, it starts to drop low-
importance ﬂits from memory and from the output of the
ﬂit classiﬁer based on their service classes (id_serv) and
importance tags (pr_tag). Priority is scaled out of four for
all classes of service and attributed by IP cores (Intellectual
property) when the application data stream is generated.
The memory occupancy manager starts by dropping ﬂits
with the lowest priority (pr_tag = 0). If all the ﬂits with
the lowest priority are dropped and the router is still con-
gested, incoming ﬂits of the next highest priority begin to
be dropped to reduce the time needed to resolve congestion.
4. General protocol structure and ﬂit types
In the proposed communication protocol we deﬁne two ﬂit
types: data ﬂits and signaling ﬂits. In order to maintain scala-
bility and match the hardware data bus and internal router
buffer constraints, we have chosen a ﬁxed 32 bits ﬂit size for
all ﬂit types.
Data flits are dedicated to the transport of data between
IPs. During the communication process, three types of data ﬂit
are required:
188 A. Aldammas et al. Head-data flit: the wormhole commutation technique is
based on route establishment (with a physical multiplexed
channel) (Ni and McKinley, 1993) between source and des-
tination IPs. In the proposed architecture, route establish-
ment takes into account some information related to the
expected process, such as the class of service identiﬁer (Id-
serv). Furthermore, route establishment is mainly enhanced
by the study of the physical states of the expected in-path
NoC routers. This information, quantifying the available
memory space and the length of the in-queue waiting ﬂits
in the considered class of service, is provided through the
signaling process between neighboring routers. Fig. 2
depicts the head-data ﬂit used in our approach.
 Continuation-data flit: this ﬂit type is processed after the
establishment of the communication path between source
and destination during packet transmission. It transports
the data payload. In this ﬂit, the process identiﬁer Id_proc
is used to identify the output port of the connected routers,
using a local switching table that must be updated at the
start and end of each process.
 End-data flit: this has the same structure as the data ﬂit, but
its arrival triggers the release of resources reserved for the
current communication process.
Signaling flits are generated by a router upon state change.
They carry useful information between neighboring routers in
order to provide each router with a virtual representation of its
neighbors.
5. Approaches to ﬂow control and congestion avoidance
In a Multi-Processor System-on-Chip, the end-to-end QoS
(e.g. end-to-end delay, jitter, and number of lost ﬂits) depends
heavily on the capacity of the router to handle, with required
granularity, the ﬂits of a communication process (Bolotin
et al., 2004). The internal tasks associated with ﬂit manage-
ment during per-hop transition affect delay and jitter. The
depth of the waiting ﬂits is directly related to the gap between
the input and output throughputs. Therefore, the ﬁrst strategy
for avoiding a router bottleneck is to reduce as much as possi-
ble the per-ﬂit processing time. Even though the processing
time of SoCs is very short due to their high frequency, the
demands in bandwidth for multimedia applications are veryNat: Flit type 
Id_proc: Process identifier 
Id_serv: Class service identifier 
Nb_flit: Flit number 
pr_tag: Priority  tag 
Seq_num: Flit sequence number  
Nat @s @d Id_p
Nat Id_SW Mem_st
Data-head flit structure
Continuation-data flit 
structure
Signaling flit structure
32
Nat Id_SId_proc 
Figure 2 Flit structure in the propohigh, which requires that internal processing tasks be ade-
quately efﬁcient in order to achieve short responses.
The QoS warranty in NoCs requires to be considered at dif-
ferent levels in the design of the router. Not only does it
involve dimensioning of hardware resources, but it also
requires a whole communication strategy for trafﬁc control
and congestion avoidance. The per-hop memory occupancy
during communication is highly dependent on the strategy
applied to balance the data-trafﬁc load based on the
communication-path state.
Our approach applies a ﬂow-control mechanism along with
a service-differentiation scheme suitable to schedule the ﬂits
for output according to application requirements. For that
purpose, the services in the proposed MP-SoC architecture
are divided into four classes according to QoS constraints
deﬁned by the application. Table 1 speciﬁes the service differ-
entiation and the associated scheduling-output weight.
Splitting the trafﬁc into different classes of services allows
the scheduler to apply generic output scheduling policies to
control the output rate. This approach is suitable for the inter-
nal processing of QoS constraints associated with classes. Dur-
ing the communication process, ﬂits are stored in the routers
internal memory based on their class of service. Upon their
arrival in the router, the Service class classifier reads the
injected ﬂits from the input ports of the router following the
selected round-robin method. Based on the service identiﬁer
ﬁeld (id_serv), the ﬂit is in-queued in the appropriate class of
service in the internal memory, which is organized into four
waiting in-queues that are managed using speciﬁc pointers
and indexes.
For the queued outgoing ﬂits, the scheduler treats each
class of service using a corresponding weight (Wi) in order
to apply a WRR policy. Table 1 outlines the classes of service
deﬁned for this approach, along with their weights for the out-
put scheduling process.
Congestion control in the NoC is basically dependent on
the capability of the router to avoid overﬂow of its internal
memory. In the proposed scheme, theMemory occupancy man-
ager controls the memory load. When the lower threshold is
reached, the process interprets the state of the memory as over-
loaded and starts selectively dropping incoming ﬂits based on
their importance. Moreover, a signaling ﬂit is generated and
sent to alert its neighbors about its overloaded state, askingroc Id_serv Nb_flit pr_tag reserved 
Mem_stat: State of the memory 
Nb_inqueue: Number of the active in-queue 
Nb_process: Number of the active process 
Data: payload data  
at Nb_inqueue Optional Nb_proc 
-bit flit size 
erv Data pr_tag Seq_num 
sed communication architecture.
Table 1 Service classiﬁcations and their associated weights at
the output scheduler.
Classes of
service
Service
identiﬁer
Description of the
service
Associated
weight at the
output
scheduler (Wi)
Signaling
ﬂits
Class_1 Control and
management of
network IP cores
4
Real-time
ﬂits
Class_2 Time-sensitive
applications, such as
multimedia
applications
3
Short data
ﬂits
Class_3 Used to transfer short
information, like
register content
2
Block
memory
transfer
ﬂits
Class_4 Streaming large
amounts of data
between IP cores
1
The efﬁciency of buffer and buffer-less data-ﬂow control schemes 189them to avoid using it as part of their communication paths.
Low-importance ﬂits are discarded using a WRED-like algo-
rithm (Barbera et al., 2008; David et al., 2011). The dropping
process selects low-importance ﬂits according to the informa-
tion embedded in the ﬂit (pr_tag), which is generated by the
application based on the importance of the payload. Fig. 3
illustrates the general organization of the internal processes
during ﬂit processing.
The proposed mechanism for discarding low-priority ﬂits
helps a router experiencing congestion to unload its memory
with less impact on the QoS. Indeed, this selective process of
packet discarding helps to reduce per-ﬂit transaction time,
while keeping signiﬁcant data that is useful to the receiver
application. However, we think that neighboring routers using
the congested router as a next hop in a data-communication
path should also reduce their upstream data rates to avoid
burstiness. This paper presents, then, two mechanisms for
NoC ﬂow control that contribute to avoiding and solving con-
gestion. The ﬁrst proposed scheme for ﬂow control is based on
the use of an output buffer, slowing down the output stream
when the next hop experiences a congestion problem. The sec-
ond proposed scheme is a bufferless approach, based on aW
RED
 A
lg
o
rith
m
 
 
Intern
al
 m
em
o
ry
 stru
ctu
re 
Local port
North Port
East port 
West port 
South port 
Serv_cl
Serv_cla
Serv_cla
Serv_cla
Service Class
 
Memory occupancy manager
Lo
w
-prio
rity
 flit 
dropping
 
 
Figure 3 General approachfeedback control scheme between the congested router and
the source core of the data in order to reduce the sending win-
dow size.
5.1. Flow control based on the use of an output buffer
In the output buffer ﬂow-control approach, a buffer at the out-
put level helps to slow down the frequency of the upstream
data when the next router is congested (Fig. 4). For that pur-
pose, we deﬁne three levels that quantify the load of a router:
under-loaded, loaded, or congested. So, upon receipt of a sig-
naling ﬂit that indicates a loaded memory state in the next rou-
ter, a router activates the output buffer to control the output
stream rate. Flits are then injected into the ﬂow-control buffer
from either index_1 or index_2, representing two positions in
the output buffer. Flits are injected to index_1 when the next
router is over-loaded (congested), while index_2 is used when
it is in a loaded state. At every positive edge of the clock signal,
ﬂits are time-shifted to the output port. The ﬂits are injected at
index_3 if the ﬂow-control mechanism is not activated. The
idea is to introduce some clock cycles before sending out the
ﬂits with the hope that giving more time to the next router
to forward ﬂits from its memory to their destination would
avoid or resolve the congestion.
5.2. Flow control using a feedback-signaling mechanism to the
source core
In this approach, the congestion event is signaled to the source
of the data in order to reduce its throughput. The router expe-
riencing congestion determines ﬁrst from among all the carried
communication processes which one is injecting the largest
amount of data. It then informs its neighbors of its congested
state to help avoid the arrival of more, newly routed ﬂits. The
signaling information in this scheme is carried in an extra
band; in fact, a speciﬁc bus carries this information to the rou-
ters neighbors and consequently also identiﬁes the process ðPcÞ
that caused the congestion.
Among its neighbors, the router ðRn1Þ that is sending the
ﬂits of the communication process ðPcÞ sends back the signal-
ing event to the router R0 and then to the source of data
ðIPsource; see Fig. 5).
The router experiencing congestion calculates the input
injection rates for all carried ﬂows at the time when congestionass4 
ss2 
ss1 
ss3 
R
o
uting
 Pro
cess
 
Flit Scheduler  
for memory management.
       Output Buffer Controller  
clk
Index 1 Index 2 Index 3 
Output port 
Routing 
Block 
Figure 4 Architecture of output-buffer ﬂow control.
190 A. Aldammas et al.is detected. It locates the ﬂow with the highest input-injection
rate and notiﬁes it to reduce its trafﬁc load. In our approach, a
communications process is deﬁned by a data Flow Fi and iden-
tiﬁed by (proc_id). The router maintains an internal table, the
structure of which is shown in Table 2, where the input-
injection rate is updated continuously for each communication
process.
Let us note by Fi the data-ﬂow of the communication pro-
cess (i). We deﬁne the injection rate Proc_InjRat(i) as
expressed in Eq. (5):
Proc InjRatðiÞ ¼ NumflitsðiÞ
TClockcycle
ð5Þ
For process I, Num flitsðiÞ represents the number of ﬂits
injected into the router over the last period of time (TClock cycle).
The congestion bus carries the signal all the way through
until it reaches the hardware interface of the IP source, as
shown in Fig. 5. The IP source interface then reduces its injec-
tion rate according to the algorithm shown in Fig. 7.
The IP source interface continuously checks the conges-
tion_bus in order to detect if it is the source of congestion
for a router in the communication path (Fig. 6). When this
happens, the interface adapts its inputs according to the con-
trol algorithm in Fig. 7. The source IP interface ﬁrst reduces
the injection rate by half (Fig. 7, line 3). After a period of
TClock cycle, the IP source keeps the same reduced injection rate
if the state of the router is still in congestion (Fig. 7, line 7).
Otherwise, the injection rate is increased slowly until it reaches
the maximum value (congestion_injection_rate; Fig. 7, line 10).
Every period of TClock cycle, one more ﬂit is injected in that cycle
than in the cycle before. When the injection rate reaches the
threshold, the interface increases the injection rate following
the congestion-avoidance mode as expressed in (Fig. 7, line
13).
The pseudo code, N  TClock cycle represents a multiple N of
the time period on which we update the injection rate. The
more we increase the value of N, the more slowly the injection
rate increases, avoiding congestion. This value N can be a
dynamic parameter to ﬁne tune the injection rate. The optimal
value of N should allow high throughput while avoiding con-
gestion. In our study, we have chosen ðN ¼ 4Þ as optimal
based on simulations that tested the impact of N on the
input-injection rate. We note that greater values of N can be
also applied in this algorithm.The proposed algorithm favors best effort mode in the
behavior of the routers which maximizes the bandwidth alloca-
tion. However, when a congestion is detected, the proposed
scheme obliges the main source of this congestion to reduce
heavily its injection rate (half of the current injection rate) that
will be kept until the end of congestion. After solving conges-
tion, the source starts increasing, periodically, its input injec-
tion rate with one ﬂit every TClock cycle until reaching the
initial measured congestion_injection_rate. This phase avoids
to quickly saturate the network again and aims to maximize
the throughput. Beyond this threshold, the source applies the
congestion avoidance mode to increase smoothly its input-
injection rate as explained in the previous paragraph.
The characteristics of the two proposed schemes for conges-
tion control in on-chip networks are different. The output-
buffer scheme is a local mechanism operating between a con-
gested router and its neighbors. It uses an in-bandwidth signal-
ing mechanism based on the exchange of ﬂits that carry
congestion information. The second ﬂow-control mechanism,
on the other hand, involves an interaction with the source core
to reduce its injection rate.
Table 3 summarizes the main characteristics of and differ-
ences between these two ﬂow-control modes.
In the following part of the paper, we evaluate the perfor-
mance for a multimedia application transmitted with the appli-
cation these two proposed ﬂow-control schemes for avoiding
congestion in NoCs.
6. Performance analysis
The performance of the proposed router architecture was eval-
uated with both of the presented ﬂow-control schemes for con-
gestion avoidance. In a co-simulation environment based on
Simulink-ModelSim tools, we built a 3 * 3 mesh NoC using
the described router architecture. The Register Transfer Level
(RTL) – VHDL description of the router was developed, sim-
ulated with ModelSim, and used to link the different cores of
the network. For each core of this network, an adapter (inter-
face) was designed in order to shape the ﬂits and control the
injection rate.
Between the cores of this network, many communication
processes were deﬁned in order to test the performance of
the ﬂow-control schemes under injection at variable rates, with
changing data-trafﬁc loads to create congestion states. We
evaluated the schemes mainly based on their capabilities to
manage QoS and to provide better performance at the applica-
tion level when the network experiences congestion. The simu-
lation focused on the evaluation of QoS metrics for a speciﬁc,
real-time data-ﬂow that was transported over the network with
other, simultaneous communication processes. The speciﬁc test
application involved sending an image and retrieving it at the
receiver core.
6.1. Performance of the WRED-like algorithm
We ﬁrst evaluated the performance of the designed WRED-
like algorithm for discarding low-importance ﬂits to unload
memory and avoid congestion during heavy trafﬁc. To test
the capability of this algorithm to resist congestion under
heavy loads, we injected into one router many communication
processes on different input ports. Fig. 8 shows the output of
R (n) R (0) R (n-2) R (n-1) 
R 
(n+1) 
north 
R 
(n+1) 
South 
R 
(n+1) 
west 
IP 
source  
…….. 
Data_Bus  
Signaling_bus  
Signaling back the congestion to the source  
Source has to 
reduce the 
injection 
throughput    
Figure 5 Flow control with feedback to the source.
Table 2 Structure of the routing table.
In_port Output _port Class_id Injection rate
Id_proc_1 Proc1_in_Port_id Proc1_Out_pot_id Proc1_Class_id Proc1_inj_rate
Id_proc_2 Proc2_in_Port_id Proc2_Out_pot_id Proc2_Class_id Proc2_inj_rate
Id_proc_n ProcN_in_Port_id ProcN_Out_pot_id ProcN_Class_id ProcN_inj_rate
Set to 1 by the router experiencing the congestion. Set to ‘0’ 
when the router forwards the signal back to the source  
2 bits to code the state of the router generating the congestion event 
(01: Loaded, 10: Overloaded, 11: Congested  
4 bits to identify the Proc_id of the highest injection rate   
Figure 6 Structure of the congestion signaling bus.
The efﬁciency of buffer and buffer-less data-ﬂow control schemes 191the router for two different injection rates (0.23 ﬂits/clk and
0.19 ﬂits/clk). The instants T1 and T3 are the times when con-
gestion is detected in the router memory for injection rates of
0.23 ﬂits/clk and 0.19 ﬂits/clk, respectively, without the use of
the ﬂit-discarding algorithm. When the WRED-like algorithmFigure 7 Pseudo-code of the injis applied, congestion occurrences are delayed for both input
rates, respectively, to T2 and T4. By using the selective
ﬂit-discarding process, the router delayed the congestion by
more than 180 and 200 clk cycles for input rates of 0.23 and
0.19 ﬂits/clk, respectively.
The ability of the WRED-like algorithm to delay conges-
tion is higher with a lower injection rate. In fact, in this case,
more time is allowed for the routing process to forward ﬂits
from memory to their destinations. The WRED-like algorithm
is applied locally, in the internal memory of the router experi-
encing the congested state. Even if this process is able to delay
the congestion (as in the case of 0.23 ﬂits/clk), it cannot avoid
injecting a high data load to the router. Thus, we think that an
appropriate data-ﬂow control mechanism has to be applied
along with this approach.ection-rate control algorithm.
Table 3 General characteristics of the two proposed ﬂow-control schemes.
First scheme: output buﬀer Second scheme: feedback control of source core
NoC architecture Mesh architecture but also scalable for
other architectures
Buﬀer With output buﬀer Buﬀer-less approach
Signaling of congestion In-bandwidth signaling Usage of extra hardware bus
Congestion awareness Fast broadcasting of, in bandwidth,
signaling ﬂits to neighbors
Very fast signaling mechanism based
on hardware signals generated between routers
WRED-like algorithm for selective
packet discarding
Applicable
Scalability The two proposed schemes are scalable for the size of the network.
In fact, both of them aren’t depending on the number of routers in the network
192 A. Aldammas et al.To compare with similar work, the authors in Lee et al.
(2012) implemented Globally-Synchronized Frames to ensure
QoS in NoCs. In (Lee et al., 2012), congestion occurred for
an input rate of 0.22 ﬂit/clk while adopting different arbitra-
tion approaches. Our proposed solution without the applica-
tion of a ﬂow-control scheme became congested for a similar
input rate (0.23 ﬂits/clk). However, the WRED-like algorithm
delayed the occurrence of congestion, which may facilitate the
efﬁcient delivery of QoS at the reception level. In addition, the
authors of Lee et al. (2012) did not discuss a speciﬁc mecha-
nism to manage or avoid congestion.6.2. Performance of the router with the application of the flow-
control mechanisms
The ability of the proposed router to avoid congestion in the
network was evaluated with the two proposed ﬂow-control
schemes. Performance was analyzed in terms of its capability
to enhance the QoS observed at the application level. For this
purpose, a communication process was dedicated to the trans-
mission of an image of 512 * 512 8bpp that was transformed
using the Haar wavelet transform (HWT) and injected into
the 4 * 4 NoC mesh architecture. The use of the HWT gener-
ates ﬂits of variable levels of importance for the retrieval pro-
cess (Fig. 9). The data ﬂow representing the image after
decomposition contains four sub-bands (LL, LH, HL, and1900 2000 2100 2200
340
360
380
400
420
440
460
N
um
be
r 
o
f f
lit
s Input rate ( 0.19 flits/clk)
Input rate ( 0.23 flits/clk)
T3T1
Figure 8 Capability analysis of the WRED-likeHH). Accordingly, ﬂits were generated and injected into the
network with variable importance tags (pr_tag). The ﬂits gen-
erated from the sub-band LL carry the low-frequency data
most required for the retrieval process; therefore, they are
assigned the highest importance. The two sub-bands LH and
HL contain data that are less important, compared with the
LL sub-band data, to reconstruct the image. The least-
signiﬁcant ﬂits injected into the network are those transporting
data from the HH sub-band.
The HWT decomposition tests the effect of the proposed
WRED-like algorithm to select low-weight ﬂits to discard from
the data stream in order to avoid memory saturation in the
routers. In addition, other communication processes were
simultaneously carried on the network in order to increase
the per-router loads and to cause congestion states in the data
path. Table 4 sums up the features of the data streams injected
on the network during the simulation.
The proposed ﬂow-control schemes were applied, along
with the WRED-like algorithm, to discard low-importance ﬂits
from transitional nodes when congestion symptoms were
detected. We evaluated the capabilities of the two proposed
schemes of ﬂow control to enhance the QoS at the reception
level. For the transmission of the Lena image, we have esti-
mated variation in jitter, which expresses the capacity of the
routers in the path to maintain periodic delivery of their
incoming ﬂits. Fig. 10 shows the maximum absolute values
of jitter measured for the different schemes of ﬂow control at2300 2400 2500 2600
Time (clk) 
T2 T4
algorithm for congestion delay in the router.
(a) Lena image  (b) Frequency Sub-bands obtained with 
HWT  
Figure 9 Output stream after the decomposition with HWT for the Lena image.
Table 4 Features of the test data ﬂows.
Process
identiﬁer
ID class of
service
Input injection rate to the
network (ﬂits/clk)
Source–destination
(X–Y)
Period of time
(TClock cycle)
General description
1 1 Variable from 0.01 to 0.1 (00, 00) to (01, 10) 100 CLK cycles HWT applied to Lena output
data stream
2 2 Variable from 0.015 to 0.1 (00, 01) to (10, 01) Extra communication processes
injected to the mesh network during
simulation to load the network
3 3 0.02 (01, 00) to (01, 10)
4 4 0.02 (10, 00) to (00, 10)
5 1 Variable from 0.01 to 0.1 (10, 00) to (01, 10)
The efﬁciency of buffer and buffer-less data-ﬂow control schemes 193the reception level under different network loads. Under low
network load (0.12 ﬂits/clk cycle), the network delivers the ﬂits
to the receiver with almost the same maximum jitter. However,
under higher network load, which causes congestion (from
0.56 ﬂits/clk cycles), the scheme based on a feedback-
signaling mechanism to the source core ensures less jitter
compared to the output buffer ﬂow-control mechanism or com-
pared to communication without any ﬂow-control mechanism.
At a load of 0.56 ﬂits/clk cycle, the maximum measured
value of jitter is higher when applying the feedback-signaling
control mechanism than when applying either of the two other
schemes. In fact, at this load the memory of the router is
lightly-loaded, the output-buffer ﬂow-control mechanism
starts time-shifting the ﬂits at the output (index 2) without
reducing the injection rate of the source. This high input-
injection rate favors injecting more ﬂits of the same0 
20 
40 
60 
80 
100 
120 
140 
160 
180 
200 
0.12 0.56 0.7
Without flow control Feedback with the source c
M
ax
iu
m
m
ab
so
lu
te
 v
al
ue
 o
f t
he
 ji
tte
r 
(c
lk
)
Figure 10 The maximum absolute value ofdata-ﬂow in the congested router and consequently delivering
more ﬂits to the reception. However, the feedback ﬂow-control
mechanism reduces the injection rate of the source by half and
then it starts increasing it slowly to avoid congestion. As a
result, for this load of 0.56 ﬂits, the jitter measured with the
feedback ﬂow-control mechanism will be increased because
of this considerable reduction of the input injection rate as well
as the application of the WRED-like algorithm. Fig. 10 shows
that the jitter is higher than the measured values with the two
others ﬂow-control schemes.
When the network is heavily-loaded, the output buffer can-
not avoid high values of jitter. It time-shits the ﬂits using the
position index 1 in the output buffer that will saturate quickly
because of the high load of the router. Thus, this scheme will
not be able to avoid high number of discarded ﬂits with the
application of the WRED-like algorithm (since the injection6 1 1.05 
ore  flow control Output buffer flow control 
Network injecon load (ﬂits/clk) 
jitter with different ﬂow-control schemes.
194 A. Aldammas et al.rate will remain high into the router). Then, the jitter will
increase signiﬁcantly compared to the use of the feedback-
signaling ﬂow-control mechanism that will keep low measured
values of the jitter compared to the two others ﬂow control
schemes.
The measured values of the jitter with the output-buffer
based ﬂow control scheme are slightly better than the commu-
nication without ﬂow control schemes. But, the two schemes
keep the same trend of increasing jitter with different data-
trafﬁc loads. In fact, without reducing the data-source injec-
tion rate, the memory of the router will be heavily-loaded
which increases the probability of discarding ﬂits. In that case,
the output-buffer used in the neighbors will be saturated after
a short time and its function of ﬂits time-shifting will not
reduce the injection rate in the congested router.
In Fig. 10, the feedback ﬂow-control scheme was able to
efﬁciently avoid high jitter experienced at the receiving applica-
tion. This ﬁgure shows a gain of around 43% under a network
load of 1.05 ﬂits/clk cycle. This result attests to the efﬁciency of
the feedback-signaling mechanism to the source core.
End-to-end delay is one of the most interesting QoS metrics
used to evaluate the time efﬁciency of a communication sys-
tem. We have evaluated the time performance of the discussed
schemes and their capabilities to ensure delivery to the receiv-
ing application without delay. Fig. 11 presents the measured
end-to-end delays for the two ﬂow-control schemes, as well
as the results obtained without the application of a ﬂow-
control mechanism. The end-to-end delay increases with
increasing data-trafﬁc load, reﬂecting more buffering time in
the routers. The application of an output-buffer ﬂow-control
mechanism does not reduce end-to-end delay compared to
the results obtained using the feedback-signaling mechanism
to the source core. Buffering at the output of neighboring rou-
ters slightly reduces the measured end-to-end delay, but this
gain is less than with the application of the feedback-
signaling mechanism. Under network loads that cause conges-
tion (P0.56 ﬂits/clk cycle), the measured end-to-end delay
with the application of feedback signaling to the source core
remains less than the measured values with the scheme using
an output buffer.
We noticed also, from Fig. 11, that the end-to-end delay is
not signiﬁcantly enhanced with the application of the output-
buffer ﬂow-control mechanism compared to the communica-
tion without ﬂow control. In fact, as explained in the
interpretation of Fig. 10, since the input-injection rate at
source is not reduced, the memory of the congested router
remains with high data-load as well as the memory of the
others router in the data path. Thus, the end-to-end delay is
slightly reduced with the application of the output buffer-
based ﬂow- control approach.
The results shown in Fig. 11 suggest that the feedback-
signaling mechanism is a more scalable and efﬁcient ﬂow-
control mechanism compared with the output-buffer scheme.
The performance analysis also measured the number of ﬂits
lost through a congested router (Fig. 12). The application of
the output-buffer ﬂow-control scheme reduced the number of
lost ﬂits. The feedback-signaling scheme to the source core also
yielded fewer lost ﬂits in the congested router. When compared
with communication without a speciﬁc ﬂow-control mecha-
nism, the number of lost ﬂits is reduced by more than 40%,
and the feedback-signaling mechanism reduces lost ﬂits by
more than 20% when compared with the output-bufferscheme. As an important reminder, the WRED-like algorithm
is applied to both of the studied schemes to drop from con-
gested memory ﬂits of low importance to the retrieval of the
image.
This selection of low-importance ﬂits increases the quality
of the received image. Fig. 13 shows the different measured
Peak Signal-to-Noise Ratios (PSNR) for the retrieved image
at the receiving application. The quality of the image is sub-
stantially enhanced by applying both proposed schemes of
data-ﬂow control. However, applying the feedback-signaling
scheme speciﬁcally as a ﬂow-control method obtained more
improvement in the measured PSNR when the network is
overloaded. In fact, the value of the PSNR remains higher
than 25 dB in such a scheme, a threshold that corresponds to
a good-quality received image. The gain in the PSNR of the
received images with the application of the feedback ﬂow-
control scheme is more than 34% for a data trafﬁc load of
1.05 ﬂits/clk, compared with transmission without ﬂow
control.
Fig. 14 shows the results of the throughput measured at the
reception core for the image-communication process. For an
under-loaded network (0.12 ﬂits/clk), the throughput at the
receiver level is almost the same as that achieved by the two
proposed ﬂow-control schemes. As network load increases,
some routers become congested (loads of 0.56 ﬂits/clk and
above). The two proposed ﬂow-control schemes act differently
in this case. In the scheme with feedback-signaling to the
source core, the congested router asks the core source with
the highest injection rate to reduce its injection rate, as
described above. However, in the output-buffer ﬂow-control
scheme, the congested router just asks its neighbors to delay
forwarding the routed ﬂits. This explains why the reception
throughput is low under the feedback-control scheme. How-
ever, when the congestion is severe (under high trafﬁc loads),
arrival throughput to the application is better than under the
output-buffer ﬂow-control scheme. In other words, combined
with the other QoS metrics, even though the reception
throughput is lower, the achieved QoS is better with the appli-
cation of the feedback ﬂow-control scheme.7. Hardware characteristics of the proposed router
To estimate the hardware features of the studied router for on-
chip communication, a NoC router has been designed for
ASIC CMOS prototyping. We used a VHDL description at
the RTL level to describe the functionalities of the router as
sets of Extended Finite-States Machines. The Design Vision
tool was used for synthesis, using CMOS cell libraries and
the Silicon Encounter tool for placement and routing while
laying out the circuit. We used the Silicon Encounter tool with
integration technology 45 nm standard cell library in order to
extract the main features of the proposed circuit. We estimate
the area and power consumption of the NoC router.
Table 5 illustrates the main characteristics of the proposed
router, assuming an internal memory size of 1024 bytes, which
is able to in-queue a depth of eight ﬂits in four internal classes
of services. The results shown in that table were obtained for
the router without the application of the two ﬂow-control
schemes (Fig. 1). The additional hardware cost of applying
these two schemes is negligible, especially considering their
QoS advantages. In fact, for a frequency of 500 MHz, they
0 
0.2 
0.4 
0.6 
0.8 
1 
1.2 
1.4 
1.6 
1.8 
2 
0.12 0.56 0.76 1 1.05 
buﬀer ﬂow control Without ﬂow control Feedback with the source core ﬂow control  
Injection rate
(flits/clk)
En
d 
to
 e
nd
 d
el
ay
  (
cl
k)
*1
0
2
Figure 11 End to end delay measured for different ﬂow control schemes.
0
20
40
60
80
100
120
140
0.12 0.56 0.76 1 1.05
Without flow control
Buffer based flow control
Feedback control with the source core  flow control mechanism 
Network injection load (flits/clk)
nb
ro
f l
os
t f
lit
s 
*1
03
Figure 12 Dropped ﬂits during congestion with different ﬂow-control schemes.
The efﬁciency of buffer and buffer-less data-ﬂow control schemes 195occupy about 0.0002 mm2 of the circuit area and around
0.1 mW of the dynamic power consumption.
These costs increase slightly with frequency increase,
requiring about 0.004 mm2 more between 330 MHz and
1 GHz using this integration technology. The required change
in circuit area is not signiﬁcant with increase in frequency. So,
we think that the proposed circuit can provide high-bandwidth
communication in MP-SoCs without compromising the area
reserved for the processing cores (IPs).
A circuit power consumption is often determined by its
dynamic part. Table 5 illustrates the dynamic power consump-
tion of the router for different frequencies, which increases
with the frequency of the circuit. As the frequency determines
the available bandwidth-link between routers, it is important
to note that a tradeoff has to be made between power con-sumption and the link communication characteristics, such
as the delay and the per-router transaction time. For a fre-
quency of 1 GHz, the proposed router will provide a band-
width link of about 128 Gbits (frequency * link size * number
of output ports).
In conclusion, the proposed NoC router achieves high com-
munications throughput with low power consumption, which
makes it a very suitable solution for efﬁcient communication
in embedded systems.
8. Comparison with related work
Coupled with QoS demands at the application level, different
researchers have considered the very important topic of
congestion-control mechanisms in NoCs, aiming to avoid
05
10
15
20
25
30
35
40
0.12 0.56
without ﬂ
Feedback c
core flow c
Output buff
PS
N
R
 
(dB
)
0.76 1 1.05
ow control
ontrol with the source 
ontrol 
er based flow control 
Network  injection load  (flits/clk)
Figure 13 PSNR of the received images using different ﬂow-
control schemes.
0 
10 
20 
30 
40 
50 
60 
70 
80 
90 
100 
0.12 0.56 0.76 1 1.05 
Feedback with the source  core ﬂow control 
scheme  
without ﬂow control 
Output buﬀer ﬂow control scheme  
Re
ce
iv
ed
 th
ro
ug
hp
ut
 (f
lit
s/c
lk
)
Network injecon  load  (ﬂit/clk)
Figure 14 Received throughput at the application level for the
image communication process.
Table 5 Circuit area and power estimation.
Frequency (MHz) Area (mm2) Dynamic power (mW)
330 0.048 10.28
500 0.049 15.42
1 GHz 0.052 30.96
196 A. Aldammas et al.saturation and improve throughput. In this section of the
paper, we compare our solution to different work from the
literature.
Wang and Bagherzadeh (2014) propose a novel, QoS- and
congestion-aware NoC architecture to manage QoS and to
balance the trafﬁc load inside the network to enhance overall
throughput. The proposed approach differentiates the trafﬁc
into different service classes, and then bandwidth allocation
is managed accordingly to fulﬁll QoS requirements. The pro-
posed router incorporates a congestion-control scheme com-
prising dynamic arbitration and adaptive routing-path
selection. High-priority trafﬁc is directed to less-congested
areas and is given preference over available resources. The
simulated results of a network based on the proposed router
show interesting performance in terms of load balance and
per-node latency. When compared with our solution, we thinkthat distributing the trafﬁc load is an interesting approach, but
it still cannot avoid congestion under high trafﬁc loads. There-
fore, we think that a ﬂow-control mechanism such as the one we
propose here to prevent congestion could enhance these results.
The authors in Jafri et al. (2010) address the problem of
backpressureless and backpressured networks, charting the
tradeoff between performance and energy consumption related
to buffering. They propose a novel adaptive ﬂow control
(AFC) router that dynamically adapts between backpressured
and backpressureless ﬂow control using three new schemes:
local contention thresholds, gossip-induced mode-switch, and
lazy virtual circuit allocation. These different mechanisms are
dynamically applied in the router according to the load state
of the network. Their approach was evaluated in a 3 * 3 mesh
network with a ﬂit size of 32 bits, demonstrating through sim-
ulation that the AFC routers operate in backpressureless mode
at low loads and as backpressured routers at high loads in
order to avoid the signiﬁcant energy/performance penalties
that each of these two ﬂow-control policies incurs when oper-
ating outside their sweet spots. Their proposed ﬂow-control
scheme looks very interesting as a concept. However, the
authors did not study the hardware characteristics of an imple-
mentation of their proposed router, and they did not check its
real impact on the end-to-end QoS, both of which would attest
to the efﬁciency of the method.
The authors in Lee and Bagherzadeh (2009) designed a
NoC router using 90 nm technology. The proposed circuit,
using a FIFO depth of eight ﬂits, was able to reach a frequency
of 425.5 MHz with a circuit area of 0.143 mm2 while maintain-
ing relatively low power consumption (2 mW). The circuit
design was oriented to reduce power consumption by applying
the clock-boosting concept. The presented architecture was not
based on deployment per class of service, and the authors did
not explore the capacity of the router to avoid congestion
under heavy data trafﬁc.
Lotﬁ-Kamran et al. (2010) developed a routing scheme
called EDXY to share trafﬁc on the network and avoid over-
loading certain hops while others remain unloaded. This solu-
tion can avoid congestion and enhance the received QoS
metrics, such as latency, for high injection rates. The circuit
has a data link of 32 bits, the same ﬂit size as used in our
approach. It has an area of 0.089 mm2 and the dynamic power
consumption is 26.1197 mW. The approach presented in this
paper looks attractive, but the hardware features of our solu-
tion are more interesting for SoC design.
Comparing our solution to the XHiNoC circuit studied in
Samman et al. (2009), when using the same integration tech-
nology, our designed router occupies a bigger area. The XHi-
NoC router for a FIFO size of eight ﬂits of 38 bits requires an
area of 0.108 mm2 using 130 nm integration CMOS technol-
ogy and reaches a maximum frequency of 453 MHz. We think
this difference in router area is due to the adoption of central
memory for service classiﬁcation in our approach. We note
that in exchange for a larger area, our circuit offers more band-
width and ensures higher efﬁciency and granularity of ﬂow-
control for enhanced QoS delivery.
In Wiklund and Liu (2003), the SoCBuS NoC was presented
as a solution designed for real-time applications. The circuit
clocks at 1.2 GHz, within the same frequency range as our solu-
tion, and it offers high available bandwidth between routers and
enables real-time services. However, the proposed circuit imple-
ments no ﬂow control for congestion management.
The efﬁciency of buffer and buffer-less data-ﬂow control schemes 197In Jovanovic et al. (2007), the CuNoC solution was pre-
sented for communication in a Multi-Processor SoC. The cir-
cuit was evaluated for FPGA prototyping. Interestingly, it
provides high throughput—more than 500 Mbits—using a
data link of 16 bits. However, the proposed solution integrates
no speciﬁc mechanism for congestion management or QoS
enhancement, so we think that our approach remains more
interesting under high data trafﬁc.
The authors in Michelogiannakis and Dally (2009) detailed
the design of a router with an elastic buffer based on a master-
slave ﬂip–ﬂop architecture. They presented three different
hardware architectures (baseline two stage, enhanced two
stage, and single stage) to implement this elastic buffer in order
to optimize ﬂit transaction time in the router. The application
of a single-stage approach signiﬁcantly reduced the latency
compared with the two-stage approaches. The authors of this
paper did not apply a speciﬁc ﬂow-control mechanism and
did not study the end-to-end QoS achieved by their
approaches. Thus, we suggest that their proposed approaches
be enhanced with an efﬁcient ﬂow-control mechanism to man-
age QoS, such as the one presented in this paper.
9. Conclusions and future work
In this paper, we have addressed the problem of ﬂow con-
trol for congestion management and for efﬁcient QoS deliv-
ery in NoCs. We proposed an enhanced architecture that
allows per-ﬂit differentiation and offers more processing
granularity during ﬂit transactions. Two ﬂow-control
schemes were studied to avoid and solve congestion with
low impact on QoS.
The ﬁrst ﬂow-control scheme involves an output buffer to
delay ﬂit forwarding to a congested router based on a signal-
ing process between neighbors. The second studied scheme
applies a feedback-signaling mechanism to the source core
to reduce the injection rate. These two proposed ﬂow-
control schemes were applied alongside a WRED-like algo-
rithm that selects low-importance ﬂits to be dropped out of
memory during congestion avoidance and resolution. This
algorithm to unload the memory of congested routers is
intended to reduce distortion on the QoS at the reception
level. We measured the performance of the two proposed
ﬂow-control schemes when applied to a multiprocessor archi-
tecture, which demonstrated the efﬁciency of the two pro-
posed schemes. In particular, the feedback-signaling scheme
has shown attractive performance compared to the output
buffer ﬂow-control process.
The hardware features of the proposed router have been
checked for ASIC circuit prototyping with 45 nm integration
technologies. The obtained results show that our approach
outperforms many similar solutions for on-chip communica-
tion while ensuring attractive end-to-end QoS.
For future work, we think that the design of a dynamic
ﬂow-control mechanism which takes into consideration the
constraints of the speciﬁc data stream that overloads the net-
work could be an interesting approach to extend the scalability
of the proposed router. In addition, we believe that designing
the router with an asynchronous internal architecture would
deﬁnitely improve the power consumption of the proposed
architecture, which might make it more attractive for usage
in low-power applications.References
Aci, C.I., Akay, M.F., 2010. A new congestion control algorithm for
improving the performance of a broadcast-based multiprocessor
architecture. J. Parallel Distrib. Comput. 70, 930–940.
Adel, Soudani, Aldammas, Ahmed, Al-Dhelaan, Abdullah, 2014.
Efﬁcient scheme for congestion control in network on chip with
QoS consideration. J. Circuits Syst. Comput. 23, 09.
Ascia, G., Catania, V., Palesi, M., Patti, D., 2006. A new selection
policy for adaptive routing in network on chip. In: Proceedings of
the 5th WSEAS International Conference on Electronics, Hard-
ware, Wireless and Optical Communications, pp. 94–99.
Barbera, M., Lombardo, A., Schembra, G., Trecarichi, A., 2008.
Improving fairness in a WRED-based DiffServ network: a ﬂuid-
ﬂow approach. Perform. Eval. 65, 759–783.
Becker, D.U., Jiang, N., Michelogiannakis, G., Dally, W.J., 2012.
Adaptive backpressure: efﬁcient buffer management for on-chip
networks. In: IEEE 30th International Conference on Computer
Design (ICCD), pp. 419–426.
Bolotin, E., Cidon, I., Ginosar, R., Kolodny, A., 2004. QNoC: QoS
architecture and design process for network on chip. J. Syst.
Architect. 50, 105–128.
Chen, Y.-J., Yang, C.-L., Chang, Y.-S., 2009. An architectural co-
synthesis algorithm for energy-aware network-on-chip design. J.
Syst. Architect. 55, 299–309.
Daneshtalab, M., Ebrahimi, M., Liljeberg, P., Plosila, J., Tenhunen,
H., 2012. A systematic reordering mechanism for on-chip networks
using efﬁcient congestion-aware method. J. Syst. Architect.
David, L., Sood, M., Kajla, M.K., 2011. Router based approach to
mitigate DOS attacks on the wireless networks. In: Proceedings of
the 2011 International Conference on Communication, Computing
& Security, pp. 569–572.
Ebrahimi, M., Daneshtalab, M., Liljeberg, P., Plosila, J., Tenhunen,
H., 2012. CATRA-congestion aware trapezoid-based routing
algorithm for on-chip networks. In: Design, Automation & Test
in Europe Conference & Exhibition (DATE), pp. 320–325.
Jafri, S.A.R., Hong, Y.-J., Thottethodi, M., Vijaykumar, T., 2010.
Adaptive ﬂow control for robust performance and energy. In:
Proceedings of the 2010 43rd Annual IEEE/ACM International
Symposium on Microarchitecture, pp. 433–444.
Jovanovic, S., Tanougast, C., Weber, S., Bobda, C., 2007. CuNoC: a
scalable dynamic NoC for dynamically reconﬁgurable FPGAs. In:
International Conference on Field Programmable Logic and
Applications, FPL 2007, pp. 753–756.
Kaddachi, M.L., Soudani, A., Tourki, R., 2008. Signaling approach
for NOC quality of service requirements. In: 2nd International
Conference on Signals, Circuits and Systems, SCS 2008, pp. 1–5.
Kumar, A., Mahapatra, R.N., 2005. An integrated scheduling and
buffer management scheme for input queued switches with ﬁnite
buffer space. Comput. Commun. 29, 42–51.
Lee, S.E., Bagherzadeh, N., 2009. A variable frequency link for a
power-aware network-on-chip (NoC). Integr. VLSI J. 42,
479–485.
Lee, J.W., Ng, M.C., Asanovic´, K., 2012. Globally synchronized
frames for guaranteed quality-of-service in on-chip networks. J.
Parallel Distrib. Comput. 72, 1401–1411.
Lotﬁ-Kamran, P., Rahmani, A.-M., Daneshtalab, M., Afzali-Kusha,
A., Navabi, Z., 2010. EDXY–A low cost congestion-aware routing
algorithm for network-on-chips. J. Syst. Architect. 56, 256–264.
Michelogiannakis, G., Dally, W.J., 2009. Router designs for elastic
buffer on-chip networks. In: Proceedings of the Conference on
High Performance Computing Networking, Storage and Analysis,
pp. 1–10.
Mishra, A.K., Yanamandra, A., Das, R., Eachempati, S., Iyer, R.,
Vijaykrishnan, N., Das, C.R., 2011. RAFT: a router architecture
with frequency tuning for on-chip networks. J. Parallel Distrib.
Comput. 71, 625–640.
198 A. Aldammas et al.Ni, L.M., McKinley, P.K., 1993. A survey of wormhole routing
techniques in direct networks. Computer 26, 62–76.
Palesi, M., Kumar, S., Holsmark, R., 2006. A method for router table
compression for application speciﬁc routing in mesh topology NoC
architectures. In: Embedded Computer Systems: Architectures,
Modeling, and Simulation. Springer, pp. 373–384.
Peh, L.-S., Dally, W.J., 2000. Flit-reservation ﬂow control. In:
Proceedings Sixth International Symposium on High-Performance
Computer Architecture, HPCA-6, pp. 73–84.
Rijpkema, E., Goossens, K., Ra˘dulescu, A., Dielissen, J., van
Meerbergen, J., Wielage, P., Waterlander, E., 2003. Trade-offs in
the design of a router with both guaranteed and best-effort services
for networks on chip. IEE Proc. Comput. Digital Tech. 150, 294–
302.
Samman, F.A., Hollstein, T., Glesner, M., 2009. Networks-on-chip
based on dynamic wormhole packet identity mapping management.
VLSI Des. 2009, 2.van den Brand, J.W., Ciordas, C., Goossens, K., Basten, T., 2007.
Congestion-controlled best-effort communication for networks-on-
chip. In: Proceedings of the Conference on Design, Automation
and Test in Europe, pp. 948–953.
Wang, Chifeng, Bagherzadeh, Nader, 2014. Design and evaluation of a
high throughput QoS-aware and congestion-aware router architec-
ture for network-on-chip. Microprocess. Microsyst. 38, 304–315.
Wang, C., Hu, W.-H., Bagherzadeh, N., 2012. Scalable load balancing
congestion-aware network-on-chip router architecture. J. Comput.
Syst. Sci.
Wang, J., Gu, H., Yang, Y., Wang, K., 2013. An energy-and buffer-
aware fully adaptive routing algorithm for network-on-chip.
Microelectron. J. 44, 137–144.
Wiklund D., Liu, D., 2003. SoCBUS: switched network on chip for
hard real time embedded systems. In: Proceedings International on
Parallel and Distributed Processing Symposium, 8 pp.
