New Theory for Deadlock-Free Multicast Routing in Wormhole-Switched Virtual-Channelless Networks-on-Chip by Samman, Faizal Arya, Dr.-Ing. et al.
New Theory for Deadlock-Free Multicast
Routing in Wormhole-Switched
Virtual-Channelless Networks-on-Chip
Faizal Arya Samman, Member, IEEE, Thomas Hollstein, Member, IEEE, and
Manfred Glesner, Fellow, IEEE
Abstract—A new theory for deadlock-free multicast routing especially used for on-chip interconnection network (NoC) is presented in
this paper. The NoC router hardware solution that enables the deadlock-free multicast routing without utilizing virtual channels is
introduced formally. The special characteristic of the NoC is that, wormhole packets can cut-through at flit-level and can be interleaved
in the same channel with other flits of different packets by multiplexing it using a rotating flit-by-flit arbitration. The routing paths of each
flit can be guaranteed correct because flits belonging to the same packet are labeled with the same local Id-tag on every
communication channel. Hence, multicast deadlock problem can be solved at each router by further applying a hold-release tagging
mechanism to control and manage conflicting multicast requests.
Index Terms—Network-on-chip, tree-based and multipath-based multicast routing, Id-tag-based wormhole packet switching, runtime
adaptive routing and scheduling.
Ç
1 INTRODUCTION
EFFICIENT routing is a very important aspect to optimize theperformance and communication energy of data trans-
ports in the NoC-based multicore processor systems.
Historically, the first generation multicomputers supported
only unicast communication (a single PE sends a message to
a single PE unit). Collective communication services have
been nowadays implemented in the recent multicomputers.
The collective communication services can be in forms of
multicast (the same message is sent from a source node to an
arbitrary number of destination nodes), scatter (different
messages are sent from a source node to an arbitrary number
of destination nodes), and broadcast (the samemessage is sent
from a source node to all nodes in the network). The
implementation of the collective communication services
will not only simplify the programming model in software-
layer but also will perform efficient routing in terms of the
latency and energy of the data transmission.
Multicast service has been intensively used in large-scale
multiprocessor systems and is a standard service of some
data parallel computer languages. Data parallel languages
such as Fortran D [1], Distributed Fortran 90 [2], and High-
Performance Fortran [3] have included multicast data
transport routines for users to develop parallel computation
and algorithms. In the single-program multiple-data
(SPMD) programming model, multicast communication is
of benefit, where the same program is executed on different
processors with different data, and several data are
processed in parallel. In a data parallel programming
model, a variety of process control operations and global
data movement such as reduction, replication, permutation,
segmented scan, and barrier synchronization requires the
collective communication models.
In a distributed shared memory paradigm, multicast
services maybe used to efficiently support shared data
invalidation and updating. Some libraries commonly used
to develop parallel computer programs based on message
passing programming model and shared memory program-
ming model such as Message Passing Interface (MPI) [4], [5]
and Parallel Virtual Machine (PVM) [5], [6] provide also
multicast data transport routines and procedures to design
explicitly parallel computation and communications. Both
MPI and PVM libraries are available for C and Fortran
Computer Languages. Numerous parallel algorithms, e.g.,
parallel search and parallel graph algorithms, have taken
also the advantages of the multicast service [7]. By using
software implementation, a multicast message can be sent
to the network by replicating separate copies of the message
from source to every destination node (unicast-based
multicast delivery). However, this approach leads to
inefficient transmission time and workload energy.
In order to achieve the effectiveness of the multicast data
communication, the multicast service must be implemented
not only in the higher protocol layers (software layers) but
also in the lower layers (hardware layers). In the NoC
544 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 4, APRIL 2011
. F.A. Samman is with the FB Elektrotechnik und Informationstechnik,
Research Group on Microelectronic Systems, Technische Universita¨t
Darmstadt, Merckstr. 25, 64283 Darmstadt, and the Fraunhofer Institut
LBF, LOEWE-Zentrum AdRIA (Adaptronik-Research, Innovation, Appli-
cation), Bartningstr. 53, 64289 Darmstadt, Germany, and also with the
Deparment of Electrical Engineering, Hasanuddin University at Makassar,
Indonesia. E-mail: faizal.samman@mes.tu-darmstadt.de,
faizal.samman@loewe-adria.de.
. T. Hollstein is with Department of Computer Engineering, Dependable
Embedded Systems Group, Tallinn University of Technology, Estonia.
E-mail: thomas.hollstein@ies.tu-darmstadt.de.
. M. Glesner is with FB Elektrotechnik und Informationstechnik, Research
Group on Microelectronic Systems, Technische Universitaa¨t Darmstadt,
Merckstr. 25, 64283 Darmstadt, Germany.
E-mail: manfred.glesner@mes.tu-darmstadt.de.
Manuscript received 2 July 2009; revised 7 Jan. 2010; accepted 3 Apr. 2010;
published online 1 June 2010.
Recommended for acceptance by J. Flich.
For information on obtaining reprints of this article, please send e-mail to:
tpds@computer.org, and reference IEEECS Log Number TPDS-2009-07-0297.
Digital Object Identifier no. 10.1109/TPDS.2010.120.
1045-9219/11/$26.00  2011 IEEE Published by the IEEE Computer Society
context, these layers refer to network interface layer
(transport layer) and on-chip routing layer (network,
routing, and physical layers). The main problem of the
multicast service implementation in the on-chip router is
multicast dependencies between packets in the router that
can lead to multicast deadlock configuration.
The remaining sections of the paper is organized as
follows: Section 2 presents related work and our motivations
to introduce the new theory. Section 3 defines the main
problem with some general definitions. Section 4 proposes
the technical solution and formalism of the hardware-based
solution. The formal descriptions of some definitions,
lemmas followed by the proof of the new theory are shown
in Section 5. Section 6 finally presents the concluding
remarks about the new theory.
2 RELATED WORKS AND CONTRIBUTIONS
Some theories and methodologies to perform a deadlock-
free multicast routing have been presented in [7], [8], [9],
[10], [11], and [12]. The works in [7], [8], [9] specially present
the theory for deadlock-free path-based multicast routing,
while the works in [10], [11], and [12] present the theory for
deadlock-free tree-based multicast routing. However, the
multicast theory and routing protocol presented in the
aforementioned works are not dedicated for NoCs.
There have been some other works that have introduced
multicast routing services dedicated for NoCs. The work in
[13], for example, presents a Multicast Router Rotary (MRR).
The multicast routing algorithm in the MRR can be
classified into a distributed routing method. The multicast
contention in MRR is solved by implementing two single-
direction internal rings in the switch, one in clockwise
direction and the other one in counterclockwise direction.
Without careful data-flow rule, a dangerous permanent
deadlock can occur especially when packets come from all
different input ports, and each of them requests simulta-
neously all output ports. The proposed data-flow rule in the
MRR must even allow misrouting to avoid deadlock in a
case that a packet cannot find a free output port. In any
circumstance, misrouting can increase data communication
energy due to the overhead misrouting traffics, and can
lead to livelock situation. The work in [13] has not yet
addressed this livelock issue. Moreover, additional 10 in-
ternal buffers (5 for each ring) in the MRR will increase the
area overhead of the router.
The work in [14] presents a broadcast-multicast-enabled
Logic-based Distributed Routing (bLBDR). Another routing
approach called Recursive Partitioning Multicast (RPM)
method is also presented in [15]. The bLBDR and RPM
methods need global network view and preprocessing
algorithm for network partitioning. The bLBDR uses
configuration bits at each router and broadcast/multicast
operations are computed at every router. When using
wormhole switching method, concurrent broadcast using
the bLBDR can lead to deadlock. Since the bLBDR proposes
a tree-based multicast routing inside region, our contribu-
tion in this paper can be a complementary solution for the
deadlock problem. In the RPMmethod, a routing decision is
made based on the current network partitioning that has
been previously computed recursively, and the multicast
deadlock problem is solved by using virtual channels. The
whole network is divided into at most eight subnets by the
source node. The objective of partitioning the network is to
minimize packet replication time.
A Virtual Circuit Tree Multicasting (V CTM) method is
presented in [16]. In the VCTMmethod, a setup packet must
be sent in the network to configure a switched tree-based
multicast virtual circuit. Like the RPM method [15], the
VCTM solves the multicast deadlock problem by using
virtual channels (VCs). The main critics of the use of VCs in
theNoCcontext areprohibitive area cost in termsof buffering
and/or potential slower speed of router cycle time. VCs will
increase total buffer counts and result in power consumption
that would exceed the target constraint for an embedded
application [19]. The work in [17] has found that additional
VCs do not increase router cycle time because the router
complexity slows down the router working period. VCs can
increase delay in router’s critical path due to extra arbitra-
tions, thus it potentially affects the cycle time or pipeline
depth of the router [20]. The same result is presented in [18],
in which additional arbitration and multiplexing circuits for
VCs on physical channels introduce delay into the critical
path in implementing alternate routing algorithms.
The MRR [13], bLBDR [14], and VCTM [16] rely mainly
in virtual-cut-through networks to avoid multicast/broad-
cast deadlocks. The RPM [15] uses wormhole switching but
it uses virtual channels leading to higher area overhead. The
proposed solution in this paper is novel since it targets
wormhole-switching network without using virtual chan-
nels. This paper will present a new theory to solve the
multicast dependency problem suitable for wormhole-
switched NoCs without virtual channels, where the routing
algorithms used to route unicast and multicast packets are
the same, resulting in a very efficient routing function gate-
level implementation. The theory is practically applicable
and implementable directly in NoC routing hardware layer
using path-based and tree-based multicast routing. Runtime
local/distributed routing is used as the basis solution rather
than derive specific algorithms executed in software-level
(source routing). Hence, the solution based on our new
theory can be performed at runtime during application
execution time. The new theory must be supported by a
specific hardware infrastructure to back up the effectiveness
of the new theory. The formal descriptions of the hardware
base will be also described in this paper.
3 PROBLEM DEFINITION
Definition 3.1 (Network-on-Chip). A network-on-chip (NoC)
can be represented as a graph Gð<;Þ, where  is represented
as a set of edges (communication links) and < is represented as
a set of vertices (router nodes).
Definition 3.2 (Router). NoC consisting of Nnode number of
node will have a set of NoC Router < ¼ fR1; R2; . . . ; RNnodeg
or Rc 2 <jc ¼ f1; 2; . . . ; Nnodeg.
Definition 3.3 (Communication Link). Communication link
Li;j 2  is a communication link connecting router node Ri
and Rj where Ri;Rj 2 <, and i; j ¼ f1; 2; . . . ; Nnodeg.
The number of link components in the set < depends on
network-on-chip topology. We can describe that :8i; j such
SAMMAN ET AL.: NEW THEORY FOR DEADLOCK-FREE MULTICAST ROUTING IN WORMHOLE-SWITCHED VIRTUAL-CHANNELLESS... 545
that Li;j exists. Fig. 1 shows an example of a NoC in 3 2
mesh topology with full duplex links connection. Based on
Definition 3.1, we can see that the router set is
< ¼ fR1; R2; R3; R4; R5; R6g, while the set of the commu-
nication resources is defined as  ¼ fhorz;vertg, where
horz ¼ fL1;2; L2;1; L2;3; L3;2; L4;5; L5;4; L5;6; L6;5g and vert ¼
fL4;1; L1;4; L2;5; L5;2; L3;6; L6;3g.
Definition 3.4 (Router IO Port). For Ninp number of input
ports and Noutp number of output ports of a router R 2 <, we
can describe the set of input ports as I ¼ fI1; I2 . . . ; INinpg and
the set of output ports as O ¼ fO1; O2 . . . ; ONoutpg. Hence, if we
define ¼ f1; 2; . . . ; Ninpg and’ ¼ f1; 2; . . . ; Noutpg, then we
can define an input port of router as In 2 I jn 2 , and an
output port of a router as Om 2 Ojm 2 ’. We can further
define that 8n: In ¼ n and 8m: Om ¼ m.
Multicast deadlock configuration is a situation in which
multicast packets, which are switched in the network with
wormhole switching technique, cannot move further due to
multicast dependency occurs in some NoC routers. The
multicast dependency occurs because two or more multicast
packets are competing each other to access the same output
ports in any NoC router, while in other NoC routers, the
same situation occurs, i.e., the same competing multicast
packets compete also to acquire the same output ports. In
order to understand a better insight about the multicast
deadlock configuration, an example of the multicast dead-
lock configurations are presented in Fig. 2. The deadlock
configuration when using tree-based multicast routing with
wormhole switching method is shown in Fig. 2a. In node
(2,2), message A cannot move further because the East and
West output ports are acquired by message B. Meanwhile,
message B cannot advance in node (2,1), because the East
and West output ports are acquired by message A. The
multicast dependencies of the contenting multicast packets
inmany network nodes can lead to a deadlock configuration.
In a path-based multicasting, a source node arranges the
ordered list of headers containing destination address.
When a message is injected to the network, it will be routed
to a destination node according to the address attached in
the leading header flit. When the message arrives the
destination node, the leading header flit is removed. Hence,
the next header will be the leading flit and guides the
message into the next destination. The path-based multi-
casting is a mechanism to avoid branching in intermediate
nodes. Two branches of the paths are formed only in
destination nodes, i.e., one to Local port and one to another
port. In the source node, the number of branches can be
maximum two for dual-path and more than two for the
multipath-based multicasting. Deadlock configuration can
occur when a multicast router does tree branch (even by
using dual/multipath multicast) as shown in Fig. 2b.
4 TECHNICAL SOLUTION AND FORMAL
DESCRIPTION
4.1 Multicast Contention Control and Management
Deadlock configurations because of multicast contentions
can occur not only in tree-based but also in multipath-based
multicast routing. The deadlock occurs in an intermediate
node when one or more outgoing links are simultaneously
requested by the same multicast packet. We propose a novel
scheduling method called Hold-Release Tagging Mechanism
for fair multicast scheduling policy to manage the multicast
contention for a NoC without using virtual channels. The
methodology is suitable for NoC router supporting simul-
taneous parallel crossbar interconnection, where a routing
unit and an arbitration are distributed on every input port
and every output port, respectively.
The philosophy of the Hold-Release Tagging Mechanism is
as follows: “If a multicast flits from an input port n has an
Nreqs;n number of requests at any instant of time ts, then each
single request to an output port m can be forwarded from
546 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 4, APRIL 2011
Fig. 1. A NoC in a 2D 3 2 mesh network graph.
Fig. 2. Multicast deadlock configurations when using tree-based and path-based multicast routing in mesh networks. (a) Deadlock in tree-based
multicast routing. (b) Branching in dual/multipath multicasting.
the input port n to the output port m in the next time stage
only if it receives a grant by an arbitration unit at input port
n, while the other requests must be held in the input port if
it is not granted by their requested output ports. In each
next time stage, a single request, which has been granted
before, must be reset to prevent improper flit replication. If
all requests have been granted, then the multicast flit can be
released from the queue in input port n.”
In order to support such mechanism, the flits of the
wormhole packets must be able to be interleaved/multi-
plexed flit-by-flit to share the same link with other flits
belonging to different wormhole packets (cut-through at
flit-level). The “wormhole cut-through switching” is im-
plementable by attaching a local ID-tag on each flit of
packets, where flits belonging to the same packets will have
the same local ID-tag on every local communication channel
[24], [25], [26]. Furthermore, an arbitration unit distributed
at every output port m must rotate (circulate) its selection
among requests at instant time ts.
Fig. 3 presents successive snapshots on how the
proposed method solves the deadlock problem due to the
multicast dependency. In the Snapshots 2, 4, and 6, we can
see how the selection of each arbiter unit at every output
port is rotated among requests from input ports. In
Snapshot 3, we can see how requests, which are not granted
in the previous grant step (Snapshot 2), are not forwarded
to the output ports (depicted in dashed lines). Snapshot 5
shows how requests that have been granted are reset to
prevent improper flit replication. In order to guarantee the
correct routing path and proper replication in the hold-
release tagging mechanism, both issues will be discussed
formally in Sections 5.1 and 5.2, respectively.
4.2 Specific Packet Format
Fig. 4 presents the generic specific packet format that should
be used to perform a deadlock-free multicast routing and to
enable the hold-release tagging mechanism described in
Section 4.1. A message or a streaming datum is divided into
flits. The total bit-width of each flit of the message or
streaming data is Btotal ¼ Btype þBtag þBword, where Btype is
the bit-width of the flit-type field, Btag is the bit-width of the
id-tag field, and Bword is the bit-width of the data word.
Formally, we will give formal definition for unicast and
multicast message or streaming data.
Unicast message/stream is single packet consisting of
one header flit, databody flits, and one tail flit, while
multicast message/stream consists of more than one header
flits. Even if the size of the message/stream is extremely
large, it has only one tail flit. In other words, a message or
data stream is assembled in a single packet.
A number of header flit Nhf represents the number of the
multicast destination Ndest (Nhf ¼ Ndest). The header flit
contains information of the source address from which node
themessage is injected and the target address towhich nodes
the message will be sent. If Ndf number of data flits will be
sent to Ndest number of multicast target, then the total
number of flits injected to the NoC is NFlit ¼ Ndf þNhf .
Definition 4.1 (Data Flit). A data flit coming from input port n
is represented as Fnðtype; IdÞ that consists of a data word with
additional flit-type field (type) and local Id-tag field (Id). Each
flit will always bring a data word together with its type and its
local (see Fig. 4).
Definition 4.2 (Flit type). The type field represents the type of
each flit. The flit-type can be a header, a databody, a tail flit, or
a response flit (type 2 "typej"type ¼ fheader; databody; tail;
responseg).
Definition 4.3 (Flit Id-tag). Id-tag field present on each flit is a
local label (Id-tag) to indicate and differentiate the flit from
different flits. Flits belonging to the same message or streaming
data will always have the same local Id-tag on each
communication link Li;j 2 . The value of the local Id-tag is
defined as Id 2 j ¼ f0; 1; 2; . . . ; Nslot  1g where Nslot is
number of available Id slot on communication link Li;j 2 .
See also later the definition of the local Id slot in Definition 4.4.
Definition 4.4 (Local Id Slots). Each communication link
Li;j 2  has Nslot number of available local Id slots, which is
defined as a set i;j ¼
  (see Definition 4.3). If we assume that
all communication links has equal set of available Id slots then
the term i;j ¼ . We define a single local Id slot k 2 , where
k ¼ Nslot  1 is reserved for flow-control purpose, and the
usable Id slots are 8k 2  \ k 6¼ Nslot  1.
4.3 Specific Routing Organization
Definition 4.5 (Routing Engine). Routing Engine (RE) is a
component to make a routing direction rdir 2 D where
SAMMAN ET AL.: NEW THEORY FOR DEADLOCK-FREE MULTICAST ROUTING IN WORMHOLE-SWITCHED VIRTUAL-CHANNELLESS... 547
Fig. 3. Hold-Release tagging mechanism for a fair multicast flit
arbitration and scheduling policy.
Fig. 4. Specific packet format.
D ¼ f1; 2; . . . ; Noutpg. For Ninp number of router input ports,
then the number of RE per router is NRE ¼ Ninp. We can
describe that the RE at input port In 2 I as En; n 2 I . The
En provides a routing function fRE giving routing direction
D. We can defined fRE: rdir ¼ fREðtype; Id;AdestÞ, where
Adest is the destination address field present in the header flit.
The RE (En) at input port n 2  consists of combination
of a routing state machine M and routing reservation table T .
Accordingly, there will be pairs of ðMn; TnÞj8n 2 . Both
components are allocated at each input port. Operation in
the Routing Engine depends on the type of the data flit
Fnðtype; IdÞ. The type of flit will also determine which
component (M or T ) that will give a routing direction.
Definition 4.6 (Routing State Machine). The routing state
machine (RSM) provides a routing function fRSM where the
output of the function depends on the destination address
appear on the target address field Adest of a header flit. If a
header flit is detected by the routing state machine, then the
routing direction of the flit is computed with routing function
fRSM : fRSMðAdestÞ ) rdir, where Adest is the destination
address present on the header flit and rdir 2 D.
Definition 4.7 (Routing Reservation Table). A Routing
Reservation Table (RRT) of the RE unit at an input port is
defined as
T ðkjk 2 Þ ¼ rdir 2 D ¼ 1; 2; 3; . . . ; Noutp
 
; ð1Þ
or T ðkÞ 2 Djk 2 . Definition of D can be found in
Definition 4.6. So, we can define that 8k 2 , the value
of the Routing Table T ðkÞ is a routing direction rdir 2 D.
The array structure of the RRT T can be seen in Fig. 5. As
shown in Algorithm 1, when a header flit Fn;mðheader; IdÞ is
coming from an input port n, a routing direction rdir is
computed by the RSM, and it is concurrentlywritten (copied)
in the slot number k in the RRT,where k ¼ Id (equal to the Id-
tag of a header flit) such that T ðIdÞ ¼ rdir. When a databody
Fn;mðheader; IdÞ or tail flit Fn;mðheader; IdÞ is coming to the
input portn, then the rdir is taken from the slot number k ¼ Id
in the RRT. The operation rdir [ T ðIdÞ in the algorithm is the
union operation between the current content of the T ðIdÞ in
the slot number Id and the current value of the routing
direction rdir.
Algorithm 1. Runtime Id-based Routing Organization
Read Data Flit from Queue : Fnðtype; IdÞ
1: Ftype ( type
2: Adest is obtained from Header flits
3: if Ftype is header or response then
4: rdir ( fRSMðAdestÞ; T ðIdÞ ( rdir [ T ðIdÞ
5: else if Ftype is databody then
6: rdir ( T ðIdÞ
7: else if Ftype is tail then
8: rdir ( T ðIdÞ; T ðIdÞ ( ;
9: end if
4.4 Specific Switching and Arbitration Method
The hardware solution to overcome the multicast deadlock
configuration is realized by organizing locally the Id-tag of
each multicast message on each communication link. The
local Id-tag of each branch of the multicast tree is variable
and organized by an Id-tag management unit. Fig. 6 shows
how the local Id-tags of all multicast tree branches are
dynamically assigned, where a different message must be
allocated to different Id slot in the same one-directional
link. The number in each one-directional link is the Id slot,
to which the message is allocated. Every communication
link is shared fairly by applying a rotating flit-by-flit
arbitration as defined in Definition 4.11. Section 4.5 will
formally describe the other specific hardware requirement
to solve the multicast problem.
Definition 4.8 (Time-Varying Binary Request). A time-
varying input-output request from an input port n 2  to an
output port m 2 ’ is defined as rn;mðtÞ: 8n;mjrn;mðtÞ 2
f0; 1g. Hence, the time-varying input n binary request can
be defined as array of binary or rn;m¼1 to NoutpðtÞ or rn;m¼1:NoutpðtÞ
or rn;1:NoutpðtÞ, and the time-varying outputm binary request
is defined also as rn¼1 to Ninp;mðtÞ or rn¼1:Ninp;mðtÞ or r1:Ninp;mðtÞ.
Definition 4.9 (Set of Requests from an Input Port). A set of
requests from an input port n to output ports is defined as
548 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 4, APRIL 2011
Fig. 5. Local Id slots per link.
Fig. 6. Flexible runtime interconnect scheduling and switching based on
locally variable/dynamic Id-tagging.
’reqn , thus the number of set members N
req
s;n is defined as the
number of requests from an input port n to output ports at
time stage ts. Further, we can define the definitions in the
following equation:
’reqn  ’, Nreqs;n ¼
XNoutp
m¼1
rn;mðtsÞ < Noutp
’reqn  ’, Nreqs;n ¼
XNoutp
m¼1
rn;mðtsÞ ¼ Noutp
’reqn ¼ ; , Nreqs;n ¼
XNoutp
m¼1
rn;mðtsÞ ¼ 0
h 2 ’reqn , rn;hðtÞ ¼ 1:
ð2Þ
For example, if rn;1:NoutpðtsÞ ¼ ½1 0 1 1 0 or rn;1ðtsÞ ¼ 1,
rn;2ðtsÞ ¼ 0, rn;3ðtsÞ ¼ 1, rn;4ðtsÞ ¼ 1 and rn;5ðtsÞ ¼ 0, then
Nreqs;n ¼ 3 and ’reqn ¼ f1; 3; 4g.
Definition 4.10 (Set of Requests to an Output Port). A set of
requests to an output port m from input ports is defined as
reqn , thus the number of set members N
req
s;m is defined as the
number of requests to an output port m from input ports at
time stage ts. Further, we can define the definitions in the
following equation:
reqm  , Nreqs;m ¼
XNinp
n¼1
rn;mðtsÞ < Ninp
reqm  , Nreqs;m ¼
XNinp
n¼1
rn;mðtsÞ ¼ Ninp
reqm ¼ ; , Nreqs;m ¼
XNinp
n¼1
rn;mðtsÞ ¼ 0
l 2 reqm , rl;mðtÞ ¼ 1:
ð3Þ
For example, if r1:Ninp;mðtsÞ ¼ ½0 1 1 0 1 or
r1;mðtsÞ ¼ 0, r2;mðtsÞ ¼ 1, r3;mðtsÞ ¼ 1, r4;mðtsÞ ¼ 0 and
r5;mðtsÞ ¼ 1, then Nreqs;n ¼ 3 and reqm ¼ f2; 3; 5g.
Definition 4.11 (Rotating Flit-by-Flit Arbitration). A
rotating flit-by-flit arbitration is an arbitration strategy at
every output port m that circulates its selection among input
ports l 2 reqm according to Definition 4.10 and (3) in flit-by-
flit manner.
Algorithm 2 shows the rotating/circulating flit-by-flit
arbitration at every output port m. At initial time (no
request at all), the arbiter will first serve the first coming flit.
When many flits need arbitration service at the same cycle,
the arbiter will rotate selection like round-robin arbiter, but
the arbitration is done in flit-by-flit and active-port-by-
active-port manner. Active port refers to an input port
having request for routing. It means that the arbiter
circulates selection but will not select input ports having
no data flit.
Algorithm 2. Rotating Flit-by-Flit Arbitration
Binary Request : rn 2 f0; 1g : n ¼ 1; 2; . . . ; Ninp
Initial Value : RotStart ¼ Ninp, RotStop ¼ 1
1: while 9n : rn 6¼ 0 & the next queue is :Full do
2: for n ¼ RotStart down to n ¼ RotStop do
3: if rn ¼ 1 then
4: select( n; RotStart ¼ n 1; RotStop ¼ n
5: end if
6: end for
7: end while
8: Result: select) In
Definition 4.12 (Rotating Arbitration Time). Input port
selection in every output port m will be recirculated in every
Ts;m þ 1, where Ts;m ¼ Nreqs;m is a Rotating Arbitration Time.
Definition 4.13 (Binary Output Acknowledgment). We
have defined the time-varying outputm binary routing request
in Definition 4.8. Thus, we can now define the Request-
Dependent BinaryOutputAcknowledgment as a1:Ninp;mðtÞ.
For 8njan;mðtÞ 2 f0; 1g where only one element of a1:Ninp;mðtÞ
can be set to ’1’ because of natural hardware constraint of an
arbiter unit.WhenNreqs;m > 1, then physically it means that there
is contention between Nreqs;m number of flits between input ports
in the set reqm to access the same output portm. For example, if
the binary request r1:5;mðtsÞ ¼ ½0 1 0 1 1T , then accord-
ing toDefinition 4.12 and (3), Ts;m ¼ 3. The set of possible order
of the rotating arbitration at output port m is a1:5;mð1Þ¼
½0 0 0 0 1T , a1:5;mð2Þ ¼ ½0 0 0 1 0T , a n d
a1:5;mð3Þ ¼ ½0 1 0 0 0T . According to Definition 4.10,
reqm ¼ f2; 4; 5g. We can see that r1:5;mðtsÞ ¼
St¼3
t¼1 a1:5;mðtÞ.
4.5 Local Id-Tag Management
For each communication resource (link/channel) Li;j con-
necting an outport port Om 2 O of an on-chip router Ri
with an input port In 2 I of an adjacent (neighbor) router
Rj, then an amount of local identity (Id) slots N
i;j
slot is
implemented on the communication link Li;j 2 .
Definition 4.14 (ID Slot Table). ID Slot Table State is defined
as a set of Slot State S ¼ fS0; S1; S2; . . . ; SNslot1g and 8k 2
 : Sk 2 ftrue; falseg (See Definition 4.4). If Sk ¼ true, it
means that the Id Slot state is “free,” else if Sk ¼ false, then
the Id Slot is being “used” by any message. We can define the
ID Slot Table as
8k 2  : SðkÞ ¼ ðIdold; FfromÞ 2 ð;Þ: ð4Þ
The set  ¼ f1; 2; 3; . . . ; Ninpg is in accordance with Defnition
3.4. We define Ffrom as the selected flit from any input port as
the arbitration result.
Fig. 5 can help to comprehend Definition 4.14 and
describes the structure of the ID Slot Table S. The figure
presents the ID Slot Table in output port of Router Ri and
the Routing Table in the input port of Router Rj. A
communication link Li;j connecting the output and input
port of the Ri and Rj. For the sake of simplicity and for
maintaining a design regularity, we assume in this paper
that Nijslot ¼ Nslot. Therefore, the number of available Id-tags
on each communication link is uniform.
As shown in Algorithm 3, when a header flit
Fnðheader; IdÞ with ID-tag Id coming from input port n is
switched to an output portm then an Id-tag update function
fIDM : ðIdoldÞ 7! Idnew is made, and a free Id slot is searched
for the header.When a free Id slot k is found, the input port n
and the Id-tag Id arewritten in the slot number k in theS, and
the header uses this Id slot k as its new Id-tag (Idnew ¼ k).
SAMMAN ET AL.: NEW THEORY FOR DEADLOCK-FREE MULTICAST ROUTING IN WORMHOLE-SWITCHED VIRTUAL-CHANNELLESS... 549
When the header fails to find a free Id slot, it will be assigned
to Id slot Nslot  1. When a databody Fnðdatabody; IdÞ or tail
Fnðtail; IdÞ flit having the same Id-tag with the previously
routed header Fnðheader; IdÞ flit is switched to the output
portm, then they will be assigned with the same new Id-tag
k. The databody and tail flits will be dropped from the
network if its header having the same Id-tag fails to reserve
an Id Slot k 2  \ k 6¼ Nslot  1 on a certain link. Like
response flits, header flits having ID-tag Nslot  1 will be
always routed in the NoC with the ID-tagNslot  1, in which
their paths can be guaranteed correct even when many
Fnðheader;Nslot  1Þ and Fnðresponse;Nslot  1Þ flits flow in
the NoC, since these flits are only single-flit, in which
destination address is attached in their address field and
independent from the routing table contents.
Algorithm 3. Runtime Local Id-tag Update
Outgoing Data Flit : Fnðtype; IdÞ
Input Arbitration : n ¼ f1; 2; . . . ; Ninpg
NfreeId : number of free Id slots
SNslot1 : Slot reserved for control purpose
1: Idold ( Id
2: if Ftype is header then
3: if Idold ¼ Nslot  1 then
4: Idnew ( Nslot  1
5: else if Idold 6¼ Nslot  1 then
6: for k ¼ 0 to k ¼ Nslot  2 do
7: if 9k : Sk is true then
8: SðkÞ ( ðIdold; FfromÞ
9: Sk ( false /* the Id Slot is used now */
10: Idnew ( k; NfreeId ( NfreeId  1
11: else
12: Idnew ( Nslot  1
13: end if
14: end for
15: end if
16: else if Ftype is databody then
17: for k ¼ 0 to k ¼ Nslot  1 do
18: if 9k : k 6¼ Nslot  1 : SðkÞ ¼ ðIdold; FfromÞ then
19: Idnew ( k
20: else if 6 9k : k 6¼ Nslot  1:SðkÞ¼ðIdold; FfromÞ then
21: Idnew ( ;; The Databody flit is dropped
22: end if
23: end for
24: else if Ftype is tail then
25: NfreeId ( NfreeId þ 1
26: for k ¼ 0 to k ¼ Nslot  1 do
27: if 9k : k 6¼ Nslot  1 : SðkÞ ¼ ðIdold; FfromÞ then
28: Idnew ( k; SðkÞ ( ð;; ;Þ
29: Sk ( true /* the Id Slot is now free */
30: else if 6 9k : k 6¼ Nslot  1 : SðkÞ ¼ ðIdold; FfromÞ
then
31: Idnew ( ;; The Tail flit is dropped
32: end if
33: end for
34: else if Ftype is response then
35: Idnew ( Nslot  1
36: end if
37: Idnew ) Id
4.6 Issue Related to Local ID Slots Scalability
The main issue related to the local ID-based method for
flexible communication media share is a runout of local ID
problem. Therefore, it is important to set the number of
available Id slots on each link such that all considered traffics
can be covered to flow on every link. In a certain NoC
topology, therewill be a fact that there are some links that are
never be used by packets sent from some computing element
cores to some destination cores. Let us take a case of a NoC in
2DN M mesh topology, and let us also set the address and
port namesof eachnode as (x; y;Op),wherex is the horizontal
address such that 0  x  N  1, y is the vertical address
such that 0  y M  1, and Op 2 fEðEastÞ; NðNorthÞ;
WðWestÞ; SðSouthÞ; LðLocalÞg. Thus, When each commu-
nication edge of the 2D N M mesh architecture is
implemented with full-duplex link, then the number of
available local Id slots in theNorth, South, East andWest port
can be set to a minimum number, i.e., less than N M
number of local Id slots.
When a minimal fully adaptive (MFA) routing algorithm is
used to route packets, then the minimum number of the
available Id slots that can be set to each output port of every
NoC router is shown in (5).
NMFAx;y;Op ¼
Mðxþ 1Þ; 0p ¼ E; 0  x < N  1;
Nðyþ 1Þ; 0p ¼ N; 0  y < M  1;
MðN  xÞ; 0p ¼W; 0 < x  N  1;
NðM  yÞ; 0p ¼ S; 0 < y M  1;
NM  1; 0p ¼ L;
0; otherwise:
8>>><
>>>:
ð5Þ
When a Static X-First (SXF) routing algorithm is used to
route packets, then the minimum number of the available Id
slots that can be set to the East andWest Port will be less, i.e.,
NSXFx;y;E ¼ xþ 1 when 0  x < N  1, and NSXFx;y;W ¼ N  x
when 0 < x  N  1. Equation (5) is derived with assump-
tion that a router node will not send packets to itself. The
minimum number of Id slots at the Local output port is set to
(NM  1) to anticipate an all-to-one communication, a
collective communication mode, where all nodes send
message to one target node. By setting the number of Id
slots at each output port according to (5), we can guarantee
that the Id slot runout problem in the 2D N M mesh
architecture can be avoided. Therefore, packet dropping and
retransmission protocol can be neglected because both
mechanismsmay end up in unfairness for some traffic flows.
The NM quantity represents the number of participating
computing elements in the NoC. If the number of processing
elements connected to some mesh nodes is larger than one,
then the (5)must be reformulated to cover all possible traffics
flowing through each outgoing port of the mesh nodes.
5 FORMALISM OF THE NEW THEORY
This section presents several definitions and lemmas that
will be used to induce the proof of the new theorem for the
deadlock-free multicast routing.
5.1 Correctness of the Routing Path Establishment
Based on Local Id-Tag Management
Lemma 1. By organizing the Id-tag of each flit of packets using
Algorithm 3 with local ID Slot Table defined in Definition 4.14,
thenwe can guarantee that flits belonging to the same packetwill
550 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 4, APRIL 2011
always have the same local Id-tag k 2  on each communication
link Li;j 2 .
Proof of Lemma 1. Based on Definitions 4.4 and 4.14, we
can see that the local Id slot k 2  is indexed by using
two variables, i.e., the Id-tag Id 2  of a message from an
input link and from which port n 2  the message come.
Further, we define Fnðtype; IdÞ as a flit from input port n
with local Id-tag Id.
If we make a preassumption, that every single Id slot k is
allocated for every flit belonging to the same message on
every link Li;j connected to an input port n, then
ð8v; w 2 i;jÞ \ ðv 6¼ wÞ, a flit Fnðtype; vÞ will not belong
to the same message with Fnðtype; wÞ, where i;j is the set
of local Id slots on the link Li;j. 8p; q 2  \ p 6¼ q, there is a
probability that Fpðtype; vÞ and Fqðtype; wÞ have the same
local Id-tag (v ¼ w), because the sets of local Id Slot on each
communication link is the same. But certainly Fpðtype; vÞ
and Fqðtype; wÞ are flits of different packets, although
v ¼ w; : p 6¼ q. If every single Id slot k is assigned to a new
packet by identifying twoparameters, i.e., fromwhichport
the packet come and its current old Id-tag, we can make
sure that ð8x; y 2 Þ \ ðx 6¼ yÞ ) SðxÞ ¼ ðv; pÞ 6¼ SðyÞ ¼
ðw; qÞ according to (4), : 8p; q 2 : p 6¼ q, or if p ¼ q (ports
connected to the same link Li;j) then v 6¼ w; 8v; w 2 i;j
according to the preassumption. x and y is the new Id-tags
for flit Fpðtype; vÞ and Fqðtype; wÞ, respectively.
Therefore, by further applying a local ID-tag manage-
ment described in Algorithm 3, we can make sure that
each differentmessage can be allocated to one Id slot k 2 
and guarantee that flits belonging to the same packet can
be assigned to the same local Id-tag k 2  on each
communication link. Accordingly, the proof makes also
the aforementioned preassumption to be a valid assumption
continuously on every communication link. tu
Lemma 2. By using Id-based routing mechanism as described in
Algorithm 1 and organizing the Id-tag of each flit of packets as
described in Algorithm 3, then each flit belonging to the same
packet can be routed to a correct routing direction rdir 2 D
although the flits are interleaved with other flits that belong to
other different message by applying the rotating flit-by-flit
arbitration described in Definition 4.11.
Proof of Lemma 2. Based on the Proof of Lemma 1, if we
have proved that different packets can be allocated to
different local Id Slot such that flits belonging to the
same packet will always have the same local Id-tag,
then by implementing ID Slot Table at every outgoing
port such that if 9Sk; SðkÞjk 2  on each communication
link Li;j 2  connecting routing node Ri;Rj 2 <, then
we can also further implementing a Routing Table T ðkÞ
(Definition 4.7) at every incoming port, which routes
flits of packets based on their Id-tag k in such a way
that the interleaved different flits can be correctly
routed into their correct paths. tu
5.2 Correctness of the Proper Multicast Flit
Replication Based on Hold/Release Tagging
Mechanism
Definition 5.1. A Routing Request Matrix RðtÞ describes the
requests of all incoming flits to access the output ports at time
stage unit t. The elements of the routing request matrix RðtÞ
consist of the elements of the time-varying input n binary
request or the time-varying outputm binary request defined in
Definition 4.8 such that
RNinpNoutpðtÞ ¼ rn;1ðtÞ; rn;2ðtÞ; . . . ; rn;NoutpðtÞ
 
; ð6Þ
n ¼ 1; 2; . . . ; Ninp  1; Ninp
  ¼ 1 : Ninp 
RNinpNoutpðtÞ ¼
r1;m¼1:NoutpðtÞ
r2;m¼1:NoutpðtÞ
  
rNinp;m¼1:NoutpðtÞ
0
BBB@
1
CCCA: ð7Þ
We can also define that RðtÞ : rn;mðtÞ 2 f0; 1g, where n andm
represent the row and column coordinates of each matrix
element. According to Definition 3.4, n andm are interpreted as
the input and output port number of the router IO port,
respectively. The value of the rn;mðtÞ are either 0 or 1. The
element rn;mðtÞ ¼ 1 if there is a routing request from input port
n to output port m, else its value is rn;mðtÞ ¼ 0. For a unicast
request, 8n :PNoutpm¼1 rn;mðtÞ ¼ 1, and for multicast request
8n : 1 <PNoutpm¼1 rn;mðtÞ  Noutp.
If 0 PNinpn¼1 rn;mðtÞ  1, then there is no contention to
access the output port m. Equation (8) (left side) shows an
example of the RðtÞ for the Snapshot 1 in Fig. 3 where the
IO ports are represented as port numbers 1, 2, 3, 4, and 5,
respectively.
Rð1Þ ¼
0 1 1 0 0
0 0 0 0 0
1 0 0 1 0
0 1 1 0 1
0 0 0 0 0
0
BBBB@
1
CCCCA;Að1Þ ¼
0 0 1 0 0
0 0 0 0 0
1 0 0 1 0
0 1 0 0 1
0 0 0 0 0
0
BBBB@
1
CCCCA:
ð8Þ
Definition 5.2. An Arbitration Matrix A describes the grant
signal from an arbiter unit to select one flit from the input port
to access its requested output port at time stage t. The
Arbitration Matrix and its array elements are defined as
AðtÞ : an;mðtÞ 2 f0; 1g. The form of the Arbitration Matrix
AðtÞ is strongly dependent on RðtÞ.
ANinpNoutpðtÞ ¼ an;1ðtÞ; an;2ðtÞ; . . . ; an;NoutpðtÞ
 
n ¼ 1; 2; . . . ; Ninp  1; Ninp
  ¼ 1 : Ninp : ð9Þ
For example, if the output port m ¼ 2 has two requests from
input ports as shown in column 2 of matrix RðtÞ in (8), i.e.,
rn¼1:5;m¼2ðtÞ ¼ ½1 0 0 1 0T , then based on Defini-
tions 4.11, 4.12, and 4.13, the set of two possible combinations
of the column 2 of the arbitration matrix is an¼1:5;m¼2ð1Þ ¼
½0 0 0 1 0T and an¼1:5;m¼2ð2Þ ¼ ½1 0 0 0 0T . In
other words, in each time stage t, where the arbiter rotates the
selection among existing requests, the arbiter can only select
one flit from input port. This means, the sum of the column
elements in A must be either 0 or 1, or 0 PNinpn¼1 an;mðtÞ  1.
Equation (8) (right side) shows an example of the AðtÞ for the
Snapshot 2 in Fig. 3.
Definition 5.3. A Tagged Matrix R	ðtÞ: r	n;mðtÞ is a support
matrix that is useful to determine whether a flit must be “held”
or can be “released” from the input port, and to compute the next
routing request matrix Rðtþ 1Þ. For each time stage unit t, the
SAMMAN ET AL.: NEW THEORY FOR DEADLOCK-FREE MULTICAST ROUTING IN WORMHOLE-SWITCHED VIRTUAL-CHANNELLESS... 551
matrix request RðtÞ will be updated as presented in (10). The
function  contains two subfunctions, i.e., fFR	: RðtÞ; AðtÞ !
R	ðtÞ contains operator to form tagged matrix R	ðtÞ and
fUPR: R
	ðtÞ ! Rðtþ 1Þ to update each element in Rðtþ 1Þ.
R tþ 1ð Þ ¼  R tð Þ; R	 tð Þ; A tð Þð Þ
fFR	: R tð Þ; A tð Þ ! R	 tð Þ
fUPR: R
	 tð Þ ! R tþ 1ð Þ:
ð10Þ
According to (11) and (12), the form of tagged matrix R	ðtÞ
depends on the current form of the RðtÞ and AðtÞ.
if n ¼ const: and 9m : rn;mðtÞ 6¼ an;mðtÞ then
8m: rn;mðtÞ ¼
1 : rn;mðtÞ ¼ an;mðtÞ;
1	 : rn;mðtÞ 6¼ an;mðtÞ;
0 : rn;mðtÞ ¼ 0;
8><
>:
ð11Þ
if n ¼ const: and 8m : rn;mðtÞ ¼ an;mðtÞ then
8m: rþn;mðtÞ ¼
0 : rn;mðtÞ ¼ 0;
1þ : rn;mðtÞ ¼ an;mðtÞ:
 ð12Þ
If we compare matrices in (8), then according to (11), we
can see that there are two elements which do not match
each other, i.e., r1;2ð1Þ does not match with a1;2ð1Þ, and
r4;3ð1Þ does not match with a4;3ð1Þ. Therefore, these two
elements are tagged with the symbol ð	Þ as presented in
(14). If minimal one element of row n is tagged with ð	Þ,
then the other elements having the value 1 in same row n
will be marked with the symbol ðÞ. As presented in (14),
the elements r	1;3ð1Þ, r	4;2ð1Þ, and r	4;5ð1Þ are assigned with ðÞ.
The other elements of the row n having no 1-element, being
tagged with symbol ð	Þ or ðÞ, are assigned with symbol ðþÞ
in accordance with (12). As presented in (14), all elements in
row 3, i.e., r	3;1ð1Þ and r	3;4ð1Þ are assigned with ðþÞ, because
there is no element in the row 3 having tag symbol ð	Þ.
R	NinpNoutpðtÞ ¼
8m: r1;mðtÞ or rþ1;mðtÞ
8m: r2;mðtÞ or rþ2;mðtÞ
        
8m: rn;mðtÞ or rþn;mðtÞ
0
BB@
1
CCA; ð13Þ
R	ð1Þ ¼
r1;mð1Þ
rþ2;mð1Þ
rþ3;mð1Þ
r4;mð1Þ
rþ5;mð1Þ
0
BBBBB@
1
CCCCCA ¼
0 1	 1 0 0
0 0 0 0 0
1þ 0 0 1þ 0
0 1 1	 0 1
0 0 0 0 0
0
BBBB@
1
CCCCA: ð14Þ
Definition 5.4. The Hold/Release Tagging Policy can be
applied by observing whether the row array element of (13)
falls in the case according to (11), i.e., rn;1:NoutpðtÞ, or in the case
according to (12), i.e., rþn;1:NoutpðtÞ. Both (11) and (12) comprise
an antecedence or condition part and a consequence part.
Definition 5.5 (Data Hold Policy). If the array elements of the
row n of the Tagged Matrix R	ðtÞ are rn;1:NoutpðtÞ, then the flit
coming from an input port n (n ¼ const,) must be held in the
input port n, because at time stage t and 8m;n ¼ const:)
9m: rn;mðtÞ 6¼ an;mðtÞ or there is minimal one element of the
rn;1:NoutpðtÞ that has not been granted to be switched out to the
requested output port. According to (11), the nongranted
element is tagged with symbol ð	Þ, while the granted element is
tagged with symbol ðÞ. Therefore, according to (15), the next
request at time stage tþ 1 of granted element will be dropped
to avoid improper multicast flit replication.
if n ¼ const: and r	n;mðtÞ ¼ rn;mðtÞ then
8m: rn;mðtþ 1Þ ¼
0 : r	n;mðtÞ ¼ 0 or 1;
1 : r	n;mðtÞ ¼ 1	:
(
ð15Þ
Definition 5.6 (Data Release Policy). If the array elements of
the row n of the Tagged Matrix R	ðtÞ are rþn;1:NoutpðtÞ or all
elements of the rn;1:NoutpðtÞ has been granted to be switched out to
the requested output ports, then the flit coming from the input
port l ¼ n can be released from the input port n. In (16), we can
define that if the flit from input port n is released from the input
port and switched to the output port, then the next considered
flit at time stage tþ 1 maybe 1) a zero flit (not a data flit,
rn;1:Noutpðtþ 1Þ ¼ ;), 2) a flit of different message with different
unicast or multicast output routing direction (rn;1:Noutpðt þ
1Þ 6¼ rn;1:NoutpðtÞ), or 3) a flit that belongs to the flit that has been
released from the input port (rn;1:Noutpðtþ 1Þ ¼ rn;1:NoutpðtÞ).
if n ¼ const: and r	n;mðtÞ ¼ rþn;mðtÞ then
8m: rn;mðtþ 1Þ ¼ rFnextn;m ðtþ 1Þ;
ð16Þ
Rð2Þ ¼
0 1 0 0 0
0 0 0 0 0
1 0 0 1 0
0 0 1 0 1
0 0 0 0 0
0
BBBB@
1
CCCCA;Að2Þ ¼
0 1 0 0 0
0 0 0 0 0
1 0 0 1 0
0 0 1 0 1
0 0 0 0 0
0
BBBB@
1
CCCCA;
ð17Þ
R	ð2Þ ¼
rþ1;mð2Þ
rþ2;mð2Þ
rþ3;mð2Þ
rþ4;mð2Þ
rþ5;mð2Þ
0
BBBBB@
1
CCCCCA ¼
0 1þ 0 0 0
0 0 0 0 0
1þ 0 0 1þ 0
0 0 1þ 0 1þ
0 0 0 0 0
0
BBBB@
1
CCCCA: ð18Þ
Lemma 3. By using the “hold/release tagging policy” defined in
Definition 5.4, improper multicast flit replication on every
router can be avoided.
Proof of Lemma 3. If the number of requests of a flit coming
from input port n is defined as Nreqs;n such that at time
stage t ¼ ts, Nreqs;n ¼
PNoutp
m¼1 rn;mðtsÞ, then according to (15),
a routing request rn;mðtÞ that has been granted will be
reset at the next time stage t ¼ ts þ 1. Therefore, every
routing request rn;mðtÞwill be only forwarded once to the
output port. tu
5.3 Proof of the New Theory
Postulate 1. If unicast packets are routed in a certain router such
that 8n: PNoutpm¼1 rn;mðtÞ  1 (See Definition 5.1), or if multi-
cast packets are routed in a certain router such that
9n: PNoutpm¼1 rn;mðtÞ > 1, and there is no contention between
them to access output link such that 8m: PNinpn¼1 rn;mðtÞ  1,
then according to Definitions 5.2, 4.11, and 4.12, : 8m:
Ts;m ¼ 1) 8t: Aðtþ 1Þ ¼ AðtÞ. Thus, according to Defini-
tion 5.3, then 6 9r	n;mðtÞ ¼ 1	 :: 8t: RðtÞ ¼ AðtÞ or there is no
552 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 4, APRIL 2011
need to apply for the “Hold/Release Rule.” All unicast packets
(without contention) can be released from every input port, or
will not be withheld at every input port at each time stage t.
Fig. 7 shows an example of the multicast requests in a router
without contention.
Postulate 2. If unicast packets are routed in a certain router such
that 8n: PNoutpm¼1 rn;mðtÞ  1 (See Definition 5.1), and there are
one or more contentions between them to access output link such
that 9m: PNinpn¼1 rn;mðtÞ > 1, according to Definitions 5.2, 4.11,
and 4.12, the contention on each output portm can be solved at
t ¼ Ts;m where 8m: 9Ts;m: rn;mðtsÞ ¼
STs;m
t¼ts an;mðtÞ, where
8m: 1  Ts;m  Ninp. Furthermore according toDefinition 5.3,
: 8n:
PNoutp
m¼1 rn;mðtÞ  1) 8t; n;m: 6 9r	n;mðtÞ ¼ 1, o r
8t;m; n: r	n;m ¼ rþn;m according to (13). Therefore, the “Hold/
Release Tagging Policy” is only partially applied, i.e., a unicast
request rn;mðtÞ that has not been granted at time stage t must
wait in the input port l ¼ n. Fig. 8 shows an example of this
situation and how the unicast contention is solved.
Now, we can extend the problem in a situation where
multicast packets are routed in a router with multicast
requests such that 9n :PNoutpm¼1 rn;mðtÞ > 1, and there are one
or more contentions between the multicast requests to
acquire the same output ports or 9m :PNinpn¼1 rn;mðtÞ > 1. This
extended problem is actually the key problem that will be
solved by our new theory to perform deadlock-free multi-
cast routing without implementing virtual channels in a
wormhole-switched NoC. Therefore, instead of introducing
a new lemma and its proof, we will introduce new theorems
and their proofs as described in the following.
Theorem 1. The Id-field beingpart of every flit allows to implement
a flit-by-flit arbitration and an Id-based routing for interleaving
different packets in the same queue, where flits belonging to the
same packet have the same Id-tag on every local communication
link. Hence, multicast deadlock problem can be solved at each
router by further applying a “Hold/Release Tagging Policy” to
control and manage conflicting multicast requests.
Theorem 2. If the multicast dependency and deadlock problems
can be solved at each router as mentioned in Theorem 1, then
multicast deadlock configurations in the network can be solved,
if: 1) the routing algorithm used to route the unicast and
multicast packets does not perform cyclic dependency, and 2) a
data dropping mechanism at each outgoing communication link
is applied to packets that cannot be assigned to an Id slot on the
communication link, or the mechanism can be neglected if
sufficient number of Id slots is set per link for considered traffics.
Proof of Theorem 2. The Theorem 2 can be proved, if
Theorem 1 can be proved and the conditions mentioned
in Theorem 2 are fulfilled. Therefore, we will explain and
prove the need for the necessary conditions mentioned in
the Theorem 2 as follows:
1. The necessary condition mentioned by the item
1) in the Theorem 2 is needed because routing
algorithm used to route unicast and multicast
packets are the same according to the proposed
hardware solution in this paper. Therefore, if the
used routing algorithm does not perform cyclic
dependency, then the proposed tree-based multi-
cast routing is also free from deadlock configura-
tion. The proof of the deadlock-freeness in term of
the cyclic dependency problem is presented in
detail in [21], [22], and [23].
2. The necessary condition mentioned by the item
2) in the Theorem 2 is required because if the data
flits are not dropped then they will stall in the
router especially if they must wait other messages
to free one Id slot for very long time. In this case,
the data flits will be stagnant and occupy many
buffers in the upstream channels, and do not give
spaces for other messages to flow (chain block).
However, if the number of available Id slots per
link is set, for example, equal to (5) (when using
minimal fully adaptive routing), then the packet
stall can be avoided because each considered
traffic can reserve one Id slot to flow on every
communication link.
tu
Proof of Theorem 1. The circulating arbitration mechanism
can guarantee that one flit of unicast or multicast packets
can be forwarded to each outgoing link at each router
node,wheremulticast conflictmayoccur.After arbitration
process at each time t, a hold-release tagging mechanism
can also guarantee that improper replication of the
multicast packets can be avoided, because: 1) the granted
multicast bit-requests will be assigned and will not be
included again in the next arbitration process, and 2) the
flits having multicast bit-requests will be kept in the FIFO
queue until all its multiple bit-requests are granted.
SAMMAN ET AL.: NEW THEORY FOR DEADLOCK-FREE MULTICAST ROUTING IN WORMHOLE-SWITCHED VIRTUAL-CHANNELLESS... 553
Fig. 7. Scheduling multicast requests without contention.
Fig. 8. Scheduling unicast requests with contention.
The circulating selection result of the arbitration
process at each output port maybe random and not
uniform. Therefore, there are two possible configurations
after the arbitration process, i.e., 1) all requests of a
multicast flit from an input port n are granted at the same
time stage t (rn;1:NoutpðtÞ ¼ an;1:NoutpðtÞ), or 2) not all the
multicast requests from an input port n are granted
(8h 2 ’reqn : 9rn;hðtÞ 6¼ an;hðtÞ). In the situation 1), the multi-
cast flit can be released from FIFO queue, and in the
situation 2), the multicast flit must be held in FIFO queue,
and the hold-release tagging policy and the circulating/
rotating flit-by-flit arbitrationwill then cover the situation.
By circulating the bit-set selection in every column m
of AðtÞ at each time stage t, where the circulating
combinations of a1:Ninp;mðtÞ for 8m (all output ports) are
independent each other, then it is possible, in finite time
Tf to find AðTfÞ in such a way, that all conflicting
multicast flits can escape from multicast dependency.
1. If the amount of requests in every output portm at
t ¼ ts is Nreqs;m such that 1  Nreqs;m  Ninp (See (3)),
then the required number of circulating arbitration
time to grant the request from the input port l 2
reqm to output m appear at ts is Ts;m ¼ Nreqs;m. The
probability that the request rl;mðtsÞ is selected by
grant acknowledge al;mðtsÞ is Probðrl;mðtsÞ ¼
al;mðtsÞÞ ¼ 1Nreqs;m . According to Definitions 4.10,
4.11, 4.12, and 4.13, then at t ¼ Ts;m wewill achieve
that r1:Ninp;mðtsÞ ¼
STs;m
t¼ts a1:Ninp;mðtÞ. The maximum
number of requests to an output port m is Ninp.
Hence, ifreqm  , then r1:Ninp;mðtsÞ ¼
SNinp
t¼1 a1:Ninp;m
ðtÞjts ¼ 1.
2. If at t ¼ ts, there is multicast requests from an
input port n such that Nreqs;n > 1 (See Definition
4.9), then we obtain a set of input ports ’reqn in
such a way that rn;hðtsÞ ¼ 1 iff an output port h 2
’reqn (See (2)). The probability that every single
request of the multicast request rn;hðtsÞ from the
input port n is selected by the arbitration unit at
the requested output port h 2 ’reqn , depends also
on the number of requests from other input ports
in the set reqh to the same output port h.
In accordance with items 1 and 2 mentioned above,
then we can derive a conditional equation such that the
multicast deadlock problem is solved as described in the
following equation:
RðtsÞ ¼
[Ts;1
t¼ts
a1:Ninp;1ðtÞ   
[Ts;Noutp
t¼ts
a1:Ninp;NoutpðtÞ
0
@
1
A: ð19Þ
The typical contentionless switching situations pre-
sented in Fig. 7 can be solved at every single time stage
such that 8t;m; n : rn;mðtÞ ¼ an;mðtÞ. The typical problem
presented in Fig. 8 can be solved per output port basis,
where at every output port the problem is solved at t ¼
Ts;m such that 8m : r1:Ninp;mðtsÞ ¼
STs;m
t¼ts a1:Ninp;mðtÞ. The
multicast contention problem presented, e.g., in Fig. 3
must be solved per input-output basis because of the
existing multicast requests.
Because the circulating arbitration order at every
input port m is not uniform or independent from each
other, then there will be many possible combinations of
AðtÞ at every time stage t. The arbitration time solution
Ts;m at every outputm 2 ’may vary and depends on the
number of requests Nreqm to the output port m (Ts;m ¼
Nreqm ). However, we can guarantee that, at the maximum
time of t ¼ Tf such that 8m: Tf ¼ Tmaxs;m ¼ maxðTs;mÞ or
Tf ¼ maxðTs;1; Ts;2; . . . ; Ts;NoutpÞ, then the multicast dead-
lock dependency problem on each router is solved at
Tfif there is no congestion in the outgoing links,
: at t ¼ Tf ) 8m :
STf
t¼ts a1:Ninp;mðtÞ ¼
STs;m
t¼ts a1:Ninp;mðtÞ.
Therefore, the conditional equation (19) is fulfilled,
and by following the Proofs of Lemma 1, 2, and 3, then
the multicast deadlock and dependency problems on
each router is solved without improper multicast flit
replication. In other words, all requests depicted in RðtsÞ
(at initial time ts) will finally receive a grant acknowl-
edge AðTfÞ in such a way that all flits appear at t ¼ ts
from all input ports n 2  would have been switched out
to the output ports m 2 ’ or would have been rescued
from the multicast contention in the router in finite time
stage Tf where 1  Tf  Ninp. If congestion occurs at any
outgoing link then the solution is postponed for Twf time
stage. Hence, the problem is solved at t ¼ Tf þ Twf ,
where Twf is the number of time stages to wait for a free
data slot available in the queue of the congested link
connected directly to the most requested output port.
The descriptions given above have proved the
Theorem 1 because the multicast problem can be solved
in such a way that the contenting or conflicting multicast
packet can be rescued from the multicast dependency in
a router. If the multicast dependency (deadlock) problem
can be solved on every router Rc 2 <, then the network is
free from multicast deadlock problem as long as the
routing algorithm used to route unicast and multicast
packets does not form cyclic dependencies. The detailed
proof of the last statement can be found in [21], [22], [23].
Tf depends on the concrete multicast conflict situa-
tion in each router. For instance, in the multicast
conflict case presented in Fig. 3, the flit coming from
the PORT 3 port can be rescued from the multicast-
dependency after generating one in-column bit-set
combination of the arbitration matrix Að1Þ as shown
in Snapshot 2. While the flits coming from PORT 1 and
PORT 4 ports can be rescued after generating two in-
column bit-set combinations of the arbitration matrices
Að1Þ, and Að2Þ as shown in Snapshots 2 and 4,
respectively. From Fig. 3, we can see that for Ninp ¼
Noutp ¼ 5, then the numbers of request at every output
port m 2 f1; 2; 3; 4; 5g are Nreqs;1 ¼ Nreqs;4 ¼ Nreqs;5 ¼ 1 and
Nreqs;2 ¼ Nreqs;3 ¼ 2. Thus, 8m ¼ f1; 2; 3; 4; 5g then
Tf ¼ maxðTs;1; Ts;2; Ts;3; Ts;4; Ts;5Þ ¼ maxð1; 2; 2; 1; 1Þ ¼ 2:
Therefore, the multicast dependency deadlock depicted
in Fig. 3 can be solved in the next finite time stage Tf ¼ 2.
The rotating output selection per output port in four
successive time stages of the problem shown in Fig. 3 is
presented in the following tabular.
554 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 4, APRIL 2011
tu
5.4 Switch Architecture
The most interesting aspect of the new theory, is that the
VLSI architecture which implement the new theory can be
designed in a regular modular switch architecture. Fig. 9
shows us a typical five-port generic VLSI architecture of a
switch that represents the methodology. The upper part of
Fig. 9 presents modular units at the input port of the switch
router, while the lower part presents modular units at the
output port. In the figure, we can see that 8n;m ¼ 1; 2; 3;
4; 5: 9rnm; anm. The control paths of signals rnm and anm
are practically similar to the variables rn;mðtÞ and an;mðtÞ
defined theoretically in Definitions 5.1 and 5.2, respectively.
The logical function to form the Tagged Matrix defined in
Definition 5.3 are practically implemented in the RE unit at
every input port. The Arbiter (A) units at every input port
represent the Arbitration Matrix defined in Definition 5.2
and perform the rotating flit-by-flit arbitration described in
Algorithm 2. The Routing Table T ðkÞ defined in Definition
4.7 and the ID Slot Table SðkÞ defined in Definition 4.14 are
implemented in the RE unit at every input port and in the
MIM unit at every output port, respectively.
Fig. 9 is only a representative figure of many possible
architectures that has four-stage pipeline, i.e., input buffer-
ing, routing, arbitration, and link traversal stages. Pipeline
stages number can be freely determined by NoC designers.
The input buffering is implemented using a standard first-
in first-out algorithm. The main critic of the NoC router
design is the implementation of two tables on every one-
directional link that can lead to area overhead. However,
when a number of N messages is allowed in-flight in the
same physical link, then compared to a VC-based solution,
our concept requires theoretically less logic area and power,
since the size of the two tables having N slot registers will
be less than the size of N number of VC buffers.
This paper focuses on the proof of the new theory for
deadlock-free multicast routing in a wormhole-switched
virtual channelless NoC. The verification through RTL-
simulation as well as the early VLSI architecture implemen-
tations usingCMOS standard-cell technology can be found in
[24] in 2D 4 4Mesh NoC with static XY routing algorithm,
in [25] in 2D 8 10MeshNoCwith static andpartial adaptive
routingalgorithms, and in [26] in 2D8 10Mesh-PlanarNoC
with 2D planar adaptive routing algorithm.
6 CONCLUSIONS
Thispaperhaspresentedanewtheory andmethodology for a
deadlock-free tree-basedmulticast routing suitable forNoCs.
The multicast routing theory and methodology is supported
by allowing flits of different wormhole messages being
interleaved/multiplexed at flit-level on the same link. Multi-
cast deadlock problems are solved and improper multicast
replications are prevented by using the hold-release tagging
mechanism. The message routing is guaranteed correct
because flits belonging to the same wormhole message are
allocated to the same Id slot ona certain link. The introduction
of the local Id slots can be interpreted implicitly as a
mechanism to reduce the size of the RRT. The size of ID-tags
in theRRTcanbeset less thanthenumberofnodeentries in the
NoC to run specific data distribution scenarios. This case has
been experimented in [25] and [26] by setting 16 ID-tags per
link in the 80 cores (mesh 8 10) NoC.
Fortunately, the traffics in embedded MPSoCs are
predictable, and it is a rare case that more than 15 messages
are in-flight in the same link. However, the packet dropping
mechanism must be applied in this case to avoid data-flow
stall. If the number of ID-tags per link Nslot is set to cover all
considered traffics, e.g., equal to (5), then the packet
dropping mechanism can be neglected. There is a design
trade-off in this aspect. By setting the minimum Nslot per
link as discussed in Section 4.6, the size of the RRT and ID
Slot Table units would be larger, but there is no need for a
retransmission protocol. When data dropping is applied
and the number of entries in the tables units is reduced,
then router size will be smaller, but the retransmission
protocol must be applied leading to area overhead in the
network interface, and probably time overhead when the
data drop occurs.
SAMMAN ET AL.: NEW THEORY FOR DEADLOCK-FREE MULTICAST ROUTING IN WORMHOLE-SWITCHED VIRTUAL-CHANNELLESS... 555
Fig. 9. An example of typical five-port router/switch matrix structure based on the proposed hardware solution.
APPENDIX
(NOTATIONS)
Symbol Description
Nnode : Number of router node in a NoC
< : Set of routers fR1; R2; . . . ; RNnodeg
Rc : Router node c, where Rc 2 <
 : Set of communication links in a NoC
Li;j : Link connecting Ri to Rj, where Li;j 2 
Ninp : Number of input ports in a router
Noutp : Number of output ports in a router
 : Set with elements f1; 2; . . . ; Ninpg
’ : Set with elements f1; 2; . . . ; Noutpg
n : Input port number, n 2 
m : Output port number, m 2 ’
Nreqs;m : Number of request to acquire output port m
Nreqs;n : Number of request from input port n
Nslot : Number of Id Slot on every link
 : Set of elements f0; 1; 2; . . . ; Nslot  1g
"type : Set of element fheader; databody; tail,
responseg
type : Type of a flit, type 2 "type
 : Set of elements f0; 1; 2; . . . ; Nslot  1g
Id; k : Local Id-tag and Id slot, k 2 ,  ¼ 
RðtÞ : Routing Request Matrix
rn;mðtÞ : Matrix element of RðtÞ
AðtÞ : Arbitration (Routing Acknowledge) Matrix
an;mðtÞ : Matrix element of AðtÞ
R	ðtÞ : Tagged Request Matrix
r	n;mðtÞ : Matrix element of R	ðtÞ
Sk : State of Id Slot k, where k 2 
SðkÞ : ID Slot Table
T ðkÞ : Routing Table
ACKNOWLEDGMENTS
The authors gratefully acknowledge the insightful com-
ments and suggestions made by the reviewers. Special
thanks are given to Deutcher Akademischer Austausch-
Dienst (DAAD, German Academic Exchange Service)
awarding the DAAD-Scholarship for Faizal A. Samman to
pursue the engineering doctoral degree at Darmstadt
University of Technology in Germany. The authors would
also like to thank LOEWE-Zentrum AdRIA in Fraunhofer
Institute LBF Darmstadt for further cooperation and for
possible implementation of the concept and theory pre-
sented in this paper within Project AdRIA (Adaptronik-
Research, Innovation, Application) funded by Hessian
Ministry of Science and Arts with grant number IIIL4-
518/14.004 (2008).
REFERENCES
[1] G. Fox et al., “Fortran D Language Specification,” Technical
Report CRPC-TR 90079, Center for Research on Parallel Computa-
tion, Rice Univ., Dec. 1990.
[2] J. Merlin, “Techniques for the Automatic Parallelization of
Distributed Fortran 90,” Technical Report SNARC 92-02, South-
ampton Univ., Nov. 1991.
[3] High Performance Fortran Forum, “High Performance Fortran
Language Specification, Version 1.0,” Scientific Programming,
vol. 2, no. 1, May/June 1993.
[4] Message Passing Interface Forum, “MPI-2: Extensions to the
Message-Passing Interface,” technical report, Univ. of Tennessee,
Nov. 2003.
[5] G.A. Geist, J.A. Kohl, and P.M. Papadopoulos, “PVM and MPI: A
Comparison of Features,” Calculateurs Paralleles, vol. 8, no. 2,
pp. 137-150, May 1996.
[6] G.A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V.
Sunderam, PVM: Parallel Virtual Machine: A User’s Guide and
Tutorial for Networked Parallel Computing, MIT Press, http://
www.csm.ornl.gov/pvm, 1994.
[7] X. Lin, P.K. McKinley, and L.M. Ni, “Deadlock-Free Multicast
Wormhole Routing in 2-D Mesh Multicomputers,” IEEE Trans.
Parallel and Distributed Systems, vol. 5, no. 8, pp. 793-804, Aug.
1994.
[8] J. Duato, “A Theory of Deadlock-Free Adaptive Multicast Routing
in Wormhole Networks,” IEEE Trans. Parallel and Distributed
Systems, vol. 6, no. 9, pp. 976-987, Sept. 1995.
[9] R.V. Boppana, S. Chalasani, and C.S. Raghavendra, “Resource
Deadlocks and Performance of Wormhole Multicast Routing
Algorithms,” IEEE Trans. Parallel and Distributed Systems, vol. 9,
no. 6, pp. 535-549, June 1998.
[10] M. Barnett, D.G. Payne, R.A. van de Geijn, and J. Watts,
“Broadcasting on Meshes with Worm-Hole Routing,” J. Parallel
and Distributed Computing, vol. 35, no. 2, pp. 111-122, 1996.
[11] M.P. Malumbres, J. Duato, and J. Torrelas, “An Efficient
Implementation of Tree-Based Multicast Routing for Distributed
Shared-Memory Multiprocessors,” Proc. Eighth IEEE Symp. Parallel
and Distributed Processing, pp. 186-189, 1996.
[12] D.R. Kumar and W.A. Najjar, P.K. Srimani, “A New Adaptive
Hardware Tree-Based Multicast Routing in K-Ary N-Cubes,”
IEEE Trans. Computers, vol. 50, no. 7, pp. 647-659, July 2001.
[13] P. Abad, V. Puente, and J.-A. Gregorio, “MRR: Enabling Fully
Adaptive Multicast Routing for CMP Interconnection Networks”
Proc. 15th IEEE Int’l Symp. High Performance Computer Architecture
(HPCA ’09), pp. 355-366, Feb. 2009.
[14] S. Rodrigo, J. Flich, J. Duato, and M. Hummel, “Efficient
Unicast and Multicast Support for CMPs,” Proc. 41st IEEE/
ACM Int’l Symp. Microarchitecture (MICRO ’08), pp. 364-375,
Nov. 2008.
[15] L. Wang, Y. Jin, H. Kim, and E.J. Kim, “Recursive Partitioning
Multicast: A Bandwidth-Efficient Routing for Networks-on-Chip,”
Proc. Third ACM/IEEE Int’l Symp. Networks-on-Chip (NOCS ’09),
pp. 64-73, May 2009.
[16] N.E. Jerger, L.-S. Peh, and M. Lipasti, “Virtual Circuit Tree
Multicasting: A Case for On-Chip Hardware Multicast Support,”
Proc. 35th Int’l Symp. Computer Architecture (ISCA ’08), pp. 229-240,
June 2008.
[17] A.S. Vaidya, A. Sivasubramaniam, and C.R. Das, “Impact of
Virtual Channels and Adaptive Routing on Application Perfor-
mance,” IEEE Trans. Parallel and Distributed Systems, vol. 12, no. 2,
pp. 223-237, Feb. 2001.
[18] K. Aoyama and A.A. Chien, “The Cost of Adaptivity and Virtual
Lanes in a Wormhole Router,” J. VLSI Design, vol. 2, no. 4, pp. 315-
333, 1995.
[19] D.A. Ilitzky, J.D. Hoffman, A. Chun, and B.P. Esparza, “Archi-
tecture of the Scalable Communications Core’s Network on Chip,”
IEEE Micro, vol. 27, no. 5, pp. 62-74, Sep./Oct. 2007.
[20] P. Gratz, C. Kim, K. Sankaralingam, H. Hanson, P. Shivaku-
mar, S.W. Keckler, and D. Burger, “On-Chip Interconnection
Networks of the TRIPS Chip,” IEEE Micro, vol. 27, no. 5,
pp. 41-50, Sep./Oct. 2007.
[21] C.J. Glass and L.M. Ni, “The Turn Model for Adaptive
Routing,” Proc. 19th Int’l Symp. Computer Architecture, pp. 278-
287, 1992.
[22] W.J. Dally and C.L. Seitz, “Deadlock-Free Message Routing in
Multiprocessor Interconnection Networks,” IEEE Trans. Compu-
ters, vol. C-36, no. 5, pp. 547-553, May 1987.
[23] J. Duato, “A New Theory of Deadlock-Free Adaptive Routing in
Wormhole Networks,” IEEE Trans. Parallel and Distributed Systems,
vol. 4, no. 12, pp. 1320-1331, Dec. 1993.
[24] F.A. Samman, T. Hollstein, and M. Glesner, “Multicast Parallel
Pipeline Router Architecture for Network-on-Chip,” Proc. Design,
Automation and Test in Europe Conf. and Exhibition (DATE ’08),
pp. 1396-1401, 2008.
556 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 4, APRIL 2011
[25] F.A. Samman, T. Hollstein, and M. Glesner, “Adaptive and
Deadlock-Free Tree-Based Multicast Routing for Networks-on-
Chip,” IEEE Trans. Very Large Scale Integration Systems, vol. 18,
no. 7, pp. 1067-1080, July 2010.
[26] F.A. Samman, T. Hollstein, and M. Glesner, “Planar Adaptive
Router Microarchitecture for Tree-Based Multicast Network-on-
Chip,” Proc. First Int’l Workshop Network-on-Chip Architecture
(NoCArch ’08), in conj. with the IEEE/ACM Int’l Symp. Microarch-
itecture (MICRO-41), pp. 6-13, 2008.
Faizal Arya Samman received the bachelor of
engineering degree (with Honour) in electrical
engineering from Gadjah Mada University at
Yogyakarta, Indonesia in 1999 and the master of
engineering degree from Inter University Center
for Microelectronics Research, Bandung Insti-
tute of Technology in Indonesia in 2002 with
Scholarship Award from Indonesian Ministry of
National Education in Control and Computer
System Laboratory. In 2002, he was appointed
to be a research and teaching staff at Hasanuddin University in
Makassar, Indonesia. From 2006 until 2010, he received scholarship
award from Deutscher Akademischer Austausch-Dienst (DAAD, Ger-
man Academic Exchange Service) to pursue the engineering doctoral
degree at Darmstadt University of Technology, in Germany. He is now
working toward the postdoctoral program in LOEWE-Zentrum AdRIA
(Adaptronik-Research, Innovation, Application) within the research
cooperation framework between Darmstadt University of Technology
and Fraunhofer Institute LBF in Darmstadt. His research interests
include network on-chip (NoC) microarchitecture, NoC-based multi-
processor system-on-chip application mapping, programming models
for multiprocessor systems, design and implementation of analog and
digital electronic circuits for control system applications on FPGA/ASIC
as well as energy harvesting systems and wireless sensor networks. He
is a member of the IEEE.
Thomas Hollstein received the graduation in
electrical engineering/computer engineering
from Darmstadt University of Technology in
1991 and the PhD degree on “Design and
interactive Hardware/Software Partitioning of
Complex Heterogeneous Systems” from Darm-
stadt University of Technology in 2000. In 1992,
he joined the research group of the Microelec-
tronic Systems Lab at Darmstadt University of
Technology. He worked in several research
projects in neural and fuzzy computing and industrial VHDL-based
design. Since 1995, he focused his research on hardware/software
codesign. Since 2000, he is working as a senior researcher, leading a
research group focusing System-on-Chip communication architectures,
the design of reconfigurable HW/SW Systems-on-Chip and integrated
SoC test and debug methodologies. His current research interests are in
the fields of networks-on-chip, hardware/software co-design, systems-
on-chip design, printable organic and inorganic electronics, and RFId
circuit and system design. Furthermore, he is giving lectures on VLSI
design and CAD methods. From 2001 until now, he has been member of
a leader team initiating and establishing a new international master
program in “Information & Communication Engineering” at Darmstadt
University of Technology. In 2010, he was appointed as a professor at
Tallin University of Technology in Department of Computer Engineering,
Dependable Embedded Systems Group.He is a member of the IEEE.
Manfred Glesner received the diploma and PhD
degrees from Saarland University, Saarbru¨cken,
Germany, in 1969 and 1975, respectively, and
the three Doctor Honoris Causa degrees from
Tallinn Technical University, Estonia, in 1996,
Polytechnic University of Bucharest, Romania,
in 1997, and Mongolian Technical University,
Ulan Bator, in 2006. His doctoral research was
based on the application of nonlinear optimiza-
tion techniques in computer-aided design of
electronic circuits. Between 1969 and 1971, he has researched work
in radar signal development for the Fraunhofer Institute in Werthoven/
Bonn, Germany. From 1975 to 1981, he was a lecturer in the areas of
electronics and CAD with Saarland University. In 1981, he was
appointed as an associate professor in electrical engineering with the
Darmstadt University of Technology, Germany, where, in 1989, he was
appointed as a full professor for microelectronic system design. His
current research interests include advanced design and CAD for micro-
and nanoelectronic circuits, reconfigurable computing systems and
architectures, organic circuit design, RFID design, mixed-signal circuit
design, and process variations robust circuit design. With the EU-based
TEMPUS initiative, he built up several microelectronic design centers in
Eastern Europe. Between 1990 and 2006, he acted as a speaker of two
DFG-funded graduate schools. He is a member of several technical
societies and he is active in organizing international conferences. Since
2003, he has been the vice president of the German Information
Technology Society (ITS) in VDE and also a member of the DFG
decision board for electronic semiconductors, components, and inte-
grated systems. He was a recipient of the honor/decoration of “Palmes
Academiques” in the order of Chevalier by the French Minister of
National Education (Paris) for distinguished work in the field of education
in 2007/2008. He is a fellow of the IEEE.
. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
SAMMAN ET AL.: NEW THEORY FOR DEADLOCK-FREE MULTICAST ROUTING IN WORMHOLE-SWITCHED VIRTUAL-CHANNELLESS... 557
