Louisiana State University

LSU Digital Commons
LSU Historical Dissertations and Theses

Graduate School

5-1994

A Dag Based Wormhole Routing Strategy
John Roy
Louisiana State University and Agricultural and Mechanical College

Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_disstheses

Recommended Citation
Roy, John, "A Dag Based Wormhole Routing Strategy" (1994). LSU Historical Dissertations and Theses.
8277.
https://digitalcommons.lsu.edu/gradschool_disstheses/8277

This Thesis is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has
been accepted for inclusion in LSU Historical Dissertations and Theses by an authorized administrator of LSU
Digital Commons. For more information, please contact gradetd@lsu.edu.

A DAG BASED WORMHOLE ROUTING STRATEGY

A Thesis

Submitted to the Graduate Faculty of the
Louisiana State University and
Agricultural and Mechanical College
in partial fulfillment of the
requirements for the degree of
Master of Science in Electrical Engineering
in
The Department of Electrical and Computer Engineering

by
Kaushik Roy
B.S., Ranchi University, Bihar, India, 1991
May 1994

MANUSCRIPT THESES

Unpublished theses submitted for the Master’s and Doctor’s

Degrees and deposited in the Louisiana State University Libraries

are available for inspection.
the rights of the author.

Use of any thesis is limited by

Bibliographical references may be

noted, but passages may not be copied unless the author has
given permission.

Credit must be given in subsequent written

or published work.
A library which borrows this thesis for use by its clientele

is expected to make sure that the borrower is aware of the above
restrictions.
LOUISIANA STATE UNIVERSITY LIBRARIES

66
/

7

Acknowledgements

I thank Dr. Suresh Rai, my thesis advisor, for his invaluable help and guidance

throughout this work. I would also like to thank Dr. Siqing Zheng and Dr. J.
Ramanujam, for serving as members of the examining committee. Hereby I also express
my gratitude to my friends Mr. Ram Sateesh Katta for his suggestions and Mr. Bruno

Gamon for his constant inspiration.

ii

Contents

Acknowledgements

ii

List of Tables

v

List of Figures

vi

Abstract

vii

Chapter 1. Introduction
1.1
Architecture
1.2
Topology
1.3
Switching Techniques
1.4
Problem Formulation and Thesis Layout

1
1
3
7
11

Chapter 2. Wormhole Routing -Concept and Technique
2.1
Deadlock
2.2
Wormhole Routing Algorithms
2.3
Deterministic Routing
2.4
Adaptive Routing
2.4.1 Minimal Adaptive Routing
2.4.2 Nonminimal Adaptive Routing
2.4.3 The Turn Model
2.5
Store-and-forward Revisited

12
13
18
18
19
20
21
22
23

Chapter 3. DAG Based Adaptive Routing
3.1
Background
3.2
The Algorithm
3.2.1 Logical Basis of Deadlock Prevention
3.2.2 Implementation
3.3
Performance
3.3.1 Simulation

27
28
28
33
33
36
37

iii

Chapter 4. Results and Discussion
4.1
Results
4.2
The Costsof Virtual Channel
4.3
Fault Tolerance
4.4
Livelock

39
39
49
51
52

Chapter 5. Conclusion

53

Bibliography

55

Vita

57

iv

List of Tables

1.1 Comparison of network latencies for various switching techniques

9

4.1 A comparison of various adaptive algorithms with regard to the
saturation throughput

48

4.2 A comparison of various adaptive algorithms with respect to the number
of virtual channels used

50

v

List of Figures

1.1 A generic multiprocessor based on a direct network

5

1.2 A generic node architecture

6

1.3 Comparison of different switching techniques

10

2.1 Wormhole routing

14

2.2 An example of channel deadlock involving four packets

15

2.3 A four-node network and the corresponding channel dependence graph

17

2.4 An illustration of the turn model in a 2D mesh

24

2.5 Examples of west first routing in an 8 x 8 2D mesh

25

3.1 A 2D mesh with bidirectional links

29

3.2 Construction of G7 from G

30

3.3 Flow diagram of the algorithm

34

4.1 Throughput vs latency, uniform traffic (8x8 mesh)

40

4.2 Throughput vs latency, matrix transpose traffic (8x8 mesh)

41

4.3 Throughput vs latency, bit reversal traffic (8x8 mesh)

42

4.4 Throughput vs latency, uniform traffic (16x16 mesh)

43

4.5 Throughput vs latency, matrix transpose traffic (16x16 mesh)

44

4.6 Throughput vs latency, bit reversal traffic (16x16 mesh)

45

vi

Abstract

The wormhole routing (WR) technique is replacing the hitherto popular store-

and-forward routing in message passing multicomputers. This is because the latter has

speed and node size constraints. The wormhole routing is, on the other hand, susceptible
to deadlock. A few WR schemes suggested recently in the literature, concentrate on
avoiding deadlock. This thesis presents a Directed Acyclic Graph (DAG) based WR

technique. At low traffic levels the proposed method follows a minimal path. But the

routing is adaptive at higher traffic levels. We prove that the algorithm is deadlock-free.
This method is compared for its performance with a deterministic algorithm which is
a de facto standard. We also compare its implementation costs with other adaptive

routing algorithms and the relative merits and demerits are highlighted in the text.

vii

Chapter 1

Introduction

1.1 Architecture
Sequential computers are approaching a fundamental physical limit on their

potential computational power. Electronic circuits are ultimately limited in their speed
of operation by the speed of light, and many of the circuits are already operating in the

nanosecond range. Massively parallel computers, therefore, are the in thing today.

Message passing multicomputers are usually organized as an array of nodes,
where each node consists of a processor with its own local memory and other supporting
devices. The processing nodes of a concurrent computer exchange data and synchronize

with one another by passing messages over an interconnection network [1]. The
interconnection network is often the critical component of a large parallel computer

because performance is very sensitive to network latency and throughput and because
the network accounts for a large fraction of the cost and power dissipation of the

machine.
An interconnection network is characterized by its topology, routing, and flow

control. The topology of a network is the arrangement of nodes and channels into a
graph. Routing specifies how a packet chooses a path in this graph. Flow control deals

1

2

with the allocation of the channel and buffer resources to a packet as it traverses this
path.
In an indirect network, there is no direct link between two nodes. Rather every
node has an input and output connection to the network, which may consist of several

stages of switching exchanges.

In a direct network each node has a direct connection to some other node. The
adjacent nodes which are directly connected are called its neighbors. Direct networks
are popular because they scale well, that is, as the number of nodes in the system

increases the total communication bandwidth, memory bandwidth, and the processing
capabilities of the system increases. Figure 1.1 shows a generic multiprocessor and how

the nodes are connected through a direct network [2].

The neighbors are defined by the topology of the network. A node communicates

with a node that is not its neighbor by sending a message through one of its neighbors.
To handle the complexities of routing messages in the direct network, each node often

has a router. Although a router’s function could be performed by the corresponding

local processor, dedicated routers are used to allow overlapped computation and
communication within each node. Figure 1.2 shows the architecture of a generic node
[2]. The router controls local input and output channels, which connect it to local

devices, and network input and output channels, which connect it to neighboring routers.

The time required to move data between nodes is critical to system performance,
as it effectively determines what granularity levels of parallelism are possible in

3
executing an application program. A metric commonly used to evaluate a direct network
system is communication latency. The communication latency is obtained as

Communication latency = startup latency + network latency + blocking time.
The startup latency is the time required for the system to handle the packet at both the
source and the destination nodes. The network latency is the time elapsed after the head

of a packet has entered the network at the source until the tail of the packet emerges

from the network at the destination.
Therefore the startup and network latencies are static for a given system. The

blocking time reflects the dynamic behavior of the system and it includes delays due to

channel contention where two packets simultaneously require the same channel.
Furthermore, the channel width is the number of bits that can be transmitted

simultaneously on a physical channel between two adjacent nodes. The channel rate is

the peak rate at which bits can be transferred over a physical channel. Therefore
Channel bandwidth = channel width x channel rate.

1.2 Topology
The two most popular network topologies are the n-dimensional meshes and the

fc-ary n-cubes. Formally, an n-dimensional mesh has k0 x kj ...x k^ nodes, where kt

represents nodes along dimension i, and kt >= 2. Each node x is identified by n co
ordinates, sn_1(x),...,s1(x), s^x), where Q<=si(x)<=ki -1 for Q<=i<=n-1. The nodes have
n to 2n neighbors depending on their location in the mesh [2].

4

In a ^-ary n-cube, all nodes have the same number of neighbors. The definition

of a £-ary n-cube differs from that of an ^-dimensional mesh in that all the A:/ s are equal
to k. In a k-ary n-cube the lowest and the highest node in each dimension is connected

by wrap-around channels. A £-ary n-cube contains F nodes. If k = 2, then every node

has n neighbors, two in each dimension [2].
A hypercube (binary n-cube) is a special case where k = 2. The torus (torroidal

mesh) is formed when we have n = 2, in a £-ary n-cube. Both the hypercube and the

torus are symmetric networks, since they map any node of the graph onto any other
node.

Mesh networks, on the other hand, are asymmetric because the wraparound
channels are absent. Assuming uniform traffic between nodes, channels near the center

of the mesh are likely to experience higher traffic density than channels on the
periphery.
It is easier to provide deadlock-free routing in mesh networks than torus

networks. Most direct network topologies used in wormhole-routed systems are low
dimensional meshes and hypercubes. The Intel Touchstone Delta, the Intel Paragon, and

the Symult 2010 use 2D mesh; the MIT J-machine and Caltech’s Mosaic use a 3D mesh

[2], We have therefore chosen a 2D mesh to simulate our routing technique.

5

Figure 1.1 A generic multiprocessor based on a direct network [2].

6

Figure 1.2 A generic node architecture [2].

7

1.3 Switching Techniques
The communication latency of a direct network depends on several architectural

characteristics; one of the most important is the type of switching technique used by
routers to transfer data from input channels to output channels. Four switching

techniques have been adopted for direct networks: store-and-forward, circuit switching,

virtual cut-through, and wormhole routing.
In the store-and-forward technique, when a packet reaches an intermediate node,

the entire packet is stored in a packet buffer. The packet is then forwarded to a

neighboring node when the next output channel is available and the neighboring node

has an available buffer. The drawbacks of the store-and-forward routing is that it must
store every incoming packet thereby consuming memory space, and that the network

latency is proportional to the distance between the source and the destination nodes.
In circuit switching, a physical circuit is constructed between the source and

destination nodes during the circuit establishment phase. The packets are then send to
the destination node in the packet transmission phase. Throughout the packet

transmission phase the channels are reserved exclusively for the circuit. Therefore,
buffers at the intermediate nodes are not necessary. After the transmission is over, the

circuit is tom down.
In virtual cut-through, the packet header is examined upon arrival at an

intermediate node. The packet is stored at the intermediate node only if the next
required channel is busy; otherwise it is forwarded immediately without buffering.

8

The wormhole routing also uses a cut-through approach to switching. A packet

is divided into a number of flits (flow control digits) for transmission.

Refer

to

Chapter 2 for the details on WR techniques.

We compare the network latencies of various switching techniques in Table 1.1.

Here, L is the packet length, B is the channel bandwidth, and D represents the length

of path between the source and destination nodes. Figure 1.3 compares the
communication

latency of wormhole

routing

with others in a

contention-free

network [2].

Wormhole routing is attractive because:
1. The network latency of message delivery is less compared to store-and-forward

routing. We have seen in Table 1.1 that for wormhole routing, the communication
latencies are nearly independent of the distance between the source and the destination

nodes.
2. Large packet buffers at each intermediate node are obviated; only a small FIFO flit

buffer is required. So, this has an edge over virtual cut-through.
3. It can share the physical channel between messages by using virtual channels. This
makes it better than the circuit switching.
Therefore the speed and node size constraints requires the use of wormhole

routing rather than store-and-forward or virtual cut-through or circuit switching.

9

Table 1.1 Comparison of network latencies for various switching techniques.

Switching technique

Network latency

Store-and-forward

(L/B )D

Virtual cut-through

( L/B )D + L/B

where Lh : length of header field.
Usually L»Lh
Circuit switching

( L/B )D + L/B

where Lc: length of control packet.
Usually Lc « L

Wormhole Routing

( L/B )D + L/B
where Lf : length of each flit.
Usually Lf«L

10

Node

Packet
Legend:

|

■

Header

Data

Figure 1.3 Comparison of different switching techniques :(1) store-and-forward
switching, (2) circuit switching, and (3) wormhole routing [2].

11

1.4 Problem Formulation and Thesis Layout
Various algorithms for wormhole routing are proposed in the literature. In this
thesis we present an adaptive wormhole routing algorithm. We also prove that our

algorithm is deadlock-free. Its performance is compared with a standard deterministic
algorithm. We make a comparative study of the merits and demerits of these adaptive

routing algorithms.
Chapter 2 explains the concept of wormhole routing. We discuss the issue of deadlock.

We also present an overview of the various algorithms that have been suggested in the
literature.
Chapter 3 describes our algorithm, and proves that it is deadlock-free.

We present the simulation results and perform the cost analysis in Chapter 4. These

results help compare our method with existing techniques.
Finally, Chapter 5 concludes the thesis.

Chapter 2

Wormhole Routing - Concept and Technique

The wormhole routing has become quite popular in recent years. Wormhole
routing uses a cut-through approach to switching. A packet is divided into a number of
flits (flow control digits) for transmission. The bits constituting a flit are transmitted in
parallel between two routers. The header flit of a packet governs the route. As the

header advances along a specified route, the remaining flits follow the header in a

pipeline fashion. If the header flit finds a channel which is already in use, the header

is blocked until the channel becomes available. Rather than buffering the remaining flits
by removing them from the network channels, as in virtual cut-through, the flow control

within the network blocks the trailing flits and they remain in flit buffers along the

established route. Once a channel has been acquired by a packet, it is reserved for the

packet. The channel is released when the last, or the tail flit has been transmitted on the

channel [2],
Typically, a single buffer is associated with each channel. Once a packet Pt is

allocated a buffer b^ no other packet Pj can use the associated channel ct, until P,
releases

In networks that use flit-level flow control, packet Pt maybe blocked due to

contention elsewhere in the network while still holding h,. In this case, channel c} is

idled even though there may be other packets in the network, e.g., packet Pj, that can

12

13
make productive use of the channel [7]. Figure 2.1 shows the mechanism of the

wormhole routing [2].

2.1 Deadlock
The buffers work as resources in the store-and-forward and virtual cut-through,

while channels are the resources in the circuit switching and wormhole routing. Since

blocked packets holding channels (and their corresponding flit buffers) remain in the
network, wormhole routing is particularly susceptible to deadlock. Deadlock is avoided

by the routing algorithm. By ordering the network resources and requiring that the
packets request and use these resources in strictly monotonic order, circular wait - a

necessary condition for deadlock is avoided. Figure 2.2 shows how a channel deadlock
can occur involving four packets [2].
If N represents the set of processing nodes and C represents the set of
communication channels, a routing function:

R: CxN-C

(2.1)

maps the current channel cc, and destination node nd, to the next channel cn. On the

route from cc to nd, R(cc,nd) = cn . The channel is not allowed to route to itself. This

definition of R precludes the route from being dependent on the presence or absence of
other traffic in the network. R describes strictly deterministic and non-adaptive routing
functions [3].

14

Processors

Source

Destination

n
Flit buffer
Routers

Figure 2.1 Wormhole routing [2].

15

|

| Flit buffer
g Input selection
a circuit

Packet progression

Packet awaiting
resource

Figure 2.2 An example of channel deadlock involving four packets [2].

16

A deadlocked configuration for a routing function R, is a non-empty legal

configuration of channel queues such that

Vc^eC, (\fn?member(n,c^, n^d^

(2.2)

Cj = Ric^n) ^size(c^ = capacity^c^

(2.3)

where the size(cy) denotes the number of flits in the queue for channel Cj. A routing

function R, is deadlock-free if no deadlock configuration exits for that function on that
network [4].

A channel dependence graph can be used to develop a deadlock-free routing
algorithm. The channel dependence graph for a direct network and a routing algorithm

is a directed graph D = G(C,E), where the vertex set C(D) consists of all the
unidirectional channels in the network, and the edge set E(D) includes all the pairs of

connected channels, as defined by the routing algorithm. In other words, if (cit Cj) is an
element of E(D), then c, and Cj are, respectively, an output channel of a node and the

routing algorithm may route packets from c( to cr
Theorem 2.1: A routing function R, is deadlock-free if and only if there are no cycles
in the channel dependency graph D.
For a proof of this theorem refer to [4], Figure 2.3 shows a four node network and its

corresponding channel dependence graph. It also illustrates a channel dependency graph

for the deadlock-free routing.

17

Figure 2.3 A four-node network and the corresponding channel dependence graph: (a)
a direct network with four nodes, (b) channel dependence graph, (c) channel dependence
graph based on restricted minimal routing [2].

18

2.2 Wormhole Routing Algorithms
The routing of messages amongst various nodes can be of two types:

deterministic and adaptive. It can further be classified as either minimal or non-minimal.
Various strategies for the routing has been proposed. Each one though, has some

drawbacks. Adaptive routing schemes are supposedly better than deterministic routing
schemes since they can handle hot spots and hardware failure better. Furthermore, in
most cases they are better off at high traffic levels.

But then adaptive routing algorithms are more complex than deterministic
algorithms. Hence they increase the delay, a flit suffers, at each router. Furthermore,
they are more difficult to implement in hardware.

2.3 Deterministic Routing
In deterministic routing the path is completely determined by the source and
destination addresses. Here the behavior of the algorithm is independent of current
network conditions. Two famous deterministic algorithms are the e-cube algorithm and
the XY algorithm [3].

In an n-cube, each node is represented using an n-bit binary number. Each node

has n outgoing channels and the zth channel corresponds to the zth dimension. In the e
cube routing algorithm, the packet header carries the destination node address d. When

a node v in the n-cube receives a packet, the e-cube routing algorithm computes c = d

19
XOR v. If c = 0, the packet is forwarded to the local processor. Otherwise, the packet

is forwarded to the outgoing channel in the kth dimension, where k is the position of the
rightmost ( alternatively, leftmost) 1 in c.

In a 2D mesh, each node is represented by its position (x,y) in the mesh. In the
XY routing algorithm, packets are first sent along the X dimension and then along the

Y dimension. In other words, at most one turn is allowed, and the turn must be from the

X dimension to the Y dimension.
The two algorithms stated above are minimal in nature. However it has been
shown that for k-ary n-cubes with k > 4, it is impossible to have a minimal deterministic

algorithm. Nevertheless, by breaking cycles in the channel dependency graph, non

minimal deterministic algorithms can be developed [2].

2.4 Adaptive Routing
The main disadvantage of deterministic routing is that it cannot respond to

dynamic network condition such as congestion. So adaptive routing is used. Most
adaptive routing use virtual channels where the same physical medium is shared by
several channels. Note that putting multiple pair of channels between adjacent nodes is

very expensive. On way to circumvent this problem is to multiplex several virtual
channels on a physical channel. Each virtual channel has its own flit buffer, control, and

data path [7]. The construction of the resource graph during the routing by an algorithm
is still the same, except that virtual channels rather than physical channels are used. If

20
a link is defined to be involved in deadlock, and at least one of the deadlocked messages

waits for any of its virtual channels, then the above properties still hold true.

It has been shown by Dally [7] that virtual channels increase the network
throughput for certain types of networks, and reduce the dependence of throughput on
the depth of the network. It also provides an additional degree of freedom in allocating

resources to packets in the network.

2.4.1 Minimal adaptive routing
One general adaptive routing technique works by partitioning the channels into
disjoint subsets. Each subset constitutes a corresponding subnetwork. Packets are routed

through different subnetworks depending on the location of the destination nodes. It
contains an additional pair of channels added to the y dimension. The network can be
partitioned into two subnetworks called +x subnetwork and the -x subnetwork, each

having a pair of channels in the y dimension and a unidirectional channel in the x
dimension. If the destination is to the right of the source, the packet will be routed
through the +x subnetwork. If the destination is to the left of source, the -x subnetwork

is used. If destination is in the same column as source, either of the networks can be
used.
For any pair of source and destination nodes, the channels will be traversed in

a descending order, no matter which shortest path is taken. Hence, the occurrence of

deadlock is prevented.

21
Providing deadlock-free minimal fully adaptive routing algorithms for hypercube,
2D torus, or more general Fary n-cube may require additional channels. Linder and

Harden [5] have shown that a Fary n-cube can be partitioned into 2"‘7 subnetworks, n+1
levels per subnetwork, and F channels per level. The additional channels increases

rapidly with n. This approach does provide minimal fully adaptive routing, but is
impractical for large n because of the requirement of increasing number of virtual

channels.

2.4.2 Nonminimal adaptive routing

In the static dimension reversal routing algorithm [7], there are r pairs of

channels between any two adjacent nodes. The network is partitioned into r

subnetworks. The class-z (0<= i<= r-1) subnetwork consists of all the z-th pair channels.

The packet header carries an additional class field c initially set to 0. Packets with c <

r-1 can be routed in any direction in the class-c subnetwork. However, each time a
packet is routed from a high-dimensional channel to a low dimensional channel, that is,
reverse to the dimension ordering, the c field is increased by 1. Once the value of c has

reached r-1,the packet must use the deterministic dimension ordered routing.
In the dynamic dimension reversal routing algorithm [8], the channels are divided

into two nonempty classes: adaptive and deterministic. Packets originate in the adaptive
channels, where they can be routed in any direction with no limit on the number of

times the packet can be routed in reverse dimension order. However, a packet with c =
p is not allowed to wait on a channel currently occupied by a packet with c = q if p >=

22

q. A packet which reaches a node where all permissible output channels are occupied

by packets whose values of c are less than or equal to its own, must switch to
deterministic class of channels.

2.4.3 The turn model
The turn model proposed by Glass and Ni [6] provides a systematic way to

develop maximally adaptive algorithms, both minimal and nonminimal for a given

network without adding channels.
The basic concept behind this model is to prohibit the smallest number of turns

such that cycles are avoided. In the case of a 2D mesh, there are eight possible turns
and two possible abstract cycles. This is shown in Figure 2.4. Cycles among packets

may result if turns are not restricted.

The following six steps can be used to develop adaptive routing algorithms for
n-dimensional meshes and k-ary n-cubes [6]:
1. Classify channels according to the direction in which they route packets.
2. Identify the turns that occur between one direction and another omitting 0-degree and

180-degree turns.
3. Identify the simple cycles these turns can form.
4. Prohibit one turn in each cycle.

5. In the case of k-ary n-cubes, incorporate as many turns as possible that involve

wraparound channels.

23

6. Add 180-degree and O-degree turns, which are needed for non-minimal routing
algorithms or if there are multiple channels in the same direction.

Theorem 2,2: The minimum number of turns that must be prohibited to prevent a

deadlock in an w-dimensional mesh is n(n-l), or a quarter of the total number of
possible turns.
For a proof of this theorem refer to [6].

We will use this concept in Chapter 3 to supplement the proof for our algorithm.
The two most popular algorithms based on the turn model are the west-first algorithm

and the north last algorithm. In the west-first algorithm a packet is first routed to the
west, if necessary, and then adaptively south, east and north. Figure 2.5 shows this

method using three examples in an 8x8 mesh.

2.5 Store-and-forward Revisited
Recently Boppana and Chalasani have developed fully adaptive deadlock-free

wormhole routing algorithms from previous store-and-forward (SAF) algorithms. They
have shown [11] that certain types of store-and-forward algorithms can be used for

deadlock-free wormhole routing.
Given a SAF routing algorithm, a corresponding wormhole routing can be

derived as follows. If a message can occupy a buffer of class b^ bj,...,or bm at an

intermediate node and go through a communication channel (or physical channel) then

virtual channels named c0,c1,...,cm are provided for wormhole routing on that

24

Figure 2.4 An illustration of the turn model in a 2D mesh: (a) abstract cycles in a 2D
mesh, (b) four turns (solid arrows) allowed in XY routing, and (c) six turns (solid
arrows) allowed in west-first routing [6].

25

IM | Source node
|M| Destination node
■

Other node

—Channel traversed
by packet

I

Unavailable channel

Figure 2.5 Examples of west first routing in an 8 x 8 2D mesh [6].

26

communication channel. If a message occupies a buffer of class b{, at the intermediate
node j and takes communication channel between nodes j and k in SAF routing, then,
in the corresponding wormhole routing, the header flit of the message acquires virtual

channel c{ in the communication channel connecting j and k.
If there is a directed cycle in the resource graph of the SAF algorithm, then a
directed cycle can occur in the header graph of the corresponding wormhole routing

algorithm [11].
Our DAG based algorithm, which draws some concepts from a previous SAF

algorithm [12], is a limiting case where we have just one class of buffer and hence the
number of communication channels provided between the nodes is one.

Chapter 3

DAG Based Adaptive Routing

3.1 Background
Most adaptive routing algorithms use virtual channels to avoid deadlock and are

considered to be better than deterministic algorithms. A fully adaptive routing scheme
by Linder and Harden [5] needs (n+l)2n'J virtual channels. Another adaptive algorithm
presented by Boppana and Chalasani [11], uses 2" virtual channels. Hence the number

of virtual channels increases rapidly with the size of the network. Although adding a
virtual channel to a physical channel is less expensive than adding a new physical
channel, it is not free. A virtual channel involves adding buffer space and control logic

to the routers at the ends of physical channel so that the virtual channels can share the

physical channel and routers. It also reduces the bandwidths of the virtual channels

already sharing the physical channel [6].

Here we present an adaptive routing technique which at its minimum, does not
require virtual channels. The turn model, which is partially adaptive, also does not
require virtual channels as such. Nevertheless, extra channels can be added to it and its

performance studied. Obviously, adding virtual channels will improve its performance.
The algorithm presented in this chapter is based on a Directed Acyclic Graph
(DAG) model. Therefore, circular wait, a necessary condition for deadlock is avoided.

In this algorithm under certain conditions packets are forcibly rerouted around potential

27

28

deadlocks. Furthermore, one arbitrarily chosen node is required to accept within finite

time any packet seeking entrance to it, and thereby erasing the packet. It is argued
nonetheless that this condition is imposed infrequently enough and is sufficiently well
manageable by heuristic techniques.

3.2 The Algorithm
Let any network N be represented by undirected graph G. Nodes in G correspond
to processors in N; edges in G corresponds to communication lines in N. We assume
that the communication lines are bidirectional. We construct a directed graph G, with

the following properties.

1. The nodes and directed edges of Gj correspond to the nodes and the undirected edges

of G.
2. Gj is acyclic and has exactly one node (the root, that is,) with no incident incoming

edges. This unique node will be called the root of Gj.
We can produce a directed graph G, from an undirected graph G in the following
manner:

a) Construct a directed spanning tree of G with its root r. All nodes in N can therefore,
be reached from r via the edges of T.

b) Assign directions to the edges of G which are not part of T , in such a manner that
the Gj graph remains acyclic. To do this, we can direct edges connecting nodes at
different distances from the root, in such a manner that it points away from the root.

Nodes which are at same distance from the root can be pointed in either direction, such
that Gj remains acyclic.

29

fl 20 mesh with bidirectional links
forms undirected graph G.

Figure 3.1 A 2D mesh with bidirectional links. The links form the edges of the
undirected graph of G.

30

Directed Acyclic Graph

Figure 3.2 Construction of G; from G.

31

Now we construct the graph G2. It is the inverse graph of G, and is obtained by

reversing the direction of all edges of G}. Thus, every hop a header flit (of a packet)
makes in N corresponds to either a hop in G} or a hop in G2. We say that a header is
a G] header if the next hop it will take is a Gj link. Similarly if a header is to take a G2

link, we call it a G2 header.

The destination node address is contained in the header flit. At low traffic levels,
the header takes a minimal path. A routing is said to be minimal if the path selected is

one of the shortest paths between the source and the destination pairs. That is, every

channel the header takes, brings the packet closer to the destination. Here the header
tries to take the X-direction (alternatively T-direction) and then the T-direction. However,

if it finds that the next channel in the minimal path is blocked, then it changes its
direction, and may take a non-minimal path. In other words, at low traffic levels, it takes

a minimal path, which is equal to the Manhattan distance between the source node and
the destination node. The Manhattan distance between two points (X],yj) and (x2,y2) is
the sum of mod(x2-x7) and mod(y2-y7). At higher traffic levels, it adaptively takes a path

whose length is greater than the Manhattan distance between the source node and the
destination nodes.

The deadlock-free property of the algorithm derives from the fact that, if at any
non-root node Nq, the packet header finds that the next required channel is a Gj link and

it is blocked, then it is forced to redirect itself over a G2 link.

32

This is also a pragmatic approach to congestion control. That is, if there is too
much congestion in a particular area, the packets will be forced to move on a G2 link.

In other words, it will move towards the root. A packet which is forced to travel to the
root because of congestion is erased when it finally reaches the root.

We note that the forced redirection of packets takes place only for Gj packets.
If a G2 packet finds the required channel to be blocked, then it has to wait.

Theorem: The DAG algorithm is deadlock-free.
Proof:
1) No G2 header can be deadlocked.

Let Pj, P2,...,Pm be packets with G2 headers. So a deadlock can occur if P, (waits
for) P2 (waits for) P3 ... (waits for) Pm (waits for) P,. But G2 is acyclic. So such a

situation cannot arise. The only possible locking can take place if the root node is
involved in the locked path. But the root can delete excess packets and hence locking
will not arise.

2) No Gj header can be deadlocked.
Deadlock can occur only if there is a circular wait amongst Gj packets. But since
a Gj header is forced to become a G2 header if it finds that its desired channels are
blocked, deadlock cannot occur. Thus a Gj header may be forced to wait for a G2
packet, but a situation where all Gj packets are blocked in a cycle is avoided.

Therefore, no packet, whether G, or G2 can be deadlocked.

33

3.2.1 Logical basis of deadlock prevention
We have seen in the turn model, proposed by Glass and Ni [6], that deadlockfree adaptive routing schemes can be developed by restricting the turns a packet can

take. Here also we see that two turns out of the eight possible turns is prevented, should
there a deadlock situation arise. We refer back to theorem 2.2, for this.
Another analogy can be drawn from the dimension reversal algorithm by Dally
[8]. In his case cycles in the wait-for graph are eliminated by not allowing a packet to

wait on a buffer held by a packet with a lower DR number.

3.2.2 Implementation

When traffic in the network is such that a deadlock is possible, excess packets
are drawn to the root and are eliminated. Such a situation may arise only if the network

is operating at very high throughput levels. In real systems this condition is rare. Actual
systems almost never operate at extremely high throughputs. Nevertheless, packet

deletion can be reduced if the root node has the capacity to store a few redirected

packets.
One way of implementing it is shown in the form of a flow diagram in Figure
3.3. This can be easily implemented in hardware or in software.

34

no

-(+)x denotes direction

Figure 3.3 Flow diagram of the algorithm.

35

Figure 3.3 continued.

36

3.3 Performance
We are interested in the average channel utilization u, average latency I, and

average wait time w, of a message. The average time taken for transmission of a
message is:

w+(jnl+d-V)*ft

(3.1)

where m^f, d, represent average length of the message in flits, the time to transfer a flit
between neighbors, and average number of hops taken by a message, respectively. The

average number of hops, in an uniform traffic, is the average diameter of the network
[11]. For k-ary n-cube it is approximately nk/4. The average channel utilization refers

to fraction of a link bandwidth utilized in any time interval. This is also the normalized

throughput of the network, the ratio of the network bandwidth utilized to the raw
bandwidth available.

The latency of a message is the elapsed time from when the message is initiated

until the message is completely received. Latency is measured by applying a constant
rate source to each input and measuring the time from packet creation until the last flit

of the packet is accepted at the destination [8]. Source queuing time is included in the
latency measurement.
Throughput is the number of messages the network can deliver per unit time. The

normalized throughput or the average channel utilization is computed as:

31

u=kmfi(

Number of nodes
)
Number of channels

(3.2)

where X is the average message interarrival time. The numerator computes the average
traffic generated by a node, and the denominator gives the available bandwidth due to

the physical channels originating from a node.

The message length in flits depends on the width of the physical channels. For
low-dimensional networks, it is expected that soon 32-bit or 64-bit wide links will be
used. In that case, a message of about 256 bits is only 8 or 4 flits long, with each flit

being 32 or 64 bits [11].

3.3.1 Simulation
The simulator is a C-program implemented on the IBM RS/6000. The

performance of the algorithm is measured using uniform, matrix transpose, and bit
reversal traffic patterns. For matrix transpose traffic, if the source node is (ij), then the

destination node is

In the bit reversal traffic, each node p, sends messages to node

q, where q is the bit reversal of p. For example, node (101011)2 sends messages to node
(010100)2.

The processors generate messages at time intervals chosen from exponential
distribution. Messages that cannot enter the network after generation because of blocking

are source queued.

38

We have used 8x8 and 16x16 meshes for our simulations. We have used the
following assumptions:
1) Packet destinations are uniformly randomly distributed.

2) A packet arrives at its destination is consumed without waiting.
3) All packets are of length L.

4) At each source packets are created by a Poisson process.
5) Packet blocking probabilities are independent.

Chapter 4

Results and Discussion

4.1 Results
Figures 4.1 through 4.6 shows the average latency (in cycles) versus normalized
throughput graph for various situations. We find that for uniform traffic the deterministic
algorithm performs better than the DAG algorithm. That is, the deterministic routing has

a lower latency at high throughput. Although at low throughputs, both the algorithms

have about the same latencies. But for non-uniform traffic the DAG algorithm exhibits
a superior performance.
For matrix transpose traffic, the given algorithm performs better than the

deterministic algorithm. The average path length for matrix transpose traffic (in hops)
is greater than that of uniform traffic.

We also observe that the DAG algorithm outperforms the deterministic
algorithm, for the bit reversal traffic.

In every case the deterministic algorithm and this algorithm exhibit equal latency
for low traffic. This is because both the algorithms take a minimal, wait-free path. It is
only when the traffic load increases, we see that the latencies vaiy.
Thus adaptive routing has an advantage over the deterministic routing for traffic

patterns that load channels non-uniformly.

39

40

—i
0.2

normalized throughput

Figure 4.1 Average channel utilization vs average message latency
routing of four-flit size messages on a 8 x 8 mesh for uniform traffic.

for wormhole

avg.

latency(cycles)

41

Figure 4.2 Average channel utilization vs average message latency for wormhole
routing of four-flit size messages on a 8 x 8 mesh for matrix transpose traffic.

avg.

latency(cycles)

42

Figure 4.3 Average channel utilization vs average message latency for wormhole
routing of four-flit size messages on a 8 x 8 mesh for bit reversal traffic.

43

Uniform

normalized

traffic

throughput

Figure 4.4 Average channel utilization vs average message latency for wormhole
routing of four-flit size messages on a 16 x 16 mesh for uniform traffic.

44

normalized throughput

Figure 4.5 Average channel utilization vs average message latency for wormhole
routing of four-flit size messages on a 16 x 16 mesh for matrix transpose traffic.

45

Bit

reversal

traffic

Figure 4.6 Average channel utilization vs average message latency for wormhole
routing of four-flit size messages on a 16 x 16 mesh for bit reversal traffic.

46
That the deterministic algorithm outperforms this adaptive algorithm for uniform

traffic, is not unusual. Most of the other adaptive routings saturate earlier than the e
cube algorithm. In case of the algorithm based on the turn model, it saturates about 25%
before the e-cube algorithm.

According to Glass and Ni [6], the reason the nonadaptive routing algorithms
perform better than the partially adaptive routing algorithms for uniform traffic is that

they happen to embody global, long-term information about this traffic pattern. From

a global, long-term point of view, the uniform traffic pattern starts with message traffic

spread evenly across the mesh, and the XY algorithm maintains that evenness. The
adaptive algorithms, on the other hand, select channels based on local, short-term

information. These selections tend to benefit just the routed packet and only for the

immediate future and tend to interfere with other packets. The result is that the evenness
of uniform traffic is not maintained as well as when global information is used.

Despite the superior performance of the nonadaptive algorithms for uniform

traffic, the adaptive algorithms probably provide better performance in real systems [6].

In fact we know of no real applications that generate uniform traffic. A traffic pattern
is determined by the application and how its processes are mapped to the nodes of the

network. For most applications, each node will communicate with some nodes much
more than others. Non uniform traffic presents a problem for the deterministic
algorithms because they are nonadaptive. Just as they maintain the evenness of uniform

traffic, they blindly maintain the unevenness of nonuniform traffic. The result is poor

performance [6].

47

Even the Dimension Reversal adaptive routing by Dally and Aoki, gives a higher

latency than deterministic routing at high load and uniform traffic. According to them,
this is because the dimension-order (deterministic) routing concentrates traffic on the

through channels of each switch.
Furthermore, the collision probability and hence expected latency is proportional

to the competing traffic rate. Thus concentrating traffic on the through channels results
in less contention and lower latency. With adaptive routing, the switch traffic is more

uniform, resulting in higher latency [8].
A comparison of the saturation throughput of various adaptive algorithms is

given in Table 4.1.
Out of the four adaptive routing algorithms that have been compared, only one

of them saturates after the e-cube algorithm. The superior performance by the algorithm
presented by Boppana and Chalasani, with regard to saturation throughput, however,
draws other costs. The number of virtual channel it uses, increases exponentially with
the number of dimensions of the network. We discuss the disadvantages of virtual

channels in the next section. We also note that the performance of the DAG algorithm,

with regard to saturation throughput is comparable to other well known adaptive
algorithms.

48

Table 4.1 A comparison of various adaptive algorithms with regard to the saturation
throughput.

For uniform traffic

Reference

Deviation of saturation

throughput from e-cube algorithm
Boppana and Chalasani [11]

+25%

Dally and Aoki [8]

-15%

Glass and Ni [6]

-25%

DAG Algorithm

-15%

+(-) better(worse)

49

4.2 The Costs of Virtual Channel

The biggest advantage of this algorithm is that it does not require virtual
channels at its bare bone level. The drawback of most adaptive algorithms is that they
use virtual channels.
Nevertheless, it has been shown by Dally and Seitz [4], that adding virtual

channels to certain types of schemes can, in fact, improve the performance.
But the reality is that, virtual channels are not free. It involves adding buffer
space and control logic to the two routers at the ends of the physical channel so that the

virtual channels can share the physical channel and routers [6].
Furthermore, it also reduces the bandwidth of the virtual channels already sharing
the physical channel.
There is also a limit to which one can use virtual channels. For low traffic loads,

using more virtual channels may lead to performance degradation rather than

performance enhancement [9].
Therefore the fact that the algorithm presented by Boppana and Chalasani

saturates after other adaptive algorithms, is of little significance, since the costs

associated with its virtual channels are forbidding.
A comparison with regard to virtual channels, of some well known adaptive
algorithms are given in Table 4.2.

50

Table 4.2 A comparison of various adaptive algorithms with respect to the number of
virtual channels used.

Reference

Number of virtual channels

Linder and Harden [5]

(n+lW1

Boppana and Chalasani [11]

2n

Dally and Aoki [8]

depends on the degree of adaptivity

Glass and Ni [6]

-

DAG algorithm

-

51

As with other routing algorithms, this algorithm is not without drawbacks. If

there is too much of congestion, then the packets are forced to travel to the root and if
within finite time the congestion is not cleared, the packets are destroyed at the root.

The packet deletion can be reduced to a negligible fraction if the root has the facility

to store a few packets. But a situation where packet deletion is necessary is extremely
rare. Besides, this can happen only at very high traffic rate. Under such conditions, the

latency also becomes very high. Most importantly, actual systems, never operate under
such extreme traffic rates. Thus the chances of packet deletion (at very high traffic
levels) are negligible.
In our case it was found that long after saturation was reached, less than 1% of

the packets had to be deleted. Besides, not a single packet was deleted before saturation

was reached.
Thus the fact that the algorithm may delete packets, has little or no significance,
as far as implementing it on real systems are concerned.

4.3 Fault Tolerance
Adaptive routing algorithms are better able to cope with hot spots and hardware
failure than deterministic algorithms. If a node or channel fails, the packet will take a

roundabout route to reach its destination. In the DAG algorithm, problem may arise, if
the root node fails. In such a case, some node adjacent to the root, may take over the

52

root’s functions. Another situation is where the G2 channels from a node fails. In that
case that node may have to initiate a network reconfiguration wherein, that node is
avoided for through traffic.

4.4 Livelock
Livelock is a state in which some packet is prevented indefinitely from making
its next hop towards its destination because of unfavorable traffic patterns. It occurs

when the routing of a packet never leads to its destination. On the other hand indefinite

postponement is a situation similar to deadlock, which occurs when a packet waits for
an event that can happen but never does. For example, a packet may wait forever to
acquire a network resource for which other packets are always competing successfully.

Both deadlock and indefinite postponement can stop a packet from being injected into
the network. In contrast, livelock does not stop a packet’s movement, but rather its

progress toward the destination. Livelock is possible only when routing is adaptive and
non-minimal.

In theory, our algorithm may come to a situation akin to livelock. This may
happen if the traffic is fluctuating at a very high level. Nevertheless this can be avoided

if we introduce a ’reroute bit’ in the header. A packet which had to be rerouted will set
its reroute bit. If a router finds too many rerouted headers, it may choose to delete the

packet.

Chapter 5

Conclusion

A major goal in the design of direct networks is to minimize the constituent
elements of communication latency so that such systems can support a finer grain of

parallelism. Of the various methods of communication amongst processors, wormhole
routings has been a subject of intense research because of its low communication

latency.
A DAG based wormhole routing was described and its performance evaluated.
The primary research tools used by various authors to study the performance of
wormhole routing algorithms have been analysis and simulation, in which either uniform

or generic parameterized workloads have been used to evaluate routing algorithms.

We have simulated a 2D mesh to measure the performance of the algorithm since
most direct networks topologies used in wormhole-routed systems are low dimensional
meshes. The mesh-connected bidirectional £-ary n-cube, especially in two dimensions,

costs veiy little and is probably best suited for hardware implementations of routing
functions and virtual queues [5]. Nonetheless, this algorithm may as well be used for

a torus or a hypercube.

53

54

Furthermore, multicast and broadcast communications using this algorithm can
also be looked into. The multicast communication refers to the delivery of the same
message from a source node to an arbitrary number of destination nodes. A broadcast

is the delivery of the same message to all other nodes on the network.

No algorithm is without any drawback. So, this algorithm also has its share of

disadvantages. But, given the comparable performance of this algorithm and its
simplicity, it makes a good case for consideration and more so because its does not
require virtual channels. Nevertheless, virtual channels may be incorporated and its
performance studied.

Bibliography

[1] H.J. Siegel, Interconnection Networks for Large-Scale Parallel Processing,
Lexington, MA: Lexington Books, 1985.
[2] L.M. Ni and P.K. Mckinley, "A Survey of Wormhole Routing Techniques in Direct
Networks", IEEE Computer, Feb. 1993, pp. 62-76.

[3] W.J. Dally and C.L. Seitz, "The Torus Routing Chip", J. Distributed Computing,
Vol. 1, No. 3, 1986, pp. 187-196.
[4] W.J. Dally and C.L. Seitz, "Deadlock-Free Message Routing in Multiprocessor
Interconnection Networks", IEEE Trans. Computers, Vol. C-36, No. 5, May. 1987, pp.
547-553.
[5] D.H. Linder and J.C. Harden, "An Adaptive and Fault-Tolerant Wormhole Routing
Strategy for k-ary n-cubes", IEEE Trans. Computers, Vol. 40, No. 1, Jan. 1991, pp. 2
12.

[6] C.J. Glass and L.M. Ni, "The Turn Model For Adaptive Routing", Proc. 19th Int’l
Symp. Computer Architecture, IEEE CS Press, Los Alamitos, CA, 1992, pp. 278-287.
[7] W.J. Dally, "Virtual Channel Flow Control", IEEE Trans. Parallel and Distributed
Systems, Vol. 3, No. 2, Mar. 1992, pp. 194-205.
[8] W.J. Dally and H. Aoki, "Deadlock-Free Adaptive Routing in Multicomputer
Networks using Virtual Channels", Technical report, MIT Laboratory for Computer
Science, Sept. 1990.

[9] R.V. Boppana and S. Chalasani, "A Comparison of Adaptive Wormhole Routing
Algorithms", Proc. 20th Int’l Symp. Computer Architecture, IEEE CS Press, Los
Alamitos, CA, Order No. 3810-03, 1993, pp. 351-360.
[10] I.S. Gopal, "Prevention of store-and-forward deadlock in computer networks", IEEE
Trans, on Communications, COM-33, Dec. 1985, pp. 1258-1264.
•

55

56

[11] R.V. Boppana and S. Chalasani, "New Wormhole Routing Algorithms for
Multicomputers", Technical report, Univ, of Wisconsin-Madison, Dept, of Electrical and
Computer Engineering, Madison, WI, 1992.

[12] D. Gelemter, "A DAG-Based Algorithm for Prevention of Store-and-Forward
Deadlock in Packet Networks", IEEE Trans. Computers, Vol. C-30, No. 10, Oct. 1981,

pp. 709-715.
[13] W.C. Athas and C.L. Seitz, "Multicomputers: Message-Passing Concurrent
Computers", Computer, Vol. 21, No. 8, Aug. 1988, pp. 9-25.

Vita

Kaushik Roy, son of Dr. Somnath Roy and Mrs. Jyotsna Roy was bom in

Calcutta, India. He had his schooling at Ramakrishna Mission Vidyapith, Deoghar and
at Delhi Public School, New Delhi. He received his Bachelors degree in Electrical

Engineering from Bihar Institute of Technology, India, and later joined the Steel

Authority of India Limited, as Management Trainee-Technical. Thereafter he joined
Louisiana State University as a graduate student.

MASTER'S EXAMINATION AND THESIS REPORT

Candidate:

Kaushik Roy

Major Field:

Electrical Engineering

Title of Thesis:

A DAG Based Wormhole Routing Strategy

Approved:

Major Professor and Chairman

Dean of the Graduate School

EXAMINING COMMITTEE:

Date of Examination:

R ||5 hs

