Loyola University Chicago

Loyola eCommons
Computer Science: Faculty Publications and
Other Works

Faculty Publications and Other Works by
Department

3-1997

Universal Wormhole Routing
Ronald I. Greenberg
Rgreen@luc.edu

Hyeong-Cheol Oh

Follow this and additional works at: https://ecommons.luc.edu/cs_facpubs
Part of the Theory and Algorithms Commons, and the VLSI and Circuits, Embedded and Hardware
Systems Commons

Author Manuscript
This is a pre-publication author manuscript of the final, published article.
Recommended Citation
Ronald I. Greenberg and Hyeong-Cheol Oh. Universal wormhole routing. IEEE Trans. Parallel and
Distributed Systems, 8(3):254--262, March 1997.

This Article is brought to you for free and open access by the Faculty Publications and Other Works by Department
at Loyola eCommons. It has been accepted for inclusion in Computer Science: Faculty Publications and Other
Works by an authorized administrator of Loyola eCommons. For more information, please contact
ecommons@luc.edu.
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.

Universal Wormhole Routing
Ronald I. Greenberg, Member, IEEE, and H.-C. Oh, Member, IEEE,

To appear in IEEE Transactions on Parallel and Distributed Systems.
Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or
promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any
copyrighted component of this work in other works must be obtained from the IEEE.

0

Universal Wormhole Routing
Ronald I. Greenberg, Member, IEEE, and H.-C. Oh, Member, IEEE,

wormhole routing or similar schemes. Leighton [17] per-

Abstract— In this paper, we examine the wormhole routing problem in terms of the “congestion” c and “dilation”
d for a set of packet paths. We show, with mild restrictions, that there is a simple randomized algorithm for routing any set of P packets in O (cdη + cLη log P ) time with
high probability, where L is the number of flits in a packet,
and η = min {d, L}; only a constant number of flits are
stored in each queue at any time. Using this result, we show
that a fat-tree network of area Θ(A) can simulate wormhole
routing on any network of comparable area with O(log 3 A)
slowdown, when all worms have the same length. Variablelength worms are also considered. We run some simulations
on the fat-tree which show that not only does wormhole
routing tend to perform better than the more heavily studied store-and-forward routing in this context, but that performance superior to our provable bound is attainable in
practice.

forms average-case analysis of greedy cut-through routing on meshes. But cut-through routing [13] differs from
wormhole routing in that it uses buffers that can store at
least one full packet rather than a few flits. Makedon and
Simvonis [20] give worst case bounds for cut-through routing of permutations on the mesh and the torus. Aiello,
Leighton, Maggs, and Newman [1] give an efficient algorithm for wormhole routing of permutations on a dilated

Index T erms—Wormhole routing, packet routing, randomized routing, greedy routing, area-universal networks, fattree interconnection network

I

butterfly. Their algorithm is nonoblivious (may use information about other packets when routing a given packet).
More recently, Felperin, Raghavan, and Upfal [6] have ob-

Introduction

tained a simple, oblivious algorithm for wormhole routing
An efficient routing algorithm is critical to the design of

of permutations on the butterfly and the mesh. Some other

most large-scale general-purpose parallel computers. One

works have described iterative methods for estimating the

must move data between different locations in an appropri-

performance of certain networks under certain probabilistic

ate routing network as quickly as possible and with as little

models of message generation [4, 11, 2].

queuing hardware as possible. Store-and-forward routing

While the above analyses of wormhole routing have been

is the most extensively studied model and many asymp-

applicable only to specific networks and/or specific message

totically efficient algorithms have been proposed for this

patterns, this paper takes a more general approach based

model (e.g., [15] and the references therein). Recently, in-

on summary measures of the message traffic, as in [16, 15].

creasing attention has been devoted to the wormhole rout-

We require only that any two paths in the network inter-

ing model [3], since it can lead to a reduction in routing

sect in at most a constant number of contiguous sequences

time and the storage requirements of intermediate nodes.

of edges, a condition that is met by many networks used

In this model, packets (or worms) are composed of flits or

in practice. Recently, an intermediately broad class of net-

flow control digits, and packets snake through the network

works has been considered by Ranade et al. with particular

one flit after another.

application to improved routing of certain message distri-

Few works have performed any theoretical analysis of

butions on the butterfly [26].

This work was supported in part by the National Science Foundation
under grants CCR-9109550 and CCR-9321388 and by a Summer Research
Award from the University of Maryland Office of Graduate Studies and
Research.
R. I. Greenberg is with the Department of Mathematical and Computer
Sciences Loyola University, 6525 N. Sheridan Rd., Chicago, IL 60626,
rig@math.luc.edu.
H-C. Oh is with the Department of Information Engineering, Korea
University, Chochiwon, Korea, hyeong@tiger.korea.ac.kr.

After deriving general bounds for wormhole routing, we
apply the results to the construction of area-universal networks. In particular, when worms have a fixed length,
a bounded-degree network (the butterfly fat-tree [9]) of
1

area Θ(A) using wormhole routing can simulate (on-line)

prised of two tasks, selecting a path through the network

any network of comparable area with O(log 3 A) slowdown.

for each packet and setting a schedule for when packets

Though it has been proven that O(log A) slowdown suf-

move and wait. In the next section of this paper, we focus

fices in the store-and-forward routing model [15], such an

on the second task. Of course, the selection of paths affects

approach requires the universal network to queue full pack-

the required routing time. For example, the maximum dis-

ets at each intermediate node and similarly limits the type

tance d, in number of edges, traveled by any packet is a

of competing network that is considered. Also, the circuit-

lower bound on the routing time; this distance is often re-

switching scheme of [9] could actually be used as a worm-

ferred to as the dilation in the literature. (It may be noted

hole routing scheme, but with poorer overhead than we

that the dilation is typically at most as large as the network

show here, since the earlier scheme locks down a routing

diameter, but this is not necessarily so if some packets do

path for more than the time required for a worm to pass.

not traverse shortest paths.) Similarly, the routing time is

We also extend the universality analysis to the case in

lower bounded by cL, where the congestion c is the max-

which worms have varying lengths. In this case, each pro-

imum over all edges of the number of packets that must

cessor continuously generates and sends packets, where the

traverse the edge over the entire course of the routing.

packet length L is a random variable with mean E[L] = L̄

Once the set of packet paths has been determined, we

and maximum value LM . With mild restrictions, we show

can define a graph, D, which has a vertex for each edge of

that a fat-tree network of area Θ(A) can simulate any
¡
¢
network of comparable area with O (LM /L̄) log3 A slow-

the network and an edge (u, v) whenever there is a packet

Before proceeding with the promised results, we give

We ensure that deadlock cannot occur by assuming that

more detail on the model and terminologies used through-

the dependency graph of the paths is acyclic [3]. (Many

out this paper. We consider the routing of a set of P pack-

networks, e.g., leveled networks [15], have no cycle in D

ets, each consisting of L flits. We follow the usual graph-

for any set of packets, and there also are techniques for

based terminology; processors and switches are nodes in

breaking cycles [3]. In addition, there are adaptive routing

the graph and communication channels are represented by

techniques for avoiding deadlock (e.g., see [5] and the ref-

edges. We make the usual assumption that unit time suf-

erences therein), and our analysis can be applied to the set

fices for a flit to cross any edge in the network (though

of packet paths generated by such a technique.)

path in which network edge v immediately follows network
edge u. We refer to this graph as the dependency graph.

down.

it would also be desirable to extend the analysis to genII

eral edge delays as done in [10] for the store-and-forward

A Simple wormhole routing algorithm

model). A flit is an atomic object, which at each time

In this section, we give a simple delayed-greedy worm-

step, either waits in a queue, or crosses an edge and enters

hole routing algorithm and its theoretical analysis, when

the edge queue at the end of that edge. (In store-and-

all worms have the same length, L. Throughout this sec-

forward routing, packets are the atomic objects.) We call

tion, we only consider a set of paths such that the channel

this unit time step a flit-step, while the corresponding unit

dependency graph is acyclic. We also assume that any two

time step for store-and-forward routing is a packet-step.

paths in the network intersect in at most one contiguous

We restrict attention to bounded-degree networks, so the

sequence of edges. (It will be easily seen that the results

time to make routing decisions at any given node does not

are also valid as long as any two paths intersect in at most

affect the asymptotic time bounds.

a constant number of contiguous sequences of edges.) Each

We may view the packet routing problem as being com-

node has a queue, for each input edge, which can store at
2

of the phase.

most one flit. It is sufficient for our analysis to have each

We begin with a key lemma for bounding the tail of a

node scan its input queues in a fixed order and send out a

binomial distribution:

flit whenever the relevant outgoing edge is not occupied by
another worm.

Lemma 1 Consider τ independent Bernoulli trials, each

We say that a worm W 0 blocks W at t if the edge to which

with probability p of success. The probability that the num-

the head of W has to proceed at t is taken by W 0 . Worm

ber of successes s is larger than the expectation pτ is at
¡ ¢s
most epτ
.
s

W 0 delays worm W at t, if at t, there is a delay chain of
r(≥ 1) worms W = W1 , W2 , · · · , Wr = W 0 such that worm

Proof. The probability is at most

Wi is blocking worm Wi−1 ; worm W 0 is moving; and no

¡τ ¢
s

ps and the Lemma

other worm in the chain can move. Since we exclude any

follows through the use of Stirling’s approximation to the

possibility of deadlock, any blockage will end at some time.

factorial.

Also, once worm W 0 delays worm W for at most L steps

Now, we can prove a general lemma about the number of

(not necessarily consecutive), W 0 will not delay W again

delaying worms encountered during a specified set of edge

because of our assumption that packet paths intersect in

traversals of some set of worms:

at most one contiguous sequence of edges.
The basic routing algorithm we use is a delayed-greedy

Lemma 2 Consider a set of worm paths comprising a to-

approach similar to that of Felperin, Raghavan, and Up-

tal of y edge traversals (by worm heads) that need to be ac-

fal [6]. The analysis here simplifies, clarifies, and general-

complished in some phase of Algorithm A, and let P (x, y)

izes their argument. We begin with a worst-case bound on

be the probability that delays by x worms interfere with
¡ ¢x−y/L
that set of traversals. Then P (w, x) ≤ k10
, where
q
k
k 0 = 2e
≥ 2.

routing time that holds with high probability and then give
a tighter bound on the expected time. (Throughout this
paper we use the term “high probability” in the standard

Proof. We use induction on x and prove the result for each

fashion that for any constant m, we can achieve probability

value of x by using induction on y. The base cases are

1−O(1/S m ), where S is an appropriate measure of problem

trivial. For the induction step, we first consider y > L and

size. In particular we focus on achieving time bounds for

focus on the first L edge traversals of a particular worm;

routing P packets with probability 1 − O(1/P ) and bounds

notice that they can be blocked by at most cL worms. If

for network simulation using networks of area A that hold

none of those worms is launched in the current phase, then

with probability 1 − O(1/A).) The starting point for both

P (x, y) is just equal to P (x, y − L). Otherwise, let S be the

results is the following core routing procedure, expressed

set of such worms launched in the current phase, and note
¡ ¢i
that there is a probability of at most ke that S contains

in terms of parameters k and T to be determined later:
Algorithm A Assign each packet an integral delay ran-

i worms, by using τ = cL, p =

domly and uniformly from the interval [0, kcL − 1]. A

In the worst case, all the worms in S could act as delaying

packet that is assigned delay i waits in its initial queue

worms, and we must also worry about delays encountered

for iT steps and then proceeds to its destination. We refer

by those worms during traversal of up to L − 1 edges that

1
kcL ,

and s = i in Lemma 1.

to the time between (i−1)T and iT in as the i-th phase. In

have not already been considered. (We need not consider

what follows, we assume T is large enough that worms dis-

more than L − 1 new edges for each worm in S, because

patched in different phases do not interfere with each other

once a worm has traversed so many edges not already under

and we analyze how large T needs to be for the worms dis-

consideration, it has digressed far enough from the paths

patched in a phase to actually get delivered before the end

originally under consideration as to have no further effect.)
3

Thus, for the induction step, we have
P (x, y)

≤
≤

P (x, y − L) +
µ

≤

µ

≤

µ

1
k0

¶x−y/L+1

1
k0

¶x−y/L+1

1
k0

¶x−y/L

cL ³ ´i
X
e

k

i=1

Ã
Ã

1+

phases. Lemmas 2 and 3 still go through by replacing some
appearances of L with d in the proofs, so we can proceed
P (x − i, y − L + iL)

¶−2i
cL ³ ´i µ
X
e
1

k
!
cL µ ¶i
X
1
1+
2
i=1
i=1

k0

as above with T = 2d + L log2 P = O(L log P ), and the
total routing time is O(cdL log P ) with high probability.

!

It is interesting to note that we can also obtain a better
expected routing time as well as the high probability result
above.
Theorem 5 Any set of packets can be routed in O(cdL)

For y ≤ L, the additive term preceding the summation

expected time.

above disappears, and we replace P (x − i, y − L + iL) with

Proof.

P (x − i, y + iL); the final result still holds.

As in the proof of Theorem 4, we actually use

Now we can analyze the time after a worm is dispatched

O(32ecη) phases in Algorithm A, where η = min {d, L}.

in Algorithm A that is required to traverse all d of its links:

Also, we initially run the algorithm with T = T0 = 2(d+L),
and then run the whole algorithm through again with T =

Lemma 3 Let P 0 (z) be the probability that a given worm

2T0 , and then with T = 4T0 , etc. The high probability

W requires at least 2d+Lz time to reach its destination unq
¡ ¢z
k
der Algorithm A. Then, P 0 (z) ≤ k10 , where k 0 = 2e
≥

result of Theorem 4 still holds because we at most double the routing time through the process of building up

2.

towards a high enough value of T . In addition, the ex¡ 1 ¢z
P∞
by
pected time per phase is at most
z=0 (d + L)z 4

Proof. For W to require 2d + Lz time to reach its destination, it must be delayed by

d
L

+ z worms since any one

Lemma 3, which is O(d + L). So the total expected time

worm can delay W for at most L steps. The result follows

is O(cη(d + L)) = O(cdL).

from Lemma 2.

We also have the following corollary to Theorem 4, which

We are now ready for the main analytical result:

is useful in Section A:

Theorem 4 With η = min {d, L}, any set of P packets

Corollary 6 When d ≤ log P , any set of P packets can be
¡
¢
routed in O cL log2 P flit-steps with high probability.

can be routed in O (cdη + cLη log P ) flit-steps with high
probability.
Proof. We consider first the case of L ≤ d. In this case,

III

we simply run Algorithm A with k = 32e and T = 2d +

Wormhole routing on fat-trees

L log2 P . By Lemma 3 (with k 0 = 4), the probability is

Fat-trees constitute a class of routing networks for hard-

O(1/P 2 ) that any given worm is not delivered during the

ware-efficient parallel computation [18, 9, 15]. Figure 1

phase in which it is dispatched (under the assumption that

shows a layout of one fat-tree variant using switches of

all worms dispatched in previous phases have been deliv-

constant size. A fat-tree in this style is usually referred as

ered). This yields an overall probability of O(1/P ) that

a butterfly fat-tree, of which a variation has been adopted

there exists any worm that does not get delivered. (The

in the CM-5 supercomputer of Thinking Machines Corpo-

failure probability can be changed to any constant power

ration [19]. In Figure 1, a set of N processors are placed at

of 1/P by changing the constant k.) The total time for

the leaves, represented by circles; the squares are switches.

2

the kcL phases of the algorithm is O(cdL + cL log P ). For

Each connection drawn between a pair of switches or a pro-

L > d, we use a modified version of Algorithm A with kcd

cessor and a switch represents a pair of oppositely directed
4

e

e

e

e

e

e

e

e

A

Area-universality of fat-trees

e

e

e

e

e

e

e

e

A.1

e

e

e

e

e

e

e

e

The algorithm analyzed in Section II allows us to ex-

e

e

e

e

e

e

e

e

tend to the wormhole routing problem universality theo-

Worms with a fixed length

rems from [18, 9, 15, 7] which state that a universal fat-tree
of a given area (volume) can simulate (using circuit switche

e

e

e

e

e

e

e

ing or store-and-forward packet routing) any other routing

e

e

e

e

e

e

e

e

network of equal area (volume) with only a polylogarithmic factor increase in the time required. Throughout this

e

e

e

e

e

e

e

e

e

e

e

e

e

e

e

e

section, we assume that all worms have a fixed length, L.
We construct a fat-tree on unit-size processors, which occupies area linear in the number of processors, as in [7]. (It

Fig. 1. A butterfly fat-tree.

is actually more reasonable to consider processors that are
larger than constant-size, but we bypass this complication,
links, each capable of transmitting one flit in unit time. We

since it can be handled as in [7, 8].) Then, a very simple

call the link from parent to child a down link, and the other

one-to-one mapping of a competing network’s processors

an up link. The underlying structure of Figure 1 is a com-

to those of the fat-tree guarantees that any set of pack-

plete 4-ary tree. Each edge of the underlying tree consists

ets delivered in one packet-step by a competing network of

of a group of links, called a channel. We call the channel

comparable area does not induce too great a congestion on

from parent to child a down channel, and the other an up

the fat-tree, as is shown by the following lemma, adapted

channel. The number of links in a channel is called its ca-

from [7, Lemma 2.1]. (For example, in area A, we can con-

pacity. An important measure of the difficulty of routing a
set of packets on a fat-tree is the load factor, the maximum

struct a mesh or H-tree on O(A) processors or a butterfly
√
on O( A lg A) nodes and do a straightforward geometric

ratio of the number of packets traversing a channel to the

mapping to a fat-tree with A processors and appropriate

capacity of the channel. The load factor λ is closely related

channel capacities.)

to the congestion c. We can always choose packet paths so

Lemma 7 Consider networks with unit-sized processors,

that c = O (λ + log N ) [15, Lemma 9].

and let R be the set of all networks of area A. Then, there

We select a shortest path for each packet. The depen-

exists a fat-tree F of area Θ(A) such that any set of packets

dency graph for the paths selected in this way is free from

delivered in one packet-step by a network in R induces a

cycles, because no shortest path proceeds from a down

congestion of O (log A) on F .

channel to any up channel. In fact, we can view the network as being partitioned into two halves, a network of

We can immediately extend this lemma to the case in

up channels and a network of down channels, by duplicat-

which the competing network uses wormhole routing; the

ing each switch. In the network comprised of these two

set of packets that move during any window of L flit-steps

halves, any two paths intersect in at most two contiguous

in the competing network induce a congestion of O (log A).

sequences of edges. Hence the result of Section II can be

Then we can state our universality result for wormhole

applied. (The bound on the number of steps during which

routing:

a given worm can delay another given worm only increases

Theorem 8 A fat-tree F of area Θ(A) can simulate any
¡
¢
network of area A with a factor of O log3 A loss of run-

from L to 2L.)
5

at least 1 − n1 .

time efficiency, using on-line wormhole routing with high
probability.

Proof. The proof follows from the Tchebycheff inequal-

Proof. Consider the set of packets that moves during L

ity [23, p 115].

flit steps in a competing network of area A. By extending

Next, we note a simple corollary to Corollary 6:

Lemma 7 as suggested above, we know that the congestion
Corollary 10 When d ≤ log P , any set of packets can be
¡
¢
routed, in O cLM log2 P flit-steps with high probability.

created by this set of packets on a fat-tree of area Θ(A) is
O (log A). Next we can restate Corollary 6 by substituting
A for P as long as the number of packets is polynomial in
A, as is true here. For a fat-tree, d = O (log A), so the set
¡
¢
of packets can be delivered by F in O L log3 A flit-steps.

We assume that the standard deviation of the packet
length satisfies 0 < σL ≤ ²L̄, for some constant ² such that
0 < ² < 1. This assumption is satisfied by the packet-

It should be noted that under some circumstances, we

length distributions, generated in typical concurrent com-

can obtain an asymptotic bound that appears better than

puting applications, presented in the literature, e.g. [21].

the above by splitting each packet into flits and essentially

When packets have varying lengths, the simulation over-

treating these flits as independent packets. Of course, we

head becomes more complicated to analyze, because the

must then attach complete addressing information to each

number of packets crossing a wire in competing networks

flit. If a flit is big enough to carry a full address, then we

may vary from wire to wire during an interval of time. We

can think of each flit as being transformed into a packet

consider the situation in which each processor continuously

of two flits and we could use the store-and-forward rout-

generates and sends packets during a time interval of length

ing scheme for leveled networks of Leighton et al. [15] to

T À L̄. In the following theorem, we extend the result of

route the packets in O (cL + d + log P ) time. This yields

Theorem 8 to this general setting:

O (log A) overhead for fat-tree simulation. Of course, it is

Theorem 11 Consider competing networks with area A

unfair to compare this result with Theorem 8, because this

in which processors continuously generate and send packets

independent-flit approach would induce additional overhead,

during a time interval of length at least AL̄, and let LM be

such as increased storage in the intermediate nodes and the

bounded above by a polynomial in A. Then a fat-tree F of

overhead of splitting and reconstructing the packets.
A.2

area Θ(A) can simulate any such network with a factor of
¡
¢
O (LM /L̄) log3 A loss of runtime efficiency, using on-line

Worms with variable lengths

wormhole routing with high probability.

In this section, we consider the situation in which each
processor continuously generates and sends packets, where

Proof. We break up the overall time period under consid-

the packet length L is a random variable with mean L̄,

eration and consider separately the worms delivered by the

2
variance σL
, and maximum value LM .

competing network in consecutive time intervals of length

The following lemma shows that, with realistic restric-

T = AL̄. Then we determine the overhead with which any

tions, the total length of a set of n packets is unlikely to

set of message routed by any network during an interval I

greatly deviate from its expected value.

of length T can be delivered by a fat-tree of comparable

Lemma 9 Let X be a random variable, with mean µ and

area.

variance σ 2 , and let Sn be the sum of n independent ran-

We construct a fat-tree on unit-size processors, as in Sec-

dom variables distributed as X. If 0 < σ ≤ ²µ, where ² is a

tion A.1. Next we recursively bisect the competing network

constant and 0 < ² < 1, then Sn = Θ(nµ) with probability

in the straightforward geometric fashion, as in [7, 8], and
6

match the parts obtained in this bisection to pieces of F

level of a node be its distance from the leaves. At the

in accordance with the obvious recursive bisection of F .

0-th level (l = 0) are N processors which are addressed

0

We consider a piece, C, of area A in the competing net-

from 0 to N − 1. In Figure 1, we arrange the proces-

work. Our construction of F guarantees that the chan-

sors in a similar fashion to the shuffled row-major indexing

nel capacity, in F , corresponding to the perimeter of C is
³√
´
O
A0 / log A .

in [27]. These processors are connected to N/4 switches
at the 1-st level such that the processor at (0, a) is con-

the remaining part of the competing network. Let P 0 de-

2, · · · , log4 N , there are ml =

note the maximum, over all the wires in S, of the num-

tions of a switch are determined by the switch’s address as
¥ a ¦ l
follows: (l, a) is connected to (l + 1, 2l+1
· 2 + a mod 2l )
¥ a ¦ l
and (l + 1, 2l+1 · 2 + (a + 2l−1 ) mod 2l ).

nected to the switch (1, ba/4c). At the l-th level, for l =

We now consider the set of wires, S, connecting C to

ber of worms crossing the wire during I. Since the sum
of the lengths of the P 0 packets cannot exceed T , we know
¡
¢
from Lemma 9 that P 0 = O T /L̄ with probability at least
³√ ´
1
1 − T /1L̄ = 1 − A
A0 ,
. Since the perimeter of C is O

B.2

the total number of worms crossing the wires in S dur³√
´
A0 P 0 . Thus the congestion induced, on
ing T is O

ml−1
2

switches. The connec-

Routing algorithms and strategies

Algorithm STORE is a (delayed) greedy store-and-forward routing algorithm. Each packet chooses an integral

the channel in F , by the worms crossing the wires in S

delay randomly and uniformly from the interval [0, R − 1].

is O (P 0 log A). By Corollary 10, all the worms routed by

A packet that is assigned delay x waits in its initial queue

the competing network during I can be delivered by F in
¡
¢
O LM P 0 log A log2 P flit-steps, where P is the total num-

for x time steps and then proceeds to its destination. At

Since P = O (N T ) = O (N ALM ), P is bounded above

output edge is idling and the queue at the end of that edge

each step, each node scans its input queues once and sends

ber of worms to be delivered.

out available packets greedily (whenever the corresponding

by a polynomial in A, from which the theorem follows.

is not full).
Algorithm WORM is a (delayed) greedy wormhole routing algorithm. Each packet consists of L flits. Each packet

B

Simulation

chooses an integral delay randomly and uniformly from the

This section investigates the practical performance of

interval [0, R−1]. A packet that is assigned delay x waits in

wormhole routing algorithms on butterfly fat-trees. For

its initial queue for xL log N time steps and then proceeds

simplicity, we focus on the case in which all worms have a

to its destination. At each flit step, each node scans its

fixed length L. (The overhead for fat-trees simulating other

input queues once. If the flit is a head flit, the node sends

networks is not much worse with variable length worms in

it out according to the flit’s path only when the output

terms of provable bounds, and we expect the same to be

edge is not being used by any other packet and the queue

true in practice.)

at the end of that edge is not full. If the flit is not a head
flit, the node sends it out to where the flit’s head was sent

B.1

Description of the butterfly fat-tree

out, whenever the queue at the end of that edge is not full.
Algorithm UNIV is the universal store-and-forward rout-

We use the butterfly fat-tree with N processors in the
Each node has an address which is

ing algorithm of [15] for leveled networks. In this scheme

expressed as a pair (l, a) of integers, where l represents

packets choose a random priority from [1, R] that is used

the level of the node in the butterfly fat-tree and a rep-

to order the passage of packets through any given switch.

style of Figure 1.

resents the address of the node in that level.

Algorithm SPLIT uses the independent-flit approach.

Let the
7

B.3

Each packet is split into flits which are treated as independent packets and routed as in STORE . This approach

Simulation results

Since most real parallel computations tend to be domi-

is also called the multipacket routing approach [14]. Note

nated by communication time, and algorithms typically can

that this approach requires a replication of addressing in-

be viewed as consisting of alternating phases of computa-

formation so that each of the independent packets can be

tion and communication, we focus on the static injection

routed to the correct location.

model. Here each processor has a fixed number of pack-

In the butterfly fat-tree, there is more than one short-

ets to send, and we measure the time to deliver the full

est path between a pair of leaves. More specifically, at a

set of messages. This scenario corresponds to the situation

switch, a packet can take any one of two up links, when its

analyzed theoretically in Section A.1, to which we com-

destination is not one of the leaves of the subtree rooted at

pare our empirical results, and constitutes an important

the switch. (There is no redundancy for down links.) We

general model as argued in [28], for example. We consider

can use this redundancy in selecting paths.

three communication patterns representative of the range
of likely patterns in real parallel computations:

• Fixed-Path (FP) selection: For each packet, we select
a shortest path randomly and uniformly before the

• Random Instance: Each packet chooses a destination

packet leaves its source.

randomly and uniformly.

• Random-Path (RP) selection: When a packet needs

• Complement Permutation: Each processor (0, a) sends

to go up, it selects an up link randomly. If the link is

a packet to processor (0, N − 1 − a). This permutation

blocked, the packet waits. The selection is oblivious,

induces as high a congestion on the fat-tree as any

i.e., each time a packet seeks to go up, it makes a

other permutation. The congestion created by this
√
permutation is N /2.

selection randomly.
• Greedy-Path (GP) selection: The packet seeking to go

• Many-to-1 Instance: Packets are sent from proces-

up scans up links and chooses the first one which is not

sors (0, 0), · · · , (0, N/2 − 1) to processor (0, N − 1),

blocked.

and packets from processors (0, N/2), · · · , (0, N − 1)

When more than one incoming packet is to be routed to

are sent to processor (0, 0). This pattern gives us a

an outgoing link, the way of selecting one may affect the

high congestion (c = N/2) with the same number of

results. The following schemes have been tested:

packets as for a permutation.

• Fixed-Order (FO) scan: At each time step, a switch

The random pattern is probably the most common in other

scans its incoming links in a fixed order and chooses

simulation studies and arises in many practical contexts.

the first pertinent packet for each outgoing link.

For example, a recently studied variation on sample sort

• Random Round-robin (RR) scan: This scheme is sim-

begins by randomly redistributing the keys to be sorted [12].

ilar to FO scan, except that a switch selects the first

We focus on this pattern in the graphical results presented

incoming link randomly and scans around from that

below. Permutations have also been included in our stud-

link.

ies, however, since they comprise a common communica-

• Farthest-First (FF) selection: In this scheme, a switch

tion primitive, but some are trivial, whereas the comple-

scans its input queues in a RR fashion, except that pri-

ment is a natural permutation that presents a high conges-

ority is given to packets heading to the farthest des-

tion. Finally, the many-to-1 pattern models the common

tinations for up links, and packets from the farthest

operations of broadcast or census, which involve “hot-spot

sources for down links.

contention” that can give substantially different behavior
8

UNIV & STORE-RP-RR; 30 runs; 1 random instance/run; q=1 packet

than more uniform traffic patterns [25].

450
’UNIV’
’STORE’

400

max. latency [packet-steps]

Five network sizes have been considered: N = 16, 64,
256, 1024, 4096. For each run, we measure the maximum
communication latency which is the time elapsed after the
routing has begun until the tail of the last packet arrives at
its destination. In figures 2 – 5, each point represents the
average of 30 runs. Error bars showing the 99% confidence

350
300
250
200
150
100
50

interval for the true average value of the maximum latency

0
16

64

256
N

1024

are also included in Figure 2, but they are omitted from the
Fig. 2. Performance of STORE (with RP and RR): Comparison with
UNIV . Both algorithms used R = log2 N .

other plots to ease readability. In all cases where we draw
distinctions in performance, there is little or no overlap of
the error bars. We describe these plots and the principle

We also found that the average latency tends to depend

conclusions below; additional plots including other combi-

linearly on the worm size L. This is consistent with the ob-

nations of routing strategies, different parameters (such as

servation that the total number of packets which may delay

queue size and worm size), and error bars can be found

a given packet is not a function of L once L ≥ d. There-

in [22].

fore, except where otherwise noted, we do experiments for

The queue size q of WORM was chosen experimentally.

only one worm size L = 32 flits, a typical value in the

For random instances, little was gained by increasing the

literature [24].

queue size beyond 2 flits, and this choice generally yielded

Figure 3 compares the path selection schemes for both

better performance than STORE with queues of any size

store-and-forward routing and wormhole routing. Adap-

tested. In the following, we use queues for 2 flits in WORM ,

tive schemes significantly outperform the fixed-path scheme

and queues for 1 packet in STORE . (STORE improves

for the cases we considered. Similar results were obtained

somewhat with larger queues, but we are already using

with the other packet selection schemes.

more buffer space than for WORM .)

Using the best path selection scheme, RP, Figure 4 com-

First, we compare STORE with UNIV . Even though

pares the packet selection schemes.

It shows that RR

UNIV is known to achieve an asymptotically optimal time

slightly outperforms FO (by 4–8% for most cases). (FF

of (O (c + log N )) on fat-trees, the delayed greedy routing

performed similarly to FO.) We henceforth show most of

algorithm STORE performed better than UNIV , for all of

our results with the RR and RP routing schemes. (The

the communication patterns considered. A comparison on

GP-FO combination may also be a good choice, though

random instances is shown in Figure 2. The marked dif-

RP-RR outperforms it by 5–9% for STORE and 12–15%

ference here is somewhat surprising since both algorithms

for WORM with N = 1024 and N = 4096. With GP-FO,

use the same value of R and the use of random priorities

we don’t have to worry about the difficulty of implement-

in UNIV seems intuitively somewhat similar to imposing

ing good randomization schemes, and some programmers

random initial delays.

prefer deterministic systems.)

We tested the effects of initial delays on the latency of

Figure 5 compares two approaches for treating the flits in

STORE and WORM . We found that the initial random

a packet: ordinary wormhole and independent-flit (SPLIT )

delays can decrease the latency, but we did not find any

approaches. The performance of SPLIT is pretty sensi-

cases in which they provided much advantage, so we do

tive to the selection of routing schemes.

not use them henceforth.

SPLIT with GP-FF uniformly outperforms SPLIT with
9

For example,

N=256; 30 runs; 1 random instance/run; R=1; q=2 flits

FP, GP, & RP; w/ RR; 30 runs; 1 random instance/run; R=1;
4000

3000

max. latency [flit-steps]

max. latency [flit-steps]

3500

’SPLIT-RP-FO’
’SPLIT-GP-FF’
’WORM-RP-RR’
’SPLIT-RP-RR’

1200

’STORE-FP’
’STORE-GP’
’STORE-RP’
’WORM-FP’
’WORM-GP’
’WORM-RP’

2500

2000

1000

800

600

400

1500

200

1000

0
4 8

16

32

64
L [flits]

Fig. 5. Comparison of routing schemes on treating flits: WORM (with
RP and RR) and SPLIT (with various strategies).

500

0
16

64

256

1024

4096

FO and SPLIT with GP-FF, and SPLIT with RP-RR per-

N

forms slightly better than WORM with RP-RR. This com-

Fig. 3. Comparison of routing schemes on selecting paths in STORE and
WORM with RR scan.

parison is, however, made without considering the addressing information to be added to each individual flit in the

FF, FO & RR; w/ RP; 30 runs; 1 random instance/run; R=1
3500

3000
max. latency [flit-steps]

original packet. From Figure 5, we can expect that even a

’STORE-FF’
’STORE-FO’
’STORE-RR’
’WORM-FF’
’WORM-FO’
’WORM-RR’

slight increase in the number of flits sent by SPLIT (due
to the replication of addressing information) would cause
WORM to outperform SPLIT .
Table I compares the average latencies of WORM and

2500

STORE for various conditions. For all cases considered,
WORM outperforms STORE .

2000

Finally, we investigated the experimental upper bound of
the (maximum) latency for WORM with RP-RR. Table II

1500

summarizes, for the random instances, the average values
over 30 runs of c and of latency divided by c. We sought

1000

the best least-squares fit to the latency divided by c, in the
form of k logp4 N for constants k and p. The best fit was

500

obtained with p ≈ 0.22. It would be rash to conclude that
16

64

256

1024

latency fits very closely to Θ(cL log 0.22 N ), since we have

4096

N

neglected lower order terms, and the network sizes used

Fig. 4. Comparison of routing schemes on scanning input queues in
STORE and WORM with RP selection.

may be too small to observe the proper asymptotic bound.
Nonetheless, it appears that performance superior to our

RP-FO, which was not observed for STORE . We found

provable bound in Corollary 6 and close to the obvious

that WORM with RP-RR outperforms SPLIT with RP-

lower bound (Ω(cL + d)) is attainable for random instances
10

N
16
64
256
1024
4096

random
269
534
944
1677
3031

STORE-RP-RR
complement many-to-1
198
544
442
2144
829
8352
1565
32992
2896
131360

random
125
233
441
843
1592

WORM-RP-RR
complement many-to-1
68
258
161
1028
301
4102
583
16392
1123
65546

Table I. Average latency, in flit-steps, of the greedy store-and-forward (STORE with R = 1 and q = 1) and the greedy wormhole (WORM with R = 1
and q = 2) algorithms. Each value represents an average of 30 independent experiments.

N
16
64
256
1024
4096

c
3.5
5.6
10.2
18.6
34.3

wormhole routing in parallel computers. IEEE Trans.

Latency/c
35.6
41.9
43.4
45.3
46.4

Computers, 45(6):704–713, June 1996.
[7] R. I. Greenberg.

The fat-pyramid: A robust net-

work for parallel computation. In W. J. Dally, editor,
Advanced Research in VLSI: Proceedings of the Sixth

Table II. The average values of c and of the latency divided by c, for
WORM with RP-RR. Each value represents 30 independent experiments
with L = 32, R = 1, and q = 2 flits.

MIT Conference, pages 195–213. MIT Press, 1990.
[8] R. I. Greenberg. The fat-pyramid and universal parallel computation independent of wire delay. IEEE

in practice.

Trans. Computers, 43(12):1358–1364, Dec. 1994.
[9] R. I. Greenberg and C. E. Leiserson. Randomized

References

routing on fat-trees. In S. Micali, editor, Randomness
[1] B. Aiello, T. Leighton, B. Maggs, and M. Newman.

and Computation. Volume 5 of Advances in Comput-

Fast algorithms for bit-serial routing on a hypercube.

ing Research, pages 345–374. JAI Press, 1989.

In Proceedings of the 2nd Annual ACM Symposium

[10] R. I. Greenberg and H.-C. Oh. Packet routing in net-

on Parallel Algorithms and Architectures, pages 55–

works with long wires. Journal of Parallel and Dis-

64. Association for Computing Machinery, 1990.

tributed Computing, 31(2):153–158, Dec. 1995.

[2] W. J. Dally. Performance analysis of k-ary n-cube

[11] F. T. Hady.

interconnection networks. IEEE Trans. Computers,

A Performance Study of Wormhole

Routed Networks Through Analytical Modeling and

39(6):775–785, June 1990.

Experimentation. PhD thesis, University of Maryland

[3] W. J. Dally and C. L. Seitz. Deadlock-free message

Electrical Engineering Department, 1993.

routing in multiprocessor interconnection networks.

[12] D. R. Helman, D. A. Bader, and J. JáJá. A parallel

IEEE Trans. Computers, C-36(5):547–553, May 1987.

sorting algorithm with an experimental study. Techni-

[4] J. T. Draper and J. Ghosh. A comprehensive ana-

cal Report UMIACS-TR-95-102, University of Mary-

lytical model for wormhole routing in multicomputer

land Institute for Advanced Computer Studies, Dec.

systems. Journal of Parallel and Distributed Comput-

1995.

ing, 23:202–214, 1994.

[13] P. Kermani and L. Kleinrock. Virtual cut-through:

[5] J. Duato. A necessary and sufficient condition for

A new computer communication switching technique.

deadlock-free adaptive routing in wormhole networks.

Computer Networks, 3:267–286, Sept. 1979.

In D. P. Agrawal, editor, Proceedings of the 1994 In-

[14] M. Kunde and T. Tensi. Multi-packet-routing on mesh

ternational Conference on Parallel Processing, pages

connected arrays. In Proceedings of the 1989 ACM

I-142–I-149. CRC Press, 1994.

Symposium on Parallel Algorithms and Architectures,

[6] S. Felperin, P. Raghavan, and E. Upfal. A theory of
11

[24] M. J. Pertel. A critique of adaptive routing. Tech-

pages 336–343. Association for Computing Machinery,

nical Report CS-TR-92-06, Department of Computer

1989.

Science, California Institute of Technology, 1992.

[15] F. T. Leighton, B. M. Maggs, A. G. Ranade, and
S. B. Rao.

[25] G. F. Pfister and V. A. Norton.

Randomized routing and sorting on

fixed-connection networks.

Hot spot con-

tention and combining in multistage interconnection

Journal of Algorithms,

networks. IEEE Trans. Computers, C-34(10):943–948,

17(1):157–205, July 1994.

Oct. 1985.

[16] F. T. Leighton, B. M. Maggs, and S. B. Rao. Packet

[26] A. Ranade, S. Schleimer, and D. S. Wilkerson. Nearly

routing and job-shop scheduling in O(congestion +

tight bounds for wormhole routing. In Proceedings of

dilation) steps. Combinatorica, 14(2):167–180, 1994.
[17] T. Leighton. Average case analysis of greedy rout-

the 35th Annual Symposium on Foundations of Com-

ing algorithms on arrays. In Proceedings of the 2nd

puter Science, pages 347–355. IEEE Computer Society

Annual ACM Symposium on Parallel Algorithms and

Press, 1994.
[27] C. Thompson and H. Kung.

Architectures, pages 2–10. Association for Computing

connected parallel computer. Communications of the

Machinery, 1990.
[18] C. E. Leiserson.

Sorting on a mesh-

Fat-trees:

ACM, 20(4):263–271, Apr. 1977.

Universal networks

[28] L. G. Valiant. A bridging model for parallel compu-

for hardware-efficient supercomputing. IEEE Trans.

tation. Communications of the ACM, 33(8):103–111,

Computers, C-34(10):892–901, Oct. 1985.

Aug. 1990.

[19] C. E. Leiserson, Z. S. Abuhamdeh, D. C. Douglas,
C. R. Feynman, M. N. Ganmukhi, J. V. Hill, W. D.
Hillis, B. C. Kuszmaul, M. A. S. Pierre, D. S. Wells,
M. C. Wong, S.-W. Yang, and R. Zak. The network
architecture of the connection machine CM-5. In Proceedings of the 4th Annual ACM Symposium on Par-

Ronald I. Greenberg (S’87-M’90) received
the A.B. degree in Mathematics, the B.S. degree in Computer Science and the B.S. and
M.S. degrees in Systems Science and Mathematics all from Washington University, St.
Louis, MO in 1983. He received the Ph.D. degree in Electrical Engineering and Computer
Science from the Massachusetts Institute of
Technology in 1989.
He is currently an Associate Professor in the Department of Mathematical and Computer Sciences at Loyola University Chicago. His
research interests include parallel computation and algorithms for
computer-aided design of integrated circuits.
Dr. Greenberg is a member of ACM and SIAM.

allel Algorithms and Architectures, pages 272–285. Association for Computing Machinery, 1992.
[20] F. Makedon and A. Simvonis. On bit-serial packet
routing for the mesh and the torus. In J. JaJa, editor, Proceedings of the Third Symposium on the Frontiers of Massively Parallel Computation, pages 294–
302. IEEE Computer Society Press, 1990.
[21] J. Y. Ngai. A framework for adaptive routing in multicomputer networks. Technical Report CS-TR-89-09,

Hyeong-Cheol Oh (M’95) received the B.S.
degree in Electronics Engineering from Seoul
National University, Seoul, Korea in 1982 and
the M.S. degree in Electrical and Electronic
Engineering from Korea Advanced Institute of
Science and Technology, Seoul, Korea in 1984.
He received the Ph.D. degree in Electrical Engineering at the University of Maryland, College Park in 1993.
He is currently an Assistant Professor in the Department of Information Engineering, Korea University, Chochiwon, Korea. He also
worked for three years at Goldstar Semiconductor Ltd, Korea, where
he designed NMOS full-custom and CMOS Gate-Array ICs. His research interests include parallel computation and VLSI design.
Dr. Oh is a member of the Korea Information Science Society.

Department of Computer Science, California Institute
of Technology, 1989.
[22] H.-C. Oh. Efficient Communication Schemes for massively Parallel Computers. PhD thesis, University of
Maryland Electrical Engineering Department, 1993.
[23] A. Papoulis.

Probability, Random Variables, and

Stochastic Processes. McGraw-Hill, 1984.
12

