Product-shuffle networks: toward reconciling shuffles and butterflies  by Rosenberg, Arnold L
Discrete Applied Mathematics 37138 (1992) 465-488 
North-Holland 
465 
Product-shuffle networks: toward 
reconciling shuffles and butterflies* 
Arnold L. Rosenberg 
Departtttettt of Contpurer and Inforntarion Science, University of Massachuseus. Antherst, 
MA 01003, USA 
Received 16 May 1989 
Revised 21 May 1990 
Abstract 
Rosenberg, A.L., Product-shuffle networks: toward reconciling shuffles and butterflies, Discrete 
Applied Mathematics 37/38 (1992) 465-488. 
We study product-shtrff/e (PS) networks, which are direct products of de Bruijn networks, as 
interconnection networks for parallel architectures. PS networks can be viewed as generalizing 
both butterfly-oriented network-: (such as the butterfly and cube-connected cycles networks) and 
shuffle-oriented networks (such as the de Bruijn and shuffle-exchange networks), in the sense that 
l PS netvrorks can emulate both butterfly-oriented and shuffle-oriented netuorks of any size, 
via emulations that are work preserving, i.e., preserve the processor-time product; 
l PS networks share many computationally valuable structural features of various butterfly- 
and shuffle-oriented networks, including pancyclicity, logmithrttic diameter, and large cotnplefe 
binary tree subnet works; 
l PS networks overcome certain: computational deficiencies of b;:tterfly- and shuffle-oriented 
networks, by containing as subnetworks moderate-size meshes and r?i&Cs o,f trees. networks 
which butterfly- and shuffle-oriented networks cannot emulate efficiently. 
Finally, PS networks attain their communication power at modest cost: they are 8-valent, and 
they enjoy VLSI iayouts that consume only modestly more area than the best layout5 of like-Gzed 
butterfly- and shuffle-oriented networks. 
Keywords. Interconnection network, parallel rlrchitecture, network emulation, direct-product 
network, butterfly netwol k, cube-connected cycles network, shuffle-exchange network, de Rruijn 
network. 
1. Goals of the study 
The Boolean hypercube and its bounded-degree dericaiives, such as the butteuf?~- 
oriented butterfly and cube-connected cycles (CCC) networks and the shuffie-oriented 
* A portion of this research was supported by NSF Grant CCR-B8-12567. 
0166-218X/92/$05.00 Cc, 1992-Elseviei Science Publishers B.V. All rights reserved 
466 A.L. Rosenberg 
shuffle-exchange and de Bruijn networks, are among today’s dominant inter- 
connection networks for massively parallel architectures. Indeed, architectures based 
on these networks have been built in both industry and academia. 
Among these interconnection networks, the hypercube is the clear favorite 
because of its efficiency on a broad class of algorithm [6,&l 1,211 and its structural 
uniformity that simplifies programming [19]. The major shortcoming of the hyper- 
cube is its high valence.’ The technological difficulties attendant o implementing 
such high-valence networks have led to the development of several butterfly- 
oriented bounded-degree “approximations” of the hypercube, most notably the 
butterfly and CCC networks [16]. These networks were constructed with a certain 
important genre of hypercube algorithm, called ascend-descend algorithms [161, in 
mind and so can emulate the hypercube with little or no slowdown on a large, im- 
portant class of computational problems. Yet, in a sense, butterfly-oriented net- 
works just replace one implementational problem with another, since the) use 
No log2 N nodes (processors) to emulate the N-node hypercube. Further, algebraic 
transformations [3] of these large networks yield the smaller, shuffle-oriented 
bounded-degree “approximations” of the hypercube, most notably the shuffle- 
exchange’ [22] and de Bruijn [9,20] networks. Shuffle-oriented networks have only 
as many nodes as does the hypercube, yet they avoid its large valence; and, on cer- 
tain computational tasks (including ascend-descend algorithms) they afford one 
computational efficiency (roughly) equal to that of the butterfly and CCC. 
Butterfly- and shuffle-oriented networks are roughly equivalent approximations 
of the hypercube on a broad class of ccmputational tasks, but it is not clear whether 
or not one of these network families majorizec the other on general computations, 
Confusingly enough, there is evidence that butterfly- and shuffle-oriented networks 
have incomparable strengths and weaknesses, and there is countervaiiling evidence 
that the two families of networks are equiAent in power. Distinguishing the two 
families are properties uch as the following. The N-node de Bruijn network has the 
computationally useful properties of being pancyclic [24], of having diameter 
exactly log2 N, and of containing an (N- I)-node complete binary tree as a subnet- 
work; butterfly-oriented networks enjoy none of these. In contrast, the butterfly 
network enjoys both node-trslnsitivity and a recursive decomposition structure; 
neither is shared by shuffle-oriented networks. The symmetry and decomposability 
of butterfly networks are quite useful in developing efficient algorithms. For 
instance, the efficient routing and sorting algorithms for the butterfly network, in 
[17] and [Ml, respectively, exploit the symmetry and recursive structure of the but- 
terfly, hence are not easily transported to any shuffle-oriented network. Further, 
circumstantial evidence separating the two families (and, indred. suggesting that 
‘The N-node hypercube bq- *,a1 UJ vnlence (-= maximum node-degree) log-, N. 
‘The shuffle-exchange network can also be derived directly from the hypercube, via a geometric 
transformation. 
‘That is, the network contains cycles of all lengths IN. 
Product-shrrffle net works 467 
shuffle-oriented networks are more powerful than butterfly-oriented ones) arises 
from studies in [3,2 I] of emulations of one interconnection etwork by another. 
These studies use the same strong notion of emulation as we do in this paper. 
(1) (a) Every even-order (respectively, odd-order) butterfly and CCC network can 
be emulated with no time ioss (respectively, with at most a factor of 2 time 
loss) by the smallest hypercube that is big enough to hold it [l I]; 
(b) every known emulation of an N-node shuffle-oriented network by a hyper- 
cube incurs slowdown Q(log N) (which is pessimal). 
(2) (a) An N-node shuffle-oriented network can emulate a like-size butterfly- 
oriented network with slowdown O(log log N) [3]; 
(b) every known emulation of an N-node shuffle-oriented network by a like-size 
butterfly-oriented network incurs slowdown Q(log IV) (which is pessimal). 
Blurring the apparent algorithmic distinctions between the two families of networks 
is the recent work in [12] which presents a computational framework wherein either 
of the butterfly or de Bruijn networks can emulate the other efficiently. In par- 
ticular, these emulations yield an algorithm for sorting on the de Bruijn network 
that rivals the efficiency of the Reif-Valiant [18] algorithm for butterfly networks. 
The present study is motivated in part by the unresolved questions implicit in the 
preceding paragraph: Which, if either, of the butterfly- and shuffle-oriented net- 
works is the more powerful? Under what circumstances are computationally more 
complicated emulations of the type introduced in [12] to be preferred to the purely 
structure-oriented emulations of [3,11], and vice versa? 
Remark. The major argument in favor of the structure-oriented framework is that 
its emulations yield algorithms that translate programs for the e;nulated network to 
equivalent programs for the emulating network. The major argument in favor of 
the computationally more complicated framework is that it significantly expands the 
class of networks that a given network can emulate efficiently. 
Further motivating our study is the fact that the constructions in both [12] and 
[3] suggest that efficient emulations of either of these network families by the other 
are likely to be rather sophisticated and complicated. There would be value, there- 
fore, in having a family of networks which retains most of the structural simphcity 
of both butterfly- and shuffle-oriented networks, but which can emulate both 
families of networks imply- the emulation procedure should be simple to specify- 
and efficiently - the host architecture should be able to emulate any of the guest 
architectures in a work-preserving manner, i.e., so that the processor-time product 
is preserved .4 In this paper, we study such a “least upper bound” family, the 
product-shuffle (PS) networks; each PS network is a direct product of two de Bruijn 
networks. 
’ More precisely, an emulation of an N-processor architecture :G by an (MS N)-processor architecture 
.;I/’ is work preserving if .W can emulate any sequence of Tcomputational steps of 8 in at most c(N/M)T 
steps, for some fixed overhead constant c. 
468 A. L . Rosen berg 
The goals presented thus far can be satisfied by a variety of interconnection et- 
works; indeed, the simplest such network would just superpose a butterfly-oriented 
network on a like-sized shuffle-oriented one. (Formally, this would amount to taking 
the union of the edge-sets of the two networks.) Why, then, should we bother with 
PS networks, which introduce the additional complication of the direct-product 
structure? The answer lies in our willingness to suffer a (very) modest amount of 
structural complication in return for a (very) considerable amount of algorithmic 
simplicity and efficiency. The following example should clarify our concerns. It is 
shown in [5] that neither butterfly- nor shuffle-oriented networks can emulate meshes 
(and a variety of other planar networks) efficiently, using the simple notion of 
emulation that we study here. This is a quite serious deficiency, given the importance 
of mesh-oriented parallel algorithms.’ PS networks overcome this deficiency, in 
that every “reasonable shaped” PS 1:etwork can emulate moderate-size meshes with 
no slowdown, since it contains the meshes as subnetworks. Using the stronger 
notion of emulation in [ 121, one can emulate meshes on butterfly- or shuffle-oriented 
networks with only constant-factor slowdown; but, the constants are nontrivial,.and 
the emulation algorithm is rather complicated. There are further beneficial conse- 
quences of the direct-product structure of PS networks: 
l PS networks contain moderate-size copies of the computationally important 
mesh-of-trees network (Theorem 3.7). 
l PS networks admit simple and efficient load-balanced6 work-preserving emula- 
tions of both butterfly- and shuffle-oriented networks (Theorems 4.2 and 4.4). 
* PS networks admit co.mpact VLSI layouts (Theorem 4.i). 
Additionally, the direct-product structure of PS networks gzlarnntees fficicqt detcr- 
ministic off-line permutation routing within the network [I]. Finally, the product 
structure ailows PS networks to be used as an ingredient in achieving algorithmic 
enhancements to parallel architectures that would be prohibitively expensive to 
achieve in hardware [2]. It is not clear that any “superposed”, composite butterfly- 
plus-shuffle network could match all of the benefits that ensue from the direct- 
product structure of PS networks. 
The remainder of the paper is organized into three sections. In Section 2, we for- 
malize the topics of our study and present some simple basic results. Section 3 
verifies the large family of subnetworks of PS networks, that we have been alluding 
to. In Section 4, we compare PS networks with butterfly- and shuffle-oriented net- 
works, demonstrating efficient work-preserving emulations of the latter two families 
by the former family, and indicating the impossibility of efficient converse emula- 
tions. These results show that, within our framework of highly structured emulations 
(as opposed to the more general framework of [12]), PS networks are strictly more 
:‘The parallel algorithm literature abounds vzith linear-algebraic and numerical algorithms that con- 
form naturally to the structure of a mesh. 
‘We say that the loud is balanced in an emulation when each processor of the emulating architecture 
emulates the same number of processors of the emulated architecture. 
Product-shlrffle net works 469 
powerful than either butterfly- or shuffle-oriented networks, by more than any con- 
stant factor Moreover, the added power comes at only moderate cost, in that PS 
networks are 8-valent (in contrast o the 4-valence of butterfly- and shuffle-oriented 
networks), and PS networks admit VLSI layouts that are only modestly bigger than 
the best layouts of like-sized butterfly- or shuffle-oriented networks. 
We have thus far used the term “shuffle-orienttd networks” to refer ambiguously 
to the de Bruijn and shuffle-exchange networks, and the term “butterfly-oriented 
network” to refer ambiguously to the butterfly and CCC networks. This ambiguity 
is justified by the fact that the de Bruijn and shuffle-exchange networks can each 
emulate the other with only a factor-of-2 slowdown, and the same is true of the but- 
terfly and CCC networks. We focus on the de Bruijn and butterfly networks in the 
sections that follow, since they yield smaller constant factors in our emulations, and 
they have richer families of subnetworks. 
2. The formal framework 
2.1. Interconnection et works as graphs 
As is customary in structural studies of parallel architectures, we restrict attention 
to arrays of identical processing elements (PEs), and we view the architectures and 
their underlying interconnection etworks as undirected graphs. 
A directed graph g is specified by a set l& of nodes and a multisubset At9 c 
I& x VG called arcs. One obtains an undirected graph %’ from a directed graph $3, 
L.. 64 
WY symmetrizing” the set of arcs: one replaces each arc of $3 with a pair of mated 
arcs having opposing directions. We refer to each mated pair of arcs as an edge of 
the graph %‘.’ 
The nodes of the graph represent the PEs of the architecture, and the edges of 
the graph represent the inter-PE comnunication links. We henceforth use the term 
“graph” instead of “network”. 
2. I. I. Notation and terminology 
l In phrases like “for all II”, n always ranges over the positive integers. For all 
n, Zn=def (4 l,..., n- l}, and il(rz) =def [log nl. (All logarithms are to the base 2.) 
@ For any set S and nonnegative integer k: IS 1 denotes the cardinality of S; S” 
denotes the set of all length-k stringc of elements of 3; 1x1 denotes the length (= k) 
of each x&Sk. 
l Given graphs $3 =(I’& E& and .%=( l&, E,&, the (direct) prodlict graph 
$2 x .X has node-set K9 x I&. Let u and v be nodes of $3, and let x and y be Ilodes 
of 3. Then ((a,~), (v, y)) is an edge of $9 x .Z just when either (u, v) is an edge of 
~3 and x= y, or (x, y) is an edge of .Ye and u= v. 
‘Note that we allow self-loops and parallel edges. 
470 A. L. Rosenberg 
Fig. 1. The (directed version of the) order-3 de Bruijn graph g(3). 
l The degree of node u of graph 99 is the number of edges of ‘S!J incident to (i.e., 
involving) U. The valence of $4 is the maximum degree of any of its nodes. 
The order-n de Bruijn graph 9$, and the order-n shuffle-exchange graph ZPn are 
the undirected versions of the following directed graphs. Both gn and @n have the 
set of nodes Z:. For all /? E Zz and x E Zf - ‘: 
0 
e 
In both 9,, and y3n, each node of the form px is connected via a shuffle urc to 
node xfl. 
Additicnaliy: 
- in gn, each node of the form @x is connected via a sh&fle-exchange arc to 
node xp;* 
2.1.2. The graphs of interest 
Let m,n be positive integers. 
- in Y’, each node of the form xp is connected via an exchange arc to 
node xB. 
Figure 1 depicts the (directed version of the) order-3 de Bruijn graph 93. 
The order-n butterfly graph a,, and the order-n cube-connected cycles graph 
(CCC, for short) I!?,, have the set of nodes & = Zn x ZT. We call 1 the level of each 
node in (I} xZf. For each kZ, and each x=/?& ~+~_1 EZ:: 
0 
e 
In both 9?, and e,, , each level-f node v = (I, x) is connected via a straight-edge 
with node ((I+ 1)mod n,x). 
Additionally: 
- in @? n9 nvk P_) is connected via a cross-edge with node 
- in gn, node v is connected via a level-edge with node 
‘For DE&, $= 1-p. 
Product-shuffle net works 471 
1.000 1.100 1,010 1,110 1.001 1.101 1.011 1.111 
2,000 2.100 2,010 2.110 2.001 2.101 2.011 2.111 
0.000 0.100 0.010 0.110 0.001 0.101 0.011 0.111 
Fig. 2. The order-3 butterfly graph *g(3) with level 0 replicated to aid visualization. 
Figure 2 depicts the order-3 butterfly graph &. 
The order-@, n) product-shuffle graph (PS graph, for short) 9m,n is the product 
graph Bm x %In .
Remark. For brevity, we study only the base-2 versions of our graph families, by 
dint of our using Z2 as the underlying alphabet of the graphs’ node-sets. One can 
easily define arbitrary base-d versions of the graphs’ and extend our results with 
only clerical changes. In [3], we deal with the general, base-d versions of our graph 
families. 
2.2. Emulation via graph ernbeddings 
In defining the emulation of one architecture % by another architecture Z9 we 
assume that the PEs of H are sufficiently powerful to emulate the PEs of ?J step 
for step- so no delay is incurred because of computational steps. We restrict atten- 
tion to emulations that honor a pulsed computation regimen: Architecture JV alter- 
nates phases that emulate one computation step of architecture 9, with phases that 
emulate one communication step of architecture %.” The slowdown incurred by 
an emulation arises from two sources. First, we allow emulations that require one 
PE of X to play the role of several PEs of %; second, architecture X must emulate 
on its interconnection graph communication steps that are tailored to the (possibly 
‘For illustration, the base-d order-n de Bruijn graph has node-set 2;’ and edges connecting each node 
of the form 6-r, where BE Z,, and SE Zj ‘, to all nodes of the form X-YE Zj. 
“‘This regimen of having .W mimic the exact form of the computation by !9 motivates our using the 
term “emulate” rather than “simulate”. 
472 A.L. Ro,c wberg 
very different) structure of the interconntction graph underlying architecture %. 
The second type of delay results both frcm mismatched adjacency structure:s and 
from congested communication lines. Our study of emulations is based on the 
following notion of graph embeddings and their costs. 
2.2.1. Graph embeddings 
An embedding of the graph 93 in the graph X is specified by: 
l a (possibly many-to-one) assignrnerzt a of the nodes of Y3 to the nodes of X: 
[A PE of .H” must emu/ate ail of the JDEs of $2 assigned to it via a.] 
l A routing e of each edge (u, v) of ‘33 along a path in 3 connecting a(u) and 
a(v).’ ’ [X must emulate ach communication along edge (u, v) of 92 bv transmitting 
the “message ’ ’ along the path ~(u, vi.] 
2.2.2. Slowdown incurred by an emulation 
A number of factors induce slowdown when architecture .X emulates architecture 
$9. We account for these factors very conservatively, by assuming that in each step 
of an emulated computation, every PE of 33 performs a computation step followed 
by a communication with all of its neighbors. Clearly, most algorithms will not exer- 
cise the resources of $3 so exhaustively; hence, they might well be emula.ted by X 
with !ess slowdown than our accounting procedures indicate. Say that we have an 
embedding (a,& of $3 in .%‘. 
@ The load of the embedding is the maximum number of PEs of % assigned to 
any one PE of .Z: 
load(a,,o)=E;F Ia-‘(v)l. 
H 
[Load induces lowdown because, in each computation phase of the emulation, each 
PE of .H must emulate a computation step by each PE of 9 assigned to it.] 
l The dilation of the embedding is the maximum amount that rthe routing Q 
“stretches” any edge of %?: 
dilation(a, Q) = max Length@@, v)). 
(II, U) E Es 
[Dilation incurs slowdown because very message that crosses link e in Y3 must 
traverse path e(e) in X] 
l The I/O-expansion of the embedding is the ratio of the valence of .H to the 
valence of $3: 
l/O-expansion(a, Q) = 
valence(M) 
valence( 3) ’ 
” A lerlgrif-I pat/t in .w’ from node SE k’, to node y E k’, is a sequence JI of I+ 1 nodes x= 
_ _ .,()...I ,..., :/=_vsuchthat, foreachO1i<I,(Ij,c’;+,)EEx. By abuse of notation, we write “(z;, z;+ 1) E IT*‘. 
473 
[I/O-expansion incurs slowdown because at each computation step, % needs poll 
only valence(%) I/O ports, while .ti must poll valence(.S) ports.] 
l The edge-congestion f the embedding is the maximum number of edges of $9 
that 0 routes over a single edge of .M: 
edge-congestion(cx, Q) = eta 1 {e% Et9 : eE e(e’)) I. 
[Edge-congestion i curs slowdown because the messages that want to cross a con- 
gestled edge must be queued up. (For simplicity, we are giving each edge of .%’ the 
same capacity as a single edge of $3 .)I 
The slowdown due to load and I/O-expansion seems to be unavoidable. In contrast, 
one cm avoid the slowdown due to edge-congestion by increasing the bandwidth of 
M’s communication links, at the cost of increased hardware and increased layout 
area. Another avenue for mitigating the effects of edge-congestion is to orchestrate 
the communication phases of X, so that message traffic is spread uniformly along 
the paths of .X that are used to emulate the links of $9; this ploy, which allows one 
to amortize edge-congestion over the paths that create dilation, is used to decrease 
the showdown of the emulations in [3]. Another form of orchestrating the com- 
munication phases of emulations leads to the following result, which guarantees that 
load, dilation, and edge-congestion can always be made to combine additively, 
rather than multiplicatively (as a naive analysis would suggest). The main analysis 
leading to this result appears in [15]; its extension to the current framework appears 
in [12]. 
Proposition 2.1 [ 12,151. Say that one can embed the graph 22 in the graph .X, with 
load L, dilation D, and edge-congestion C. Then the :iprchitecture based on .Yf can 
emu/ale T steps of the architecture based on $9 on a general computation in 
O(L + C+ D)T step.?. 
Aside from assuring us that proper scheduling can make the costs of an emulation 
combine additively rather than multiplicatively, Proposition 2.1 also demonstrates 
that our purely graph-theoretic formalism does, indeed, model the algorithmic 
situation that we want it to. 
Proposition 2.1 points out the importance of balancing the load of an emulation 
(cf. footnote 6), i.e., of keeping the quantity 
bounded by a constant. Every emulation we present here has balanced load. 
Since l/O-expansion is a property only of the structures of !G and .W, and not of 
an!7 embedding of % in X’, we do not discuss it further. 
Obviousi~, one strives to make emulations as “efficient” as possible. One can 
argue (cf. [ 121) that one type of efficiency that is very desirable in emulations is work 
preservation: When the N-PE architecture $9 operates for T steps, it can perform 
474 A. L . Rosetl berg 
NT atomic operations. If the (MI N)-PE architecture .X emulates these T steps of 
3, it requires at least rAVM1 Tsteps to perform the same amount of work. Allowing 
a (hopefully small) constant factor leeway as overhead for the emulation, we say 
that the emulation of % by .ti is work preserving if .Z can emulate any T steps of 
?J in at most O(rN/Ml)T steps; cf. footnote 4. All of the emulations we present 
in this paper are work preserving. 
2.2.3. Two quasi-isornetries 
We are finally in a position to formalize our discussion at the end of Section 1, 
concerning the “equivalence” of the de Bruijn and shuffle-exchange graphs, on the 
one hand, and the butterfly and CCC graphs, on the other hand. 
Proposition 2.2. For all n: 
(a) One can embed either of the shuffle-exchange graph 8, or the de Brujjn 
graph 9,, in the other, wit!1 !oad 1, dilation 2, and edge-conge&im 2. 
(b) (i) One can embed the butterfly graph &lfi in the CCC graph @?,,, with load 
I, diiation 2, and edge-congestion 2. 
(ii) [IO] The CCC graph v?,, is a subgrclph 0-f the butterfly graph an; hence, 
one can embed gt, in &!3?, with load I 9 dilation I, and edge-congestioril 1. 
Proof. (Sketch). For each of the three embeddings of part (a) and part (b)(i), we 
employ the identity assignment. Ignoring edges that are shared by the respective 
pairs of graphs in the embeddings (hence use the identity routing), we route 
* edge (flx,x& of ST,, along the following length-2 path in Y,,: 
l edge (xp,xj?) of ?Y,, along the following length-2 path in G?,,: 
xp+?x++x/& 
l edge (t!, x>, ((/+ l)modn,x’)) of B,, along the following length-2 path in g,,: 
(/,x) f+ (1,x’) * ((/+ l)mod n&j. 
For the embedding of pa.rt (b)(ii), our assig:lment branches on the weight of the 
string x in node (I,x> of fqn, i.e., the number of 1s in x. If x has even weight, then 
node (I, x) of %?n is assigned to node (I, x> of an; if x has odd weight, then node 
(I,x) of gn is assigned to node ((I+ 1)mod n, x) of 93,. Details are left to the 
reader. 0 
2.3. Structural character&tics of the graphs of interest 
The diameter (maximum inter-node distance) of a graph .H bounds above both 
the dilation of any embedding into .% and the time required for any single-node 
broadcast in .%‘. Therefore, the following table places our emulation results in 
perspective and provides an interesting comparison of 9,p1,,, with its “competitors”. 
One noteworthy point is that PS graphs share diameter (exactly) log2 N bvith de 
Bruijn graphs and hypercubes, although de Bruijn graphs acquire their small dia- 
meter with valence 4, while PS graphs have valence 8 and hypercubes have valence 
log2 N. 
Proposition 2.3. For al/ n: 
Graph Size Valence 
60 % N= 2” 4 
09 a,, N=n2” 4 
(c) q,,,, N=2”1+n 8 
Diameter 
log N 
2 log N- 2 log log N 
log N 
Proof. Parts (a)-(b) being well known, we concentrate on part (c). One can proceed 
from any node (x,y) of 9 ,,,, n to any other node W,y’) by 
(1) proceeding from node (x, y) to node (x’, y) in at most Ix!= in steps by 
mimicking the way one would proceed from node x to node x’ in 9&; 
(2) proceeding from node (x’, y> to node (x’, y’) in at most 1 yI = n steps by 
mimicking the way one would proceed from node y to node y’ in C&. 0 
3. Computationally important subgraphs of 9m,n 
PS graphs contain a variety of computationally useful graphs as subgraphs, Le., 
as graphs that can be embedded with unit load, edge-congestion, and dilation; an 
architecture based on PS graphs can emulate an architecture based on any of these 
subgraphs with no slowdown. 
3. I. Cycles 
The N-node cycle & is the graph whose nodes comprise the set Z/V and whose 
edges connect each node v with node (v + 1)mod N. 
It is well known that 9,, is Hamiltonian in that it contains the cycle 9&f! as a 
subgraph. In fact, gn satisfies the following stronger property. 
Lemma 3.1 [24]. For all n, the de Bruijn graph Q?,, is pancyclic; that is, fcr all 
15 k I 2”, the k-node cycle Sk is a subgraph of Q$, .‘2 
PS graphs share this property, whose computational benefits are exploited in the 
emulations in ]3] and in our Section 4. 
Theorem 3.2. For all m, n except for- m = n = 1, the 13s graph Y,,,,,, is /jancyclic. 
“Since we allow self-loops and parallel edges, it makes sense to talk about cycles of lengths 1 and 2. 
476 A. L. Rosenberg 
Proof. For any choice of m, n other than m = n = 1, and for any integer 1~ CI 2m+n, 
we show algorithmically that the cycle & is a subgraph of 9M,n. Our algorithm 
assumes that the cycles promised by Lemma 3.1 can be produced algorithmically; 
cf. [24]. (Note that 9$, I is (essentially) a 4-cycle, whence its exclusion from the 
theorem.) 
.Qssume, with no loss of generality, that ml n (or else, interchange the roles of 
.VZ and n in what follows). If the desired cycle-length c satisfies 1 ICI 2”, then Bc 
is a subgraph of 9m,n, by Lemma 3.1. Let us restrict attention, therefore, to values 
of c in the range 2”<c~2~+~, in which case we must have m >O. 
Now, every integer c in the indicated range admits a unique representation i the 
form 
c=a2”-+b 
with Oc QI 2” and OS b< 2”. The overall strategy of our algorithm is to “hook 
together” Hamiltonian cycles from a of the 2”’ copies of $Bn that comprise S’m,n, 
together with a length-b cycle from one additional copy of gn whenever b>O. (In 
fact, technical difficulties in “hooking up” these cycles will cause us to deviate from 
this strategy slightly.) To the end of implementing this strategy, we invoke Lemma 
3.1 to find a length-d cycle in 9&, where 
d= I;, ff b==o, 
, If b>O, 
and we use this cycle in the natural way to select and order d “consecutive” 
copies of 9,,, from the ZM copies that comprise l &, II ; call the ordered copies 
f-&O) @) #d- 1) . 
ie dek%be the mechanism for “hooking the cycles together” via an analysis of 
cases. 
Case 1: b = 0, so d= a and a > 1. This is the easiest case, since we have only to 
“hook together” a set Co, Cl, . . . , Co_ 1 of cycles, each Ci being a copy within 9(i) 
of a Hamiltonian cycle C of 9n. We start by selecting any two independent edges 
(x9 y) and (u, U) of g,, l3 that both lie on the cycle C; since n 22, we are sure that 
these edges exist. Next, we let xi, yi, Ui, vi (OS i< a) denote the instances of the 
nodes x, y, u, v, respectively, in copy s@) of $Bn. Assume that the nodes x, y, u, 
v lie in (say: fer defin.i!eness) clockwise order around the cycle C in 9,, so that 
each cycle C, has the form 
where Pi and Qi a.re the intermediate paths that define the cycle. 
We are now ready to find a length-c cycle in 9,n,n. 
(1) Trace the cycle C0 in 9(O) in clockwise order, from node y. to node x0, leaving 
out the edge that con;lect-.s the two nodes. 
I3 “Independence” impli~,3 that {.v,y) fl (u, u) =0. 
Producr-shuffle net works 
(2) Trace the following path to complete the cycle: 
x~~x~++Q,c*u,~u~c* Q2f-‘x2+-+x3t, Q3 t) o3 c) a*. 
~qrr-,*Qa_,*-‘ra_lt+sa_,~Pa-I~ta-,~... 
U3t,P3c-‘Y3HY2HPz+-‘&+-‘u1 ‘-‘PI ++Yi “Yo 
where 
4, r, s, t = 
x, v, u, y respectively, if a is even, 
v, x, y, u respectively, if a is odd. 
477 
The paths Pi and Qi and the edges (Xi,yi) and (Idi, Vi) come from the copies of 9,,, 
while the edges (Xi,Xi+ I), (yi,yi+ I), (Ui, Ui+ I), and (Vi, Vi+ 1) come from the COPY of 
9~~~ weused to order the copies of 9,. 
The reader should be able to fill in details, with the help of Fig. 3. 
Case 2: O<b<2”, so d=a+ 1 and OCLIC~“‘. The added challenge in this case 
arises from the need to append a cycle of length b to the chain of a Hamiltonian 
cycles created in Case 1. The mechanism we use depends on the value of b. 
0 
0 
0 
P 
i 
Fig. 3. A schematic view of the cycle constructed in Theorem 3.2. 
478 A. L. Rosenberg 
Case 2(a): b I 3. We must alter the procedure of Case 1 in two ways: we must find 
a copy of a length-b cycle in copy %, (‘I of 91,, and we must ensure that we can 
“hook” this new cycle to the chain of JIamiltonian cycles. The first of these tasks 
is trivial, by Lemma 3.1; let us call the length-b cycle B. In order to accomplish the 
second task, we invoke a strong property of 90,: 
Claim. For any pn;li x* y +B z in QOn involving three distinct nodes, there is a 
Hamiltonian cycle of gn that contains either the edge (x, y) or the edge (y, z). 
Proof. The Claim is true by inspection when n = 2; when n>2, it follows from 
standard facts about de Bruijn graphs. 
Fact 1. For all n, 9n is the line-graph of z&_ I. 
Fact 2. As a consequence of Fact 1, one can construct a Hamiltonian cycle in %In 
from any Eulerian cycle in gn _ 1. 
Fact 3. Given any Eulerian graph % and any 2-edge path n in 8 whose removal 
does not disconnect $9, one can construct an Eulerian cycle in % which contains 71. 
Fact 4. The only 2-edge paths whose removal disconnect gn are the paths both 
of whose edges are incident to either node 6 or node i. 
Because of Fact I:, a 3-node path 
in gn results from a 3-edge path 
in %)“, 1. Since at most two of these edges can both be incident to either node 6 or 
node 1 in 9, _ 1, it follows by Facts 3 and 4 that there is an Eulerian cycle in 9,_ I 
passing through either the path 
or the path 
x0 Y-Z. 
Fact 2 assures us that, in the former case, there is a Hamiltonian cycle in %In passing 
through edge (x, y), while in the latter case, there is a Hamiltonian cycle passing 
through edge (y,z). 
By dint of the Claim and the fact that b L 3, we can find an edge < of the length-b 
cycle B in %, (‘I, tbt lies on a Hamiltonian cycle of gn. Let us choose edge e as the L 
edge (x, y) of Case 1 if a is even or as the edge (u, O) of Case I if a is odd. We then 
alter the trajectory of the length-c cycle after the initial path within 9(O), by replacing 
the length-l path 
with the length-(b+ I) path 
Producbshuffle net works 479 
where r, s are as in Case 1, and S is the length-@ - 1) path within cycle El that con- 
nects nodes r, and s, in 9(‘) once edge &,s,) is removed. 
Case 2(b): b = 2. We proceed exactly as in Case 1, except hat we alter the trajectory 
of the length-a2” cycle of that case by replacing the length-l path 
b-1+%-1 
with the length-3 path 
r*_pr,*s,*s,_~ 
where r, s are as in Case 1. 
Case 2(c): b = 1. We branch immediately on the value of n. 
Case 2(c)(i). When n = 2, we proceed exactly as in Case 1, until we have to deal 
with copy %, (‘- ‘) of %I,. At that point we replace the length-3 path 
from Case 1 with a length-4 path of one of the forms 
qa-1 ‘-‘&I *s,o to@ tp-_l 
or 
%-l”%“%*%-l ~to-I 
within copies 9(‘- ‘) and &) of 9,. One verifies readi!y that one of these paths 
exists. 
Case 2(c)(ii). When n >2, we alter Case 1 by insisting that at least one of the in- 
dependent edges (x, y) and (u, U) not touch either node 6 or node 7 of %In. (Note 
that this is impossible when n = 2.) Say, without loss of generality, that node 6 is 
not touched by either edge. 
Having thus restricted the choice of these edges, we proceed exactly as in Case 
2(b) (6 = 2), with the following exception. Once having found the cycle produced in 
Case 2(b) (which has length c+ l), we remove the instance of node 0’ of 9Jn from 
whichever of P,_ , or Q,_ , contains an instance of 6. (One must, because of our 
restriction.) Since every Hamiltonian cycle in $& contains the path 
1&&+61, 
the elision of node 6 does not cut our cycle: it just shortens it, as desired. 
This case analysis cqmpletes the proof. Cl 
Very little of the proof of Theorem 3.2 depends on properties that are peculiar 
to de Bruijn graphs. In fact, F. Annexstein and M. Baumslag [personal communica- 
tion] have observed that an altered version of the proof will establish the following. 
Proposition 3.3. Let 93 and .%’ be pancyclic graphs, one of which -say 9 -has an 
even number of nodes. Suppose that for every .!rteger 2I 1 I 1% 1, $3 has a length-l 
cycle that shares an edge with a Hamiltonian cycle. Then the product graph % x 2 
is pancyclic, except when 1% I= 1% I= 2. 
480 A.L. Rosenberg 
3.2. Meshes 
The m x n (wroidaf ) mesh A&, is (isomorphic to) the product graph R, x 9G&. 
One corollary of the main lower bound in [5] is that no butterfly- or shuffle- 
oriented graph 8 can emulate meshes with only constant slowdown (using our 
notion of emulation); it follovq of course, that Q cannot contain a large mesh as 
a subgraph. In contrast, PS graphs contain moderate-size meshes as subgraphs, as 
indicated in the following corollary of Lemma 3.1. 
Corollary 3.4. For all m, n and all p 5 2”’ a& q 5 2”, the PS graph 9& contains 
the mesh AJ~,~ as a subgraph. 
3.3. Complete binary trees 
The height-n complete binary tree 9* is the graph whose 2”+ ’ - 1 nodes comprise 
the set Uz=e Zi of binary strings of length =n and whose edges connect each node 
x of length <n with nodes x0 and xl. 
Complete binary trees are very useful computational structures, most obviously 
for broadcasting, but also for emulations [5]. Thus, the following obvious result 
points out one of the most useful properties of de Bruijn graphs; cf. [20]. 
Lemma 3.5. For all n, the de Bruijn graph 9,, contains the complete binary tree 
gn _ , as a subgraph, rooted at node 0’1. 
While PS graphs cannot match the fact that the N-node de Bruijn graph contains the 
(IV- I)-node complete binary tree, they do come within a factor of 2 of matching it. 
Theorem 3.6. For all m, n, the PS graph C?m,n contains the complete binary tree 
SUE + n _ z as a subg:*aph. 
Proof. We find an instance of $m +,, _2 rooted at node u. = (0'1, 0’1) of pm,,, as 
follows. We first invoke Lemma 3.5 to find a copy of *s_, in “copy 0’1” of 9,, 
rooted at node u. and having 2” -I leaves of the form @1,x) for some XEZ~-*. 
We then invoke Lemma 3.5 once for each “copy” of a,,, that is “connected” to 
one of these leaves, to find a copy of sm_, rooted at each of these leaves. n 
To place Theorem 3.6 in perspective, the efficient embedding of complete binary 
trees in butterfly graphs presented in [5] promises only constant (as oppose to unit) 
dilation and utilizes onfy roughly one-eighth of the nodes of the host butterfly. 
(These constants can be improved somewhat, but not to unity.) 
3.4. Meshes of trees 
The m x n mesh of trees JZ&,~ (m and n being powers of 2) is obtained from the 
m x n mesh by 
Produci-shuffle net works 481 
l eliminating all mesh edges, 
l erecting a copy of the complete binary tree .&) along each column, using the 
column-nodes as leaves of the tree, 
l erecting a copy of the complete binary tree gA(nI along each row, using the 
row-nodes as leaves of the tree. 
Parallel architectures based on meshes of trees have been shown to be quite 
powerful computationally [14]. One can prove, using Lemma 3.5, that PS graphs 
contain at least moderate-size meshes of trees as subgraphs. 
Theorem 3.1. For all m, n, and for all p, q such that 2Psm/2, 2% n/2, the PS 
graph %-t&n contains the mesh of trees dW$zq as a subgraph. 
Proof. Let us denote byxi, Isi12~, the ith leaf of gp and by yi, 11!r2*, the ith 
leaf of + By Lemma 3.5, 9m,n contains the product graph gp x gq as a subgraph. 
Consequently, ~9~. n contains as a subgraph every tree of the form (Xi} x t& as well 
as every tree of the form gp x { yi } . The union of all these subgraphs i  A&24. 0 
4. PS vs. butterfly vs. de Eruijn networks 
In this section we demonstrate that, relative to our notion of network emulation, 
PS graphs have strictly more communication power than either shuffle- or butterfly- 
oriented graphs. Our demonstration consists of efficient embeddings of both de 
Bruijn graphs (Section 4.1) and butterfly graphs (Section 4.2) in PS graphs, followed 
by a proof that PS graphs cannot be embedded efficiently in either of the other two 
families (Section 4.3). We close with a discussion in Section 4.4 of the price one pays 
for the additional power of PS graphs. 
Our embeddings of de Bruijn and butterfly graphs in PS graphs are presented in 
two stages, the first assuming that the guest and host graphs in the embedding have 
the same number of nodes and the second assuming that the host PS graph is smaller 
than the guest. 
4.1. PS networks emulating de Bruijn networks 
We consider first the (technically) easier problem of emulating de Bruijn graphs 
on PS graphs. 
Lemma 4.1. For all m and n, one can embed the de BruJn graph 9,,, ,_,, in the PS 
graph K,,,, , with load 1, dilution 2, and edge-congestion 2. 
Prmf. Each node x of @,,+,, is a length-(m + n) binary string. Let !u), denote the 
length-m prefix of this string, and let [x]~ denote the length-n suffix of this string. 
The assignment function of the desired embedding is given by: 
o(x) = ((X)m, b-1, > 
482 A. L. Rosenberg 
for all XE Zr? Each edge of 9m+n _ has the form (flw, wS) for&$, WEZ~+“-‘, 
and SE{/?,/?). Let us write each WEZ~+~-’ in the form w=uyu, with u&Z~-‘, 
y&Z,, and b&Z;-‘. Then the routing function p of the embedding realizes the edge 
(Pw, w& = (PUY& uyu& 
via the following length-2 path in pm,,: 
a(/3w) = a(puyv) = @4 y.0 * by, yv) 
- (uy, ~6) = a(uyvt3) =a(w&. 
This embedding clearly has dilation 2. The claimed edge-congestion follows from 
the facts that the first edge in the length-2 path identifies the edge of 9m+n being 
emulated, up to the identity of d, while the second edge identifies the edge being 
emulated, up to the identity of /I. q 
Theorem 4.2. For all n and all p and q with p + q I n, one can embed the de Bruijn 
graph 9” in the PS graph 9&, with load 2”- P-4, dilation 4, and edge-congestion 
2” -P Y t ‘. This leads to a work-preserving emulation of CB,, by 9p,q. 
Proof. First, we use Lemma 4.1 to embed 9, in @n _p_ q,p +q, with load 1, dilation 
2, and edge-congestion 2.
Next, we use a projection embedding to embed 9,, __p_ q,t, + q in 9$,+q, with load 
2” -P-q, dilation 1, and edge-congestion 2*-P-? The projection embedding assigns 
each node (x,y) of @n_p_q,p,q to node y of 9p+q and routes edges in the naive 
(edge-to-edge) way. 
Finally, we use Lemma 4.1 a second time, to embed 9p+q in gprq, with load 1, 
dilation 2, and edge-congestion 2.
Since our cost measures multiply when embeddings are composed, we can invoke 
Proposition 2.1 to complete the proof. Cl 
4.2. PS networks emulating butterfly networks 
Lemma 4.3 [3]. For all n, one can embed the butterfly graph a,, in the PS graph 
PJtnXn, with load 1, dilation 2, and edge-congestion 2. 
Proof (Sketch). We sketch the proof from [3], which has recently been put in a much 
more general context in [4]. By the pancyclicity of de Bruijn graphs (Lemma 3. l), 
it suffices to embed a, in the product graph an x 9,,, with unit load, dilation 2, 
and edge-congestion 2. 
We label the nodes of 88, with strings from 2: via the following inductive pro- 
cedure that is implicit in [3]; cf. Fig. 4. 
(1) Label node (40’) of &?, with the string 6. 
(2) If level-l node v (IEZ,) is labeled with string L(v), then label the straight- 
edge (respectively, the cross-edge) neighbor of node v on level (I+ 1)mod n with the 
shuffle (respectively, the shuffle-exchange) of L(v). 
Product-shffle tier works 483 
000 100 010 110 001 
w w Ddl olM1l 
000 001 100 101 010 011 110 111 
I ll%Kl 
000 010 001 011 100 110 101 111 
000 100 010 110 001 101 011 ill 
Fig. 4. .#(3) with the shuffle-oriented labelling of Lemma 4.3; level 0 is replicated to aid visualization. 
Now, isolate any two consecutive levels of the labeled a,, together with the 2nf ’ 
edges that connect the levels; cf. Fig. 5. Produce the 2”-node graph 8, from the 
isolated levels by identifying like-labeled nodes and eliminating self-loops. Our 
labeling procedure guarantees that: 
Claim. For any two consecutive l vels of SB,, the graph 9, is isomorphic to $,. 
The result is now direct: To embed 8Bn in an x 9,: 
l Assign level-f node v of 3, to node L(v) of copy I of %In, where L(v) E 2: is 
the label assigned to node v by the indicated procedure. 
l Route edge ((I, x), (I’, y j) of 39, within an x gn via the length-2 path: 
(1, LW, x))) 4-P (1, uu; y))) * U’, LW YW l 
Thus, we first route within a copy of gn and then between copies. 
000 100 010 110 001 101 011 111 
000 100 010 110 001 101 011 111 
Fig. 5. Two consecutive levels of A’(3) with the shuffle-oriented labelling; “columns” are permuted to 
help visualize the identification. 
484 A. L . Rosenberg 
The described embedding clearly has unit load and dilation 2. The edge-congestion 
of the embedding is only 2 because ach edge connecting levels I and I’ in 8B, is 
routed first within the level-l copy of gn and only then between the level-/ and 
level-l’ copies of 99,; hence, the only edges that engender edge-congestion are pairs 
of straight-edges and cross-edges in SYn that share an endpoint. 0 
Theorem 4.4. For all n and aN p and q with q I min(n, 2p), one can embed the 
butterfly graph a, in the PS graph 9p,(l, with load L = 2”-q l max( 1, 2[n2-P - 1 I), 
edge-congestion 4 L, and dilation 2. This leads to a work-preserving ernulation of 
% bY gp,q- 
Proof. Our embedding proceeds in three stages. 
Stage 1. We embed =8t?, in a graph 8,q which is defined implicitly via the surjec- 
tive assignment and routing mappings (cyI, Q,) defined as follows. For each node 
(Lx) of a,, 
The mapping ,ql routes each edge of a, naively, via an edge of 8,, . The embed- 
ding thus defined has load L, = 2” -q, edge-congestion 2L 1, and unit dilation. In 
ordez to verify these costs and to see what the next stage of our composite embed- 
ding needs to accomplish, consider what the graph Sn,q looks like. From one 
vantage point, one obtains %n,q from Bn by removing all cross-edges except hose 
having both endpoints in levels 0, 1, . . . , q; from another vantage point, one con- 
structs Sri,,, from the nodes of 93, by inserting edges that: 
l make the induced subgraph of $&q on levels 0, 1, . . . , q (isomorphic to) the 
graph obtained from &!I,+ l by removing the edges connecting level q to level 0; 
l make the induced subgraph of %,l,q on levels q, q + 1, . . . , n - 1,O (isomorphic 
to) 2V node-disjoint length-(n -4) paths, each having the form 
G&x) ++ <q+ 1,x) * l f) (n - 1,x) ++ (0,x) 
for some length-q binary string XEZ;. 
The only subtlety involving the costs of this embedding is that the edges within levels 
q,q+ 1, l*=, n- I,0 of Sn,q are congested twice as much as the edges within levels 
4 1 , . . . , q, because within the hjgher-numbered levels, cross-edges hare routing 
paths with straight-edges. 
Stage 2. Now we compare the values of n and p. If n ~2~, then we do nothing. 
If n>2P, then we embed %n,q in S24q, as follows. 
(1) “Fold” the length-(n - 2p) “dangling paths” that start at level 2p of C!& q , 
and proceed to level 0, into the top-2P levels of the graph. 
(2) 14iminate all levels of SS n, q from level 2p + 1 through level n - 1. 
(3) identify levels 0 and 2p of the resulting “pruned” graph. 
I4 Recall that (xl 4 is the length-q prefix of the string x. 
Product-shuffle networks 485 
The “folding” of “dangling paths” is accomplished via the assignment function o2 
defined by: 
~2((2~ +k, x>) = ccz(Cn - k - 1, x)) = (k mod 2P,x) 
for all XE 2: and all 
Osks 
n - 2p - ((n - 2P)mod 2) 
2 
. 
Once again, we employ the naive (edge-to-edge) routing to complete the specifica- 
tion of Stage 2 of our embedding. One verifies easily that this embedding has load 
L2 = max( 1, 2rn2- P - 1 I), edge-congestion 2L2, and unit dilation. 
Stage 3. Finally, we embed the host graph (am,q from Stage 2, where m= 
min(n, 2p), into pp, Q. This embedding can be specified indirectly, by invoking 
the proof (rather than statement) of Lemma 4.3. In that proof, a,, is embedded 
in H, x %In with unit load and with dilation and edge-congestion 2. Precisely the 
same reasoning embeds (e,,,4 in 9, x g4 with the same costs. Our embedding 
of 9m.q in pp.4 is completed by noting that a,,, x Bq is a subgraph of 9p,4 
(Lemma 3.1). 
Since our cost measures multiply when embeddings are composed, we can invoke 
Proposition 2.1 to complete the proof. Cl 
4.3. The converse mulations 
In the framework of our strong notion of emulation, PS graphs are strictly more 
powerful than either butterfly or de Bruijn graphs, in the sense of the following 
result. Note how much stronger the result is for butterfly graphs than for de Bruijn 
graphs, both in terms of quantification and dilation. 
Theorem 4.5. (a) For all m, n, any embedding of the PS graph pm,. in any butter- 
fly graph must have dilation Q(min(m, n)). 
(b) For all m, n, any embedding of the PS graph 9m,n in the de Bruijn graph 
%I ,,, + n must have dilation Q(log min(m, n)). 
Proof. Let M= 2min(ms @. By Corollary 3.4, 9&, n contains the M x A4 mesh A&M as 
a subgraph. It is proved in [5] that any embedding of AM,,M in any butterfly graph 
must have dilation Q(logM). It is also proved there that any embedding of A,, 
in a like-sized de Bruijn graph must have diiation Q(log log M). q 
The lower bounds of Theorem 4.5 grow faster than any constant, thus justifying 
our assertion about the power of PS graphs; hcv:ever, each of these lower bounds 
is smaller than the best-known correspondmg upper bound. We do not know at this 
point whether to believe that the upper bounds can be lowered or that the lower 
bounds can be raised. 
A. L. Rosenberg 
4.4. Area-efficient VLSI layouts of the networks 
The additional power of PS graphs over both butterfly and de Bruijn graphs 
comes at modest cost. First, and obviously, PS graphs are 8-valent while their com- 
petitors are 4-valent. Less obviously, PS graphs admit VLSI layouts which are only 
modestly more consumptive of area than the most efficient layouts of either of the 
other two graphs. We refer the reader to [7,23] for background on the formal 
framework and techniques of analysis for VLSI layouts. 
We begin with the layout requirements of de Bruijn and butterfly graphs. 
Theorem 4.6. (a) [ 131 The de Bruijn graph 9,Pl+n admits a VLSI layout in a “box” 
of dimensions 
(b) [23] An-v VLSI Iayout of the de Bruijn graph 9,,, + ,, consumes area 
q ~:;;J* 
(c) [23] The area-minimal VLSI layouts of the butterfly graph a,,, which has 
N= n2” nodes, consume area 
@(&)=0(q). 
Now we turn to the layout area of PS graphs. 
Theorem 4.7. The PS graph P,,,,,. admits a VLSI layout of area 
4 Wl + N 
O- 
( ) mn * 
Proof. We use the layouts guaranteed by Theorem 4.6(a) and by the following 
lemma to obtain a VLSI layout of 9t,,,n with the advertised area. 
Lemma 4.8 [7]. The de Bruijn graph 9, admits a collinear VLSI layout in a ‘box” 
of dimensions 
2” 
O- ( > x O(2” ), n 
i.e., a layout in which aN nodes are laid out in a line. 
Assume, with no loss of genera!ity, that mrn. Stack 2”’ copies of the area- 
efficient layout of gn from Theorem 4.6(a), aligned so that, for each node v of 
Product-shuffle net works 487 
gn, all copies of v are lined up in the same vertical trask. For each node v of c&, 
allocate 2m/m new vertical tracks, and use these tracks to route a collinear layout 
of grn via the layout of Lemma 4.8, using the copies of v as nodes. Easily supplied 
de;ails turn this schematic description into a layout of 9$,,, whose area satisfies 
the bound of the theorem. Cl 
5. Concluding remarks 
Permutation routing. Proposition 2.3 indicates that PS networks can match the effi- 
ciency of de Bruijn networks and exceed the efficiency of butterfly networks on 
single point-to-point communications and on single-source broadcasts. Recent work 
[l] shows PS networks to be competitive with the other two networks also on 
(deterministic, off-line) permutation routing. 
Proposition 5.1 [ 11. There is a deterministic algorithm that routes any permutation 
on the PS net work pm,” in time 
<2(m+n)+2min(m,n)- 3. 
Further challenges. Among the unresolved problems in the study of hypercube- 
derivative networks, the most inviting seek definitive answers to the questions of 
how efficiently the hypercube and its derivatives (including PS graphs) can emulate 
one another. Although certain of these questions have been resolved within the more 
comprehensive framework of [12], there are practical, as well as intellectual, reasons 
to determine whether or not our simpler framework yields the same answers to these 
questions. Even after all of these individual questions have been answered, it will 
be an interesting challenge to adduce underlying principles that explain the answers 
(along the lines of the algebraic development in [3]). 
Acknowledgement 
It is a pleasure to thank Fred Annexstein, Marc Baumslag, Lenny Heath, Tom 
Leighton, Seth Malitz, Bojana ObreniC, and the anonymous referees for heipful 
comments and suggestions. 
References 
[I] F.S. Annexsteir and M. Baumslag, A unified approach to global permutation routing on parallel 
networks, Math. Systems Theory 24 (1991) 233-251 L
[2] F.S. Annexstein, M. Baumslag, M.C. Pierbordt, B. Obrenic, A.L. Rosenberg and C.C. Weems, 
Achieving multigauge behavior in bit-serial SIMD architectures via emulation, in: 3rd IEEE Sym- 
posium on Frontiers of Massively Parallel Computation (1990) 186-195. 
488 A. L. Rosenberg 
[3] F.S. Annexstein, M. Baumslag and A.L. Rosenberg, Group action graphs and parallel architectures, 
SIAM J. Comput. 19 (1990) 544-569. 
[4] M. Baumslag and A.L. Rosenberg, Processor-time tradeoffs for Cayley graph interconnection et- 
works, in: 6th Distributed Memory Computing Conference (1991) 630-636. 
[5] S.N. Bhatt, F.R.K. Chung, J.-W. Hong, F.T Leighton, B. Obrenic, A.L. Rosenberg and E.J. 
Schwabe, Optimal emulations by butterfly-like networks, J. ACM, to appear. 
[a] S.N. Bhatt, F.R.K. Chung, F.T. Leighton and A.L. Rosenberg, Efficient embeddings of trees in 
hypercubes, SIAM J. Comput., to appear. 
[7] S.N. Bhatt and F.T. Leighton, A framework for solving VLSI graph layout problems, J. Comput. 
System Sci. 28 (1984) 300-343. 
[S] R.M. Chamberlain, Gray codes, Fast Fourier Transforms and hypercubes, Parallel Comput. 6 
(1988) 225-233. 
[9] N.G. de Bruijn, A combinatorial problem, Nederl. Akad. Wetensch. Proc. Ser. A 49 (1946) 
758-764. 
[ 101 R. Feldmann and W. Unger, The cube-connected cycles network is a subgraph of the butterfly net- 
iversity of Paderborn (1990). 
Heath and A.L. Rosenberg, Optima1 embeddings of butterfly-like graphs in 
the hypercube, Mat% Systems Theory 23 (1990) 61-77. 
[12] R. Koch, F.T. Leig ton, B. Maggs, S. Rao and A.i. Rosenberg, Work-preserving emulations of 
fixed-connection etksforks, in: 21st ACM Symposium on Theory of Computing (1989) 227-240. 
[13] F.T. Leighton, Complexity issues in VLSI: Optima1 Layouts for the Shuffle-Exchange Graph and 
Other Networks (MIT Press, Cambridge, MA, 1983). 
[ 141 F.T. Leighton, Parallel computation using meshes of trees, in: 1983 Workshop on Graph-Theoretic 
Concepts in Computer Science (Trauner, Linz, 1984) 200-218. 
[15] F.T. Leighton, B. Maggs and S. Rao, Universal packet routing algorithms, in: 29th IEEE Sym- 
posium on Foundations of Computer Science (1988) 256-269. 
[16] F.P. Preparata nd J.E. Vuillemin, The cube-connected cycles: a versatile network for parallel com- 
putation, Comm. ACM 24 (1981) 300-309. 
[ 17; A-G. Ranade, How to emulate shared memory, in: 28th IEEE Symposium on Foundations of Com- 
puter Science (1987) 185- 194. 
[IS] J.H. Reif and L.G. Valiant, A logarithmic time sort for linear graphs, J. ACM 34 (1987) 60-76. 
1191 Y. Saad and M.H. Schultz, Topological properties of hypercubes, IEEE Trans. Comput. 37 (1988) 
867-8?2. 
WI MR. Samatham and D.K. Pradhan, The de Bruijn multiprocessor network: A versatile parallel 
processing and sorting network for VLSI, IEEE Trans. Comput. 38 (1989) 567-581. 
[211 C. Stanfill, Communications architecture in the Connection Machine system, Tech. Rept. HA87-3, 
Thinking Machines Corporation (1987). 
[221 H. Stone, Parallel processing with the perfect shuffle, IEEE Trans. Comput. 20 (1971) 153-161. 
E231 C.D. Thompson, A complexity theory for VLSI, Ph.D. Thesis, CMU (1980). 
t241 M. Yoeli, Binary ring sequences, Amer. Math. Monthly 69 (1962) 852-855. 
