Rosenberg, A.L., Product-shuffle networks: toward reconciling shuffles and butterflies, Discrete Applied Mathematics 37/38 (1992) 465-488.
Goals of the study
The Boolean hypercube and its bounded-degree dericaiives, such as the butteuf?~-oriented butterfly and cube-connected cycles (CCC) networks and the shuffie-oriented shuffle-oriented networks are more powerful than butterfly-oriented ones) arises from studies in [3,2 I] of emulations of one interconnection network by another. These studies use the same strong notion of emulation as we do in this paper.
(1) (a) Every even-order (respectively, odd-order) butterfly and CCC network can be emulated with no time ioss (respectively, with at most a factor of 2 time loss) by the smallest hypercube that is big enough to hold it [l I]; (b) every known emulation of an N-node shuffle-oriented network by a hypercube incurs slowdown Q(log N) (which is pessimal). (2) (a) An N-node shuffle-oriented network can emulate a like-size butterflyoriented network with slowdown O(log log N) [3] ; (b) every known emulation of an N-node shuffle-oriented network by a like-size butterfly-oriented network incurs slowdown Q(log IV) (which is pessimal). Blurring the apparent algorithmic distinctions between the two families of networks is the recent work in [12] which presents a computational framework wherein either of the butterfly or de Bruijn networks can emulate the other efficiently. In particular, these emulations yield an algorithm for sorting on the de Bruijn network that rivals the efficiency of the Reif-Valiant [18] algorithm for butterfly networks.
The present study is motivated in part by the unresolved questions implicit in the preceding paragraph: Which, if either, of the butterfly-and shuffle-oriented networks is the more powerful? Under what circumstances are computationally more complicated emulations of the type introduced in [12] to be preferred to the purely structure-oriented emulations of [3, 11] , and vice versa?
Remark. The major argument in favor of the structure-oriented framework is that its emulations yield algorithms that translate programs for the e;nulated network to equivalent programs for the emulating network. The major argument in favor of the computationally more complicated framework is that it significantly expands the class of networks that a given network can emulate efficiently.
Further motivating our study is the fact that the constructions in both [12] and [3] suggest that efficient emulations of either of these network families by the other are likely to be rather sophisticated and complicated. There would be value, therefore, in having a family of networks which retains most of the structural simphcity of both butterfly-and shuffle-oriented networks, but which can emulate both families of networks simply-the emulation procedure should be simple to specifyand efficiently -the host architecture should be able to emulate any of the guest architectures in a work-preserving manner, i.e., so that the processor-time product is preserved . 4 In this paper, we study such a "least upper bound" family, the product-shuffle (PS) networks; each PS network is a direct product of two de Bruijn networks.
The goals presented thus far can be satisfied by a variety of interconnection networks; indeed, the simplest such network would just superpose a butterfly-oriented network on a like-sized shuffle-oriented one. (Formally, this would amount to taking the union of the edge-sets of the two networks.) Why, then, should we bother with PS networks, which introduce the additional complication of the direct-product structure? The answer lies in our willingness to suffer a (very) modest amount of structural complication in return for a (very) considerable amount of algorithmic simplicity and efficiency. The following example should clarify our concerns. It is shown in [5] that neither butterfly-nor shuffle-oriented networks can emulate meshes (and a variety of other planar networks) efficiently, using the simple notion of emulation that we study here. This is a quite serious deficiency, given the importance of mesh-oriented parallel algorithms.' PS networks overcome this deficiency, in that every "reasonable shaped" PS 1:etwork can emulate moderate-size meshes with no slowdown, since it contains the meshes as subnetworks. Using the stronger notion of emulation in [ 121, one can emulate meshes on butterfly-or shuffle-oriented networks with only constant-factor slowdown; but, the constants are nontrivial,.and the emulation algorithm is rather complicated. There are further beneficial consequences of the direct-product structure of PS networks:
l PS networks contain moderate-size copies of the computationally important mesh-of-trees network (Theorem 3.7).
l PS networks admit simple and efficient load-balanced6 work-preserving emulations of both butterfly-and shuffle-oriented networks (Theorems 4.2 and 4.4).
* PS networks admit co.mpact VLSI layouts (Theorem 4.i). Additionally, the direct-product structure of PS networks gzlarnntees efficicqt detcrministic off-line permutation routing within the network [I] . Finally, the product structure ailows PS networks to be used as an ingredient in achieving algorithmic enhancements to parallel architectures that would be prohibitively expensive to achieve in hardware [2] . It is not clear that any "superposed", composite butterflyplus-shuffle network could match all of the benefits that ensue from the directproduct structure of PS networks.
The remainder of the paper is organized into three sections. In Section 2, we formalize the topics of our study and present some simple basic results. Section 3 verifies the large family of subnetworks of PS networks, that we have been alluding to. In Section 4, we compare PS networks with butterfly-and shuffle-oriented networks, demonstrating efficient work-preserving emulations of the latter two families by the former family, and indicating the impossibility of efficient converse emulations. These results show that, within our framework of highly structured emulations (as opposed to the more general framework of [12] ), PS networks are strictly more :'The parallel algorithm literature abounds vzith linear-algebraic and numerical algorithms that conform naturally to the structure of a mesh.
'We say that the loud is balanced in an emulation when each processor of the emulating architecture emulates the same number of processors of the emulated architecture.
powerful than either butterfly-or shuffle-oriented networks, by more than any constant factor Moreover, the added power comes at only moderate cost, in that PS networks are 8-valent (in contrast to the 4-valence of butterfly-and shuffle-oriented networks), and PS networks admit VLSI layouts that are only modestly bigger than the best layouts of like-sized butterfly-or shuffle-oriented networks.
We have thus far used the term "shuffle-orienttd networks" to refer ambiguously to the de Bruijn and shuffle-exchange networks, and the term "butterfly-oriented network" to refer ambiguously to the butterfly and CCC networks. This ambiguity is justified by the fact that the de Bruijn and shuffle-exchange networks can each emulate the other with only a factor-of-2 slowdown, and the same is true of the butterfly and CCC networks. We focus on the de Bruijn and butterfly networks in the sections that follow, since they yield smaller constant factors in our emulations, and they have richer families of subnetworks.
The formal framework

Interconnection net works as graphs
As is customary in structural studies of parallel architectures, we restrict attention to arrays of identical processing elements (PEs), and we view the architectures and their underlying interconnection networks as undirected graphs.
A directed graph g is specified by a set l& of nodes and a multisubset At9 c I& x VG called arcs. One obtains an undirected graph %' from a directed graph $3, L.. 64 WY symmetrizing" the set of arcs: one replaces each arc of $3 with a pair of mated arcs having opposing directions. We refer to each mated pair of arcs as an edge of the graph %'. ' The nodes of the graph represent the PEs of the architecture, and the edges of the graph represent the inter-PE comnunication links. We henceforth use the term "graph" instead of "network".
I. I. Notation and terminology
l In phrases like "for all II", n always ranges over the positive integers. For all n, Zn=def (4 l,..., n-l}, and il(rz) =def [log nl. (All logarithms are to the base 2.) @ For any set S and nonnegative integer k: IS 1 denotes the cardinality of S; S" denotes the set of all length-k stringc of elements of 3; 1x1 denotes the length (= k) of each x&Sk.
l Given graphs $3 =(I'& E& and .%=( l&, E,&, the (direct) prodlict graph $2 x .X has node-set K9 x I&. Let u and v be nodes of $3, and let x and y be Ilodes of 3. Then ((a,~), (v, y)) is an edge of $9 x .Z just when either (u, v) is an edge of ~3 and x= y, or (x, y) is an edge of .Ye and u= v.
'Note that we allow self-loops and parallel edges. In both 9,, and y3n, each node of the form px is connected via a shuffle urc to node xfl. Additicnaliy: -in gn, each node of the form @x is connected via a sh&fle-exchange arc to node xp;*
The graphs of interest
Let m,n be positive integers.
-in Y', each node of the form xp is connected via an exchange arc to node xB. Remark. For brevity, we study only the base-2 versions of our graph families, by dint of our using Z2 as the underlying alphabet of the graphs' node-sets. One can easily define arbitrary base-d versions of the graphs' and extend our results with only clerical changes. In [3] , we deal with the general, base-d versions of our graph families.
Emulation via graph ernbeddings
In defining the emulation of one architecture % by another architecture Z9 we assume that the PEs of H are sufficiently powerful to emulate the PEs of ?J step for step-so no delay is incurred because of computational steps. We restrict attention to emulations that honor a pulsed computation regimen: Architecture JV alternates phases that emulate one computation step of architecture 9, with phases that emulate one communication step of architecture %." The slowdown incurred by an emulation arises from two sources. First, we allow emulations that require one PE of X to play the role of several PEs of %; second, architecture X must emulate on its interconnection graph communication steps that are tailored to the (possibly 'For illustration, the base-d order-n de Bruijn graph has node-set 2;' and edges connecting each node of the form 6-r, where BE Z,, and SE Zj ', to all nodes of the form X-YE Zj.
"'This regimen of having .W mimic the exact form of the computation by !9 motivates our using the term "emulate" rather than "simulate". very different) structure of the interconntction graph underlying architecture %. The second type of delay results both frcm mismatched adjacency structure:s and from congested communication lines. Our study of emulations is based on the following notion of graph embeddings and their costs.
Graph embeddings
An embedding of the graph 93 in the graph X is specified by: 
Slowdown incurred by an emulation
A number of factors induce slowdown when architecture .X emulates architecture $9. We account for these factors very conservatively, by assuming that in each step of an emulated computation, every PE of 33 performs a computation step followed by a communication with all of its neighbors. Clearly, most algorithms will not exercise the resources of $3 so exhaustively; hence, they might well be emula.ted by X with !ess slowdown than our accounting procedures indicate. Say that we have an embedding By abuse of notation, we write "(z;, z;+ 1) E IT*'.
[I/O-expansion incurs slowdown because at each computation step, % needs poll only valence(%) I/O ports, while .ti must poll valence(.S) ports.]
l The edge-congestion of the embedding is the maximum number of edges of $9 that 0 routes over a single edge of .M: edge-congestion(cx, Q) = eta 1 {e% Et9 : eE e(e')) I. [Edge-congestion incurs slowdown because the messages that want to cross a congestled edge must be queued up. (For simplicity, we are giving each edge of .%' the same capacity as a single edge of $3 .) I The slowdown due to load and I/O-expansion seems to be unavoidable. In contrast, one cm avoid the slowdown due to edge-congestion by increasing the bandwidth of M's communication links, at the cost of increased hardware and increased layout area. Another avenue for mitigating the effects of edge-congestion is to orchestrate the communication phases of X, so that message traffic is spread uniformly along the paths of .X that are used to emulate the links of $9; this ploy, which allows one to amortize edge-congestion over the paths that create dilation, is used to decrease the showdown of the emulations in [3] . Another form of orchestrating the communication phases of emulations leads to the following result, which guarantees that load, dilation, and edge-congestion can always be made to combine additively, rather than multiplicatively (as a naive analysis would suggest). The main analysis leading to this result appears in [15] ; its extension to the current framework appears in [12] . Aside from assuring us that proper scheduling can make the costs of an emulation combine additively rather than multiplicatively, Proposition 2.1 also demonstrates that our purely graph-theoretic formalism does, indeed, model the algorithmic situation that we want it to. Proposition 2.1 points out the importance of balancing the load of an emulation (cf. footnote 6), i.e., of keeping the quantity bounded by a constant. Every emulation we present here has balanced load.
Since l/O-expansion is a property only of the structures of !G and .W, and not of an!7 embedding of % in X', we do not discuss it further.
Obviousi~, one strives to make emulations as "efficient" as possible. One can argue (cf. [ 121) that one type of efficiency that is very desirable in emulations is work preservation: When the N-PE architecture $9 operates for T steps, it can perform NT atomic operations. If the (MI N)-PE architecture .X emulates these T steps of 3, it requires at least rAVM1 Tsteps to perform the same amount of work. Allowing a (hopefully small) constant factor leeway as overhead for the emulation, we say that the emulation of % by .ti is work preserving if .Z can emulate any T steps of ?J in at most O(rN/Ml)T steps; cf. footnote 4. All of the emulations we present in this paper are work preserving.
Two quasi-isornetries
We are finally in a position to formalize our discussion at the end of Section 1, concerning the "equivalence" of the de Bruijn and shuffle-exchange graphs, on the one hand, and the butterfly and CCC graphs, on the other hand. (/,x) f+ (1,x') * ((/+ l)mod n&j.
For the embedding of pa.rt (b)(ii), our assig:lment branches on the weight of the string x in node (I,x> of fqn, i.e., the number of 1s in x. If x has even weight, then node (I, x) of %?n is assigned to node (I, x> of an; if x has odd weight, then node (I,x) of gn is assigned to node ((I+ 1)mod n, x) of 93,. Details are left to the reader. 0
Structural character&tics of the graphs of interest
The diameter (maximum inter-node distance) of a graph .H bounds above both the dilation of any embedding into .% and the time required for any single-node broadcast in .%'. Therefore, the following table places our emulation results in perspective and provides an interesting comparison of 9,p1,,, with its "competitors". One noteworthy point is that PS graphs share diameter (exactly) log2 N bvith de Bruijn graphs and hypercubes, although de Bruijn graphs acquire their small diameter with valence 4, while PS graphs have valence 8 and hypercubes have valence log2 N. One can proceed from any node (x,y) of 9 ,,,, n to any other node W,y') by (1) proceeding from node (x, y) to node (x', y) in at most Ix!= in steps by mimicking the way one would proceed from node x to node x' in 9&;
(2) proceeding from node (x', y> to node (x', y') in at most 1 yI = n steps by mimicking the way one would proceed from node y to node y' in C&. 0
Computationally important subgraphs of 9m,n
PS graphs contain a variety of computationally useful graphs as subgraphs, Le., as graphs that can be embedded with unit load, edge-congestion, and dilation; an architecture based on PS graphs can emulate an architecture based on any of these subgraphs with no slowdown.
I. Cycles
The N-node cycle & is the graph whose nodes comprise the set Z/V and whose edges connect each node v with node (v + 1)mod N.
It is well known that 9,, is Hamiltonian in that it contains the cycle 9&f! as a subgraph. In fact, gn satisfies the following stronger property. Proof. For any choice of m, n other than m = n = 1, and for any integer 1~ CI 2m+n, we show algorithmically that the cycle & is a subgraph of 9M,n. Our algorithm assumes that the cycles promised by Lemma 3.1 can be produced algorithmically; cf.
[24]. (Note that 9$, I is (essentially) a 4-cycle, whence its exclusion from the theorem.)
.Qssume, with no loss of generality, that ml n (or else, interchange the roles of .VZ and n in what follows). If the desired cycle-length c satisfies 1 ICI 2", then Bc is a subgraph of 9m,n, by Lemma 3.1. Let us restrict attention, therefore, to values of c in the range 2"<c~2~+~, in which case we must have m >O. Now, every integer c in the indicated range admits a unique representation in the form c=a2"-+b with Oc QI 2" and OS b< 2". The overall strategy of our algorithm is to "hook together" Hamiltonian cycles from a of the 2"' copies of $Bn that comprise S'm,n, together with a length-b cycle from one additional copy of gn whenever b>O. (In fact, technical difficulties in "hooking up" these cycles will cause us to deviate from this strategy slightly.) To the end of implementing this strategy, we invoke Lemma 3.1 to find a length-d cycle in 9&, where ie dek%be the mechanism for "hooking the cycles together" via an analysis of cases.
Case 1: b = 0, so d= a and a > 1. This is the easiest case, since we have only to "hook together" a set Co, Cl, . . . , Co_ 1 of cycles, each Ci being a copy within 9(i) of a Hamiltonian cycle C of 9n. We start by selecting any two independent edges (x9 y) and (u, U) of g,, l3 that both lie on the cycle C; since n 22, we are sure that these edges exist. Next, we let xi, yi, Ui, vi (OS i< a) denote the instances of the nodes x, y, u, v, respectively, in copy s@) of $Bn. Assume that the nodes x, y, u, v lie in (say: fer defin.i!eness) clockwise order around the cycle C in 9,, so that each cycle C, has the form where Pi and Qi a.re the intermediate paths that define the cycle.
We are now ready to find a length-c cycle in 9,n,n.
(1) Trace the cycle C0 in 9(O) in clockwise order, from node y. to node x0, leaving out the edge that con;lect-.s the two nodes. The paths Pi and Qi and the edges (Xi,yi) and (Idi, Vi) come from the copies of 9,,, while the edges (Xi,Xi+ I), (yi,yi+ I), (Ui, Ui+ I), and (Vi, Vi+ 1) come from the COPY of 9~~~ we used to order the copies of 9,. The reader should be able to fill in details, with the help of Case 2(a): b I 3. We must alter the procedure of Case 1 in two ways: we must find a copy of a length-b cycle in copy %, ('I of 91,, and we must ensure that we can "hook" this new cycle to the chain of JIamiltonian cycles. The first of these tasks is trivial, by Lemma 3.1; let us call the length-b cycle B. In order to accomplish the second task, we invoke a strong property of 90,:
Claim. For any pn;li x* y +B z in QOn involving three distinct nodes, there is a Hamiltonian cycle of gn that contains either the edge ( x, y) or the edge (y, z) .
Proof. The Claim is true by inspection when n = 2; when n>2, it follows from standard facts about de Bruijn graphs.
Fact 1. For all n, 9n is the line-graph of z&_ I. Fact 2. As a consequence of Fact 1, one can construct a Hamiltonian cycle in %In from any Eulerian cycle in gn _ 1.
Fact 3. Given any Eulerian graph % and any 2-edge path n in 8 whose removal does not disconnect $9, one can construct an Eulerian cycle in % which contains 71.
Fact 4. The only 2-edge paths whose removal disconnect gn are the paths both of whose edges are incident to either node 6 or node i.
Because of Fact I:, a 3-node path in gn results from a 3-edge path in %)", 1. Since at most two of these edges can both be incident to either node 6 or node 1 in 9, _ 1, it follows by Facts 3 and 4 that there is an Eulerian cycle in 9,_ I passing through either the path or the path x0 Y-Z. within copies 9('-') and &) of 9,. One verifies readi!y that one of these paths exists. Case 2(c)(ii). When n >2, we alter Case 1 by insisting that at least one of the independent edges (x, y) and (u, U) not touch either node 6 or node 7 of %In. (Note that this is impossible when n = 2.) Say, without loss of generality, that node 6 is not touched by either edge.
Having thus restricted the choice of these edges, we proceed exactly as in Case 2(b) (6 = 2), with the following exception. Once having found the cycle produced in Case 2(b) (which has length c+ l), we remove the instance of node 0' of 9Jn from whichever of P,_ , or Q,_ , contains an instance of 6. (One must, because of our restriction.) Since every Hamiltonian cycle in $& contains the path 1&&+61, the elision of node 6 does not cut our cycle: it just shortens it, as desired.
This case analysis cqmpletes the proof. Cl
Very little of the proof of Theorem 3.2 depends on properties that are peculiar to de Bruijn graphs. In fact, F. Annexstein and M. Baumslag [personal communication] have observed that an altered version of the proof will establish the following. Proposition 3.3. Let 93 and .%' be pancyclic graphs, one of which -say 9 -has an even number of nodes. Suppose that for every .!rteger 2 I 1 I 1% 1, $3 has a length-l cycle that shares an edge with a Hamiltonian cycle. Then the product graph % x 2 is pancyclic, except when 1% I= 1% I= 2.
Meshes
The m x n (wroidaf ) mesh A&, is (isomorphic to) the product graph R, x 9G&. One corollary of the main lower bound in [5] is that no butterfly-or shuffleoriented graph 8 can emulate meshes with only constant slowdown (using our notion of emulation); it follovq of course, that Q cannot contain a large mesh as a subgraph. In contrast, PS graphs contain moderate-size meshes as subgraphs, as indicated in the following corollary of Lemma 3.1.
Corollary 3.4. For all m, n and all p 5 2"' a& q 5 2", the PS graph 9& contains the mesh AJ~,~ as a subgraph.
Complete binary trees
The height-n complete binary tree 9* is the graph whose 2"+ ' -1 nodes comprise the set Uz=e Zi of binary strings of length =n and whose edges connect each node x of length <n with nodes x0 and xl.
Complete binary trees are very useful computational structures, most obviously for broadcasting, but also for emulations [5] . Thus, the following obvious result points out one of the most useful properties of de Bruijn graphs; cf.
[20].
Lemma 3.5. For all n, the de Bruijn graph 9,, contains the complete binary tree gn _ , as a subgraph, rooted at node 0'1.
While PS graphs cannot match the fact that the N-node de Bruijn graph contains the (IV-I)-node complete binary tree, they do come within a factor of 2 of matching it. Theorem 3.6. For all m, n, the PS graph C?m,n contains the complete binary tree SUE + n _ z as a subg:*aph.
Proof. We find an instance of $m +,, _2 rooted at node u. = (0'1, 0'1) of pm,,, as follows. We first invoke Lemma 3.5 to find a copy of *s_, in "copy 0'1" of 9,, rooted at node u. and having 2" -I leaves of the form @1,x) for some XEZ~-*. We then invoke Lemma 3.5 once for each "copy" of a,,, that is "connected" to one of these leaves, to find a copy of sm_, rooted at each of these leaves. n To place Theorem 3.6 in perspective, the efficient embedding of complete binary trees in butterfly graphs presented in [5] promises only constant (as oppose to unit) dilation and utilizes onfy roughly one-eighth of the nodes of the host butterfly. (These constants can be improved somewhat, but not to unity.)
Meshes of trees
The m x n mesh of trees JZ&,~ (m and n being powers of 2) is obtained from the m x n mesh by l eliminating all mesh edges, l erecting a copy of the complete binary tree .&) along each column, using the column-nodes as leaves of the tree, l erecting a copy of the complete binary tree gA(nI along each row, using the row-nodes as leaves of the tree.
Parallel architectures based on meshes of trees have been shown to be quite powerful computationally [14] . One can prove, using Lemma 3.5, that PS graphs contain at least moderate-size meshes of trees as subgraphs. Proof. Let us denote byxi, Isi12~, the ith leaf of gp and by yi, 11!r2*, the ith leaf of + By Lemma 3.5, 9m,n contains the product graph gp x gq as a subgraph. Consequently, ~9~. n contains as a subgraph every tree of the form (Xi} x t& as well as every tree of the form gp x { yi } . The union of all these subgraphs is A&24. 
PS vs. butterfly vs. de Eruijn networks
In this section we demonstrate that, relative to our notion of network emulation, PS graphs have strictly more communication power than either shuffle-or butterflyoriented graphs. Our demonstration consists of efficient embeddings of both de Bruijn graphs (Section 4.1) and butterfly graphs (Section 4.2) in PS graphs, followed by a proof that PS graphs cannot be embedded efficiently in either of the other two families (Section 4.3). We close with a discussion in Section 4.4 of the price one pays for the additional power of PS graphs.
Our embeddings of de Bruijn and butterfly graphs in PS graphs are presented in two stages, the first assuming that the guest and host graphs in the embedding have the same number of nodes and the second assuming that the host PS graph is smaller than the guest.
PS networks emulating de Bruijn networks
We consider first the (technically) easier problem of emulating de Bruijn graphs on PS graphs. This embedding clearly has dilation 2. The claimed edge-congestion follows from the facts that the first edge in the length-2 path identifies the edge of 9m+n being emulated, up to the identity of d, while the second edge identifies the edge being emulated, up to the identity of /I. q Theorem 4.2. For all n and all p and q with p + q I n, one can embed the de Bruijn graph 9" in the PS graph 9&, with load 2"-P-4, dilation 4, and edge-congestion Proof. First, we use Lemma 4.1 to embed 9, in @n _p_ q,p +q, with load 1, dilation 2, and edge-congestion 2.
Next, we use a projection embedding to embed 9,, __p_ q,t, + q in 9$,+q, with load 2" -P-q, dilation 1, and edge-congestion 2*-P-? The projection embedding assigns each node (x,y) of @n_p_q,p,q to node y of 9p+q and routes edges in the naive (edge-to-edge) way.
Finally, we use Lemma 4.1 a second time, to embed 9p+q in gprq, with load 1, dilation 2, and edge-congestion 2.
Since our cost measures multiply when embeddings are composed, we can invoke Proposition 2.1 to complete the proof. Cl
PS networks emulating butterfly networks
Lemma 4.3 [3] . For all n, one can embed the butterfly graph a,, in the PS graph PJtnXn, with load 1, dilation 2, and edge-congestion 2.
Proof (Sketch). We sketch the proof from [3] , which has recently been put in a much more general context in [4] . By the pancyclicity of de Bruijn graphs (Lemma 3. l), it suffices to embed a, in the product graph an x 9,,, with unit load, dilation 2, and edge-congestion 2.
We label the nodes of 88, with strings from 2: via the following inductive procedure that is implicit in [3] ; cf. Now, isolate any two consecutive levels of the labeled a,, together with the 2nf ' edges that connect the levels; cf. Fig. 5 . Produce the 2"-node graph 8, from the isolated levels by identifying like-labeled nodes and eliminating self-loops. Our labeling procedure guarantees that:
Claim. For any two consecutive levels of SB,, the graph 9, is isomorphic to $,.
The result is now direct: To embed 8Bn in an x 9,:
l Assign level-f node v of 3, to node L(v) of copy I of %In, where L(v) E 2: is the label assigned to node v by the indicated procedure.
l Route edge ((I, x), (I', y j) of 39, within an x gn via the length-2 path:
(1, LW, x))) 4-P (1, uu; y))) * U', LW YW l Thus, we first route within a copy of gn and then between copies. The described embedding clearly has unit load and dilation 2. The edge-congestion of the embedding is only 2 because each edge connecting levels I and I' in 8B, is routed first within the level-l copy of gn and only then between the level-/ and level-l' copies of 99,; hence, the only edges that engender edge-congestion are pairs of straight-edges and cross-edges in SYn that share an endpoint. 0 l make the induced subgraph of %,l,q on levels q, q + 1, . . . , n -1,O (isomorphic to) 2V node-disjoint length-(n -4) paths, each having the form G&x) ++ <q+ 1,x) * l f) (n -1,x) ++ (0,x) for some length-q binary string XEZ;. The only subtlety involving the costs of this embedding is that the edges within levels q,q+ 1, l *=, n-I,0 of Sn,q are congested twice as much as the edges within levels 4 1 , . . . , q, because within the hjgher-numbered levels, cross-edges share routing paths with straight-edges.
Stage 2. Now we compare the values of n and p. If n ~2~, then we do nothing. If n>2P, then we embed %n,q in S24q, as follows.
(1) "Fold" the length-(n -2p) "dangling paths" that start at level 2p of C!& q , and proceed to level 0, into the top-2P levels of the graph.
(2) 14iminate all levels of SS n, q from level 2p + 1 through level n -1. (3) identify levels 0 and 2p of the resulting "pruned" graph. I4 Recall that (xl 4 is the length-q prefix of the string x.
The "folding" of "dangling paths" is accomplished via the assignment function o2 defined by:
for all XE 2: and all
Once again, we employ the naive (edge-to-edge) routing to complete the specification of Stage 2 of our embedding. One verifies easily that this embedding has load L2 = max( 1, 2rn2-P -1 I), edge-congestion 2L2, and unit dilation.
Stage 3. Finally, we embed the host graph (am,q from Stage 2, where m= min(n, 2p), into pp, Q. This embedding can be specified indirectly, by invoking the proof (rather than statement) of Lemma 4.3. In that proof, a,, is embedded in H, x %In with unit load and with dilation and edge-congestion 2. Precisely the same reasoning embeds (e,,,4 in 9, x g4 with the same costs. Our embedding of 9m.q in pp.4 is completed by noting that a,,, x Bq is a subgraph of 9p,4 (Lemma 3.1).
The converse emulations
In the framework of our strong notion of emulation, PS graphs are strictly more powerful than either butterfly or de Bruijn graphs, in the sense of the following result. Note how much stronger the result is for butterfly graphs than for de Bruijn graphs, both in terms of quantification and dilation. (m, n) ).
Proof. Let M= 2min(ms @. By Corollary 3.4, 9&, n contains the M x A4 mesh A&M as a subgraph. It is proved in [5] that any embedding of AM,,M in any butterfly graph must have dilation Q(logM). It is also proved there that any embedding of A,, in a like-sized de Bruijn graph must have diiation Q(log log M). q
The lower bounds of Theorem 4.5 grow faster than any constant, thus justifying our assertion about the power of PS graphs; hcv:ever, each of these lower bounds is smaller than the best-known correspondmg upper bound. We do not know at this point whether to believe that the upper bounds can be lowered or that the lower bounds can be raised.
Area-efficient VLSI layouts of the networks
The additional power of PS graphs over both butterfly and de Bruijn graphs comes at modest cost. First, and obviously, PS graphs are 8-valent while their competitors are 4-valent. Less obviously, PS graphs admit VLSI layouts which are only modestly more consumptive of area than the most efficient layouts of either of the other two graphs. We refer the reader to [7, 23] for background on the formal framework and techniques of analysis for VLSI layouts.
We begin with the layout requirements of de Bruijn and butterfly graphs. Proof. We use the layouts guaranteed by Theorem 4.6(a) and by the following lemma to obtain a VLSI layout of 9t,,,n with the advertised area. Further challenges. Among the unresolved problems in the study of hypercubederivative networks, the most inviting seek definitive answers to the questions of how efficiently the hypercube and its derivatives (including PS graphs) can emulate one another. Although certain of these questions have been resolved within the more comprehensive framework of [12] , there are practical, as well as intellectual, reasons to determine whether or not our simpler framework yields the same answers to these questions. Even after all of these individual questions have been answered, it will be an interesting challenge to adduce underlying principles that explain the answers (along the lines of the algebraic development in [3] ).
