Abstract-We stiudy the problem of network embeddings in 2-D array architectures in which each row and column of processors are interconnected by a bus. These architectures are especially attractive if optical buses are used that allow simultaneous access by multiple processors through either wavelength division multiplexing or message pipelining, thus overcoming the bottlenecks caused by the exclusive access of buses. In particular, we define S-trees to include both binary X-trees and pyramids, and prwent two embeddings of X-trees into 2-D processor arrays with spanning buses. The first embedding has the property that all neighboring nodes in S-trees are mapped to the same bus in the target array, thus allowing any two neighbors in the embedded A--trees to communicate with each other in one routing step. The disadvantage of this embedding is its relatively high expansion cost. In contrast, the second embedding has an expansion cost approaching unity, hut does not map all neighboring nodes in S-trees to the same bus. These embeddings allow all algorithms designed for binary trees, pyramids, as well as X-trees to be executed on the target arrays.
I. INTRODUCTION AND PROBLEM DEFINITION
Although the computational powers of parallel computers are potentially much larger than sequential ones, parallel computers suffer from inefficient interprocessor communications. Dealing with the communication inefficiency through either architecture or algorithm design, or both, has always been a key research issue in parallel computation. In order to improve the communication efficiency of parallel systems, buses, both electronic and optical, have been considered by many researchers for interconnecting processor arrays [I] , [ 7 ] , 191, [18], [19] , [21] . In particular, 2-D processor arrays with broadcasting row and column buses have been suggested that allow for efficient solutions to various problems, including semigroup and prefix computations, image processing, computational geometry, and numerical computations [3] , [ 161. [I 71 . Broadcasting buses, however, have low bandwidth because of their exclusive access mode of operation, and thus are not suited for tasks involving extensive interprocessor communications. This limitation may be alleviated if optical buses are used either with wavelength division multiplexing, which provides multiple channels on the same bus [7] , or with spaceltime multiplexing, which allows message pipelining on a bus 191. [ 151. In both cases, simultaneous bus access by multiple procesManuscript received March 12, 1992; revised September 1 I , 1992 . This work was supported by the U S . Air Force under Grant AFOSR-89-0469, and by the National Science Foundation under Grant MIP-8901057.
Z. Guo sors is permitted. As an illustration, in the following subsection, we briefly describe the principle of message pipelining on optical buses.
A. Message Pipelining on Opticul Buses
Pipelined optical buses take advantage of two unique properties of optical signal transmissions in waveguides: unidirectional propagation and predictable path delays. Consider Fig. I , which shows an array of . V processors connected by an optical bus (waveguide). Each processor is coupled to the bus with two passive optical couplers, one for writing signals and the other for reading. Assume that the optical distance between each pair of adjacent processors is Do and that a message consists of a sequence of b optical pulses, each having a width 11: in seconds. In contrast to the case of an electronic bus, where writing access to the bus is exclusive, all of the processors may write their messages on the optical bus simultaneously. This may be accomplished if all processors write their messages at the same instant and if the length of each message is smaller than D<,.
Let cg be the speed of light in the waveguide. Then we may ensure that messages sent by distinct processors do not collide with one another on the optical bus if bwc, < Do. Here, by "colliding," we mean that two optical signals injected on the bus by any two distinct processors arrive at some location on the bus simultaneously. With this collision-free condition satisfied, every processor can, in parallel, send a message to some other processor, and the messages will all travel from left to right on the bus in a pipelined fashion. To transfer messages in both directions in the same bus cycle, either a second waveguide can be added so that messages are transmitted in opposite directions on separate buses [9] , or a folded waveguide can be used to allow messages to be delivered to processors on both sides of the source processor on the same bus [IO] .
Connecting N processors with a linear optical bus results in bus cycles of length O( N ) . One solution to this problem is to extend the linear architecture of Fig. I to a 2-D architecture. Although the bus latency is significantly reduced to O ( 0) in the 2-D architecture, in general it takes at least two steps, a row routing and a column routing, for two processors to communicate with each other. Such a two-step communication requires a message relay by an intermediate processor and involves an optical-electronic-optical information conversion in the middle of a message transfer. In the worst case, e.g., when all the messages from the same row are to be sent to the same column, up to O ( a) message relays by a single processor may be needed.
B. Alignment Cost
The problem studied in this short note is one of network embeddings in 2-D processor arrays with spanning buses, as shown in Fig. 2 , where the processors in each row and each column are connected with a bus. A bus in Fig. 2 may be either optical or electronic and allows a processor to send a message to any other processor on the same bus in one routing step. It is clear from the above discussion that for efficient communications in such arrays, message relays should be minimized and preferably eliminated. This can be achieved by allocating communication processes to processors that share a common bus, thus avoiding the relay of messages. Let S = {YS, E,s} be a process graph (called a source graph), where 1, : s is a set of processes and E.5 is a set of edges between communicating processes. Let also T be the target network, which is a processor array with row and column spanning buses, onto which S is to be mapped. The quality of a mapping is often measured in terms of the dilution cost. Specifically, the dilation of an edge ( i . j ) E E,? that is mapped to a path Ll in the target network, is ILlI -1, where is the number of nodes on B. The dilation cost for mapping the source graph into the target network is then defined as the maximum of all of the edge dilations. This measure, however, is of little value when the source graph is mapped into a 2-D processor array with spanning buses, because the bus connections allow two nodes to communicate in one step if the two nodes are mapped to the same bus, and in two steps otherwise. If two neighboring nodes i and j in the source graph are mapped to the same bus in the target array, we say that i and j are nligned. Clearly, it is desirable to obtain a mapping in which every pair of neighboring nodes in the source graph are aligned in the target array.
In order to measure how well neighboring nodes in a source graph S are aligned when mapped onto a target network with bus interconnections, we define a cost function, called alignment cos?.
Let i E IC and j E 1; be two neighboring nodes connected by an undirected edge ( i , j ) E E.\, and let the cost of aligning i and In the above definition, I1;I is used to have the maximum value for normalized to the degree of the source graph. This gives a maximum alignment cost independent of the size of the source graph, and makes it more convenient to evaluate embeddings of the same type of source graphs (e.g., X-trees) of different sizes. Note that the maximum alignment cost 6,s may result only when the degree of each node in the source graph is a constant. A mapping will be said to satisfy the alignmen? condition if its alignment cost is 9,t = 0, which is optimal. Another useful measure for evaluating graph embeddings is the expansion cos?, denoted @ + , which IS defined as = 11; I/II;SI, where I15.l is the number of nodes in the target network. These costs will be used to evaluate our embeddings of S-trees as defined in the next subsection.
C. X-Trees
Consider an L-level tree in which each node has C children, where C = 2 or 4. The tree is augmented with interconnections among the In the second class of X-trees that we consider, C = 4, and there are 4' nodes on each level 1. These nodes are interconnected as a 2' x 2' mesh with wraparound connections. Specifically, let (.r. y ) , 0 5 .r. y < 2' be the rowlcolumn position of a node in the 2' x 2l mesh representing level 1 of the S-tree. Then the wraparound connections connect node (0. y ) to (2' -1, y ) . and node (.r,Oi to (.r. 2' -1) (See Fig. 4(a) , where wraparound connections are omitted for clarity.) The resulting architecture will be called an S-quad-tree, which is an extension of the .Y-binary-tree. Thus, an S-binary-tree is a binary tree with each level connected as a ring, and an I-quad-tree is a quad tree with each level connected as a torus. The term S-tree is used hereafter to refer to both S-binary-tree and I-quad-tree.
In an L-level S-tree, each node has 2C + 1 neighbors. except the root (at level 0), which has only C children, and the leaf nodes (at level L -1) each of which has C neighbors at the same level and a parent at the level above. When embedding an S-tree into a 2-D array with spanning buses, all of the neighboring nodes in the X-tree must be mapped to either the same row or the same column of the target array, so that the alignment condition can be satisfied. To faciliate future discussions, we define the alignment condition for the embedding of an L-level A-tree in a 2-D array as follows. Note that once an embedding of -Y-trees into a 2-D array is obtained, all algorithms designed for binary trees, pyramids as well as X-trees, can be run on the target array. In the case of running a binary tree algorithm, all connections at the same level are ignored. In the case of running a pyramid algorithm, the wraparound connections at each level are not used. Numerous applications will benefit from such embeddings because many efficient parallel algorithms have been designed for binary trees, X-binary-trees, and pyramids. Examples include selection, scheduling, sorting, prefix algorithms, matrix operations, graph theoretic problems, and various image and vision computations [21, [41-[61, [121, [ 141. In Section 11, we define, for meshes, an indexing scheme, called the rejection index, and a coding scheme called the rejection code. These schemes will be used to specify the X-tree embeddings, because each level 1 of an X-tree is a (C/2)' x 2' mesh (with wraparound connections). The reflection code was first introduced in [8] for 2' x 2' square meshes and used to obtain an embedding of pyramids in 2-D arrays with pipelined optical buses. The definition of the reflection code in this short note extends that of [8] to 2' x 2J meshes for arbitrary integers i and j . This allows us to embed binary trees, Xtrees, and pyramids using the reflection code. This result is presented in Section 111 as our first embedding, which is proved to achieve an optimal alignment cost at the expense of a relatively high expansion cost. The second embedding, presented in Section IV, is defined using the reflection index, and achieves an optimal expansion cost. The alignment cost of the second embedding, however, is not optimal.
HI)
Finally, these results are summarized in Section V.
REFLECTION INDEX AND REFLECTION CODE
First, the reflection index is defined. Consider a mesh of size 2' x 2 j for arbitrary integers i 2 0 and j 2 0. Let ( T , y ) be the node at row s and column y, 0 5 s < 2z and 0 5 y < 2'. The reflection index for (.r, y) is defined by two transforms, G and R, on (s. y). The first transform, G, results in a Gray code [13] , to which the second transform, R, is applied to obtain the reflection index. Let a be an integer with binary representation ak-1 . . . no. Then the transform G is defined as follows:
is the logical Exclusive-OR operator. The Gray code, denoted ( e . f), for (s. y ) is then given by the following equation: To define the reflection index, the notion of binary numbers with null bits is now introduced. The notion is best illustrated by an example. Let n be an integer with binary representation a 3 U 2 ( L l i f 0 , where 0 3 , a1 and (LO are ordinary bits and a2 is a null bit, denoted by *. Then the actual value of integer n is obtained by throwing away the null bit; that is,
For notational convenience in defining the reflection index, c or f is augmented with Ii -j l null bits at higher bit positions, so that e and f have the same number of bits in their binary representations.
Specifically, if i < j , the binary representation of e is augmented with j-inullbits, sothatonecanwritee = e J -l ... e,e,-l...eo,where c J -l through e , are null bits. If i 2 j , however, f is augmented with i -j null bits. If k = max{i,j}, then the reflection index T for ( T . y ) in a 2' x 2.' mesh is obtained by shuffling the bits of g, which is the Gray code index of (s,y). That is,
( 3 )
It can be checked that after throwing away the null bits, the number of bits in T is i + j , resulting in a binary representation T = T , +~-I . . . T O . An example of the reflection index is shown in Fig. 5 (a) for a 2 x 8 mesh. Two special cases of the mesh size are important to us. When i = 0 (or j = O), the mesh degenerated to a linear array, the reflection index is the same as the Gray code index. As the other special case, when i = j , the mesh being a square, the reflection index is obtained by shuffling the bits (without null bits) of the Gray code index. Fig. 5 (b) and 5(c) show the reflection indices for a linear array and a square mesh, respectively. The reflection indices (and the reflection codes, defined later) for these two special mesh sizes will facilitate our definitions of X-tree embeddings, because each level of an A-tree is a linear array with wraparound connections in the case of an A-binary-tree, and a square mesh with wraparound connections in the case of an X-quad-tree.
The reflection index for a mesh of size 2z x 2J, can be obtained recursively by successive reflections, thus the name reflection index. Specifically, starting with a single node 0, first perform alternate column and row reflections until all the 2% rows are exhausted (assuming i < j, the case where i 2 j is similar), and then do only column reflections until all of the 23 columns are exhausted. The reflection index has the properties of adjacency and squareness, as established in Lemmas 1 and 2 below.
Lemma I: (Adjacency of the reflection index) The reflection indices for two adjacent nodes ( z , y ) and (s, ( y + 1) mod 2.') or ( ( s + 1) mod 2'. y ) in a 2' x 2' mesh are at Hamming distance 1.
It can be proved that G(x)G(y) gives the Gray code index g for To prove the squareness, we need show only that bits r,+,-1 " + T z h depend only on x t -l " ' x h and yJ-1 . . . yh, and are thus the same for the 2"' nodes in each 2h x 2h block. It can be shown from (1) and (2), however, that e , -~ . . . eh and f J -1 . . e f t , , which are the higher bits in e and f of the Gray code ( e , f ) for (s:y),dependonly o n z i -1 . . . s t , a n d y , -i . . . y h , respectively. Thus, to prove the squareness, it is sufficient to show that bitsrZ+,-1...r2h dependonlyoneZ-1...fh a n d f J -l . . . f h . This can be easily shown to be true from (3) .
where n = 7't-I .. . &, and t = C1/2 is the number of bits in 7'.
Corollary 1 tells us that the binary representations of the reflection indices for the C children of any node in an X-tree have the same higher significant bits +,-I . . . &. and are different only in the two lowest bits i l = eo and i o = fo, where eo and fo are the least significant bits in the Gray code ( e , f ) of each node, as defined in (2) . Note that e , and thus eo, is null if C = 2.
The adjacency and squareness properties of the reflection index can be checked from the examples in Fig. 5 . It is interesting to compare the reflection index with two well-known indexing schemes for meshes: the Gray code index and the shuffled row major index [20] . Like the reflection index, the Gray code index also possesses adjacency, but it does not have squareness (except when 2-D meshes degenerate into linear arrays). On the contrary, the shuffled row major index also has squareness; it does not possess adjacency, however. Thus, neither the Gray code index nor the shuffled row major index has both adjacency and squareness properties, which, as will be seen, are both crucial to our X-tree embeddings.
Next the reflection code is defined. The reflection code, denoted ( p , q ) , for (s, y ) in a 2' x 23 mesh, is also defined by two transforms.
The first transform, G, is defined the same as before, which gives us the Gray code (e, f ) for (T, y). Then, applying the second transform, R', to (e, f), one obtains the reflection code ( p , q). R' is defined in (5) below. Note that e or f has been augmented with null bits at higher bit positions to make the number of bits in both e and f equal to k , where k = max{i.j} as defined before. Thus, given a node (z.y) in a mesh of size 2' x 2.', its reflection code ( p , q ) is defined as follows:
G(Y)).
(6) and q = X(6f). Therefore, like the reflection index, the reflection code for a mesh of size 2' x 2J can also be obtained by successive reflections. Because the reflection code is a 2-tuple ( p . q j, however, one needs to do reflections on the two numbers p and q alternatively. Further, the reflection code also has the properties of adjacency and squareness. Lemmas 3 and 4 below are similar to Lemmas 1 and 2, respectively. ( P . Q ) = (P,dEoFo). (7) where 1 = 4,-, . . . Q 2 . and t = [Cl/41 is the number of bits in 0.
It is clear from 'Corollary 2 that the first variable, P , in the reflection codes for the C children of any node in an X-tree has the same value.
The second varialble, Q , in their reflection codes is different only in the two lowest bits (2, = Eo and QO = Fo, where QEo and PO are the least significant bits in the Gray code ( E . F ) ofeach node, and E is null for C = 2.
The reflection code and the reflection index defined above allow us to define two embeddings, presented in the next two sections, with optimal alignment cost and optimal expansion cost, respectively.
AN +\--TREE EMBEDDING WtTH OPTIMAL ALIGNMENT COST
Consider an -X-tree of L levels, with the root at level 0. Level 
level L level I. level X-tree, the C children of (p,q.Z) are (q,ps,I + 1) where Note that if ( e , f , 1) is the parent of ( E , F, I + :l), then:
Proof:
Let ( p , q ) and (P, Q) be the reflection codes for ( e , f ) and ( E , F ) , respectively. Assuming that 1 is even (the case for odd I follows similarly), then, from (5), we have: (9) . noting that s = EO Fo, with (7) to see the effect of squareness on obtaining Lemma 5 , which, together with adjacency, is crucial to the proof of the following theorem.
Theorem 1: The X-tree embedding M I , defined in (S), satisfies the alignment conditions H1 and H2.
Pro08 To prove H1, notice that from Lemma 3, all the neighboring nodes of ( p , q, 1) on level I are at Hamming distance 1 from ( p ? q, 1) . Thus, it is sufficient to show that all nodes at Hamming distance 1 from ( p , q , I ) on level 1 are mapped to the same row or column as ( p , q? 1). Consider two nodes ( p , q , 1 ) and (p,q, 1 ) (or  ( p , q. 1) and (p, q, Z) ), where q and ij (or p and p) differ by exactly one bit in their binary representations. From (S), if 1 is even, the two nodes are mapped to the same row p + <I (or to the same column q+*l), and if 1 is odd, the two nodes are mapped to the same column p + 9 1 (or the same row q + c l ) , thus proving H1.
To prove H2, note that from Lemma 5 , node ( p , q, I ) has C children (P, &, 1 + 1) = (q,ps,Z + 1). When Thus, the parent ( p , q , 1 ) and its children ( P , Q , 1 + 1) are mapped to the same row. Therefore, H2 is satisfied by embedding 0 Note that because each node knows the reflection code of its neighbors (see the discussion at the beginning of this section), it is clear from the proof of Theorem 1 that each node knows whether a neighbor is mapped to the same row or to the same column. For example, if two neighboring nodes on the same level I have the same p value in their reflection codes, then the two nodes will be mapped to the same row if 1 is even, or to the same column if 1 is odd.
An important issue in embeddings is how efficiently the routing in the source architecture can be emulated in the target architecture, and what the incurred overhead is. To see this, let us assume that in one routing step, every node in the source X-tree sends a message to one of its neighbors. In the case of optical buses, because all nodes on a bus can write their messages on the bus simultaneously and all neighboring nodes in the source X-tree are aligned, each routing step in the source X-tree can be emulated in one step in the target network. Thus, the emulation incurs no overhead. Note that if two neighboring nodes in the source X-tree are not aligned, then each message transfer between the two misaligned nodes requires a message relay by an intermediate node. As discussed in Section I, up to O ( n ) messages may have to be relayed by the same node, which subsequently increases the number of steps that are required to emulate one X-tree routing step. Therefore, aligning neighboring nodes in the X-tree significantly improves the efficiency of the emulation task.
In the case of exclusive access buses, only one distinct message can be written on a bus at any step. Thus, more than one step in the target array will be required to emulate one routing step in the X-tree. The number of steps taken in order to emulate one X-tree routing step will, in the worst case, be equal to the maximum number, NJ,,, of messages to be transmitted on the same bus in the target array. To find .%rA4, we consider three types of routing steps in the X-tree. M I , completing the proof of Theorem I.
1)
Each child node sends one message to its parent. 2) Each parent node broadcasts the same message to all of its children or sends one message to one of its children.
3) Each node sends one message to any neighbor on the same level. These are the most commonly used routing patterns in X-tree algorithms. Let lv<, . Arb, and A',. be the maximum number of messages to be transmitted on the same bus for the three types of routing, respectively. Then W,tf = ma?c{.Va, L V b > .V,.}. Clearly, in type 1 routing, a node may receive C messages, and in types 2 and 3 routing, each node receives at most one message. It can be shown that Therefore, Nnr = l$ic = C(L-1)/2 + Cir,-:u/2 , which is equal to the maximum number of nodes mapped to the same bus.
In order to allow messages to be transmitted on an exclusive access bus without conflict, a simple timing mechanism can be designed using the reflection code. That is, the step at which a node should transmit its message is determined by its identification (11, q. 1 ) . Here we use type 3 routing, which corresponds to the worst-case overhead, to illustrate how this can be achieved. The cases for types 2 and 3 are similar. Observe from Figs. 7, 8, and 9 that nodes of at most two consecutive levels in the X-tree are mapped to the same bus in the target array. Thus. the access of a bus has to be arbitrated only among these nodes. Specifically, in Type 3 routing, to transmit a message to a neighbor in the same row (the case for columns is similar), node (11, q , I ) , where 1 is even, transmits its message at the ( q + 1 ) t h step, and node (p', q', 1 -1) transmits its message at the ( p ' + 1 + C"/')th step, Note that C'/2 5 C(L-L)/2 and p' + 1 5 C(',-'iJ/2 . Th us, may check row 1 (second row) in Fig. 8(b) for an example.) Using this bus arbitration technique, one type 3 routing step in the X-tree can be emulated in at most C(r'-')/2 + C(',"')/* steps in the target array with exclusive access buses.
The above analysis shows that X-trees can be emulated much more efficiently on optical buses than on exclusive access buses. Note that the emulation of each of the three types of X-tree routing tequires message transfers on both row and column buses in the target array.
If we assume that message transmissions on row and column buses must be done in separate steps in the target array, then the number of emulation steps computed above doubles.
In the following, we analyze the alignment and expansion costs for AI,. Because embedding , 1 1 1 aligns all neighboring nodes of S-trees, its alignment cost, denoted an,l, is optimal. That is, To compute the expansion cost, @ < , 1, of embedding -111 , note that a + , I can be improved without affecting the alignment cost by flipping over levels L-4 through 0 as shown in Fig. 7(b) . It can be shown that in the case of optical buses, such flipping does not affect the efficiency for emulating 3-trees on target arrays. In the case of exclusive access buses, however, the efficiency will be reduced, because in this case, Assume that the embedding of an L-level S-tree ( L odd) occupies
7r1 rows and IJ columns in Fig. 7(b) . Then rri = + c ( L -3 ) / 2 + c ( L --s ) / 2 and = C(r--L)/2 + C(r,-n)/z, Thu!;, from which it is easy to show that @, 1 is asymptotically equal to 1.31 for C = 2 and 1.23 for C = 4. It is noted that with proper node labeling, the optimal alignment cost can also be achieved by using the well-known H-embedding in the case of -Y-binary-trees. The corresponding expansion cost, however, will be asymptotically equal to 2, which is much higher than that of MI.
The above results indicate that embedding -2.11 achieves an optimal alignment cost, but its expansion cost is not optimal. In the next section, we present our second embedding of S-trees, which achieves an optimal expansion cost. That embedding, however, does not align all neighboring nodes in -I--trees.
IV. A N S-TREE EMBEDDING WITH OPTIMAL EXPANSION COST
In this section, our second embedding of L-level ( L may be either odd or even) X-trees into 2-D processor arrays with spanning buses is presented. For this embedding, we assume that the number of nodes in each row of the target array is 11 = C", where A is an integer parameter in the range 1 5 A < L . Like the first embedding ,If1 presented in the previous section, our second embedding is again obtained by mapping each level of an S-tree into a target subarray of proper size so that the alignment condition H1 is satisfied. The resulting target subarrays are then properly positioned to form the target array (see Theorem 2: The X-tree embedding SI2 defined in (10) satisfies the alignment conditions H1 and H2, except that the parents on level 7i -1 and their children are not aligned.
The formal proof of this theorem is rather involved and is given in [ I I] . The analysis for the overhead incurred in emulating S-trees using embedding A.12 is similar to the case of :\,I,. It is noted that the emulation will not be as efficient, because not all neighboring nodes in .Y-trees are aligned in ML. But it can be shown that at most one message relay by each node in the target network is necessary to emulate the routing between any two misaligned nodes. Thus, the increase in emulation overhead compared with MI is not significant.
Such analysis is omitted because of space limitation. In the following, we analyze the expansion cost, denoted When C = 2, corresponding to the case of S-binary-trees, we obtain:
That is, only one node is wasted, regardless of the choice of A. When C = 4, corresponding to the case of X-quad-trees, the expansion cost depends on the choice of x. Because it is desirable to make the target from which it is easy to show that aPc,2 = 1.03 for L = 5, ae,2 = 1.002 for L = 9, and ( P e , 2 approaches 1 as L goes large.
It is noticed that when the target network is restricted to a square array, the unity expansion cost is not achievable by any embedding, because the number of nodes in an X-quad-tree is (4L -1)/3, which is not a perfect square of any integer. v. SUMMARY Two embeddings of X-trees in 2-D processor arrays with spanning buses are presented in this short note. The first embedding has an optimal alignment cost, but its expansion cost can be as high as 1.31 and 1.23 for binary X-trees and pyramids, respectively. The second embedding asymptotically achieves an optimal expansion cost, but does not align all neighbors in the embedded X-tree. This demonstrates a trade-off between the expansion cost and the alignment cost for graph embeddings in processor arrays with spanning buses.
