Abstract. Let G and H be two mesh-connected arrays of processors, where G = g, X gz X .
1. Introduction A network of processors H is said to m-simulate another network G if and only if every step of G can be simulated with O(m) steps of H. Note that, if H can msimulate G, then any problem that G solves in time T can be solved by H in time O(mT). Establishing simulation results is an important research topic, since it enables the transport of algorithms written for one network to another [3, 5, lo] . This paper deals with the problem of simulating a mesh G = gl x ---x g, (called the guest) by another mesh H = h, x . . . x hd (called the host), where 1 GI = g, . . -g, 5 hl . . . hd= ]H].Sinceanymesh~~ x . . . Xx,&the same as the mesh x,(~) x . . . X x,(,) for any permutation a of (1, . . . , s], for the rest of this paper we assume, without loss of generality, that h, 2 . . . r hd and g, I *** L g,. We also assume, without loss of generality, that both G and H are d-dimensional (i.e., t = d). This is justified because we can always introduce additional dimensions of of length 1; for example, if t < d, then we consider G to be the mesh g, x ..-xgd,wheregl+, = . . . =gd= 1.
The model of a d-dimensional mesh is introduced in Section 2. Sections 3 and 4 establish that mesh H can a-simulate mesh G, where (Y = max15isd(gi+l * * * gd/hi+l * * * hd)'li, and that this bound is optimal to within a constant factor. Throughout, we adopt the notational convention that, for i = 4 gi+i * -* gdlhi+ 1 . . . hd = 1, which simplifies many definitions. In previous work [2] , Atallah had considered a special case of this problem, that in which 1 H 1 = ] G ] and either H is a cube (i.e., hl = . . . = hd) or G is a cube. It was shown in [2] that if G is a cube, then H can h,/g,-simulate G, and that if H is a cube then H can l-simulate G. Both of these results are special cases of our general simulation result, which constitutes a nontrivial generalization of the results of [2] . The result in Section 5 is relevant to the extensive work that has been done in the area of minimum-cost encodings of one graph in another graph [ 1, 4, 6-91. The lower bound part of the simulation result of Sections 3 and 4 implies that the worst-case cost of encoding G in H is Q(a), where (Y is as defined above. It is then natural to ask whether there is an encoding of G in H whose average cost is o(a). We settle this issue by proving that any encoding of G in H must have average cost Q(a). Only special cases of this result were previously known: In [4] , for the case of G = & x & and H = n X 1, an optimal Q( &) bound was established. In [9] thiswasgeneralizedtothecaseG=n'ldx . . . ~n'/~andH=nx 1 x . . . x 1, and an optimal Q(n 1-"d) bound was established. Both of these results follow from our general result.
The d-Dimensional Mesh of Processors
In a d-dimensional mesh of processors, the processors operate synchronously and are positioned on an hl x . . . X hd grid, one processor per grid point. A processor is denoted by its position in the grid, a typical one being denoted by (iI, . . . , id) where 1 5 ik 5 hk for every k E ( 1, . . . , d ). Processors (i, , . . . , id) and (j, , . . . , jd)areneighborsifandonlyif]i, -j,l + Ii2-j21 + ... + lid-jdl = 1. Processors (il, . . . , id) and (j,, . . . , jd) are neighbors along dimension k if and only if they are neighbors and 1 ik -j, ] = 1. Note that a processor cannot have more than 2d neighbors (processors at the boundary have fewer). A step of such a mesh consists either of each processor communicating with a neighbor by sending/ receiving the contents of a register (a data movement step), or of each processor performing a computation within its own registers (a computation step). A data movement is a sequence of data movement steps. A processor has a fixed (i.e., O( 1)) number of storage registers. Some researchers assumed that a register can store up to log n bits, whereas others limited the size of a register to 0( 1) bits; our results hold for either model. Two meshes A and B are equivalent if A can l-simulate B and B can l-simulate A. Note that "stretching" the dimensions of a mesh by constant factors results in an equivalent mesh; for example, an h, x hZ mesh and a (cl hi) x (c2h2) mesh can l-simulate each other if cl and c2 are constants.
Throughout this paper d, the number of dimensions, is assumed to be a constant; that is, d = 0( 1). Since the case d = 1 is trivial, we also assume d > 2.
If every processor is viewed as a vertex of a graph and every communication line between two neighboring processors is viewed as an undirected edge, then a mesh can alternatively be viewed as an undirected graph.
In order to establish the simulation upper bounds, we use the idea of embedding the guest G into the host H, that is, assigning every processor of G to a processor ofHthat will mimic its behavior (i.e., simulate it) during the complete computation.
A processor of H might be simulating more than one processor of G in this way, but (because of the storage limitation) it cannot simulate more than a constant number of processors of G. Although this type of embedding-based simulation will enable us to establish the desired upper bounds, our lower bound proof holds for any type of simulation, including simulations where the embedding changes dynamically during the computation, and simulations where more than one processor of H can simulate the same processor of G.
In this paper we have not considered the case ] H 1 = o( 1 G I) since, in such a case, H cannot even store as much information as G because each processor of H has 0( 1) storage registers. Relaxing the standard assumption of limited storage per processor gives rise to interesting questions that are not within the scope of this paper.
A Simulation Lemma
Here we establish a lemma that is crucial to the simulations of Section 4: (ii) g,g2 . . . gd=hlh2 . . . hd.
Then H can l-simulate G.
PROOF. The proof is by induction on d:
Basis. d = 2. From (i) and (ii), we have g, r h, 2 h2 2 g2. Embed G in Has follows: Partition H into h2/g2 (= g,/h, 1 1) rectangular slabs of dimensions h, x g2 each. Now, "snake" G through these slabs in the manner depicted in Conditions (b) and (c) guarantee that we can @deed parti$on h2 X ---X hd into (hz/h2) . --(hd/hd) pieces, each of which is h2 X . . . X hd and has the same volume (=e) as the base of G. (Condition (a) will be useful later, when we eventually require that a piece can l-simulate the base of G.) This partition of h2 x . . . x hd into pieces induces a partition of H into host slabs each of which is hl X h2 X .--X id, where we view hl as being the host slab's depth and i2 x . . . X & as being its base.
We would like the host slab base to be capable of l-simulating the base of G; that is, h2 X . . . X id must be capable of l-simulating g2 X . . . X gd. The above conditions, along with (d) below, will enable us to use the induction hypothesis and embed the base of G in the base of the host slab such that the latter l-simulates the former. 03 g2 . * . gi 2 h2 . . -ii for every i E (2, . . . , d).
We now begin describing the embedding of G in H, which enables the latter to l-simulate the former. First embed the base of G in the base of a "corner" slab (e.g., in the lowest leftmost slab base in Figure 2b ). Then, with its own base embedded in that of a slab, we snake G back and forth through the slabs until it is completely embedded in H. This snaking is an obvious generalization of that done in Figure lb , and requires that the depth of a host slab (=h,) be no smaller than the longest dimension of its base (=&) in order for G to shift smoothly from one slab to another (this condition was satisfied in Figure lb since we had g2 5 hl).
Therefore, we need (e) h2 zs h,.
Ifwecouldtind&,..., hd to satisfy conditions (a)-(e) above, then the above embedding would clearly enable H to l-simulate G. A data movement step in G along its first dimension can obviously be simulated in 0( 1) steps in H, while a data movement step along any of the remaining d -1 dimensions of G can also be simulated in 0( 1) steps in H because, by the induction hypothesis, the base of a slab in H can l-simulate the base of G.
We are left with the problem-of finding h2, . . . , id such that conditions (a)-(e) above are satisfied. We choose h2, . . . , hd as follows:
Now we prove that the above choice for d2, . . . , id satisfies all the conditions (a)-(e) which completes the proof of Lemma 3.1. That every Ri is a power of 2 can easily be established by induction on i, $nce every gi and hi is also a power of 2. !$nce hi = min(hi? eg3 + * **gi/h2hs . . . hi-l), we have hi I hi, establishing (b), and hi 5 g2g3 * * * gJh2h3 * * * hi-l, establishing (d). Condition (e) holds since h2 zz h2 and h2 ZG hr.
PROOF OF (a). We prove that ii 2 ii+, by a case analysis: Case 2. hi+, = g2 * * * gi+l/h2 * * * hi. Then i2 -** ii+, = g2 **a gi+l 2 qh2 *** hi+,, where we used (3.1).
Thus, property (c) holds, which completes the proof of Lemma 3.1. 0
The Main Simulation Result
In this section we exploit the above simulation lemma to prove the following theorem: THEOREM 1. LetH= h, X .--x hdandG=g, X --. Xgd, withh, 2 --. z hd, g, 1 . . . 2 gd, and h, . . . hd 2 g, -. . gd. Then mesh H can o-simulate mesh G, where (Y = maXi(gi+, * --gd/hi+, -* * hd ),ji. (Recall our notational convention that, for i = d, gi+, *-a gd/hi+, ---hd = 1.) This bound is optimal to within a constant factor. Basis. d = 2. We begin with the case in which &g, I n; that is, Z = (&g,) x 0. G consists of gl adjacent columns of length g2 each (see Figure 3a) . Now, snake these columns one after the other in Z, as depicted in Figure 3b . Note that each snaked column occupies a horizontal width of g2/p = 6 columns in I. A data movement step between adjacent processors in the same column of G can clearly be simulated with 0( 1) steps of I. It is trivial to design a data movement that takes O(G) steps on Z and simulates a data movement step between adjacent processors on the same row of G.
If &g, > n, then Z = (g, g2) x 1, and we can go through the same simulation as for the first case, except that g2 and 1 now play the roles that (respectively) & and /3 played in the previous case. Hence Z can g2-simulate G. Since &g, > g, g2, we have & > g2, and hence Z can &-simulate G.
Inductive step. We begin with case (2), &g, I n. That is, we want to show that z= (hg,) x *-a x (Ggm) x p x 1 x . . . X 1 can&simulateG=g, X . . . xgd, where g, 2 . . . 2 gd, p 5 GgM+,, and (&g,) . . . (sgm)p = g, . . . gd. Divide Z along its first dimension (i.e., using hyperplanes orthogonal to first dimension) into gl consecutive submeshes (which we call Z-chunks) each of which is an 6 x (&g2) x . . . x (GgJ x p x 1 x . . . x 1 mesh. Similarly divide G along its first dimension into gl consecutive submeshes (which we call G-chunks) each of which is a 1 x g2 x . . . X gd mesh. We use each Z-chunk to &-simulate a G-chunk (more on how this is done is described later). Of course, for this simulation, the G-chunks are assigned to the Z-chunks in consecutive order. First observe that two processors of G that are neighbors along G's first dimension are simulated by two processors of Z that are in two consecutive Z-chunks, and that one data movement step in G between such processors can be simulated in Z by a data movement in time proportional to the width of an Z-chunk along its first dimension, that is, O(s) (we omit the detailed specification of this easy data movement). We still need to show that a data movement step of G along its second, third, . . . , or dth dimension can also be simulated by O(h) steps of I. Since each such data movement is between processors in the same G-chunk, it suffices to show that an Z-chunk can &simulate a G-chunk. Therefore it suffices to show that an Z-chunk can l-simulate J, and this follows from Lemma 3.1. We now consider case (I), &gl > n. Partition G and Z into chunks as in the prior case. An Z-chunk is now an (n/g,) x 1 x . . . x 1 mesh and hence can n/g,-simulate a G-chunk (because a l-dimensional array of x processors can always x-simulate any other x-processor mesh by circularly rotating the data in O(n) time for every step of the other mesh). Since the width of an Z-chunk along its first dimension is also n/g,, it follows that Z can n/g,-simulate G. Since 6 > n/g,, Z can &-simulate G. Cl This proves Claim 4.2. Thus H can a-simulate G when 1 H 1 = 1 G 1 and every g; and hi is a power of 2.
Suppose not all the gi's and hi's are powers of two. It is easy to find a mesh A = al x *a* X ad such that (i) every ai is a power of two, (ii) 2-l hi 5 ai 5 2hi for every iE((i,..., d), and (iii) 2-'hl ---hi 5 al s --ai I 2hl ---hi for every i E (1, . . . , d ).
(To obtain such a mesh A, for i = 1, . . . , d, let ai be either the smallest power of two Zhi, or the largest power of two sh;, choosing the alternative that preserves property (iii).) Observe that (ii) implies that A and H are equivalent (i.e., can l-simulate each other). Let B = bl X -. . X bd be to G what A is to H. Since A and B are equivalent to Hand G, respectively, it suffices to prove that A can a-simulate B. First observe that the number of processors of A and B are equal to within a multiplicative constant of (at most) 4, so that one of them can be "scaled up" to have the same number of processors as the other (by multiplying by 1, 2, or 4, its largest dimension.) Suppose this has already been done (i.e., 1 A 1 = 1 B 1). Then A and B are still equivalent to H and G, respectively, and moreover, every ai and bj is a power of two. Consequently, A can p-simulate B, where p = maxi(bi+l --* bd/Qi+l ---ad) 'Ii Since p = O(a), it follows that A can a-simulate . B, which completes the proof of Lemma 4. This completes the proof of the upper bound part of Theorem 1.
4.3 LOWER BOUND PROOF. Now we establish the lower bound proof of Theorem 1 by proving that (Y is, to within a constant factor, the best bound achievable by any simulation of G by H. Note that the simulations we gave so far have the property that a processor q of G is simulated by exactly one processor p of H, and that p simulates q throughout the computation. Our lower bound proof holds not only for such "embedding-like" simulations, but also for any simulation where every processor p (there may be many such p's) that is simulating q's condition at a certain instant of time must store within its registers a complete description of q's condition at that instant of time. Thus, a processor of G can be "simulated" by many processors of H at any particular instant of time, and this assignment of processors can change dynamically for the class of simulations that the lower bound holds for.
Let there exist a @imulation of G by H, and focus on the situation in G at some instant of time t during the simulation (any t will do). If the condition of a processor q in G at time t depends on the condition, at time t -gi, of a set Q of processors in G, then 1 Q 1 2 (gi)'gi+I . . . gd (because every processor located at a distance less than gi can influence q). Since H P-simulates G, the condition of a processor p in H, which simulates q's condition at time t, depends on the condition of a set P of processors in H, where 1 PI = O((pgi)'hi+, . . . hd) (because any processor located at a distance greater than this number of steps cannot influence p). In addition, for H to simulate G properly, we must have 1 P 1 = t2( 1 Q 1). Thus, and e(e2) share an endpoint in H. The worst-case cost of the encoding c is WCOST(E) = IJII length(e(e)), G where length(e(e)) denotes the length of the path c(e). The average-case cost of the encoding E is ACOST = I EG I-' c length(e(e)).
KEG
Here we are interested in the case in which G is a g, x . . . x gd mesh and H is ah, x . . . Xhdmesh,withh,r . . . zhd,glZ . . . rgd,and IHI = 1 f'-,l = h, .--hdLg, a.. gd= 1 ?'-,I = ICI. The embeddings of G in H that we used in the simulation results of Sections 3 and 4 are not encodings in the above sense, since we allowed them to map 0( 1) vertices (edges) of G into a single vertex (path) of H (whereas in an encoding the mapping is one to one). Of course, the embeddings of Sections 3 and 4 are quite appropriate for simulation purposes, because they capture the fact that (1) the dimensions of the mesh can be stretched/shrunk by a constant factor without any change in its computational power, and (2) for simulation purposes it is perfectly acceptable for one host processor to simulate more than one guest processors. Moreover, the lower bound part of the proof of Theorem 1 can easily be seen to hold for encodings, and it leads to a proof of the fact that, for any encoding E of G in H, we have
where CY is as in Theorem 1. In addition, the proof of the simulation result of and c, and the study of this trade-off is an interesting research issue, but one that is beyond the scope of this paper. Reference [I] investigates this trade-off for the problem of encoding a two-dimensional rectangle in a two-dimensional square.
The following theorem, which establishes a tight lower bound on the average cost, is the main result of this section. PROOF. The number of such vertices v is no larger than the surface area of an xx *--x x (k-dimensional) cube, which is O(xk-'). El PROOF. We begin by describing a transformation that, when applied to S, may modify its shape, but does not cause any increase in 1 I'(S) 1. We call this transformation COMPRESS. The COMPRESS Transformation Perform COMPRESS(l), followed by COMPRESS(2), . . . , followed by COM-PRESS(d), where COMPRESS(j) consists of "compressing" S along the j th dimension toward the lower end of that dimension (i.e., in the direction of lower values of the j th coordinate). In other words, for every segment of G parallel to the j th dimension (there are gl s . -gi-I gj+ I . . + gd such segments, and each of them has length gi ), "slide" the points of S on that segment so that they occupy adjacent positions at the beginning of the segment. To see that COMPRESS( j ) does not increase 1 I'(S) 1, simply note that (1) after compression along one segment has occurred, only one of the compressed points on that segment has a neighbor along the j th dimension that is not in S, whereas there were L 1 such points before the compression along that segment; (2) if HI and Hz are adjacent segments containing nl and (respectively) IZ~ points of S, then after the COMPRESS(j) there are exactly 1~1~ -nl 1 edges between H, and H2 joining a point in S to one not in S, whereas there were ~112~ -n, 1 such edges before COMPRESS(j). Also observe that, once COMPRESS( j ) has been done, S remains "compressed" along the j th dimension after we perform COMPRESS( j + l), . . . , COMPRESS(d).
For the rest of this proof, we assume S has already been COMPRESSed. Now, partition G into gd slices each of which is gl X . . . X g&l, and let Gi denote the ith slice. That is, Gi contains all the vertices of G whose coordinates are of the form (j, , . . . , j&, , i). Let St be the projection of S n Gi onto hyperplane G1 ; that is, Si = ((i,, . . . ) id-l, 1): (i,, . . . , id-,, i) E S).
PROOF OF FACT 5.1. We show that S' -Si = 0 for any j, i such that gd 2 j > i r 1. Suppose to the contrary that vertex p is in Sj , but not in Si. Then consider the segment of G that is parallel to the dth dimension (and hence has length gd) and contains p: The j th vertex on that segment is p and belongs to S, while the ith vertex does not belong to S. Now i <j contradicts the fact that S is COMPRESSed along the dth dimension. This proves Fact 5. 
