The paper considers computational domains structured as a 3D grid of cells. It presents a cell-to-hypercube map that is useful for implementing the Alternating Direction Method (ADM). The map is shown to be perfectly load-balanced, and to optimally preserve adjacencies between cells in the computational domain.
Introduction
Twizell [11] notes, The ADM was introduced by Peaceman and Rachford [6] for the numerical solution of elliptic and parabolic partial differential equations.
These methods are used for the solution of the Dirichlet problem, the Neumann problem, and the Robbins problem, among others [11] . The Cartesian space is taken to be the physical problem domain and the solution is sought over a rectangular box. Other geometries are accommodated by transforming the cartesian domain into other curvilinear coordinates. The ADM's computational domain D consists of a set of points which are uniformly spaced over the problem domain. These points are called grid points and are indexed by positive integers (i, j, k) such that 1 ≤ i ≤ I, 1 ≤ j ≤ J, and 1 ≤ k ≤ K, where I, J, and K are positive constants.
Abstractly, each iteration of the procedure consists of 3 steps:
1. Perform I × J independent computations (e.g., solving tridiagonal systems) each of size K.
2. Perform J × K independent computations (e.g., solving tridiagonal systems) each of size I.
3. Perform I × K independent computations (e.g., solving tridiagonal systems) each of size J. Considering step 1 above, the I × J independent computations are indexed by (i, j) and can be carried out in parallel. For a fixed value of (i, j), the computation typically requires values associated with grid points with indices (i, j, k), for 1 ≤ k ≤ K, and values associated with neighboring grid points. For example, a typical computation involves solving a tridiagonal linear system of equations where the coefficients are associated with the grid points along the line (i, j, k), for 1 ≤ k ≤ K. The computation "sweeps" along the line from (i, j, 1) to (i, j, K) during the elimination phase, and then "sweeps" along the line from (i, j, K) to (i, j, 1) during the back substitution phase. Similar comments apply to steps 2 and 3 where the "sweeps" traverse coordinate directions i and j, respectively.
When implementing the ADM, an important, if not overriding, consideration is the processor communication resulting from the assignment of grid points to processors [2, 5, 8, 4] . Considering only step 1 of an iteration, it is desirable to map the computational domain onto the hypercube so that all grid points with equal i and j coordinates are assigned to the same node. If this were so, the computation associated with a particular i and j would be completed using little or no communication with other nodes. Considering only step 2 of an iteration, it is desirable to map the computational domain onto the hypercube so that all grid points with identical j and k coordinates are assigned to the same node. Finally, considering only step 3 of an iteration, we want to map all grid points with identical i and k coordinates to the same node.
In Fig. 1 , we show the 3 different partitions of the computational domain into "cells" which would accommodate the above computations. According to step 1, we prefer cells shaped like the one labeled z-sweep; according to step 2, we prefer cells shaped like the one labeled x-sweep; and lastly, for step 3, we prefer cells shaped like the one labeled y-sweep.
The 3 different partitions are not compatible, and it is costly to change the partition (and the assignment of cells to processors) between each step of an iteration. We thus are motivated to seek a cell size and shape and assignment of cells to processors that can be maintained throughout the computation and which accommodates an efficient implementation of each step of an iteration. In this section we discuss the effect of "cell-to-node" mappings on the efficiency of our implementation.
The efficiency depends on a number of factors [5] :
Load balancing: Assign an equal amount of computational work to each node. A d-dimensional hypercube is a graph of 2 d nodes(processors), each numbered with a distinct nonnegative integer less than 2 d , with an edge between a pair of nodes if the binary representations of their node numbers differ in exactly one bit position. Two nodes are said to be adjacent if there is an edge between them. The distance between two nodes is equal to the number of bit positions in which the representations of the node numbers differ. We are concerned about the distance between two nodes since it is equal to the minimum number of communication links that must be traversed by a message transmitted between the two nodes [1, 10] .
A cell-to-node mapping assigns a node number to each cell in the computational domain. We say that a cell-to-node mapping is adjacency preserving if it maps adjacent cells into adjacent nodes.
Cell-to-node mappings are represented by labeling each cell with the hypercube node number assigned to the cell. That is, if θ is a cell-to-node mapping then the cell with indices , and 7, respectively. When nodes 4, 5, 6, and 7 take over the elimination phase, the initially active nodes become idle. Indeed, it is easy to see that at least half of the nodes are idle at every point in time. This is also true for steps 2 and 3 of the ADM using this cell-to-node mapping.
Next consider a 2-dimensional hypercube, and a computational domain partitioned into 8 cells. A cell-to-node mapping for this case is shown in Fig. 3 . Since each node is assigned a cell with c = 0, all nodes have work to do, and are initially active during the first part of the elimination phase of step 1. Furthermore, all the nodes are assigned a cell with c = 1, so that all the nodes remain active after they are finished with cells having c = 0. The situation is the same on the back-substitution phase.
Step 2 is also favorable since all the nodes are assigned cells with a = 0 and a = 1.
Step 3 is not as favorable since only half of the nodes are assigned cells with b = 0. Therefore, at least half of the nodes are idle throughout step 3.
These examples demonstrate the nature of the relationship between the cell-to-node mapping and the potential efficiency of our implementation.
It follows that the best case would be to have every node assigned exactly one cell for each different value for a, b, and c and have adjacent cells map into adjacent nodes or the same node.
This would mean that in each step, every node would be busy except for the time during the transmission of intermediate results to neighboring cells. We have not achieved this condition in the previous example since nodes 0 and 1 are not assigned any cell with b = 1, and nodes 2 and 3 are not assigned any cell with b = 0. We prove in the next section that such a cell-to-node mapping does not exist.
We might relax the above conditions by requiring that each node be assigned at least one cell for each different value for a, b, and c and that adjacent cells map into adjacent nodes. We can find cell-to-node mappings which satisfy this condition and an example is shown in Fig. 4 . Mappings such as these are obtained by increasing N a · N b · N c . This is undesirable for the following reason. Normally, we associate a process with each cell, called a cell process, which is responsible for the computation associated with all the grid points within the cell. This means that if a cell-to-node mapping assigns r cells to a particular node, then the node will contain r cell processes, one corresponding to each cell, and the processor corresponding to the node will be multiplexed among the cell processes. The goal is to have the fewest number of cells consistent with keeping all the processors busy for the duration of the computation. By increasing N a · N b · N c we can often keep all processors busy (as we have done in this example) at the expense of additional overhead due to processor multiplexing among the cell processes and additional inter-processor communication [5] .
Another possibility is to relax the constraint that adjacent cells map into adjacent nodes. Let g s (r) denote a function which is defined for all integers r and s where 0 ≤ r < 2 s , whose range is the set of all binary strings of length s, and has the property that g s (r) and g s (r + 1 mod 2 s ) differ in exactly one bit position. There are many possible functions g s and such functions are called Gray codes [7, 9] . We define a cell-to-node mapping θ (a, b, c) , where 0 ≤ a, b, c < 2 d , as follows:
where denotes concatenation. The cell-to-node function θ maps a
2d nodes. The map's adjacency properties affect its communication efficiency. We now turn our attention to these properties. (2 2d ) does not divide the number of boundary cells, it is mathematically impossible to assign every processor the same number of boundary cells. The following theorem establishes an asymptotic approximation to this mathematically impossible, but desirable assignment of boundary cells.
Theorem 2.3 Let P denote the set of processors that are assigned 6 distinct boundary cells, one from each of the 6 boundaries. Under the cell-to-node mapping θ, |P |
Proof. It follows from Theorem 2.1 that each node labels exactly one cell on each of the six boundary faces. We want to count all those nodes that label a cell that is part of two or more boundary faces. If we simply count all the cells that are part of two or more boundary faces, this provides an upper bound on the number of processors that label such cells. It is not difficult to count these cells and we get 12
Except for O(2 d ) processors, this map has the remarkable property of assigning each of the 2 2d
processors the same number of boundary cells of each type: It is asymptotically balanced with respect to the assignment of boundary cells. Again, it is mathematically impossible to assign the same number of boundary cells to all processors.
An example of such a mapping is shown in Fig. 5 . In this case, cells which are adjacent in the z-direction are mapped into nodes which are at a distance 2 from each other. All other adjacencies are preserved.
A Negative Result
In the previous section, we noted that the best map would have:
• adjacent cells map to adjacent nodes;
• every node contain exactly one cell for each different value for a, b, and c. In this section, we prove that such a map does not exist. If such a map did exist, then it would have the following properties:
• The map would be from a 2 n × 2 n × 2 n mesh of cells to a 2n-dimensional hypercube of processors.
• Adjacent cells in the mesh would map to adjacent processors in the hypercube.
• A plane of the mesh has 2 n × 2 n cells. For every mesh plane along the x, y, or z axis, each cell in the plane would map to a distinct processor in the hypercube. This existence question is formalized as a mapping problem between graphs.
The Problem
, and
Does there exist a surjection, η : V M → V C satisfying the 2 conditions below?
2. η is injective when restricted to any mesh plane along the x, y, or z axis.
Mesh Edge Labels
Let ζ be a map that satisfies condition 1 of η. If {p, q} ∈ E M , then by condition 1, h(ζ(p), ζ(q)) = 1: ζ(p) differs from ζ(q) in exactly 1 bit position. This property is used to label the edges of the mesh. Definition ζ-labeling:
in the i th bit position (see, e.g., Fig. 6 ).
The Surjection Does Not Exist
Proof. (By contradiction.) Assume that if η exists then
Since q and r are adjacent to p, they share a plane. By condition 2 of η, their images are distinct. But Proof. This theorem asserts that a square in the mesh must have an η-labeling like that depicted in Fig. 7 .
Since q and r are adjacent to p in the mesh,
Case l η ({q, s}) = k ∈ {i, j}: Then η(r) differs from η(s) in 3 bit positions: i, j, and k. Since r is adjacent to s in the mesh, condition 1 on η is violated.
Case l η ({q, s}) = i: This case is precluded by the condition that Lemma 3.1.1 imposes on η. Therefore, l η ({q, s}) = j.
By an analogous case analysis, l η ({r, s}) = i. 
Proof. (By contradiction.) Assume that
and T = {e ∈ S|l η (e) = i}. We are given that {p, q} ∈ T . Let {r, s} ∈ S − T be an edge that is 'next to' an edge {t, u} ∈ T (see Fig. 8 ). Then l η ({t, u}) = i and l η ({r, s}) = i. The edges connecting mesh vertices r, s, t, and u form a square. The η-labeling of these edges however is precluded by Thm. 3.1.
From Corollary 3.1.1, one sees that if η exists, then all edges in a x-y 'plane' have the same η-label. This also is true for edges in a x-z, or y-z 'plane.' The definition of A x , A y , and A z are illustrated by Fig. 9 : A x is the set of edges along the x 'axis;' A y and A z are defined similarly: A y = {e 4 , e 5 , e 6 }, and A z = {e 7 , e 8 , e 9 }. Let P x = {i|∃e ∈ A x , l η (e) = i}. P y and P z are defined similarly.
The following lemma shows that since the extent of the mesh is 2 n in each dimension, |P x,y,z | ≥ n.
Proof. Let S = {η(v)|∃u, {u, v} ∈ A x }, the set of vertex labels of vertices on the x 'axis.'. By condition 2 of η, |S| = 2 n . In order to achieve 2 n distinct vertex labels, at least n different bit positions must vary.
Proof. (By contradiction.) Assume that ∃i ∈ P x ∩ P y . One sees (Fig. 10 ) that i ∈ P x ∩ P y produces squares whose edges all have the same label. This labeling, however, is precluded by Lemma 3.1.1. Proof. Assume that η exists. By Lemma 3.2.2,
But vertex names consist of only 2n bits.
Conclusion
In this paper, we have considered implementing the 3D ADM on a hypercube concurrent processor system. Such an implementation entails mapping the grid points of the computational domain onto the nodes of a hypercube concurrent processor. The best map would be to have adjacent cells map to adjacent nodes (or the same node), and to have every hypercube node contain exactly one mesh cell for each different x coordinate, one mesh cell for each different y coordinate, and one mesh cell for each different z coordinate. We prove that such a map does not exist, but show how to construct the next best thing: a map in which cells that are adjacent in the z axis map to hypercubes nodes that are of distance exactly two apart, and in which x and y axis adjacencies are preserved. Moreover, this optimally adjacent map is shown to be perfectly load balanced. Although this modulo-based map has been analyzed with respect to the hypercube concurrent processor, it is useful on any architecture that is capable of being configured as an m by n grid of processors with toroidal edge connections. If we assume N a is a multiple of n and N b is a multiple of m, then the mapping to an m by n torus is easy to specify and does not require the use of the Gray code transformation, namely, θ(a, b, c) = ((a + c) mod m, (b + c) mod n) where (r, s) denotes the processor located in the rth row and the sth column of the grid of processors. In order for each processor to be assigned an equal number of cells, it is necessary that N c be a multiple of lcm(m, n). Normally, we would partition the computational domain so that N a = m and N b = n and N c = lcm(m, n). The reason is that creating more domain cells results in additional overhead in processor multiplexing within a node and additional inter-processor communication.
For the class of algorithms under consideration, one could use this mapping on SIMD machines which provide 2D toroidal or hypercube connections, such as the Goodyear MPP, the Thinking Machines Connection Machine, or the Active Memory Technologies DAP [3] .
