Endowing a communication network with the ability to realize arbitrary communication patterns is an expensive proposition, both in hardware and in system software. One might instead ask whether, for a given application program, a simple network can be built that performs well for that particular program. In this paper, we model an application program by the set of communication patterns it uses. We then consider the problem of determining when such a set of communication patterns is suitable for fast realization on a simple network. We show that the question of whether there exists a simple, inexpensive network for an algorithm is closely related to the VLSI layout question. In particular, we show how the VLSI framework can be used to produce a simple test that tells how complex such a network must be. Within this context we show that, contrary to common wisdom, the communication necessary for block-matrix transpose does not require complex hardware | in fact, it is e ciently realizable on a mesh. However, other important patterns, such as perfect shu e, do indeed require either expensive hardware or large amounts of message congestion.
Introduction
The holy grail of parallel computing is an interconnection network that is inexpensive to build and that can realize all communication patterns quickly. Until a researcher who is pure enough to nd the grail comes along, we must be content with compromises. In this paper we will continue to strive for fast communication and ease of construction, but we will relax the requirement that arbitrary communication patterns be realized. Instead we will ask whether, given the communication patterns used by a particular parallel algorithm or program, it is possible to e ciently realize them with low hardware cost.
We do not mean to denigrate the e orts to produce a general-purpose parallel machine. Intel has moved from a hypercube 7] to a two-dimensional mesh 8].
Other researchers have suggested active messages, multi-threading, pre-fetching, and novel cache schemes to reduce overhead and make general-purpose communication more practical. Each method has its expenses and drawbacks. We only argue that, since exibility comes at a cost, one might want to consider trading reduced generality for reduced cost.
In some sense, any machine topology that forms a single connected component is generalpurpose | that is, it is possible to send a message from any processor to any other processor. It may, however, be quite di cult (i.e., expensive in terms of time or resources) to send messages between particular pairs of processors or to send several messages at the same time. The question of how to e ciently route the messages arising from common algorithms on particular networks has been extensively studied. For example, Bajwa 1] discusses routing on tori and grids, and Sibeyn 13] gives an extensive survey of routing on a mesh. The eld of graph embedding (e.g., 6]) can be interpreted as providing mappings of algorithms' communication patterns onto various network topologies, or showing that they do not exist. Some of the limitations of graph embeddings for this purpose have been removed by looking at work-preserving emulations 9, 16].
Many researchers have attempted to classify those sets of communication patterns that can
be implemented e ciently on particular architectures. Nassimi and Sahni 11] consider the class of BPC permutations, which includes many common communication patterns such as matrix reblocking, matrix transpose, and perfect shu e. Their work has been followed by several examinations of BPC permutations on hypercubes 4, 17] . Attempts to route BPC permutations on meshes have been less successful and we will show later that e cient routing of some BPC permutations is not possible on meshes.
In this paper, we will attempt to characterize those parallel algorithms (described by the collection of communication patterns among processes that must be realized by the algorithm) that can be e ciently implemented on easily-built networks. Given current technology, we consider an \easy to build" network (this is, of course, a relative term) to be one for which each processor has small number of neighbors in the network, and for which no long wires are required in its construction. Given these guidelines, our focus turns naturally to low-dimensional meshes, particularly two-dimensional meshes, as they are the epitome of a low-cost interconnection network that is simple to construct. In fact, low-dimensional meshes are an option that currently appears to be enjoying increasing popularity relative to hypercube-based networks 7, 8] .
The question of how to e ciently design a machine for an algorithm was previously discussed by Greenberg 5] , and we have borrowed some of our model from this earlier paper, particularly as pertains to the description of the communication patterns of parallel algorithms. However, the focus of this earlier paper was on the e cient use of chip pins, and we are instead focusing on the e cient use of the wire length and wire area of the interconnection network. This focus leads us to comparisons with research on VLSI layout. The framework paper of Bhatt and Leighton 2] provides both important general theorems and an extensive bibliography of this area.
The remainder of the paper is organized as follows: In Section 2, we discuss in detail our models of parallel machines and the communication patterns that describe parallel algorithms, and formulate precisely the problem we are considering. In Section 3, we illustrate our model by designing a network that can e ciently implement matrix squaring. Section 4 discusses a self-mapping of the two-dimensional mesh that allows it to e ciently implement transpose communication in addition to the standard row and column shifts, thus solving the matrix-squaring problem on a simple, inexpensive network. In Sections 5 and 6, we relate the problem of determining which communication patterns can be implemented e ciently on two-dimensional meshes to the theory of graph bifurcators, and apply this framework to obtain bounds on realizations of higher-dimensional meshes and other common communication patterns. Finally, in Section 7, we summarize our results.
Our Model
Throughout this paper, we will be making claims such as \Communication pattern X can be performed quickly" and \Network N has low hardware cost." In order to do this in a precise and rigorous way, we must have clearly de ned models of both parallel computers and communication patterns. In this section, we describe our models.
Modeling Parallel Machines
Since we want our results to hold for large-scale machines (i.e., thousands or even millions of processors), we begin with a fairly general model of a distributed-memory parallel computer. Our parallel machine will consist of N processing nodes, p 0 through p N ?1 , and an interconnection network consisting of point-to-point links. Each node may contain several CPUs, memory banks, caches and other logic designed for fast computation and e cient reuse of data. While the details of these components will, of course, have a tremendous e ect on program speed, they are not of concern to us in this paper. We will concentrate only on the hardware used for interprocessor communication.
Abstractly, a parallel computer can be thought of as a graph in which vertices represent processor nodes and edges represent bidirectional point-to-point communication links. This level of abstraction, however, obscures some very important hardware issues. In particular, the speed at which communication occurs is intimately related to the speed at which the communication links run, and the speed of the links depends on the topology of the graph. Two characteristics of the graph that have essential e ects on link speed are its degree (i.e., the number of edges incident to each node), and its maximum wire-length when embedded in two-dimensional or three-dimensional space.
The degree a ects link speed in several ways. The most direct e ect is through the necessity of sharing a limited number of pins. A high performance processor node will generally reside on one chip (or perhaps a small number of chips). Such a chip will typically have at most a few hundred pins to which communication links can be attached. When implementing networks with higher degree, fewer pins can be assigned to each link, and thus each link will be proportionally slower.
A less obvious e ect is that as the degree of the network increases, the complexity of the internal linkages on each chip increases. This is because signals arriving on one link may need to be sent out on another link, and distinguishing among more outgoing links will take additional time.
Higher degree can therefore mean greater delay in switching a signal from one link to another.
In addition, if the links require bu ering then the presence of more incident links means that more chip area must be devoted to bu ers. Clearly, the degree of the underlying graph of an interconnection network will have a profound e ect on the speed of its communication links.
Once the number of pins assigned to each link (its width, as de ned by Greenberg 5] ) has been xed, the speed of the link depends on how fast the line can be driven. The switching speed of a line depends on several factors. In early machines, the primary constraint was the speed of the on-chip driver, which was related to the clock speed. In newer machines, such as the Intel Paragon 8], the wires are driven at 100Mhz (and are 16 bits wide, yielding 200Mbytes/sec). At these speeds it becomes increasingly important to keep the lengths of the wires short; lengths of a foot or two are considered quite satisfactory. In terms of the underlying graph of the network, it is important that the graph does not require long edges when embedded in physical space. Given today's technology, it is preferable that the edges are short when the graph is embedded in the plane (and thus the network can be realized in two-dimensional space).
The pinnacle of scalability for an interconnection network is for it to be realizable with constant-length wires in physical space. For this to be possible, it is necessary that processor nodes that are distant in physical space are also distant in the underlying graph. For example, if the processor nodes are embedded in the plane, then in a N-node graph some nodes must be a distance of p N apart in the graph. Thus the engineering goal of having only short wires con icts with the communications goal of requiring only local communications. The purpose of this paper is to explore this con ict.
One attempt at a technological amelioration of this con ict is the use of wormhole routing.
Wormhole routing attempts to mitigate the cost of forwarding messages over several wires (see Ni and McKinley's survey paper 12] for a detailed description). The need for forwarding is evident in any global communication among the processors, even a simple broadcast, since we have argued that some message must traverse many graph edges. The cost of forwarding is commonly addressed by special wormhole-routing hardware that causes each message to establish a path of links along which it travels, and then pipelines successive packets along this path | at some level, the message must be broken into pieces containing no more than the number of bits that can traverse a link simultaneously. Communication between distant nodes in the the graph will still necessarily have greater latency (i.e., time for the rst packet to arrive) than will transmission between adjacent nodes, but the two transmissions will have the same throughput | that is, they will proceed toward completion at the same rate. Thus broadcast need not always be an expensive operation, even relative to nearest-neighbor communication.
While the use of wormhole routing reduces the cost of forwarding messages (i.e., the cost of path length) it does not cure all ills. If many messages must travel over long paths, then the total number of links collectively used by the packets will be large. Since the number of links in the system is xed, we will eventually encounter congestion on the links. Thus, while for single messages the use of wormhole routing makes the issue of wire length seem unimportant, it is nonetheless important to consider the e ects of wire length on groups of concurrent messages.
Furthermore, the use of wormhole routing does not remove the constraint on the length of physical wires that are intended to run at high speeds. Thus, wormhole routing may allow fast communication between any two nodes in (for example) a large Intel Paragon, but each physical wire in the machine still needs to be short.
The issue of message congestion has its analog in the physical world in the routing of wires.
Even if other methods could be used to eliminate the speed limitations of long wires, the use of long wires would still lead to more complicated and perhaps physically impossible wire routing problems. As we shall see in the later sections of this paper, the twin problems of link congestion and wire routing congestion will form the primary obstacle to the creation of simple architectures that meet the needs of algorithms with complex communication requirements.
Modeling Communication Patterns
Given our model of a parallel computer's capabilities, we wish to be able to quantify how quickly the communication required by a parallel algorithm can be performed. This necessitates a description of the algorithm's communication needs. We have stated that we are unwilling to service arbitrary communication; instead, we demand that the algorithm specify some small In order to run e ectively on our parallel computer, an algorithm must be divided into subtasks, or processes. For the purposes of this paper, we will assume that the algorithm consists of a number of processes equal to the number of processor nodes in our parallel machine, and that all processes have equal computation load. This will not generally be true in practice (the division of large problems into appropriate subtasks, and assigning them to processors in such a way as to balance the computational load is an entirely di erent area of inquiry), and is only assumed here for simplicity of exposition. The communication between processes will be described via permutations on the set of processes. For example, if the processes form an N-element two-dimensional array labeled in row-major order, then communication of each array value to its right-hand neighbor is a subset of the permutation that maps each process i to process i+1, and communication of each array value to its neighbor immediately below is a subset of the permutation that maps each process i to process i + p N. In this way, any communication step consisting of each process sending at most one message to some other process and each process receiving at most one message from some other process can be described by a single permutation on the processes. Doing this for every communication step of the algorithm, we can translate the communication needs of a parallel algorithm into a set of permutations.
Measuring Communication Cost
Once we have represented the communication needs of an algorithm as a set of permutations on the processes, the problem of optimizing communication reduces to the problem of mapping the processes to the processor nodes so that each of the communication patterns (i.e., permutations) can be realized with paths between communicating processes that congest the network links as little as possible. Since we have assumed that the links are bidirectional, two messages that cross an edge in opposite directions do not interfere. We normalize the cost of a permutation to the rate of sending a single message between two nodes when it is the only message in the network.
(This is equivalent to the bandwidth of the links.) If a message travels over a single path then this is the best possible rate. On the other hand, if we can arrange multiple link-disjoint paths for a single message then it may be possible to exceed this rate. If the paths for several messages share a link then we will not be able to achieve this rate.
Formally: We are given a network with N processor nodes and a point-to-point interconnection network, and a set of permutations j on the N processes of a parallel algorithm. We must supply both a mapping of the processes to the processor nodes and a mapping of each source-destination pair (p i ; j (p i )) to a set of paths connecting p i to j (p i ), where p i runs over the set of all processes and j runs over the set of all communication patterns. In the common case in which each pair is mapped to a single path, the rate of communication for the mapping of each permutation j is equal to the inverse of the congestion in the network of the embedding of all the paths needed to realize j (congestion is de ned to be the maximum number of paths using any one link).
A more general de nition of rate can be made for the case when each pair (p i ; j (p i )) can be mapped to several paths, but it is rather technical, and since we will not be discussing multiple paths in this paper we will not formally de ne rate in this more general case.
Since parallel algorithms often use di erent communication patterns during di erent phases of their execution, we have de ned the rate of each permutation separately. This is di erent from the usual graph embedding approach of considering the congestion of all the permutations together. Occasionally, we will be able to show that two permutations can be mapped in such a way that their combined congestion is less than the sum of their individual congestions. In this case, these two permutations can be combined into a single phase that will increase the overall e ciency beyond that which is achieved by routing them in two separate phases.
An Example: Implementing Matrix Squaring
We illustrate the use of the models described in the preceding section by discussing a simple matrix squaring algorithm. Suppose that for some matrix M; we want to compute M 2 . One standard parallel algorithm for matrix-matrix multiplication, where each matrix is initially stored one element per processor, is described by Leighton 10] . In this algorithm, the two matrices are stored on a torus (i.e., a two-dimensional mesh with wraparound edges), one in row-major order and one in column-major order. After an initial skewing where each column i of one matrix is shifted up i positions and each row j of the other matrix is shifted to the left j positions, the computation proceeds by repeatedly accumulating the product of the elements within each processor and then shifting columns down and rows to the right.
When used for matrix squaring, this algorithm requires three communication patterns: transpose communication, row shift, and column shift. (In fact, the skewing operation requires row and column shifts of distance greater than one, and we might want to include the skew as an additional type of communication. For now we will consider only single row and column shifts, but we will discuss these more general shifts later, in Section 6.) The goal of nding an e cient architecture for this algorithm thus translates into the question of whether there exists a network topology that allows these three patterns and can be realized with low degree and short wires.
One natural candidate network is the degree-ve graph that contains edges that directly implement row shifts, column shifts, and transpose communication. That is, a wire is placed between each pair of processors that are transposes of each other or that are neighbors in a row or column. There are two potential problems with this approach. First, since the transpose is only used in an initial phase of the algorithm, the hardware resources for the transpose edges are not used e ciently (in the sense discussed by Greenberg 5] ).
Second, and more importantly, this network initially appears to be di cult to realize eciently in hardware. Row shifts and column shifts are easy to implement, since together they exactly form the common two-dimensional mesh network. The mesh meets both our hardware criteria of having low (in this case, constant) degree and of being realizable in the plane with short (here, constant-length) wires. The problem occurs when we try to add the transpose edges.
We observe that simply adding edges to the standard mesh may not be a wise choice, since the In Section 4, we will show that there is in fact a better solution | that it is possible to build a network with constant degree such that row shifts, column shifts, and transpose communication are all implemented directly by short wires. In fact, we will show that it is possible to directly map this new network onto the standard mesh by mapping the processes onto processors in a new manner.
Implementing Transpose Communication on a Mesh
In the previous section, we introduced the problem of how to build a network that can e ciently implement row shifts, column shifts, and transpose communication among an array of processes.
We will now show that transpose communication can be implemented very e ciently on a twodimensional mesh without seriously impacting upon its performance on row and column shifts. This is a somewhat surprising result, given that the standard embedding of an array of processes to a mesh network yields very poor performance for transpose, and it illustrates the usefulness of our framework in preventing the misclassi cation as \di cult" of a set of communication patterns that is in fact easy to implement on a simple network.
In order for transpose to have its usual meaning, we need the number of processes to be a perfect square | that is, N = S 2 for some integer S. In this case, we can identify each process p k with the pair (i; j), 0 i; j < S, such that k = iS +j. Now we can de ne three permutations, R, C, and T, that correspond to row shifts, column shifts, and transpose communication. The row permutation, R, maps process (i; j) to process (i; j + 1) for each 0 i < S , 0 j < S ? 1; the column permutation, C, maps process (i; j) to process (i+1; j) for each 0 i < S?1 , 0 j < S; the transpose permutation, T, maps process (i; j) to process (j; i) for each 0 i; j < S.
We now state our main theorem regarding the addition of transpose communication ability to a standard two-dimensional mesh:
Theorem 1 For the permutations R, C, and T as de ned above, and N the square of an integer, it is possible to:
1. Map N processes to a N-node two-dimensional mesh network so that each of the permutations R, C, R ?1 , C ?1 , and T is individually routed by a set of edge-disjoint paths. That is, each can be realized with rate one. Furthermore, R and C (similarly, R ?1 and C ?1 ) can be realized together with rate one.
2. Build an N-node network that is laid out on a two-dimensional grid (i.e., the N processor nodes are mapped to grid points and wires are routed along grid edges) such that no wire has Manhattan length greater than two, no more than three bidirectional channels are routed along any grid edge, pairs of processors that are mapped to each other by any of R, C, R ?1 , C ?1 , or T have a wire between them.
Proof We will demonstrate the mapping for (1.) above. The transformation from (1.) to (2.) is straightforward.
The intuition behind the mapping for (1.) is based on folding up a square to form a smaller square (see Figure 1) . Picture the array of processes as an S S square. First, expand the square into a 2S 2S square (each original point can be thought of as being mapped to the lower left of the four points to which it is expanded | see part (a) of the gure). If the square is then folded along its diagonal, the processes in each transpose pair are now close to each other (see part (b) of the gure). Finally, the triangle at the lower right is folded in along the vertical midline, and the square is folded up onto the triangle in the upper left (parts (c) and (d) of the gure). The end result is that each point of the original S S square is mapped to a unique point of the resulting S S square. Note that none of the folds ever disconnect horizontal lines (rows) or vertical lines (columns) in the square, though they may be bent. The initial mapping to a 2S 2S square induces dilation two; clearly, the folds add no further dilation. Formally, we de ne the mapping in six parts:
Region of Mesh (i; j) mapped to i < S=2, j < S=2, j > i (2j; 2i + 1) i < S=2, j < S=2, j i (2i + 1; 2j) i < S=2, j S=2, An alternate de nition is that each process (i; j) is mapped to processor node (minfmaxf2i+1; 2jg; 2S +1?maxf2i+1; 2jgg; minfminf2i+1; 2jg; 2S +1?minf2i+1; 2jgg):
By examining the mapping, the following facts are easy to establish:
1. Process row r < S=2 (that is, processes (r; j) for 0 j S ? 1) is mapped only to nodes in processor row 2r + 1 or processor column 2r + 1.
2. Process row r S=2 (that is, processes (r; j) for 0 j S ? 1) is mapped only to nodes in processor row 2S ? 2r or processor column 2S ? 2r.
3. Process column c < S=2 (that is, processes (i; c) for 0 i S ? 1) is mapped only to nodes in processor row 2c or processor column 2c.
4. Process column c S=2 (that is, processes (i; c) for 0 i S ? 1) is mapped only to nodes in processor row 2S ? 2c + 1 or processor column 2S ? 2c + 1.
5. Processes (i; j) and (j; i) (for all 0 i; j S ?1 and i 6 = j) are mapped to processor nodes that are diagonally or anti-diagonally adjacent.
The existence of disjoint paths for each of R, C, R ?1 , C ?1 , and T, and therefore their realizability with rate one, can be deduced directly from the above facts. We illustrate this for permutation R; identical reasoning establishes disjoint paths for C, R ?1 , and C. Note from facts (1.) and (2.) that each process row is mapped to a single processor row and column that are di erent from those to which any other row is mapped (although they will be shared with some process column). Thus if all the paths needed to realize each row of R are routed within the corresponding processor row and column then the paths for each row are disjoint from those for all other rows. Within the processor row and column for a process row r, we observe that the processes are mapped rst to successive odd positions in increasing order, and then to successive even positions in decreasing order. Thus they can be connected by successive paths of length two that each use distinct unidirectional links. The only potential problem is at the point at which the mapping of the process row switches from its processor row to its processor column.
However, we see that this \turning point" occurs at the process that is on the diagonal of the array of processes. This process is mapped to the intersection of the designated processor row and column for that process row, so the mapping of the process row can continue uninterrupted.
Identical reasoning to the above establishes the existence of disjoint paths for C, R ?1 , and C ?1 . Disjoint paths for T are established by having all paths use rst a row link and then a column link. In this manner, pairs of communicating processes mapped to diagonally adjacent processors will use paths that progress clockwise around a two-by-two square, while processes mapped to anti-diagonally adjacent processors will use paths that progress counterclockwise around the same square. It is easy to see that no unidirectional link will be used more than once by this set of paths.
To show that the paths chosen for R are edge-disjoint from those chosen for C, consider the paths chosen for a particular process row r and process column c. From This completes the proof of part (1.) of the theorem. As was mentioned earlier, the transformation of this mapping to the layout required for part (2.) is straightforward.
Graph Bifurcators and Mesh Realizability
To address the question of which communication patterns can be e ciently realized on a twodimensional mesh, we can make use of an elegant framework developed by Bhatt and Leighton for solving VLSI layout problems using graph bifurcators. The following de nition is modi ed from their paper 2]:
A graph G is said to have an (F, )-bifurcator if either G has only one node, or G can be decomposed into two subgraphs by removing at most F edges and the two resulting subgraphs both have (F= ; )-bifurcators. F is called the size of the bifurcator.
Every graph bifurcator has associated with it a decomposition tree for the graph G, where the root of the tree corresponds to G itself, and the children of a node at level i in the tree are the subgraphs created by removing at most F= i edges from the subgraph corresponding to that level i node. Bhatt and Leighton 2] showed that a graph's best bifurcator is closely related to its best layout in two dimensions. More speci cally, they showed that: These results together give a simple, though slightly imprecise, test for whether a given set of communication patterns is e ciently realizable on a two-dimensional mesh. If we let G be the graph corresponding to the set of all the communication patterns in question (i.e., G has a node for each process and an edge (u; v) whenever process u and process v communicate in some communication pattern), our test requires only that we know the smallest F such that G has an In the matrix-squaring example of Section 3, we mentioned that it may be desirable to be able to shift di erent rows of the mesh by di erent amounts. Sending each process (i; j) to (i; j + s) for some arbitrary integer s can of course be achieved via s applications of the permutation R, but we may wish to do the shift in one step for some particular s by adding the permutation R s to 
. From these two results we derive the interesting fact that transpose edges are not always \free" when communication patterns other than R and C are involved. In particular, adding the permutation T and the permutation R N 1=4 to a mesh would allow us to easily realize both of the permutations R N 1=4 and C N 1=4 , which we have just shown is strictly harder (in terms of congestion) than realizing just R N 1=4 .
Other Permutations
The multiple shifts discussed in the last subsection require diminished performance in terms of rate when realized on meshes, but they are far from the worst o enders in this regard.
One criticism of mesh networks that is sometimes voiced is that they have large diameter | some processor nodes are at distance ( p N) from others in an N-node two-dimensional mesh, as compared to at most O(log N) in a hypercube network. (Of course, the advent of wormhole routing has dulled this criticism somewhat.) One approach to reducing the diameter of a mesh would be adding shu e edges to the rows, thus reducing the row diameter to O(log N).
Assuming for simplicity that the number of columns in the mesh is a power of two, a row-shu e edge would connect each process (i; j) to process (i; j 0 ), where j 0 is obtained by rotating the If we add column-shu e edges as well, we obtain a communication graph that more or less contains the N-vertex shu e-exchange graph. (A full shu e corresponds to a row shu e, a column shu e, and perhaps a row exchange, a column exchange, or both.) Therefore, since the best bifurcator of an N-node shu e-exchange graph is of size (N= log N), realizing the four permutations row shift, row shu e, column shift, and column shu e requires ( p N= log N) congestion. To within constant factors, the same bounds can be established for adding FFT connections to the mesh, either globally or within rows or columns. The high congestion required for shu e edges shows that not all BPC permutations can be realized on simple networks.
However, there are other permutations that do not require high congestion and therefore can be realized with high rate on a two-dimensional mesh without excessively impacting the rate of row and column shift. What these permutations have in common is that when added to the two-dimensional mesh, they do not cause a great increase in the size of its best (F; (The ability to switch between the snake-ordering of a mesh's vertices and the usual row-major ordering can be useful for applications that use sorting.)
We note that adding either one of these permutations to a two-dimensional mesh only increases the size of the mesh's best bifurcator by a factor of two. Thus, the Bhatt-Leighton results discussed in Section 5 allow us to realize either one with only O(log N) congestion. We do not know how to improve this result for the Gray code permutation, but we can do signi cantly better for the snake ordering, as the same \folding" technique that can be used to add wraparound edges to the mesh also allows us to add snake-ordering edges with congestion two.
Conclusion
In this paper, we have taken the standard question of how best to map a speci c algorithm onto a speci c machine and turned it around, instead asking how to build the best machine for a speci c algorithm. This question was previously addressed by looking at ways to make e cient use of wires. Instead, we have considered the feasibility of the resulting network in terms of maintaining low node degree and short wires. These criteria for the feasibility of networks led us to focus on mesh-based machines.
The general question of determining which sets of communication patterns correspond to buildable machines was addressed by applying a graph-theoretic framework of Bhatt and Leighton 2] that was originally designed for VLSI layout problems. Their framework showed that the layout area and wire length needed to realize a set of communication patterns (i.e., permutations) and the congestion that results from implementing them on a two-dimensional mesh are both intimately related to the size of the smallest bifurcator of the graph formed by the permutations.
For the particular problems of adding transpose communication, snake-ordering, and multiple row and column shifts to a two-dimensional mesh, we were able to explicitly construct mappings that have optimal behavior. The best solution for Gray codes remains open, though a solution that is within a logarithmic factor of optimal can be constructed by the Bhatt-Leighton approach.
On the other hand, some patterns, such as perfect shu e, have optimal solutions with very high cost. Thus the class of BPC permutations, which is e ciently implementable on a hypercube, is not e ciently implementable on any simple, low-cost network.
Finally, we should point out that most of our discussions of optimality have not included the consideration of small constant factors. The transpose solution yields communication with rate one | that is, as fast as if all communicating processes were neighbors in the processor graph. However, as discussed in Section 2, the possibility of using multiple paths to increase the communication rate might remain. Since a mesh has only degree four, the number of paths
clearly cannot be very high. In fact, because the mesh is bipartite, any use of two paths requires the use of at least four edges. Since the mesh has slightly fewer than four times as many edges as nodes, we conclude that even a unidirectional communication pattern cannot hope to achieve even rate two. 
