The complete exchange (or all-to-all personalized) communication pattern occurs frequently in many important parallel computing applications. W e discuss several algorithms to perform complete exchange on a two dimensional mesh connected computer with wormhole routing. W e propose algorithms for both powerof-two and non power-of-two meshes as well as an algorithm which works for any arbitrary mesh. W e have developed analytical models to estimate the performance of the algorithms on the basis of system parameters. These models take into account the effecis of link contention and other characteristics of the communication system. Performance -results on the Intel Touchstone Delta are presented and analyzed.
Introduction
Low-dimension high-bandwidth interconnection networks, such as a mesh, have recently emerged as a popular alternative to the earlier high-dimension lowbandwidth networks, such as the hypercube, for distributed memory multicomputers. The Intel Touchstone Delta, the Intel Paragon and the Symult 2010 use a two-dimensional mesh while the MIT J-machine and the Mosaic computer developed at Caltech use a three-dimensional mesh [5] . All these machines use wormhole routing, an important feature of which is that the network latency is almost independent of the path length when there is no link contention and the packet size is large. In this paper, we discuss four algorithms to perform complete exchange on a mesh connected computer with wormhole routing. The complete exchange or all-to-all personalized communication pattern is one in which all processors simultaneously need to communicate with all other processors. It occurs in many applications like parallel quicksort, Complete exchange algorithms for a hypercube architecture are described in [2, 41. Ponnusamy, Thakur et al [SI discuss complete exchange on the fat tree architecture of the CM-5. These algorithms assume that the number of processors is a power-of-two, which is a valid assumption for those architectures. The mesh architecture introduces different problems because of high contention and the fact that the user can allocate a mesh size which need not be a power-of-two and may even be an odd number (eg. 5 x 5). Bokhari and Berryman [3] describe two algorithms for a circuit-switched mesh, which assume that the number of processors is a power-of-two. In this paper, we discuss algorithms for both power-of-two and non power-of-two meshes. We have developed analytical models to estimate the performance of the algorithms. We present performance results on the Intel Touchstone Delta.
Section 2 describes the architecture of the Delta and the performance model used for the algorithms. The algorithms are described in Section 3. The performance of the algorithms on the Delta is discussed in Section 4 followed by Conclusions in Section 5.
Architecture and Performance
The Intel Touchstone Delta is a 16x32 mesh of computational nodes, each of which is an Intel i86O/XR microprocessor. The two-dimensional mesh interconnection network has bidirectional links with wormhole routing. It uses deterministic XY routing in which packets are first sent along the X dimension and then along the Y dimension. In wormhole routing, a packet is divided into a number of flits (flow control digits) for transmission. The size of a flit is typically the same as the channel width. The header flit of a packet determines the route and remaining flits follow in a pipeline Model fashion. The network latency for wormhole routing is To model the performance of the algorithms, we use an approach similar to that used by Barnett The time taken for an exchange operation may be different from the time to send to and receive from different processors, because in the latter c m the incoming and outgoing messages may traverse links with different amount of contention. Hence, we use Per: or Par depending on the algorithm. We assume that the time taken is independent of distance, a property of wormhole routing. Thus, the time required for an exchange step i is given by the pairwise exchange algorithm described in [2, 71 as it guarantees no link contention in the hypercube at every step. This algorithm has also been shown to perform well on the fat tree architecture of the CM-5 [SI.
The algorithm is described in Figure 1 . It requires p -1 steps and the communication schedule is a8 follows. In step i , 1 i 5 p-1, each processor exchanges a message with the processor determined by taking the exclusive-or of its processor number with i. Therefore, this algorithm has the property that the entire communication pattern is decomposed into a sequence of pairwise exchanges. The communication schedule of PEX for 8 processors is given in Table 1 . The entry j c-, j in the table indicates that processors j and j exchange data.
Since each step of PEX involves an exchange between pairs of processors, the maximum number of m m a g a contending for a link at any step is limited by maZ(f, c)/2. An exact expression for the maximum number of messages contending for a link at step j is
w e a~~m e that conflicting messages share the bandwidth of a network link and that there exists some POSitive integer 7 such that P = 2' P8a,. For the Delta, y = l is a good approximation [l] . In other words, even if two messages contend for a link, there is no increase in communication time. Note that since the Delta h a bidirectional links, two messages contend for a link only if they need to travel in the same direction simultaneously.
3 Algorithms f( j ) = 21lg{maz(mod(i,c),i/c))J Scott [7] has shown that a3/4 is the lower bound on the number of phases required to perform a comHence, the time taken for step i is ~l e t e exchange on an a x a mesh such that there is 
Pairwise Exchange for General Mesh
The PEX algorithm cannot be directly used if the number of processors is not a power-of-two a8 the exclusive-or function will not create all the required processor pairs in p-1 steps. The Pairwise Exchange for General Mesh (PEX-GEN) algorithm described in Figure 2 is an extension of PEX for non power-oftwo meshes. The algorithm first finds the smallest power-of-two (say q) greater than the number of p r e ceasors and uses this number to schedule q -1 steps of the pairwise exchange. In each step, every processor checks to see if the calculated destination processor number is less than the actual number of processors. If so, it exchanges data with the processor, else it goes ahead to the next step. Thus, the algorithm requires q -1 steps where q is the nearest power-of-two larger than the number of processors. Clearly, the algorithm takes more steps than necessary and many processors remain idle in several of the steps. However, this reduces the link contention in each step. The maximum contention in each step is upper bounded by that in the PEX algorithm.
PEX-GEN with Shift (PEX-GEN-
The motivation for the Pairwise Exchange for General Mesh with Shift (PEX-GEN-SHIFT) algorithm can be explaimed with the help of Figure 3 Figure 4 maintains a balance of the number of active and inactive processors in all steps. This is done by defining virtual processor numbers such that the real proceasors 0 to 19 are numbered 6 to 25 as shown in Figure 3(b) . The processor numbers are shifted by an amount equal to half the absolute difference between the number of processors and the nearest higher power of two. Thus the empty space which earlier existed only in the half 16 -31 is now equally divided among the two halves. So, even in the first 15 steps of the algorithm, there are equal number of idle processors in both halves, which balances the contention among all the steps of the algorithm. This algorithm also takes q -1 steps where p is the smalleat power-of-two larger than the number of processors. The maximum contention in each step is upper bounded by that in the PEX algorithm. 
General Algorithm for any Mesh
The above algorithms require one less than a power of two number of steps, because they use the exclusiveor function to obtain processor pairs which exchange with each other. For non power-of-two meshes, it would be advantageous to have an algorithm which requires only p -1 steps. 
Experimental Results
We implemented all the algorithms on the Delta and studied their performance for different mesh configurations and message sizes. The performance of PEX is shown in Table 2 . The number of procm sors is varied from 16 to 512 with message size varied from 256 bytes to 16 Kbytes. Message size refers to the amount of data communicated in each send and receive operation, so the total amount of data communicated increases as the number of processors is increased. Hence, the time taken increases almost linearly with the number of processors.
The performance of PEX-GEN is given in Table 3 . We have chosen some mesh sizes which are non powerof-two. We observe that for mesh sizes which are only slightly less than the nearest higher power-of-two, the performance is close to that of PEX for that power-oftwo. But, if the mesh size is only slightly higher than the nearest smaller power-of-two, the time taken is almost twice the time taken by PEX for that power-oftwo. For example, the time taken by PEX-GEN on a 16x9 mesh is much higher than the time taken by PEX on a 16 x 8 mesh, but the time taken by PEX-GEN on a 16 x 14 mesh is very close to the time taken by PEX on a 16 x 16 mesh. This is because of the difference in the number of steps required. Another interesting observation is that the time taken by PEX-GEN on a 16 x 30 mesh is in fact higher than the time taken by PEX on a 16 x 32 mesh. This is because since the processors are numbered in row major order, a change in the number of columns from a power-of-two to a non power-of-two, changes the communication pattern in the mesh completely for an algorithm which uses the exclusive-or function to determine processor pairs. Hence, there is more contention in the 16x30 case than in the 16 x 32 case. Table 4 shows the performance of PEX-GEN-SHIFT. In most cases, it performs better than PEX-GEN. Table 5 gives the performance of GEN on a power-of-two mesh. GEN performs better than PEX for small message sizes and small number of proces sow. However, for large number of processors (2 64) and large message sizes (> 1 Kbytes) PEX performs Table 4 : Performance of PEX-GEN-SHIFT Table 5 : Performance of GEN on power-of-two mesh The performance of GEN on non power-of-two meshes is given in Table 6 . GEN reduces the number of steps from q -1 in PEX-GEN and PEX-GEN-SHIFT, where q = 2rIgP1, to p -1. For small number of processors, PEX-GEN performs the best and the improvement in performance is higher when q -p is large. However, if q-p is small and the number of processors is large, the performance of PEX-GEN-SHIFT tends to that of PEX and and the performance of GEN tends to that for a power-of-two mesh. So in this case, PEX-GEN-SHIFT performs better than GEN.
Conclusions
In this paper, we have discussed algorithms to perform complete exchange on a wormhole routed mesh with performance results on the Intel Touchstone Delta.
For power-of-two meshes, when the number of processors is small (< 64) and message size is small (< 1 Kbytes), the GEN algorithm performs the best. For larger message and mesh sizes, PEX performs better. For non power-of-two meshes, PEX-GEN-SHIFT performs better than PEX-GEN, but they both require q-1 steps where q = 2f'gPl. GEN reduces the number of steps to p -1 and performs better than PEX-GEN-SHIFT when q -p is large. As p tends to q, the mesh tends to a power-of-two mesh and the performance of PEX-GEN-SHIFT tends to PEX, while the performance of GEN tends to that for a power-of-two mesh.
Acknowledgments
This work was supported in part by ARPA under contract no. DABT63-91-(2-0028. The content of the information does not necessarily reflect the position or policy of the Government and no official endorsement should be inferred. Alok Choudhary's research is also supported by an NSF Young Investigator Award CCR-9357840 and a grant from Intel SSD. This research was performed in part using the Intel Touchstone Delta System operated by California Institute of Technology on behalf of the Concurrent Supercomputing Consortium. Access to this facility was provided by the Center for Research on Parallel Computation.
