Complete Exchange on a Wormhole Routed Mesh by Thakur, Rajeev et al.
Syracuse University 
SURFACE 
Northeast Parallel Architecture Center College of Engineering and Computer Science 
1993 
Complete Exchange on a Wormhole Routed Mesh 
Rajeev Thakur 
Syracuse University, Northeast Parallel Architectures Center, thakur@npac.syr.edu 
Alok Choudhary 
Syracuse University, Northeast Parallel Architectures Center 
Geoffrey C. Fox 
Syracuse University, Northeast Parallel Architectures Center 
Follow this and additional works at: https://surface.syr.edu/npac 
 Part of the Computer Sciences Commons 
Recommended Citation 
Thakur, Rajeev; Choudhary, Alok; and Fox, Geoffrey C., "Complete Exchange on a Wormhole Routed Mesh" 
(1993). Northeast Parallel Architecture Center. 65. 
https://surface.syr.edu/npac/65 
This Article is brought to you for free and open access by the College of Engineering and Computer Science at 
SURFACE. It has been accepted for inclusion in Northeast Parallel Architecture Center by an authorized 
administrator of SURFACE. For more information, please contact surface@syr.edu. 
Complete Exchange on a Wormhole Routed MeshRajeev Thakur  Alok Choudhary  Georey Fox yNortheast Parallel Architectures Center111 College Place, Rm. 3-228Syracuse UniversitySyracuse, NY 13244-4100thakur, choudhar, gcf @npac.syr.eduAbstractThe complete exchange (or all-to-all personalized)communication pattern occurs frequently in many im-portant parallel computing applications. We discussseveral algorithms to perform complete exchange on atwo dimensional mesh connected computer with worm-hole routing. We propose algorithms for both power-of-two and non power-of-two meshes as well as analgorithm which works for any arbitrary mesh. Wehave developed analytical models to estimate the per-formance of the algorithms on the basis of system pa-rameters. These models take into account the eectsof link contention and other characteristics of the com-munication system. Performance results on the IntelTouchstone Delta are presented and analyzed.1 IntroductionLow-dimension high-bandwidth interconnectionnetworks, such as a mesh, have recently emerged as apopular alternative to the earlier high-dimension low-bandwidth networks, such as the hypercube, for dis-tributed memory multicomputers. The Intel Touch-stone Delta, the Intel Paragon and the Symult 2010use a two-dimensional mesh while the MIT J-machineand the Mosaic computer developed at Caltech use athree-dimensional mesh [5]. All these machines usewormhole routing, an important feature of which isthat the network latency is almost independent of thepath length when there is no link contention and thepacket size is large. In this paper, we discuss four algo-rithms to perform complete exchange on a mesh con-nected computer with wormhole routing. The com-plete exchange or all-to-all personalized communica-tion pattern is one in which all processors simultane-ously need to communicate with all other processors.It occurs in many applications like parallel quicksort,Also with the Dept. of Electrical and Computer Eng., Syra-cuse UniversityyAlso with the Dept. of Computer and Information Science,Syracuse University
some implementations of the 2D FFT, matrix trans-pose, array redistribution etc. It is the densest form ofcommunication which can result in a lot of link con-tention. Hence it is necessary to use ecient algo-rithms to perform complete exchange.Complete exchange algorithms for a hypercube ar-chitecture are described in [2, 4]. Ponnusamy, Thakuret al [6] discuss complete exchange on the fat tree ar-chitecture of the CM-5. These algorithms assume thatthe number of processors is a power-of-two, which isa valid assumption for those architectures. The mesharchitecture introduces dierent problems because ofhigh contention and the fact that the user can allocatea mesh size which need not be a power-of-two and mayeven be an odd number (eg. 55). Bokhari and Berry-man [3] describe two algorithms for a circuit-switchedmesh, which assume that the number of processors is apower-of-two. In this paper, we discuss algorithms forboth power-of-two and non power-of-two meshes. Wehave developed analytical models to estimate the per-formance of the algorithms. We present performanceresults on the Intel Touchstone Delta.Section 2 describes the architecture of the Deltaand the performance model used for the algorithms.The algorithms are described in Section 3. The per-formance of the algorithms on the Delta is discussedin Section 4 followed by Conclusions in Section 5.2 Architecture and PerformanceModelThe Intel Touchstone Delta is a 1632 mesh of com-putational nodes, each of which is an Intel i860/XRmicroprocessor. The two-dimensional mesh intercon-nection network has bidirectional links with wormholerouting. It uses deterministic XY routing in whichpackets are rst sent along the X dimension and thenalong the Y dimension. In wormhole routing, a packetis divided into a number of its (ow control digits) fortransmission. The size of a it is typically the same asthe channel width. The header it of a packet deter-mines the route and remaining its follow in a pipeline
fashion. The network latency for wormhole routing is(Lf=B)D + L=B, where Lf is the length of each it,B is the channel bandwidth, D is the path length, andL is the length of the message. Thus, if Lf << L, thepath length D will not signicantly aect the networklatency provided there is no link contention. Detailsof wormhole routing techniques can be found in [5].To model the performance of the algorithms, weuse an approach similar to that used by Barnett et alin [1]. The following notations are used in our models:- startup time per messageex transfer time per byte for an exchangewith no link conictssr transfer time per byte to send to and receivefrom dierent processors with no link conictssat transfer time per byte on a saturated linkL number of bytes to be exchangedper processor pairf(i) maximum number of messages contendingfor a saturated link at step ir number of rows in the meshc number of columns in the meshp total number of processors = r  cThe time taken for an exchange operation may bedierent from the time to send to and receive fromdierent processors, because in the latter case the in-coming and outgoing messages may traverse links withdierent amount of contention. Hence, we use ex orsr depending on the algorithm. We assume that thetime taken is independent of distance, a property ofwormhole routing. Thus, the time required for an ex-change step i is given byT = + L max(ex; f(i)sat)We assume that conicting messages share the band-width of a network link and that there exists some pos-itive integer  such that  = 2sat. For the Delta, = 1 is a good approximation [1]. In other words,even if two messages contend for a link, there is noincrease in communication time. Note that since theDelta has bidirectional links, two messages contend fora link only if they need to travel in the same directionsimultaneously.3 AlgorithmsScott [7] has shown that a3=4 is the lower boundon the number of phases required to perform a com-plete exchange on an a  a mesh such that there isno link contention in any phase. However, if we al-low link contention to exist, the operation can be per-formed in fewer steps. We have adopted this approachof allowing a small amount of link contention to exist,thereby reducing the number of steps and keeping all
do i=1, p   1destination = xor(mynumber, i)Exchange with destinationend do Figure 1: Algorithm for PEXTable 1: Communication Schedule for PEX on 8 ProcsStep 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 70$ 1 0$ 2 0$ 3 0$ 4 0$ 5 0$ 6 0$ 72$ 3 1$ 3 1$ 2 1$ 5 1$ 4 1$ 7 1$ 64$ 5 4$ 6 4$ 7 2$ 6 2$ 7 2$ 4 2$ 56$ 7 5$ 7 5$ 6 3$ 7 3$ 6 3$ 5 3$ 4processors active at every step. This approach takesadvantage of the fact that in machines like the Deltaand the Paragon, the links have excess bandwidth, sothat a small number of contending messages will notaect the communication time.3.1 Pairwise Exchange for Power-of-TwoMesh (PEX)The best algorithm for a hypercube architecture isthe pairwise exchange algorithm described in [2, 7] asit guarantees no link contention in the hypercube atevery step. This algorithm has also been shown to per-form well on the fat tree architecture of the CM-5 [6].The algorithm is described in Figure 1. It requiresp  1 steps and the communication schedule is as fol-lows. In step i, 1  i  p 1, each processor exchangesa message with the processor determined by taking theexclusive-or of its processor number with i. Therefore,this algorithm has the property that the entire com-munication pattern is decomposed into a sequence ofpairwise exchanges. The communication schedule ofPEX for 8 processors is given in Table 1. The entryi $ j in the table indicates that processors i and jexchange data.Since each step of PEX involves an exchange be-tween pairs of processors, the maximum number ofmessages contending for a link at any step is limitedbymax(r; c)=2. An exact expression for the maximumnumber of messages contending for a link at step i isf(i) = 2blgfmax(mod(i;c);i=c)gcHence, the time taken for step i isT (i) = + Lmax(ex; f(i)sat)The cost of PEX can be determined by summing overall steps of the algorithm :TPEX =Pp 1i=1 [+ Lmax(ex; f(i)sat)]= (p  1)+ LPp 1i=1 max(ex; f(i)sat)
q = 2dlg pedo j=1, q   1destination = xor(mynumber, j)if (destination < p) thenExchange with destinationend ifend doFigure 2: Algorithm for PEX-GEN3.2 Pairwise Exchange for General Mesh(PEX-GEN)The PEX algorithm cannot be directly used if thenumber of processors is not a power-of-two as theexclusive-or function will not create all the requiredprocessor pairs in p  1 steps. The Pairwise Exchangefor General Mesh (PEX-GEN) algorithm described inFigure 2 is an extension of PEX for non power-of-two meshes. The algorithm rst nds the smallestpower-of-two (say q) greater than the number of pro-cessors and uses this number to schedule q 1 steps ofthe pairwise exchange. In each step, every processorchecks to see if the calculated destination processornumber is less than the actual number of processors.If so, it exchanges data with the processor, else it goesahead to the next step. Thus, the algorithm requiresq   1 steps where q is the nearest power-of-two largerthan the number of processors. Clearly, the algorithmtakes more steps than necessary and many processorsremain idle in several of the steps. However, this re-duces the link contention in each step. The maximumcontention in each step is upper bounded by that inthe PEX algorithm.3.3 PEX-GEN with Shift (PEX-GEN-SHIFT)The motivation for the Pairwise Exchange for Gen-eral Mesh with Shift (PEX-GEN-SHIFT) algorithmcan be explained with the help of Figure 3(a). Assumethat the user has allocated a mesh of 20 processorsnumbered 0 to 19. The nearest power-of-two largerthan 20 is 32, so PEX-GEN will require 31 steps. Inthe rst 15 steps of PEX-GEN, processors 0 to 15 ex-change completely among themselves and processors16 to 19 exchange completely among themselves. Inthe next 16 steps, processors 0 to 15 exchange withprocessors 16 to 19. Since there are only 4 proces-sors greater than 15, many of the processors 0 to 15do not do any communication in many of the last 16steps. Hence there is high link contention in steps 1to 15 and very little or no link contention in steps 16to 31. In general, if there are p processors where p isnot a power of two, PEX-GEN will require q 1 stepswhere q = 2dlg pe. In the rst b(q   1)=2c steps, the
20 processors
0 15 31
0 15 3119
20 processors
6 25
(b)  Processor numbers shifted
(a)   20 processors allocatedFigure 3: Processor Shiftq = 2dlg peshift = (q   p)/2myvirtual = mod(mynumber + shift, p)do j= 1, q   1virtual destination = xor(myvirtual, j)destination = virtual destination { shiftif (destination < 0) thendestination = destination + qend ifif (destination < p) thenExchange with destinationend ifend doFigure 4: Algorithm for PEX-GEN-SHIFTrst q=2 processors are active and in the remainingsteps, several of them are inactive.The Pairwise Exchange for General Mesh with Shift(PEX-GEN-SHIFT) algorithm described in Figure 4maintains a balance of the number of active and inac-tive processors in all steps. This is done by deningvirtual processor numbers such that the real proces-sors 0 to 19 are numbered 6 to 25 as shown in Fig-ure 3(b). The processor numbers are shifted by anamount equal to half the absolute dierence betweenthe number of processors and the nearest higher powerof two. Thus the empty space which earlier existedonly in the half 16 | 31 is now equally divided amongthe two halves. So, even in the rst 15 steps of thealgorithm, there are equal number of idle processors inboth halves, which balances the contention among allthe steps of the algorithm. This algorithm also takesq 1 steps where q is the smallest power-of-two largerthan the number of processors. The maximum con-tention in each step is upper bounded by that in thePEX algorithm.
do j=1, p  1destination = MOD(mynumber + j, p)source = mynumber { jif (source < 0) thensource = source + pend ifsend to destinationreceive from sourceend do Figure 5: Algorithm for GEN3.4 General Algorithm for any Mesh(GEN)The above algorithms require one less than a powerof two number of steps, because they use the exclusive-or function to obtain processor pairs which exchangewith each other. For non power-of-two meshes, itwould be advantageous to have an algorithm whichrequires only p   1 steps. Figure 5 describes such analgorithm, which we call the General Algorithm forany Mesh (GEN), because it works for any number ofprocessors. In the GEN Algorithm, processor pairs donot exchange with each other. Instead, at step i, aprocessor j sends data to processor mod(j + i; p) andreceives data from processor j  i if j  i, and j  i+pif j < i. Clearly, this algorithm will require only p  1steps, for any value of p.The maximum contention at step i is given byf(i) = min[mod(i; c); c mod(i; c)]+min[i=c; (p i)=c]The total time for all steps can obtained as :TGEN =Pp 1i=1 [+ Lmax(sr ; f(i)sat)] =(p  1)+ LPp 1i=1 max(sr ; f(i)sat)4 Experimental ResultsWe implemented all the algorithms on the Deltaand studied their performance for dierent mesh con-gurations and message sizes. The performance ofPEX is shown in Table 2. The number of proces-sors is varied from 16 to 512 with message size variedfrom 256 bytes to 16 Kbytes. Message size refers tothe amount of data communicated in each send andreceive operation, so the total amount of data com-municated increases as the number of processors isincreased. Hence, the time taken increases almost lin-early with the number of processors.The performance of PEX-GEN is given in Table 3.We have chosen some mesh sizes which are non power-of-two. We observe that for mesh sizes which are onlyslightly less than the nearest higher power-of-two, theperformance is close to that of PEX for that power-of-two. But, if the mesh size is only slightly higher thanthe nearest smaller power-of-two, the time taken is
Table 2: Performance of PEXMessage Size Time in sec. for a Mesh Conguration(bytes) 4 4 8 8 16 8 16 16 16 32256 0.004 0.022 0.045 0.094 0.2031K 0.008 0.064 0.120 0.290 0.8604K 0.023 0.114 0.355 0.999 3.2188K 0.034 0.228 0.692 2.068 6.79416K 0.064 0.441 1.413 4.145 13.61Table 3: Performance of PEX-GENMessage Size Time in sec. for a Mesh Conguration(bytes) 4 5 6 8 16 9 16 14 16 30256 0.008 0.019 0.085 0.092 0.2111K 0.017 0.038 0.191 0.270 0.8994K 0.037 0.091 0.576 0.977 3.5888K 0.073 0.174 1.188 2.007 7.61616K 0.138 0.333 2.403 4.056 15.82almost twice the time taken by PEX for that power-of-two. For example, the time taken by PEX-GEN on a169 mesh is much higher than the time taken by PEXon a 168 mesh, but the time taken by PEX-GEN ona 1614 mesh is very close to the time taken by PEXon a 16  16 mesh. This is because of the dierencein the number of steps required. Another interestingobservation is that the time taken by PEX-GEN ona 16  30 mesh is in fact higher than the time takenby PEX on a 16 32 mesh. This is because since theprocessors are numbered in row major order, a changein the number of columns from a power-of-two to anon power-of-two, changes the communication patternin the mesh completely for an algorithm which usesthe exclusive-or function to determine processor pairs.Hence, there is more contention in the 1630 case thanin the 16 32 case.Table 4 shows the performance of PEX-GEN-SHIFT. In most cases, it performs better than PEX-GEN. Table 5 gives the performance of GEN on apower-of-two mesh. GEN performs better than PEXfor small message sizes and small number of proces-sors. However, for large number of processors ( 64)and large message sizes (> 1 Kbytes) PEX performsTable 4: Performance of PEX-GEN-SHIFTMessage Size Time in sec. for a Mesh Conguration(bytes) 4 5 6 8 16 9 16 14 16 30256 0.008 0.019 0.085 0.092 0.2111K 0.017 0.038 0.188 0.263 0.8944K 0.036 0.091 0.543 0.933 3.5268K 0.071 0.170 1.111 1.948 7.51516K 0.129 0.333 2.242 3.844 15.74
Table 5: Performance of GEN on power-of-two meshMessage Size Time in sec. for a Mesh Conguration(bytes) 4 4 8 8 16 8 16 16 16 32256 0.004 0.016 0.042 0.089 0.2831K 0.008 0.042 0.123 0.346 1.2174K 0.018 0.145 0.461 1.220 3.9448K 0.037 0.290 0.933 2.511 8.00716K 0.069 0.576 1.947 5.052 16.15Table 6: GEN on non power-of-two meshMessage Size Time in sec. for a Mesh Conguration(bytes) 4 5 6 8 16 9 16 14 16 30256 0.004 0.015 0.046 0.074 0.2461K 0.009 0.027 0.146 0.285 1.0694K 0.025 0.083 0.527 0.998 3.7068K 0.052 0.186 1.071 2.011 7.75216K 0.098 0.369 2.182 4.005 15.94better. The GEN algorithm has a certain amount ofasymmetry in the communication in the sense thateach communication operation consists of a send toone processor and a receive from some other pro-cessor. Thus, the incoming and outgoing messagesmay traverse a dierent number of links with dier-ent amounts of contention, and the path which hasthe highest amount of contention adversely aects thecommunication time. On the other hand, in the PEXalgorithm, processor pairs exchange with each otherat every step, so the incoming and outgoing messagestravel the same number of links with the same amountof contention.The performance of GEN on non power-of-twomeshes is given in Table 6. GEN reduces the num-ber of steps from q   1 in PEX-GEN and PEX-GEN-SHIFT, where q = 2dlg pe, to p  1. For small numberof processors, PEX-GEN performs the best and theimprovement in performance is higher when q   p islarge. However, if q p is small and the number of pro-cessors is large, the performance of PEX-GEN-SHIFTtends to that of PEX and and the performance of GENtends to that for a power-of-two mesh. So in this case,PEX-GEN-SHIFT performs better than GEN.5 ConclusionsIn this paper, we have discussed algorithms toperform complete exchange on a wormhole routedmesh with performance results on the Intel TouchstoneDelta.For power-of-two meshes, when the number of pro-cessors is small (< 64) and message size is small (< 1Kbytes), the GEN algorithm performs the best. Forlarger message and mesh sizes, PEX performs better.For non power-of-two meshes, PEX-GEN-SHIFT per-
forms better than PEX-GEN, but they both requireq 1 steps where q = 2dlg pe. GEN reduces the numberof steps to p 1 and performs better than PEX-GEN-SHIFT when q  p is large. As p tends to q, the meshtends to a power-of-two mesh and the performanceof PEX-GEN-SHIFT tends to PEX, while the perfor-mance of GEN tends to that for a power-of-two mesh.AcknowledgmentsThis work was supported in part by ARPA undercontract no. DABT63-91-C-0028. The content of theinformation does not necessarily reect the positionor policy of the Government and no ocial endorse-ment should be inferred. Alok Choudhary's research isalso supported by an NSF Young Investigator AwardCCR-9357840 and a grant from Intel SSD. This re-search was performed in part using the Intel Touch-stone Delta System operated by California Institute ofTechnology on behalf of the Concurrent Supercomput-ing Consortium. Access to this facility was providedby the Center for Research on Parallel Computation.References[1] Barnett, M., Littleeld, R., Payne, D., and vande Geijn, R., \Global Combine on Mesh Archi-tectures with Wormhole Routing", Proc. of 7thInt. Parallel Proc. Symp., April 1993.[2] Bokhari, S., \Complete Exchange on theiPSC/860", ICASE Technical Report 91-4, 1991.[3] Bokhari, S., and Berryman, H., \Complete Ex-change on a Circuit Switched Mesh", Proc. ofScalable High Perf. Computing Conf., 1992, pp.300{306.[4] Johnsson, S. L., and Ho, C., \Optimum Broad-casting and Personalized Communication in Hy-percubes", IEEE Trans. on Computers, Septem-ber 1989, pp. 1249{1268.[5] Ni, L., and McKinley, P., \A Survey of WormholeRouting Techniques in Direct Networks", Com-puter, February 1993, pp. 62{76.[6] Ponnusamy, R., Thakur, R., Choudhary, A., andFox G., \Scheduling Regular and Irregular Com-munication Patterns on the CM-5", Proc. of Su-percomputing 92, November 1992, pp. 394{402.[7] Scott, D., \Ecient All-to-All CommunicationPatterns in Hypercube and Mesh Topologies",Proc. of 6th Distributed Memory ComputingConf., 1991, pp. 398{403.
