Optimal communication channel utilization for matrix transposition and related permutations on binary cubes  by Johnsson, S.Lennart & Ho, Ching-Tien
DISCRETE 
APPLIED 
MATHEMATICS 
ELSEWIER Discrete Applied Mathematics 53 (1994) 25 l-274 
Optimal communication channel utilization for matrix 
transposition and related permutations on binary cubes 
S. Lennart Johnsson*,a, Ching-Tien Hob 
a Harvard University and Thinking Machines Corp.. Cambridge, MA. USA 
‘IBM Almaden Research Center, San Jose. CA 95120, USA 
Received 27 August 1991; revised 22 July 1992 
Abstract 
We present optimal schedules for permutations in which each node sends one or several 
unique messages to every other node. With concurrent communication on all channels of every 
node in binary cube networks, the number of element transfers in sequence for K elements per 
node is K/2, irrespective of the number of nodes over which the data set is distributed. For 
a succession of s permutations within disjoint subcubes of d dimensions each, our schedules 
yield min(K/2 + (s - l)d,(s + 3)d, K/2 + 24 exchanges in sequence. The algorithms can be 
organized to avoid indirect addressing in the internode data exchanges, a property that 
increases the performance on some architectures. 
For message-passing communication libraries, we present a blocking procedure that 
minimizes the number of block transfers while preserving the utilization of the communication 
channels. For schedules with optimal channel utilization, the number of block transfers for 
a binary d-cube is d. The maximum block size for K elements per node is rK/(2d) 1. 
1. Introduction 
We give simple, yet optimal, schedules for a class of permutations on multipro- 
cessors configured as binary cubes with concurrent communication on all channels of 
every processor, all-port communication. The class of permutations we consider is 
all-to-all personalized communication (AAPC), in which each node sends a unique 
message to every other node. The Connection Machine systems CM-2 and CM-200 
[20] are examples of binary cube configured multiprocessor architectures allowing 
concurrent communication on all channels of every node. Our schedules avoid 
indirect addressing in the data exchanges by performing a local alignment of the data 
in each processor prior to, and after, the data interchanges between processors. 
*Corresponding author. E-mail: johnssonOharvard.edu. 
0166-218X/94/$07.00 0 1996Elsevier Science B.V. All rights reserved 
SSDI 0166-2 18X(93)E0052-Z 
252 S. Lennart Johnsson, C.-T. Ho i Dlscretr Applied Mathematics 53 (1994) 251- 274 
Examples of all-to-all personalized communication are bit-reversal, vector-reversal, 
matrix transposition and shuffle permutations. Conversion between cyclic and 
consecutive mapping [9] for array allocations, such that a number of memory address 
bits are exchanged with the same number of processor bits, also constitutes all-to-all 
personalized communication. Cyclic and consecutive data layout can be specified in 
Vienna Fortran [21] and in Fortran D [4]. Both layouts are also included in the 
emerging High Performance Fortran standard. Any one of the above permutations, 
combined with code conversion, such as conversion between binary code and bi- 
nary-reflected Gray code [lo, 131, also constitutes all-to-all personalized communication. 
A succession of AAPCs is necessary in some computations. For instance, a Fast 
Fourier Transform on 4096 data elements distributed evenly over 512 processors can 
be performed as a sequence of three local transforms, each on eight data elements. 
Between successive local transforms, an all-to-all personalized communication within 
3-cubes must be performed. Successive AAPCs are performed on successive sets of 
three processor dimensions. With data in cyclic order [9], the first communication 
exchanges the local memory address bits (maddr) with the bits AAPC-0 within the 
processor address field (paddr). The second AAPC exchanges the local memory 
address bits with the bits AAPC-1, etc. 
~u4d-,~4d-2~~~~3d~3d-1~3d-2~~~~2du2d-l~2d-2~~~~d~ud-l~d-2~~~~O~ 
MMVV 
AAPC-2 AAPC- 1 AAPC-0 maddr 
I 
paddr 
Each segment of the processor address field can be viewed as an axis encoded in d bits. 
The successive AAPCs in the example perform the axis exchanges (1, k, j 1 i) + (1, k, i 1 j) 
+ (1, j, i 1 k) -+ (k, j, i 1 I), which constitute a generalized shu& [S, 6,111. 
For the pipelining of a succession of AAPCs, the time elapsed from the first motion 
of an element until it reaches its destination affects the pipeline delay. We refer to the 
maximum elapsed time for any element as the spun. We assume that each node in 
a d-dimensional binary cube, d-cube, concurrently can send and receive one data 
element on all its ports in one time step, i.e., al/-port communication. The all-port 
communication implies that all communication links are full duplex. The number of 
elements per processor is K. 
The lower bound for the number of time steps for all-port, all-to-all personalized 
communication can be shown to be K/2 [l, 121. The optimum span is d. We present 
five algorithms of which the first three either has an optimum number of element 
transfers or optimum span. The fourth algorithm has both an optimum number of 
element transfers and optimum span. The first three algorithms are very simple and 
provide some of the essential ideas for the fourth algorithm. The fifth algorithm is for 
multiple, pipelined AAPCs. 
In many message-passing libraries for multiprocessors, there is a significant over- 
head for each message. It is often desirable to send few messages with a large amount 
of data in each message instead of many messages with a small amount of data in each 
S. Lrnnart Johnson, C.-T. Ho / Discrete Applied Mathematics 53 (1994) 25Ib 274 253 
message. We present a blocking procedure that minimizes the number of block 
transfers and the block size for a given schedule for a single AAPC. The blocking 
preserved the number of time steps, i.e., the utilization of the communication links is 
the same with blocking as without blocking. For Algorithm 4 that has the optimum 
number of time steps and optimum span, the minimal number of block transfers is d, 
which is optimal. The block size for this number of block transfers is rK/(2d)l, which 
is minimal for d block transfers. 
The procedure for minimizing the number of messages in a message-passing system 
can also be used to achieve optimal utilization of communication links when the 
width of the communication links is a multiple of the element size, or when there are 
multiple links between pairs of nodes. The Connection Machine systems CM-2 and 
CM-200 have two communication links between pairs of nodes forming a binary cube 
with up to 11 dimensions. For a channel width of b elements, our procedure yields 
drK/(2db) 1 time steps for a single AAPC. Thus, the time is reduced in proportion to 
the width of each link. 
Saad and Schultz [lS] have suggested a recursive AAPC algorithm based on 2d 
translated binomial trees. The algorithm requires dK/2 time steps; it is primarily of 
interest for one-port communication, i.e., communication restricted to one send and 
one receive per node per time step. One-port communication algorithms have also 
been presented by Nassimi and Sahni [16,17] and Flanders [3]. Nassimi and Sahni 
discuss bit-permute complement (BPC) permutations on mesh or hypercube con- 
figured multiprocessor systems with one-port communication. Flanders [3] considers 
similar permutations on two-dimensional meshes with one-port communication. 
In [12] we presented one-port and all-port algorithms based on balanced trees and 
rotated binomial trees. For all-port communication, the algorithms attain the opti- 
mum number of time steps, when each node sends a multiple of d elements to every 
other node, i.e., K = rd2d for some integer c(. For other values of K, the algorithms in 
[12] are optimal within a small constant factor ( < 24% for d > 4)[7]. Algorithms 2, 
3, and 4 presented here all have the optimum number of time steps for any K. 
Algorithms 1 and 4 have the optimum span. Independent of the work reported in [12], 
Stout and Wagar [19] also gave an algorithm with the optimum number of time steps 
for all-port communication. Later, Bertsekas et al. [l] presented yet another algo- 
rithm with an optimal number of time steps for any K. A detailed optimal scheduling 
algorithm for all-port communication has also been presented by Edelman [Z], who 
implemented the algorithm on the Connection Machine system CM-2. The issue of 
minimum span is not addressed in any of the previous algorithms. Indeed, the 
schedule used by Edelman has a span that is greater than 2d-2. The schedule 
computation time is proportional to O(2d). Our Algorithm 4 improves upon previous 
algorithms by offering both the optimum number of time steps and optimum span, 
and by easily being amenable to blocking for message-passing communication libra- 
ries, while preserving the efficiency in using the communication links. Our Algorithms 
2 and 3 have very simple control and are easier to implement than the algorithms in 
Cl, 21. 
254 S. Lennart Johnsson. C.-T. Ho / Discrete Applied Mathematics 53 11994) 251- 274 
The outline of this paper is as follows. In Section 2 we define the concepts used and 
ideas common to the algorithms presented in this paper. We also discuss pipelining of 
several AAPCs on different sets of cube dimensions, but with a shared set of local 
memory dimensions. The need for efficient pipelining of a sequence of AAPCs was our 
motivation for devising Algorithm 4. In Section 3 we present our algorithms for 
a single AAPC. Section 4 discusses the use of the algorithms for a single AAPC in 
performing multiple AAPCs. An algorithm performing multiple concurrent AAPCs is 
presented. Section 5 presents an idea for minimizing the communications overhead in 
message-passing libraries, and Section 6 presents an idea for the efficient utilization of 
wide channels. Section 7 gives a summary of our results. 
2. All-to-all personalized communication on binary cubes 
2.1. Preliminaries 
A binary n-cube has N = 2” nodes. Each node has n neighbors which, with the 
conventional binary addressing scheme, correspond to the n different single bit 
complementations of the bits in a node address. For two nodes a and b with addresses 
(a,- 1 an_2 . ..ao) and (b,_, bn_* . . . b,) the Hamming distance is Cl:,’ (ai @ bi), where 
@ denotes modulo two addition. There are n edge-disjoint paths of length n between 
any pair of nodes at distance n. The local address space in each processing node, or 
processor for short, is X = {O, 1, . . . , K - 11, and the global address is 0’1 i), where j is 
the node address and i the local memory address. The bits required for address 
encoding are sometimes referred to as dimensions. 
Definition2.1. LetY”={O,l,..., d- l},Y’,c{d,d+ l,..., d+n- l},wherenisthe 
number of cube dimensions. d is the number of cube dimensions involved in the 
permutation and also the number of local memory dimensions involved in the 
permutation. Furthermore, letfbe a bijection Ys + 9* and g be a bijection 9,,-+ 9,. 
An AAPC is a permutation defined by ai -+ aJci, for all i •9’~ and aj + a,(j) for all 
j EYp. 
Note that in our definition of the AAPC 1 Ysl = 1 Yp 1. 9” and yb represent the set of 
local memory dimensions and processor dimensions involved in the AAPC, respec- 
tively. For simplicity in notation, we assume in the definition that all memory 
dimensions are involved in the AAPC. An example of AAPCs satisfying Definition 2.1 
is the bit-reversal (u7a6a5a4a31a2u1aO) -+ (u7a6aoa1a21a3u4a5), where n = 5, d = 3, 
9” = (0, 1,2) and Y, = {3,4,5}. This bit-reversal operation, in fact, consists of four 
independent bit-reversals in three-dimensional subcubes. The subcubes are identified 
by the two leading processor address bits. Another example is the axis exchange 
(a6a5a4a31u2u1uo) -+ (a2aluou31a6a5a4), for which n = 4, d = 3, 9, = {0,1,2} and 
9” = {4,5,6}. This axis exchange can formally be represented as (k,jli) --f (i,jIk), 
S. Lennart Johnsson, C.-T. Ho / Discrete Applied Mathematics 53 (1994) 251- 274 255 
where i initially is encoded in the three local memory dimensions (0, 1,2) and 
k initially is encoded in the processor dimensions {4,5,6}. 
If 2d < K, then the AAPC is repeated rK/2d 1 times. For instance, if K = 256 and 
the AAPC includes three processor dimensions (d = 3), then the permutation is 
repeated 28-3 = 32 times. For clarity, we assume that K = 2’ in the remainder of this 
paper. Conversely, if there are processor dimensions not included in the AAPC, then, 
in fact, a number of AAPCs in disjoint subcubes is specified, as shown in the examples 
above. The subcube for each AAPC is identified by the processor address bits not 
included in the specification of the AAPC. 
The relative address of a local memory address i in processor j with respect to its 
destination is j @ i. A homogeneous communication schedule has schedules for each 
node that only depend on the relative addresses. Thus, in a homogeneous schedule, 
node i sends data to node i Oj in the same dimension and step as node 0 sends data to 
node j. The message path from node i to node i @j is a translation (modulo two 
addition) with respect to node i of the path from node 0 to j. We only consider 
homogeneous schedules. For such schedules, it is sufficient to consider schedules 
for node 0. All our schedules are based on minimum length path routing of all 
elements in a node and use all-ports of every node in every step, except possibly the 
last step. 
Definition 2.2. Let a data element i leave the source during step ti and arrive at the 
destination during step t, > tl. Then, spun(i) = t, - tI + 1. The span of an AAPC is 
the maximum span for any data element, i.e., spun = maXi spun(i). 
Thus, the span for element i is the number of exchanges required for it to move from 
source to destination, including exchanges during which the element may be waiting 
en route. 
In Algorithm 4 we use the notion of necklaces. A necklace [15] is a set of addresses 
derived from each other through rotations of some fixed bit string. A necklace isfull if 
it has d distinct addresses for a string of length d. Otherwise, it is degenerated. The 
period of a bit string is the minimum number of rotations ( > 0) required to generate 
a bit string identical to the unrotated string. An address in a degenerate necklace is 
cyclic. Addresses in full necklaces are noncyclic. For instance, the address (0000) is 
cyclic with period one and the address (0101) is cyclic with period two; the addresses 
(0001) and (0011) are noncyclic. The address with smallest value in the necklace is 
a distinguished address. A q-necklace is a necklace in which each address has q bits 
equal to 1. 
2.2. Algorithm organization 
We first consider the organization of algorithms for axes exchanges of the form (jl i) 
-+ (i/j), then discuss permutations of the form (jli) -P (f(i)lg(j)). 
256 S. Lennart Johnsson, C.-T. Ho 1 Discreie Applied Mathematics 53 (1994) 251- 274 
2.2.1. Axes exchanges: (jli) -+ (ilj) 
The permutation (jl i) -+ (ilj), where j and i have the same number of bits, is 
equivalent to the transposition of a matrix stored with one row per processor to the 
storage of the matrix with one column per processor. The permutation can also be 
viewed as changing the allocation of a one-dimensional array from consecutive 
storage to cyclic storage [9], two storage forms included in Vienna Fortran [21] and 
Fortran D [4], and adopted in the emerging High Performance Fortran standard. 
The routing algorithms we present use only direct addressing in the data exchanges 
between processors. All processors access the same local memory address during the 
same step. To accomplish this property, a local alignment precedes the interprocessor 
exchanges, and a realignment follows the exchanges. On the Connection Machine 
systems CM-2 and CM-200, avoiding indirect addressing in the exchange phase 
results in a significant speed up. 
Phase 1: Local alignment. Sort the local data by relative address, i.e., perform the 
operation (jl i) --f (jlj @ i). Table 1 shows the data distribution in a 3-cube after the 
alignment. The processor addresses are given in the decimal number system and the 
local memory addresses in the binary number system. 
Phase 2: Interprocessor exchange. The interprocessor exchange phase implements 
the operation: (jji) -+ (j @ i(i). 
Phase 3: Local realignment. The realignment to restore the local memory order is 
identical to Phase 1. 
The alignment is performed on the memory dimensions included in the AAPC. 
A memory address is subject to an exchange operation in processor dimensions that 
corresponds to nonzero bits in its relative address. Thus, local memory address zero 
retains its content throughout Phase 2, whereas local memory address (11.. .1) needs 
to send its content across all cube dimensions (in any order). Similarly, local memory 
address (OO...Ol) only exchanges its content with the neighboring node in the least 
significant cube dimension, while local memory address (ll... 10) needs to send its 
content across all but the least significant cube dimension. 
Table 1 
Data distribution for AAPC in a 3-cube, after the alignment in Phase 1 
rel-addr PO PI PZ P, P4 P5 P6 P, 
0 (0 low 
1 KlOOl) 
2 KlOlO) 
3 KlOl1) 
4 (Ol100) 
5 (01101) 
6 (Ol110) 
I (Ollll) 
(11001) 
(1 low 
(1lOll) 
(11010) 
(1llOl) 
(1llW 
(l/111) 
(llll0) 
(31011) 
(31010) 
(31001) 
(3 low 
(3ll11) 
(31110) 
(31101) 
(31100) 
(41100) 
(41101) 
(41110) 
(41111) 
(4lOW 
(41~1) 
(41010) 
(41011) 
(51101) 
(51100) 
(51111) 
(51110) 
(51001) 
(5 low 
(51011) 
(51010) 
(61110) 
(61111) 
(61100) 
(61101) 
(6/010) 
(61011) 
(6 I 000) 
(6loC’l) 
(71111) 
(71110) 
(71101) 
(71100) 
(71011) 
(71010) 
(71001) 
(7lOW 
S. Lennart Johnsson, C.-T. Ho / Discrete Applied Mathematics 53 (1994) 251-274 257 
Table 2 
The initial data distribution viewed as complement pairs of local atidresses after 
alignment in a 3-cube 
PO P, P* P, P4 P5 PC5 P, 
(OIOW 
(Ollll) 
Klool) 
(OIllO) 
(OlOlO) 
(Ol101) 
(OlOl1) 
(Ol1W 
(11001) 
(11110) 
(1lOW 
(11111) 
(11011) 
(11100) 
(llOl0) 
(11101) 
(21010) 
(2/101) 
(21011) 
(21100) 
(2 low 
(21111) 
(21001) 
(21110) 
(31011) 
(31100) 
(31010) 
(31101) 
(31001) 
(31110) 
(3lOW 
(3/111) 
(41100) 
(41011) 
(41101) 
(41010) 
(41110) 
(41001) 
(41111) 
(4 low 
(51101) 
(51010) 
(5l1W 
(51011) 
(51111) 
(5lOW 
(51110) 
(51001) 
(61110) 
(61001) 
(6/111) 
(6 I O’W 
(6ll’W 
(61011) 
(61101) 
(61010) 
(71111) 
(7 low 
(71110) 
(71001) 
(71101) 
(71010) 
(71100) 
(71011) 
In an AAPC, including all local memory dimensions, either local memory address 
i or ineeds to send its content across a given cube dimension. Thus, each complement 
pair (i, F) of local memory addresses must send one element across cube dimension. 
The pairing of local memory addresses is shown in Table 2 for d = 3. 
2.2.2. Axes exchanges with permutations: (jl i) -+ (f(i)1 g(j)) 
The algorithm organization for axes exchanges (j 1 i) + (ilj) is easily generalized to 
permutations of the form (jl i) + (f(i)lg(j)), wherefis a bijection from 9, to Yp and 
g a bijection Yp + 9”. The operation i -f(i) can be combined with the prealignment 
operation (Phase 1) without a need for additional data motion. Similarly, the permu- 
tation j -+ g(j) can be performed as a local memory operation by combining the 
permutation with the postalignment operation (Phase 3). For instance, if the proces- 
sor addresses are encoded in a binary-reflected Gray code and the local memory 
addresses use the standard binary code, then an axes exchange preserving the 
encoding strategy requires no extra data motion. But, the pre- and postalignment 
phases must be modified to include the code conversion from binary code to bi- 
nary-reflected Gray code (f), and from binary-reflected Gray code to binary code 
(g =f- ‘) [lo, 131. The functions f and g adds to the complexity of the address 
calculation for the local permutations in the pre- and postalignment. 
2.3. Multiple AAPC 
A sequence of AAPCs can be represented as a sequence of axis exchanges. For 
instance, in the FFT example given in the introduction, the sequence of AAPCs was 
represented by the axis exchanges 
(Lkjli) --f (Lkilj) + U,j,ilk) -+ (k,j,ilO. 
Lemma 2.1. The alignments for AAPCs on different processor bits are independent, and 
can be performed at once. 
258 S. Lennart Johnson, C.-T. Ho 1 Discrete Applied Mathematics 53 (1994) 251- 274 
The sequence of AAPCs can therefore be organized into three phases as follows: 
Phase 1: Local alignment at once for all axes. 
Phase 2: Interprocessor exchanges for the different AAPCs. 
Phase 3: Local realignment at once for all axes. 
For the three-axes example the operations are 
(I,k,jli) L (I,k,jliOjOk00, 
(l,k,jli) 2 (I,k,i@jOk@lli), 
(1,k,j)i) s (I,i@j@kOl,jli), 
(l,k,jli) % (iOj@k01,k,jli), 
(l,k,j\i) 5 (I,k,j\iOj@k@l). 
The first step is Phase 1, the local prealignment. Phase 2 consists of steps two to 
four. Each step represents an AAPC within distinct subcubes defined by the differ- 
ent axes (sets of processor dimensions), starting with coordinate axis j. All the com- 
munications in the second phase are between elements with the same local memory 
address. The element initially in location (I, k, jl i) is routed to the final location (k,j, il I) 
according to the path: 
(I,k,j(i) 
2 (l,j,ili@j@kOQ 2 (k,j,i)iOjOk@l) L (k,j,i\l). 
Table 3 shows the data motion in detail for the permutation (k,jli) + fj, ilk), where 
each axes is encoded in two bits. The row mod 2 add shows the index used for the 
alignment operations, i.e., k @j. Fig. 1 shows the motion of one set of data elements. 
The pairing of data elements is based on the local storage dimensions involved in the 
AAPCs. The pairing is the same as for a single AAPC. 
For a sequence of s AAPCs on d dimensions each, and a total of sd dimensions in 
the permutation, there are s coordinate axes (each encoded in d bits), and s steps in 
Phase 2. The steps are as follows: 
(a,, a,- 
1 
1, . . ..allad + (a,,+ 1, . . . . alla,Oa,-10~~~0a10a,), 
(%G 
2 
,,...,4~d --, (a,,~,-~,...,a~,a,Oa,-~ 0 ... @aI OaolhJ, 
ha,- 1, . . ..allao) “2 @,O%l @...a, Ouo,u,~l,a,~2,...,ull~o), 
@s, as- 1, . . ..~ll~O) “2 (%,G 1, . . . . u~~u,Ou,_lO~~~Oa,~uo). 
The communication for successive AAPCs can be pipelined. Once a pair of local 
memory addresses has completed the communication for one AAPC, they are ready 
to proceed to the next AAPC. Minimizing the pipeline delay is equivalent to minimizing 
S. Lennart Johnsson, C.-T. Ho / Discrete Applied Mathematics 53 (1994) 25Ib 274 259 
Table 3 
The global memory state for an AAPC with explicit alignment 
Phase 
PO p, p, p3 
mod 2 add. 00 01 10 11 
Initial 0 4 8 12 
1 5 9 13 
allot. 2 6 10 14 
3 7 11 15 
after 0 5 10 15 
mod 2 add. 1 4 11 14 
1 2 7 8 13 
3 6 9 12 
after 0 5 10 15 
first 4 1 14 11 
AAPC 8 13 2 7 
* 12 9 6 3 
after 0 17 34 51 
second 16 1 50 35 
AAPC 32 49 2 19 
et 48 33 18 3 
after 0 1 2 3 
mod 2 add. 16 17 18 19 
1 32 33 34 35 
48 49 50 51 
- 
p4 p5 P6 p7 ps p9 PI0 PII PI2 PI3 PI4 PI5 
01 00 11 10 
16 20 24 28 
17 21 25 29 
18 22 26 30 
19 23 27 31 
10 11 00 01 11 10 01 00 
32 36 40 44 48 52 56 60 
33 37 41 45 49 53 51 61 
34 38 42 46 50 54 58 62 
35 39 43 41 51 55 59 63 
17 20 27 30 
16 21 26 31 
19 22 25 28 
18 23 24 29 
20 17 30 27 
16 21 26 31 
28 25 22 19 
24 29 18 23 
34 39 40 45 51 54 57 60 
35 38 41 44 50 55 56 61 
32 37 42 47 49 52 59 62 
33 36 43 46 48 53 58 63 
40 45 34 39 60 57 54 51 
44 41 38 35 56 61 50 55 
32 37 42 47 52 49 62 59 
36 33 46 43 48 53 58 63 
20 5 54 39 40 57 10 27 60 45 30 15 
4 21 38 55 56 41 26 11 44 61 14 31 
52 37 22 7 8 25 42 59 28 13 62 47 
36 53 6 23 24 9 58 43 12 29 46 63 
4 5 6 7 8 9 10 11 12 13 14 15 
20 21 22 23 24 25 26 27 28 29 30 31 
36 37 38 39 40 41 42 43 44 45 46 47 
52 53 54 55 56 57 58 59 60 61 62 63 
the span. In Section 4 we also show how the pipeline delay can be reduced by 
performing multiple AAPCs concurrently. 
3. Algorithms for a single AAPC 
The first three algorithms in this section either have optimum data transfer time, or 
optimum span, but not both. Algorithm 4, the major contribution of this section, has 
both optimum data transfer time and optimum span, but is more complex. 
3.1. Algorithm I 
A straightforward algorithm for AAPC is to schedule d complement pairs 
of addresses concurrently, by scheduling complement pair u in dimensions U, 
(u + 1) mod d, (u + 2) mod d, . . . , (u + d - 1) modd. After d exchanges, the d pairs of 
addresses have performed all necessary communications. In each step exactly one of 
260 S. Lennart Johnsson. C.-T. Ho / Discrete Applied Mathematics 53 11994) 251& 274 
pair 
GQO,111) 
alignment 
PO Pl P2 P3 P4 P5 P6 PI 
dim. = 2 
dim. = 0 dim. = 1 
PO Pl P2 P3 P4 PS P6 PI 
realignment 
PO Pl P2 P3 P4 P5 P6 Pl 
final distribution 
Fig. 1. Tracing a set of memory locations for an AAPC with explicit alignment 
the two addresses in a pair must exchange its content with another processor. The 
address sending its data to another processor has the relative address bit equal to one 
for the dimension being exchanged. Table 4 shows the exchanges for the first three 
complement pairs of addresses for d = 3. For the first row of complement pairs of 
addresses in Table 4, the exchange sequence with respect to processor dimensions is 0, 
1 and 2. The sequence for the second row of complement pairs of addresses is 2,0 and 
1. Every channel is used in every step; each item follows a minimum-length path. 
The pseudocode below serves to illustrate the simplicity in implementing the 
algorithm for Phase 2 of an AAPC performed on the d least significant processor 
dimensions. The innermost loop performs the concurrent exchanges in all processor 
dimensions. Index u enumerates different complement pairs of local memory ad- 
dresses in a concurrent exchange. Index r enumerates the exchanges required for each 
complement pair of memory addresses. Loop index q enumerates different sets of 
d complement pairs of addresses, while the forall statement specifies all binary cube 
processors. The memory accesses over time and the loop ordering are illustrated in 
Fig. 2 for d = 5. 
By unrolling the loop on q in Algorithm 1, the exchanges corresponding to a few 
diagonal blocks in Fig. 2 can be made in the same exchange step, thereby reducing the 
number of exchange steps and potentially the overhead in a message-passing com- 
munication library. The number of element transfers in sequence is unaffected by such 
a schedule change. Blocking of element exchanges is discussed further in Section 5. 
S. Lennart Johnsson, C.-T. Ho 1 Discrete Applied Mathematic,s 53 11994) Z-274 261 
Table 4 
The locations of the contents of the first three complement pairs of local addresses after each exchange step 
in a 3-cube. The position of the entry in the column “Dim” (top or bottom) indicates the address of the data 
element subject to exchange 
Time Dim PO PI P* P3 P4 PS PS P7 
(OlOOO) (11001) (21010) (31011) (41100) (51101) (61110) (71111) 
0 (IlllO) (01111) (31100) (21101) (51010) (41011) (71000) (61001) 
1 (OlOOl, (11000) (21011) (31010) (41101) (S(100) (61111) (71110) 
1 (21100) (31101) (O/110) (11111) (61000) (71001) (41010) (51011) 
(OlOlO) (l/011) (21000) (31001) (41110) (51111) (61100) (71101) 
2 (41001) (S(OO0) (61011) (71010) (O(101) (l/100) (21111) (31110) 
(O/000) (11001) (21010) (31011) (41100) (51101) (61110) (71111) 
1 (3) 100) (21101) (11110) (01111) (71000) (61001) (S(O10) (41011) 
2 (O/001) (llOO0) (21011) (31010) (41101) (51100) (61111) (71110) 
2 (61000) (71001) (41010) (51011) (21100) (31101) (OIllO) (11111) 
(OlOlO) (11011) (21000) (31001) (41110) (51111) (61100) (71101) 
0 (SIOOO) (41001) (71010) (61011) (11100) (01101) (31110) (21111) 
(OIOOO) (l/001) (21010) (31011) (41100) (S(101) (61110) (71111) 
2 (71000) (61001) (51010) (41011) (31100) (21101) (II 110) (Ollll) 
3 0 (llOO0) (0(001) (31010) (21011) (S)lOO) (41101) (71110) (61111) 
(61000) (71001) (41010) (51011) (21100) (31101) (01110) (11111) 
1 (21000) (31001) (O(O10) (11011) (61100) (71101) (41110) (51111) 
(SIOOO) (41001) (71010) (61011) (lllO0) (01101) (31110) (21111) 
M 
e 
mq 
0 
r 
Exchange step 
_-____ 
_m____ 
______ 1 
I I 
I I 
,I+ ii+______ 
-e-_- 
Fig. 2. Loop ordering for Algorithm 1 
262 S. Lennart Johnsson. C.-T. Ho / Discrete Applied Mathematics 53 (1994) 251-274 
forall p E{O, 1, . ...2” - l} do 
for q := 0 to L2d/(24] - 1 step d do 
forr:= Otod-ldo 
forall u ~{0,1, . . ..d - 1) do 
if (4 + &,+,)modd = 1 then 
(P -P n+d 1 n+d-2 . ..P(u+r)modd+lP(u+*)moddP(u+r)modd-1 ...Pdiq + u) 
-+h,+d-lPn+d-2 . ..P(u+*)modd+lP(u+*)moddP(u+*)modd-1 ...Pdiq + u, 
else 
(Pn+d-lPn+d-2 . ..P(u+I)modd+lP(u+r)moddP(u+r)modd-1 “. Pdiq + u, 
+ (Pn+d~1Pn+d-2...P(u+r)modd+lP(u+*)moddP(u+*)modd-l . ..Pdh + u, 
endif 
endforall 
endfor 
endfor 
. . . 
Similar exchanges for the remaining 2d mod 2d local addresses. 
. . . 
endforall 
The correctness of the above algorithm follows from the following consideration: 
Phase 1: (jli) 
Phase 2: (jli) 
Phase 3: (jli) 
or 
(jli) L (j(jOi) 
Phase 2 of Algorithm 
2 (jlj@i), 
5 ((jOi)OjljOi)=(iJj@i) S (il(j@i)@i)=(ilj). 
1 proceeds in LK/(2d)j subphases with d exchanges per 
subphase. If d does not divide K/2, then the last subphase (d communication steps) 
does not fully use all communication channels. 
3.2. Algorithm 2 
Table 5 defines Algorithm 2. Complement pair u of local memory addresses is 
exchanged in processor dimension r during time step (U + u) mod(2d/2), 0 d Y < d 
and 0 < u < K/2. The span is K/2. The AAPC schedule in [2] has a span of order 
0(2d) and optimal channel utilization. The large span is a disadvantage when several 
AAPCs on different processor address bits must be performed. 
3.3. Algorithm 3 
Optimal channel utilization with a span of d + (K/2) mod d is easily obtained by 
applying Algorithm 2 to d + (K/2) mod d pairs of memory addresses and Algorithm 1 
S. Lennarr Johnsson, C.-T. Ho 1 Discrete Applied Marhematics 53 (1994) 251- 274 263 
Table 5 
A schedule for optimal channel utilization (Algorithm 2) for an AAPC 
on five processor dimensions 
Memory Exchange step 
lot/pair 
1 2 3 4 5 6 I 8 
(0000,1111) 0 1 2 3 
(0001,1110) 0 1 2 3 - - 
(0010, 1101) - 0 1 2 3 - - 
(0011,1100) - - 0 1 2 3 - 
(0100, 1011) - - 0 1 2 3 
(0101, 1010) 3 - - 0 1 2 
(0110, 1001) 2 3 - 0 1 
(0111, 1000) 1 2 3 0 
Table 6 
The number ofelement transfers, the span, and the number oflocal memory addresses scheduled together in 
Algorithms l-3 on a d-cube 
Algorithm Element Span Local memory addresses in transit 
transfers 
Min Max Min Max 
1 drKl(2d) 1 d d 2((2“/2) mod d) 2d 
2 KIT d K/2 2d 2d 
3 K/2 d d + ((2“/2)modd) 2d 2(d + (2d/2) mod d)) 
to all other pairs of memory addresses. We refer to this combined algorithm as 
Algorithm 3. 
3.4. The communication complexity of Algorithms 1-3 
The characteristics of the above algorithms are summarized in Table 6. The 
algorithms either have optimum span or an optimum number of time steps, but not 
both. 
Performing s AAPCs in sequence by pipelining successive applications of Algo- 
rithm 1 requires drK/(2d) 1 + (s - 1)d time steps. For Algorithm 3, pipelining yields 
K/2 + (s - l)(d + (K/2) mod d) time steps. The schedule devised below (Algorithm 4) 
has optimal channel utilization and a span of d. Pipelining s AAPC based on this 
algorithm yields K/2 + (s - 1)d time steps. 
3.5. Algorithm 4 
Algorithms 1, 2 and 3 schedule complement pairs of local addresses together. In 
Algorithm 4 we supplement this scheduling strategy with one based on whether or not 
264 S. Lennart Johnsson, C.-T. Ho / Discrete Applied Mathematics 53 (1994) 251& 274 
an address is noncyclic. We first consider a possible scheduling of noncyclic addresses, 
then a possible scheduling of cyclic addresses. Considering noncyclic and cyclic 
addresses separately would yield an algorithm with optimum span, but in general 
nonoptimum data transfer time. By scheduling some cyclic addresses together with 
some noncyclic addresses, an algorithm with both optimum span and optimum data 
transfer time can be devised. 
3.5.1. Scheduling of noncyclic addresses 
The exchanges for the three local memory locations (OOl), (010) and (100) can be 
done concurrently. Similarly, locations (01 l), (110) and (101) can be scheduled concur- 
rently with respect to the least significant bit of (01 l), the second least significant bit of 
(1 lo), and the most significant bit of (101). The communication for the remaining bit 
that is one in each of these addresses can also be scheduled concurrently. In general, 
for a full q-necklace with bits iO, ii, . . . . i,_, equal to one, the contents of the distin- 
guished address is exchanged in dimensions iO, iI, . . . . i,_ 1. Within the q-necklace, 
address r obtained through an r steps left cyclic rotation of the distinguished address, is 
subject to exchanges in dimensions (iO + r) mod d, (iI + r) mod d, . . . , (i4_ 1 + r) mod d. 
For the l-necklace with distinguished address (001) the described schedule yields 
a single exchange in dimension i0 = 0 for the distinguished address. For address 
r in the l-necklace, the exchange is in dimension r. For the 3-necklace 
{Olll, 1110,1101,101 l}, the schedule yields the exchanges given in Table 7. 
Lemma 3.1. The set of addresses forming a full q-necklace can be scheduled to complete 
the required permutation in q exchange steps, which is optimal. The span is q f d, and all 
communication channels are used in each exchange step. 
The presented schedule for noncyclic addresses yields optimum span and full 
utilization of all communication channels. Cyclic addresses cannot be scheduled 
optimally using the same idea. In our Algorithm 4, we schedule one suitably chosen 
full necklace of addresses together with a set of complement pairs of cyclic addresses. 
All other noncyclic addresses are scheduled as described above. 
Before considering the scheduling of cyclic addresses, we compare the scheduling of 
noncyclic addresses based on the idea of necklaces with scheduling based on comple- 
ment pairs of addresses. 
Table 7 
Address Exchange 
dimensions 
0111: 0 1 2 
11 lo: I 2 3 
1101: 2 3 0 
1011: 3 0 1 
S. Lennart Johnsson. C.-T. Ho / Discrete Applied Mathematics 53 (1994) 251- 274 265 
Lemma 3.2. Addresses of a q-necklace obtained through bit-complementation of a full 
(d - q)-necklace is also a full necklace, and distinct, ifq # d/2. 
Corollary 3.3. The addresses in any full q-necklace, q # d/2, and its complement 
(d - q)-necklace, can be scheduled together to complete the required data motion in 
d exchange steps. 
Thus, for d odd, schedules of noncyclic addresses either based on necklaces, or 
complement pairs of addresses, both yield optimum utilization of all d communication 
channels. The maximum span for scheduling based on necklaces is d - 1, while the 
span for schedules based on complement pairs is d. For q = d/2 not every q-necklace 
has a matching complement necklace. For instance, the address (0011) and its 
complement (1100) belong to the same necklace. Thus, for d even, scheduling based on 
complement pairs of addresses may not yield full utilization of all communication 
channels for all noncyclic addresses, 
3.5.2. Scheduling of cyclic addresses 
Cyclic addresses are scheduled as complement pairs of addresses. The bit-comple- 
ment of a cyclic address is also cyclic. Scheduling cyclic addresses as complement pairs 
of addresses yields full utilization of all d communication channels for all such pairs 
only when the number of cyclic addresses is a multiple of 2d. 
Let the number of pairs of cyclic addresses be p and let c = p mod d. Then, all but 
c pairs of cyclic addresses are scheduled as blocks of d complement pairs in exchange 
sequences with span d, utilizing all d communication channels of every processor in 
every exchange step. The remaining c complement pairs of cyclic addresses are 
scheduled together with the addresses of one full (d - c)-necklace. For example, for 
d = 3, c = 1 and the complement pair of cyclic addresses (000) and (111) are scheduled 
together with the 2-necklace (01 l), (110) and (101). The remaining addresses all belong 
to the l-necklace (001) (010) and (100) which is scheduled as a full necklace with all 
exchanges completed during a single cycle. The schedule for the cyclic addresses and 
the 2-necklace are shown in Table 8. 
Table 8 
Address Exchange 
dimensions 
011: 0 1 
110: 1 - 2 
101: _ 2 0 
(000,111): 2 0 1 
266 S. Lennart Johnsson, C.-T. Ho / Discrete Applied Mathematics 53 (1994) 25/G 274 
3.5.3. The complete algorithm 
We are now ready to define Algorithm 4 as follows: 
l schedule blocks of d complement pairs of cyclic addresses together, 
l schedule the remaining c complement pairs of cyclic addresses with addressses of 
a full (d - c)-necklace, 
l schedule all remaining noncyclic addresses by scheduling the d members of each 
remaining full necklace together. 
It follows from the previous discussion that this algorithm fully utilizes all com- 
munication channels of every processor during every exchange step and has a span of d. 
We construct the schedule for the c pairs of cyclic addresses that are combined with 
the addresses in a full (d - c)-necklace from a d x d generating ruble. The columns of 
the table correspond to exchange steps, and the table entries correspond to the 
dimensions scheduled for that time step. The task is to associate d of the d + 2c local 
memory addresses that must be scheduled in every exchange step to the d table entries 
in a column. The first row of the table consists of the numbers 0, 1, . . ..d - 1. Each 
other row is a one step left cyclic rotation of the previous row. Thus, every dimension 
is used for every exchange step. We arbitrarily associate the last c rows of the 
generating table with the c complement pairs of cyclic addresses. Each complement 
pair is clearly subject to an exchange in every dimension during d exchange steps, as 
required. The remaining d - c rows of the table are used to create a schedule for the 
d addresses of the full (d - c)-necklace. 
We choose the full (d - c)-necklace to be the necklace with distinguished address 
(OCldec), where ldPc denotes d - c consecutive l-bits. There exist several schedules 
with span d and an optimal number of element transfers. Any valid schedule must 
associate an address of the necklace with at most one entry in a column and precisely 
d - c columns. Furthermore, all table entries associated with an address must be 
unique, and the set of table entries for different addresses must be rotations of each 
other. Each table entry can only be assigned to one address. The schedules can be 
illustrated by drawing lines in the generating table. Each line intersects d column 
entries and represents all addresses in the full (d - c)-necklace. Each entry on the line 
represents an address, and the table entry gives the dimension in which that address 
is exchanged during the time step given by the table column. In total, d - c lines 
must be drawn since each address of the (d - c)-necklace must be scheduled in d - c 
dimensions. 
Fig. 3 gives the generating table for d = 5 with two different schedules with arrows 
representing the lines. In the left half of the figure, the lines start on the diagonal and 
progress to the right within a row, cyclicly. The dimensions associated with entry j, 
0 < j < d of line i, 0 < i < d - c, is (2i + j) mod d, by construction of the generating 
table and the lines. Thus, for any j, all d - c entries for 0 < i < d - c are unique for 
d odd. Member j of the necklace has the address bits set which correspond to the table 
entries for j on the different lines. Thus, for d = 5, d - c = 4 and for j = 0, j = 1, etc., 
the addresses are {(lolll), (01111) (11110) (lllol), (11011)). For d even, the same 
scheme can be used for c >, d/2. for c < d/2 the first d/2 lines can be drawn as for 
S. Lennarr Johnsson, C.-T. Ho / Discrete Applied Mathematics 53 11994) 25I& 274 261 
Fig. 3. Scheduling lines in the generating table for d = 5 (d - c = 4) 
d odd, while a skew of two columns is used for line i = d/2 in order to preserve 
uniqueness of all entries for j. 
The schedule corresponding to the right half of Fig. 3 is obtained by drawing the 
lines vertically in a “greedy” manner. Thus, the first line is drawn from the top of the 
first column down to the (d - c)th row, then c columns to the right. The second line 
starts in the second column, goes vertically through d - c - 1 rows, then turns right 
through c columns, followed by a vertical downturn to include one more row. 
Subsequent lines are drawn in a similar way by proceeding vertically one row less than 
the previous line, then proceeding horizontally through c columns, then proceeding 
vertically to the last row reserved for the full necklace. 
The schedules corresponding to Fig. 3 are shown in Table 9. The first row corres- 
ponds to j = 0, the second row to j = 1, etc. 
In summary, in Algorithm 4, local memory addresses are partitioned into cyclic and 
noncyclic addresses. The number of cyclic addresses is two if d is prime; otherwise it is 
of order O($?) [S]. Cyclic addresses are paired through bit-complementation and 
divided into blocks of d pairs. The remaining c pairs are scheduled together with d - c 
noncyclic addresses forming a (d - c)-necklace. All remaining noncyclic local memory 
addresses can be scheduled as blocks of d complement pairs of addresses. 
Theorem 3.4. An all-to-all personalized communication on d processor dimensions can 
be performed in K/2 time steps for all-port communication. The maximum span is d. 
Table 9 
Exchange dimension for type-2 scheduling for d = 5 and q = 4 
Address Time step Address Time step 
0 1 2 3 4 0 1 2 3 4 
10111 0 2 4 1 01111 0 1 2 3 
01111 _ 1 3 0 2 11110 1 2 3 - 4 
11110 3 2 4 I 11101 2 3 - 4 0 
11101 2 4 3 0 11011 3 4 0 1 
11011 1 3 0 - 4 10111 4 0 1 2 
(00000,11111) 401 23 (00000,11111) 40123 
268 S. Lennart Johnsson. C.-T. Ho / Discrete Applied Mathematics 53 (1994) 251b 274 
4. Algorithms for multiple AAPCs 
For multiple AAPCs with the same set of local memory dimensions, pipelining can 
be applied to increase the utilization of the communication channels in all-port 
communication. If each AAPC fully utilizes the communication channels, then the 
only source inefficiency is the pipeline delay. If there are additional storage dimen- 
sions, then the pipeline delay may be avoided by performing multiple, concurrent 
AAPCs. Below we first comment on the pipelined approach, then outline an algorithm 
for multiple concurrent AAPCs (Algorithm 5). 
4.1. Pipelining of’ AAPCs 
Pipelining the communications for a succession of AAPCs is conceptually straight- 
forward when all the AAPCs involve the same number of local memory and processor 
dimensions. The communication complexity for pipelined AAPCs follows from 
Theorem 3.4. 
Corollary 4.1. The number of time steps for s all-to-all personalized communications in 
successions on different sets of d processor dimensions, and the same set of local memory 
dimensions, is K/2 + (s - 1)dfor a local data set of size K, and all-port communication. 
If the succession of AAPCs shall be performed on nonoverlapping subcubes of 
different order, then the situation is more complex. If the subcubes are of nonincreas- 
ing order, and the local memory bits for AAPCs on smaller subcubes form a subset of 
the dI bits for the first AAPC, then an extra delay is introduced whenever the number 
of dimensions in an AAPC does not divide the number of dimensions in the previous 
AAPC. Assume that the number of dimensions for the ith AAPC is di. Then, the extra 
delay introduced by the ith AAPC in the case of Algorithm 1 is di - gcd(di_ i, di). NO 
realignment or change of pairing is required for Algorithm 1. For Algorithm 4, 
a change in the number of local memory dimensions involved in the AAPC affects the 
classification of addresses as cyclic and noncyclic, and hence their schedules. The 
regrouping for the ith AAPC may involve addresses from two blocks each of which 
requires di- , exchanges steps. Slightly more efficient, and simpler algorithms for this 
case, are contained in [14]. 
4.2. Multiple concurrent AAPCs 
Consider the exchange sequence (1, k, jl i, v) + (1, u, jl i, k) + (k, u, j\ i, 1) + (k, u, j( 1, i) 
+ (k, u, i 11, j) + (k, j, i 11, u). The resulting permutation is the same as for the exchange 
sequence (l,k,jli) + (l,k,ilj) + (l,j,ilk) + (k, j, i) 1). Thus, if there are local memory 
dimensions in addition to the ones included in the AAPCs, then the first AAPC can be 
performed on a processor axis other than the first, at the expense of one local 
S. Lennart Johnson. C.-T. Ho / Discrete Applied Mathematics 53 (1994) 251-274 269 
I Time ster, 
I I I i,l i,2 i,3 i,4 i,5 i,6 
Fig. 4. The exchange sequences for six AAPCs in sequence using arbitrary starting dimensions and 
pipelining. 
memory reordering and of one repeated AAPC. The idea is illustrated in Fig. 4 for six 
processor axes and one local memory axis. 
In Fig. 4, each column represents d exchanges (as required for any pair of memory 
addresses). The number of memory addresses associated with a row in the figure 
depends upon the algorithm used for each AAPC. For Algorithm 1,2d addresses are 
associated with each table row (compare Fig. 2). For Algorithm 4, a table row 
corresponds to 2d memory addresses, except for the noncyclic addresses scheduled 
together with some of the cyclic addresses. The number of local memory addresses 
associated with this row is d + 2c. The data sets dp’i and djo.‘: have their first axes 
exchanges in dimensions other than the first. With the axes labeled from 0, the data 
sets dp’j perform their first exchange with processor axis s - 2, while the data sets 
dp’l perform their first exchanges on processor axis s - 4. The l represents the 
exchange between the two memory axes initially represented by i and v, and corres- 
ponds to the exchange (k, v,jl i, 1) -+ (k, VJ 1, i) in the example above. 
Algorithm 5 for multiple AAPCs divides the local data sets into groups scheduled 
together according to a suitable algorithm for a single AAPC. Then, s - 2 such data 
sets for s even, and s - 1 data sets for s odd, have their first exchange on an axis other 
than the first. Pairs of data sets have their first exchange on the same axis, and perform 
a local exchange prior to the exchange on the first processor axis. Data sets having 
their first exchange on the first processor axis have their exchanges pipelined. 
For multiple AAPCs, either performed as multiple concurrent AAPCs as outlined 
above or by pipelining multiple AAPCs, the following communication complexity can 
be derived. 
Theorem 4.2. A sequence ofs all-to-all personalized communications each on d proces- 
sor dimensions can be accomplished in a time of, at most, 
T = 
I 
K/2 + (s - l)d, s ,< 3 or K d 8d, 
(s + 3)d, 8d < K < 2(s + l)d, 
K/2 + 2d, K 3 2(s + l)d, 
time steps with all-port communication. 
270 S. Lennurt Johnsson, C.-T. Ho i Discrete Applied Muthenmtics 53 (1994) 251- 274 
The first case above corresponds to pipelining of AAPCs; the other two cases apply 
to multiple, concurrent AAPCs. For a succession of AAPCs of different orders, 
algorithms are presented in [ 143. 
5. Minimizing communications overhead 
The scheduling algorithms above attempt to maximize the utilization of the com- 
munications bandwidth, i.e., the algorithms strive to communicate on all channels in 
every exchange step. No attention was paid to a possible overhead in communicating 
elements. In many message-passing communication libraries, the overhead associated 
with each message is often substantial. It is of interest to organize the element transfers 
into block transfers, with each such block being subject to one communications 
overhead. Below we present a blocking procedure applicable to any algorithm. The 
procedure yields the minimal number of block transfers, and the minimal block size 
for this number of block transfers. Applying the blocking procedure to Algorithm 
4 yields both optimum span, optimum communication channel utilization, and 
a minimum number of block transfers in sequence. 
We construct the blocks in the form of a table of local memory addresses where all 
entries in a row can be communicated as one block. Thus, an address can only appear 
once in a row. A block cannot be extended beyond a single row. The number of 
different rows into which an address is entered is equal to the number of exchange 
steps required for the address. The minimum number of block transfers is equal to the 
number of rows. Whether or not the minimum number of block transfers can be 
realized in an implementation depends on the sizes of communications buffers. 
Clearly, it is of interest to minimize the maximum block size. 
Before defining the procedure, we give an example. For Algorithm 4, a table with 
d rows (the minimum) can be constructed as shown in Table 10. R(i) denotes the 
necklace of addresses derived from rotations of i, and C(i) denotes the complement 
pair of addresses {i, I}. Thus, C(11111) denotes the addresses (00000) and (11111). 
A group is a set of addresses that are scheduled together during d exchange steps in 
a nonblocked algorithm. Group 1 consists of the c complement pairs of cyclic 
Table 10 
The scheduling for block exchanges for d = 5 
Comm. step Group 1 Group 2 Group 3 Group 4 
0 {C(lllll), R(Ollll)) {R(OOOll)) {R(00101)‘1 {R(OOOOl)) 
1 {C(lllll), R(Olll1)) (R(00011)) (R(00101)) 
2 (C(lllll),R(Ollll)} (R(00111)) jR(OlOll)j 
3 /C(lllll). R(Olll1)) {R(OOlll)} {R(OlOll)~ 
4 (C(lllll), R(Olll1)) (R(00111)) {R(OlOll)) 
S. Lennart Johnsson, C.-T. Ho / Discrete Applied Mathematics 53 (1994) 251- 274 271 
addresses and the associated (d - c)-necklace of addresses scheduled together during 
d exchange steps. The columns Group 2 and Group 3 each contains the addresses of 
two full necklaces. The necklaces in a column form d complement pairs of addresses 
scheduled during d exchange steps. The last column contains a single necklace for 
which a single exchange suffices. The maximum block size is four in this example. 
In Algorithm 1, d complement addresses are scheduled together during each of 
d exchange steps. Each such group of addresses may form one column of our table for 
blocking of data exchanges. The number of groups and the block size is LK/(2d) J. 
Applying our blocking strategy to Algorithm 2 yields a table of addresses with K/2 
rows, since the span is K/2. Fewer blocks yield bandwidth inefficiency for this 
algorithm. 
In general, we partition the local memory addresses X = (0, 1, . . . . K - 1) into 
disjoint subsets Xi, X2, . . , &, us”= 1 -X, = X, Xs n Xr = 8, s # r. The addresses 
belonging to different subsets are scheduled independently, while addresses within 
a subset are scheduled together. In the example above, the subsets can be de- 
fined as follows: Xi = {C(11l11),R(Ollll)f, X2 = (R(OOOll)), XX = {R(OOlll)}, 
X4 = {R(OOlOl)}, X5 = {R(01011)}, and X, = {R(OOOOl)}. 
Let 7-3 be the time required for subset XS to complete the data motion. In our 
example, T, = 5, T2 = T, = 2, T3 = T, = 3 and T6 = 1. By creating a table of ad- 
dresses with T,,, = max, c, s 1 z rows, T,,, block transfers are required for a max- 
imum block size of r(T,,,/T,,,)l, where 7&, = I;= 1 z. The strategy in assigning 
partitions to table entries is that a partition pi can appear in a row at most once, but 
must appear in precisely T, rows. 
Note, that if each subset fully uses the communications bandwidth, then the number 
of element transfers in sequence is the same for the blocked and nonblocked algo- 
rithms, i.e., T,,,. 
Let 1x1 be the number of l-bits in x. Clearly, if the schedule for each subset pi fully 
uses all channels then 7i = (IV, EX, 14)/d and T,,, = CT=i z = K/2. Thus, we have 
the following lemma. 
Lemma 5.1. Algorithm 3 can be organized into d + (K/2) mod d < 2d - 1 block ex- 
changes with maximum block size rK/(2(d + (K/2) modd))l, which is optimal. Algo- 
rithm 4 can be organized into d block exchanges with maximum block size rK/(2d)l, 
which is also optimal. 
6. Wide channels 
If the channel width is a multiple b of the width of a data item, then b x d addresses 
can be scheduled concurrently. Determining the optimum channel utilization is 
related to minimizing the maximum block size in an algorithm that blocks the data 
transfers. 
212 S. Lennari Johnsson, C.-T. Ho / Discrete Applied Mathematics 53 (1994) 251- 274 
Table 11 
The scheduling for d = 5 and b = 3 
Group 1 Time Group 2 Time Group 3 Time Group 4 Time 
{C(lllll), R(Olll1)) 1 (R(00011)) 0 {R(00101)} 0 (R(00001)) 0 
{C(lllll), R(Olll1)) 2 {R(00011)} 1 {R(OOlOl)} 1 
{C(lllll), R(Ollll)} 3 {R(OOlll)} 2 {R(OlOll)} 2 
{C(lllll), R(Olll1)) 4 {R(OOlll)} 3 {R(OlOll)} 3 
{C(lllll), R(Olll1)) 5 {R(OOlll)} 4 {R(OlOll)} 4 
Table 11 shows a scheduling for d = 5 and b = 3. The columns headed by “Time” 
denote the time step during which an address is scheduled for exchange. The same 
time step appears precisely b = 3 times except for the last step. The time step is 
determined by labeling the table entries row by row and within each row from right to 
left. 
Lemma 6.1. Zf LK/(2d) A>, b, then an AAPC of order d with channel bandwidth b can 
be completed in d rK/(2db)l time steps, which is optimal. 
Proof. Multiple occurrences of cyclic addresses appear within the same column 
(group), since Ti = d for all such addresses. The blocking procedure guarantees that 
multiple occurrences of cyclic addresses appear within blocks of d rows and in the 
same column (group). Clearly, multiple occurrences in the same group cannot be 
scheduled during the same time step, since b < LK/(2d)J. For noncyclic addresses, 
z < d. If the same address appears in two adjacent groups, say in row i of group j and 
row i’ of group j + 1, then i’ < i. 0 
When b 3 rK/(2d)l, the upper bound is the same as the lower bound: d. 
7. Summary 
We have presented four schedules for a single all-to-all personalized communica- 
tion, three of which are simple to implement. One algorithm (Algorithm 4) requires the 
optimal number of element exchanges in sequence, K/2, with a span of d. Combining 
Algorithms 3 and 4for s all-to-all personalized communications in sequence, each on 
d dimensions, yields 
I 
K/2 + (s - l)d, s < 3 or K < 8d, 
T = (s + 3)d, 8d < K < 2(s + l)d, 
K/2 + 2d, K >, 2(s + l)d, 
time steps. 
S. Lennart Johnsson, C.-T. Ho / Discrete Applied Mathematics 53 (1994) 251- 274 273 
The algorithms can be organized such that indirect addressing is not required in 
interprocessor data exchanges by carrying out pre- and postalignment steps. By 
combining these alignments with other local permutations, the presented algorithms 
can perform permutations of the form (ilj) c1 (g(j)lf(i)) with no increase in the 
required data motion. 
The presented blocking procedure preserves optimality with respect to element 
transfers for an optimal element-wise schedule. For instance, for Algorithm 4 the 
number of block transfers is d for a single AAPC and the block size is rK/(2d)]. 
Applying the same blocking procedure to the scheduling of exchanges when the 
channel width is a multiple b of the width of a message yields rK/(2d)l element 
transfers in sequence, if LK/(2d) J 2 b, otherwise d. 
Algorithm 1 has been implemented on the Connection Machine model CM-2. The 
exchanges require 40 us per element compared to 62 us for the schedule in [2]. The 
total expense for alignment and realignment is about 0.9 us per element (four bytes). 
Hence, the simplified algorithms presented here may yield a speedup of up to 50% for 
a single AAPC as well as a considerably reduced pipeline delay for multiple AAPC. 
Acknowledgement 
The Connection Machine implementation of Algorithm 1 was performed by Michel 
Jacquemin of Yale University, Department of Computer Science, in a collaborative 
effort between Thinking Machines Corporation and INRIA, Centre di Sophia-An- 
tipolis. Valuable comments on a draft were given by Roland Sweet of the University of 
Colorado at Denver. We also gratefully acknowledge the comments and suggestions 
made by the referees, which helped improve the presentation significantly. 
References 
[l] D.P. Bertsekas, C. Ozveren. G.D. Stamoulis, P. Tseng and J.N. Tsitsiklis, Optimal communication 
algorithms for hypercubes, J. Parallel Distrib. Comput. 11 (1991) 2633275. 
[Z] A. Edelman, Optimal matrix transposition and bit-reversal on hypercubes: all-to-all personalized 
communication, J. Parallel Distrib. Comput. 11 (1991) 3288331. 
[S] P.M. Flanders, A unified approach to a class of data movements on an array processor, IEEE Trans. 
Comput. 31 (1982) 809-819. 
[4] G. Fox, S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer, C. Tseng and M. Wu, Fortran 
D language specification, Tech. Rept. TR90-141, Department of Computer Science, Rice University 
(1990). 
[S] C.-T. Ho and S.L. Johnsson, Optimal algorithms for stable dimension permutations on Boolean 
cubes, in: Third Conference on Hypercube Concurrent Computers and Applications (ACM, New 
York, 1988) 725-736. 
[6] C-T. Ho and S.L. Johnsson, Stable dimension permutations on Boolean cubes, Tech. Rept. 
YALEU/DCS/RR-617, Department of Computer Science, Yale University (1988). 
[7] C-T. Ho and S.L. Johnsson, Spanning balanced trees in Boolean cubes, SIAM J. Sci. Statist. Comput. 
10 (1989) 607-630. 
214 S. Lennurt Johnsson, C.-T. Ho / Discrete Applied Mathematics 53 (1994J 251- 274 
[S] D. Hoey and C.E. Leiserson, A layout for the shuffle-exchange network, in: 1980 International 
Conference on Parallel Processing, IEEE Computer Society (1980). 
[9] S.L. Johnsson, Communication efficient basic linear algebra computations on hypercube architec- 
tures, J. Parallel Distrib. Comput. 4 (1987) 1333172. 
[lo] S.L. Johnsson and C.-T. Ho, Matrix transposition on Boolean n-cube configured ensemble architec- 
tures, SIAM J. Matrix Anal. Appl. 9 (1988) 419-454. 
[11] S.L. Johnsson and C.-T. Ho, Shuffle permutations on Boolean cubes, Tech. Rept. 
YALEU/DCS/RR-653, Department of Computer Science, Yale University (1988). 
[12] S.L. Johnsson and C.-T. Ho, Spanning graphs for optimum broadcasting and personalized commun- 
ication in hypercubes, IEEE Trans. Comput. 38 (1989) 124991268. 
[13] S.L. Johnsson and C.-T. Ho, Boolean cube emulation of butterfly networks encoded by Gray code, 
Tech. Rept. YALEU/DCS/RR-764, Department of Computer Science, Yale University (1990) 
[I41 S.L. Johnsson and C.-T. Ho, Generalized shuffle permutations on Boolean cubes, J. Parallel Distrib. 
Comput. 16 (1992) I-14. 
1151 F.T. Leighton, Complexity Issues in VLSI: Optimal Layouts for the Shuffle-Exchange Graph and 
Other Networks (MIT Press, Cambridge, MA, 1983). 
[16] D. Nassimi and S. Sahni, An optimal routing algorithm for mesh-connected parallel computers, 
J. ACM 27 (1980) 6-29. 
1171 D. Nassimi and S. Sahni, Optimal BPC permutations on a cube connected SIMD computer, IEEE 
Trans. Comput. 31 (1982) 3388341. 
[lS] Y. Saad and M.H. Schultz, Data communication in hypercubes, Tech. Rept. YALEU/DCS/RR-428, 
Department of Computer Science, Yale University (1985). 
1193 Q.F. Stout and B. Wagar, Passing messages in link-bound hypercubes, in: M.T. Heath, ed., Hypercube 
Multiprocessors 1987 (Society for Industrial and Applied Mathematics, Philadelphia, PA, 1987). 
1201 Thinking Machines Corp., CM-200 Technical Summary (1991). 
1211 H. Zima, P. Brezany, B. Chapman, P. Mehrotra and A. Schwald, Vienna Fortran - A language 
specification version 1.1, Tech. Rept. ICASE, Interim Report 21 (1992). 
