Generalized Shuffle Permutations on Boolean Cubes by Johnsson, S. Lennart & Ho, Ching-Tien
Generalized Shuffle
Permutations on Boolean Cubes
The Harvard community has made this
article openly available.  Please share  how
this access benefits you. Your story matters
Citation Johnsson, S. Lennart and Ching-Tien Ho. 1991. Generalized Shuffle
Permutations on Boolean Cubes. Harvard Computer Science Group
Technical Report TR-04-91.
Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:23597699
Terms of Use This article was downloaded from Harvard University’s DASH
repository, and is made available under the terms and conditions
applicable to Other Posted Material, as set forth at http://
nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-
use#LAA
Generalized Shue Permutations on
Boolean Cubes
S. Lennart Johnsson
Ching-Tien Ho
TR-04-91
February 1991
Parallel Computing Research Group
Center for Research in Computing Technology
Harvard University
Cambridge, Massachusetts
Air Force Oce of Scientic Research grant: 44-752-9000-2.
National Science Foundation grant: 44-752-9705-2.
Generalized Shue Permutations on Boolean Cubes
S. Lennart Johnsson Ching-Tien Ho
Division of Applied Sciences IBM Almaden Research Center
Harvard University 650 Harry Road
Cambridge, MA 02138 San Jose, CA 95120-6099
and Ho@ibm.com
Thinking Machines Corp.
Johnsson@harvard.edu
Abstract. In a generalized shue permutation an address (a
q 1
a
q 2
: : : a
0
) receives its
content from an address obtained through a cyclic shift on a subset of the q dimensions used
for the encoding of the addresses. Bit-complementation may be combined with the shift. We
give an algorithm that requires
K
2
+ 2 exchanges for K elements per processor, when storage
dimensions are part of the permutation, and concurrent communication on all ports of every
processor is possible. The number of element exchanges in sequence is independent of the number
of processor dimensions 
r
in the permutation. With no storage dimensions in the permutation
our best algorithm requires (
r
+ 1)d
K
2
r
e element exchanges. We also give an algorithm for

r
= 2, or the real shue consists of a number of cycles of length two, that requires
K
2
+1 element
exchanges in sequence when there is no bit complement. The lower bound is
K
2
for both real and
mixed shues with no bit complementation. The minimum number of communication start-ups
is 
r
for both cases, which is also the lower bound. The data transfer time for communication
restricted to one port per processor is 
r
K
2
, and the minimum number of start-ups is 
r
. The
analysis is veried by experimental results on the Intel iPSC/1, and for one case also on the
Connection Machine.
1 Introduction
The main contributions of this paper are optimal algorithms for dimension permutations on
Boolean cube congured distributed memory multi-processors, and lower bounds for such per-
mutations with concurrent communication on all channels. Communication systems (packet or
circuit switched) only allowing communication on one channel at a time per processor are treated
briey. The Connection Machine is an example of a computer allowing concurrent communica-
tion on all channels. Some computers allow concurrent communication on several, but not all
channels of every processor. The techniques for concurrent communication on all channels may
be adapted to such architectures.
A dimension permutation is dened by permuting and/or complementing the bits of the logic
address eld. There are M(log
2
M)! possible dimension permutations where M is the number of
elements in the address eld. We consider stable permutations [2], i.e., permutations for which
the data occupy the same machine address space before and after the data rearrangement.
The machine address space consists of two parts: a processor address eld and a local storage
1
address eld. Real dimension permutations are restricted to the processor address eld, whereas
mixed dimension permutations include parts of both the local storage and processor address
elds. Virtual dimension permutations that only include local storage addresses require no
communication, and are not considered here.
The reason for distinguishing between processor and local storage dimensions is that access
times to local storage usually is considerably faster than communication between processors.
The width of the data paths to local memory are often equal to the word width of the ar-
chitecture (32 or 64 bits), while the width of network channels typically are in the range 1 {
16 bits. Furthermore, the contention for network channels is often a more serious issue than
the contention for local memory, and the techniques for reducing network contention are quite
dierent from the techniques for handling contention for local memory.
The dierence in the width of the data paths to local memory and inter-processor commu-
nication channels, and the dierent protocols used for interprocessor communication and local
memory references often allow many local memory references to be performed in the time re-
quired for an inter-processor communication. For instance, in a Connection Machine model
CM-2 each processor has a single 32-bit wide data path to memory, while inter-processor com-
munication channels are 1-bit wide. The width of the communication channels is determined
by a trade o between many demands for o-chip communication. Pins on chips is a highly
critical resource in many technologies. A memory read of 32 bits is a one-cycle operation, while
a store is slightly longer. An exchange of 32-bits between a pair of processors requires about 185
cycles. On a fully congured Connection Machine model CM-2 22 exchanges can be performed
concurrently. In the n-port communication model local memory references on a Connection
Machine account for at most about 25% of the total time.
Examples of dimension permutations are k-shue/unshue permutations, matrix transposi-
tion, bit-reversal, vector-reversal, and conversion between various data allocation schemes, such
as consecutive and cyclic storage [3, 4], reshaping of arrays [7], and multi-sectioning. A vector
reversal expressed in binary code is the operation (b
n 1
b
n 2
: : : b
0
)! (b
n 1
b
n 2
: : : b
0
) where b
i
is the complement of b
i
, i.e., V (i) V (2
n
  1  i) with elements numbered consecutively from
zero. Bit-reversal is the operation (b
n 1
b
n 2
: : : b
0
)! (b
0
b
1
: : : b
n 1
), and shue is the operation
(b
n 1
b
n 2
: : : b
0
)! (b
n 2
b
n 3
: : : b
0
b
n 1
). The transposition of a matrix with axis lengths being
powers of two is the operation (r
p 1
r
p 2
: : :r
0
c
q 1
c
q 2
: : : c
0
) ! (c
q 1
c
q 2
: : : c
0
r
p 1
r
p 2
: : : r
0
),
which is a shue repeated p times, or an unshue performed q times. In the consecutive alloca-
tion scheme successive elements are allocated to the same processor, whereas in cyclic allocation
successive elements are assigned to successive processors in a wraparound fashion. From the
illustration below, it is clear that conversion between the two is a dimension permutation, which
can be dened as an n-shue, or n-unshue.
Consecutive Cyclic
(b
p 1
b
p 2
: : : b
p n
| {z }
paddr
j b
p n 1
b
p n 2
: : : b
0
| {z }
maddr
) (b
p 1
b
p 2
: : : b
n
| {z }
maddr
j b
n 1
b
n 2
: : : b
0
| {z }
paddr
)
Nassimi and Sahni [10, 11] consider stable, real dimension permutations on multi-processors
congured as meshes and hypercubes, while Flanders [1] focuses on stable, both real and mixed,
dimension permutations on two-dimensional mesh-connected multi-processors. Swarztrauber
[15] considers dimension permutations on Boolean cubes. Swarztrauber does not consider bit
2
complementation, which is not required for the FFT, the main focus of [15]. In all of the previous
work only one communication channel per processor is used in any step of the algorithms, or for
lower bounds. We consider concurrent communication on all channels of all processors. Such
communication is possible on the Connection Machine. The emphasis is on mixed, stable per-
mutations. Real, stable permutations are treated briey. Algorithms for real, stable dimension
permutations with one-port communication can be found in [10, 11, 15].
The notation and denitions used throughout the paper are introduced in Section 2. In
Section 3 we discuss lower bounds. Algorithms are described in Section 4, and results from
implementations on the Intel iPSC/1 and the Connection Machine are described in Section 5.
We conclude with a few remarks in Section 6.
2 Preliminaries
In one-port communication the communication is restricted to one exchange operation per pro-
cessor. In n-port communication exchange operations can be performed concurrently on all ports
of every processor. With the processors congured as a Boolean n-cube each processor has n
channels (edges) to n distinct processors (nodes). With a binary encoding there is an adjacent
processor for every bit of the address, or dimension. The number of node-disjoint minimum
length paths between any pair of nodes at real distance d is d. There are n   d node-disjoint
paths of length d + 2 between any pair of nodes at real distance d [12]. The machine address
space is A = f(
paddr
z }| {
a
q 1
a
q 2
: : : a
q n
j
maddr
z }| {
a
q n 1
a
q n 2
: : :a
0
)g, where a
i
2 f0; 1g. The symbol \j"
denotes concatenation. The rightmost dimension is dimension zero, the least signicant one.
We arbitrarily let the processor address eld form the high order part of the global address,
and the local storage address eld form the low order part. The real distance between two lo-
cations in the address space is Hamming
r
(a; a
0
) =
P
q 1
i=q n
(a
i
 a
0
i
) and the virtual distance is
Hamming
v
(a; a
0
) =
P
q n 1
i=0
(a
i
 a
0
i
). Hamming(a; a
0
) = Hamming
r
(a; a
0
) + Hamming
v
(a; a
0
).
The set of machine dimensions is D
A
= fq   1; q   2; : : : ; 0g. The processor dimensions are
D
A
p
= fq 1; q 2; : : : ; q ng, and the local storage dimensions areD
A
s
= fq n 1; q n 2; : : : ; 0g.
The logic address space L = f(b
m 1
b
m 2
: : : b
0
)g has the dimensions D
L
= fm 1; m 2; : : : ; 0g,
m  q. A dimension allocation is a mapping: D
L
! D
A
. The set of machine dimensions used
for the data allocation is D
U
 D
A
. D
U
p
= D
U
\D
A
p
and D
U
s
= D
U
\D
A
s
. K = 2
k
is the number
of elements per processor, where k is the cardinality of the set D
U
s
, or k = jD
U
s
j.
Denition 1 A Stable Dimension Permutation (SDP), , is a one-to-one mapping D
U
! D
U
such that 
i
 


j

j
, where 
i
; 
i
2 D
U
, i 2 fq   1; q   2; : : : ; 0g, and 
i
6= 
j
; 
i
6= 
j
, for any
i 6= j. The index set J of the permutation is the set f
i
j
i
6= 
i
g. The SDP is real if J  D
U
p
,
virtual if J  D
U
s
, and mixed otherwise. The order  of the permutation is jJ j, the real order

r
is jJ \ D
U
p
j, and the virtual order 
s
is jJ \ D
U
s
j.
Denition 2 Let J = f
 1
; 
 2
; : : : ; 
0
g  D
A
, where  > 1 and 
i
6= 
j
, i 6= j. A
stable generalized shue permutation (GSH) is an SDP such that 
i
 


(i 1)mod

(i 1)mod
, i 2
3
paddr paddr
00 01 10 11 00 01 10 11
(00j00) (01j00) (10j00) (11j00) (00j00) (01j00) (00j10) (01j10)
(00j01) (01j01) (10j01) (11j01) =) (00j01) (01j01) (00j11) (01j11)
(00j10) (01j10) (10j10) (11j10) (10j00) (11j00) (10j10) (11j10)
(00j11) (01j11) (10j11) (11j11) (10j01) (11j01) (10j11) (11j11)
Figure 1: Shue permutation of order two.
f0; 1; : : : ;    1g.
In a dimension permutation the content of address a is assigned to address (a) (i.e., (a) 
a). A GSH corresponds to a single cycle on the bits of the address eld. We let the GSH
correspond to a left cyclic shift on the address eld. For example, the shue (a
4
a
3
a
2
a
1
a
0
)  
(a
3
a
2
a
1
a
0
a
4
) has the index set f4; 3; 2; 1; 0g. For a 2-shue on the same dimensions J =
f4; 2; 0; 3; 1g. In general, an SDP denes several cycles on the set J . There are 2
m 
identical
SDP of order  in an address space of size 2
m
.
A real GSH preserves the local address map. Data is moved between processors in subcubes
dened by the set of dimensions J . A real GSH is a generalization of the collinear planar
exchanges described in [1]. A virtual GSH (called vertical exchanges in [1]) implies the same local
data movement in all processors. A mixed GSH (called planar-vertical exchanges in [1]) implies
interprocessor communication and a dierent placement in local memory of communicated data.
For instance, the GSH of order two dened by (a
3
a
2
ja
1
a
0
)  (a
1
a
2
ja
3
a
0
) results in the data
motion in Figure 1. An i{cycle as dened in [15] does not include local memory reordering, and
is further restricted to include only the most signicant storage dimension.
Denition 3 A sub-cube permutation (SCP) is an algorithm for dimension permutation in
which the routing is conned to the set of processors to which the data set is allocated. An
extended-cube permutation (ECP) is an algorithm for dimension permutation in which the rout-
ing is extended to an n
e
-cube in which the n-cube holding data is embedded.
3 Time complexity
The transmission time for each element is t
c
, and the start-up time, or overhead, for each packet
of B elements is  . In a circuit-switched system  corresponds to the time to set one switch
in a path. In the time complexity T (ports; 
r
; n;K) for a lower bound, or an algorithm, the
rst argument is the number of ports per processor used concurrently, the second argument the
real order of the GSH, the third argument the number of processor dimensions used for data
allocation, and the last argument the size of the local data set.
Lemma 1 The time complexity of an SDP of real order 
r
< n cannot be improved by com-
munication in the n   
r
processor dimensions not included in the index set, if a dimension
4
permutation is required within all 
r
-cubes, and the SDP algorithm uses full bandwidth within
each 
r
-cube.
Proof: We prove the theorem by contradiction. Let the lower bound for the SDP of real order

r
< n on an n-cube be T =
W
L
, where W is the total bandwidth required and L the available
bandwidth per unit time. Now, map 2
n 
r
nodes of the n-cube to a single node of a 
r
-cube
by identifying all nodes with the same values of the address bits for dimensions in the set J
of the SDP. Furthermore, increase the communications bandwidth of each channel in the 
r
-
cube by a factor of 2
n 
r
. Then, every algorithm for the SDP performed on the n-cube can
be converted to an algorithm on the 
r
-cube with a running time T
00
that is at most the same.
Hence, T  T
00
=
2
n 
r
W
2
n 
r
L
= T .
Lemma 2 Lower bounds for a sub-cube GSH of real order 
r
> 0 are
T
l b
gsh
(1; 
r
; n;K) =
8
>
<
>
:
max(
r
K
2
t
c
; (
r
  1)); if (number of dimensions that are moved
and not complemented) is odd
max(
r
K
2
t
c
; 
r
); otherwise
T
l b
gsh
(
r
; 
r
; n;K) =
8
>
<
>
:
max(
K
2
t
c
; (
r
  1)); if (number of dimensions that are moved
and not complemented) is odd
max(
K
2
t
c
; 
r
); otherwise,
Proof: The minimum number of start-ups, or switch settings in a path, is equal to the maximum
number of channels that must be traversed, which is max
a2A
Hamming
r
(a; (a)).
The minimum data transfer time is bounded from below by the required bandwidth divided
by the available bandwidth. By Lemma 1 it suces to consider a 
r
-cube. For each dimension
j 2 J \ D
U
p
such that j  i, where i 2 J \ D
U
p
, only nodes for which (a
i
6= a
j
) need to send
elements across dimensions j. If instead i 2 J \ D
U
s
then all nodes send half of their data.
Therefore, the bandwidth requirement for each 
r
-cube is 
r
2

r
K
2
. The available bandwidth per
routing cycle is 2

r
for one-port and 
r
2

r
for n-port communication.
The lower bounds are not tight when some dimensions in the permutation are complemented.
For instance, consider the case (
0

1
)  (
1

0
). In this case the permutation is a cyclic shift
in a loop of length four. All non-minimum length paths are of length three. There is only one
minimum length path between any pair of processors, and only one non-minimum length path
as well. For a routing time of
K
2
element transfers in sequence at most
K
2
elements can be
routed along minimum length paths. The total routing requirement for elements routed along
non-minimum length paths is at least 4 3
K
2
. With four available links a total routing time
of at most
K
2
is impossible.
Corollary 1 The data transfer time for a sub-cube, real GSH, performed by any xed packet
size algorithm requiring c routing cycles per packet is at least
cK
2(c 1)
t
c
for n-port communication,
5
and at least
c
r
K
2(c 1)
t
c
for one-port communication in which all processors use the same dimension
during the same routing cycle.
Proof: For the lower bound data transfer time in Lemma 2, all channels are used evenly in
every routing cycle, and all elements are routed through a shortest path. But, during the rst
and last routing cycles, at least half of the channels are not used for a real GSH. For one-port
communication, the same argument applies to the rst and last routing cycles for any cube
dimension used by all processors.
A dimension permutation on a data set allocated to an n-dimensional subcube of an n
e
-cube
can be realized by subcube expansion from the n-cube to the n
e
-cube, full cube permutation, and
compression to the original n-cube. The permutation is performed with local data sets reduced
by a factor of 2
n
e
 n
. The subcube expansion (compression) is of type one-to-all (all-to-one)
personalized communication [6].
Corollary 2 Lower bounds for a GSH of order  extended from an n-cube to an n
e
-cube are
T
l b
gsh
(1; 
r
; n;K) =
8
>
>
<
>
>
>
:
max

(2(K  
K
2
n
e
 n
) + 
r
K
2
n
e
 n+1
)t
c
; (2(n
e
  n) + 
r
  1)

;
if (number of dimensions that are moved and not complemented) is odd;
max

(2(K  
K
2
n
e
 n
) + 
r
K
2
n
e
 n+1
)t
c
; (2(n
e
  n) + 
r
)

;
otherwise,
T
l b
gsh
(n
e
; 
r
; n;K) =
8
>
>
<
>
>
:
max

(2
K
n
e
 n
+
K
2
n
e
 n+1
)t
c
; (2(n
e
  n) + 
r
  1)

;
if (number of dimensions that are moved and not complemented) is odd;
max

(2
K
n
e
 n
+
K
2
n
e
 n+1
)t
c
; (2(n
e
  n) + 
r
)

;
otherwise.
The proof follows from Lemma 2 and the lower bounds for one-to-all personalized commu-
nication [6].
4 Algorithms
A left cyclic shift of the set J = f
 1
; 
 2
; : : : ; 
0
g can be achieved in    1 exchanges by:
1) exchanging adjacent pairs of dimensions starting with any pair and progressing to the right
cyclicly, terminating with the pair immediately to the left of the starting pair, 2) exchanging an
arbitrary xed dimension with successively higher dimensions modulo . The left cyclic shift
can also be accomplished in +1 exchanges by augmenting the index set with one dimension v,
and using it as a xed exchange dimension: v $ 
i
; v$ 
(i+1)mod
; : : : ; v$ 
(i+ 1)mod
; v$

(i+)mod
. The rst sequence uses all dimensions twice except for the rst and last dimension
in the sequence. The second sequence uses every dimension once, except the xed exchange
dimension. The last sequence uses all dimensions once, except the starting dimension which
is used twice, and the xed exchange dimension. These three exchange sequences are the ba-
sis for our Boolean cube algorithms. A shue permutation with bit-complementation can be
accomplished by complementing the appropriate bits in the exchange operation.
6
The essential ideas used to achieve concurrency in communication is: 1) pipelining, 2) start-
ing the cyclic shifts in several dimensions, 3) factoring of cycles into several cycles. Independent
concurrent exchange sequences are created by partitioning the local data set, and dening one
sequence per partition. By properly choosing the sequences a uniform load on the communi-
cation system is achieved. Dimension permutations can also be performed as recursive matrix
transpositions [4, 13, 14], or by performing all-to-all personalized communication twice [6, 14].
The main focus below is on algorithms for mixed, stable dimension permutations. We rst
consider the case with a single mixed GSH, then consider shue permutations that can be
factored into several mixed GSH. We conclude by considering two algorithms for real GSH.
4.1 Mixed, generalized shue algorithms
4.1.1 A single GSH
A mixed GSH with an index set J consisting of a single block of processor dimensions fol-
lowed by one virtual dimension can be performed in 
r
exchanges by using 
0
as the xed
exchange dimension, where the GSH is dened by (

r


r
 1
: : :
1
j
0
). If the index set J
consists of one block of processor dimensions and several storage dimensions, then the GSH
is factored into two cycles: one on the block of processor dimensions and the memory dimen-
sion immediately to the right of it, one on all memory dimensions. For instance, a cyclic
shift on the set (
 1

 2
: : :
 
r
j
 
r
 1

 
r
 2
: : :
0
) is factored into a cyclic shift on
(
 1

 2
: : :
 
r
j
 
r
 1
) followed by a cyclic shift on (
 
r
 1

 
r
 2
: : :
0
). The second
GSH is virtual. For n-port communication we will rst describe a pipelined algorithm in which
all data items start their exchange sequence in the rst processor dimension, then an algorithm
for which some of the exchange sequences are initiated in other processor dimensions than the
rst.
Pipelining
The mixed GSH with a single local storage dimension and K  2 elements per processor
requires that
K
2
elements be exchanged in each dimension. For n-port communication the data
transfers can be pipelined. The number of element transfers in sequence is
K
2
+
r
  1. Figure 2
shows the scheduling of local memory locations paired with respect to the exchange dimension.
The table entries indicate the processor dimension in which one element in a local pair is subject
to exchange. Figure 3 shows an example of memory locations exchanging data in the case of
eight processors. Two memory locations marked by the same symbol, for instance X0, are
exchanged with each other. For more than two local memory locations the exchange pattern is
repeated for all local pairs. The exchange algorithm for two local memory locations is as follows
for i := 1 to 
r
do
forall (

r


r
 1
: : :
1
j
0
) do
if 
0
 
i
= 1 then
(

r


r
 1
: : :
i+1

i

i 1
: : :
1
j
0
)! (

r


r
 1
: : :
i+1

i

i 1
: : :
1
j
0
)
endif
endforall
endfor
7
Time step
M d
0
1 2 3
e d
1
1 2 3
m d
2
1 2 3
d
3
1 2 3
Figure 2: Exchange dimensions for a generalized shue permutation through pipelining.
P0 P1 P2 P3 P4 P5 P6 P7
Initial allocation
0 2 4 6 8 10 12 14
1 3 5 7 9 11 13 15
Step 1
X0 X1 X2 X3
X0 X1 X2 X3
Step 2
X0 X1 X2 X3
X0 X1 X2 X3
Step 3
X0 X1 X2 X3
X0 X1 X2 X3
Final allocation
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
Figure 3: Data exchanges for a mixed generalized shue permutation of real order three.
A local memory reordering before and after the exchange sequence allows the sending and
receiving local addresses in an exchange to be the same. Indirect addressing can be avoided for
all data interchanges. Half of the processors use the lower address in a local pair, half of the
processors the upper address. The initial and nal reordering, and the pairs of memory locations
involved in the exchanges are illustrated in Figure 4. The reordering and exchange algorithm is
dened by
Align
forall (

r


r
 1
: : :
1
j
0
) do
if 

r
 

r
 1
 : : : 
1
= 1 then
(

r


r
 1
: : :
1
j
0
)! (

r


r
 1
: : :
1
j
0
)
endif
endforall
Permute
for i := 1 to 
r
do
forall (

r


r
 1
: : :
1
j
0
) do
if 

r
 

r
 1
 : : : 
i+1
 
i 1
 : : : 
1
 
0
= 1 then
8
(

r


r
 1
: : :
i+1

i

i 1
: : :
1
j
0
)! (

r


r
 1
: : :
i+1

i

i 1
: : :
1
j
0
)
endif
endforall
endfor
Realign
forall (

r


r
 1
: : :
1
j
0
) do
if 

r
 

r
 1
 : : : 
1
= 1 then
(

r


r
 1
: : :
1
j
0
)! (

r


r
 1
: : :
1
j
0
)
endif
endforall
The correctness follows from the following consideration.
Step 1:
(

r


r
 1
: : :
1
j
0
)! (

r


r
 1
: : :
1
j

r
 

r
 1
 : : : 
1
 
0
).
Steps 2  j  
r
+ 1:
(

r


r
 1
: : :
1
j
0
)! (

r


r
 1
: : :
j
f

r
 

r
 1
 : : : 
1
 
0
g
j 2
: : :
1
j
0
).
Step 
r
+ 2:
(

r


r
 1
: : :
1
j
0
)! (

r


r
 1
: : :
1
j

r
 

r
 1
 : : : 
1
 
0
).
Tracing the steps yields
(

r


r
 1
: : :
1
j
0
) ! (

r


r
 1
: : :
1
j

r
 

r
 1
 : : : 
1
 
0
)
! (

r


r
 1
: : :
2

0
j

r
 

r
 1
 : : : 
1
 
0
)
! (

r


r
 1
: : :
3

1

0
j

r
 

r
 1
 : : : 
1
 
0
)
.
.
.
.
.
.
! (

r


r
 2
: : :
2

1

0
j

r
 

r
 1
 : : : 
1
 
0
)
! (

r 1


r
 2
: : :
2

1

0
j

r
 

r
 1
 : : : 
1
 
0
)
! (

r 1


r
 2
: : :
2

1

0
j

r
).
Note that precisely one local storage location in a pair dened by the exchange dimension is
subject to an exchange in any step. Moreover, after the rst exchange even and odd data are
separated into two distinct subcubes identied by the rst exchange dimension.
Pipelining with arbitrary starting dimension
With multiple local elements it is possible to initiate exchange sequences in dierent di-
mensions. All such sequences can be executed concurrently with n-port communication. For
an arbitrary real starting dimension 
r
+ 1 exchanges are required. The xed local exchange
dimension, v, cannot be 
0
. The local memory is partitioned into sets of four elements identied
by v and 
0
. The pairs in a set of four elements are denoted d

0
=0
j
and d

0
=1
j
, j  0, where
d

0
=0
j
and d

0
=1
j
identies pairs. The set j starts the exchange in dimension 

r
 j
. A set that
starts the exchange with real dimension 
1
does not need to use dimension v for the exchange,
and grouping into four elements is unnecessary. The grouping into sets of four elements is due
9
P0 P1 P2 P3 P4 P5 P6 P7
Initial reordering
X0 X1 X2 X3
X0 X1 X2 X3
Step 1
X1 X1 X2 X2
X0 X0 X3 X3
Step 2
X1 X1 X2 X2
X0 X0 X3 X3
Step 3
X1 X2 X1 X2
X0 X3 X0 X3
Final reordering
X0 X1 X2 X3
X0 X1 X2 X3
Figure 4: Alignment/realignment and data exchanges for a mixed generalized shue permuta-
tion of real order three.
to the exchange between dimensions 
0
and v. This exchange need not be performed explicitly,
but enforces a synchronization between exchange sequences. Exchanges subsequent to the one in
which the exchange should have taken place can simply use 
0
instead of v as the xed exchange
dimension. The maximum number of local elements that can be handled concurrently using sets
of four elements is 2(
r
  1) for 
r
odd, and 2(
r
  2) for 
r
even. For suciently many local
elements we combine the concurrent exchange scheme with pipelining, Figure 5.
In Figure 5 the rst eight memory locations have been scheduled to start in a dimension
other than the rst, while the following 10 locations start the exchange in the rst dimension.
A bullet in the table illustrates a synchronization cycle for an exchange with the next row. A
synchronization cycle must succeed the exchange v $ 

r
and precede the exchange 
0
$ 
1
,
Time step
d
0
0
v; 5 v; 6  0;1 0; 2 0;3 0;4 0;5
M d
1
0
v; 5 v; 6 0; 1 0;2 0;3 0;4 0;5
e d
0
1
v; 3 v; 4 v; 5 v; 6  0;1 0;2 0;3
m d
1
1
v; 3 v; 4 v; 5 v; 6 0;1 0;2 0;3
o d
2
0; 1 0;2 0; 3 0;4 0; 5 0;6
r d
3
0;1 0; 2 0;3 0; 4 0;5 0; 6
y d
4
0; 1 0;2 0; 3 0;4 0; 5 0;6
d
5
0;1 0; 2 0;3 0; 4 0;5 0;6
d
6
0; 1 0;2 0; 3 0;4 0;5 0;6
0;1 0; 2 0;3 0;4 0;5 0;6
0; 1 0;2 0;3 0;4 0;5 0; 6
0;1 0;2 0;3 0;4 0; 5 0;6
0;1 0;2 0;3 0; 4 0;5 0;6
Figure 5: The exchange sequences for J = f
6

5

4

3

2

1
j
0
g and K = 18.
10
but can otherwise take place at an arbitrary time. Row d
2
through row d
6
are subject to pipelined
exchanges starting in the rst processor dimension. The unlabeled rows in the table are included
to illustrate what the total time would have been if all exchanges had started in the rst processor
dimension. Then, rows d
2
through the end of the table would apply. Conceptually, one can view
the schedule represented by rows d
0
0
through d
6
, as being constructed by removing the bottom
four unlabeled rows, and instead schedule these rows as dened by rows d
0
0
through d
1
1
. Moving
the bottom two rows above row d
2
saves two communications. Moving an additional two rows
again saves two exchanges in the bottom part. But, the net savings is only one exchange due to
the length of the exchange sequence in the top part.
The communication complexity for a single mixed GSH
The pipelined algorithm with all elements starting their exchange in the rst processor
dimension requires
K
2
+
r
 1 element exchanges in sequence for n-port communication. Starting
some sequences in other dimensions than the rst yields 
r
+3 exchanges, if K  2(
r
+1). For
K  2(
r
+ 1) the number of element exchanges in sequence is
K
2
+ 2. The minimum number
of start-ups is 
r
+ 3 for the second type of algorithm. The block size B for this number of
start-ups is d
K
2(
r
+1)
e.
Theorem 1 A mixed GSH of real order 
r
=    1 and K elements per processor requires at
most
T =
8
>
<
>
:
K
2
+ 
r
  1 
r
 3; K  8,

r
+ 3 8  K  2(
r
+ 1),
K
2
+ 2; K  2(
r
+ 1); 
r
 3.
element exchanges in sequence. The minimum number of start-ups is at least 
r
, and at most

r
+ 3.
For K  8, or 
r
 3 all elements should have their rst exchange in the rst processor
dimension of the GSH. If K > 8, but less than 2(
r
+ 1), then some elements should start their
exchange sequence in a dimension other than the rst processor dimension in the GSH. The
time complexity is determined by elements that start their exchange in a dimension other than

1
. The partitioning of the K=
r
space with respect to scheduling algorithms is illustrated in
Figure 6. The communication complexities are summarized in Table 1. The expressions in the
table include the eects of nite sizes of the communication packets (buers) B, and a start-up
time  for each such packet. The overhead and the transmission times are assumed additive.
Remarks:
 The same local memory dimension can be used for dierent concurrent exchange sequences,
but memory locations must be distinct. Four memory locations are needed per sequence,
if the starting dimension is dierent from the rst in the cycle.
 The alignment is made in the xed exchange dimension, and is controlled by the parity of
all processor dimensions in the shue, except the xed exchange dimension. Hence, the
alignment is made on the local memory dimension in the shue, if the exchange sequence
11
-K
j j j j j j j j j j j j j
0 5 10 15 20 25 30 35 40 45 50 55 60
6

r
 
 
 
 
0
5
10
15
20












Concurrent exchange
Concurrent exch. with pipel.
Pipelining
P
i
p
e
l.
Figure 6: The partitioning of the K=
r
space.
Comm. model Time
one-port
comm.

r
K
2
t
c
+ 
r
d
K
2B
e
n-port
comm.
8
>
<
>
:
(
K
2
+ (
r
  1)B)t
c
+ (d
K
2B
e+ 
r
  1) pipelining
(
r
+ 3)(Bt
c
+ ) concurrent exchange
(
K
2
+ 2B)t
c
+ (d
K
2B
e+ 2) conc. exch. with pipel.
Table 1: The communication complexity for a single mixed GSH.
starts in the rst real dimension in the shue. Otherwise, the alignment is made on the
extra memory dimension used for the exchanges.
 The alignment and exchanges are controlled entirely by the dimensions in the shue.
 The local storage is partitioned into two blocks with respect to exchange schedules: one
for exchange sequences starting in a dimension other than the rst real dimension, and
consisting of 2(
r
 1) locations for 
r
odd, or consisting of 2(
r
 2) locations for 
r
even,
and one block for exchange sequences starting in the rst real dimension and consisting
of the remainder of the local storage. The rst block is aligned on dimension v, and the
second on dimension 
0
.
4.1.2 Multiple GSH
In general, the index set J for a GSH can be factored into a number of mixed GSH, each for a
block of unique processor dimensions and one unique storage dimension, and one virtual GSH, as
illustrated in Figure 7. We refer to the mixed GSH's in the factored GSH as constituting GSH.
The number of such GSH is . All constituting, mixed GSH can be performed concurrently.
For each location in local storage the exchange dimensions for dierent constituting GSH can be
interleaved in any order. Only the order within each GSH is xed. The techniques for scheduling
of exchanges described for a single mixed GSH can be applied for each constituting GSH. The
scheduling algorithms presented below are optimal within two element exchanges.
The set of exchanges dened by one dimension exchange for all constituting GSH consists
12
-
6
?
-

6
?
-

6
?

6
?
-
: : : : : :
j j j j j j j j

 1
r

 1
s

 2
r

 2
s

0
r

0
s
-

6
?
: : : : : :
j j j j j j j j

 1
r

 1
s

 2
r

 2
s

0
r

0
s
Figure 7: Factoring of a cycle into independent cycles.
of a number of independent 2-cycles. This permutation is equivalent to matrix transposition,
or bit-reversal. We refer to it as all-to-all personalized communication [6]. Each processor
holds a unique piece of data for every other processor. Any algorithm for all-to-all personalized
communication (AAPC) can be used repeatedly to accomplish the required permutation. If all
constituting GSH are of the same order, then the use of any AAPC algorithm is straightforward.
In [8] AAPC algorithms are presented that allow for a pipeline delay of  cycles between each
new AAPC application for  mixed GSH.
The dierence between the AAPC based algorithms and the concurrent/pipelined algorithms
is in the scheduling of exchanges for the local storage, or in the ordering of the loops. This
dierence results in a dierence in the pipeline lling time, which is slightly longer for the AAPC
based schedules. The alignment and realignment is the same for both types of algorithms.
Alignment and realignment
Let the constituting GSH be (
j

j
r

j

j
r
 1
: : :
j
1
j
j
0
) for 0  j < , where 
j
r
is the real
order of the jth constituting GSH. Furthermore, let p
j
= 
j

j
r

j

j
r
 1
: : :
j
1
and x
j
= 
j

j
r


j

j
r
 1
 : : : 
j
1
 
j
0
. Then, the alignment operation is the local storage reordering dened by
(p
 1
p
 2
: : : p
0
j
 1
0

 2
0
: : :
0
0
) ! (p
 1
p
 2
: : : p
0
jx
 1
x
 2
: : : x
0
). The realignment is the
exact same operation.
If an extra storage dimension v is used for the jth constituting GSH, then x
j
v
= 
j

j
r


j

j
r
 1
 : : :
j
1
v. Moreover, if 
j
0
is used for exchanges for some of the storage locations, and
v for others, then the same alignment can be performed on both local dimensions, (p
j
j
j
0
v) !
(p
j
jx
j
x
j
v
). This alignment guarantees that for both type of exchanges the sending and receiving
processors use the same local storage address.
Lemma 3 The alignment for dierent constituting GSH can be made on the same exchange
dimension v preserving the property that exchanges are always made between locations with the
same local address.
The lemma follows from the fact that the exchanges for each constituting GSH take place
within subcubes for which all other processor address bits are the same. However, by performing
the alignment for dierent GSH on the same storage dimension the local addresses are not the
13
Time step
d
0
0; 1 0; 2 0; 3 0; 4 0; 5 0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
0
0
; 4
0
0
0
; 5
0
M d
1
0; 1 0; 2 0; 3 0; 4 0; 5 0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
0
0
; 4
0
0
0
; 5
0
d
2
0; 1 0; 2 0; 3 0; 4 0; 5 0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
0
0
; 4
0
0
0
; 5
0
e d
3
0; 1 0; 2 0; 3 0; 4 0; 5 0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
0
0
; 4
0
0
0
; 5
0
d
4
0; 1 0; 2 0; 3 0; 4 0; 5 0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
0
0
; 4
0
0
0
; 5
0
m d
5
0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
0
0
; 4
0
0
0
; 5
0
0; 1 0; 2 0; 3 0; 4 0; 5
d
6
0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
0
0
; 4
0
0
0
; 5
0
0; 1 0; 2 0; 3 0; 4 0; 5
o d
7
0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
0
0
; 4
0
0
0
; 5
0
0; 1 0; 2 0; 3 0; 4 0; 5
d
8
0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
0
0
; 4
0
0
0
; 5
0
0; 1 0; 2 0; 3 0; 4 0; 5
r d
9
0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
0
0
; 4
0
0
0
; 5
0
0; 1 0; 2 0; 3 0; 4 0; 5
d
10
0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
0
0
; 4
0
0
0
; 5
0
0; 1 0; 2 0; 3 0; 4 0; 5
y d
11
0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
0
0
; 4
0
0
0
; 5
0
0; 1 0; 2 0; 3 0; 4 0; 5
Figure 8: The exchange sequences for J = f
5

4

3

2

1
j
0
g, J
0
= f
0
5

0
4

0
3

0
2

0
1
j
0
0
g, and
K = 24 using pipelining.
same in all subcubes.
Pairing of local memory locations
In the exchanges in any dimension only half of the local data is exchanged. For a single
GSH one local memory bit (v or 
0
) is used to control the exchange, and either all the local
data with the bit set or not set is exchanged. It is convenient to view storage locations in pairs,
where one location or the other in a pair is exchanged. When more than one local storage
dimension is used to control the exchanges, then the pairing of locations shall be made with
respect to all such dimensions, i.e., the pairing shall be made on v, 
0
0
, 
1
0
, : : : , and 
 1
0
. For
locations with a rst exchange for each constituting GSH in its rst processor dimension v need
not be included in the alignment and the pairing. The values of the local storage dimensions in
a pair are complements of each other. For instance, for three control dimensions the pairs are
(000; 111), (001; 110), (010; 101), and (011; 100). If there are additional local storage dimensions
the pairing is simply repeated as many times as necessary.
Pipelined Algorithms
Concurrent pipelined single GSH algorithms
The algorithm for a single mixed GSH can be generalized to multiple GSH. With n-port
communication and suciently many local data elements, the dierent constituting GSH can be
initiated and performed concurrently. Figure 8 illustrates the case where all exchange sequences
start in the rst processor dimension of each GSH. The number of element exchanges in sequence
is
K
2
+

r

 1, if all constituting GSH are of the same real order, and K  2
r
. If the constituting
GSH are of dierent order, then the number of exchanges are
K
2
+max
j

j
r
 1, where (
P
 1
j=0

j
r
=

r
). If K  2
r
then the number of exchanges in sequence is d
K
2
e + 
r
  1, whether or not all
constituting GSH are of the same order. The local storage is partitioned into blocks of b
K
2
c and
d
K
2
e locations. The data in partition j, 0  j <  is permuted according to constituting GSH
j, (j + 1) mod ;    ; (j   1) mod .
Figure 9 illustrates the case in which some exchange sequences start in a dimension other
than the rst of any constituting GSH. The number of element exchanges in sequence is
K
2
+ 2,
if K  2(
r
+1). For K  2(
r
+1) the number of exchanges is 
r
+3. The algorithm does not
14
Time step
d
0
0
v; 5 v; 6  0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
0
0
; 4
0
0
0
; 5
0
0
0
; 6
0
0; 1 0; 2 0; 3 0; 4 0; 5
M d
1
0
v; 5 0; 6 0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
0
0
; 4
0
0
0
; 5
0
0
0
; 6
0
0; 1 0; 2 0; 3 0; 4 0; 5
d
0
1
v; 3 v; 4 v; 5 v; 6  0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
0
0
; 4
0
0
0
; 5
0
0
0
; 6
0
0; 1 0; 2 0; 3
e d
1
1
v; 3 v; 4 v; 5 v; 6 0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
0
0
; 4
0
0
0
; 5
0
0
0
; 6
0
0; 1 0; 2 0; 3
d
2
0; 1 0; 2 0; 3 0; 4 0; 5 0; 6 0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
0
0
; 4
0
0
0
; 5
0
0
0
; 6
0
m d
3
0; 1 0; 2 0; 3 0; 4 0; 5 0; 6 0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
0
0
; 4
0
0
0
; 5
0
0
0
; 6
0
d
0
0
4
v; 5
0
v; 6
0
0; 1 0; 2 0; 3 0; 4 0; 5 0; 6 0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
0
0
; 4
0
0
0
; 5
0
o d
1
0
4
v; 5
0
v; 6
0
0; 1 0; 2 0; 3 0; 4 0; 5 0; 6  0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
0
0
; 4
0
0
0
; 5
0
d
0
0
5
v; 3
0
v; 4
0
v; 5
0
v; 6
0
0; 1 0; 2 0; 3 0; 4 0; 5 0; 6 0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
r d
1
0
5
v; 3
0
v; 4
0
v; 5
0
v; 6
0
0; 1 0; 2 0; 3 0; 4 0; 5 0; 6  0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
d
6
0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
0
0
; 4
0
0
0
; 5
0
0
0
; 6
0
0; 1 0; 2 0; 3 0; 4 0; 5 0; 6
y d
7
0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
0
0
; 4
0
0
0
; 5
0
0
0
; 6
0
0; 1 0; 2 0; 3 0; 4 0; 5 0; 6
d
8
0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
0
0
; 4
0
0
0
; 5
0
0
0
; 6
0
0; 1 0; 2 0; 3 0; 4 0; 5 0; 6
d
9
0
0
; 1
0
0
0
; 2
0
0
0
; 3
0
0
0
; 4
0
0
0
; 5
0
0
0
; 6
0
0; 1 0; 2 0; 3 0; 4 0; 5 0; 6
Figure 9: The exchange sequences for J = f
6

5

4

3

2

1
j
0
g, J
0
= f
0
6

0
5

0
4

0
3

0
2

0
1
j
0
0
g, and
K = 28.
require that all constituting GSH are of the same order.
Theorem 2 Any generalized shue of real order 
r
> 0 can be realized in at most
T =
8
>
>
<
>
>
:
d
K
2
e+ 
r
  1 max
i

i
r
 4; K  8, or max
i

i
r
 3; K  2
r
,
K
2
+max
i

i
r
  1 K  2
r
, max
i

i
r
 3

r
+ 3 max
i

i
r
 4, 8 < K  2(
r
+ 1),
K
2
+ 2; K  2(
r
+ 1);max
i

i
r
> 3.
element exchanges in sequence with n-port communication. The minimum number of commu-
nication start-ups is 
r
for a block size of d
K
2
e and 
r
+ 3 for a block size of d
K
2(
r
+1)
e.
All-to-all personalized communication algorithms
Employing an AAPC algorithm with a delay of  exchanges between successive AAPCs on
dierent dimensions [8] yields
K
2
+ 
r
   exchanges in sequence, if all constituting GSH are of
the same order,

r

. As in previous algorithms some storage locations can start their exchange
sequences in dimensions other than the rst of any constituting GSH, thereby reducing the
pipeline delay [9]. For the AAPC algorithms blocks of storage locations are scheduled together.
The block sizes depend upon the AAPC algorithm chosen.
T =
8
>
<
>
:
K
2
+ 
r
   
r
 3, or   4;K  8 (pipelining)

r
+ 3   4; 8  K  2(
r
+ ) (concurrent exchange)
K
2
+ 2 
r
 4;K  2(
r
+ ) (concurrent exchange with pipelining)
The use of AAPC algorithms for a succesion of AAPC's of dierent orders is not entirely
straightforward. If the orders are non-increasing and the number of dimensions in one AAPC
does not divide the number of dimensions in the preceding AAPC, then a delay of up to one less
than the number of dimensions in the AAPC to be performed is introduced for some algorithms,
while other algorithms may require an even greater delay [9].
15
Algorithm Constituting GSH
one-port communication

r
K
2
t
c
+ 
r
d
K
2B
e
n-port communication
same order
pipelined
AAPC
(d
K
2B
e+ 
r
  )(Bt
c
+ )
pipelined
GSH

(d
K
2
e+ (
r
  1)B)t
c
+ (d
K
2B
+ 
r
  1); K  2B
r
(d
K
2
e+ (

r

  1)B)t
c
+ (d
K
2B
e+

r

  1); K  2B
r
concurrent
pipelined
GSH
8
>
>
<
>
>
:
(d
K
2
e+ (
r
  1)B)t
c
+ (d
K
2B
e+ 
r
  1); 
r
 4;K  8B,
or 
r
 3;K  2B
r
(d
K
2
e+ (

r

  1)B)t
c
+ (d
K
2B
e+

r

  1); 
r
 3;K  2B
r
(
r
+ 3)(Bt
c
+ ); 8B < K  2B(
r
+ 1)
(
K
2
+ 2B)t
c
+ (d
K
2B
e+ 2); K  2B(
r
+ 1); 
r
> 3
dierent order
pipelined
GSH

d
K
2
e+ 
r
  1; K  2
r
K
2
+max
i

i
r
  1; K  2
r
concurrent
pipelined
GSH
8
>
<
>
:
d
K
2
e+ 
r
  1; max
i

i
r
 4;K  8 or max
i

i
r
 3;K  2
r
K
2
+max
i

i
r
  1; max
i

i
r
 3;K  2
r

r
+ 3; max
i

i
r
 4, 8 < K  2(
r
+ 1)
K
2
+ 2; K  2(
r
+ 1);max
i

i
r
> 3
Table 2: The number of element transfer in sequence for concurrent/pipelined algorithms for
multiple mixed GSH.
Comparison of algorithms
The pipelined GSH algorithm is always preferable over the pipelined AAPC algorithm. Sim-
ilarly, the strategy to combine concurrent exchange sequences with pipelining always yields a
better result for the algorithms completing each constituting GSH for each pair of data elements
before initiating another constituting GSH, than the same strategy applied to AAPC based al-
gorithms. The pipeline lling time is longer for the latter algorithms. The complexity estimates
are summarized in Table 2. For all constituting GSH of the same order we have included the
eects of limited communication buers and a start-up overhead for each communication action.
4.2 Real shue algorithms
A mixed GSH has at least one local storage dimension as part of the permutation. In a real GSH
no local storage dimension is part of the permutation. But, if there are at least two data elements
per processor, then using an algorithm with the storage dimension as a xed exchange dimen-
sion allows each exchange to involve only one processor dimension, as in the algorithms above
for mixed GSH. The real shue (

r
 1


r
 2
: : :
0
jv)! (

r
 2


r
 3
: : :
0


r
 1
jv) performed
through a sequence of exchanges between v and a processor dimension requires a minimum of

r
+ 1 exchanges [15], Algorithm A1, Figure 10. The number of element exchanges in sequence
is (
r
+ 1)
K
2
for one-port communication. For n-port communication pipelining or concurrent
exchange sequences can be used. Pipelining requires
K
2
+ max(
K
2
; 
r
) element exchanges in
sequence, whereas the use of up to 
r
concurrent exchange sequences leads to (
r
+ 1)d
K
2
r
e
16
maddr paddr
000 001 010 011 100 101 110 111
0 0j000 0j001 0j010 0j011 0j100 0j101 0j110 0j111
1 1j000 1j001 1j010 1j011 1j100 1j101 1j110 1j111
+
0 0j000 1j000 0j010 1j010 0j100 1j100 0j110 1j110
1 0j001 1j001 0j011 1j011 0j101 1j101 0j111 1j111
+
0 0j000 1j000 0j001 1j001 0j100 1j100 0j101 1j101
1 0j010 1j010 0j011 1j011 0j110 1j110 0j111 1j111
+
0 0j000 1j000 0j001 1j001 0j010 1j010 0j011 1j011
1 0j100 1j100 0j101 1j101 0j110 1j110 0j111 1j111
+
0 0j000 0j100 0j001 0j101 0j010 0j110 0j011 0j111
1 1j000 1j100 1j001 1j101 1j010 1j110 1j011 1j111
Figure 10: A shue permutation of order 
r
using 
r
+ 1 communications.
element exchanges in sequence. The pipelined algorithm always requires a larger number of
element transfers in sequence, and more communication start-ups, than the algorithm based on
concurrent exchange sequences.
The algorithm outlined above using an extra local memory dimension for all exchanges is
non-optimal by one exchange. For architectures with a high start-up time relative to the data
transfer time, it may be desirable to nd an algorithm with the optimal number of start-ups. In
algorithm A1 half of the data is at its nal destination after 
r
exchanges. If the initial content
in location zero in processors four through seven, and in location one in processors zero through
three did not matter, then the permutation would be completed in 
r
steps. In general, at the
expense of doubling the memory requirements, and the data transfer time, the minimal number
of communications start-ups can be achieved. The rst step of the modied algorithm can be
accomplished through the exchange
if 

r
 1
6= 
0
then (

r
 1


r
 2
: : :
0
j0)! (

r
 1


r
 2
: : :
0
j1)
The result is (

r
 1


r
 2
: : :
0
j
0
); 
0
= 

r
 1
with the other processors being empty. During
the communication in dimensions 
1
through 

r
 2
subcubes (0

r
 2
: : :
1
1) and (1

r
 2
: : :
1
0)
are empty. In the last communication data is sent from subcube (0

r
 2
: : :
1
0) to subcube
(1

r
 2
: : :
1
0), and from subcube (1

r
 2
: : :
1
1) to subcube (0

r
 2
: : :
1
1). K elements
are sent in the rst and last communication, and exchanged in the 
r
 2 other communications.
This is algorithm A2.
The case with 
r
= 2 represents a special case for algorithm A2. Each channel is only used
in one direction. Splitting the data set K into two parts allows both minimum length paths to
be used. For two equal parts the complexity is (
K
2
+B)t
c
+ (d
K
2B
e+1) . This is algorithm A2
2
.
With an insignicant communication start-up, such as on the Connection Machine, the optimal
value of B is one, and only one element exchange in excess of the lower bound is required. The
result generalizes to the case where the real shue consists of a number of independent cycles
17
Comm. Alg. B
opt
Mem Communication complexity t
c
factor/l b  factor/l b
one-port A1
K
2
K (
r
+ 1)
K
2
t
c
+ (
r
+ 1)d
K
2B
e (1, 1.5] (1, 1.5]
comm. A2 K 2K 
r
Kt
c
+ 
r
d
K
B
e 2 1
n-port A1 d
K
2
r
e K (
r
+ 1)d
K
2
r
et
c
+ (
r
+ 1)d
K
2
r
B
e (1, 1.5] (1; 1:5]
comm. A2 d
K

r
e 2K Kt
c
+ 
r
d
K

r
B
e 2 1
A2
2
q
K
2t
c
K + B (
K
2
+B)t
c
+ (d
K
2B
e + 1) 1 +
2B
K
1
Table 3: The memory requirements and communication complexities for a real GSH of order 
r
.
of length two.
The complexity expressions are summarized in Table 3. The break-even point between
algorithms A1 and A2 for unbounded buer sizes B and 
r
> 2 is  = (
r
  1)
K
2
t
c
for one-port
communication, and  = (1 
1

r
)
K
2
t
c
for n-port communication.
5 Experiments
We have implemented the one-port version of algorithm A1 for real GSH on the Intel iPSC/1,
and algorithm A2
2
on the Connection Machine (in the context of a bit-reversal routine). On
the Intel iPSC/1 we also performed shue permutations by using the routing logic, by using
all-to-all personalized communication twice, and by an algorithm that requires precisely 
r
start-
ups for a real shue of order 
r
(algorithm A1 in [5]). Algorithm A2 presented above should
behave similarly for small values of K, and be superior for large values of K, but was not known
at the time of the implementation. For a real, sub-cube, shue permutation with a message
size less than a few hundred bytes, algorithm A2 is the fastest, but for larger message sizes
algorithm A1 is preferable, Figure 11. The best time of either algorithm A1 or A2 is 5 { 10
times less than that of the router. All one-port algorithms have a complexity that is linear in
the number of dimensions for shue permutations with a real order equal to the number of cube
dimensions. The deviation from the linear dependence exhibited in Figure 11 is due to a hybrid
implementation, that optimizes the sum of start-up time and the time for local data movement
[4].
For the Connection Machine implementation of algorithm A2
2
the measured time complexity
is 270 + (3:2 + 3:9)K sec, where  is the number of 2-cycles in the real GSH.
6 Summary and conclusions
The algorithms for mixed shue permutations are optimal within two element exchanges for
concurrent communication on all channels of every processor. The number of communication
start-ups for an unlimited buer size is either exactly optimal, or requires three excess start-ups
18
-K
Bytes
j j j j
10
1
10
2
10
3
10
4
6
Time
msec
 
 
 
10
1
10
2
10
3
 AAPC
 router
t
A1
d
A2
t t t t
t
t
t
t
t
t
t
d
d
d
d
d
d
d
d
d
d
d
d






















Figure 11: The measured shue times as a function of message lengths on an iPSC/1 5-cube.
depending upon the size of the local data set relative to the number of processor dimensions
in the generalized shue, and the number of blocks of contiguous processor dimensions. The
communication complexity for n-port communication is summarized in Table 2, page 16. For
one-port communication the minimum number of start-ups is 
r
, and the minimum number
of element transfers in sequence is 
r
K
2
. The mixed shue algorithms are also valid when
bit-complementation is required by appropriately modifying which data in a pair is exchanged.
We also show that by performing a local alignment before and after the data exchanges for
mixed GSH, all data exchanged have the same relative address.
The communication complexities of a real GSH of order 
r
are summarized in Table 3,
page 18. The second and last columns contain the values of the data transfer times and start-
up times relative to the respective lower bound. With an unbounded buer size the number
of start-ups is either exactly optimal, or suboptimal by one start-up. The number of element
transfers in sequence exceeds the lower bound by a factor of
1

r
for the best algorithm, except
if 
r
= 2, or there are a number of 2-cycles. Then, only one element transfer in excess of the
lower bound is required. This result is not valid if bit-complementation is required.
Performing a GSH through all-to-all personalized communication [6, 14] is always inferior to
the pipelined/concurrent algorithms presented here. Likewise, performing the permutation by
recursively applying an optimal matrix transposition algorithm [4] yields higher complexity.
Finally we note that the control is identical for all processors, and can easily be distributed.
Acknowledgement
The authors would like to thank the anonymous referees for several valuable suggestions that
helped improve the manuscript, and for bringing important references to our attention.
19
The research reported here was in part supported by the Oce of Naval Research un-
der Contract No. N00014-86-K-0310, by the AFOSR under contract AFOSR-89-0382 and by
NSF/DARPA under contract CCR-8908285. The Connection Machine implementation was
made by Alan Edelman, Steve Heller and Mark Bromley, and is part of the CM System Soft-
ware.
References
[1] Peter M. Flanders. A unied approach to a class of data movements on an array processor.
IEEE Trans. Computers, 31(9):809{819, September 1982.
[2] Ching-Tien Ho and S. Lennart Johnsson. Optimal algorithms for stable dimension permu-
tations on Boolean cubes. In The Third Conference on Hypercube Concurrent Computers
and Applications, pages 725{736. ACM, 1988.
[3] S. Lennart Johnsson. Communication ecient basic linear algebra computations on hyper-
cube architectures. J. Parallel Distributed Computing, 4(2):133{172, April 1987.
[4] S. Lennart Johnsson and Ching-Tien Ho. Matrix transposition on Boolean n-cube cong-
ured ensemble architectures. SIAM J. Matrix Anal. Appl., 9(3):419{454, July 1988.
[5] S. Lennart Johnsson and Ching-Tien Ho. Shue permutations on Boolean cubes. Technical
Report YALEU/DCS/RR-653, Department of Computer Science, Yale University, October
1988.
[6] S. Lennart Johnsson and Ching-Tien Ho. Spanning graphs for optimum broadcasting and
personalized communication in hypercubes. IEEE Trans. Computers, 38(9):1249{1268,
September 1989.
[7] S. Lennart Johnsson and Ching-Tien Ho. The complexity of reshaping arrays on Boolean
cubes. In The Fifth Distributed Memory Computing Conference, pages 370{377. IEEE
Computer Society, April 1990.
[8] S. Lennart Johnsson and Ching-Tien Ho. Maximizing channel utilization for all-to-all per-
sonalized communication on Boolean cubes. In The Sixth Distributed Memory Computing
Conference, pages 299{304. IEEE Computer Society Press, 1991.
[9] S. Lennart Johnsson and Ching-Tien Ho. Optimal communication channel utilization for
matrix transposition and related permutations on Boolean cubes. Discrete Applied Mathe-
matics, 1992.
[10] David Nassimi and Sartaj Sahni. An optimal routing algorithm for mesh-connected parallel
computers. JACM, 27(1):6{29, January 1980.
[11] David Nassimi and Sartaj Sahni. Optimal BPC permutations on a cube connected SIMD
computer. IEEE Trans. Computers, C-31(4):338{341, April 1982.
[12] Yousef Saad and Martin H. Schultz. Topological properties of hypercubes. Technical Report
YALEU/DCS/RR-389, Dept. of Computer Science, Yale Univ., New Haven, CT, June 1985.
20
[13] Quentin F. Stout and Bruce Wagar. Intensive hypercube communication I: prearranged
communication in link-bound machines. Technical Report CRL-TR-9-87, Computing Re-
search Lab., Univ. of Michigan, Ann Arbor, MI, 1987.
[14] Quentin F. Stout and Bruce Wagar. Passing messages in link-bound hypercubes. In
Michael T. Heath, editor, Hypercube Multiprocessors 1987. Society for Industrial and Ap-
plied Mathematics, Philadelphia, PA, 1987.
[15] Paul N. Swarztrauber. Multiprocessor FFTs. Parallel Computing, 5:197{210, 1987.
21
