BMMC Permutations on a DECmpp 12000/sx 2000 by Bruhl, Kristin
Dartmouth College 
Dartmouth Digital Commons 
Dartmouth College Undergraduate Theses Theses and Dissertations 
6-8-1994 
BMMC Permutations on a DECmpp 12000/sx 2000 
Kristin Bruhl 
Dartmouth College 
Follow this and additional works at: https://digitalcommons.dartmouth.edu/senior_theses 
 Part of the Computer Sciences Commons 
Recommended Citation 
Bruhl, Kristin, "BMMC Permutations on a DECmpp 12000/sx 2000" (1994). Dartmouth College 
Undergraduate Theses. 169. 
https://digitalcommons.dartmouth.edu/senior_theses/169 
This Thesis (Undergraduate) is brought to you for free and open access by the Theses and Dissertations at 
Dartmouth Digital Commons. It has been accepted for inclusion in Dartmouth College Undergraduate Theses by an 
authorized administrator of Dartmouth Digital Commons. For more information, please contact 
dartmouthdigitalcommons@groups.dartmouth.edu. 
BMMC Permutations on a
DECmpp 12000/Sx 2000
Kristin Bruhl
June 8, 1994
Computer Science Honors Thesis
Contents
1 Introduction 1
1.1 Contributions of this Thesis : : : : : : : : : : : : : : : : : : : : : : : 3
1.2 Outline : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3
2 Background 5
2.1 Permutations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5
2.2 Model : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6
2.3 Matrix Terminology : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9
2.4 BMMC Permutations : : : : : : : : : : : : : : : : : : : : : : : : : : : 11
2.5 BPC Permutations : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13
3 The DECmpp Network 16
3.1 The Machine : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 16
3.2 Network Communication : : : : : : : : : : : : : : : : : : : : : : : : : 17
3.3 Expanded Delta Network Model : : : : : : : : : : : : : : : : : : : : : 19
i
Contents ii
4 Approach 26
4.1 Cormen's BMMC Permutation Algorithm : : : : : : : : : : : : : : : 26
4.1.1 Finding a 1-permutable set : : : : : : : : : : : : : : : : : : : : 28
4.1.2 Creating a schedule, given a 1-permutable set : : : : : : : : : 34
4.2 Using Cormen's Algorithm in the PE Array : : : : : : : : : : : : : : 35
4.3 Coding Issues : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 38
4.4 Forming a 1-permutable Set of Clusters : : : : : : : : : : : : : : : : : 42
5 Evaluation of Results 46
5.1 Congestion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 47
5.2 Overhead : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 51
5.3 Timing : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 52
6 Conclusions 59
Bibliography 62
I would like to thank Isaac Scherson, Raghu Subramanian, and Brian Alleyne at
UCI for letting us use their MasPar to test our algorithms, and for answering our
questions about the EDN network model. Also, Len Wisniewski for spending hours
working out various properties of BMMC permutations to help explain our results.
Finally, I am espcially grateful to my advisor, Tom Cormen, for putting up with me
for four terms, oering constant guidance, criticism, and patience, and not getting
too mad when I didn't speak to him for three weeks.
Chapter 1
Introduction
Increasingly, modern computing problems, including many scientic and business
applications, require huge amounts of data to be examined, modied, and stored.
Parallel computers can be used to decrease the time needed to operate on such large
data sets, by allowing computations to be performed on many pieces of data at
once. For example, on the DECmpp machine used in our research, there are 2048
processors in the parallel processor array. The DECmpp can read data into each of
these processors, perform a computation in parallel on all of it, and write the data
out again, theoretically decreasing the execution time by a factor of 2048 over the
time required by one of its processors.
Often, the computations that occur after the data is in the processors involve
rearranging, or permuting, the data within the array of parallel processors. Infor-
mation moves between processors by means of a network connecting them. Com-
1
Chapter 1. Introduction 2
munication through the network can be very expensive, especially if there are many
collisions|simultaneous contentions for the same network resource|between items
of data moving from one processor to another. When a program performs hundreds
or even thousands of these permutations during its execution, a bottleneck can occur,
impeding the overall performance of the program.
Eective algorithms that decrease the time required to permute the data within a
parallel computer can yield a signicant speed increase in running programs with large
data sets. Cormen [Cor92, Cor93] has designed algorithms to improve performance
when the data movement is dened by certain classes of permutations. This thesis will
examine the performance of one of these classes, the bit-matrix-multiply/complement
(BMMC) permutation, when implemented on the DECmpp. Although Cormen's
algorithm was designed for parallel disk systems, this thesis adapts it to permutations
of data residing in the memory of the parallel processors.
The DECmpp network follows the model of an Extended Delta Network (EDN).
One characteristic of an EDN is that it has a set of input and output ports to the
network, each of which can carry only one item of data at a time. If more than one
item needs to travel over a given port, a collision occurs. The data must access the
port serially, which slows down the entire operation. Cormen's algorithm reduces
these collisions by computing a schedule for sending the data over the network.
For small data sets, it is not worthwhile to perform the extra operations to gen-
erate such a schedule, because the overhead associated with computing the schedule
Chapter 1. Introduction 3
outweighs the time gained by preventing collisions at the network ports. As the size
of the data set increases, eliminating collisions becomes more and more valuable. On
the DECmpp, when the data permutation involves more than 128 elements per pro-
cessor, our algorithm beats the more naive and obvious method for permuting in the
parallel processor array.
1.1 Contributions of this Thesis
This thesis provides
 a routing algorithm for BMMC permutations which decreases the number of
collisions at the network ports, speeding up the routing of large data sets,
 an understanding of the issues involved in modifying and implementing Cor-
men's algorithm so that it works on data residing in the memory of the DECmpp
parallel processors, and
 a notion of the relative speeds of built-in routing in the network in comparison
to our higher level routing implementation.
1.2 Outline
The rest of this thesis is organized as follows. Chapter 2 denes and explains the
BMMC class of permutations and gives terminology helpful in understanding the rest
Chapter 1. Introduction 4
of the thesis. Chapter 3 describes the DECmpp 12000/Sx 2000, including the parallel
processor array and the serial processors we also use. Chapter 3 also discusses a model
of the network given by Scherson and Subramanian [SS93] and their work in routing
permutations using this network model. Chapter 4 presents Cormen's algorithm for
routing BMMC permutations. It explains how we modied this algorithm to make
it compatible with the DECmpp environment and to maximize the benet we can
derive from running it on a parallel system. Chapter 5 compares network collisions
within the array of processors when Cormen's algorithm is used and when it is not,
as well as timing results for the two situations. Finally, Chapter 6 contains some
concluding remarks.
Chapter 2
Background
In this chapter, we describe our model of a parallel processor array and compare it to
the parallel disk model for which Cormen designed his algorithm. We explain some
basic matrix terminology, which aids in understanding the material in the following
chapters. Finally, we dene the class of BMMC permutations and the subclass of
BPC permutations, which includes many commonly used permutations.
2.1 Permutations
A permutation is a perfect rearrangement of the elements of a set. In the context of an
array of parallel processors, we use the term permutation to refer to an interprocessor
communication in which each processor sends one piece of data and receives one piece
of data [SS93].
5
Chapter 2. Background 6
Frequently, a program will need to permute a vector of data larger than the
number of processors on the machine. In this case, we can imagine a machine with
enough processors to successfully complete the permutation. We call this a system of
virtual processors. These virtual processors can be simulated by storing the contents
of several virtual processors on each physical processor, and running the actions of
each virtual processor in serial. The ratio of the number of virtual processors to the
number of physical processors is called the virtual processor ratio (VPR). In cases
involving virtual processors, we are less interested in permutations than in virtual
permutations (where each virtual processor sends and receives exactly one piece of
data).
To specify a permutation of N records within an array of processors, we must
give the source address of each record x = (x
0
; x
1
; : : : ; x
n 1
), where n = lgN , and
its corresponding target address y = (y
0
; y
1
; : : : ; y
n 1
), for each x and y in the range
0; 1; : : : ; N   1. By denition, the set of target addresses must be a rearrangement of
the elements of the set of source addresses.
2.2 Model
The model we use throughout this thesis is as follows. N records are stored on D
physical processor elements (PEs) D
0
;D
1
; : : : ;D
D 1
. The VPR is N=D, equal to the
number of records stored in the memory of each PE. Figure 2.1 shows this layout of
Chapter 2. Background 7
D
0
D
1
D
2
D
3
D
4
D
5
D
6
D
7
0 4 8 12 16 20 24 28
1 5 9 13 17 21 25 29
2 6 10 14 18 22 26 30
3 7 11 15 19 23 27 31
Figure 2.1: The layout of N = 32 records in a parallel processor array with D = 8. The
VPR is (N=D) = 4. Numbers indicate record indices.
records. Within our model, record indices will vary most rapidly within each PE, and
less rapidly between PEs. (DECmpp terminology refers to this layout as \hierarchical
virtualization.") Figure 2.2(a) demonstrates how to parse an address using our model.
The following notations are used here and throughout this thesis:
d = lgD ;
v = lg(N=D) ;
n = lgN :
The least signicant v bits specify the oset of a record within its PE, and the most
signicant d bits specify on which PE the record resides.
This model is similar to the Vitter-Shriver scheme [VS90, NV91] for the layout
of data on a parallel disk system, used by [Cor92, Cor93, CSW93]. In the Vitter-
Shriver model, the independent processors are disks, rather than the PEs that we
use. Accordingly, N records are stored on D disks.
1
The layout of records is shown
1
The Vitter-Shriver model also allows for blocks of records on each disk. Reading and writing
Chapter 2. Background 8
x0
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
x0
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
(a) (b)
disk d
stripe s
offset v
n
PE d
n
Figure 2.2: Parsing the address x = (x
0
; x
1
; : : : ; x
n 1
) of a record. (a) On a system of
parallel processor elements. Here, n = 13, v = 3, and d = 10. The least signicant v
bits contain the oset of a record within a PE (the virtual processor within the physical
processor) and the most signicant d bits contain the number of the PE. (b) On a parallel
disk system, using the Vitter-Shriver model. Here, n = 13, d = 7, and s = 6. The least
signicant d bits contain the number of the disk and the most signicant s bits contain the
stripe number.
Chapter 2. Background 9
in Figure 2.2(b). The least signicant d bits specify the number of a disk, and the
most signicant s = n   d bits specify on which \stripe" the record resides. The
most notable dierence between the two models is that in our model the independent
devices correspond to the most signicant bits of the address, whereas in the Vitter-
Shriver model, they correspond to the least signicant bits of the address.
2.3 Matrix Terminology
In order to understand the discussion of the following permutation classes, some basic
matrix terminology is necessary.
2
A matrix is a two dimensional array of numbers.
For example,
A =
2
4
a
00
a
01
a
02
a
10
a
11
a
12
3
5
=
2
4
1 2 3
4 5 6
3
5
is a 2  3 matrix A = (a
ij
). The element of the matrix in row i and column j
is denoted a
ij
. Uppercase letters denote matrices, and the corresponding lowercase
letters with subscripts denote elements of these matrices. We address matrix elements
beginning with 0.
operations are performed on entire blocks. A block size of 1 allows manipulation of individual
records.
2
The denitions and examples used in this chapter are taken from [CLR90, Section 31.1].
Chapter 2. Background 10
A vector is a matrix with just one column. For example,
x =
2
6
6
6
4
2
3
5
3
7
7
7
5
is a vector of length 3. We use lowercase letters to denote vectors. The ith element
of an n-vector is denoted x
i
. We sometimes view an m  n matrix A as a set of n
column vectors A
0
; A
1
; : : : ; A
n 1
, each of length m.
In the following chapters, we will often be concerned with whether or not a matrix
can be inverted. The inverse of an n n matrix A, written as A
 1
, has the property
that AA
 1
= I
n
= A
 1
A, where I
n
is an nn matrix with 1s along the main diagonal
and 0s in every other position:
I
4
=
2
6
6
6
6
6
6
6
4
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
3
7
7
7
7
7
7
7
5
:
For example,
2
4
1 1
1 0
3
5
 1
=
2
4
0 1
1  1
3
5
:
Not all nonzero matrices have inverses. If a matrix does not have an inverse, it
is called singular, or noninvertible. A matrix which does have an inverse is called
nonsingular, or invertible.
Chapter 2. Background 11
A set of vectors fx
1
; x
2
; : : : ; x
n
g is said to be linearly dependent if there exist
coecients c
1
; c
2
; : : : ; c
n
, at least one of which is non-zero, such that c
1
x
1
+ c
2
x
2
+
   + c
n
x
n
= 0. If vectors are not linearly dependent, they are linearly independent.
The identity matrix is an example of a matrix with linearly independent columns and
rows.
A fundamental property of matrices will be very useful in the following chapters:
If all of the columns and rows of a square matrix are linearly independent,
then that matrix is nonsingular.
We shall rely on this property extensively.
2.4 BMMC Permutations
The class of permutations with which we are concerned in this thesis is the class of
bit-matrix-multiply/complement, or BMMC permutations. A BMMC permutation is
dened by an n  n characteristic matrix A = (a
ij
) whose entries are drawn from
f0; 1g and which is nonsingular over GF (2),
3
and a complement vector c = (c
0
; c
1
;
: : : ; c
n 1
) of length n. Treating a source address of a record x as an n-bit vector, we
perform matrix-vector multiplication over GF (2) and then complement some subset
of the resulting bits to form the corresponding target address y. In other words,
3
Matrix multiplication over GF (2) is like standard matrix multiplication over the reals but with
all arithmetic performed modulo 2. Equivalently, multiplication is replaced by logical-and, and
addition is replaced by exclusive-or.
Chapter 2. Background 12
y = Ax c, or
2
6
6
6
6
6
6
6
6
6
6
4
y
0
y
1
y
2
.
.
.
y
n 1
3
7
7
7
7
7
7
7
7
7
7
5
=
2
6
6
6
6
6
6
6
6
6
6
4
a
00
a
01
a
02
   a
0;n 1
a
10
a
11
a
12
   a
1;n 1
a
20
a
21
a
22
   a
2;n 1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
n 1;0
a
n 1;1
a
n 1;2
   a
n 1;n 1
3
7
7
7
7
7
7
7
7
7
7
5
2
6
6
6
6
6
6
6
6
6
6
4
x
0
x
1
x
2
.
.
.
x
n 1
3
7
7
7
7
7
7
7
7
7
7
5

2
6
6
6
6
6
6
6
6
6
6
4
c
0
c
1
c
2
.
.
.
c
n 1
3
7
7
7
7
7
7
7
7
7
7
5
:
This class includes the permutations for Gray code and inverse Gray code. If
y = Gray(x), then
y
i
=
8
>
>
<
>
>
:
x
i
 x
i+1
if 0  i < lgN   1 ;
x
i
if i = n  1 :
For example, if N = 2
6
, the corresponding BMMC permutatation is given by
2
6
6
6
6
6
6
6
6
6
6
6
6
6
4
y
0
y
1
y
2
y
3
y
4
y
5
3
7
7
7
7
7
7
7
7
7
7
7
7
7
5
=
2
6
6
6
6
6
6
6
6
6
6
6
6
6
4
1 1 0 0 0 0
0 1 1 0 0 0
0 0 1 1 0 0
0 0 0 1 1 0
0 0 0 0 1 1
0 0 0 0 0 1
3
7
7
7
7
7
7
7
7
7
7
7
7
7
5
2
6
6
6
6
6
6
6
6
6
6
6
6
6
4
x
0
x
1
x
2
x
3
x
4
x
5
3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4
0
0
0
0
0
0
3
7
7
7
7
7
7
7
7
7
7
7
7
7
5
:
If y = Gray
 1
(x), then
y
i
=
M
ijn 1
x
j
;
Chapter 2. Background 13
and as a BMMC permutation for N = 2
6
, we have
2
6
6
6
6
6
6
6
6
6
6
6
6
6
4
y
0
y
1
y
2
y
3
y
4
y
5
3
7
7
7
7
7
7
7
7
7
7
7
7
7
5
=
2
6
6
6
6
6
6
6
6
6
6
6
6
6
4
1 1 1 1 1 1
0 1 1 1 1 1
0 0 1 1 1 1
0 0 0 1 1 1
0 0 0 0 1 1
0 0 0 0 0 1
3
7
7
7
7
7
7
7
7
7
7
7
7
7
5
2
6
6
6
6
6
6
6
6
6
6
6
6
6
4
x
0
x
1
x
2
x
3
x
4
x
5
3
7
7
7
7
7
7
7
7
7
7
7
7
7
5

2
6
6
6
6
6
6
6
6
6
6
6
6
6
4
0
0
0
0
0
0
3
7
7
7
7
7
7
7
7
7
7
7
7
7
5
:
2.5 BPC Permutations
If we restrict the characteristic matrix of a BMMC permutation to be a permutation
matrix|each row and each column contains exactly one 1|we obtain the subclass
of bit-permute/complement (BPC) permutations. We can also think of BPC permu-
tations as being dened by a permutation  : f0; 1; : : : ; n   1g
1 1
 ! f0; 1; : : : ; n   1g
on the address bits, and a complement vector c. The characteristic matrix A = (a
ij
)
and the address-bit permutation  have the relationship
a
ij
=
8
>
>
<
>
>
:
1 if i = (j) ;
0 otherwise
for i; j = 0; 1; : : : ; n  1.
Many common permutations fall into the BPC class. For example, matrix trans-
position is a BPC permutation. Matrix transposition consists of mapping the (i; j)
entry of an R  S matrix to the (j; i) position. The n-bit address of each entry is
Chapter 2. Background 14
made up of a (lgR)-bit row number in the most signicant bits followed by a (lgS)-
bit column number in the least signicant bits. Thus, the transposition mapping
swaps the upper lgR bits and the lower lg S bits of the source address to produce the
target address. The permutation of bits, therefore, is a cyclic rotation by either lgR
positions in one direction or lg S positions in the other (no bits are complemented).
The equation for this permutation is
(j) = (j + lgR) mod n = (j   lgS) mod n
and c
j
= 0 for j = 0; 1; : : : ; n   1.
Bit-reversal permutations are also BPC permutations. A bit-reversal consists of
reversing the bits of the source address to form the target address, If the source
address of a record is (x
0
; x
1
; : : : ; x
n 1
), its target address will be (x
n 1
; x
n 2
; : : : ; x
0
).
The permutation equation is
(j) = (n  1)  j
and c
j
= 0 for j = 0; 1; : : : ; n   1.
A nal example of a BPC permutation is matrix reblocking. In a matrix reblocking
permutation, the bits of a source address are partitioned into four groups, and the
middle two groups are swapped. If a source address is (; ; ; ), where ; ;  and
Chapter 2. Background 15
 each represent zero or more consecutive bits of the address, the target address will
be (; ; ; ). The equation for the permutation is
(j) =
8
>
>
>
>
>
<
>
>
>
>
:
j + jj if j 2  ;
j   jj if j 2  ;
j otherwise
and c
j
= 0 for j = 0; 1; : : : ; n   1.
We have explained the class of permutations on which this thesis focuses. We now
proceed to examine the specic machine, the DECmpp 12000/Sx 2000, on which we
perform the routing of these permutations.
Chapter 3
The DECmpp Network
In this chapter we describe the DECmpp 12000/Sx 2000, paying special attention to
its array of parallel processors [Dig92b]. Then we will discuss a model for this network
presented in [SS93], along with their work on routing permutations using this model.
3.1 The Machine
The DECmpp 12000/Sx 2000 is a massively parallel processing system, made up of a
console system and a data parallel unit (DPU). The console is a processor providing
standard I/O devices. In our case, the console system is a DECstation 5000/240.
The DPU is made up of an array control unit (ACU), an array of PEs (2048
in our case), and a PE communication system. All of the parallel processing takes
place within the DPU. The ACU controls the PE array and performs any operations
16
Chapter 3. The DECmpp Network 17
on singular (i.e., non-parallel) data within modules of code written for the parallel
processors. When the ACU encounters instructions dealing with parallel data, it
sends data and instructions to each PE simultaneously.
Each PE within the PE array has its own processor and data memory. When
the ACU sends an instruction to the PEs, each PE carries it out only on variables
that reside physically in that PE. If a computation requires data from two or more
PEs, the PE communication system must send the data to a common PE so that the
PE can perform the operation. We refer to any piece of data sent between PEs in
this fashion as a message. It is the function of the global message router in the PE
communication system to send these messages between PEs.
The PEs are arranged in a two-dimensional matrix, 32  64 in our case. Each
nonoverlapping 4  4 square matrix within the PE array forms a cluster, so our
machine has a total of 128 clusters, as shown in Figure 3.1.
3.2 Network Communication
The global message router can communicate with all PE clusters at the same time.
It is limited, however, to communicating with one PE per cluster at a time, because
each cluster has only one input port and one output port to the network. Every
inter-PE communication must take place over this network. Even if one PE connects
to another PE in the same cluster, the message must travel over the global router,
Chapter 3. The DECmpp Network 18
2K PE Array
PE PE PE PE
PE PE PE PE
PE PE PE PE
PE PE PE PE
PE Cluster
data
memory
processor
PE
Figure 3.1: The PE Array of the DECmpp 12000/Sx 2000 [Dig92a, page 1-23].
Chapter 3. The DECmpp Network 19
and thus use the input and the output ports for that cluster. To communicate with
more than one PE in a cluster, the router must make serial accesses to that cluster.
This network property is signicant in our research, since to achieve the fastest
results routing our data, we would like to minimize the number of serial accesses to
a cluster. We are concerned, then, not only with sending one record to or from each
PE at one time, but also with sending only one record from and to each PE cluster.
We will discuss this issue in depth in Chapter 4.
3.3 Expanded Delta Network Model
Scherson and Subramanian [SS93] have examined the DECmpp global network
1
with
respect to this restriction on accessing multiple PEs per cluster. They classify the
network as an expanded delta network (EDN) and give an algorithm for routing per-
mutations on the PEs through EDNs. The algorithm is guaranteed to completely
route a permutation on the DECmpp in 32 iterations of the global message router.
This algorithm is o-line, meaning that computing the routing pattern for a given
permutation is not included in the running time of the algorithm.
Figure 3.2 shows an example EDN. The thin lines represent thin wires, which are
capable of carrying only one message at a time. The thin wires on the far left of
Figure 3.2 are the input ports to the network, and those to the right are the output
1
This work was actually done on a MasPar MP-2 network, which is the same as the DECmpp
network. For consistency, we will use the name DECmpp throughout this discussion.
Chapter 3. The DECmpp Network 20
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0
1
2
4
6
5
3
7
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
(0,0)
(1,0)
(2,0)
(3,0)
(0,1) (0,2)
(1,2)
(2,2)
(3,2)
(1,1)
(2,1)
(3,1)
Figure 3.2: An example of an expanded delta network [SS93, page 285]. T = 16.
ports. The bold lines represent thick wires, and can each carry up to K messages at
once. For the DECmpp, K = 2. There is a unique path through the network from
each input wire to each output wire.
We dene T to be the number of thin-wire ports on the network. Let P = fP
0
; P
1
;
: : : ; P
T 1
g denote a set of processors, each connected to a thin-wire port. A message
is specied by an ordered pair of these processors|the rst component is the source
of the message, and the second component is the destination. A set of messages (a
relation over P ) is called a trac pattern.
Scherson and Subramanian classify a trac pattern   as EDN passable if
 for each thin wire w in the EDN, there is at most one message hP
i
; P
j
i 2   such
that w lies on the unique path from input port i to output port j, and
Chapter 3. The DECmpp Network 21
 for each thick wire w in the EDN, there are at most K messages hP
i
; P
j
i 2  
such that w lies on the unique path from input port i to output port j.
A permutation is a trac pattern consisting of a set of messages for which each
processor is the source of exactly one message and each processor is the destination
of exactly one message. Scherson and Subramanian prove that every permuation can
be expressed as the composition of three EDN passable permutations, each of which
corresponds to one stage in the network. Thus, every permutation of T elements
can be routed in exactly three complete passes over the data. First, we route the
data through the rst stage using the rst permutation, which sends the messages to
intermediate destination processors. Then, we use the second permutation to send
each message from here to a second intermediate destination. Finally, to route the
messages through the third network stage, we permute them according to the third
permutation, and the messages reach their nal destinations. Some EDNs, including
the DECmpp, have only two stages. In these cases, every T -element permutation can
be routed in only two passes.
When the number of elements N to be permuted is greater than T , we must
deal with the issue of restricted access. More than one element must pass through
each port to access the network. This idea of a virtual permutation was described in
Section 2.1. Let V = fV
0
; V
1
; : : : ; V
N 1
g represent a set of virtual processors. There
are N=T virtual processors which must share each port to the EDN. A virtual message
is an ordered pair of virtual processors, similar to the message described above. A
Chapter 3. The DECmpp Network 22
virtual permutation is a trac pattern in which each virtual processor is the source
of exactly one virtual message and the destination of exactly one virtual message.
Since each port can only carry one message at a time, the virtual permutation must
be sent in N=T serial accesses.
Scherson and Subramanian prove that every virtual permutation can be parti-
tioned into N=T conict free sets of messages by computing an edge coloring on a
bipartite graph, an expensive operation which they perform o-line. Since every per-
mutation of T elements can be routed on an EDN in three passes over the data, it
follows that every virtual permutation of N elements can be routed on an EDN in
3N=T passes. If the EDN has only two stages, a virtual permutation can be routed
in 2N=T passes.
As mentioned above, the EDN within our DECmpp has only one network in-
put and output port per PE cluster, for a total of 128 ports. This conguration
is analagous to Scherson and Subramanian's model with T = 128 processors (the
clusters) and N = 2048 virtual processors (the PEs). These parameters give us
N=T = 2048=128 = 16 sets of messages that must be routed serially through the
two stage DECmpp network for every PE permutation. Since each set of messages
requires 2 router iterations (1 for each pass over the data), all 16 sets require 32
iterations of the router.
In coding this routing algorithm, Subramanian encounters a problemwhich we also
face in dealing with the ordering of PEs within clusters. The DECmpp Sx parallel
Chapter 3. The DECmpp Network 23
0 1 2 3 4 5 6 7 63
64 65 66 67 68 69 70 71 127
128 129 130 131 132 133 134 135 191
192 193 194 195 196 197 198 199 255
1984 1985 1986 1987 1988 1989 1990 1991 2047
...
...
...
...
.
.
.
...
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Figure 3.3: The matrix-oriented PE numbering given by the DECmpp programming en-
vironment. Bold lines represent cluster boundaries.
programming environment provides a PE numbering system to indicate each PE's
relative position within the PE array. The PEs are numbered as one large 32  64
matrix, in the same fashion we described in Section 2.3. We call this numbering
system a matrix-oriented addressing scheme. Figure 3.3 shows this addressing of the
PEs.
Unfortunately, this PE numbering system is not easily compatible with the EDN
model. The EDN model treats each cluster as a physical processor with a net-
work port. The 16 PEs within that cluster are virtual processors. Ideally, the PEs
within a cluster would be numbered sequentially. This way, one cluster would contain
D
0
;D
1
; : : : ;D
15
, the next cluster would contain D
16
;D
17
; : : : ;D
31
, and so on, neatly
dividing the PEs into clusters corresponding to the network ports. We call this
numbering system a cluster-oriented addressing scheme. Figure 3.4 gives a possible
cluster-oriented addressing.
Chapter 3. The DECmpp Network 24
0 1 2 3 16 17 18 19 1363
4 5 6 7 20 21 22 23 1367
8 9 10 11 24 25 26 27 1371
12 13 14 15 28 29 30 31 1375
684 685 686 687 700 701 702 703 2047
...
...
...
...
.
.
.
...
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
(a)
1 2 5 6 17 18 21 22 65 66 69 70 81 82 85 86
3 4 7 8 19 20 23 24 67 68 71 72 83 84 87 88
9 10 13 14 25 26 29 30 73 74 77 78 89 90 93 94
11 12 15 16 27 28 31 32 75 76 79 80 91 92 95 96
33 34 37 38 49 50 53 54 97 98 101 102 113 114 117 118
35 36 39 40 51 52 55 56 99 100 103 104 115 116 119 120
41 42 45 46 57 58 61 62 105 106 109 110 120 121 124 125
43 44 47 48 59 60 63 64 107 108 111 112 122 123 126 127
(b)
Figure 3.4: A cluster-oriented PE numbering, partitioning clusters into groups of sequen-
tial PEs. (a) The ordering of PEs within a cluster and (b) the ordering of all the clusters
within the PE array.
Chapter 3. The DECmpp Network 25
Subramanian uses two mapping functions to switch back and forth between the
two numbering systems in his code. (He refers to the cluster-oriented system as the
\real" addressing and the matrix-oriented system as the \fake" addressing.) This
mapping is actually a BPC permutation on bits of the PE addresses. For example,
the permutation for the transformation from matrix-oriented addresses to cluster-
oriented addresses on our M = 2048 network (m = 11 bits in the address of a PE)
is
x 0 1 2 3 4 5 6 7 8 9 10
(x) 0 1 4 6 8 10 2 3 5 7 9
;
and the permutation for the reverse transformation, from cluster-oriented addresses
to matrix-oriented addresses is
x 0 1 2 3 4 5 6 7 8 9 10

 1
(x) 0 1 6 7 2 8 3 9 4 10 5
:
In the next chapter, we discuss the implementation of Cormen's algorithm for
routing BMMC permutations on the DECmpp. We deal with several of the same
issues in coding (i.e., matrix-oriented versus cluster-oriented PE addressing and con-
tention for the network ports) discussed above. The algorithm diers from that of
Scherson and Subramanian in that it is able to compute the routing schedule on-line
for the class of BMMC permutations.
Chapter 4
Approach
Now that we have seen the organization of the DECmpp network, we turn to Cormen's
algorithm for routing BMMC permutations, which can deal with the issue of restricted
access to the network ports. We discuss the algorithm as it applies to Vitter-Shriver
parallel disk systems and then present modications which allow it to work on a
parallel processor array. We also mention a few coding issues which arose during
implementation of the algorithm.
4.1 Cormen's BMMC Permutation Algorithm
Cormen has developed an algorithm to perform any block BMMC permutation for
which the block size is 1 on a Vitter-Shriver parallel disk system in only one pass over
the data [Cor93]. To describe this algorithm, he relies on a technique of partitioning
26
Chapter 4. Approach 27
the data into disjoint sets and permuting these sets sequentially. We describe this
decomposition, and then show how Cormen's algorithm uses this method.
The method assumes that the records to be permuted reside in the source portion
of the disks of the disk array, and that each disk has sucient available room to hold
another set of records the same size. Each disk originally holds N=D records, and
there must be room for another set of N=D records. This storage space is called the
target portion of the disks.
The algorithm has two components, which partition the data records into sets of
data which we can route serially. First, we nd a 1-permutable set of the records.
Then, we devise a schedule for the entire permutation, given the rst 1-permutable
set.
The 1-permutable set is a special case of a k-permutable set, which Cormen denes
as follows. For a permutation of source addresses to target addresses and a positive
integer k, we dene a k-permutable set of records as a set of kD source records such
that
1. each disk contains exactly k of these source records, and
2. each disk has exactly k target records mapped to it [Cor93].
We sequentially permute sets of kD records at a time until the entire set of source
records has been permuted. We call this sequence of sets a schedule. As long as we
choose a k small enough that the entire set of kD records will t into memory, we
Chapter 4. Approach 28
can perform the permutation looking at each data record only once.
4.1.1 Finding a 1-permutable set
Cormen notes that in nding a 1-permutable set of records, we need only look at the
rst d rows of the characteristic matrix A. The other n  d rows of A determine the
stripe numbers of the target addresses, which do not aect 1-permutability. Likewise,
only the rst d positions of the complement vector c must be taken into considera-
tion. Using this observation, the algorithm nds a 1-permutable set of blocks in the
following four steps:
1. Find a set S of d \basis" columns for the rst d rows of A.
2. Given S, dene three sets of columns, T , U , and V .
3. Based on T and U , dene a permutation R on the set of disks f0; 1; : : : ;D 1g.
4. From all of the above, dene a set of source addresses fx
(0)
; x
(1)
; : : : ; x
(D 1)
g,
which constitutes a 1-permutable set of blocks [Cor93].
We describe each step of the process, using a running example (from [Cor92]) to help
in understanding.
Finding a set of basis columns
We rst nd a set S of d columns such that the submatrixA
0::d 1;S
, is nonsingular. As
we stated in Section 2.3, sinceA is nonsingular, all of its rows are linearly independent.
Chapter 4. Approach 29
In particular, the rst d rows are linearly independent. Thus, there must exist a subset
of column indices S  f0; 1; : : : ; n   1g such that jSj = d and the d  d submatrix
A
0::d 1;S
is nonsingular. We dene
Q = A
0::d 1;S
;
so that Q
 1
exists.
Example: Let n = 5 and d = 3, and use the characteristic matrix
A =
2
6
6
6
6
6
6
6
6
6
6
4
0 1 1 1 0
0 0 0 1 1
0 1 1 0 0
1 1 0 0 1
1 0 0 1 0
3
7
7
7
7
7
7
7
7
7
7
5
:
We choose the basis S = f1; 3; 4g, giving
Q = A
0::d 1;S
=
2
6
6
6
4
1 1 0
0 1 1
1 0 0
3
7
7
7
5
and Q
 1
=
2
6
6
6
4
0 0 1
1 0 1
1 1 1
3
7
7
7
5
:
Dening T , U , and V
Given S, we dene
T = f0; 1; : : : ; d  1g   S ;
Chapter 4. Approach 30
U = S \ f0; 1; : : : ; d  1g ;
V = f0; 1; : : : ; n  1g   (S [ T ) :
The sets S and T form a partition of f0; 1; : : : ; d 1g and the sets S, T and V form a
partition of f0; 1; : : : ; n  1g. We can think of T as the subset of the rst d columns
that are not in the basis S and of U as the set of basis columns that are also in the
rst d columns. Cormen orders T and U into increasing order so that
T = ft
0
; t
1
; : : : ; t
jT j 1
g; t
j
> t
j 1
for j = 1; 2; : : : ; jT j   1;
U = fu
0
; u
1
; : : : ; u
jU j 1
g; u
j
> u
j 1
for j = 1; 2; : : : ; jU j   1:
Example: With S = f1; 3; 4g, we have T = f0; 2g, U = f1g, and V = ;.
Dening the permutation R
In his discussion of this step, Cormen introduces a new notation to allow more explicit
dierentiations between integers and their binary representations. He denotes the d-
bit representation of an integer i 2 f0; 1; : : : ;D   1g by bin(i), which is also a vector
of length d. The jth bit of this representation, for j = 0; 1; : : : ; d  1, is bin
j
(i).
We dene the permutation R on f0; 1; : : : ;D 1g in the following way. Let i 2 f0;
1; : : : ;D 1g, and let the binary representation of i be bin(i) = (i
0
; i
1
; : : : ; i
d 1
). Then
Chapter 4. Approach 31
R(i) = k, where the binary representation of k is bin(k) = (k
0
; k
1
; : : : ; k
d 1
) and
k
j
= i
u
j
for j = 0; 1; : : : ; jU j   1 ;
k
jU j+j
= i
t
j
for j = 0; 1; : : : ; jT j   1 :
The permutation R is a BPC permutation on bin(i), and so it denes a permutation
on f0; 1; : : : ;D   1g.
Example: Since d = 3, we have D = 8. We can see how to form the permutation R
by writing out 0 through D   1 in binary and labeling each of the d columns of bits
as to whether it belongs in T or U . Then we can rearrange the columns so that all
the columns of U precede the columns of T :
T U T
i 0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1
=)
U T T
1 0 2 R(i)
0 0 0 0
0 1 0 2
1 0 0 1
1 1 0 3
0 0 1 4
0 1 1 6
1 0 1 5
1 1 1 7
Chapter 4. Approach 32
Dening the set of source addresses
Finally, we dene the set (x
(0)
; x
(1)
; : : : ; x
(D 1)
) the following way:
x
(i)
S
= Q
 1
A
0::d 1;T
bin
T
(i) bin(R(i))  c
0::d 1
;
x
(i)
T
= bin
T
(i) ;
x
(i)
V
= 0
for i = 0; 1; : : : ;D   1.
The source addresses specied by the above method yield a 1-permutable set, as
Cormen proves [Cor93].
Example: We let the complement vector c be all 0s to keep the example simple. Here
we compute x
(6)
. We start by computing the bits in positions 1, 3, and 4, since
S = f1; 3; 4g. Writing lower-order bits on the left, and indexing from zero, we have
that bin(6) = 011 and, taking bits 0 and 2 because T = f0; 2g, we have bin
T
(6) = 01.
We know bin(R(6)) = 101 (from above), so
x
(6)
f1;3;4g
= Q
 1
A
0::2;T
bin
T
(6) bin(R(6))
=
2
6
6
6
4
0 0 1
1 0 1
1 1 1
3
7
7
7
5
2
6
6
6
4
0 1
0 0
0 1
3
7
7
7
5
2
4
0
1
3
5

2
6
6
6
4
1
0
1
3
7
7
7
5
Chapter 4. Approach 33
=
2
6
6
6
4
0 1
0 0
0 0
3
7
7
7
5
2
4
0
1
3
5

2
6
6
6
4
1
0
1
3
7
7
7
5
=
2
6
6
6
4
1
0
0
3
7
7
7
5

2
6
6
6
4
1
0
1
3
7
7
7
5
=
2
6
6
6
4
0
0
1
3
7
7
7
5
:
Filling these values into the positions indexed by set S in the source address, we get
x
(6)
=?0?01, where the positions marked by question marks correspond to set T . For
these remaining positions, we use the bits of bin
T
(6) = 01, and so x
(6)
= 00101. We
also have that y
(6)
= Ax
(6)
= 11110.
If we do this calculation for i = 0; 1; : : : ; 7, we compute the following source and
target addresses:
Chapter 4. Approach 34
0 1 2 3 4
x
(0)
0 0 0 0 0
x
(1)
1 0 0 1 0
x
(2)
0 1 0 0 0
x
(3)
1 1 0 1 0
x
(4)
0 1 1 0 1
x
(5)
1 1 1 1 1
x
(6)
0 0 1 0 1
x
(7)
1 0 1 1 1
0 1 2 3 4
y
(0)
0 0 0 0 0
y
(1)
1 1 0 1 0
y
(2)
1 0 1 1 0
y
(3)
0 1 1 0 0
y
(4)
0 1 0 0 0
y
(5)
1 0 0 1 0
y
(6)
1 1 1 1 0
y
(7)
0 0 1 0 0
Note that for both the set of source addresses and the set of target addresses, the bits
that represent the disk numbers (in the three leftmost columns), are permutations of
f0; 1; : : : ; 7g.
4.1.2 Creating a schedule, given a 1-permutable set
Once we have a 1-permutable set, we can partition the BMMC permutation into
1-permutable sets of records. This sequence of sets forms a schedule.
To create the schedule, Cormen denes a set of N=D   1 n-vectors fp
(0)
; p
(1)
;
: : : ; p
(N=D 1)
g such that p
(j)
0::d 1
= 0 and p
(j)
d::n 1
is the binary representation of j for
j = 0; 1; : : : ; N=D   1. Given a 1-permutable set of blocks fx
(0)
; x
(1)
; : : : ; x
(D 1)
g, we
Chapter 4. Approach 35
form the following N=D sets:
fx
(0)
 p
(0)
; x
(1)
 p
(0)
; : : : ; x
(D 1)
 p
(0)
g ;
fx
(0)
 p
(1)
; x
(1)
 p
(1)
; : : : ; x
(D 1)
 p
(1)
g ;
fx
(0)
 p
(2)
; x
(1)
 p
(2)
; : : : ; x
(D 1)
 p
(2)
g ;
.
.
.
fx
(0)
 p
(N=D 1)
; x
(1)
 p
(N=D 1)
; : : : ; x
(D 1)
 p
(N=D 1)
g :
Through this procedure, we XOR the most signicant s bits by 0; 1; : : : ; N=D  1
in turn, to form the entire schedule. Cormen proves that this method does in fact
generate a schedule given any 1-permutable set [Cor93].
4.2 Using Cormen's Algorithm in the PE Array
Cormen's algorithm for BMMC permutations was designed specically for permuting
data residing on a parallel disk array. However, we can use the same algorithm, with
very little modication, to perform BMMC permutations on data residing on PEs
within a PE array. We saw in Section 2.2 the dierences in parsing an address from
the Vitter-Shriver parallel disk model and our PE array. In the Vitter-Shriver model,
the least signicant d bits of an address specify the disk number, whereas in our case,
it is the most signicant d bits which dene the PE number. Thus, we must form a
1-permutable set using the upper d bits, rather than the lower d bits that Cormen
Chapter 4. Approach 36
uses. Here, we briey summarize the changes made to the algorithm for this reason.
 In computing the set of basis columns, we dene
Q = A
n d::n 1;S
:
 In dening T , U and V , we use
T = fn  d; n  d+ 1; : : : ; n  1g   S ;
U = S \ fn  d; n  d + 1; : : : ; n   1g ;
V = f0; 1; : : : ; n  1g   (S [ T ) :
 In dening the source addresses, we use
x
(i)
S
= Q
 1
A
n d::n 1;T
bin
T
(i) bin(R(i))  c
n d::n 1
;
x
(i)
T
= bin
T
(i) ;
x
(i)
V
= 0
 In creating the schedule of 1-permutable sets, we dene the set of N=D   1
n-vectors fp
(0)
; p
(1)
; : : : ; p
(N=D 1)
g such that p
(j)
n d::n 1
= 0 and p
(j)
0::n d 1
is the
binary representation of j for j = 0; 1; : : : ; N=D   1.
In this way, we can create a schedule of 1-permutable sets in the PE array.
Chapter 4. Approach 37
Running this modied version of Cormen's algorithm on a SIMD machine allows
us improve the performance of the algorithm by parallelizing some of its steps. When
we perform a computation for each element in a large set, we benet from doing that
computation in parallel within the PE array.
The rst two steps in creating a 1-permutable set|nding the set of basis columns
S and, from that, creating the sets T , U and V|do not require multiple calculations
which depend on the values of the PE numbers. Hence, these steps will not take
advantage of the capabilities of the DPU and are better run on the console, a much
faster scalar processor. However, the third step|dening a permutation R on the
set of PEs|does need to perform operations with dierent data for each PE. Specif-
ically, this step must rearrange the bits of each PE number. If we had to perform
this calculation serially for 2048 processors, we would spend a signicant amount of
time on this one operation. By running this step in the PEs, we perform the entire
computation in parallel; each PE refers to its own number and permutes its own ad-
dress. Similarly, the fourth step in creating the 1-permutable set|dening the source
addresses|lends itself easily to running in parallel. The PEs can refer to the relevant
bits of their own addresses, and the source addresses are all computed in parallel.
Once we have computed the 1-permutable set, we can also create the schedule in
parallel. We now have the advantage that every PE already contains a element of the
1-permutable set. All we must do is XOR the least signicant v bits of that element
by 0; 1; : : : ; N=D   1 in turn to determine which record within each PE is part of
Chapter 4. Approach 38
the current 1-permutable set. (Remember from Chapter 2 that v represents the bits
giving the oset of a record within a PE.)
We see more thoroughly in the next section how Cormen's algorithm
1
takes ad-
vantage of properties of the parallel network.
4.3 Coding Issues
To test the eectiveness of Cormen's algorithm at speeding up BMMC permutations
in the PE array, we use two methods to route the permutations through the network:
a straightforward \naive" algorithm and Cormen's algorithm.
Implementing the naive algorithm
The simplest way to route a BMMC permutation in the PE array is to cycle through
the records in each PE, sending them in order regardless of their target addresses.
Each PE containsN=D records. Thus, we can keep a counter i, which increments from
0 to v 1, and with each iteration send the ith record in each PE to its target address
storage location. This can be thought of as a row-by-row routing. Referring back to
Figure 2.1, we see that sending the rst element from each processor is comparable
to sending the rst \row" of data in the PE array.
1
Throughout the rest of this thesis, we will use the version of Cormen's algorithm modied to
run on the PE array rather than a parallel disk system. The changes we had to make to the original
algorithm are not signicant enough to warrant continued dierentiation between the two versions;
hence we will refer to the modied version simply as Cormen's algorithm.
Chapter 4. Approach 39
This naive method computes a schedule of N=D sets of records to route through
the network. This is the same number of sets that Cormen's algorithm generates.
With this row-by-row method, however, we have no guarantees about the permutabil-
ity of each set of records we send. We know that each PE contains exactly one source
record per set, but we know nothing about how many target records are mapped to
each PE. This mapping may be signicant when we examine this routing in light of
the expanded delta network model we presented in Section 3.3.
Let us assume for the moment that we are not dealing with a restricted access
EDN|that is, we have one input and one output port to the network for each PE.
When we send one source record from each PE, then, we should have no problem
getting the data out into the network. It is likely, however, that the target addresses
for two or more of these records are contained on the same PE.
2
Those records need
to share the output port from the network to that destination PE. Since only one
record can use the port at once, we have contention for the network ports, causing
these records to be routed serially to their destinations. This is the condition we are
able to avoid by using Cormen's schedule of 1-permutable sets.
2
In fact, one can use a standard high-probability argument for random permutations to show
that with probability 1 (1=D), there exists a PE that is the target of (lgD= lg lgD) source PEs.
For BMMC permutations, the number of source PEs that route to one target PE is an increasing
function of the rank of a particular submatrix of the characteristic matrix.
Chapter 4. Approach 40
Implementing Cormen's algorithm
Using Cormen's algorithm, we guarantee that each PE will only send and receive
one record for each set of records in the schedule. Therefore, we can avoid the port
contention that our control algorithm suers within the non-restricted EDN. How-
ever, Cormen's algorithm involves several computations to create the 1-permutable
set before we can send any data. To maximize our benet from the congestion-free
routing, then, we must minimize the amount of time needed to compute the rst
1-permutable set. We used several optimizations in coding both Cormen's algorithm
and the naive algorithm, in an attempt to speed up the rst set generation and also
to ensure that we were not using poor methods for either algorithm. Using the same
optimizations in both implementations makes our comparison of the algorithms more
fair. We list some of these \coding tricks" here.
 We stored our matrices in column-major order, with each column packed into
a word.
3
This organization simplies matrix-vector multiplication (such as we
perform every timewe compute a record's target address from its source address)
by allowing columns of the matrix to be XORed in one step.
 We stored our sets S, T , and U packed into single words as well. This approach
decreases the amount of storage space required. It also allows us to determine
easily what elements are in a given set by shifting the bits of the word|much
3
We assumed that all addresses in our system t in one word.
Chapter 4. Approach 41
faster than accessing elements of an array as we would have done had we not
packed the sets.
 We used a method developed by Len Wisniewski [Wis94] for computing the set
of basis columns, S. This method works as follows:
1. Create a submatrix A
bottom
of A, containing the lower d rows of A.
2. Find the leftmost column, j, of A
bottom
that contains a 1 and is not already
in S.
3. Choose a row i containing a 1 in column j.
4. Add column j into every column of A
bottom
to the right which has a 1 in
row i.
5. Add column j to S.
6. Repeat steps 2 through 5 until the basis contains d columns.
 We ran as many routines as possible on the console. A large part of the gener-
ataion of the rst 1-permutable set|nding the basis columms S and dening
the sets T , U , and V| do not require parallel computations, and so we copy
these variables out to the console, compute their values there, and then copy
them back to the DPU.
Until now, we have been assuming that the EDN within the PE array is non-
restricted. We saw in Section 3.3 that this is not really the case. The router network
Chapter 4. Approach 42
within the DECmpp has only one network port per cluster rather than one per PE.
Because Cormen's algorithm guarantees that each PE will only send and receive one
record at a time, each cluster receives 16 records from each set of the schedule. Those
16 records have to be routed serially to their destination PEs. This method produces
less contention than we expect with our naive method since, as we have seen, a row-
by-row send may route more than 16 records to one cluster. We would like a way,
however, to eliminate the port contention inherent in in the 1-permutable set of PEs.
4.4 Forming a 1-permutable Set of Clusters
To alleviate the problem of contention for network ports, we devise a method to route
a 1-permutable set containing only one record per network port at a time. We can
use Cormen's algorithm to do so, by thinking of the clusters rather than the PEs as
the independent devices involved. We compute a 1-permutable set of the clusters and
then create a schedule to send all of the records within each cluster in sequential sets.
Figure 4.1 shows how we parse a cluster-oriented address using Cormen's algorithm
on clusters.
When we try to access records within a cluster, we encounter the same problem
described in Section 3.3. We would like a cluster-oriented number scheme for address-
ing the PEs, rather than the matrix-oriented scheme inherent in the DECmpp's native
language, MPL. We use the BPC transformation between the two addressing systems
Chapter 4. Approach 43
x0
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
offset v
n
cluster c
4
Figure 4.1: Parsing the address x = (x
0
; x
1
; : : : ; x
n 1
) of a record using the cluster rather
than the PE as the independent device. This address layout uses cluster-oriented addressing.
Here we have n = 13, v = 2 and c = 7. The cluster size is 16, giving 4 bits for the oset of
a PE within the cluster.
Chapter 4. Approach 44
discussed in that section to permute the address bits of the PEs, thus renumbering
them according to the addressing scheme we want.
Since our characteristic matrix A and complement vector c give us a BMMC
permutation on the actual, matrix-oriented layout of PEs, we must adjust this per-
mutation to yield the same results on our new layout. To do so, we create two
characteristic matrices|one for the transformation from matrix-oriented addressing
to cluster-oriented addressing and one for the inverse transformation, cluster-oriented
to matrix-oriented.
From the original BMMC permutation, which A and c give us, we create a new
characteristic matrix A
cluster
and complement vector c
cluster
, which simulate the same
BMMC permutation on a cluster-oriented layout of PEs. Let us call the character-
istic matrix of the permutation of the PE addresses from matrix-oriented to cluster-
orientedG, and the inverse permutation from cluster-oriented to matrix-orientedG
 1
.
Through multiplication of A, c, G, and G
 1
, we create A
cluster
and c
cluster
using the
equations
A
cluster
= GAG
 1
;
c
cluster
= Gc :
This is the BMMC permutation we perform on the cluster-oriented PEs.
By using a 1-permutable set of clusters, we can remove the port contention in
Chapter 4. Approach 45
routing the permutation. We are, however, introducing more overhead into the com-
putations. We must calculate the transformations back and forth between matrix-
oriented addressing and cluster-oriented addressing. Also, we have reduced the num-
ber of records in each 1-permutable set from one per PE, or D records, to one per
cluster, or C records, where C = D=16. Using D records per set, we can route all
N records in N=D sets. Using C records per set, we need 16N=D sets to route all
N records. We have increased the number of sets in our schedule by a factor of 16.
Thus, the loop overhead associated with sending each set also increases.
In the next chapter, we examine the results we obtained from all three methods
for routing BMMC permutations: the 1-permutable set of PEs, the 1-permutable set
of clusters and our naive row-by-row scheduling. We shall look at both the intra-
network congestion each of the methods produced and the actual time in which each
method could perform the entire permutations.
Chapter 5
Evaluation of Results
In Chapter 4, we described our algorithm for routing BMMC permutations. We would
like to perform these permutations more quickly by using this algorithm than we can
by using the naive row-by-row routing of the records. We hope that by reducing
the number of collisions that occur at the network ports through the calculation
of a schedule of 1-permutable sets, we can lower the total running time (including
the schedule calculation) to route the permutation. In this chapter, we examine the
empirical data gathered for routing BMMC permutations using 1-permutable sets of
PEs, 1-permutable sets of clusters, and the naive algorithm.
There are two main factors that gure into the total time each algorithm requires:
congestion at the network ports, and the overhead time to set up the routing schedule
before we can send any records. The 1-permutable-set algorithm reduces the number
of intra-network collisions from the number we observe under the naive method. There
46
Chapter 5. Evaluation of Results 47
is, however, virtually no setup time associated with the naive method, whereas the
1-permutable set algorithm must perform several computations to create a schedule.
We analyze each of these factors separately and then report overall running times for
each of the three methods.
5.1 Congestion
We can measure the amount of contention at the network ports by counting the
number of iterations that the global message router uses in sending a given set of
records from their source locations to their target locations. Since it can only send
one message through a network port at a time, the number of iterations should equal
the number of messages trying to enter each cluster. We call this number of iterations
the port serialization. For example, if we send a record from one PE within each
cluster, and each record's target address guarantees that only one PE per cluster will
receive a record, the port serialization is 1. No serial accesses of network ports need
to be made. If, on the other hand, we send one record from a particular PE to every
PE in the PE array, the port serialization is D, because all D messages must queue
up at the sending PE's input port to the network.
In this section, we predict the port serialization we expect to see for our method
using PEs as the independent devices, our method using clusters as the independent
devices, and the naive method based on our discussion of the algorithms in Chap-
Chapter 5. Evaluation of Results 48
ter 4. Then, we compare these predictions to empirical data observed with the three
methods.
If we use PEs as the independent devices, our algorithm generates a schedule of
1-permutable sets of PEs. By denition, then, when we route each set, each PE sends
and receives exactly one record. Each cluster (and each network port) must send
and receive exactly 16 records. These 16 records are accessed serially at the network
ports, and so we expect the port serialization for this method to be 16.
Running our algorithm with clusters as the independent devices yields a schedule
of 1-permutable sets of the clusters. For each set routed, each cluster sends and re-
ceives only one record. This situation is analagous to our best case example described
above, and thus the port serialization should be 1.
We cannot predict an exact number of collisions that we expect when using the
naive row-by-row method. The number of records sent to each cluster during a given
routing step varies depending on the permutation we perform. If, for a given row, the
target addresses of the records in that row all lie in dierent PEs, the port serialization
will be 16, just as in the PE permutable set method. The identity permutation, for
example, where the source address and target address for each record are the same,
yields a port serialization of 16. If all the target addresses of the records in a row
lie within the same cluster, we should observe a port serialization of D, since all the
records must serially enter that cluster's network port. An example of this case is a
matrix transpose permutation.
Chapter 5. Evaluation of Results 49
We gathered data on the port serialization using the system dened routerCount
variable in MPL, which counts the number of iterations of the global message router.
These iterations are caused by two types of occurences in the network. One is con-
tention for network ports. The other is congestion in the switches of the network,
shown in Figure 3.2 as the lines between the three interior stages of the EDN [Sub93].
When we gathered data on port serialization under the three algorithms, we noted
that in a small percentage of the runs, routerCount exceeds what we expect the port
serialization to be under the PE-permutable set method and the cluster-permutable-
set methods. The values that we observed, both on our 2048 PE machine and a 4096
PE MasPar at University of California Irvine,
1
are given in Figure 5.1. This discrep-
ancy arises because not all BMMC permutations are EDN-passable. When we route
a permutation that is not EDN-passable, routerCountmeasures congestion within
the switches as well as port serialization. The port serialization we observed under
the naive method is, as expected, signicantly higher than under the permutable set
methods. Figure 5.2 shows this data.
The method of Subramanian and Scherson (given in [SS93]) which we discussed in
Section 3.3 guarantees that every permutation can be routed in two passes through
the DECmpp network. We implemented a specialized routing algorithm for BMMC
permutations using their EDN model of the network. The algorithm computed a
1
We thank Isaac Scherson, whose work is supported by NSF/MASPAR grant number MIP-
9205737, for access to the MasPar at UCI.
Chapter 5. Evaluation of Results 50
__routerCount
2048 PE
machine
4096 PE
machine
1 88.1% 82.7%
2 11.9% 17.0%
4 0.0% 0.3%
Figure 5.1: Frequencies of router counts observed under the cluster-permutable-set
method. The 4096 PE machine is lilliput.ics.uci.edu
VPR
Min. Port
Serialization
Max. Port
Serialization
Avg. Port
Serialization
1 16 16 16.0
2 16 32 18.0
4 16 64 19.3
8 16 128 19.7
16 16 256 20.1
32 16 512 20.1
64 16 1024 21.6
128 16 2048 22.4
256 16 4096 24.6
512 16 8192 29.5
1024 16 16384 34.1
Figure 5.2: Port serialization under the naive routing algorithm for various sizes of source
vectors.
Chapter 5. Evaluation of Results 51
schedule of sets using Cormen's method, and routed each element of the set to an
intermediate destination and then its nal destination, according to the two stages of
the EDN. We found that, as Subramanian and Scherson state, we could route all the
permutations through the network in two passes with this algorithm. The setup time
was signicant, however, especially when we found that when we omitted routing to
the intermediate destinations, instead routing directly to the nal destinations, we
observed the results in Figure 5.1.
5.2 Overhead
We have seen that using either the 1-permutable-sets-of-clusters or the 1-permutable-
sets-of-PEs methods results in signicantly fewer collisions at the network ports.
This property is clearly a benet to routing BMMC permutations using one of these
algorithms rather than sending row-by-row. Both these algorithms, however, require
us to calculate the rst 1-permutable set. Even when we run the scalar portions of
this code on the console, as we discussed in Section 4.3, the time taken to execute this
portion of the algorithm is signicant. Certainly, for very small VPRs, the overhead
associated with the calculation of the rst 1-permutable set will be higher than the
time to resolve the intra-network collisions of the naive method. We hope that, as the
VPR increases (and the number of sends needed to route all of the data also increases),
the one-time cost of generating the rst 1-permutable set will be compensated by the
Chapter 5. Evaluation of Results 52
time saved from low port serialization.
There is also a measurable overhead inherent in creating each 1-permutable set in
the schedule, even once we have calculated the rst set. We must XOR each source
address with the current value of the mask to determine whether a given record is
part of the current set. Even the simple loop statement we use to increment the
set number within the schedule eats up a little bit of time. When we send only one
record per cluster rather than one record per PE in each set, we must create 16 times
more sets in the schedule to route every record. By doing so, we assure that the port
serialization will usually be 1 rather than the 16 we obtain by sending one record
from each PE. Is the overhead inherent in running 16 times as many iterations of
our main loop oset by the time needed to perform 16 times as many serial accesses
within the network?
In the next section, we report the actual timing gures obtained by routing BMMC
permutations using each of the three algorithms. We shall analyze these gures in
light of the tradeos discussed above to determine which of the methods yields the
most favorable results.
5.3 Timing
Two essential measures in determining the value of an algorithm are the overall time
and space it requires. We know that running either of the permutable-set algorithms
Chapter 5. Evaluation of Results 53
requires extra space to hold infomation for creating the schedule of 1-permutable sets.
The naive method does not require this extra space. A partially redeeming factor of
the permutable-set algorithms is that they require only (D) space for computation,
not (N). Thus, as the VPR increases, for a given size PE array the space needed
remains constant.
We now examine the time taken to actually run BMMC permutations on the
DECmpp using each method and show that the permutable-set algorithms require
less time than the naive method for suciently large VPRs. We tested each algorithm
on characteristic matrices which produce various amounts of congestion within the
network, to determine the threshold VPR for which the cost of computing the rst
1-permutable set outweighs the cost of resolving intra-network collisions.
To test the case in which there is as little network congestion as possible, we
ran the three algorithms on the identity matrix. As expected, our naive algorithm
performed much better than either the PE-permutable-set or the cluster-permutable-
set methods, as shown in Figure 5.3. Since there is no contention in the network to
start with, computing contention-free sets will not help routing time and will increase
overhead computation time.
At the other end of the spectrum, we ran the three algorithms on a transpose
matrix to maximize the amount of congestion in the network. Referring to Figure 2.1
and viewing the PE array as a matrix, we can intuitively understand how this per-
mutation creates maximum congestion. A matrix transpose swaps the rows and the
Chapter 5. Evaluation of Results 54
VPR Cluster PE Naive
1 0.011 0.007 0.001
2 0.016 0.038 0.022
4 0.025 0.069 0.063
8 0.043 0.012 0.005
16 0.08 0.017 0.011
32 0.155 0.028 0.021
64 0.308 0.049 0.043
128 0.619 0.092 0.065
256 1.251 0.178 0.121
512 2.53 0.352 0.295
1024 5.128 0.702 0.654
Figure 5.3: Routing times (in seconds) for the identity permutation, using each of three
algorithms. We give times for permuting data sets with VPRs of 1 to 1024.
columns of a matrix. In the PE array, this pattern means sending the rst record
in each processor to the rst PE. The 1-permutable-set algorithms will compute a
schedule to guarantee that only one record is sent to each PE in each pass, but the
naive algorithm will route the entire rst row at once. Since all the rst row records
map to the rst processor, they must all be routed serially. This situation is the
worst case for the naive method or equivalently, the case in which our 1-permutable
set algorithm can beat the naive method by the greatest margin. The running times
of the three algorithms on a matrix transpose permutation conrm this intuition, as
shown in Figure 5.4.
Perhaps more interesting than either of these extreme situations is timing the
\average case" matrix with each of the algorithms. Any nonsingular matrix over
GF (2) characterizes a BMMC permutation and can be routed using these algorithms.
Chapter 5. Evaluation of Results 55
VPR Cluster PE Naive
1 0.011 0.008 0.001
2 0.017 0.011 0.005
4 0.027 0.013 0.010
8 0.043 0.013 0.029
16 0.080 0.023 0.075
32 0.169 0.044 0.125
64 0.336 0.108 0.457
128 0.620 0.194 1.779
256 1.251 0.301 7.022
512 2.532 0.680 27.936
1024 5.130 1.136 55.876
Figure 5.4: Routing times (in seconds) for the matrix transpose permutation, using each
of the three algorithms.
Almost every nonsingular matrix falls somewhere between the two cases described
above with respect to port contention in the network and consequently the overall
running time. To test the average-case performance of the algorithms, we ran them
on randomly generated nonsingular matrices.
It is much less obvious which algorithm will win out on these matrices and, as we
see in Figures 5.5 and 5.6, the size of the source vector determines which method is
faster. On VPRs less than 2
7
, the naive method beats the 1-permutable set of PEs.
Once we reach a VPR of 2
7
, or 128, records per PE, the 1-permutable-set-of-PEs
method becomes faster than routing the records row-by-row.
A certain degree of optimization is built into the PE communication system. Given
that these optimizations are implemented at a low level, whereas we must code in
higher level MPL, it is impressive that our algorithm is able to outperform the naive
Chapter 5. Evaluation of Results 56
lg VPR
lg
 ti
m
e 
(se
c)
-12
-10
-8
-6
-4
-2
0
2
4
Cluster Method
PE Method
Naive Method
0 1 2 3 4 5 6 7 8 9 10
Figure 5.5: Scatter plot of running times for three methods, routing randomly generated
nonsingular matrices.
Chapter 5. Evaluation of Results 57
lg VPR
lg
 ti
m
e 
(se
c)
-10
-8
-6
-4
-2
0
2
4
0 1 2 3 4 5 6 7 8 9 10
Average Cluster
Average PE
Average Naive
Figure 5.6: The average running times for the three methods, routing randomly generate
nonsingular matrices.
Chapter 5. Evaluation of Results 58
method at all. If we could implement our permutable-set algorithm at a lower level,
the crossover point would likely be at much lower VPRs.
Interestingly, we can observe that the algorithm computing the 1-permutable set
of clusters is not competitive with the other two methods. Although this algorithm
assures that that there is no congestion within the network, the duty cycle is too low
for this method to compete. The overhead caused by the higher number of sets in
the schedule more than makes up for the total lack of collisions.
Chapter 6
Conclusions
We have adapted Cormen's algorithm for performing BMMC permutations on parallel
disk systems to work in the PE array of a DECmpp. This algorithm computes a
schedule of 1-permutable sets of records, so that exactly one record is sent from and
received into each independent device for each set of data records. In the PE array,
each PE sends and receives one record for each set of the schedule. No more than
16 (the number of PEs per cluster) records can ever be routed to the same cluster
or equivalently, the same network port, at once. When permuting data using a naive
row-by-row routing method, the number of records through a single network port can
be much higher. Thus, by limiting the contention for network ports, our algorithm
decreases the total routing time for BMMC permutations. We can reduce the port
contention even further by creating the 1-permutable set with only one record per
cluster. This guarantees that there will be no contention for network ports during
59
Chapter 6. Conclusions 60
routing. However, the overhead inherent in routing 16 times as many sets overshadows
the time we gain from reduced contention.
Our algorithm must compute the rst 1-permutable set of the schedule before
we can do any routing. Since the naive method does not need to peform these
calculations, on small source vectors the overall running time of this method is faster
than when using our algorithm, despite the increased contention for network ports.
For VPRs at least 2
7
, or 128 records per PE, our algorithm beats the naive method
in overall running time. Our algorithm requires extra space to hold the information
on the schedule of 1-permutable sets; however, this space depends only the number
of PEs in the array and is independent of the number of records in the source vector.
For routing BMMC permutations, our algorithm compares favorably with that
given by Scherson and Subramanian. They must compute a bipartite edge coloring
to determine the routing schedule before the beginning of their routing algorithm, an
expensive calculation that they perform o-line. We are able to perform our entire
algorithm on-line.
This thesis has provided
 a routing algorithm for BMMC permutations that beats the naive method on
large VPRs,
 an understanding of the coding issues involved in implementing Cormen's algo-
rithm, and
Chapter 6. Conclusions 61
 an understanding of the relative speeds of built-in routing in comparison to
MPL-coded routing.
We have seen the power of ecient algorithms. For suciently large VPRs, our
adaption of Cormen's algorithm for BMMC permutations, coded in MPL, can out-
perform the naive method, even though the naive method's optimizations are imple-
mented at a much lower level.
Bibliography
[CLR90] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduc-
tion to Algorithms. The MIT Press, Cambridge, Massachusetts, 1990.
[Cor92] Thomas H. Cormen. Virtual Memory for Data-Parallel Computing. PhD
thesis, Department of Electrical Engineering and Computer Science, Mas-
sachusetts Institute of Technology, 1992. Available as Technical Report
MIT/LCS/TR-559.
[Cor93] Thomas H. Cormen. Fast permuting in disk arrays. Journal of Parallel and
Distributed Computing, 17(1{2):41{57, January and February 1993.
[CSW93] Thomas H. Cormen, Thomas Sundquist, and Leonard F. Wisniewski.
Asymptotically tight bounds for performing BMMC permutations on par-
allel disk systems. Submitted to IEEE Transactions on Parallel and Dis-
tributed Systems. Preliminary version appeared in Proceedings of the 5th
Annual ACM Symposium on Parallel Algorithms and Architectures, 1993.
62
Bibliography 63
[Dig92a] Digital Equipment Corporation, Maynard, Massachusetts. DECmpp Pro-
gramming Language (ANSI) Reference Manual, Version 3.1 Field Test edi-
tion, December 1992.
[Dig92b] Digital Equipment Corporation, Maynard, Massachusetts. DECmpp Pro-
gramming Language (ANSI) User's Guide, Version 3.1 Field Test edition,
December 1992.
[NV91] Mark H. Nodine and Jerey Scott Vitter. Large-scale sorting in parallel
memories. In Proceedings of the 3rd Annual ACM Symposium on Parallel
Algorithms and Architectures, pages 29{39, July 1991.
[SS93] Isaac D. Scherson and Raghu Subramanian. Ecient o-line routing of per-
mutations on restricted access expanded delta networks. In Proceedings of
the 7th International Parallel Processing Symposium, pages 284{290, April
1993.
[Sub93] Raghu Subramanian. Private Communication, December 1993.
[VS90] Jerey Scott Vitter and Elizabeth A. M. Shriver. Optimal disk I/O with
parallel block transfer. In Proceedings of the Twenty Second Annual ACM
Symposium on Theory of Computing, pages 159{169, May 1990.
[Wis94] Leneord Wisniewski. Private Communication, April 1994.
