Multiplication of Matrices of Arbitrary Shape on a Data Parallel Computer by Mathur, Kapil K. & Johnsson, S. Lennart
Multiplication of Matrices of Arbitrary
Shape on a Data Parallel Computer
The Harvard community has made this
article openly available.  Please share  how
this access benefits you. Your story matters
Citation Mathur, Kapil K. and S. Lennart Johnsson. Multiplication of Matrices
of Arbitrary Shape on a Data Parallel Computer. Harvard Computer
Science Group Technical Report TR-01-92.
Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:23017262
Terms of Use This article was downloaded from Harvard University’s DASH
repository, and is made available under the terms and conditions
applicable to Other Posted Material, as set forth at http://
nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-
use#LAA
Multiplication of Matrices of Arbitrary
Shape on a Data Parallel Computer
Kapil K. Mathur
S. Lennart Johnsson
TR-01-92
January 1992
Parallel Computing Research Group
Center for Research in Computing Technology
Harvard University
Cambridge, Massachusetts
Multiplication of Matrices of Arbitrary Shape
on a Data Parallel Computer
Kapil K. Mathur and S. Lennart Johnsson
1
Thinking Machines Corp.
245 First Street
Cambridge, MA 02142
Abstract
Some level{2 and level{3 Distributed Basic Linear Algebra Subroutines (DBLAS) that
have been implemented on the Connection Machine system CM{200 are described. No
assumption is made on the shape or size of the operands. For matrix{matrix multiplica-
tion, both the nonsystolic and the systolic algorithms are outlined. A systolic algorithm
that computes the product matrix in{place is described in detail. We show that a level{3
DBLAS yields better performance than a level{2 DBLAS. On the Connection Machine
system CM{200, blocking yields a performance improvement by a factor of up to three
over level{2 DBLAS. For certain matrix shapes the systolic algorithms oer both improved
performance and signicantly reduced temporary storage requirements compared to the
nonsystolic block algorithms.
We show that, in order to minimize the communication time, an algorithm that leaves
the largest operand matrix stationary should be chosen for matrix{matrix multiplication.
Furthermore, it is shown both analytically and experimentally that the optimum shape
of the processor array yields square stationary submatrices in each processor, i.e., the
ratio between the length of the axes of the processing array must be the same as the
ratio between the corresponding axes of the stationary matrix. The optimum processor
array shape may yield a factor of ve performance enhancement for the multiplication of
square matrices. For rectangular matrices a factor of 30 improvement was observed for
an optimum processor array shape compared to a poorly chosen processor array shape.
1
Also aliated with the Division of Applied Sciences, Harvard University, Cambridge, MA 02138.
1
1 Introduction.
This article describes the algorithms used for matrix{vector, vector{matrix multiplica-
tion, rank{1 updates, and matrix-matrix multiplication on matrices distributed across
the memory of the Connection Machine system CM{200. This system has up to 2048
oating{point processors that support operations in both 32{bit and 64{bit precision.
The memory is distributed among the processing units, with a maximum of 4 Mbytes of
memory per unit and a total memory of 8 Gbytes. Each processing unit has a single 32-bit
wide data path to its memory. Data paths internal to the oating{point unit are 64{bits
wide. The processing units are interconnected as an 11{dimensional Boolean cube, with
two channels between every pair of nodes. Data may be exchanged on all 22 (11  2)
channels concurrently. For some phases of the algorithms described below this property
is explored.
Throughout this article, the axis enumerating the rows is referred to as the row axis, and
the axis enumerating the columns is referred to as the column axis. For a two{dimensional
data array A(i; j), the left index refers to the row axis, and the right index refers to the
column axis. A processing node, or simply node, refers to a processor with associated local
memory and communications facilities. Throughout this presentation, it is assumed that
there are N nodes, which for a two{dimensional processor conguration, have N
r
nodes
along the row axis, and N
c
nodes along the column axis (N = N
r
N
c
).
The algorithms described here have no restriction on the shape of the matrices or the
number of processing nodes, other than that each operand is assumed to have at least one
element assigned to each node. The processors may be congured as a one{dimensional
array or a two{dimensional array of arbitrary shape. The algorithms presented here
achieve perfect arithmetic load balance. For operand matrices assigned to a subset of
processors, load balancing is an important issue. Algorithms that address load balancing
issues in greater detail are discussed in [17].
The algorithms presented here are data parallel adaptations of the standard matrix mul-
tiplication algorithm requiring 2PQR arithmetic operations for the multiplication of a
P  Q matrix by a Q  R matrix. The index space for these operations is depicted in
Figure 1. R = 1 corresponds to matrix{vector multiplication, P = 1 to vector{matrix
multiplication, and Q = 1 to rank{1 updates. The algorithms described here provide
schedules for the operations in space and time that maintain perfect load balance both
with respect to communication and computation whenever there is one data element per
processor.
The data motion requirements and the performance depend strongly on the data allocation
of the operands. The Connection Machine compilers support a global address space and
allocate arrays based on their shapes. The processors are congured for each array such
that the rank of the data array and the processor array are the same. The ordering of the
axes is also the same. When there are more matrix elements than processors, consecutive
elements along each data array axis (a block) are assigned to a processor.
The outline of this paper is as follows. The next section discusses data allocation issues,
2
-6
 
 
 
 
 
 
 
 
 	
 
 
 
 
 
 
 
 
 
 
 
 
m
n
`
j
k
i
R
Q
P
A(i; ; k)
B(; j; k)
C(i; j; )
6
 
 
 
 
 
 	

Figure 1: The index space for matrix multiplication.
and their consequences for the multiplication of matrices of arbitrary shapes. Matrix{
vector, vector{matrix multiplication, and rank{1 updates on distributed data structures
are discussed in Section 2. Algorithms for matrix-matrix multiplication are described in
Section 4. A systolic algorithm that keeps the matrix C stationary is described in detail.
The optimum shapes of the processor array for the dierent algorithms are also discussed
in this section. Performance data for the Connection Machine system CM{200 are given
in Section 5.
2 Data allocation.
All Connection Machine system compilers support a global address space and allocate
data evenly among the memory modules. Multiple elements are assigned to the same
memory module based on the consecutive (also known as block) data allocation scheme
[7, 12].
For multidimensional arrays, the default mode of the Connection Machine compilers con-
gures the processors such that the average number of local references for each remote
reference is maximized when the data references are equally frequent along all array axes
[26]. The area of the faces for a subarray of a given size is minimized. The user can control
the shape of the processor array, that is, the data layout, through compiler directives. An
axis can be forced to be local to a memory module by the directive SERIAL, if there is
sucient local memory or if the length of the local axis segment is changed by assigning
weights to the axes. High weights are used for axes with frequent references and low
weights for axes with infrequent references. A relatively high weight for an axis increases
the length of the local segment of that axis at the expense of the length of the segments of
3
the other axes. The total size of the subarray is independent of the assignment of weights.
Only the shape of the subarray changes.
As an example of the data allocations possible under the model used by the Connection
Machine compilers, consider a 64  128 array assigned to 64 processors. Each processor
receives 128 elements. With the processors congured as an 8  8 array, each processor
receives a subarray of 8  16 elements. In the consecutive allocation, the elements of
the subarray are selected as successive elements along the axes. With the processors
congured as a one-dimensional array, each processor is assigned a subarray that is either
of shape 1  128, or of shape 64  2, depending upon the orientation of the processor
array with respect to the data array. The shape of the processor array clearly does
not aect the number of arithmetic operations to be performed, but it may aect both
the communication requirements and the eciency in utilizing the arithmetic units, for
example, by changing the vector lengths.
The consecutive allocation scheme selects elements to be assigned to the same processor.
Compiler directives, such as axis weights and SERIAL, address the issue of choosing
the processor array shape. Another data layout issue is the assignment of data sets
made up of consecutive elements along the dierent data array axis to processors. The
network topology and the data reference pattern are two important characteristics in this
assignment. In a mesh-connected machine it is natural to assign subarrays of one and
two{dimensional data arrays to processors such that the adjacency in the data array is
preserved when mapped to the processor array. Preserving adjacency in mapping data
arrays of at least three dimensions to two{dimensional processor arrays is impossible [24].
The processors in the Connection Machine system, CM{200, are interconnected as a
Boolean cube with two channels between each pair of processors. Meshes of any dimension
up to the dimension of the Boolean cube are subgraphs thereof. A Boolean cube network
of n dimensions has 2
n
nodes. The number of dimensions in Connection Machine system
CM-200 ranges from 7 to 11 depending upon system size (the largest system having 16
times more processors than the smallest system). The nodes of a Boolean cube can
be given addresses such that adjacent nodes dier in precisely one bit in their binary
encoding. Assigning subarrays to processors using the standard binary encoding of the
subarray index along an axis, i.e., the node address, does not preserve adjacency along
an axis. For instance, 3 and 4 dier in all three bits in the encoding of the addresses of
8 processors, and are at a distance of three apart. In general, 2
n 1
  1 and 2
n 1
dier in
n bits in their binary encoding and are at a distance of n. The number of bits in which
two indices dier translates directly into distance in a Boolean cube architecture.
Binary{reected Gray codes [23] generate embeddings of arrays into Boolean cube net-
works that preserve adjacency [12]. Gray codes have the property that the encoding of
successive integers dier in precisely one bit. In a Boolean cube network successive in-
dices are assigned to adjacent processors. The binary{reected Gray code is ecient both
in preserving adjacency, and in processor utilization, when the length of the axes of the
data array is a power of two [9]. For arbitrary data array axes' length, the Gray code
may be combined with other techniques to generate ecient embeddings [3, 10]. Matrix
multiplication algorithms can be formulated for one, two [2, 8, 12], and three{dimensional
4
[13] meshes, and Boolean cubes [4, 11, 28].
3 Level{2 Distributed BLAS (DBLAS).
All level{2 BLAS [19, 5, 6] involve operations on arrays of dierent shape. Functions,
such as matrix{vector multiplication, vector{matrix multiplication, and rank{1 updates,
involve both vectors and matrices. With data array allocation based on the shape of the
array, an alignment of the operand arrays with respect to each other is necessary before the
operations can be carried out and the result stored as required. In this presentation, BLAS
operating on distributed data structures are referred to as Distributed BLAS (DBLAS).
The data motion issues for some level{2 DBLAS are discussed briey below. For a more
extensive discussion the reader is referred to [17].
3.1 Matrix{vector and vector{matrix multiplication.
The evaluation of the matrix-vector product y  Ax requires the operations:
1. Aligning the vector x with the column axis of A.
2. Spreading the vector x along the row axis of A, such that there is one copy of the
appropriate segment of x for every node.
3. Performing a matrix{vector multiplication concurrently on each node.
4. Performing a reduction along the column axis to form y, aligned with the row axis
of A.
5. Aligning y with its original allocation.
In Step 1, the vector x is placed along a processor row of A, such that the range of inner
indices (corresponding to the Q-axis) in a node is identical for A and x. Step 2 replicates
x such that every node has a segment of the vector x that corresponds to the inner indices
of the rows of A. Step 3 denes N concurrent matrix{vector products followed in Step 4
by N
r
concurrent summations of N
c
vectors of length
P
N
r
.
The Connection Machine system CM{200 implementation is based on the processing node
description. Local level{2 BLAS [18] are used for the operations in each node. Vector{
matrix multiplication is treated similarly. In [17] it is shown that the data motion for the
alignment of the vectors with the matrix constitutes a shue on a suitably dened index
set. Optimal implementations on mesh-connected networks is described in [22] and on
Boolean cube networks in [16]. The communication eciency can be improved further by
combining the alignment and spread operations (Steps 1 and 2), and the reduction and
alignment operations (Steps 4 and 5) [17].
5
3.2 Rank{1 updates.
The evaluation of the rank{1 update A xy
T
requires the operations:
1. Aligning the vector x with the row axis of A.
2. Aligning the vector y with the column axis of A.
3. Spreading the vector x along the column axis of A, such that there is one copy of
the appropriate segment of x for every node.
4. Spreading the vector y along the row axis of A, such that there is one copy of the
appropriate segment of y for every node.
5. Performing a rank{1 update on each node.
The above description is made with respect to the processing array. A similar description
in terms of the data array is also possible.
The Connection Machine system CM{200 implementation uses the processing array de-
scription and local level{2 BLAS for the operations on each node. As with matrix{vector
and vector{matrix multiplication, the alignment can be combined with the spread (Steps
1 and 3 and Steps 2 and 4).
4 Matrix{matrix multiplication (level{3 DBLAS).
Matrix{matrix multiplication is a part of level{3 BLAS. For distributed data structures it
can be constructed out of the level{2 DBLAS described above. In [17] it is shown that for
many distributed memory architectures, including the Connection Machine system CM{
200, level{3 DBLAS is required for high eciency with respect to data motion. A level{3
DBLAS also allows for the use of local level{3 BLAS, which may enhance the arithmetic
eciency. This section generalizes the matrix{vector and vector{matrix level{2 DBLAS
to level{3 DBLAS. The systolic versions of these algorithms are then outlined. One impor-
tant advantage of the systolic algorithms is that unlike the straightforward generalizations
of the level{2 DBLAS, the systolic algorithms preserve the memory requirements.
4.1 Matrix multiplication based on level{2 BLAS functions.
The algorithm for matrix{vector multiplication described in the previous section can be
extended for the matrix{matrixmultiplication,C  AB+D. By extracting one column
of B (and D) and performing a matrix{vector multiplication, one column of the product
matrix C is generated. This column must be deposited into the matrix C. The extraction
of the column of B (and D) implies a change of layout, if the default data allocation
strategy of the Connection Machine compilers is used: B (D) is a two-dimensional object,
6
and the extracted column is one{dimensional. Similarly, the column of C obtained by a
matrix{vector multiplication is one{dimensional, but C is a two{dimensional object. A
detailed description of the required data motion can be found in [17].
Using a matrix{vector multiplication algorithm from a level{2 DBLAS package requires
that the matrix B be transposed, aligned and spread along the row axis one column
at a time. R calls to the matrix{vector multiplication routine are required. After each
such call, the computed product vector must be aligned and deposited in the appropriate
column of C. The arithmetic operations local to a node can be based on level{1 or level{2
BLAS, but not level{3 BLAS.
The number of calls to the level{2 DBLAS can be reduced by blocking the columns of
B (and C). For blocks consisting of b columns, blocking reduces the overhead for both
arithmetic and data motion by a factor of b. The arithmetic eciency may also be
improved by the use of level{3 BLAS in each node. The communication eciency also
increases because of improved load balance in the communication system [17].
With a blocking of b columns, each processing node requires temporary storage corre-
sponding to
Q
N
c
b data elements for the b columns of B. When b = R, that is, the entire
matrix B is transposed and aligned with A, then the temporary storage requirements,
due to the blocking, increase by a factor of R compared to the unblocked algorithm. Such
an increase in the temporary storage requirement is often unacceptable. The increased
demand for the temporary storage is due to the spread operation (Step 2 of the matrix{
vector multiplication algorithm). Its function is to assure that each column of B can
interact with every row of A with no motion of A. By modifying the implementation
of the spread such that it is performed in a stepwise manner, the memory requirements
can be preserved. For example, if there is at least one column of B allocated to each of
the N
r
processor rows after the extraction and alignment of b > N
r
columns, then the
spread function realizes an all{to{all broadcast [8, 14, 25], which can be implemented as
N
r
 1 cyclic shifts. The all{to{all broadcast can utilize the full communications capacity
of Boolean cube networks [1, 14]. The reduction operation required in computing C can
be expressed similarly.
The level{3 DBLAS matrix{matrixmultiplication routine based on extraction of b columns
of B, transposition and alignment, all{to{all broadcasting, all{to{all reduction and align-
ment using a memory preserving all{to{all implementation will be referred to as a sys-
tolic matrix{matrix multiplication algorithm with A stationary. Analogous algorithms
for matrix{matrix multiplication can be dened using either a straightforward general-
ization of the vector{matrix multiplication algorithms, or the rank{1 update algorithms,
or systolic versions thereof. In a vector{matrix type algorithm, B is stationary, while in
a rank{1 update algorithm, C is stationary.
The arithmetic requirements for the three algorithms is the same. There may be dier-
ences in the arithmetic eciency due to the dierent shapes of the submatrices of the
dierent operands assigned to each processing node. For the systolic matrix{matrix mul-
tiplication algorithm with A stationary and b = R, the transposition and alignment of B
can be performed as one operation. On a Boolean cube network, the optimal implemen-
7
tation of this operation requires a time proportional to
QR
2N
. The communication time is
independent of the shape of the processing array [16]. The all-to-all broadcast operation
for B requires a time proportional to
QR
N log
2
N
r
(N
r
  1) 
QR
N
c
log
2
N
r
[14]. Similarly, the
reduction for C requires a time proportional to
PR
N
r
log
2
N
c
[14]. Finally, the alignment of C
requires a time proportional to
PR
2N
. Therefore, the optimal conguration of the processing
array satises
P
Q
=
N
r
log
2
N
c
N
c
log
2
N
r
, or N
r

q
NP
Q
and N
c

q
NQ
P
for P  Q.
The optimal shape of the processing array is approximately congruent to the shape of
the stationary matrix A. Similarly, the optimum processing array shape of the systolic
algorithm with B stationary is congruent to B, and the optimum processing array shape
of the array with C stationary is congruent to C.
In the Connection Machine Scientic Software Library (CMSSL) [27] nonsystolic algo-
rithms are used for the cases with A and B stationary, while either a nonsystolic or sys-
tolic algorithm is used for the case with C stationary. The performance depends strongly
upon the shape of the operands. In [17], it is shown that as a rst approximation, the
optimal algorithm keeps the matrix with the largest number of elements stationary.
Next, the systolic matrix{matrix multiplication algorithm with C stationary that is used
in CMSSL is presented in detail.
4.2 Systolic matrix multiplication with C stationary.
4.2.1 Square processor arrays, N
c
= N
r
.
This algorithm assumes that each of the three operands has at least one element assigned
to each processor. The processors are congured as a two{dimensional array. A binary{
reected Gray code encoding is used for each axis, such that adjacency in the data array
is preserved in the distributed memory organized as a Boolean cube. A consecutive
data allocation scheme is assumed. This section describes a block algorithm for square
processor arrays. The next section generalizes the algorithm for rectangular processor
arrays. Blocking reduces the number of local memory moves and allows for the use of
level{3 BLAS on each node.
With the product matrix C stationary, for each element c
ij
 
P
Q 1
k=0
a
ik
b
kj
of C, the
corresponding elements of A, a
ik
, and B, b
kj
, for k 2 f0; 1; 2; : : : ; Q 1g, must be moved to
the processing node where c
ij
resides, for all i 2 f0; 1; 2; : : : ; P 1g and j 2 f0; 1; 2; : : : ; R 
1g. The set of processing nodes to which row i of C is assigned must all receive every
element of row i of A. Similarly, the set of processing nodes to which column j of C is
assigned must all receive every element of column j of B. In the index space, this is an
all{to{all broadcast [14] within the rows of A and the columns of B. It is assumed that
matrix rows are aligned with processor rows and matrix columns with processor columns.
This assumption is consistent with the compiler generated data layout.
Recall that the nonsystolic rank{1 update algorithm realizes the all{to{all broadcast as a
sequence of spreads. The temporary storage requirements for the blocked version of this
8
algorithm increase by a factor of b compared to the nonblocked algorithm. To reduce the
temporary storage, the all{to{all broadcast operation is performed stepwise through cyclic
shifts [21]. This introduces an alignment requirement between A and B. In evaluating
the expression c
ij
 
P
Q 1
k=0
a
ik
b
kj
, only processors where the inner index k is identical for
both A and B can participate in the computation. For a square processing array, the
inner indices initially are the same for A and B only in the processors on the diagonal
of the array. This property is true for matrices of any shape and any square processor
array
2
. To increase the processor utilization, the matrices must be \aligned" such that
the inner indices are identical in all nodes. A transposition of B or A would clearly align
the inner indices, but partial products for C would then have to be accumulated in space.
The following algorithm aligns A and B such that all processors can participate in the
evaluation of C, without any data motion for C.
Let 0  i < P , 0  k < Q, and 0  j < R denote matrix element indices, and 0  ` < N
r
and 0  m < N
c
denote indices for the processor array elements. Since N
r
= N
c
, the
partitioning of the inner axis Q is the same for A and B. Assume that P = N
r
, Q = N
r
,
and R = N
c
. An alignment such that processor (`;m) is assigned matrix elements
A: (` + ; (`+m) mod Q+ ), where 0   < , 0   < 
B: ((`+m) mod Q+ ; m+  ), where 0   < , and 0   < 
C: (` + ; m+  )
ensures that the range of inner indices for A and B are identical on each processor. This
property is true for arbitrary values of  and . Moreover, the range of indices for the P
axis is the same for A and C, and the range of indices for the R axis is the same for B
and C in each processor.
The stepwise all{to{all broadcast operation can be performed by using cyclic shifts. The
data motion for the multiplication of A with B at each step may be expressed as
A: (`+ ; (`+m) mod Q+) (`+ ; (`+m+1) mod Q+), where 0   < ,
0   < 
B: ((`+m) mod Q+; m+ ) ((`+m+1) mod Q+; m+ ), where 0   < ,
and 0   < .
The shift operation must be repeated N
r
  1 = N
c
  1 times. Clearly, the inner indices
of the two matrices are identical for each step of the algorithm. The correctness of the
algorithm follows. After the alignment, and after each cyclic shift, matrices of shape 
and    are multiplied on each processor. In the Connection Machine system CM{200
implementation described here, level{2 BLAS are used for the local matrix multiplication.
For P = N
r
and R = N
c
, the algorithm described above degenerates to the algorithm in
[2]. For certain high degree networks, such as Boolean cubes, multiple exchange sequences
can be used to make eective use of the communications bandwidth [11].
Remark 1. No local data motion is required between the cyclic shifts moving data
between processors. Emulating a large virtual processing array naively on the physical
2
Note, however, that if the layout rule is to minimize the surface area for a given subarray, then for a
rectangular matrix the processors will not be congured as a square array.
9
60
70
40
50
20
30
00
10
61
71
41
51
21
31
01
11
62
72
42
52
22
32
02
12
63
73
43
53
23
33
03
13
64
74
44
54
24
34
04
14
65
75
45
55
25
35
05
15
66
76
46
56
26
36
06
16
67
77
47
57
27
37
07
17
60
70
40
50
20
30
00
10
61
71
41
51
21
31
01
11
62
72
42
52
22
32
02
12
63
73
43
53
23
33
03
13
64
74
44
54
24
34
04
14
65
75
45
55
25
35
05
15
66
76
46
56
26
36
06
16
67
77
47
57
27
37
07
17
A B
Figure 2: Allocation of 8 8 matrices to a 4  8 processor array.
array of shape N
r
 N
c
would result in excessive local data motion. The data motion
between the processing nodes would be the same.
Remark 2. With the positive axis direction coinciding with increasing column indices
and decreasing row indices, A is shifted in the negative direction and B in the positive
direction. Shifting A in the positive axis direction and B in the negative direction also
yields a valid algorithm. Further, the submatrices for A and B can be split into two parts,
such that dierent parts are shifted in dierent directions. This observation is useful on
architectures where the primitive communication operation is an exchange, which is the
case for the Connection Machine system CM{200. Moreover, the data motion of A and
B can be performed concurrently.
Remark 3. The correctness of the above algorithm relies on the range of the inner indices
being identical for A and B. If N
r
6= N
c
, this property is not true. This restriction is
relaxed in the next section.
4.2.2 Rectangular processor arrays, N
r
6= N
c
.
Figure 2 shows the allocation of square 8  8 matrices to 32 processors congured as a
48 array. The length of the segment of the inner axis assigned to a processor is dierent
for A and B. Figure 3 shows the result of an alignment and the rst two steps of the
multiplication phase. For the example, in Figures 2 and 3, the all{to{all broadcast of the
multiplication phase requires 8 cyclic rotation steps for A and 4 steps for B, since there
are 8 processor columns and 4 processor rows. Figure 3 shows the locations of elements
after the rst and second cyclic shift of A and the rst shift of B. After the alignment,
all elements of A and half of the elements of B participate in the local multiplication.
After the rst cyclic shift of A, all its elements are again participating in local matrix
multiplications, with the previously unused elements of B. After the second cyclic shift of
A and the rst cyclic shift of B, all elements of A and the \rst" half of the elements of
B are used in the same way as after the alignment. After this sequence is repeated four
times, the matrix C is computed.
10
66
76
44
54
22
32
00
10
67
77
45
55
23
33
01
11
60
70
46
56
24
34
02
12
61
71
47
57
25
35
03
13
62
72
40
50
26
36
04
14
63
73
41
51
27
37
05
15
64
74
42
52
20
30
06
16
65
75
43
53
21
31
07
17
60
70
40
50
20
30
00
10
71
01
51
61
31
41
11
21
02
12
62
72
42
52
22
32
13
23
73
03
53
63
33
43
24
34
04
14
64
74
44
54
35
45
15
25
75
05
55
65
46
56
26
36
06
16
66
76
57
67
37
47
17
27
77
07
Matrix A aligned Matrix B aligned
Rotate A, Multiply and Add
66
76
44
54
22
32
00
10
67
77
45
55
23
33
01
11
60
70
46
56
24
34
02
12
61
71
47
57
25
35
03
13
62
72
40
50
26
36
04
14
63
73
41
51
27
37
05
15
64
74
42
52
20
30
06
16
65
75
43
53
21
31
07
17
60
70
40
50
20
30
00
10
71
01
51
61
31
41
11
21
02
12
62
72
42
52
22
32
13
23
73
03
53
63
33
43
24
34
04
14
64
74
44
54
35
45
15
25
75
05
55
65
46
56
26
36
06
16
66
76
57
67
37
47
17
27
77
07
Matrix A shifted left 1 step Matrix B aligned
Rotate A and B, Multiply and Add
66
76
44
54
22
32
00
10
67
77
45
55
23
33
01
11
60
70
46
56
24
34
02
12
61
71
47
57
25
35
03
13
62
72
40
50
26
36
04
14
63
73
41
51
27
37
05
15
64
74
42
52
20
30
06
16
65
75
43
53
21
31
07
17
60
70
40
50
20
30
00
10
71
01
51
61
31
41
11
21
02
12
62
72
42
52
22
32
13
23
73
03
53
63
33
43
24
34
04
14
64
74
44
54
35
45
15
25
75
05
55
65
46
56
26
36
06
16
66
76
57
67
37
47
17
27
77
07
Matrix A shifted left 2 steps Matrix B shifted up 2 steps
Figure 3: Matrix multiplication on a 4  8 array.
11
In Figure 3 all operations requiring a given submatrix are carried out before the entire
submatrix is moved to the adjacent processor. No local data motion is required. When the
inner axis extent per processor is dierent for A and B, which is the case for a rectangular
processor array, then only a fraction of the local submatrix with the largest inner axis
segment is used in a local matrix multiplication for each rotation step of the submatrix
with the shortest inner axis segment. The submatrices are fully used for each rotation
step in which it participates. If the number of processors assigned to one axis is a multiple
of the number of processors along the other axis, for example, N
c
> N
r
as in Figure 3,
then
N
c
N
r
rotation steps are performed along the longer axis for every rotation step along
the shorter axis. A more general case is shown in Figures 4 and 5.
Let P;Q  N
r
and Q;R  N
c
, and N = N
r
N
c
, with no other restriction on N
r
and N
c
,
and let 
c
=
Q
N
c
and 
r
=
Q
N
r
. For arbitrary values of N
r
and N
c
, dene a square virtual
processor array of shape
N
r
N
c
gcd(N
r
;N
c
)

N
r
N
c
gcd(N
r
;N
c
)
. Let `
v
;m
v
identify a block in the virtual
array: (`
v
;m
v
) 2 f0; 1; : : : ;
N
r
N
c
gcd(N
r
;N
c
)
 1gf0; 1; : : : ;
N
r
N
c
gcd(N
r
;N
c
)
 1g. Let 
v
= d
Qgcd(N
r
;N
c
)
N
r
N
c
e.
After the alignment of the operands with respect to each other and the establishment of
a shared processor array shape, the index assignment for physical processor (`;m) is:
A: (`+; 
v
((m
N
r
gcd(N
r
;N
c
)
+`
N
c
gcd(N
r
;N
c
)
) mod
N
r
N
c
gcd(N
r
;N
c
)
)+) = (`+; (
c
m+
r
`) mod Q+
), where 0   <  and 0   < 
c
.
B: (
v
((`
N
c
gcd(N
r
;N
c
)
+m
N
r
gcd(N
r
;N
c
)
) mod
N
r
N
c
gcd(N
r
;N
c
)
) + ; m+  ) = ((
r
` + 
c
m) mod Q +
; m+  ), where 0   < 
r
and 0   < .
C: (` + ; m+  ).
Note that after the alignment, all arrays are allocated to the processors assuming the
same processor array conguration.
Clearly, for every processor, the smallest inner index for A and B is identical. But, the
range of inner indices, , is dierent for A and B. The number of identical indices is
min(
r
; 
c
). The number of local blocks along the inner axis of A is
N
r
gcd(N
r
;N
c
)
, and of B
is
N
c
gcd(N
r
;N
c
)
.
For the multiplication phase, N
c
shifts are required for A and N
r
shifts for B. When
N
c
> N
r
, b
N
c
N
r
c shifts of A are performed without any shift of B. For each such shift, a
rank-
c
update is performed concurrently on all nodes, consuming the entire submatrix
of A on a node. For each shift the indices of A on a node changes as: (` + ; (
c
m +

r
`) mod Q + )  (` + ; (
c
(m + 1) + 
r
`) mod Q + ). The next shift of A, shift
b
N
c
N
r
c + 1, brings in the indices of A that correspond to the remaining inner indices of
B (if N
c
is not a multiple of N
r
) and some additional indices. A rank-mod

r

c
update
is performed rst to consume all inner indices of B, followed by a move of B, and a
rank-(
c
  mod

r

c
) before A is moved again. The cyclic shift of B brings about the index
change: ((
r
`+ 
c
m) mod Q+ ; m+  ) ((
r
(`+ 1) + 
c
m) mod Q+ ; m+  ). If
N
r
> N
c
, then B is moved more often, and a similar analysis applies. The index sets for
the blocks each processor receives are monotonically increasing, contiguous and periodic.
12
30 31
40 41
50 51
00 01
10 11
20 21
32 33
42 43
52 53
02 03
12 13
22 23
34 35
44 45
54 55
04 05
14 15
24 25
30 31
40 41
50 51
00 01
10 11
20 21
32 33
42 43
52 53
02 03
12 13
22 23
34 35
44 45
54 55
04 05
14 15
24 25
Matrix A Matrix B
Align
33 34
43 44
53 54
00 01
10 11
20 21
35 30
45 40
55 50
02 03
12 13
22 23
31 32
41 42
51 52
04 05
14 15
24 25
30 31
40 41
50 51
00 01
10 11
20 21
52 53
02 03
12 13
22 23
32 33
42 43
14 15
24 25
34 35
44 45
54 55
04 05
Matrix A aligned Matrix B aligned
Rotate A, Multiply and Add
33 34
43 44
53 54
00 01
10 11
20 21
35 30
45 40
55 50
02 03
12 13
22 23
31 32
41 42
51 52
04 05
14 15
24 25
30 31
40 41
50 51
00 01
10 11
20 21
52 53
02 03
12 13
22 23
32 33
42 43
14 15
24 25
34 35
44 45
54 55
04 05
Matrix A left shifted 1 step Matrix B aligned
Rotate B, Multiply and Add
33 34
43 44
53 54
00 01
10 11
20 21
35 30
45 40
55 50
02 03
12 13
22 23
31 32
41 42
51 52
04 05
14 15
24 25
30 31
40 41
50 51
00 01
10 11
20 21
52 53
02 03
12 13
22 23
32 33
42 43
14 15
24 25
34 35
44 45
54 55
04 05
Matrix A left shifted 1 step Matrix B shifted up 1 step
Figure 4: Matrix multiplication on a 2  3 array.
13
Rotate A, Multiply and Add
33 34
43 44
53 54
00 01
10 11
20 21
35 30
45 40
55 50
02 03
12 13
22 23
31 32
41 42
51 52
04 05
14 15
24 25
30 31
40 41
50 51
00 01
10 11
20 21
52 53
02 03
12 13
22 23
32 33
42 43
14 15
24 25
34 35
44 45
54 55
04 05
Matrix A left shifted 2 steps Matrix B shifted up 1 step
Figure 5: Matrix multiplication on a 2 3 array, last step.
The pseudocode for the above algorithm is given in the Appendix.
Claim 1. By performing the alignment along the longest axis as if the processor array was
square with the number of processors along each axis equal to the number of processors
along the shortest axis, and the alignment along the shortest axis as if the processor
array was square with the number of processors along each axis equal to the number of
processors along the longest axis, the multiplication of two matrices can be accomplished
by performing the minimum number of rotation steps along each axis.
The correctness of the claim follows from the algorithm outlined above. Figures 4 and 5
show an example where the least common divisor of N
r
and N
c
is 1.
Remark 4. There is no local data motion (except what is required by the local BLAS)
as was the case for a square processing array. Entire submatrices are moved between
processors when necessary.
Remark 5. If one of the axes is not a multiple of the other, then not all local matrix
multiplications are of the same rank. A uniform rank can be achieved without extra local
memory moves, but at the expense of having some shifts use dierent block sizes and
more complex memory management.
Remark 6. As in the case of a square processing array, A and B can be split into two
halves, such that an exchange operation can be performed along each axis. Note also that
moves on the two axes may not always be performed concurrently, since the longer axis
requires more cyclic shifts than the shorter axis. Furthermore, in general the submatrices
of A and B on each node are of dierent size, and a complete overlap of the motion of
A and B is impossible even for a square processor array. If the lengths of the axes are
relatively prime, then no communication is overlapped between A and B.
14
4.3 Optimum shape of the processor array.
Ignoring the eects of the ceiling functions, the speedup of the arithmetic operations
is perfect and independent of the shape of the processor array. With the inner axis
entirely instantiated in time, the matrix product requires 2d
P
N
r
ed
R
N
c
ed
Q gcd(N
r
;N
c
)
N
r
N
c
e
N
r
N
c
gcd(N
r
;N
c
)
arithmetic operations in sequence. The arithmetic time is proportional to the number of
matrix elements of C per physical processor, the order of the rank d
Q gcd(N
r
;N
c
)
N
r
N
c
e updates,
and the number of such updates in sequence.
The communication time for the multiplication phase is proportional to d
P
N
r
ed
Q
N
c
e(N
c
 1)
for A and to d
Q
N
r
ed
R
N
c
e(N
r
  1) for B. The communication time is entirely due to the all{
to{all broadcast. Reference [14] shows that the above broadcast times are optimum when
communication is restricted to one channel at a time per processor. With concurrent com-
munication on all channels of every node of a Boolean cube network, the communication
time can be reduced by a factor of log
2
N
c
for A and a factor of log
2
N
r
for B [11, 14].
The alignment phase for the Connection Machine system CM-200 implementation com-
bines establishing a shared processor conguration for the three arrays with the alignment
of the matrices with respect to each other. This operation is performed by the router.
Reshaping the processor array for an operand, if necessary, constitutes a shue operation
on a suitably dened index set. For the Connection Machine system, CM{200, reshaping
the processor array implies that the number of processors along one axis is reduced by
some power of two, while the number of processors along the other axis is increased by
the same factor. Figure 6 gives an example in which a 2  4 processor array is reshaped
into a 42 array. The number of partitions along the rst axis doubles, while the number
of partitions along the second axis is reduced by a factor of two. To verify that this oper-
ation is a shue, introduce a second index for the partitioning of the set of matrix rows
initially in each processor. Then the content of the rst two processor columns before
the reshape operation is formed by blocks labeled 00,01,10,11,20,21,30 and 31, where the
rst index denotes the processor to which the block of rows is assigned. After reshaping
the processor array, processor 0 contains blocks 00 and 20, processor 1 blocks 01 and 21,
processor 2 blocks 10 and 30, and processor 3 blocks 11 and 31. This reallocation can
be described as the reordering of the sequence 00,01,10,11,20,21,30,31 into the sequence
00,20,01,21,10,30,11,31. This reordering is known as a shue. Optimum Boolean cube
algorithms are given in [15, 16]. With concurrent communication on all channels of every
processor, as in the case of the Connection Machine system CM{200, the optimum time
for the reshaping is independent of the axes extent and is proportional to the number of
elements per processor.
The alignment of the matrices with respect to each other implies a skewing of rows with
respect to each other for A and a skewing of columns with respect to each other for
B, if a shared processor array shape has already been established. The skewing can be
performed as a sequence of cyclic shifts, with the number of cyclic shifts depending upon
the processor row for A and the processor column for B. On a Boolean cube network,
it is conjectured [12] that the data motion for this operation can be pipelined, and that
the communication time is proportional to the number of elements per processor and
15
10
3
2
5
4
7
6
=)
3
2
1
0
7
6
5
4
Figure 6: Reshaping a processor array.
essentially is independent of the shift length and the number of processors. Hence, it is
expected that the time for the combined processor array reshaping and matrix alignment
is proportional to the size of the submatrices assigned to each processor, and almost
independent of the machine size. The measurements reported in Section 5 show that this
assumption is valid for the Connection Machine system CM-200 router.
The number of arithmetic operations in sequence is independent of the processor array
shape. To a rst approximation, the alignment/reshaping is also independent of the array
shape. Hence the optimum shape is determined solely by the time for the all{to{all
broadcast in the multiplication phase.
Claim 2. The optimum aspect ratio of the physical processor array is the same as the
aspect ratio of the product matrix C, i.e., N
r
=
q
PN
R
and N
c
=
q
RN
P
, or
N
r
N
c
=
P
R
.
The claim follows directly from the expressions for the time required for all{to{all broad-
cast, and is veried experimentally in Section 5. Note that the optimum two-dimensional
processor array shape is the default processor array shape for C (assigned by the Con-
nection Machine compilers).
The communication times for the multiplication phase may be reduced further by con-
guring the processors as a three{dimensional array [13] and by using all communication
channels in Boolean cubes [11]. During the multiplication phase, the algorithm described
here and used for the performance data reported in the next section, concurrently uses at
most only four channels per processor.
5 Performance.
Table 1 and Figures 7, 8 and 9 give the performance of the matrix{vector, vector{matrix,
and rank{1 update routines for square matrices. For real operands, the peak performance
of the three algorithms in 64{bit precision is 4500 Mops/s, 5825 Mops/s, and 3266
Mops/s, respectively. The corresponding peak performance of the three algorithms for
complex operands is 9200 Mops/s, 11950 Mops/s, and 4900 Mops/s respectively. The
improvement in performance, for a xed number of processors, is primarily due to an
increased eciency of the local level{2 BLAS.
Table 2 gives the performance for the local matrix{vector multiplication kernels [18]. The
16
Matrix Mops/s Time (millisec)
shape Number of processors Number of processors
P  P 256 512 1024 2048 256 512 1024 2048
Matrix{vector multiplication
512 107 104 128 143 4.9 5.0 4.1 3.7
1024 233 279 405 411 9.0 7.5 5.2 5.1
2048 443 596 908 1122 18.9 14.0 9.2 7.5
4096 697 1067 1643 2330 48.2 31.4 20.4 14.4
8192 | | 3056 4541 | | 43.9 29.6
Vector{matrix multiplication
512 105 124 125 147 5.0 4.2 4.2 3.6
1024 266 325 406 474 7.9 6.5 5.2 4.4
2048 496 838 918 1246 16.9 10.0 9.1 6.7
4096 790 1462 2121 2709 42.5 22.9 15.8 12.4
8192 | | 3913 5829 | | 34.3 23.0
Rank{1 update
512 99 111 119 129 5.3 4.7 4.4 4.1
1024 213 278 380 426 9.9 7.5 5.5 4.9
2048 346 577 763 1078 24.3 14.5 11.0 7.8
4096 453 823 1450 2007 74.1 40.8 23.1 16.7
8192 | | 1901 3266 | | 70.6 41.1
Table 1: Performance data for matrix{vector, vector{matrix, and rank{1 updates on
dierent Connection Machine system CM{200 congurations, 64{bit precision.
17
-N
256 512 1024 2048
6
Gops/s
0
1
2
3
4
5

~
}
|

~
}
|

~
}
|
4

~
}
|
4
: P= 512
~: P=1024
}: P=2048
|: P=4096
4: P=8192
Figure 7: Execution rates for multiplication of a P P matrix by a vector on Connection
Machine system CM{200 with N processors, 64{bit precision.
-
N
256 512 1024 2048
6
Gops/s
0
1
2
3
4
5
6

~
}
|

~
}
|

~
}
|
4

~
}
|
4
: P= 512
~: P=1024
}: P=2048
|: P=4096
4: P=8192
Figure 8: Execution rates for multiplication of a vector by a P P matrix on Connection
Machine system CM{200 with N processors, 64{bit precision.
18
- N
256 512 1024 2048
6
Gops/s
0
1
2
3
4

~
}
|

~
}
|

~
}
|
4

~
}
|
4
: P= 512
~: P=1024
}: P=2048
|: P=4096
4: P=8192
Figure 9: Execution rates for rank{1 updates on P P matrices on Connection Machine
system CM{200 with N processors, 64{bit precision.
performance of the local kernels for 256512 submatrices is almost three times higher than
for 48 submatrices. Note that the performance numbers in this table correspond to one
oating point unit of the Connection Machine system CM{200. The peak performance for
the rank{1 update kernels is approximately 25% of the peak performance of the matrix{
vector multiplication kernels.
When there is at least one data element per processor for each operand, almost 30% of
the total time of the matrix{vector multiply is spent in the local level{2 BLAS. Another
35% of the total time is spent in the two align operations (Steps 1 and 5 of the algorithm
in Section 3). The remaining 35% is spent in the spread (Step 2) and reduction (Step 4)
operations. The main reason for the dierence in the peak performance of the matrix{
vector and the vector{matrix implementations is the asymmetry in the communication
times for the align operations in Steps 1 and 5 are signicantly dierent. Moreover, for
rectangular processor layouts, the time for the spread and the reduction operations are
considerably dierent. The peak performance of the rank{1 update is signicantly less
than the peak performance of the matrix{vector and the vector{matrix functions primarily
because of the lower eciency of the corresponding local level{2 BLAS.
Tables 3 { 4 and Figures 10 { 14 give some performance data for the systolic matrix{
matrix multiplication algorithm with the product matrix stationary and with processors
congured as a two{dimensional array. For a given machine size, the performance in-
creases as the size of the submatrices on the processing nodes increases. The overall
performance for the square matrix multiplication increases by a factor of nearly 15 as the
19
DGEMV
Number of Number of columns
Rows 2 3 4 5 8 16 32 64 128 512 2048
2 0.64 0.89 1.10 1.29 2.03 3.04 3.93 4.67 5.17 5.62 5.74
3 0.91 1.25 1.53 1.78 2.71 4.20 5.15 5.80 6.21 6.56 6.66
4 1.13 1.54 1.88 2.17 3.22 4.44 5.51 6.26 6.72 7.12 7.23
5 1.33 1.80 2.18 2.51 3.64 5.23 6.21 6.84 7.22 7.54 7.63
8 1.80 2.39 2.87 3.26 4.47 5.82 6.85 7.51 7.90 8.11 8.22
16 2.54 3.29 3.79 4.25 5.58 6.89 7.80 8.35 8.65 8.90 8.97
32 2.79 3.60 4.11 4.69 5.92 7.14 7.95 8.44 8.70 8.92 |
64 3.19 4.10 4.60 5.12 6.37 7.52 8.26 8.69 8.93 9.11 |
128 3.33 4.20 4.74 5.22 6.49 7.60 8.30 8.70 8.92 | |
512 3.45 4.33 4.86 5.30 6.53 7.57 8.22 8.60 | | |
2048 3.49 4.89 4.91 5.34 6.57 7.60 | | | | |
Table 2: Execution rate in Mops/s for matrix{vector multiplication on one Connection
Machine system CM{200 oating{point processor, 64{bit precision.
size of the submatrices increases from 48 to 256512 (Figure 12). For a given processor
conguration the communication time for A increases in proportion to P and Q, and for
B in proportion to Q and R. The number of arithmetic operations increases in proportion
to P , Q, and R. Hence, the relative inuence of the communication decreases with an
increased matrix size. However, the inuence decreases less rapidly than predicted by this
simple analysis, because the eective oating point rate of the local BLAS increases with
the size of the submatrices. It is clear from Table 4 that there are no size restrictions on
the extents of the matrix axes.
Table 5 provides some additional insights into the behavior of the systolic matrix{matrix
multiplication routine with the product matrix stationary. As a rst approximation, the
alignment time is proportional to the number of matrix elements per processor. There is
a slight increase in the alignment time with the machine size (about 6 { 7% for a factor
of 4 increase in machine size (not reported in the tables)). The time for cyclic shifts is
proportional to the number of elements per processor and the number of shifts along an
axis. The dependence of the time for cyclic shifts upon the number of elements per node
is veried by the measurements in Table 5. The Connection Machine model CM{200
communication system allows data to be exchanged concurrently on all channels of every
processor. Each of the operands A and B is split into two halves, one half moving in
the positive direction of an axis and the other half in the negative direction. Moreover,
equal{sized parts of A and B are moved concurrently whenever possible. Recall that for
rectangular arrays more cyclic shifts are required for the longer axis and that for dierent
sized matrices the submatrices are of dierent size.
The overall performance as a function matrix sizes and the number of processors is shown
in Figures 10 and 11. Increasing the number of processors by a factor of four and cong-
uring the machine as a square array reduces the number of matrix elements per processor
20
by a factor of four, but doubles the processor axes' lengths. The time for the alignment
is reduced by almost a factor of two, and the time for the cyclic shifts is reduced by a
factor of two. The number of arithmetic operations per processor for square matrices
and processing node congurations decreases by a factor of eight for each local matrix
multiply. However, the number of calls to the local matrix multiply function increases by
a factor of two. The loss in eciency in the local level{2 BLAS because of a reduced size
of the submatrices, does not quite reduce the arithmetic time by a factor of four. The
relative importance of the communication increases by about a factor of two.
For double precision operands, the systolic multiplication phase (local matrix multiply and
cyclic shifts in Table 3) has a peak oating{point rate of approximately 11 Gops/s. After
accounting for the alignment and reconguration phases, the overall algorithm peaks at
nearly 10 Gops/s. The overall peak performance of the algorithm for complex matrices
(N = 8192) is 12.5 Gops/s.
Figures 13 and 14 show how the performance of the systolic algorithm for the multi-
plication of a rectangular matrix and a square matrix varies with the machine size (C
stationary). The speedup with the machine size is almost identical for the dierent ma-
trix shapes and is linear in the machine size. The eciency for matrices of a given size
decreases with machine size, as explained above. The speedup is approximately 5 for an
increase in machine size by a factor of 8.
Figure 15 shows the dependence of the performance upon the shape of the product ma-
trix for the systolic algorithm with C stationary. For a given inner dimension Q, the
performance is approximately independent of the values of P and R for P  R = const.
The performance increases with the value of the constant and increased values of Q. The
deviation from perfect symmetry is due mostly to the fact that a Connection Machine
system CM{200 with 2048 processors cannot be congured as a square array. In this case,
one of the pairs in each symmetric case (P R and RP ) may incur a reallocation of the
product matrix, while the other case does not. The asymmetry is very small for machine
sizes for which the number of oating-point processors is a square, as seen from Table
3. Note that in Figure 15 P and R are expressed relative to Q. Hence, the increased
performance as Q increases, since all three local axes increases with an increased value of
Q for xed ratios of P=Q and R=Q.
Figures 16 and 17 illustrate the inuence of the shape of the shared processing node
conguration on the overall performance of the systolic matrix multiply algorithm with
C stationary. For square operand matrices, when the processing nodes are congured as
a linear array (N
r
= 1 or N
c
= 1), the number of cyclic shifts that are required for the
all{to{all broadcast in the multiplication phase completely dominates the algorithm. For
the square matrices considered in Figure 16, the total number of cyclic shifts is minimum
when N
r
= N
c
. For the rectangular shapes illustrated in Figure 17, performance is optimal
when the shared processor conguration is of the same shape as the product matrix when
P < R. For the case when P > R, the default processor layout for the product matrix
makes it essential for the Connection Machine system CM-200 implementation to do one
extra route operation to recongure the product matrix. This extra communication step
shifts the optimum oating point rate versus shared processor conguration to the left.
21
Matrix shape Mops/s Time (sec)
P Q R Number of processors Number of processors
256 512 1024 2048 256 512 1024 2048
1024 1024 128 300 365 704 838 0.895 0.735 0.381 0.320
1024 1024 256 377 548 826 1339 1.424 0.980 0.650 0.401
1024 1024 512 559 736 1289 1729 1.921 1.459 0.833 0.621
1024 1024 1024 744 1079 1788 2647 4.022 2.544 1.745 1.054
512 1024 1024 534 844 1231 2038 2.011 1.272 0.872 0.527
256 1024 1024 403 597 913 1366 1.332 0.899 0.588 0.393
128 1024 1024 271 389 609 987 0.991 0.690 0.441 0.272
2048 2048 128 521 674 1244 1608 2.061 1.593 0.863 0.668
2048 2048 256 476 622 1236 1520 4.512 3.453 1.737 1.413
2048 2048 512 649 815 1515 2507 6.618 5.270 2.835 1.713
2048 2048 1024 908 1257 2252 3177 9.460 6.834 3.814 2.704
2048 2048 2048 1128 1724 2975 4646 15.230 9.965 5.775 3.698
1024 2048 2048 871 1417 2194 3687 9.862 6.062 3.915 2.330
512 2048 2048 685 1039 1686 2509 6.270 4.134 2.547 1.712
256 2048 2048 470 740 1167 1849 4.569 2.902 1.840 1.161
128 2048 2048 481 879 1189 1909 2.232 1.222 0.903 0.562
4096 4096 256 743 1074 2044 2797 11.560 8.000 4.203 3.071
4096 4096 512 706 1033 1967 2608 24.334 16.631 8.734 6.587
4096 4096 1024 1020 1336 2546 4351 33.686 25.718 13.500 7.897
4096 4096 2048 | 1892 3513 5365 | 36.321 19.561 12.809
4096 4096 4096 1516 2348 4327 7318 90.650 58.534 31.763 18.781
2048 4096 4096 1263 2031 3486 6106 54.410 33.835 19.713 11.254
1024 4096 4096 1050 1636 2816 4552 32.724 21.002 12.202 7.548
512 4096 4096 786 1152 2025 3425 21.857 14.913 8.484 5.016
256 4096 4096 748 1392 2004 3539 11.484 6.171 4.286 2.427
256 256 256 240 320 497 668 0.140 0.105 0.068 0.050
512 512 512 439 609 987 1400 0.612 0.441 0.272 0.192
1024 1024 1024 744 1079 1788 2647 2.888 1.991 1.201 0.811
2048 2048 2048 1128 1724 2975 4646 15.232 9.965 5.776 3.698
4096 4096 4096 1517 2348 4327 7318 90.650 58.547 31.767 18.781
8192 8192 8192 | | | 9929 | | | 110.737
Table 3: Performance of the systolic matrix multiplication algorithm with C stationary
on some Connection Machine system CM{200 congurations, 64{bit precision.
22
Matrix shape Gops/s
P  P  P Real Complex
6400 9.3 12.5
6656 9.5 12.7
6912 9.6 12.8
7168 9.8 12.9
7424 9.9 13.0
7680 10.1 13.2
7936 10.3 13.3
8192 10.6 13.6
8448 10.6 13.6
8704 10.7 13.6
8960 10.8 13.7
9216 11.0 13.9
9472 11.1 14.0
9728 11.3 14.1
9984 11.3 14.1
10240 11.5 14.3
10496 11.6 14.3
10752 11.7 14.4
11008 11.8 14.5
11264 11.9 14.6
11520 11.9 14.5
11776 12.1 14.7
12032 12.2 14.7
12288 12.3 14.8
12544 12.3 14.9
12800 12.4 14.9
13056 12.5 14.9
13312 12.6 15.0
13568 12.7
13824 12.8
14080 12.8
14336 12.9
14592 12.9
14848 13.0
15104 13.1
15360 13.2
15616 13.2
15872 13.3
Table 4: Performance of the systolic matrix multiplication algorithm with C stationary
on a 64K Connection Machine system CM{200 conguration, 64{bit precision.
23
Matrix size 256 256 512 512 1024 1024 2048 2048 4096 4096
Operation Time % Time % Time % Time % Time %
Alignment 0.030 22 0.117 19 0.473 16 1.914 13 7.758 9
Local matmul 0.020 14 0.133 22 0.968 34 7.529 49 59.738 66
Cyclic shifts 0.089 64 0.362 59 1.447 50 5.789 38 23.153 26
Total 0.139 100 0.613 100 2.889 100 15.232 100 90.650 100
Table 5: Relative execution times for the global matrix multiplication algorithm on a 256
processor Connection Machine system CM{200, 64{bit precision.
- N
256 512 1024 2048
6
Time (s)
0
10
20
30
40
50
60
70
~
}
|

~
}
|
4

~
}
|
4

~
}
|
4
: Matrix size 256
~: Matrix size 512
}: Matrix size 1024
|: Matrix size 2048
4: Matrix size 4096
Figure 10: Execution time for multiplication of square matrices, 64{bit precision. N is
the size of the Connection Machine system CM{200.
24
- N
256 512 1024 2048
6
Gops/s
0
2
4
6
8
10

~
}
|
4

~
}
|
4

~
}
|
4

~
}
|
4

: Matrix size 256
~: Matrix size 512
}: Matrix size 1024
|: Matrix size 2048
4: Matrix size 4096
 : Matrix size 8192
Figure 11: Execution rates for multiplication of square matrices, 64{bit precision. N is
the size of the Connection Machine system CM{200.
-
P
0 2K 4K 6K 8K
6
Gops/s
0
2
4
6
8
10
t
t
t
t
t
t
Figure 12: Performance of square matrix multiplication as a function of matrix size P on
a 2048 processor Connection Machine system CM{200, 64{bit precision.
25
- N
256 512 1024 2048
6
Time (s)
0
10
20
30
40
50
60
70
80
90

~
}
|
4

~
}
|
4

~
}
|
4
~
}
|
4
: P= 256
~: P= 512
}: P=1024
|: P=2048
4: P=4096
Figure 13: Execution time for a P  4096  4096 matrix multiplication, 64{bit precision.
N is the size of the Connection Machine system CM{200.
- N
256 512 1024 2048
6
Gops/s
0
2
4
6
8

~
}
|
4

~
}
|
4

~
}
|
4

~
}
|
4
: P= 256
~: P= 512
}: P=1024
|: P=2048
4: P=4096
Figure 14: Execution rates for P  4096  4096 matrix multiplication, 64{bit precision.
N is the size of the Connection Machine system CM{200.
26
-P
RQ=16 Q=8 Q=4 Q=2 Q Q Q Q Q
Q Q Q Q Q Q=2 Q=4 Q=8 Q=16
6
Gops/s
0
2
4
6
8
10









~
~
~
~
~
~
~
~
~
}
}
}
}
}
}
}
}
}
|
|
|
|
|
|
|
|
|
: Q=1024
~: Q=2048
}: Q=4096
|: Q=8192
Figure 15: Execution rates for multiplication of rectangular matrices on a 2048 processor
Connection Machine system CM{200, 64{bit precision.
-
log
2
(N
r
)
log
2
(N
c
)0 1 2 3 4 5 6 7 8
8 7 6 5 4 3 2 1 0
6
Mops/s
0
200
400
600
800
1000
1200









~
~
~
~
~
~
~
~
~
}
}
}
}
}
}
}
}
}
: P= 512
~: P=1024
}: P=2048
Figure 16: Inuence of shared processor conguration on the performance for multipli-
cation of square matrices of size P , 64{bit precision. The shape of the 256 processor
Connection Machine system CM{200 is N
r
N
c
= 256.
27
-log
2
(N
r
)
log
2
(N
c
)0 1 2 3 4 5 6 7 8
8 7 6 5 4 3 2 1 0
6
Mops/s
0
200
400
600
800
1000
1200
1400









~
~
~
~
~
~
~
~
~
}
}
}
}
}
}
}
}
}
|
|
|
|
|
|
|
|
|
: P = 512;Q = 128;R = 2048
~: P = 1024;Q = 256;R = 4096
}: P = 2048;Q = 512;R = 8192
|: P = 8192;Q = 512;R = 2048
Figure 17: Inuence of shared processor conguration on multiplication of rectangular
matrices of shape P QR, 64{bit precision. The shape of the 256 processor Connection
Machine system CM{200 is N
r
N
c
= 256.
Finally, Table 6 and Figure 18 show the performance for the multiplication of a square
matrixA by a rectangular matrixB. The number of columns in B varies from one to 8192.
The eect of the blocking introduced for the nonsystolic \matrix{vector" multiplication
algorithm, and the selection of algorithm for matrix{matrix multiplication as a function
of the shape of the operands is clear. When
P
R
 8, a matrix{vector type algorithm is used
with a block size of min(R; 64) (A is stationary). When
R
P
 16, a vector{matrix algorithm
is used to evaluate the product (B is stationary). The block size is min(P; 32). For the data
reported in Table 6 and Figure 18, all other shapes use the systolic matrix multiplication
algorithm with C stationary. For small matrices, blocking of vectors for the nonsystolic
algorithm, yields more than a three-fold performance increase for square matrices, while
for large matrices the performance increase is about 5%. For small matrices, blocking
improves the performance of the local level{2 BLAS by about 50%.
6 Summary.
Level{2 and level{3 DBLAS used in the Connection Machine Scientic Software Library,
CMSSL, are described briey. Both the nonsystolic and the systolic algorithms for matrix{
matrix multiplication on distributed data structures, with A, B, or C stationary are
outlined. A systolic matrix{matrix multiplication algorithm keeping C stationary is de-
scribed in detail. The systolic algorithms perform the all{to{all broadcast and reduction
operations as a sequence of cyclic shifts, thereby reducing the need for temporary storage.
28
Matrix shape Mops/s Time (millisec)
P  P R Number of processors Number of processors
256 512 1024 2048 256 512 1024 2048
256 256 1 33 35 43 49 4.0 3.7 3.0 2.7
256 256 4 62 74 104 105 8.5 7.1 5.0 5.0
256 256 16 74 93 154 171 28.3 22.6 13.6 12.3
256 256 64 104 145 206 303 80.7 57.9 40.7 27.7
256 256 256 240 320 497 668 139.8 104.9 67.5 50.2
256 256 1024 398 564 834 1151 337.2 238.0 160.9 116.6
256 256 4096 182 346 397 738 2949.8 1551.7 1352.3 727.5
256 256 8192 | | 698 795 | | 1538.3 1350.6
512 512 1 107 104 128 143 4.9 4.0 4.1 3.7
512 512 4 133 164 251 297 15.8 12.8 8.4 7.1
512 512 16 165 186 329 386 50.8 45.1 25.5 21.7
512 512 64 167 204 367 405 200.9 164.5 91.4 82.9
512 512 256 318 398 689 890 422.1 337.2 194.8 150.8
512 512 1024 528 826 1219 1948 1016.8 650.0 440.4 275.6
512 512 4096 796 1205 1981 2902 2697.8 1782.1 1084.0 740.0
512 512 8192 | | 755 1435 | | 5688.7 2993.0
1024 1024 1 233 279 405 411 9.0 5.2 5.1 5.1
1024 1024 4 253 344 570 652 33.2 24.4 14.7 12.9
1024 1024 16 308 390 621 800 108.9 86.0 54.0 41.9
1024 1024 64 294 391 708 865 456.5 343.3 189.6 155.2
1024 1024 256 377 548 826 1339 1424.1 979.7 650.0 400.9
1024 1024 1024 744 1079 1788 2647 2886.4 1990.3 1201.1 811.3
1024 1024 4096 1066 1652 2771 4478 8058.1 5199.7 3099.9 1918.3
1024 1024 8192 | | 3219 5383 | | 5337.0 3191.5
4096 4096 1 697 1067 1913 2330 48.1 31.4 17.5 14.4
4096 4096 4 719 1187 1995 2734 186.7 113.1 67.3 49.1
4096 4096 16 794 1217 2262 2966 676.2 441.1 237.3 181.0
4096 4096 64 787 1160 2258 2962 2728.7 1851.3 951.1 725.0
4096 4096 256 743 1074 2044 2797 11561.2 7998.1 4202.5 2883.5
4096 4096 1024 1020 1336 2546 4351 33686.0 25718.4 13495.6 7897.0
4096 4096 4096 1517 2348 4327 7318 90599.2 58534.5 31763.1 18780.9
4096 4096 8192 | | 4952 8929 | | 55508.5 30784.8
8192 8192 1 | | 3056 4541 | | 43.9 29.6
8192 8192 4 | | 3354 4643 | | 160.1 115.6
8192 8192 16 | | 3416 5011 | | 628.7 428.6
8192 8192 64 | | 3416 4993 | | 2514.6 1720.4
8192 8192 256 | | 3164 4756 | | 10859.6 7224.5
8192 8192 1024 | | 2906 4308 | | 47294.9 31903.2
8192 8192 4096 | | | 8027 | | | 68488.3
8192 8192 8192 | | | 9929 | | | 110737.4
Table 6: Performance for the multiplication of a square matrix A by a rectangular matrix
B on some Connection Machine system CM{200 congurations using the matrix{matrix
multiplication function in CMSSL, 64{bit precision.
29
-log
2
(R)
0 1 2 3 4 5 6 7 8 9 10 11 12 13
6
Gops/s
0
2
4
6
8
10














~
~
~
~
~
~ ~ ~
~
~
~
~
~
~
}
}
}
}
} } }
}
}
}
}
}
}
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
: P = 1024
~: P = 2048
}: P = 4096
|: P = 8192
Figure 18: Performance of the matrix multiplication function in the Connection Machine
Scientic Software Library for the multiplication of a P  P matrix by a P  R matrix
on Connection Machine system CM{200, 64{bit precision.
All algorithms require that the operands rst be aligned with respect to each other, and
that a common machine conguration is established for all operands.
For the matrix{vector multiplication routine, about 35% of the time is spent in the align-
ment of the two vectors, while the spread{and{reduce operations account for another 30%
of the total time. The matrix{vector and vector{matrix multiplication routines have com-
parable performance. The dierence is due to the asymmetry in the route operation. The
performance of the rank{1 update routine is always less than for the matrix{vector and
vector{matrix routines. For small matrices, the performance dierence is about 10%, but
for large matrices the performance dierence is close to a factor of two. The performance
for the local level{2 BLAS used in the level{2 DBLAS is up to a factor of four higher for
matrix{vector (vector{matrix) multiplication than for rank-1 updates.
It has been shown experimentally that introducing blocking in the matrix{vector and
vector{matrix multiplication routines in their use for matrix{matrix multiplication yields
a performance improvement by over a factor of three for small matrices. For large matrices,
the performance gain by blocking is about 5%. Most of the performance gain for small
matrices is due to improved performance of the local level{2 BLAS. For the same blocking
factor, the performance of the matrix{vector and vector{matrix multiplication routines is
always better than the performance of the rank-b update, primarily because of the better
performance of the local level{2 BLAS functions.
For the systolic matrix{matrix multiplication routine with C stationary, about 20% of
the time is spent for the alignment of small matrices. The alignment of large matrices
accounts for nearly 10% of the total time. The fraction of time spent for the cyclic
30
rotation decreases from about 65% for small matrices to about 25% for large matrices.
For submatrices of shape pq, qr, and pr, the time for cyclic rotation is proportional to
the number of data elements moved in sequence, which for A is
pq
2
N
c
, and for B is
qr
2
N
r
.
These shifts are partially overlapped. The required number of oating point operations,
2p  q  r increases with the size of the local matrices. The total time increases at a
slower rate, because the oating point rate increases with an increase in the local matrix
size. The increase in the oating{point rate is a factor of 14 from 2  2 submatrices to
large submatrices.
As a rst approximation, the algorithm which keeps the matrix with the largest number
of elements stationary should be used to compute the matrix product. Deviations from
this rule are primarily due to particular implementation issues for communication routines
and variations in arithmetic eciency due to dierent vector lengths for the dierent local
submatrices.
The aspect ratio of the shape of the processor array for optimum communication per-
formance shall be the same as the aspect ratio of the shape of the stationary matrix.
This result has also been veried experimentally. For square matrices, the performance
varies by a factor of over ve as a function of the shape of the processing array. For the
rectangular matrices used in the experiments reported here, the inuence of the shared
processing node conguration is much more severe (18 Mops/s versus the optimal 559
Mops/s { a factor of more than 30).
The local level{2 BLAS used in the algorithm are well{optimized for the node architec-
ture, having a peak eciency in excess of 90%. For improved eciency in the global
matrix multiplication algorithm, a decrease in the need for communication is desirable, in
particular, in the multiplication phase of the algorithm. Conguring the processors as a
three{dimensional array may yield lower communication requirements for the multiplica-
tion phase [13, 20]. A decrease in communication time can also be achieved by exploiting
more of the communications bandwidth of the Boolean cube network by using multiple
concurrent exchange sequences as suggested in [11].
Acknowledgment
Many people at Thinking Machines Corp. have provided support for the distributed
BLAS functions that are described here. These include Mark Bromley, Alan Edelman,
Tim Harris, Steve Heller, Doug MacDonald, and Luis Ortiz.
References
[1] Jean-Philippe Brunet and S. Lennart Johnsson. All-to-all broadcast with applications
on the Connection Machine. International Journal of Supercomputer Applications,
6(3):241{256, 1992.
[2] L.E. Cannon. A Cellular Computer to Implement the Kalman Filter Algorithm. PhD
thesis, Montana State Univ., 1969.
31
[3] M.Y. Chan. Embedding of grids into optimal hypercubes. SIAM J. Computing,
20(5):834{864, 1991.
[4] Eliezer Dekel, David Nassimi, and Sartaj Sahni. Parallel matrix and graph algo-
rithms. SIAM J. Computing, 10:657{673, 1981.
[5] Jack J. Dongarra, Jeremy Du Croz, Iain Du, and Sven Hammarling. A Set of Level
3 Basic Linear Algebra Subprograms. Technical Report Reprint No. 1, Argonne
National Laboratories, Mathematics and Computer Science Division, August 1988.
[6] Jack J. Dongarra, Jeremy Du Croz, Iain Du, and Sven Hammarling. A Set of Level
3 Basic Linear Algebra Subprograms: Model implementation and test programs.
Technical Report Reprint No. 2, Argonne National Laboratories, Mathematics and
Computer Science Division, August 1988.
[7] Georey C. Fox and Wojtek Furmanski. Optimal communication algorithms on the
hypercube. Technical Report CCCP-314, California Institute of Technology, July
1986.
[8] Georey C. Fox, Mark A. Johnson, Gregory A. Lyzenga, Steve W. Otto, John K.
Salmon, and DavidW.Walker. Solving Problems on Concurrent Processors. Prentice-
Hall, 1988.
[9] I. Havel and J. Moravek. B-valuations of graphs. Czech. Math. J., 22:338{351, 1972.
[10] Ching-Tien Ho and S. Lennart Johnsson. Embedding meshes in Boolean cubes by
graph decomposition. J. of Parallel and Distributed Computing, 8(4):325{339, April
1990.
[11] Ching-Tien Ho and S. Lennart Johnsson. Matrix multiplication on hypercubes using
full bandwidth and constant storage. In The Sixth Distributed Memory Computing
Conference, pages 447{451. IEEE Computer Society Press, 1991.
[12] S. Lennart Johnsson. Communication ecient basic linear algebra computations
on hypercube architectures. J. Parallel Distributed Computing, 4(2):133{172, April
1987.
[13] S. Lennart Johnsson and Ching-Tien Ho. Algorithms for multiplying matrices of
arbitrary shapes using shared memory primitives on a Boolean cube. Technical
Report YALEU/DCS/RR-569, Dept. of Computer Science, Yale University, October
1987.
[14] S. Lennart Johnsson and Ching-Tien Ho. Spanning graphs for optimum broad-
casting and personalized communication in hypercubes. IEEE Trans. Computers,
38(9):1249{1268, September 1989.
[15] S. Lennart Johnsson and Ching-Tien Ho. Maximizing channel utilization for all-to-
all personalized communication on Boolean cubes. In The Sixth Distributed Memory
Computing Conference, pages 299{304. IEEE Computer Society Press, 1991.
32
[16] S. Lennart Johnsson and Ching-Tien Ho. Generalized shue permutations on
Boolean cubes. J. Parallel and Distributed Computing, 16(1):1{14, 1992.
[17] S. Lennart Johnsson and Kapil K. Mathur. Distributed BLAS. Technical report,
Thinking Machines Corp., 1992. In preparation.
[18] S. Lennart Johnsson and Luis F. Ortiz. Local Basic Linear Algebra Subroutines
(LBLAS) for distributed memory architectures and languages with an array syntax.
The International Journal of Supercomputer Applications, 6(4):322{350, 1992.
[19] C.L. Lawson, R.J. Hanson, D.R. Kincaid, and F.T. Krogh. Basic Linear Algebra
Subprograms for Fortran Usage. ACM TOMS, 5(3):308{323, September 1979.
[20] Kapil K. Mathur and S. Lennart Johnsson. Matrix multiplication on a three-
dimensional array. Technical report, Thinking Machines Corp., November 1991.
[21] Kapil K. Mathur and S. Lennart Johnsson. All{to{all communication algorithms
for distributed BLAS. In 6th SIAM Conference on Parallel Processing for Scientic
Computing. SIAM, 1993. Harvard University Technical Report TR-07-93.
[22] David Nassimi and Sartaj Sahni. Optimal BPC permutations on a cube connected
SIMD computer. IEEE Trans. Computers, C-31(4):338{341, April 1982.
[23] E.M. Reingold, J. Nievergelt, and N. Deo. Combinatorial Algorithms. Prentice-Hall,
Englewood Clis. NJ, 1977.
[24] Arnold L. Rosenberg. Preserving proximity in arrays. SIAM J. Computing, 4:443{
460, 1975.
[25] Quentin F. Stout and Bruce Wagar. Passing messages in link-bound hypercubes. In
Michael T. Heath, editor, Hypercube Multiprocessors 1987. Society for Industrial and
Applied Mathematics, Philadelphia, PA, 1987.
[26] Thinking Machines Corp. CM Fortran optimization notes: slicewise model, version
1.0, 1991.
[27] Thinking Machines Corp. CMSSL for CM Fortran, Version 3.1, 1993.
[28] Walter Tichy. Parallel matrix multiplication on the Connection Machine. In Horst D.
Simon, editor, Scientic Applications of the Connection Machine, pages 174{187.
World Scientic, 1989.
33
7 Appendix
This section describes the systolic algorithm where the product matrix C is kept station-
ary. The shape of the operand matrices is arbitrary. The processing nodes are assumed
to be congured as a two{dimensional array of shape N
r
N
c
for any N
r
and N
c
.
The alignment of each operand is expressed using two forall loops in the pseudocode
below. The inner loop denes the maximum data set that is guaranteed to have the
same data motion during alignment. For the matrix A, the alignment is performed on
submatrices of shape   
v
( = d
P
N
r
e; 
v
= d
Qgcd(N
r
;N
c
)
N
r
N
c
e). For the matrix B the same
properties hold for submatrices of shape 
v
 ( = d
R
N
c
e). All submatrices can be aligned
concurrently, should there be sucient communications bandwidth available.
The rst three loops for the multiplication phase in the pseudocode below dene a matrix
multiplication local to a processor. It is the multiplication immediately succeeding the
alignment. The main loop performs the remaining multiplications and a single cyclic
shift. The local matrix multiplication is performed as a block outer product, or rank-
d
Q gcd(N
r
;N
c
)
N
r
N
c
e update. This update is dened in terms of a matrix{vector multiplication,
where the matrix is of shape d
P
N
r
ed
Qgcd(N
r
;N
c
)
N
r
N
c
e. The matrix{vector product is expressed
as a sequence of AXPY operations [19] with a vector length of d
P
N
r
e. This loop is for
illustration only. The actual code calls a (local) matrix{vector routine. The local indices
for matrix C are  and ; for matrix A the indices are  and  , and for the matrix B the
local indices are  and . The local submatrix of A is treated in blocks of d
Q gcd(N
r
;N
c
)
N
r
N
c
e
columns, and the local submatrix of B in blocks of d
Qgcd(N
r
;N
c
)
N
r
N
c
e rows. The inner index of
the matrices A and B is 
v
^
Q+  
v
, where  
v
is the index local to a block, and
^
Q is the
block index. The local block index for A is
^
Q mod
N
r
gcd(N
r
;N
c
)
and the local block index for
B is
^
Q mod
N
c
gcd(N
r
;N
c
)
.
The organization of the computations corresponds to a reshaping of the data arrays to
arrays with six axes. Two of the axes dene the processor array. Two more axes are
necessary for the enumeration of the virtual processors being emulated by a physical
processor (one virtual axes for each of the physical processor array axis). Finally, two
axes dene the number of rows and columns for each virtual processor. For a square
processor array a processor will emulate only one virtual processor, and four axes suce.
Algorithm AM(C,A,B)
forall `;m 2 f0; 1; : : : ; N
r
  1g  f0; 1; : : : ; N
c
  1g do
Alignment:
forall
^
Q := 0 to
N
r
gcd(N
r
;N
c
)
  1 do
forall ; 
v
2 f0; 1; : : : ;    1g  f0; 1; : : : ; 
v
  1g do
a(` + ; 
v
(m
N
r
gcd(N
r
;N
c
)
+
^
Q) +  
v
) 
 a(` + ; 
v
(m
N
r
gcd(N
r
;N
c
)
+
^
Q+ `
N
c
gcd(N
r
;N
c
)
) mod
N
r
N
c
gcd(N
r
;N
c
)
+  
v
)
34
endforall ; 
v
endforall
^
Q
forall
^
Q := 0 to
N
c
gcd(N
r
;N
c
)
  1 do
forall  
v
;  2 f0; 1; : : : ; 
v
  1g  f0; 1; : : : ; 
1
  1g do
b(
v
(`
N
c
gcd(N
r
;N
c
)
+
^
Q) +  
v
; 
1
m+ ) 
 b(
v
(`
N
c
gcd(N
r
;N
c
)
+
^
Q+m
N
r
gcd(N
r
;N
c
)
) mod
N
r
N
c
gcd(N
r
;N
c
)
+  
v
; 
1
m+ )
endforall  
v
; 
endforall
^
Q
Multiplication:
for  := 0 to    1 do
for  
v
:= 0 to 
v
  1 do
for  := 0 to    1 do
c(`+ ; m+ ) c(`+ ; m+ ) + a(` + ; 
v
) b( 
v
; m+ )
endfor 
endfor  
v
endfor 
for
^
Q:=1 to
N
r
N
c
gcd(N
r
;N
c
)
  1 do
if
^
Q mod
N
r
gcd(N
r
;N
c
)
= 0 then
forall ; 2 f0; 1; : : : ;   1g  f0; 1; : : : ; 
c
  1g do
a(`+ ; 
c
m+  ) a(` + ; 
c
((m+ 1) mod N
c
) +  )
endforall ; 
endif
if
^
Q mod
N
c
gcd(N
r
;N
c
)
= 0 then
forall ;  2 f0; 1; : : : ; 
r
  1g  f0; 1; : : : ;    1g do
b(
r
`+ ; m+ ) b(
r
((` + 1) mod N
r
) + ; m+ )
endforall ; 
endif
for  := 0 to    1 do
for  
v
:= 0 to 
v
  1 do
for  := 0 to   1 do
c(` + ; m+ ) c(` + ; m+ )+
+a(`+ ; 
v
^
Q mod
N
r
gcd(N
r
;N
c
)
+  
v
) b(
v
^
Q mod
N
c
gcd(N
r
;N
c
)
+  
v
; m+ )
endfor 
endfor  
v
endfor 
endfor
^
Q
endforall `;m
35
