Hyper-Systolic Matrix Multiplication by Lippert, Thomas et al.
ar
X
iv
:c
s/9
80
91
05
v1
  [
cs
.M
S]
  2
4 S
ep
 19
98
HLRZ1998-59
Hyper-Systolic Matrix Multiplication
Th. Lippert a,b N. Petkov c P. Palazzari d K. Schilling a,b
aDepartment of Physics, University of Wuppertal, D-42097 Wuppertal, Germany
bHLRZ, c/o Research Center Ju¨lich, D-52425 Ju¨lich, Germany
cInstitute of Mathematics and Computing Science, University of Groningen, PO
Box 800, 9700 AV Groningen, The Netherlands
dENEA, HPCN Project, C. R. Casaccia, Via Anguillarese, 301, S.P. 100, 00060
S.Maria di Galeria, Rome, Italy
Abstract
A novel parallel algorithm for matrix multiplication is presented. The hyper-systolic
algorithm makes use of a one-dimensional processor abstraction. The procedure
can be implemented on all types of parallel systems. It can handle matrix-vector
multiplications as well as transposed matrix products.
Key words: matrix multiplication, hyper-systolic, parallel computer
1 Introduction
Matrix multiplication is a fundamental operation in most numerical linear
algebra applications. Its efficient implementation on parallel high performance
computers, together with the implementation of other basic linear algebra
operations, is therefore an issue of prime importance when providing these
systems with scientific software libraries [1]. Consequently, considerable effort
has been devoted in the past to the development of efficient parallel matrix
multiplication algorithms, and this will remain a task in the future as well.
A general rule, which applies not only to matrix multiplication is, that the
choice of a proper parallel algorithm strongly depends on the architecture
of the parallel computer on which the algorithm is to run. System aspects,
such as SIMD or MIMD mode of operation, distributed or shared memory
organization, cache or memory bank structure, construction and latency of
the communication network, processor performance and size of local memory,
etc., may render an algorithm which is highly efficient for one system rather
Preprint submitted to Elsevier Preprint 19 January 2014
impractical for another system. Even on a given system it may be necessary
to switch algorithms in different problem size domains.
As a consequence, one needs a diversity of algorithms for one and the same
operation, as well as systematic design approaches which allow to construct
new algorithms or to modify existing ones in such a way that they suit both
a given implementation system and problem size domain. In the following,
one such design approach is presented and applied to matrix multiplication.
The procedure will lead us to a novel class of parallel matrix multiplication
algorithms which are applicable to distributed memory computers whose inter-
connection pattern includes a ring as a subset of the system connectivity. This
novel scheme is called the hyper-systolic matrix multiplication, as it is based
on the hyper-systolic parallel computing concept [2]. The latter is generaliz-
able to any kind of commutative and associative operation on abstract data
types [3]. The communication complexity of the hyper-systolic matrix multi-
plication is O(n2 p
1
2 ), with n being the matrix dimension and p the number of
processors, and thus, it is comparable to best parallel standard methods.
The work presented is part of a program to develop a novel practicable form
of distributed BLAS-3 (PBLAS-3). In a forthcoming publication, we will give
algorithms for transposed matrix products and level-2 linear algebra compu-
tations.
The paper is organized as follows: in Section 2, we shortly comment on systolic
algorithms and the origin of hyper-systolic algorithms and review in brief
the problem of distributed matrix multiplication. In Section 3, we develop
the one-dimensional (1D) hyper-systolic matrix multiplication algorithm in a
systematic way, starting from Fox’ algorithm on a two-dimensional (2D) mesh
[4]. In Section 4, the concept of the hyper-systolic algorithm involving two
data arrays is introduced. The algorithmic presentation of the hyper-systolic
matrix multiplication is given in Section 5, Section 6 deals with the mapping of
the problem onto parallel systems and finally, Section 7 presents some results
from implementations on a SIMD computer.
2 Background
2.1 Systolic arrays
Systolic arrays are cellular automata models of parallel computing structures
in which data processing and transfer are pipelined and the cells carry out
functions of equal load between consecutive communication events. Systolic
algorithms are parallel algorithms which, as far as abstract automata models
2
are concerned, make efficient use of systolic arrays. For more precise definitions
of systolic algorithms and arrays and for many examples, the reader is referred
to the monographs under references [5] and [6] (for a number of systolic ma-
trix multiplication algorithms see Chapter 3 of [6]). The original motivation
behind the systolic array concept was its suitability for VLSI implementation
[7,8]. Only a few systolic algorithms, however, have been implemented in VLSI
chips or hardware devices. With the advent of commercially available paral-
lel computers, systolic algorithms have found an attractive implementation
medium for they match the local regular interconnection structure as present
in or easily implementable on many parallel architectures.
Systolic algorithms can efficiently be implemented in the SIMDmodel, where—
being synchronous and regular—they avoid time consuming synchronization
operations. Apart from these implementation issues, a very attractive feature
is the availability of methodologies for the systematic design of systolic al-
gorithms. Projection of regular dependence graphs has evolved as one such
technique [5,6,9–16].
As shown elsewhere [17–19], such structures can easily be transformed into
data-parallel programs. The pattern of systolic arrays induces characteris-
tic features into the respective data-parallel programs. In particular, a data-
parallel program realizing a systolic algorithm consists of a sequence of iden-
tical steps organized in a loop whose counter corresponds to the clock of the
underlying systolic array automaton model. Further, the local regular intercon-
nection pattern of a systolic array results in the use of only local synchronized
communication in the respective data-parallel program as exemplified by the
shift-type operations (e.g. cshift and eoshift).
2.2 Hyper-systolic algorithms
Hyper-systolic algorithms have been introduced for various numerical appli-
cations in order to further reduce the communication expense of systolic al-
gorithms. The advantage of the hyper-systolic over the systolic data flow first
has been demonstrated for the case of so-called n2-problems, i.e., numerical
problems that involve O(n2) computation events on pairs of elements in a
system of n elements [2].
The systolic computation of n2-problems on a parallel computer equipped
with p processors involves O(np) communication events. The hyper-systolic
algorithm can reduce the communication complexity to O(np
1
2 ), as has been
shown for a prototype n2-problem, the computation of all n2 two-body forces
for a system of n gravitatively interacting bodies [20]. This success makes us
confident that hyper-systolic processing can be applied to a variety of numer-
3
ical problems which lead to n2 computation events. An important application
is found in astrophysics where the investigation of the dynamics and evolution
of globular clusters is of prime importance [21]. Further examples of applica-
tions are protein folding, polymer dynamics, polyelectrolytes, global and local
all-nearest-neighbours problems, genome analysis, signal processing etc. [22].
Hyper-systolic algorithms are an extension of the systolic concept [2]. Similar
to systolic algorithms, data processing and transfer are pipelined and the cells
carry out functions of equal load. The three main differences are: (i) use of a
changing interconnection pattern throughout the execution of the algorithm,
(ii) use of multiple auxiliary data arrays for storage of intermediate results,
and (iii) the possible separation of communication and computation. The
combination of these three features leads to reduction of the communication
overhead.
The use of a changing communication pattern is due to the communication of
data by different strides along a 1D ring in different stages of a hyper-systolic
algorithm. The regularity of the communication pattern is retained, however.
As an example of a regular but changing communication pattern one can think
of an algorithm in which each processor of a ring communicates data to its
first, second, fourth, etc., neighbours in the first, second, etc., steps of the
algorithm, respectively.
Auxiliary data arrays are needed for temporary storage of intermediate results.
In conventional systolic algorithms such results are accumulated by shifting
them from cell to cell. In the hyper-systolic algorithm they are generated and
accumulated in place for many cycles, using multiple auxiliary data arrays,
and are subsequently used to compute the final results.
Use of a regular but changing communication pattern can be found in some
(conventional) systolic algorithms, such as the systolic implementation [6] of
Eklund’s matrix transposition algorithm on a hyper-cube [23]. Use of multiple
data arrays for solving specific tasks, e.g. problem partitioning, can also be
found in systolic algorithm literature (see Chapter 12 in [23].), however, for
purposes different of those aimed at in hyper-systolic algorithms. The unique
combination of the two features mentioned above in the hyper-systolic concept
aims at achieving new quality: substantial reduction of the communication
overhead.
2.3 Matrix multiplication on parallel systems
In order to carry out matrix multiplications on distributed systems one has to
take care for communication efficiency, parallelism and scalability of the im-
plementation. For scalability of the implementation it appears very important
4
that the data layout chosen is preserved in the course of the computation with-
out need for reordering of initial or result matrices. Furthermore, the data flow
organization should not hinder the efficient usage of register, vector or cache
facilities that largely dominates the overall performance. A third requirement
is the general applicability in linear algebra tasks such as block factorization
algorithms, where, additionally, matrix-vector multiplication must be carried
out effectively 1 . It is a general observation that many parallel matrix mul-
tiplication algorithms [24,25] meet the criteria of efficiency, parallelism and
scalability only partially. For illustration, let us for instance consider one of
the favorite matrix multiplication strategies, Cannon’s algorithm [26], applied
to n× n-matrix multiplication, A×B.
Elements ai,k and bi,k of the matrices A and B are assigned to cells on a 2D
grid, indexed by i, j. In a first step, Cannon’s method involves a pre-skewing
of A and B along the rows and columns, respectively. Data movement with
different strides for each row/column is required, cf. Fig. 1. We note that this
pre-skewing cannot be organized in form of a regular communication pattern.
The second systolic step of Cannon’s method consists of circular shifts of A
a1,1 a1,2 a1,3
a2,1 a2,2 a2,3
a3,1 a3,2 a3,3
b1,1 b1,2 b1,3
b2,1 b2,2 b2,3
b3,1 b3,2 b3,3
a1,1 a1,2 a1,3
a2,2 a2,3 a2,1
a3,3 a3,1 a3,2
b1,1 b2,2 b3,3
b2,1 b3,2 b1,3
b3,1 b1,2 b2,3
normal order skew order
Fig. 1. Preskewing of A and B.
along the rows and ofB along the columns, followed by the computation event,
cf. Fig. 2. We note that one can use blocks of elements as matrix entries.
The merit of Cannon’s algorithm lies in its memory efficiency, as it is possible
to arrange the computation in such a way that no cell holds more than one
element (block) of each matrix. Disadvantages of Cannon’s algorithm are pre-
skewing and the fact that the optimal data layout for matrix multiplication
requires one-to-all broadcast operations applied to matrix-vector multiplica-
1 Unfortunately it often happens that the matrix-to-processor mapping, as chosen
for optimal BLAS-3 performance, fails in the case of BLAS-2 applications.
5
a1,1 a1,2 a1,3
a2,2 a2,3 a2,1
a3,3 a3,1 a3,2
b1,1 b2,2 b3,3
b2,1 b3,2 b1,3
b3,1 b1,2 b2,3
c1,1 c1,2 c1,3
c2,1 c2,2 c2,3
c3,1 c3,2 c3,3
a1,2 a1,3 a1,1
a2,3 a2,1 a2,2
a3,1 a3,2 a3,3
b2,1 b3,2 b1,3
b3,1 b1,2 b2,3
b1,1 b2,2 b3,3
c1,1 c1,2 c1,3
c2,1 c2,2 c2,3
c3,1 c3,2 c3,3
cycle 0 cycle 1
a1,3 a1,1 a1,2
a2,1 a2,2 a2,3
a2,3 a3,3 a3,1
b3,1 b1,2 b2,3
b1,1 b2,2 b3,3
b2,1 b3,2 b1,3
c1,1 c1,2 c1,3
c2,1 c2,2 c2,3
c3,1 c3,2 c3,3
cycle 2
Fig. 2. Systolic phase of Cannon’s algorithm.
tion. Furthermore the layout of the result vector differs from that of the input
vector [27].
3 Design of a 1D Matrix Multiplication
In this section, we develop a 1D hyper-systolic matrix multiplication starting
from a 2D algorithm that is related to the well known matrix product scheme of
Fox [4]. In our systematic design approach the 2D scheme will be transformed
to a 1D systolic array representation.
We have chosen a skew ordering as the fundamental representation of the
matrices A, B and C. As a consequence, we will be able to carry out the com-
putation fully in parallel without even the requirement for indexed addressing
6
functionality of the target parallel computer. Furthermore, no reordering is
required in the course of the computation because the skew representation is
preserved.
3.1 Parallel Computational Problem
Given the n×m matrixA and the m×n matrix B, the matrix-matrix product
reads
ci,j =
m∑
k=1
ai,k bk,j, i, j = 1, . . . , n. (1)
On a distributed memory machine, the elements of A and B have to be spread
appropriately across the different processor nodes.
In our design approach, we will first show how to multiply 4 × 4 matrices A
and B on a 2D grid of nodes of size 4×4. Then, the algorithm is mapped onto
a 1D processing array with 4 nodes.
In Section 5, we present the hyper-systolic algorithm for matrix multiplication
of p× p matrices on an array of p processors.
In Section 6, we consider the general case Eq. 1 of multiplying a n×m matrix
A and a m×n matrix B on a 1D system of p processors. In order to map the
systolic model onto a parallel implementation machine we choose hierarchy
mapping with two different strategies, block and cyclic assignment.
3.2 2D Matrix Multiplication
3.2.1 Data alignment
In the following we make use of the concept of abstract processor arrays (APA)
as defined in HPF [28].
The grid of boxes shown in Fig. 3 is such a 2D APA on which the matrices
A, B and C involved in the operation C = AB are aligned in a column-skewed
fashion.
7
11b
22
22
22
a
b
c
33
33
33
a
b
c
44
44
44
a
b
c
21
21
21
a
b
c
32
32
32
a
b
c
43
43
43
a
b
c
14
14
14
a
b
c
31
31
31
a
b
c
42
42
42
a
b
c
13
13
13
a
b
c
24
24
24
a
b
c
41
41
41
a
b
c
12
12
12
a
b
c
23
23
23
a
b
c
34
34
34
a
b
c
11c
11a
Fig. 3. Column-skewed distribution of the matrices A, B and C on a 2D APA.
3.2.2 Semi-systolic algorithm
Let the p × p matrices A, B and C be distributed on a p × p processor array
and let for simplicity p = 4.
The algorithm consists of p (4) steps as follows: in the first step, the matrix
elements b1,1, b2,2, . . . , bp,p of the matrix B in the first row of the APA are
broadcast along the corresponding columns and subsequently are multiplied
with the elements of the matrix A in the respective columns of the APA. The
products (see Fig. 4a) are accumulated in the corresponding elements of the
matrix C, according to the arrangement shown in Fig. 3.
At the end of the step, the elements of the matrix A are circularly shifted by
one position to the left along the rows and by one position downwards along
the columns (compare the positions of the elements of matrix A in Fig. 4a and
Fig. 4b).
In the second step, the processors of the second row of the APA broadcast
their corresponding elements of the matrix B in the respective columns of
the APA, Fig.4b. This operation is followed by the same multiplication and
accumulation operations and circular shifting of the elements of the matrix A
as in the first step.
The algorithm proceeds with similar steps in which the processors of the third,
through p-th row of the array broadcast in turn their elements of the matrix B,
and the elements of matrix A are circularly shifted to the left and downwards
by one position, cf. Fig.4c-d. After a total number of p such steps all partial
products which belong to the elements of the matrix C are accumulated in the
corresponding processors.
The algorithm is called semi-systolic as it involves a broadcast operation of
8
11a 22b22a 33b33a 44b44a
11b21a 32a 43a 14a
11b31a 22b42a 33b13a 44b24a
11b41a 22b12a 33b23a 44b34a
11b
11b
22b
22b
33b
33b
44b
44b
13a 42b24a 13b31a 24b42a
31b23a 34a 41a 12a
31b33a 42b44a 13b11a 24b22a
31b43a 42b14a 13b21a 24b32a
31b
31b
42b
42b
13b
13b
24b
24b
14a 12b21a 23b32a 34b43a
41b24a 31a 42a 13a
41b34a 12b41a 23b12a 34b23a
41b44a 12b11a 23b22a 34b33a
41b
41b
12b
12b
23b
23b
34b
34b
12a 32b23a 43b34a 14b41a
21b22a 33a 44a 11a
21b32a 32b43a 43b14a 14b21a
21b42a 32b13a 43b24a 14b31a
21b
21b
32b
32b
43b
43b
14b
14b
a) t=1 b) t=2
c) t=3 d) t=4
Fig. 4. Computation of partial products on a 2D APA.
the row elements of B in each step.
3.2.3 Semi-hyper-systolic algorithm
We next consider the semi-hyper-systolic variant which corresponds to the
semi-systolic algorithm just described. The initial distribution of data is shown
in Fig. 5. The distribution of the matrices A and C is the same as the one shown
in Fig. 3. The distribution of the matrix B is obtained from the distribution
shown in Fig. 3 by circular shifting of the elements of B in the i-th row of
the processor array by a stride of (i− 1)modK , with K = 2 for p = 4. In the
particular example with p = 4, the second and the fourth row of B are shifted
by one position.
The algorithm consists of p (4) steps as shown in Fig. 6. In every step, each
9
11b
22
22
22
a
b
c
33
33
33
a
b
c
44
44
44
a
b
c
21
14
21
a
b
c
32
21
32
a
b
c
43
32
43
a
b
c
14
43
14
a
b
c
31
31
31
a
b
c
42
42
42
a
b
c
13
13
13
a
b
c
24
24
24
a
b
c
41
34
41
a
b
c
12
41
12
a
b
c
23
12
23
a
b
c
34
23
34
a
b
c
11c
11a
Fig. 5. Initial distribution of data for the semi-hyper-systolic matrix product on a
2D APA.
processor multiplies the elements of the matrix A with the elements of the ma-
trix B which it receives via the associated broadcast line. The partial product
thus computed is accumulated in one of two local variables. The local vari-
ables represent elements of two auxiliary arrays C(1) and C(2) which distributed
across the processor array. They are used to accumulate partial results for the
computations of the elements of the matrix C. Similar to the original systolic
algorithm the processors in the first, second, . . . , p-th row broadcast the ele-
ments of the matrix B they contain to all the processors of the corresponding
columns in the first, second, . . . , p-th step of the algorithm, respectively, see
Fig. 6.
At the end of each step the matrix A is cyclically shifted downwards by one
position. Unlike the original algorithm the horizontal shift of the matrix A is
performed only every second step by a stride of two elements.
The algorithm is completed by elemental addition of the auxiliary arrays C(1)
and C(2) which is preceded by circular shifting of C(2) by one position to the
left.
3.2.4 Comparison to Cannon’s algorithm
The semi-hyper-systolic algorithm defined above can be compared with Can-
non’s algorithm. Each of the two algorithms has merits and shortcomings:
Cannon’s algorithm needs pre-skewing of the elements of the matrix A along
the rows and the elements of matrix B along the columns of the 2D processor
array which can be considered as a disadvantage since it gives rise to additional
interprocessor communication that is not evenly distributed across the proces-
sors. On the other hand, the operations to be executed by each processor in a
10
11b 22b 33b 44b
31b 42b 13b 24b 34b 41b 12b 23b
14b 21b 32b 43b
a) t=1 b) t=2
c) t=3 d) t=4
11a 22b22a 33b33a 44b44a11b
11b21a 32a 43a 14a22b 33b 44b
11b31a 22b42a 33b13a 44b24a
a a a a11b41 22b12 33b23 44b34
11a 21b22a 32b33a 43b44a14b
14b21a 32a 43a 14a21b 32b 43b
14b31a 21b42a 32b13a 43b24a
a a a a14b41 21b12 32b23 43b34
31b23a 34a 41a 12a42b 13b 24b
13a 42b24a 13b31a 24b42a31b
31b43a 42b14a 13b21a 24b32a
31b33a 42b44a 13b11a 24b22a
13a 41b24a 12b31a 23b42a34b
34b23a 34a 41a 12a41b 12b 23b
34b33a 41b44a 12b11a 24b22a
34b43a 41b14a 12b21a 23b32a
Fig. 6. The products shown in the upper and the lower halves of the processors are
accumulated in the auxiliary arrays C(1) and C(2).
given step of the algorithm are rather simple, comprising one multiplication,
one addition and two shift operations. Also, Cannon’s algorithm needs only
local processor interconnections.
In contrast, the semi-hyper-systolic algorithm as illustrated in Fig. 6 needs
broadcasting lines and more complex control of the operations which have
to be carried out by the processors: in a given step, each processor has to
broadcast an element to all other processors of the same column, and the
matrix A is circularly shifted in two directions, where in horizontal direction a
stride of 2 is required for every second step while in vertical direction a stride
of 1 is carried out in each step. On the other hand, this algorithm avoids
pre-skewing of the matrices.
11
3.3 Matrix Multiplication on a 1D Processor Array
Next we transform the semi-systolic algorithm discussed above into a fully
systolic one for a 1D processor array, in which a given processor executes in
sequence the tasks assigned so far to the processors working on a given column
of the 2D processor array. This procedure will eliminate the disadvantages of
the semi-hyper-systolic 2D array algorithm since no broadcasting lines are
needed, and moreover, control structures become simpler. Downward shifts
of A merely amount to a re-assignment within the systolic cell and do not
involve actual communication between cells. Note that the application of such
a transformation on the algorithm of Cannon will not have the same effect as
it would still be required to pre-skew one of the matrices.
3.3.1 Data alignment on a 1D processor array
In the case of a 1D processor array, we assign each column of the 2D array,
Fig. 3, to one processor. The resulting layout of the matrices A, B and C is
shown in Fig. 7. The dotted lines indicate the location of the elements in the
cells of the previous 2D array.
11b
22
22
22
a
b
c
33
33
33
a
b
c
44
44
44
a
b
c
21
21
21
a
b
c
32
32
32
a
b
c
43
43
43
a
b
c
14
14
14
a
b
c
31
31
31
a
b
c
42
42
42
a
b
c
13
13
13
a
b
c
24
24
24
a
b
c
41
41
41
a
b
c
12
12
12
a
b
c
23
23
23
a
b
c
34
34
34
a
b
c
11c
11a
Fig. 7. APA with 1D systolic array alignment.
3.3.2 1D systolic algorithm
Again we start with the systolic computation. The algorithm needs p (4) steps.
It does not require a broadcast of the matrix elements of the matrix B. In the
first step, the matrix elements b1,1, b2,2, . . . , bp,p of the matrix B which reside
in the first, second, . . . , p-th processor, respectively, are multiplied with the
elements of the respective columns of the matrix A. The products (see Fig. 8a)
are accumulated in the corresponding elements of the matrix C, according to
12
the arrangement shown in Fig. 7. At the end of the first step, the elements of
the matrix A are circularly shifted by one position to the left along the rows.
11a 22b22a 33b33a 44b44a
11b21a 32a 43a 14a
11b31a 22b42a 33b13a 44b24a
11b41a 22b12a 33b23a 44b34a
11b
22b 33b 44b
13a 42b24a 13b31a 24b42a
31b23a 34a 41a 12a
31b33a 42b44a 13b11a 24b22a
31b43a 42b14a 13b21a 24b32a
31b
42b 13b 24b
14a 12b21a 23b32a 34b43a
41b24a 31a 42a 13a
41b34a 12b41a 23b12a 34b23a
41b44a 12b11a 23b22a 34b33a
41b
12b 23b 34b
12a 32b23a 43b34a 14b41a
21b22a 33a 44a 11a
21b32a 32b43a 43b14a 14b21a
21b42a 32b13a 43b24a 14b31a
21b
32b 43b 14b
a) t=1 b) t=2
c) t=3 d) t=4
Fig. 8. APA for the systolic computation of partial products on a !D array.
In the second step, the second row of the matrix B is involved, see Fig. 8b.
Its elements are multiplied by the elements of A and the partial products are
accumulated in their proper locations, i.e., in the second step the products
are copied to the elements of C which are circularly assigned one position
downwards. Subsequently, A is circularly shifted to the left by one position.
The algorithm proceeds with similar steps in which the elements of the third,
through p-th row the matrix B are multiplied with the elements of A, the
products being assigned to a row of C at locations i− 1 positions downwards
the row in the i-th step, with the elements of matrix A being circularly shifted
to the left, cf. Fig. 8c-d. After a total number of p (4) steps all partial prod-
13
ucts which belong to the elements of the matrix C are accumulated in the
corresponding processors.
3.3.3 1D hyper-systolic algorithm
Let us turn now to the 1D realization of the hyper-systolic algorithm. The
initial distribution—as in the 2D case—requires a partial skewing of B along
the rows, see Fig. 9. The distribution of the matrix B is obtained from the
distribution shown in Fig. 7 by circular shifting of the elements of B in the
i-th row of the processor array by a stride of (i − 1)modK , with K = 2 for
p = 4.
11b
22
22
22
a
b
c
33
33
33
a
b
c
44
44
44
a
b
c
21
14
21
a
b
c
32
21
32
a
b
c
43
32
43
a
b
c
14
43
14
a
b
c
31
31
31
a
b
c
42
42
42
a
b
c
13
13
13
a
b
c
24
24
24
a
b
c
41
34
41
a
b
c
12
41
12
a
b
c
23
12
23
a
b
c
34
23
34
a
b
c
11c
11a
Fig. 9. Initial distribution of data for the hyper-systolic matrix product on a 1D
APA.
Step by step each processor multiplies the element of the matrix A it contains
with the corresponding element of the matrix B. The partial product thus
computed is accumulated in one of two local variables (K = 2). The products
computed in alternate steps are again stored alternately in one of the two local
variables, in an analogous fashion as for the systolic algorithm, i.e., in the i-
th step, the product is assigned to a row located i − 1 elements downwards,
cf. Fig. 10. We emphasize again that A is shifted only every second step in
horizontal direction to the left by two elements. For the general case with p
processors, we have to carry a shift by K˜ elements in every K-th step.
The algorithm is completed by elemental addition of the auxiliary arrays C(1)
and C(2) which is preceded by circular shifting of C(2) by one position to the
left.
14
a) t=1 b) t=2
c) t=3 d) t=4
11a 22b22a 33b33a 44b44a11b
11b21a 32a 43a 14a22b 33b 44b
11b31a 22b42a 33b13a 44b24a
a a a a11b41 22b12 33b23 44b34
11a 21b22a 32b33a 43b44a14b
14b21a 32a 43a 14a21b 32b 43b
14b31a 21b42a 32b13a 43b24a
a a a a14b41 21b12 32b23 43b34
31b23a 34a 41a 12a42b 13b 24b
13a 42b24a 13b31a 24b42a31b
31b43a 42b14a 13b21a 24b32a
31b33a 42b44a 13b11a 24b22a
13a 41b24a 12b31a 23b42a34b
34b23a 34a 41a 12a41b 12b 23b
34b33a 41b44a 12b11a 24b22a
34b43a 41b14a 12b21a 23b32a
Fig. 10. Hyper-systolic algorithm on a 1D array.
3.3.4 Summary
In a systematic design approach, we developed a matrix multiplication algo-
rithm on a 1D abstract processor array, starting from an algorithm on a 2D
APA. The algorithmic transformations led to a 1D hyper-systolic scheme
• that avoids broadcast lines required in the 2D case,
• that, given p processors, shows a complexity of interprocessor communica-
tion on the 1D APA which is equal to that of Cannon’s 2D algorithm,
• that avoids skewing operations and reordering.
So far, we illustrated the scheme using 4×4 matrices distributed column-wise
over 4 systolic cells. In section 5, we generalize the 4 × 4 problem to a p × p
system, distributed on a p processor array. In the case of the 4×4 matrices, the
15
shift constant was 2. For the general case, we shall introduce a stride K which
is carried out K˜ times, where the integer constants K and K˜ fulfil KK˜ = p.
4 Hyper-Systolic Bases
Before we present the general hyper-systolic matrix product, we define the
hyper-systolic algorithm for two data streams. Such a situation is given in the
computation of convolution and correlation problems. The scheme will later
be adapted to the matrix multiplication [22].
4.1 Hyper-Systolic Recipe
Let x and z be two 1D arrays both of length n. Assume that functions
Fi =
n⊕
j=1
f(xi, zj) (2)
for each i have to be computed, with ⊕ being an associative and commutative
operator. The computation can readily be carried out by usage of systolic
algorithms on a ring of systolic cells [20]. However, one can observe redundant
interprocessor communication in this process that can be removed in hyper-
systolic processing. The general recipe to find optimal hyper-systolic bases
reads:
Let the numerical problem be computable on two 1D systolic processor arrays
by use of a 1D systolic algorithm. Let the two 1D data streams x and z both
of length n be mapped onto themselves by two sequences of (circular) shifts
on the systolic array. The hyper-systolic scheme to compute the problem is
constructed as follows:
(1) For the systolic array x of length n, k replicas are generated by shifting
the array x k times by strides at, 1 ≤ t ≤ k, where all shifted arrays are
stored as intermediate elements on the cells as arrays xˆt, 1 ≤ t ≤ k.
For the systolic array z of length n, k′ replicas are generated by shifting
the array x k′ times by strides bt′ , 1 ≤ t′ ≤ k′, where all shifted arrays
are stored as intermediate elements on the cells as arrays zˆt′ , 1 ≤ t′ ≤ k′.
(2) The sequences of strides {at}, 1 ≤ t ≤ k and {bt′}, 1 ≤ t′ ≤ k′ are
determined such that
(a) all pairings of data elements are present at least once,
(b) the total communication cost is minimized.
16
(3) After each communication event the computations can be carried out and
the results are assigned to the corresponding intermediate result arrays
yˆt, 1 ≤ t ≤ k + 1. If elements occur more than once they are accounted
for by a multiplicity table in order to avoid multiple counting.
(4) One collector array y is moved by strides that follow the inverse of the
sequence Ak of strides of the initial phase. In each step of the back-shift
phase the required intermediate result arrays yˆt are added to y.
4.2 The Hyper-Systolic Optimization Problem
Parallel machines support logical 1D chains of processors in form of linear
arrays or rings. However, circular shifts along the 1D ring in general lead to
different hardware communication expenses for different strides.
The optimal sequence of strides for minimal interprocessor communication will
depend on the interprocessor communication cost for a given stride. In order
to minimize the communication cost effect on a given machine, we introduce
a cost function C(ai), as a function of the stride ai.
For the sake of argumentation, let us first assume the costs of communication
for each array x and z on the systolic ring to be constant for any stride ai, bi.
C(ai) = C(bˆi) = const.
Definition 1 - Optimization Problem for C(ai) = C(bˆi) = const.
Let I be the set of integers m = {0, 1, 2 . . . , n− 1} ∈ Nn0 , n ∈ N. Find the two
ordered multi-sets Ak = (a0 = 0, a1, a2, a3, . . . , ak) ∈ Nk+10 of k+1 integers and
Bk′ = (b0 = 0, b1, b2, b3, . . . , bk′) ∈ NK+10 of k′ + 1 integers, with k + k′ being
a minimum, where each m ∈ I, (0 ≤ m ≤ n− 1), can be represented at least
once as the sum of two ordered partial sums
m = (ai + ai+1 + . . .+ ai+j) + (bˆi + bˆi+1 + . . . + bˆi+jˆ), (3)
with
0 ≤ i+ j ≤ k, i, j ∈ N0, 0 ≤ iˆ+ jˆ ≤ k′, iˆ, jˆ ∈ N0. (4)
4.2.0.1 Lower bound on k + k′ A lower bound for the minimal number
of non-zero elements of Ak can be derived that will deliver optimal complexity.
Theorem 1 Let Ak and Bk′ be two bases solving the optimization problem for
the hyper-systolic algorithm with 2 arrays. Then the minimal length k + k′ is
17
given by
k = k′ =
√
n− 1. (5)
PROOF. The total number of combinations required is n2 as each element of
the first array must come into contact with the n elements of the second array.
Let the matrices H1 and H2 be realized by k−1 and k′−1 shifts respectively.
In that case each element of the first matrix can be combined with k′ elements
of the second matrix, therefore the possible number of combinations will be
nkk′. Given n = kk′, the minimum number of circular shifts k+ k′ is attained
for k = k′ =
√
n− 1. ✷
Therefore, the complexity for the interprocessor communication of a hyper-
systolic algorithm forC(ai) = C(bi) = const. is bounded from below by 3(
√
n−
1) shifts, where we have already included the costs for the back-shifts.
4.3 C(ai) and C(bˆi) 6= const
We now assume that the cost for a circular shift is a function of the strides ai.
The optimization problem of definition 1 is modified only slightly, however,
the construction of an optimal base can be quite complicated.
Definition 2 - Optimization Problem for C(ai) = C(bˆi) 6= const.
Let I be the set of integers m = {0, 1, 2 . . . , n− 1} ∈ Nn0 , n ∈ N. Find the two
ordered multi-sets Ak = (a0 = 0, a1, a2, a3, . . . , ak) ∈ Nk+10 of k + 1 integers
and Bk′ = (b0 = 0, b1, b2, b3, . . . , bk′) ∈ NK+10 of k′ + 1 integers, with the total
cost
Ctotal =
k∑
i=1
C(ai) +
k′∑
iˆ=1
C(bˆi) (6)
being a minimum, where each m ∈ I, (0 ≤ m ≤ n− 1), can be represented at
least once as the sum of two ordered partial sums
m = (ai + ai+1 + . . .+ ai+j) + (bˆi + bˆi+1 + . . . + bˆi+jˆ), (7)
with
0 ≤ i+ j ≤ k, i, j ∈ N0, iˆ + jˆ ≤ k′, iˆ, jˆ ∈ N0. (8)
18
4.4 Regular Bases
The 4 × 4 problem presented in Section 3 uses so-called regular bases. This
prescription turns out to be optimal for equal cost of any stride executed in
circular shift operations on the ring. Regular hyper-systolic bases are advan-
tageous as they require only two distinct strides.
Definition 3 - Regular Bases The regular bases are given by
Ak=K−1:=
(
0,1, 1, . . . , 1
)
︸ ︷︷ ︸
K − 1
,
Bk′=K˜−1:=
(
0,K, K, . . . , K
)
︸ ︷︷ ︸
K˜ − 1
, K × K˜ = n.(9)
The completeness of a base pair is defined in terms of the h-range of the base,
a notion borrowed from additive number theory [29,30]:
Theorem 2 The h-range of a regular base is n.
PROOF. Let
r := mmodK˜ → r < K, (10)
as KK˜ = n. There are K − 1 elements ai = 1 ∈ Ak. Thus any r with
0 ≤ r ≤ K − 1 r ∈ N0 can be represented as partial sum by the elements
ai = 1, ai ∈ Ak. The partial sums of Bk′ ,
j∑
l=i
bl = K
j∑
l=i
1 = (j − i+ 1)K < n, (11)
are integer multiples of K. Adding the partial sums to r we can therefore
represent any element m ∈ I. Thus, the h-range of the base pair Ak=K−1 and
Bk′=K˜−1 is n, i.e., the base pair is complete. ✷
Theorem 3 The lower bound to the minimal length of the regular bases for
a given h-range n is K = K˜ =
√
n.
PROOF. The regular base Ak is complete.
k = K + K˜ − 1→ k = K + n
K
− 1. (12)
19
Differentiation gives K =
√
n. ✷
Theorem 4 The communication gain factor R that compares the regular hyper-
systolic to the systolic algorithm is:
R =
n− 1
2K + K˜ − 3 ≈
√
n
3
. (13)
PROOF. Let K = K˜ =
√
n. One needs K − 1 shifts by 1 and K˜ − 1 shifts
by K in forward direction and again K − 1 shifts by 1 in backward direction,
respectively; therefore, the total number of shifts required turns out to be
T = (2K + K˜ − 3). (14)
The standard systolic computation requires n− 1 shifts altogether. ✷
5 Hyper-Systolic Matrix Product
Next we present the general formulation of the systolic and hyper-systolic
matrix product in terms of a pseudo-code formulation. The size of the matrices
is p× p and the 1D processor array consists of p nodes.
5.1 Systolic Algorithm
The systolic version of the matrix product of two matrices A and B is given
in Algorithm 1. The matrices are represented in skew order.
Algorithm 1 Systolic matrix-matrix multiplication.
foreach processor i = 1 : p ∈ systolic array
for j = 1 : p
for l = 1 : p
cl,i = cl,i + cshift-col
1−j
p (al,i) bj,i
end for
for k = 1 : p
ak,i = cshift-row
1
p(ak,i)
end for
end for
end foreach
20
We have simplified the representation of data movements and assignments by
introduction of two functions:
cshift-row: horizontal circular shift of data by a stride of k on a ring of cells
numbered from 1 to n, cshift-row involves interprocessor communication.
cshift-rowkn(aj,i) := aj,(i+k−1+n)modn+1. (15)
cshift-col: vertical circular shift by a stride of k for the vector of n elements
within the systolic cells. cshift-col only amounts to memory assignments.
cshift-colkn(aj,i) := a(j+k−1+n)modn+1,i. (16)
The algorithm is completely regular. Each cell executes one compute operation
together with an assignment followed by a circular shift of the matrix A in
each systolic cycle. The skew order is not destroyed during execution of the
algorithm. Note that for each processor inner cell assignment operations (col-
cshift) are executed using equal strides in a given step of the parallel algorithm.
Hence, one global address suffices and further address computations are not
required!
5.2 Hyper-systolic Algorithm
It has been already noticed in section 3.3, that the complexity of the sys-
tolic computation is not competitive with Cannon’s algorithm. However, the
hyper-systolic computation belongs to the same complexity class as Cannon’s
algorithm. It is fully pipelined and parallel, and does not require any skewing
steps to align or re-align matrix elements.
5.2.1 Regular Bases
We employ the regular bases constructed for the hyper-systolic system. We
add a third base C to account for the back shifts:
Ak=K˜−1 =
(
0, K, K, . . . , K
)
Bk′=K−1 =
(
0, −1, −1, . . . ,−1
)
Ck′=K−1 =
(
0, 1, 1, . . . , 1
)
. (17)
21
5.2.2 Hyper-systolic matrix multiplication
The hyper-systolic matrix multiplication, as given in Algorithm 2, proceeds
within three steps. In the first part of the algorithm, matrix B is shifted K−1
times by strides of 1 along the systolic ring and stored as Bi, 0 ≤ i ≤ K − 1.
However, as motivated above, for the case of matrix products, we can spare
communication: it suffices to shift B in K˜ row blocks of K rows each, where
within each block the first row is shifted by a stride of 0 and the last by a
stride of K − 1.
Algorithm 2 Hyper-systolic matrix multiplication.
foreach processor i = 1 : p ∈ systolic array
for j = 1 : p ! pre-shift of matrix B
bj,i = cshift-row
[(1−j)modK ]
p (bj,i)
end for
for j = 1 : K˜ − 1 ! multiplication and shift of matrix A
for l = 1 : K
for n = 1 : p
cln,i = c
l
n,i + cshift-col
[1−(j−1)K−l]
p (an,i) b[(j−1)K+l],i
end for
end for
for l = 1 : p
al,i = cshift-row
K
p (al,i)
end for
end for
for l = 1 : K
for n = 1 : p
cln,i = c
l
n,i + cshift-col
[1−(K˜−1)K−l]
p (an,i) b[(K˜−1)K+l],i
end for
end for
for j = 1 : K − 1 ! back shift and accumulation
for l = 1 : p
c
K−j
l,i = c
K−j
l,i + cshift-row
1
p(c
K−j+1
l,i )
end for
end for
end foreach
After the preparatory shifts of B, the computation starts. K˜ times, the multi-
plication of A with K rows of the pre-shifted matrix B is carried out. After each
step, A is moved to the left by a shift of stride K. The result is accumulated
within K matrices Ci.
22
Finally, the K intermediate result matrices Ci are shifted back according to
base Ck′ while summed up to the final matrix C. The algorithm is very regular.
The skew order is not destroyed during execution, and in any stage, only global
addresses are required.
5.2.3 Complexity
The gain factor for the matrix product reads (note that matrix B is only
partially shifted):
Theorem 5 The gain factor R that compares the regular hyper-systolic matrix
multiplication to the systolic algorithm is
R =
p− 1
K + K˜ − 1 ≈
√
p
2
. (18)
PROOF. One needs 1 shift of the full matrix B, K˜− 1 shifts by K of matrix
A and again K − 1 shifts by 1 of matrix C. Therefore, the total number of
shifts required is
T = (K + K˜ − 1). (19)
The standard systolic computation requires p− 1 shifts of the matrix A. For
the K = K˜ =
√
p, R ≈
√
p
2
. ✷
5.2.4 Comparison to Cannon’s algorithm
In order to compare Cannon’s algorithm and the hyper-systolic matrix multi-
plication, we consider a 2D
√
p×√p processor array, where Cannon’s algorithm
is carried out, and a 1D ring array of p processors where the hyper-systolic
matrix product algorithm is computed.
The total number of shift operations of Cannon’s algorithm is 2
√
p − 2 =
2K − 2, while the number of shift operations for the hyper-systolic algorithm
was 2K−1. Thus, the complexities in terms of circular shift operations of both
algorithms are equal for large p. We will see below how this fact translates
into execution times on mesh and grid based machines.
23
6 Mapping on Parallel Systems
So far we discussed the generic situation of the matrix dimension p being
equal to the number of processors p. We now turn to the general case of
square n×m-matrices with n, m > p.
In order to map the systolic system onto the parallel implementation machine
we choose hierarchy mapping of the systolic array onto the processors with the
option for two different strategies, block and cyclic assignment.
6.1 Block Mapping
The block assignment is applied in all standard algorithms, as it allows to
exploit local BLAS-3 routines, like dgemm, by which a very high efficiency
of local computations can be achieved. While a small block of the matrix A
is hold in the cache or in the registers (thus avoiding cache-to-memory data
transfer), in turn, only the columns of B and C must be exchanged, and all
computations in which the given part of A is involved can be carried out.
In this way, the ratio between the number of computations and the cache-to-
memory traffic is minimized to nearly
2l3
(3nn + n2)
=
l
2
(20)
floating point operations per word for real data, with l being the dimension of
the sub-block. Asymptotically, the full speed of the CPU should be exploitable.
A n×m-matrix M is divided into p× p blocks of size
(
n
p
× m
p
)
or
(
m
p
× n
p
)
,
M→ Mi,j, i = 1, . . . , p, j = 1, . . . , p. (21)
The multiplication ofA andB proceeds via sub-matrix multiplication denoted
as (⊗):
Ci,j =
p∑
k=1
Ai,k ⊗ Bk,j, i = 1, . . . , p, j = 1, . . . , p,
C=AB. (22)
Altogether a system of p × p of such blocks is assigned to the p processor
array. Now we can use each sub-matrix in the same manner as the scalar
24
matrix elements before. Therefore, the p× p system of sub-matrices has to be
row-skewed for A and column-skewed for B.
6.2 Cyclic Mapping
Each block is distributed across the processors as described above for the
generic case. The blocking of the n×m-matrix M into
(
n
p
× m
p
)
blocks, leads
to blocks of size p× p,
M→ Mi,j, i = 1, . . . , n
p
, j = 1, . . . ,
m
p
. (23)
The multiplication of A and B proceeds via block-multiplication, (⊗):
Ci,j =
m
p∑
k=1
Ai,k ⊗ Bk,j, i = 1, . . . , n
p
, j = 1, . . . ,
n
p
,
C=AB. (24)
A skew representation is required for all blocks Mi,k separately. Cyclic mapping
leads to a system of n
p
× m
p
systolic processes that run in parallel.
Cyclic assignment allows us to reduce the memory overhead of hyper-systolic
computations. In general, K full intermediate matrices C are necessary. Using
cyclic mapping, one can organize the computation in such a way that only
one row of the blocks of the intermediate matrix C must be stored in a given
phase of the algorithm. All the required shifts of the given part of A can be
carried out while this part of A will not be involved in a further computation.
Eventually, the corresponding row of C is shifted back and accumulated.
A second interesting feature of cyclic mapping is the possibility to distribute
the very blocks. This approach is interesting for machines with a hybrid ar-
chitecture like the proposed Italian PQE2000 system [31]. In this approach,
the rows of A and the columns of B are assigned to different processors each.
6.3 Block-Cyclic Mapping
One can combine block and cyclic mapping in a hybrid scheme that combines
the advantages of both approaches. A good strategy is to choose the block
size of the block mapping such that it is optimal for “local” BLAS-3. For the
cyclic part one ends up with blocks of size p × p, with the entries being the
BLAS-3 blocks.
25
7 Implementation Issues
The class of algorithm presented can be useful on any type of massively parallel
system with distributed memory. Mesh and grid-based connectivities might
benefit as well as large work-station clusters.
In the previous section, we have presented complexities of interprocessor com-
munication in terms of circular shifts, irrespective of the actual communica-
tion time of a shift. This time is a constant on workstation clusters usually
connected via Ethernet, therefore complexities in terms of circular shifts will
translate into real time in a straightforward way. On mesh and grid based
machines 1D rings in general can be realized as a subset of their system con-
nectivity.
As an example taken from real life, we have implemented a level-3 PBLAS
code on the APE100 parallel computer. So far, on APE100 lack of indexed
addressing has hindered an effective scalable implementation of PBLAS [32].
In our approach, we made use of a combination of block and cyclic mapping.
Thus, we are able to use local BLAS, exploiting the CPU with high efficiency,
and to reduce the memory overhead of the hyper-systolic algorithm.
On APE100, for real data, the optimal elementary blocks are of size 6×6. For
complex data, the size is 4 × 4. The full matrix is blocked to p × p matrices
which are distributed on the ring and the elements of which are the elementary
blocks. Only the p × p matrices are skew, the elementary matrices remain in
performance/peak performance [%]
0
10
20
30
40
50
60
0 500 1000 1500 2000 2500 3000 3500
matrix dimension
sys
hysreal
0
10
20
30
40
50
60
70
0 500 1000 1500 2000 2500 3000
performance/peak performance [%]
matrix dimension
hys
sys
Fig. 11. Performance of systolic and hyper-systolic PBLAS-3 (block-cyclic mapping)
on a 128-node APE100, for real and complex data.
26
normal order.
We have benchmarked a 128-node APE/100Quadrics QH1 using real and com-
plex data. Fig. 11 shows the performance results.
The theoretical peak performances (single node!) for Quadrics are 63 % for real
data and 88 % for complex data, as can be inferred from the maximal ratio of
computation vs. memory-to-register data transfer times. Hyper-systolic matrix
multiplication leads to a peak performance of 65 % of peak speed, which
translates into 75 % of the theoretical performance.
8 Summary
The 1D hyper-systolic matrix multiplication algorithm is a promising alter-
native to 2D matrix product algorithms. With equal complexity as standard
methods, the hyper-systolic algorithm avoids non-regular communication and
indexed local addressing. Hence, the hyper-systolic matrix product scheme is
universal: it is applicable on any type of parallel system, even on machines
that cannot compute local addresses. The method preserves the alignment of
the matrices in the course of the computation. Besides the fact that trans-
posed matrix products can be carried out on the same footing, alignments for
the optimal hyper-systolic algorithm are efficient for matrix-vector operations
as well.
So far, in a feasibility study, the hyper-systolic matrix product has been imple-
mented on APE100/Quadrics systems [33,34]. We will present details of the
implementation on this system for matrix and transposed matrix products in
Ref. [35].
References
[1] J. Choi, J. J. Dongarra, and D. W. Walker: “The Design of Scalable Software
Libraries for Distributed Memory Concurrent Computers”, in: J. J. Dongarra
and B. Tourancheau (eds.): Environments and Tools for Parallel Scientific
Computing (Elsevier, 1992).
[2] Th. Lippert, A. Seyfried, A. Bode, K. Schilling: ‘Hyper-Systolic Parallel
Computing’, IEEE Trans. on Parallel and Distributed Systems 9 (1998) 1.
[3] A. Galli: ‘Generalized Hyper-Systolic Parallel Computing’, preprint server
hep/lat, URL: http://xxx.lanl.gov/ps/hep-lat/9509011
27
[4] G. C. Fox and S. W. Otto: ‘Matrix Algorithms on a Hypper-Cube I: Matrix
Multiplication’, Parallel Computing 4 (1987) 17-31.
[5] N. Petkov: Systolische Algorithmen und Arrays (Berlin: Akademie-Verlag,
1989).
[6] N. Petkov: Systolic Parallel Processing (Amsterdam: North-Holland, 1993).
[7] H.T. Kung and C. E. Leiserson: “Systolic arrays (for VLSI)”, Sparse Matrix
Proc. 1978 (Society for Industrial and Applied Mathematics, 1979) pp.256-282;
the same as “Algorithms for VLSI processor arrays”, in C. Mead and L. Conway:
Introduction to VLSI Systems (Reading, MA: Addison-Wesley, 1980) sect.8.3.
[8] H.T. Kung: “Why systolic architectures”, Computer 15 (1981) pp.37-47.
[9] P.R. Cappello and K. Steiglitz: “Unifying VLSI design with geometric
transformations”, Proc. Int. Conf. Parallel Processing (1983) 448-457.
[10] P.R. Cappello: “Space time transformation of cellular algorithms”, in E.E.
Swartzlander (ed.), Systolic Signal Processing Systems (N.Y., Basel: Dekker,
1987), 161-208.
[11] P. Quinton: “Automatic synthesis of systolic arrays from uniform recurrent
equations”, Proc. 11th Annual Int. Symp. Comput. Archit., Ann Arbor, Mich.,
1984 (IEEE, N.Y., 1984) pp. 208-214.
[12] D.I. Moldovan: ”On the analysis and synthesis of VLSI algorithms”, IEEE
Trans. on Computers C-31 (1982) 1121-1126.
[13] D.I. Moldovan: “On the design of algorithms for VLSI systolic arrays”, Proc.
IEEE 71 (1983) 113-120.
[14] P. Clauss, G. R. Perrin: “Optimal Mapping of Systolic Algorithms by Regular
Instruction Shifts”, IEEE International Conference on Application-Specific
Array Processors, ASAP (1994) 224-235.
[15] A. Darte, Y. Robert: “Affine-by-Statement scheduling of uniform and affine loop
nests over parametric domains”, Journal of Parallel and Distributed Computing,
29 (1995) 43-59.
[16] P. Clauss , V. Loechner, “Parametric analysis of polyhedral iteration spaces”,
IEEE International conference on Application Specific Array Processors, ASAP,
1996.
[17] T. Dontje, Th. Lippert, N. Petkov and K. Schilling: “Statistical analysis of
simulation-generated time series: Systolic vs. semi-systolic correlation on the
Connection Machine”, Parallel Computing 18 (1992) 575-588.
[18] N. Petkov: ‘Fuzzy number subtraction convolution on the CM-2’, Int. J. of Mod.
Phys. C4 (1993) 181-196.
[19] N. Petkov: ‘Fuzzy number subtraction convolution on the CM-2’, in: Th.
Lippert, K. Schilling and P. Ueberholz (eds.) Science on the Connection
Machine (Singapore: World Scientific, 1993), pp. 181-196.
28
[20] Th. Lippert, U. Glaessner, H. Hoeber, G. Ritzenho¨fer, K. Schilling, and
A. Seyfried: ‘Hyper-Systolic Processing on APE100/Quadrics, I. n2-loop
computations’, Int. Jour. Mod. Phys. C 7 (1996) 485.
[21] G. Meylan and D. C. Heggie: ‘Internal Dynamics of Globular Clusters’,
preprint HEP-ASTRO, http://xxx.lanl.gov/multi, in press in The Astronomy
and Astrophysics Review.
[22] Th. Lippert: ‘Hyper-Systolic Parallel Computing—Theory and Applications’,
PhD-thesis, University of Groningen, 1998.
[23] J. O. Eklundth: ‘A fast Computer Method for Matrix Transposing’, IEEE Trans.
on Computers C 21 (1972) 801-803.
[24] F. DePrez and M. Pourzandi: ‘A Comparison of three Parallel Matrix Product
Algorithms’, Proc. of the Int. Conf. Advances in Numerical Methods &
Applications, Sofia (1994) 234-244.
[25] H. Gupta and P. Sadayappan: ‘Communication Efficient Matrix Multiplication
on Hypercubes’, Parallel Computing 22 (1996) 75-99.
[26] L. E. Cannon: ‘A Cellular Computer to Implement the Kalman Filter
Algorithm’, PhD Thesis, Montana State University, 1969.
[27] V. Kumar, A. Grama, A. Gupta, and G. Karypis: Introduction to Parallel
Computing (Redwood City: Benjamin/Cummings, 1994).
[28] ‘High Performance Fortran Language Specification,’ Rice University, version 1.1
November 1994. ‘High Performance Fortran’, Scientific Programming, 2 (1993).
[29] M. Djawadi and G. Hofmeister: ‘The Postage Stamp Problem’, Mainzer
Seminarberichte, Additive Zahlentheorie 3 (1993) 187.
[30] R. K. Guy, Unsolved Problems in Number Theory, (Springer-Verlag, Berlin,
New-York, 1994).
[31] URL:(http://www.sede.enea.it/∼hpcn/moshpce/hpcn01e.html)
[32] M. Beccaria, G. Cella, A. Ciampa, G. Curci, and A. Vicere´: ‘Matrix Inversion
on APE100 Machines’, Preprint IFUP-TH 17/95.
[33] Th. Lippert and K. Schilling: ‘Hyper-Systolic Matrix Multiplication’, in: H.
R. Arabnia (ed.), Proceedings of the International Conference on Parallel and
Distributed Processing Techniques and Applications, PDPTA ’96, Sunnyvale,
California, USA, - August 9 - 11, 1996, (CSREA, 1996), pp. 919-930.
[34] Th. Lippert, N. Petkov, and K. Schilling: ‘BLAS-3 for the Quadrics Parallel
Computer’, in: B,. Hertzberger and P. Sloot (eds.), Proceedings of the
International Conference on High Performance Computing and Networking,
HPCN ’97, Vienna, Austria, April 1997, (Springer, Berlin, 1997) pp. 332-341.
919-930.
[35] M. Coletta, Th. Lippert, P.Palazzari, N. Petkov, and K. Schilling: to appear.
29
