The Parallelization of Level 2 and 3 BLAS Operations on Distributed Memory Machines by Aboelaze, M. et al.
Purdue University 
Purdue e-Pubs 
Department of Computer Science Technical 
Reports Department of Computer Science 
1991 
The Parallelization of Level 2 and 3 BLAS Operations on 
Distributed Memory Machines 
M. Aboelaze 
N. P. Chrisochoides 
Elias N. Houstis 
Purdue University, enh@cs.purdue.edu 
C. E. Houstis 
Report Number: 
91-007 
Aboelaze, M.; Chrisochoides, N. P.; Houstis, Elias N.; and Houstis, C. E., "The Parallelization of Level 2 and 
3 BLAS Operations on Distributed Memory Machines" (1991). Department of Computer Science Technical 
Reports. Paper 856. 
https://docs.lib.purdue.edu/cstech/856 
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. 
Please contact epubs@purdue.edu for additional information. 
TIlE PARALLEUZATION OF LEVEL








The parallelization of level 2 and 3 BLAB





North York, Ontario, Canada M3J IP3
N. P. Chrisochoides ; E. N. Houstis f C. E. Houstis ~
Purdue University
Computer Science Department





We present the parallelization of band matrix-vector and band
matrix-matrix product operations on distributed memory multipro-
cessor systems that support a mesh and ring interconnection topology.
Our approach eliminates synchronization delay and minimizes the com-
munication overhead among processors. Three of these operations has
-This work was supported in part by grant from national Science and Engineering
Council of Canada number NSERC-OGP0043688.
tThis work was supported in part by AFSOR 88-0234, ARO grant DAAG29-83-K-0026.
tThis work was supported in part by NFS grant CCF-861 9817 and ESPRIT project
GENESIS.
§This work was completed while on sabbatical at the C.S. Department of Purdue
University. It was partly supported by NATO grant and NSF grant CCF-8619817.
1
1 INTROD UCTION
been currently implemented on the NCUBE-6400 with a 64 proces-
sor configuration. High efficiency and scalability of the algorithms has
been justified. Efficiency of 98 % for matrix-vector multiplication and
94 % for matrix-matrix multiplication has been observed for the case
of dense matrices. For the case of the band matrix-vector multiplica-




The most important objectives in designing algorithms/software for mul-
tiprocessor systems include the minimization of i) the so called edge con-
tention (more than one path share one or more links, [Bokh 90], [Chri 90J
), ii) the amount of data transferred between processors, and iii) the syn-
chronization delay. It has been observed that the minimization of the cost
functions corresponding to the above three design objectives depend on the
way the associated computation graph is decomposed. For general graphs,
this problem is NP-complete [Gare 79J.
For the case of well structured computations, special purpose algorithm /
architecture pairs were suggested known as systolic arrays [Kung 82], [Mold
82J, [Mira 84J, [Chen 87J. These architectures consist of simple processing
elements(PEs) which are capable of performing one arithmetic operation. In
systolic computations, the decrease of edge contention and synchronization
is achieved by mapping the computation graph into a systolic array such
that the correct data are in the correct place at the appropriate time. This
scheduling strategy usually results in minimizing the time any PE spends
waiting to receive the required data.
In this paper we propose to develop systolic type techniques to design
faster algorithms/software for MIMD course grain computation/architecture
pairs. To test these approaches we have selected to parallelize some prim-
itive linear algebra operations, mainly band matrix operations. The dense
matrix operations have been studied extensively, [Fox 87], [Cher 88], [Bern
89], and others. In section 2, we review some of the known matrix multi-
plication algorithms and their complexity for various architectures. Section
3 describes briefly the essential characteristics of the NCUBE-6400. For
completness in section 4 we present the matrix multiplication algorithms
and their performance, for dense matrices. In section 5 the proposed matrix
multiplication algorithms and their performance, for banded matrices are
presented. The results indicate an efficiency up to 98 % for matrix-vector
2 OVERVIEW OF PARALLEL MATRIX MULTIPLICATION ALGORITHMS3
operations and 94 % for matrix-matrix operations on a 64 processors con-
figuration Ncube-6400 with one Mbyte of memory per processor. Moreover
indicate the scalability of the band matrix-vector multiplication. We expect
to have performance data from iPSC2/860 by the time of publication.
2 Overview of Parallel Matrix Multiplication Al-
gorithms
Matrix and vector operations are at the core of many important scientific
computations. Many problems in physics, mathematics, engineering and
chemistry can be formulated as matrix-vector operations. A lot of effort is
dedicated to finding an efficient method for multiplying matrices and vectors.
In this section we review some attempts in this field. Our list is by no means
complete. We only summarize work closely related to our work.
Fox et al in [Fox 87] proposed techniques for matrix multiplication. Their
method depends on partitioning the matrix into square or rectangular sub-
blocks. These blocks are distributed between the processors. By the end
of the matrix multiplication operation, the product matrix is distributed
among the processors in the same fashion. The algorithms exploites the
mesh architecture embedded in any hypercube architecture. Tey also use
broadcasting for communicating some of these data blocks. In this paper, we
tried to avoid any broadcasting, and make all communication local between
neighboring processors.
Deckel, Nassimi, and Sahni in [Deke 81] proposed a matrix multiplication
algorithm for cube connected and perfect shuffle computers. They used N2 m
processors to multiply two N X N matrices in O(~+logm) time. They also
showed how m 2 , 1 ::; m ::; N processors can be used to multiply two N x N
matrices in 0 (tt: + m( ~?61) time. This method is efficient for multiplying
dense matrices, but, it will not be very efficient for a vector or a band matrix.
Johnson in [John 85] presented algorithms for dense matrix multiplica-
tion and for Gauss-Jordan and Gaussian elimination. His algorithm can run
on any boolean cube or torus computers. It achieves a 100 % processor
utilization except for a latency period 1/atency =O(n) of an n cube system.
In [John 89], Johnsson et al presented a data parallel matrix multiplication
algorithm. Their algorithm was implemented on the Connection Machine
CM-2, their implementation has a peak overall performance of 5.8 GFLOPS.
Independently Cherakasky et al in [Cher 88], Berntsen in [Bern 89] and
Aboelaze [Aboe 89] improved Fox's algorithm, for dense matrix multiplica-
3 THE NCUBE-6400




T = pT + ..;p ttransf + (VP - 1)tstart
where P is the number of processors, T is the time for one addition and
multiplication, and ttransf, tstart are machine dependent communication pa-
rameters. Berntsen's second idea was to partition the hypercube into a set of
subcubes and using the cascaded sum algorithm to add up the contributions
to the final matrix. His idea also reduced the asymptotic communication to-;r on the expense of having ;r extra bytes of memory per processor.
Most of the previous work on this subject is not efficient for band-matrix
operations. The algorithms for dense matrices presented in [Fox 88], [Cher
88], [Bern 89J, and [Aboe 89J require P and n iteration steps to com-
pute the c = Ab and C = C + AB respectively; each iteration step requirs
!Jttransf + tstart and ~ttransf+ tstart communication time respectively. In
this paper, we present two algorithms for for operation on band matrix
A E R NxN , with bandwidth w. The first algorithm is to multiply A by
b, where bERN. The second algorithm is to multiply A by B, where
B E R NxN , with bandwidth 6. The first algorithm requires w iteration
steps with each iterations requiring !Jttransf + tstart communication time.
The second algorithm requires w + 6 - 1 iteration steps with each iteration
step requiring !Jttransf minew, 6) communication time. The two algorithms
result in communication between neighboring processors, and minimize syn-
chronization delay.
3 The NCUBE-6400
In this section we review characteristics of the NCUBE-6400, [NCUBE 90],
that are relevant to our analysis. We also give performance measurments
taken on the NCUBE. NCUBE-6400 is hypercube interconnection multi-
processor system, a unique binary ill is assigned to each processor of the
network. Two vertices in the network are called neighbors iff their binary
representation differ in exactly one bit. Each processor has its own mem-
ory and works independently from the others. Processors exchange data by
3 THE NCUBE-6400 5
sending messages to each other. The exchange of messages is bassed on a
circuit switching logic. When two nodes have to communicate, a fixed path
is set up between them. The message flow through this path is without inter-
apting the intervening processors. The path between the two nodes (source
and destination) is established by a fixed routing strategy. The source node
sends an address packet of 32 bits to a channel. The address is passed from
node to node. Each processor compares his own binary ID with the address
packed, if the processor is the destination the path has been established
otherwise forwards the packet to his neighbor processor whose binary ID
most closely matches the binary ID of the destination. The time (in flsec)
to communicate a zero byte, in a network without edge contention is :
Tinitial = 2.84 X d + 132.87 (3.1)
where d is the length of the communication path, [Chri 90]. The claimed
[NCUBE 90J time to transfer k real numbers through an established path
without edge contention is :
Ttran"j = k X Tpacked (3.2)
where Tpacked = port speed of a link in the path = 36 machine cycles With
a 40MHz external clock [NCUBE 90J, each cycle is about 50 ns. Hence
Ttran"j = 1.8flsec (3.3)
The floating-point performance of the node was measured to be in the range
t flop = 1.28flsec -+ 1.40tLsec (3.4)
The time to perform the operation c(i) = c(i) + a(ij) * b(j) was measured,
to be equal to
tloper = 3.852tLsec (3.5)
The time to perform the operation c(i) = c(i) + a(i, j) * b(id(ij)) was
measured, to be equal to
t20per = 7.915tLsec (3.6)
The time to perform the operations rsum = rsum + a(i, j) * b(j, k), and
c(i,j) = c(ij) + rsum was measured, to be equal to
t30per = 3.933tLsec and t40per =7.632E - 03tLsec (3.7)
(3.9)
4 PARALLELIZATION OF LEVEL 2 AND 3 BLAS FOR DENSE MATRICES6
The time to perform the operations rsum = rsum + a(i, k) * b(ida(i,k), j)),
and c(i,j) = c(i,j) + rsum was measured, to be equal to
t50per = 7.639Jlsec and t60per = 7.632E - 03Jlsec (3.8)
For distributed memory multiprocessor systems fixing the size problem
creates a constraint, since large size data cannot fit on a single processor.
In such cases the scaled speedup can be computed [Gust 88] either as :
SIC' U 1 = M flops using P processors
C....Jp P Mfl . . Iops uszng szng e processor
or as :
SclSpUp2 = P X TWork...done....by-P_proces - TWork_wouldnlt...done....by...serial~roces
TWork...done....by"p~roces
(3.10)
4 Parallelization of level 2 and 3 BLAS for dense
matrices
The algorithms we are describing in this paper are suitable for multiprocessor
systems with the following properties:
1. P identical processors, each with local memory,
2. the interconnection of the processors supports at least mesh and ring
topology with wrap around,
3. the time to communicate k real words between two processors is tstart+
ttransj k where tstart = to + tt d, and d is the distance of the processors
in the interconnection topology,
4. the time to perform a floating-point multiply or add is T
4.1 Dense Matrix x Vector Multiplication
In this section we examine the matrix-vector multiplication c = ac + f3Ab,
where A E R NxN , a column vector bERN, and a column vector cERN.
a, f3 are scalers. The interconnection of PEs is a wrap around linear ar-
ray. The interconnection of PEs and the distribution of input are shown in






























"A ": 33 :
" "..........:, ..















Figure 1: The interconnection network and distribution of the input
Figure 1. Each PE i computes the corresponding Ci. As can be seen the
N N
matrix A is divided into submatrices Ai,j E R 7>x P, where P is the number
of processors, and the vector b is divided into subvectors bi E R~. Each
processor contains one row of submatrices and one subvector.
Algorithm





Without any loss of generality we assume a = f3 =1.
The algorithm performs P iterations" In each iteration a partial sum of
equation (3.1) is accumulated. the algorithm starts by multiplying Ai,i by
bi. Then every processor sends the part of the vector b it stores to processor
i-I and receives the part of b from the corresponding processor, finally
multiplies it by Ai,(i+l)modp. The algorithm can be expressed as following:
For all PE i = 0 to P-1 do
begin
4 PARALLELIZATION OF LEVEL 2 AND 3 BLAS FOR DENSE MATRICES8
For k = 1 to P do
begin
Send b(i) to PE i-l mod P
c(i) = c(i) + A(i,i+k mod P) * b(i)




Assume that each multiply and add takes r seconds. Assume also that
transferring w words in a network without edge contention takes a + f3 x w,
where a = a(d) , d length of the message path. Both a and f3 are machine
dependent parameters. Under the above assumptions the execution time for
the algorithm is :
N 2 N
Tp = P x {p2 X r + a + f3 x p}
For a single processor (4.2) becomes :
we get speedup equal to :
(4.2)
(4.3)
(4.4)S(N,P) = ;'2 Xr
p x {~ X r +a +,Bx
The space required for each processor is : O(*" + 2 X ~).
From the equations (4.4) and (3.1) - (3.5) we estimate the speedup for
various problem sizes and different configurations. Figure 2 depicts the
estimated speedup for problem sizes N = 640, 3200, 32000 and processors P
= 2i , for i = 0,4, 6, 7,8, and 9.
4.2 Dense Matrix Matrix Multiplication
In this section we examine the matrix-matrix multiplication C = aC +,BAB,
where matrix A, B,C E R NxN , and a,,B scalars. The interconnection of PEs
is a wrap around grid. The interconnection of the PEs into a grid topol-
ogy and the distribution of input are shown in Figure 3. PE (i,j) computes
the Ci,j. As can be seen the matrices A,B are divided into submatrices
----I-
4 PARALLELIZATION OF LEVEL 2 AND 3 BLAS FOR DENSE MATRICES9
speedup
500.011G--1---+_--_+_----+---+-/---.. ,t.?--I Y = x
........•..... N = 640
400.011\J-l----+----+----+----,""'"""---+--i N = 3200
/ N = 32000
300. 011\J-l----+----+----.+"----+-----+--i





0.00 100.00 200.00 300.00 400.00 500.00
Figure 2: Estimated SpeedUp for P = 1, 4, 16, 64, 128, 256, and 512
processors
4 PARALLELIZATION OF LEVEL 2 AND 3 BLAS FOR DENSE MATRICESIO
b
b
Figure 3: The interconnection network and distribution of the input
(4.5)
4 PARALLELIZATION OF LEVEL 2 AND 3 BLAS FOR DENSE MATRICESll
N X N
Ai,j, Bi,j E R 7P 7P, where P is the number of processors. There are two
paths of moving data, across c-path the algorithm moves the submatrices
Gij and across b-path it moves the submatrices Bij.
Algorithm
The matrix C = (Gi,j) can be expressed as :
N
Ci,j = L: AjkBk,j
k=l
For square matrices the interconnection is organized as folded square grid
N N
P X P. In each processor (i,j) we store the submatrices Ai,j, B j,i E R vIJ5" x vIJ5";
initialize Ci,j to zero; throughout this section we will refer to them as A,
B, C. In processor (i,j) the submatrix Ci,j is computed after n iterations.
Each iteration consists of the following three steps: (1) Send B, C, across
b/c-paths respectively to processors (i, (j-l) mod P), and ((HI) mod p,
j). (2) Compute: G = G + A x B. (3) Receive B,C from processors (i,
0+1) mod n), and ((i-I) mod .;p, j) respectively. The algorithm can be
expressed as follows :
For each PE Ci,j) do in parallel
For iter := 1, sqrtCP) do
begin
Send B across b-path to Ci, Cj-1) mod sqrtCP))
Send C across c-path to CCi+1) mod sqrtCP), j)
C := C + A * B
Receive B from processor Ci, Cj+1) mod sqrtCP))




Under the same assumptions on the time required to communicate and
multiply/add on a datum we get:
N 3 N2




4 PARALLELIZATION OF LEVEL 2 AND 3 BLAS FOR DENSE MATRICES12
we get speedup equal to :
N 3 X T
S(N,P) = n N3 N2 (4.7)
P x {~x T + 2 x (a + (3 x -p-)}
P'1
The space required is : 0(3 X ';,2)
From the equations (4.7) and (3.1) - (3.4), (3.7) we get the speedup
for various problem sizes and different configurations. Figure 4 depicts the
estimated speedup for problem sizes N = 160, 560, 1200 and processors P
= 2i , for i = 0,4, 6, 7,8, and 9.
speed_up
:::: :::~~~~~~:~~~~~:~~~~~~:~~~~~::/:.-;~-;-;~-;-;~-;~~..;~:.-:- ~:: "
/'_----- 1200
300 . 01/9-t---+----+---~>.:<-___::_l'----+__1




L-.L -'--__--'--__---'-__---' --'----' P roes
0.00 100.00 200.00 300.00 400.00 500.00
Figure 4: Estimated speedup for N = 160, 560, 1200, and P = 1, 16, 64,
128, 256, 512 processors, and Dense Matrix x Dense Matrix Multiplication
4.3 Performance on NCUBE-6400
5 PARALLELIZATION OF LEVEL 2, AND 3 BLAS FOR BAND MATRICES13
Table 1: Measured Mflops and SpeedUp for dense matrix-vector multiplica-
tion using P = 64 processors.
N Mflops 1 p Mflops P p S(N, P) ScL.SpUpl Scl..SpUp2
320 .439 7.359 16.46 16.76 35.12
640 .446 17.048 38.14 38.22 46.63
1600 .447 28.292 - 63.29 59.50
Table 2: Measured Mflops and SpeedUp for matrix-matrix multiplication
using P = 64 processors
N Mflops 1 p Mflops P p S(N, P) ScLSpUpl ScL.SpUp2
160 0.440 22.870 51.991 51.977 55.150
280 0.441 25.117 55.966 56.954 58.873
360 0.441 25.794 58.437 58.489 59.976
560 0.441 26.664 60.351 60.462 61.373
5 Parallelization of level 2, and 3 BLAS for band
matrices
The band matrix operations have significant application in the solution of
Partial Differential Equations (PDE's). Iterative methods for the solution of
the linear algebraic system can be viewed as a matrix vector multiplication
operations. In the following, we present these algorithms.
5.1 Band Matrix x Vector Multiplication
In this section we investigate the operation of c = c+A x b , where c, bE R N,
and A E R NxN is a banded matrix with Wt be the upper bandwidth, and
W2 the lower bandwidth of the matrix A. We will explain this algorithm
for the case N = P. Usually in practice N »P. However, the case
N >> P can be easily generalized by replacing each element ai,j by a
N N
submatrix Ai,j E R pX p. The interconnection of the PEs is a linear array.
The interconnection of PEs and the distribution of input are shown in Figure














Figure 5: The interconnection network and distribution of the input
5. Each PE i computes Ci. As it can be seen the matrix A is splitted into
two submatrices, the strictly lower triangular submatrix of A, let us call
it L, and the upper triangular submatrix of A, let us call it U, such that
A = L + U. Each processor contains one row of elements (in the general
case a strip of rows), and one element of vector b (in the general case a strip
of rows).
Algorithm
The vector c can be expressed as : c = c +Lx b+U x b. The algorithm
consists of 2 phases. In the first phase it multiplies U x band Wl + 1 itera-
tions are required; in the second phase it multiplies L x band W2 iterations
are required. In the first phase each processor i multiplies aii X bi and sends
bi to processor i-I and receives the new part of the vector b form processor
i + 1. Processor i = 1 during the sending stage sends nothing, while proces-
sor i = P during the receiving stage receives nothing. At the k th iteration
processors i, with i > P - k + 1 remain idle. In the second phase each
processor restors bj, from temporary storage, hence processor i restores bi'
and sends it to processor i +1, then multiplies aii-l X bi-l. Processor i = P,
5 PARALLELIZATION OF LEVEL 2, AND 3 BLAS FOR BAND MATRICES15
during the sending stage sends nothing, while processor i = 1 during the re-
ceiving stage receives nothing. The algorithm C<1fi be expressed as following:
Phase 1: Multiply the Upper triangular U by b
temp := d
For each PE i do in parallel
For j := 0 to v2
if (i + j =< P) then
begin
if ( i = 1 ) then do nothing
else Send d to PE i-1
c := c + a(i, j+i) * d
if ( i = P ) then do nothing





Phase 2: Multiply the Lower triangular L by b
For each PE i do in parallel
begin
d := temp
For j := 1 to v2
if (i < j) then
begin
if ( i = P ) then do nothing
else Send d to PE i + 1
if ( i = 1 ) then do nothing
else Receive d from PE i - 1






Without any loss of generality we assume A has K non-zero elements,
and N >> WI +W2 +1. Under the above assumptions on the time required
5 PARALLELIZATION OF LEVEL 2, AND 3 BLAS FOR BAND MATRICES16
to communicate and multiply/add a datum we get:
J( N
Tp = P X r + (WI +W2 + 1) X {tstart + P ttransf} (5.1)
Since
(5.2)
we get a speedup equal to :
J( X r
SeN, P) = K· N (5.3)
-p x r + (WI +W2 + 1) x {tstart + -pttransf}
The space required for each subdomain is : O(~ + 3~)
From the equations (5.3) and (3.1) - (3.4), (3.6) we get the speedup for
various problem sizes and different configurations. Figure 6, and 7 depict
the estimated speedup, and number of iterations for problem sizes N = 160,
560, 1200, and bandwidths equal to 3, 5, 17, with processors P = 2i , for i =
0,4, 6, 7,8, and 9.
Application
For example in two dimensions the 5 point star operator for the Poisson
equation on a mesh N X N will give a band matrix A whose upper bandwidth
is equal to lower bandwidth equal to N. The matrix A can be viewed as block
tridiagonal matrix with blocks in RmNxmN, where mEN - {O}. For given
N and a linear array of P processors, w.l.o.g assume N = >..P, the matrix-
vector multiplication of A times a vector x, can be achieved by applying the
algorithm for band matrices. First we partion the matrix into sub-blocks of
size >..N x >..N, and then we allocate a row of the sub-blocks in each processor,
as we did above.
5.2 Band Matrix x Band Matrix Multiplication
In this section we investigate the operation of C = o.C + /3BC, where
A, B E R NxN , with WI, Ih be the upper bandwidths, and W2,02 the lower
bandwidths. C E RNxN with WI +01 be the upper bandwidth, and W2 +02
the lower bandwidth, and a., /3 scalers. The interconnection of PEs is a Lin-
ear Array. The interconnection of PEs and the distribution of input are
shown in Figure 8. PE i holds the column Ci of matrix C. In each processor
we store one row of elements (in the general case a strip of rows) from matrix
A, and one column of elements (in the general case a strip of columns) from
:~~nAn -----_ ---------------- ----- _








/ y = 256
200.OIIG-1+-----~f------f------H
256_procrs
- .-- =-- ---- - -- -- -- - --- -- --- y = 512
100.OilG-1~----_If------f-----_IH
512_procrs
Figure 6: Estimated Speedup of the algorithm for different configurations




Figure 7: Iterations required by the algorithm, for the computation of the
vector c =c + A x b




















Figure 8: The interconnection network and distribution of the input.
5 PARALLELIZATION OF LEVEL 2, AND 3 BLAS FOR BAND MATRICES19
matrix B.
Algorithm
The algorithm consists of two phases as in band-matrix vector multi-
plication. In the first phase, each PE starts by calculating Cii = Ai X Bi'
then each PE i passes Bi to PE i - 1. this phase is repeated WI + 81 + 1




For each PE i do in parallel 1* each PE contain a = Ai , b = Bi */
For j := 0 to v1 + delta1
if (i + j =< N) then
begin
if (i = 1) do nothing
else Send b to PE i-1
c(i,i+j) := cCi,i+j) + a * b
if (i = P) then do nothing






For Each PE i in parallel do
For j := 1 to v2 + delta2 do
if (i > j) then
begin
if(i = P) then do nothing
else send b to PE i+1
if( i = 1) then do nothing
else receive b from PE i-1





5 PARALLELIZATION OF LEVEL 2, AND 3 BLAS FOR BAND MATRICES20
To calculate the speedup, assume that [(1, [(2 denote the number of non-
zero elements for the matrices A, B correspondingly, and N » bWl + bW2,
where bWl = WI +W2 +1, and bW2 = 01 +02 +1. To calculate Tp , we have to
perform bw iteration, each iteration consists of sending and receiving a strip
of If, vectors each with min(bwll bW2) elements, and multiply If, vectors by If,
vectors. Under the above assumptions on the time required to communicate
and multiply/add a datum we get:
min([(1 0, [(2W) N .
Tp = P T + {t"tart + p ttran"j mm(w, on (5.4)
where bw = w + 0
Since
T1 = min([(10,[(2W)T
we get speedup equal to
(5.5)
SeN P)
min([(10, [(2W )T (5.6), = . (K 8 K )
nun ~' 2
W
T + {t"tart + If, ttran"j minew, On
From the equations (5.6) and (3.1) - (3.4), (3.8) we get the speedup
for various problem sizes and different configurations. Figure 9 depicts the
estimated speedup for problem sizes N = 160, 560, 1200, and bandwidths
equal to 3,5, 17, with processors P = 2 i , for i = 0,4,6, 7,8, and 9.
5.3 Performance on NCUBE-6400
By fixing the size of the matrix for each processor the resulting performance
curves should idealy show constant Mflops as a function of the matrix size
and constant time (in Seconds) as a function of Number of Processors, [Gust
88]. Tables 3, and 4 realize the scalability of the algorithm for band matrices
with upper bandwidth equal to lower bandwidth equal to 8, 16,32 and 64,
the time (in Seconds) does not vary considerable little with the hypercube
dimension.
References
[Aboe 89] Mokhtar Aboelaze, Unpublished manuscript, June, 1989.










128_procrs_ ••••_ ••••_ ••••_-="'r-=~--=.-~~---t----------1 __.._ ..~










Figure 9: Estimated speedup for N = 160, 560, 1200, and P = 1, 16, 64,
128, 256, 512 processors.
Table 3: Measured Total Elpased time (in Seconds) for block tridiagonal
matrices, each block is of size n x n, where n = 8, 16, 32, 64
matrix size / node 4 nodes 8 nodes 16 nodes 32 nodes 64 nodes
8 x 24 3.9E-3 4.1E-3 4.1E-3 4.2E-3 4.1E-3
16 x 48 1.18E-2 1.24E-2 1.27E-2 1.28E-2 1.2E-2
32 x 96 4.33E-2 4.52E-2 4.62E-2 4.67E-2 4.69ZE-2
64 x 192 0.1680 0.1756 0.1794 0.1813 0.1822
Table 4: Measured Mflops for block tridiagonal matrices , each block is of
size n x n, where n = 8, 16, 32, 64
matrix size / node 4 nodes 8 nodes 16 nodes 32 nodes 64 nodes
8 x 24 0.239 0.494 1.004 2.0265 4.122
16 x 48 0.309 0.635 1.284 2.589 5.198
32 x 96 0.334 0.688 1.393 2.802 5.620
64 x 192 0.343 0.704 1.425 2.868 5.752
5 PARALLELIZATION OF LEVEL 2, AND 3 BLAS FOR BAND MATRICES22
[Bern 89] J. Berntsen Communication Efficient Matrix multiplication on
hypercubes, Parallel Computing, Vol 12, No 3, Dec. 1989 pp335-342
[Bokh 90] S.H. Bokhari Communication Overhead on the Intel iPSC-860
Hypercube. ICASE Interim report 10, NASA Langeley Research Cen-
ter, Hampton, Virginia 23665
[Cher 88] V. Cherkassky and R. Smith, Efficient Mapping and Implemen-
tation of Matrix algorithms on a hypercube, The Journal of Super-
computing, Vol 2, pp 7-27, 1988
[Chri 90] N.P. Chrisochoides, Communication overhead on the NCUBE-
6400 hypercube.
[Deke 81] E. Dekel, D. Nassimi, and S. Sahni, Parallel Matrix and graph
algorithms, SIAM Computing, Nov. 1981, pp 657--675
[Fox 87] G.C.Fox, S. W. Otto, and A.J. Hey, Matrix Algorithms on a
hypercube I : Matrix Multiplication, Parallel Computing, 1987, pp
17-31.
[Gare 79] M. R. Garey and D.S. Johnson Computers and Intractability, A
Guide to the Theory of NP-Completeness.
[John 85] S. L. Johnsson, Communication efficient Basic Linear Algebra
Computations on hypercube architecture, Technical Report YALEjCSDjRR-
361, Dept. of Computer Science, Yale University, 1985.
[John 89] S. 1. Johnsson, T. Harris, and Kapil K. Mathur, Matrix Multi-
plication on the Connection Machine, Proc. Supercomputing 89, Nov
13-17 1989, Reno Nevada, ACM Press, page 326-332
[Kung 82] H. T. Kung, Why Systolic Architecture, Computer, Vol 15, No
1, Jan. 1982, pp 37-46.
[Mira 84] W. 1. Miranker and A. Winkler, Space-Time representations of
computational structures, Computing, Vol. 32, 1984, pp 93-114.
[Mold 82] D. 1. Moldovan, On the analysis ans synthesis of VLSI algorithms,
Trans. on Computers, Vol. C-31, No 11, Nov. 1982, pp 1121-1126
