Explicit parallel block Cholesky algorithms on the CRAY APP by Nool, M. (Margreet)
ELSEVIER Applied Numerical Mathematics 19 (1995) 91-114 
~ APPLIED 
.NUMERICAL 
MATHEMATICS 
Explicit parallel block Cholesky algorithms on the CRAY APP 
Margreet N ool * 
CW!, P. 0. Box 94079, 1090 GB Amsterdam, Netherlands 
Abstract 
In this paper we consider the CRAY APP, the Attached Parallel Processor of the CRAY S-MP, which consists of 
seven buses with each bus supporting up to 12 processing elements. Processing elements on different buses can 
communicate simultaneously with the shared main memory, but processing elements sharing the same bus can not, 
since only one processing element per bus can access memory at a given time. Applications with a high level of data 
reuse, or, with a high computation intensity, and applications being highly parallel are very suitable to run on the 
APP. An example of such an algorithm is matrix-matrix multiplication. We illustrate how the data traffic's 
restriction influences the performance and we discuss a performance model of the bus architecture, considering a 
change in processor speed, data traffic speed and cache contents. 
Furthermore, two different algorithms for Cholesky factorization are discussed: a block left-looking algorithm and 
a block right-looking algorithm. The maximum achievable speed on the CRAY APP is mainly determined by the 
performance of the matrix-matrix multiplication. Parallelism is applied explicitly over the blocks, which makes it 
possible to concatenate different block operations in cache. The results obtained on CWI's APP (a machine having 
twenty-eight processing elements) indicate how block algorithms can be parallelized on machines with hundreds or 
thousands of processors. 
Keywords: Software; Parallelization; Vectorization 
1. Introduction 
The Attached Parallel Processor APP of the CRAY S-MP is a system of 1, 3, 5 or 7 buses, 
each consisting of several processing elements. In the case of one processing element per bus, it 
is possible to obtain a speed-up which is close to the number of buses involved. However, in the 
case of two or more processing elements sharing the same bus, the speed-up is restricted by bus 
traffic, since only one processing element per bus can access main memory at a given time. For 
that reason, only parallel algorithms with a high computation intensity compared to data traffic 
are suitable to run efficiently on the APP. In earlier papers [8,9,12], which report on APP 
performance, the speed-up was mainly limited by the number of buses rather than by the 
• E-mail: greta@cwi.nl. 
0168-9274/95/$09.50 © 1995 Elsevier Science B.V. All rights reserved 
SSDI 0168-9274(95)00078-X 
92 M. Noa// Applied Numerical Mathematics 19 (1995) 91-114 
number of processing elements. Here, we concentrate on such parallel applications, for which 
we actually obtain speed-ups higher than the number of buses. For that purpose we introduce 
the bus speed-up being a function of the execution time on B buses with one processor and that 
on B buses with P processors. Moreover, we consider a performance model of the APP, and we 
discuss the influence on the performance in case the speed of the processing elements or the 
speed of data transfer would increase. Finally, we consider the bus performance as a function 
of the cache contents. The CRAY APP is a rather exotic architecture, although its communica-
tion structure on one bus is similar to that in a network of workstations (which becomes an 
increasingly popular parallel computing environment). CRAY quotes several applications of 
the APP in image, sonar, and radar processing with speeds of 1-2.3 Gflop/s for one APP [13]. 
In [3] a speed of 548 Mflop/s is reported for the solution of a full linear system of order 1000. 
The matrix-matrix multiplication serves as a key in the first part of the paper dealing with 
the APP and its suitability for this operation. The Cholesky factorization, considered in the 
second part, is based on the matrix-matrix multiplication as well. Its performance is mainly 
determined by the speed of multiplication. The factorization can easily be parallelized, but 
during the progress of the factorization the level of parallelism decreases. We have found that 
it is difficult to predict how a given matrix should be divided into blocks to achieve optimal 
performance. Two Cholesky factorizations are compared: a left- and a right-looking algorithm. 
Though solving the same decomposition within the same number of steps, their performance 
differs significantly on a 28-processor CRAY APP. 
The organization of the paper is as follows. In Section 2, we describe the CRAY APP 
configuration. In Section 3, we focus on the block matrix-matrix multiplication; it is performed 
in such a way that data traffic is minimal resulting in a high performance. A performance 
model of the CRAY APP is discussed in Section 4. In Section 5, two block Cholesky 
factorizations are described; much attention is paid to reducing the number of synchronization 
points leading to an increase of the performance. Finally, in Section 6, some conclusions are 
drawn. 
2. The CRAY APP configuration 
The CRAY APP, a multiple-bus, parallel processor, can be used in two modes: the client 
server programming model and the computer model. In the first model, the main program runs 
on the front end machine and highly parallel time-consuming parts of the program are run on 
the CRAY APP. In the latter model, which we consider in this paper, all operations are 
performed on the CRAY APP. 
Its architecture, as illustrated in Fig. 1, consists of 4 to 84 processing elements and one to 
seven buses. The buses are connected to the main memory by a crossbar. This main memory is 
shared by all of the processing elements. An important restriction in bus traffic is that a given 
tim~ only one processing element on each bus can access memory. As a consequence, for 
optimal use of the CRAY APP it is necessary to take care of a balanced bus traffic. An 
example ~f ef~icie~t bus usage is shown in Fig. 2. Due to the short load time compared to the 
computation time m cache, seven processing elements can be used efficiently. The addition of 
I 
Up to7 
Buses 
M. Nool /Applied Numerical Mathematics 19 (1995) 91-114 
Up to 12 Processing E/stn'1nts per bus 
14 MotrtxBu•ea 
BXS 
Fig. 1. CRAY APP architecture. 
Memory 
Up to 
256 Mbytes 
93 
one or more processing elements to the bus will not result in a higher performance; several 
elements must wait to use the bus for loading. Obviously, the desired number of processing 
elements on a bus depends on the application. 
There are two ways in which parallelism can be exploited on the CRAY APP: 
• by using the auto-parallelizing feature of the compiler and insert directives for parallel 
processing wherever possible; 
• by "bus-handoff" computing: hardware features are available for loading data into caches 
and for storing updated values into the main memory. 
In this paper, we focus on the optimization of bus traffic and the reuse of data wherever 
possible. To this aim, we will investigate the application of "bus-handoff" computing. 
2.1. Some characteristics of the CRAY APP 
The CRAY APP at CWI, Amsterdam consists of seven buses, each with four processing 
elements. The data cache of each processing element is an 8-Kbyte memory chip so that for 
co.mputa~ion 
II .. ...: .. 
Pi .. ~ .. 
~ .. .. .. 
lJ .. ... 
Ps 
~ 
~ 
.. . 
... .... 
.... data transfer 
... 
' ... 
.... .. . 
..... .... 
... ... ... 
... ..... ... 
... 
Fig. 2. Efficient bus usage of seven processors sharing the same bus. 
94 M. Nool/Applied Numerical Mathematics 19 (1995) 91-114 
DOUBLE PRECISION arithmetic only 1 Kwords can be stored. This implies that for a 
matrix-matrix multiplication in cache involving two square operand matrices and one result 
matrix, the block size is restricted to 18. The processing elements, based on the Intel i860 64-bit 
microprocessor, can reach a peak performance of 60 Mflop/s for DOUBLE PRECISION 
operations. However, when performing an equal number of DOUBLE PRECISION multiplies 
and adds, as in the matrix-matrix multiplication case, the maximum possible performance is 40 
Mflop/s (not including data transfer). In addition, each bus provides a peak bandwidth of 160 
Mbyte/s, i.e., 20 Mwords DOUBLE PRECISION words can be transferred per second. So, 
two flops and one data transfer can be done simultaneously. 
2.2. Some remarks on the software available on the CRAY APP 
From the previous section it seems easy to predict the performance of a parallel application, 
since everything needed for a good and reliable analysis is available, such as the data cache 
size, a performance peak of 40 Mflop/s as well as the data transfer time. Moreover, the CRAY 
APP has a very accurate clock, and timing results for a parallel application on the APP appear 
to be reproducible. On the CWI's configuration a peak Mflop rate of more than 1 Gflop/s 
should be possible. However, a reasonable performance can be obtained only by using cache 
programming and for this it is necessary to use the Extended Math Library routines [l]. 
Unfortunately, this library has only a small set of appropriate routines for numerical program-
ming. It would be helpful if at least a set of Level 1 BLAS [10] routines were included in this 
library, but even commonly-used operations like inner products and daxpy operations are 
missing. In fact, only a limited set is available, and its usability is restricted by incomprehensible 
rules. 
Consider, for instance, the routine _ dvsvma, which performs the following operation on 
DOUBLE PRECISION data: 
D(i)=A(i)+B*C(i), for i=l, ... ,n. (2.1) 
The first restriction is that the vectors A and D may not share the same memory locations as 
they do in a daxpy operation. Therefore, this routine cannot be used as a simple daxpy 
operation without an extra vector copy and neither can any other Extended Math. Library 
routine. Secondly, the elements of A, C and D must be consecutive elements in memory; an 
increment parameter is not allowed. The third restriction gives rise to most complications: A, C 
and D must be quad-word aligned. Operations can only be carried out in pipelined mode if the 
initial elements of the input data are stored on quad-word boundaries in cache. A word in the 
wth position in cache is quad-word aligned, or on a quad-word boundary if w mod 4 = 1, as is 
described in [12]. For the DOUBLE PRECISION case either the odd or the even vector 
elements can be quad-word aligned; for the SINGLE PRECISION case only each fourth 
element of a vector satisfies this restriction. 
Finally, ·we mention the DOUBLE PRECISION matrix-matrix multiplication routine 
_dmmm which has a speed of more than 25 Mflop/s on a single processor. This routine 
computes in cache 
C=r*C+s*A*B, 
M. Nool j Applied Numerical Mathematics 19 (1995) 91-114 95 
where r and s may be - 1, 0, or 1. Besides the restriction that the memory occupied by C 
should not intersect that occupied by A or B, we must take into account that this routine is 
parameterized: all cache loads and stores are internal to the routine and all preloaded data in 
the cache will be lost. As a consequence, it does not seem to be possible to reuse the data, since 
each time the routine is called the cache will be refreshed. However, in Section 3.4 we show 
results of experiments where in certain cases data are still reused by this routine. 
The absence of a good compiler for cache programming and the large number of restrictions 
makes programming very complicated and reduces the readability of programs considerably. 
Most executions do not achieve the performances claimed in the specifications. In Section 
5.1.3, we return to this point. 
3. The matrix-matrix multiplication 
The matrix-matrix multiplication can be considered as one of the most suitable applications 
to run on the CRAY APP. Firstly, the multiplication is highly parallel; it can easily be divided 
into (many) parts which can run in parallel. Secondly, the (computation) intensity, defined by 
the ratio of the number of floating point operations to the number of words of data, is high. For 
a real matrix-matrix multiplication of the form C = C +A* B, the intensity is given by 
#Flops 2n3 1 
Intensity= = - = -n. 
#Data Words Transferred 4n2 2 
(3.1) 
3.1. Parallelism 
Let the real matrices A, Band C of Fig. 3 have orders m x k, k x n, m X n and assume that 
KB, MB and NB are proper divisors of k, m and n, respectively, such that K X KB= k, 
M x MB = m and N x NB = n. Then, A can be partitioned into a block matrix of M X K blocks 
of order MB x KB. Matrices B and C can be partitioned analogously. An example of a block 
algorithm to perform the matrix-matrix multiplication C = C +A * B is given by Fig. 4. The 
K N N 
M * M 
MB NB 
KB 
A(m,k) B(k,n) C(m,n) 
Fig. 3. Matrix-matrix multiplication A * B = C; the matrices A, B and C are partitioned into blocks of order 
MBXKB, KBxNB and MBXNB, respectively. 
~~~~~~~~----------------------------------------------~~~~~~~~~-· 
96 M Nool/Applied Numerical Mathematics 19 (1995) 91-114 
DO i=l,M 
DO j= 1,N 
C;,j = O.OdO 
DO l=l,K 
c1,j = c1.j + A;,1B1J 
END DO 
END DO 
END DO 
Fig. 4. Block matrix-matrix algorithm. Each block Au, B/,j and C;,j is of order MB X KB, KB X NB and MB X NB, 
respectively. 
inner loop is not particularly suitable for parallel processing, since for each I the same 
submatrix Ci,j is updated. Introducing parallelism on the level of the j loop results in updating 
the N block columns of C in parallel. The introduction of parallelism on the outer loop level 
leads to concurrent computation of the block rows of C and a level of parallelism equal to M. 
Note, that if loops i and j are interchanged, parallelism over the block columns changes to 
parallelism over the block rows. However, there will be a significant difference in how 
operations are scheduled to the processing elements. The order of the operations will differ as 
well as the idle pattern if M or N are not a multiple of the number of processing elements. In 
[9], a block row and a block column scheme are considered and experiments on the CRAY 
APP display that the idle time becomes significant due to a lack of parallelism. 
A higher level of parallelism and a possible reduction of idle time can be achieved by 
collapsing the middle and the outer loop. As a consequence, all computations on submatrices 
CiJ• i = 1, ... , M, j = 1, ... , N, can be carried out independently, resulting in a level of 
1arallelism of M · N. Since each submatrix Ci,j is stored once, the number of stores is minimal. 
)n the other hand, the number of loads is high, since all submatrices of A and B are loaded K 
cimes. 
3.2. Granularity 
Assume that parallelism is exploited on submatrix level rather than on block row or block 
column level. One way to enlarge the degree of parallelism is to change the block size. A 
reduction of the block size leads to a higher level of parallelism, but not automatically to a 
higher performance on a parallel machine. A reason for this is that the intensity per block 
operation-which is of order NB for square blocks-decreases, whereas Mand N increase. 
Let the cycle time for one floating point operation be Tr, then for each block C;,j the 
computation time Tc_block will be 
Tc_block(k, MB, NB)= 2. K. KB. MB. NB. Tf = 2k. MB. NB. Tf. 
Define T 1 as the time to transfer one datum word from main memory to cache or vice versa. To 
compute one block C;,j requires K · KB · MB + K · KB · NB + MB · NB loads and MB · NB 
stores. So, the transfer time TL block will be 
Ti_block(k, MB, NB)= [k(MB +NB)+ 2 ·MB· NB]T1 • 
B, 
>Y 
~s 
)f 
ll. 
K 
:k 
A 
a 
:k 
~o 
B 
M. Nool I Applied Numerical Mathematics 19 (1995) 91-IJ4 97 
The total time T block needed for computing one block, including loads and stores, is given by 
Tbtock(k, MB, NB)= Tc_block + Tt_block 
= 2k ·MB· NB· Tr + [k(MB +NB)+ 2 ·MB· NB]r1• 
The APP's uniprocessor time needed to carry out the matrix-matrix multiplication C =A * B 
will be 
Tmatrix = M · N · Tblock(k, NB, MB)= 2kmnrr+ k(m · N + M ·n)r1 +2mnr1 • (3.2) 
A change of block size affects (i) the time required to load the matrices A and B and (ii) the 
maximum degree of parallelism. For the given algorithm, halving the block size causes a 
doubling of the load time for A and B and a change of 2 2 in the degree of maximum 
parallelism. From (3.2) we may conclude that the uniprocessor time for a matrix-matrix 
multiplication is minimal for Mand N as small as possible. In other words, as the cache content 
increases, the processing time will decrease. Given the cache size, we can create a matrix-ma-
trix multiplication of maximal performance. On the other hand, (3.2) does not indicate for a 
fixed m and n, and a given number of processors, for which block size the lowest wall clock 
time can be reached. 
3.3. Bus configuration 
Until now, we have not considered the important restnction on the bus configuration: 
processors on the same bus cannot communicate with the main memory concurrently. If the 
computation time is much larger than the transfer time this restriction will have a very limited 
effect. If not, processors will become idle because the required data are not available. For 
simplicity, we assume that there are p processors P1> p 2, ••• , Pp on one bus and that to each 
processor exactly one block c .. is assigned. The process of computation and load/store traffic 
'·' on a processor is shown in Fig. 5. 
The process on processor p 1 computing C; ,. , can be described as follows: I• I 
CAB 
Bus configuration 
AB 
l= 1 
Ill Load or store submatrix 
II Compute C=C+A*B 
AB c 
l= 2 l= K 
Fig. 5. The load/store and computation process for the matrix-matrix computation C = C + A* B on fom 
processors sharing the same bus. 
98 M. Nool /Applied Numerical Mathematics 19 (1995) 91-114 
• load block matrix cl. ;'' 
l• 1 
• load block matrices A;i.1 and B 1,J1, 
•compute C-. =C .. +A. 1 *B1 ·. l1,Ji Z1,Ji ll, ,Ji 
In the next step, data of C; ,. can remain in cache, and therefore, only A,. 2 and B 2 . must 1, 1 }l ,Ji 
be loaded. After K steps the process ends by storing the final C;1.J1 data. The computation of 
submatrix C;2,h on processor p 2 requires three blocks to be loaded, too. The best solution is to 
load C; 1·, A; 1 and B 11. on p 1 and next, simultaneously, p 1 starts executing c .. = C, .. + l' l 11 ' 1 l 1,J1 1,J1 
A; 1,i * Bl,j1 and p 2 is loaded with data of C;2.J2 , A;2 , 1 and Bu2 • Processor p 2 starts updating 
C; 1· as soon as all data required are available in cache. The overhead of computing C . and 2, 2 l1,Ji 
C; 2,h simultaneously, on two processors sharing the same bus, is equal to the time required for 
loading A; 1, B 11. and C,. ,. under the restriction that the time needed for loading is less than 11 '1 }7 1 
the computation time for C,. 1. = C; 1. +A; 1 * B 1 , .• li I I• I 2' ' 1 
We can extend this process to p processors sharing the same bus. The overhead can be 
expressed by 
(p - l)[KB ·MB+ KB· NB+ MB· NB]· T 1• (3.3) 
If this time exceeds 
2 ·KB· MB· NB· Tt, (3.4) 
then more than one processor will try to communicate with the main memory at the same time 
and processors will become idle. Note the significance of the ratio Tf: T 1• 
The maximum useful number of processors on a bus can be reflected in the leverage value, 
which is the ratio of the compute time to data transfer time. A leverage value l indicates that 
l + 1 processors can share the same bus, effectively. In other words, the wall clock time cannot 
be reduced using more than l + 1 processors. For the matrix-matrix multiplication we obtain 
Leverage 
Time spent computing in cache 
Time spent transferring across the bus 
2 . k . NB . MB . T f 
[k(MB +NB)+ 2 ·MB· NB]· T 1 • 
(3.5) 
We remark that for the example shown in Fig. 5, four processors can share the same bus 
efficiently. The addition of a fifth processing element on the bus, will not improve efficiency. It 
is worth noting that, when the computation starts, all values must be available on the on-chip 
cache; for instance not only the submatrix elements, but also the block size must be known. 
Since the CRAY APP can be considered as a shared memory machine, all processors have 
access to the main memory. In case of values being not available on the on-chip cache, a 
processor will try to load them from main memory. This action will disturb the bus traffic, and 
processors can become idle resulting in a longer execution time. 
3.4. Numerical experiments with matrix-matrix multiplication 
In this section, the effect of data traffic on the performance is illustrated by means of the 
matrix-matrix multiplication. The multiplication has a high computation intensity and, there-
fore, it is possible to increase the performance by using several processors on the same bus. 
M. Nao!/ Applied Numerical Mathematics 19 (1995) 91-114 99 
It is not our intention to develop the most efficient matrix-matrix multiplication-there 
exists already a very efficient implementation, i.e. rgmmul, which is appropriate for all kinds of 
matrix shapes-but in this paper we focus on the performance loss due to bus traffic for some 
fixed block sizes. A high computation speed is needed in order to demonstrate accurately the 
influence of data traffic. Therefore, we would like to develop a fast implementation for the 
matrix-matrix multiplication and to control data traffic explicitly, using APP routines to move 
data from memory into processor cache and vice versa. It appears that the highest possible 
performance for a matrix-matrix multiplication in cache can be obtained using the routine 
_ dmmm. Unfortunately, all data traffic is governed internally by the routine (see Section 2.2). 
We use the routine in order to achieve a high computational efficiency and can only speculate 
on how the routine manages bus traffic. 
To get a clear view on the efficiency we have chosen timing experiments on matrices with 
sizes depending on the configuration chosen. This enables us to avoid idle time due to a lack of 
parallelism. For a partitioning of the matrices A, B and C into square blocks of order b we 
choose 
m = b X #processors /bus, 
n = b X #buses, 
k =bx K, 
(3.6a) 
(3.6b) 
(3.6c) 
for K = 1, ... , 10 (cf. Fig. 3). This implies that, for a fixed K, the wall clock time for each 
configuration will be approximately the same. However, the more processing elements involved 
the higher the Mflop rate will be. Since all cache loads and stores are internal to _ dmmm we 
(a) 
&"···A··· .. lli· ·A 1 processor/bus 
400 •·····•····•···· • 2 processor/bus 
f ... + .. ·+··+ 3 processor/bus 
................ 4 processor/bus 
300 
Mflop/s 
200 
100 
10 
K=l 
• 
.1 ..... 
(b) 
A·· .. ·&·· ... A .... ··A l processor/bus 
...................... 2 processor/bus 
!····+····+ .... ·i 3 processor/bus 
........... •4 processor/bus .J ..... ·::: .. i·/ 
/)~:" ..... ··· 
200 
100 
30 o+J-~~~10,--~-~,o;;--~~~Jo 
# Processors 
K=5 
200 
100 
(c) 
...................... 2 processor/bus 
f ..... + ... + .... + 3 processor/bus 
................. 4 processor/bus .x·" ..... 
............. / 
·y~/ 
'j/ .. i 
~, ... /.,l 
~/;:'./ 
fj:':f 
i#$;;· 
tl 
10 # ?f&essors 
K = 10 
30 
Fig. 6. Floating point rate in Mflop/s achieved for the matrix-matrix product for orders defined by (3.6a)-(3.6c), 
where b = 18, the maximum attainable block size. K corresponds with the number of block matrix multiplications per 
resulting block C;J· 
100 M Nool /Applied Numerical Mathematics 19 (1995) 91-114 
A-···-A .. ···A ..... ,A l processor/bus 
25 
.................... 2 processor/bus 
f ..... + .. + ...... \ 3 processor/bus 
20 ................ 4 processor/bus 
Processor 15 
Speed 
up 
10 
(a) 
0+D~~~l~O~~#-Pf0=2~-es-so-rs~~30 
Bus 
Speed 2 
up 
A .. ··6· .. -6 .. ··A 1 processor/bus (b) 
.................. 2 processor/bus 
f .... +·+·· .. ·I 3 processor/bus 
............. 4 processor/bus 
..................................... 
·~ ............. ·•···········• 
\ ........... ..j. ............ j ............................ j ............. j ............. j 
................................................................................ 
6········· .. ·f/a ............ 4 ............ 4'•·• ......... f!J ......• ··.dr··········-A 
Fig. 7. Processor and bus speed-up for the matrix-matrix product with K = 10 and a block size b = 18. 
did not expect any performance improvement by increasing K. Instead of one load and one 
store for each submatrix Ci,j• K loads and K stores are required when preloaded data are lost 
after a call of _dmmm. Fig. 6, however, shows that in practice the performance depends on K. 
Probably, it is recognized during execution, that the elements of Ci,j are already in cache. 
We have chosen to present the speed-up in two ways: 
Execution time on 1 processor 
S - (3.7a) 
processor - Execution time on B X P processors ' 
Execution time on B buses with 1 processor 
s = (3.7b) 
bus Execution time on Bx P processors 
For K = 10 and block size b = 18 a processor speed-up close to 18 is reached for the full 
configuration. The bus speed-up lines in Fig. 7 are nearly horizontal, which implies that the 
gain of adding more processing elements to a bus does not depend on the number of buses 
already involved. In other words, in that direction the configuration is scalable. The effect of 
adding more processing elements to a bus is less obvious. Fig. 8 clearly displays that an increase 
in cache contents corresponds with an increase of bus speed-up and thus the performance will 
grow. For this application, it does not pay to use more than three processing elements per bus 
in case b = 10. In view of the peak rates for Tr and Tt as mentioned in Section 2.1, which 
deliver a= 2, we could have expected here a leverage value of 5. 
4. A performance model of the CRAY APP 
From the previous section it is known that when the number of buses increases the speed-up 
will increase linearly. In contrast, adding more processing elements to a bus will not lead to the 
M. Nool I Applied Numerical Mathematics 19 (1995) 91-114 101 
6 • 6 ·'- J processor/bus (a) (b) (c) (d) 
• .. • • l processor/bus 
I l 1 1 J processor/bus 
• • • • 4 processor/bus 
. . ·•··- . 
•···--·· . . . . 
' ! T ,. '.,.,.., .... ·•····' 
. 
I T T ' 
I ' • 
#Buse~" 
Fig. 8. Influence of bus traffic for the matrix-matrix product (K = 10). From left to right the results of the bus 
speed-up for block sizes b = 10, 12, 14, 16, 18. 
same speed-up. Only a few applications have such a high computation intensity that they can 
use the APP configuration with four processors per bus efficiently. At least three other 
characteristics of the APP can be distinguished which determine the ultimate performance 
result: 
• the speed of a processing element, 
•the speed of data transfer, 
• the cache size of a processing element. 
Let Tt be the (measured) transfer time for one floating point number and let Tc be the 
(measured) time needed for one floating point operation in SINGLE or DOUBLE PRECI-
SION arithmetic. Assume that Tt is a constant independent of the length of the vector to be 
transferred and that most of the floating point operations can be performed at speed Tc. In 
addition, we assume that a sufficient-at least the leverage value (3.5)-number of processing 
elements is available on a bus. In this section we illustrate how the performance of a bus will 
change when one of the above mentioned machine quantities is modified. For that purpose, we 
define a as the ratio of Tt to Tc. The transfer time can then be written as 
(4.1) 
As an example we take the matrix-matrix multiplication (cf. Fig. 3) with m = n = k, block size 
KB= MB= NB= b and K X b = k. The number of floating point operations (flops) per block 
for this application is given by 
#flops = 2Kb3 = 2kb 2 , 
and the number of data transfers by 
#transfers= 2(K + l)b 2 • 
The processor performance can be expressed by 
#flops 2kb 2 k 1 
1/f = = ---=-------::--
processor execution time 2kb2Tf + 2(K + l)b 27't k + (K + l)a Tc 
(4.2) 
(4.3) 
(4.4) 
102 M. Nool /Applied Numerical Mathematics 19 (1995) 91-114 
The leverage of the matrix-matrix multiplication (3.5) is denoted by 
2kb 2Tf k 
Leverage= 2(K + l)b2Tt = (K + l)a . ( 4.5) 
The maximum performance per bus 1ft bus is defined by 
k 2 1 
P bus= Leverage· 1/lprocessor = [ k ( ) ] { ) · - · + K + 1 a K + 1 a Tf (4.6) 
If the block size b is chosen such that the data cache is (nearly) fully occupied, the smallest 
possible value of K is obtained. This value results in a maximum processor performance and the 
leverage is maximal, too, and so is the bus performance 1Jibus· 
Let us return to the APP characteristics which influence the bus performance. First, we 
assume that it is possible to increase the speed of a processing element by a factor (3, (3 > 1. At 
the same time we assume that the data transfer time remains constant. In formula, 
(4.7) 
The ratio of the improved computation speed's bus performance and the actual speed's bus 
performance becomes 
ijr bus 
1Ji bus 
k + (K + l)a 
------ < 1, for f3 > 1. 
k + (K + l)a/3 ( 4
.8) 
As a very undesirable effect, we get a decrease in bus performance and, consequently, a 
decrease in APP performance when the speed of the individual processing element increases. 
Fig. 9 shows an example of an application running efficiently on five processors sharing the 
same bus. If the computation time is reduced to half the original time, not more than three 
processors on a bus can be used and the new bus time becomes larger than for the "slow" case. 
Secondly, we assume that the transfer rate can be accelerated by a factor y, y > 1, and the 
computation speed will be unaltered. Then we get 
a= a/y, (4.9) 
and the maximum bus performance ifr bus becomes 
- k2 ,,2 
1Jr = ·-bus [yk+(K+l)a](K+l)a Tf. (4.10) 
Compared to the original situation we get an acceleration of 
ifrbus k + (K + l)a 
-- = y > y, for y > 1. 
1Jr bus k + ( K + 1 ) a/ y ( 4.11) 
Without doubt a reduction of the transfer time will improve the performance of the APP. 
M. Nool /Applied Numerical Mathematics 19 ( 1995) 91-ll4 
If ... ~ ........ ~ ......... ~ ... --__, ...... '8::• 111':1 .. ;..  ; .. 
Pi 
It 
I: 
i; 
.. 
.. 
liu ...... 
.... ... . 
.. 
--... computation 
.... data transfer 
..... .. .. 
ilab .... 
]t •:• Bra:• ... , .... 
.. '41' Irr ......... .. 
103 
Fig. 9. In the upper part five processors share the same bus efficiently. A reduction in computation time reduces the 
number of useful processors on the bus. In the new situation only three processors can be used resulting in an 
increase of the bus time. 
Thirdly, we consider the maximum bus performance as a function of the cache contents. If 
the data cache size is enlarged by o, then the block size b can be enlarged to 
h=f8·b 
and for a fixed problem, K will become 
K = 0 112 K. 
The maximum bus performance will increase then by a factor 
']_/'lws k + (K + 1 )a K + 1 
-·-- = - · > 1 for o > 1, 
1fr hus k + ( o 112 K + 1 ) a o ·- 112 K + 1 ' 
(4.12) 
(4.13) 
( 4.14) 
since both the leverage and the performance per processing element grow. Notice that the gain 
is less spectacular than obtained for a reduced transfer time. 
Summarizing, from the possibilities to improve the bus performance described above the 
largest gain can be expected by an increase of the transfer speed, whereas a decrease of the 
cycle time for a floating point operation will even result in a lower bus performance. 
5. The Cholesky factorization 
The block Cholcsky matrix factorization has a computational complexity which is comparable 
to that of the matrix-matrix multiplication. In this section we describe two variants: the block 
left-looking factorization and the block right-looking factorization. A third variant, the block 
top-looking, is not considered: for this we expect a similar behavior as for the left-looking 
104 M. Nool /Applied Numerical Mathematics 19 (1995) 91-114 
variant. All variants have exactly the same number of floating point operations. The maj 
are described in terms of the BLAS [5,6,10] and the organization is such that each p1 
will perform BLAS operations on single block matrices only. 
The Cholesky factorization of a symmetric, positive-definite matrix A is given by 
A=L·LT, 
where L is a unique lower triangular matrix. Assume that A can be partitioned into K; 
blocks of order b such that k, the order of A, is equal to K X b. We first derive tt. 
left-looking variant which is block-column-oriented. Next, the right-looking algorithm is 
ered. 
5.1. Left-looking algorithm 
This well-known factorization can be derived from the following block-matrix produ 
[ A11 1 [L 11 1 ·[Lfi L~1 L~11 Az1 Az2 = Lz1 Lz2 L~2 L~2 · 
A31 A32 A33 L31 L32 L33 L~3 
We assume that in the previous l - 1 steps the block matrices Lw L 21 and L 3 
consisting of (l - 1) x b columns and (l - 1) X b rows, b rows and (K - l) X b rows, resi: 
-have been computed. In the current step l, the block matrices L 22 and L 32 , with a 
width of b columns, must be updated. Then from the block matrix equality, we obtain 
Lz1 ·L11 +L22 ·L12 =A22• 
L31 ·L11 +L32 ·L12 =A32· 
. 11e computation of the diagonal block L 22 consists of two steps: 
A'22 ~A22 -L21 ·L11, 
or, in Level 3 BLAS terms, a DSYRK operation, followed by 
L 22 ~ Cholesky(A'22 ), 
which can be performed by a Level 2 BLAS algorithm. We refer to this algorithm a~ 
The block column matrix L 32 can be obtained by first computing 
A'32 ~ A32 - L31. L11• 
a straightforward matrix-matrix multiplication and subtraction, and followed by L 32 I 
L32 ~ A'32. Lz/. 
In (5.7) a triangular system is solved with a multiple right-hand side, which can be perfc 
the Level 3 BLAS routine DTRSM. In the next step another block column of L is upda 
after precisely K steps the factorization is completed. This algorithm is known as a lef1 
algorithm because the data referred to is mainly on the left-hand side of the curre 
column. 
r steps 
>Cessor 
(5.1) 
square 
: block 
~onsid-
(5.2) 
-each 
ctively 
:olumn 
(5.3) 
(5.4) 
(5.5) 
)LLT. 
(5.6) 
comes 
(5.7) 
aed by 
d, and 
)Oking 
block 
M. Nao// Applied Numerical Mathematics 19 (1995) 91-114 105 
5.1.1. Explicit parallelism by operating on single blocks 
The BLAS operations can be parallelized either implicitly or explicitly. In case of implicit 
parallel processing the BLAS are parallelized such that within a BLAS operation the work is 
equally divided over several processors. In case of explicit parallel processing, calculations on a 
single block-both input and output matrices consists of single blocks of dimension b-are not 
parallelized; however, these block operations themselves, scheduled more or less arbitrarily 
(taking into account the data dependencies) can be performed in parallel. 
In [11] parallelism is exploited in a dynamic way. Before the computation starts, the 
dependencies between single block operations are uniquely described and when a processor has 
accomplished some single block operation then it "looks" for another operation in a queue of 
"ready-to-start" processes. After completion of a block operation other processes can become 
"ready-to-start". This asynchronous approach makes it possible to reduce idle time substan-
tially and, therefore, a highly efficient method is obtained. An important disadvantage of this 
method, however, is that an extensive administration of the state of the processes and the 
dependencies is required. SCHEDULE [4], the package that was used for that purpose in [11], 
is not available on the CRAY APP. Therefore, parallelism is considered at a lower level: a 
simpler static scheduling of operations is applied, although reduction of idle time is still one of 
the main issues. 
5.1.2. Parallel implementation 
To start with we discuss how the multiple block operations (5.4), (5.6) and (5.7) can be 
written as single block operations and how parallelism can be introduced explicitly. The 
symmetric rank-k update (5.4) can be rewritten as / - 1 single block DSYRK operations. Since 
all of the operations write to the same data block A'22 , they cannot be carried out concurrently. 
Furthermore, the matrix-matrix multiplication (5.6) can be split up into (K - l) x (I - 1) single 
block DGEMM multiplications of which the explicit level of parallelism is only K- l. Finally, 
the multiple block operation DTRSM (5.7) can be partitioned into K-1 single block opera-
tions to be performed simultaneously. 
Obviously, the load balancing of such a straightforward implementation is far from optimal. 
A better balancing is achieved by performing the matrix-matrix multiplication DGEMM and 
the symmetric rank-k update DSYRK simultaneously. These processes are data-independent: 
DSYRK updates the diagonal blocks whereas DGEMM writes to off-diagonal blocks (cf. Fig. 
10). This approach is satisfactory if the execution time of a block multiplication is comparable 
with the execution time for a single block symmetric rank-k update. If not, then a better 
solution can be achieved when a "WHILE" construction, illustrated in Fig. 11, is applied, 
especially when the number of blocks in the Ith column is larger than the number of processing 
elements p. A critical section is used to allow processors to update the shared variable JJ. The 
MCP _BARRIER is called to guarantee that all operations are finished when the final update 
of the off-diagonal blocks, performed by DTRSM, starts. This "WHILE" construction makes it 
possible to execute processes asynchronously. As soon as a sequence of DSYRK or DGEMM is 
terminated, the processor starts with another operation. Moreover, the sequence of DSYRK 
operations can immediately be followed by a call of DLLT, since its input is independent of the 
operations on the off-diagonal blocks in column /.We remark that the idle time can be reduced 
even more by performing the DTRSM immediately after the (l - l)th DGEMM contribution 
106 
CPCF PARALLEL 
CPCF PRIVATE I, J 
CPCF PDO 
DO J = L, K 
IF( J.EQ.L )THEN 
DO I = 1, L-1 
M. Noa// Applied Numerical Mathematics 19 (1995) 91-114 
A[L,LJ = A[L,LJ-A[L,l].Transpose(A[L,I]) 
END DO 
c 
ELSE 
DO I = 1, L-1 
A[J,LJ = A[J,L]-A[J,I].Transpose(A[L,l]) 
END DO 
END IF 
END DO 
CPCF SINGLE PROCES 
A[L,L] = Cholesky(A[L,L]) 
CPCF END SINGLE PROCES 
c 
CPCF PRIVATE J 
CPCF PDQ 
c 
DO J = L+1, K 
A[J,LJ = A[J,LJ.Inverse(Transpose(A[L,L])) 
END DO 
CPCF END PARALLEL 
/* DSYRK(5.4) */ 
/* DGEMM(5.6) */ 
/* DLLT(5.5) */ 
I* DTRSMCS.7) */ 
Fig. 10. A pseudo-algorithmic description of the parallel implementation for the lth column of left-looking Cholesky 
factorization. While one processor updates the diagonal block, other processors perform single block matrix-matrix 
multiplications. 
under the condition that the DLLT on the diagonal block has been completed. However, a 
comprehensive registration of the status of the processes will be required then. 
5.1.3. Implementation and performance of single block BLAS 
Programming the CRAY APP has been reduced to implementing the four basic single block 
operations: DSYRK, DGEMM, DTRSM and DLLT. As mentioned in Section 2.2, local BLAS 
implementations as described in [7] are not available, but it is possible to get pipeline rather 
than scalar performance by means of a very restricted set of Extended Math. Library routines 
(1]. 
Both the matrix-matrix product DGEMM and the symmetric rank-k update DSYRK are 
1erformed by a call to _ dmmm (see Section 2.2). For the latter this implies that the symmetry 
,f the resulting matrix is not taken into account and twice as many floating point operations are 
1erformed as actually required. Other attempts to get a better performance for DSYRK failed. 
:onsiderable effort has been put in the efficient implementation of DTRSM and DLLT. Table 
shows the performance in Mflop/s for the single block operations achieved for the optimum 
>lock size b = 18. We consider the performance given in Table 1 as rather disappointing. 
--,1 
I 
I 
JJ = L - 1 
c 
CPCF PARALLEL 
CPCF PRIVATE I, J 
c 
WHILE( JJ.LE.K ) DO 
CPCF CRITICAL SECTION 
JJ = JJ + 1 
J = JJ 
M. Nao// Applied Numerical Mathematics 19 (1995) 91-114 
CPCF END CRITICAL SECTION 
IF( J.EQ.L )THEN 
c 
c 
DO I = 1, L-1 
A[L,L] = A[L,L]-A[L,IJ.Transpose(A[L,IJ) 
END DO 
A[L,LJ 
ELSE 
Cholesky(A[L,LJ) 
DO I = 1, L-1 
A[J,L] =A[J,L]-A[J,l].Transpose(A[L,l]) 
END DO 
END IF 
END WHILE DO 
CALL MCP BARRIER( 
CPCF PRIVATE J 
CPCF PDQ 
c 
DO J = L+1, K 
A[J,LJ =A[J,LJ.Inverse<Transpose(A[L,L])) 
END DO 
CPCF END PARALLEL 
107 
/* DSYRK(5.4) */ 
/* DLLT(5.5) */ 
/* DGEMM(S.6) */ 
/* DTRSM(5.7) */ 
Fig. 11. A pseudo-algorithmic description of the parallel implementation for the Ith column of the left-looking 
Cholcsky factorization with the "WHILE" construction. The update of the diagonal block and those of the 
off-diagonal blocks arc carried out concurrently. As soon as either a diagonal or an off-diagonal block update is 
terminated, the processor may continue with updating another off-diagonal block. 
Table I 
Performance results and number of floating point operations of local BLAS operations for the optimum block size 
h- 18 
Block routine 
DGEMM 
DSYRK 
DTRSM 
DLLT 
Performance 
25.6 Mflop/s 
13.6 Mflop/s 
6.7 Mflop/s 
4.0 Mflop/s 
# flops 
2h' 
b' 
b' 
(2bJ + 3h 2 + b )/6 
I 
108 M. Nool /Applied Numerical Mathematics 19 (1995) 91-114 
400~--------
processor/bus 
(a) 
G··O.··EJ. .. Q2 processor/bus 
300 l·-····f .. ···.f-·-··.f 3 processor/bus 300 
................... 4 processor/bus 
Mflopls 200 Mflopls 200 
100 100 
#Buses #Buses 
Fig. 12. Performance results of two left-looking block Cholesky factorizations. The left figure corresponds to the 
implementation presented in Fig. 10, whereas the right figure shows the Mflop rate achieved for the "WHILE" 
construction of Fig. 11. 
5.1.4. Performance of left-looking implementations 
Fig. 12 shows the performance results obtained for two left-looking variants: the left figure 
for the implementation presented in Fig. 10, the right figure for the implementation presented 
in Fig. 11. So, for the left-looking block Cholesky factorization, with its decreasing amount of 
computational work per block column, asynchronous job scheduling can improve the perfor-
mance substantially. Moreover, the performance gain obtained by going from 3 to 4 processors 
per bus is less obvious than from 1 to 2 processors per bus. However, an increase of the total 
number of processors may not result in a performance increase. For instance, the execution 
time achieved on 14 processors (viz., 7 buses X 2 processors/bus) is less than the time achieved 
on 15 processors (viz., 5 buses x 3 processors/bus) for both implementations. In Fig. 13, we 
present the bus efficiency, defined by the ratio of bus speed-up (3.7) to the number of 
processors per bus. Although the same steps are performed, we observe that due to a better job 
scheduling the bus efficiency for the full configuration has been raised from less than 30% to 
nearly 40%. 
5.2. Right-looking algorithm 
The block right-looking algorithm has an updating pattern which significantly differs from 
the left-looking algorithm. The latter does not perform operations before they are actually 
needed; information from the past-at the left side of the current column-is used to update 
that column. The right-looking algorithm, however, processes information as soon as it becomes 
available. If, for instance, the first block column of L has been computed, then all blocks at the 
M. Nool /Applied Numerical Mathematics 19 (1995) 91-114 
0.8 
0.6 
Efficiency 
0.4 
0.2 
·····~ 
........ ,. 
'·· .. 
·······:a 
···., 
······n. 
··· .. 'A. 
· ......... ,, .... 
'\ .... ·"!.... .. .... El 
....... ..... .. .......... t.. '· .. El. 
. ··.. ··a 
................ ""f.... . .. , ........... El 
..... 
·· ...... . 
··· ...... . 
··········• 
(a) 
o~~,~--,~,~_,__.~,~-r--o~,~ 
#Buses 
··· ......... A 
··-...... '!!, •• (b) 
·········A·~ ... 
·····-A ... 
....... LI ............ ,. 
0.8 
Gl 
........... B. 
1.. · ........ El ... 
',J... 
.......... ; 
··· .•... 
..... "•k 
a ... 
""B., 
0.6 
...... ""i 
..... "•... .. ............ !-.. ......... , 
·· ........ . 
"········-.......... 
Efficiency 
0.4 
0.2 
0~~,~--,-,~.---~,____,0,__~I~ 
#Buses 
Fig. 13. Bus efficiency for the two left-looking block Cholesky factorizations of Fig. 12. 
109 
right-hand side can be updated with contributions from the first block column: the diagonal 
blocks by a DSYRK operation and the off-diagonal matrices by a matrix-matrix multiplication. 
We prefer to describe the right-looking implementation in terms of single block operations 
directly. Again it is assumed that A (and L) are of order k and k = K X b, where b denotes the 
block size. In the first step the factorization can be performed analogously to the left-looking 
algorithm: 
L[l, 1] - Cholesky(A[l, 1]), 
L[i, 1) -A[i, 1) ·L(l, 1)-T, 
(5.8) 
(5.9) 
for i = 2, ... , K. For the remaining part of A the diagonal blocks A[i, i], i = 2, ... , K, are 
updated with respect to the first block column of L: 
A[i, i] =A[i, i] -L[i, 1) ·L[i, l]T, (5.10) 
and the off-diagonal elements by a matrix-matrix multiply and add operation 
A[i, j) =A[i, j] -L[i, 1] ·L[j, l]T, (5.11) 
for i = 3, ... , K, j = 2, ... , i - 1. As soon as the first step is completed, the same operations can 
be applied on the recently updated submatrix of A. The operations of the first two steps are 
illustrated by Fig. 14. In Fig. 16 the implementation of the computation of the lth block column 
and the update of its remaining right-hand part is given. Note that some idle time can be saved 
by combining DTRSM, DSYRK and the Cholesky factorization of the current block as is shown 
in Fig. 15. Actually, the reduction of idle time will be larger than shown, because relatively 
expensive operations are combined. 
110 M. Nool /Applied Numerical Mathematics 19 (1995) 91-114 
• Cholesky 
• DTRSM 
• DGEMM 
i 
l • DSYRK 
j- j-
Fig. 14. Update scheme for the first two steps of right-looking block Cholesky factorization. 
5.2.1. Performance of right-looking implementation 
The main difference in performance for the left- and right-looking implementations comes 
from the level of parallelism. For the left-looking algorithm the number of independent 
processes for the lth block column is equal to the number of blocks in that column, viz., K - l, 
in case diagonal and off-diagonal blocks are updated concurrently. For the right looking 
algorithm this number is much higher, viz., 
t{K- l) · (K- l + 1), (5.12) 
1ince all DSYRK and DGEMM updates are data-independent. 
Fig. 17 displays performance results of the block right-looking variant. The results for a small 
umber of processors are comparable to the left-looking implementations, but for a larger 
-- -- --- -
-- -- --- -
-- --- -
--- - -
- - - ----· - -
- - -
- ---
---
• Cholesky 
• DSYRK 
• DTRSM 
• DGEMM 
Fig. 15. Right-looking block Cholesky factorizations on five processors. The upper pattern illustrates the parallel 
execution of Fig. 14's implementation. A combination of DSYRK and DTRSM and the Cholesky operations can 
reduce idle time as is illustrated by the lower pattern. 
s comes 
;>en dent 
., K-1, 
looking 
(5.12) 
a small 
t larger 
-
I 
parallel 
ons can 
IF( L.EG.1 >THEN 
CPCF SINGLE PROCES 
M. Nool /Applied Numerical Mathematics 19 ( 1995) 91-114 
A[L,L] = Cholesky(A[L,L]) 
CPCF END SINGLE PROCES 
END IF 
c 
IF( L.LT.K )THEN 
CPCF PARALLEL 
CPCF PRIVATE J 
CPCF PDQ 
c 
DO J = L+1, K 
A[J,LJ = A[J,LJ.Inverse<Transpose(A[L,L])) 
A[J,JJ = ACJ,JJ-A[J,L].TransposeCA[J,LJ) 
IF( J.EQ.L+1 )THEN 
A[L+1,L+1J = Cholesky(A[L+1,L+1J) 
END IF 
END DO 
CPCF SINGLE PROCESS 
C NUMBER is number of matrix-matrix multiplications for column L 
c 
c 
CPCF 
c 
CPC F 
CPCF 
c 
NUMBER = (K-L-1) * (K-L)/2 
JJ = L 
KK = K 
END SINGLE PROCESS 
PRIVATE I, IJ, J 
PDQ 
DO IJ=1, NUMBER 
CPCF CRITICAL SECTION 
ll1 
/* DLLT(5.5) */ 
/* DTRSM(5.7) */ 
/* DSYRK(5.4) */ 
I* DLLT(5.5) */ 
C Determine the indices I, J of the matrix on which the matrix-matrix multiply rou-
C tine DGEMM must be performed 
IF( KK.EQ.K )THEN 
JJ = JJ + 1 
KK = JJ + 1 
I = KK 
J = JJ 
ELSE 
KK = KK + 1 
I = KK 
J = JJ 
END IF 
CPCF END CRITICAL SECTION 
A[I,JJ = A[I,JJ-A[I,LJ.Transpose(A[J,LJ) 
END DO 
CPCF END PARALLEL 
c 
END IF 
/* DGEMM(5.6) */ 
Fig. 16. A pseudo-algorithmic description of the parallel implementation of the right-looking block Cholesky 
factorization. 
112 
300 
Mfiop/s 200 
100 
I I. 
M. Nool /Applied Numerical Mathematics 19 (1995) 91-114 
.j "+ :) 0 I 
#Buses 
Bus 
Speed 2 
up 
(b) 
................... 
......................... ............ .. 
, ...... j .......... + ........... j-........ + ......... + .......... ! 
G·" ...... Q ... M•·- ... Q. .......... g. .......... G .......... g. .......... g 
4 ............ IJ. ·········6 ......... A,.. .. ..... IJ. ........... -A········ .... A 
L. .1 .. ;;i D 
#Buses 
0.8 
0.6 
Bus 
Efficiency 
0.4 
0.2 
(c) 
[} ....... i;; .......... g ......... g .......... g. .... , ... g ........ g 
.......... 
. ........ . 
. ........ .. 
"······· .......... 
"" .J "' .. 0 #Buses 
Fig. 17. Results of the right-looking block Cholesky factorization. From left to right: the Mflop/s, the bus speed-up 
and the bus efficiency for b = 18, respectively. 
number of processors we see that the right-looking implementation performs much better due 
to the higher level of parallelism. The speed-up pattern approximates the one achieved for the 
matrix-matrix multiplication (cf. Fig. 8). We also refer to [2] and [11] for similar experiments. 
It is worth noting, however, that for the left-looking algorithm the data reuse is better: only 
data from the current block is updated and can be kept in cache and as we have seen in Section 
3.4: this will probably happen. For the right-looking algorithm no data can be reused and, 
consequently, more bus traffic takes place. Apparently, this disadvantage does not play an 
mportant role. 
Finally, we discuss performance results for different block sizes. In Fig. 18 all lower lines 
correspond to a block size of 10, the upper lines to the optimum block size of 18, the middle 
400 
300 -
-a 
0 200 -<;::: 
::E 
(a) (b) (c) 
200 400 600 800 0 200 400 600 800 
. ··.. 
..... ·· 
" ...... ·· 
,,." .. ·· 
4/j, •• •••• 
(d) ••••• 
.. 
.. 
.. 
.. 
....... 
:'..-~ .. · --
... 
. :
~~· ....... 
/:..-·· .. 
.;..!/ 
:,· 
I 
,.,/ I I I I 
0 200 400 600 800 
Order Matrix Order Matrix Order Matrix Order Matrix 
Fig. 18. Performance of the right-looking Cholesky factorization for block sizes: b = 10, 16, 18. From left to right the 
Mflop/s obtained for runs on 7Xl, 7X2, 7X3, 7X4 processors. 
M. Nool /Applied Numerical Mathematics 19 (1995) 91-114 113 
line in each picture corresponds to a block size of 16. Horizontally, the order of the matrix is 
displayed. For a fixed matrix order a decrease of the block size delivers a higher level of 
parallelism, which is not translated into a higher efficiency. From Fig. 18 it is clear that for a 
block size of 10 the bus traffic negatively influences the performances, whereas a larger block 
size with a corresponding higher matrix-matrix multiplication's leverage value takes care of a 
high speed on the full bus configuration. 
6. Conclusions 
We have shown the influence of bus traffic on the APP performance by means of simple 
block matrix-matrix products. A bus speed-up of nearly three was achieved for matrix-matrix 
multiplication and optimally filled cache. We do not expect that better results can be obtained 
with the addition of more processing elements on the same bus. 
For the Cholesky decomposition a reasonable efficiency was achieved, not only because of 
the high matrix-matrix multiplication performance but also because of the choice of algorithm. 
The right-looking implementation with its higher level of parallelism performs significantly 
better than the left-looking implementations. For both algorithms the number of synchroniza-
tion points is significant. By means of taking different jobs together with the so-called 
"WHILE" construction considerable execution time reduction can be achieved. The best 
results for the APP configuration were obtained for partitionings with cache contents as large 
as possible. A partitioning into smaller blocks with a corresponding higher level of parallelism 
does not lead to a better performance (as is shown by Fig. 18) because of increased data traffic. 
Acknowledgements 
The author wishes to thank Alan Stewart and Herman J.J. te Riele for their constructive 
comments. 
References 
[I] Cray Research Superservers, Inc., CRAY APP Programmer's Reference (1992) Extended Math. Routines. 
[2] K. Dackland, E. Elrnroth, B. K~gstrom and C. Van Loan, Parallel block matrix factorizations on the 
shared-memory multiprocessor IBM 3090 vf/600j, lntemat. J. of Supercomputer Appl. 6 (1992) 69-97. 
[3] J.J. Dongarra, Performance of various computers using standard linear equations software in a Fortran 
environment, Technical Report CS-89-85, Oak Ridge National Laboratory (1992). 
[4] F.B. Hanson and D.C. Sorensen, The SCHEDULE parallel programming package with recycling job queues 
and iterated dependency graphs, Technical Report ANL-MCS-P22-0!89, Argonne National Laboratory (1989). 
[5] J.J. Dongarra, J. Du Croz, I. Duff and S. Hammerling, A set of level 3 Basic Linear Algebra Subprograms, 
ACM Trans. Math. Software 16 (1) (1990) 1-17. 
[6) J.J. Dongarra, J. Du Croz, S. Hammerling and R.J. Hanson, An extended set of Fortran Basic Linear Algebra 
Subprograms, ACM Trans. Math. Software 14 (I) (1988) 1-32. 
[7] S.L. Johnsson and L.F. Ortiz, Local Basic Linear Algebra Subroutines LBLAS for distributed memory 
architectures and languages with array syntax, Internal. J. Supercomputer Appl. 6 ( 1992) 322-350. 
114 M. Nool /Applied Numerical Mathematics 19 (1995) 91-114 
[8) C.-H. Lai and H.J.J. te Riele, Some experiences of solving 1-D semiconductor device equations on a matrix 
coprocessor by a domain decomposition method, Technical Report NM-R9304, CWI, Amsterdam (1993). [9] C.-H. Lai, H.J.J. te Riele and A. Ualit, Parallel experiments with simple linear algebra operations on a Cray 
S-MP System 500 matrix coprocessor, Technical Report NM-N9301, CWI, Amsterdam (1993). 
[10) C.L. Lawson, R.J. Hanson, D. Kincaid and F.T. Krogh, Basic Linear Algebra Subprograms for FORTRAN 
usage, ACM Trans. Math. Software 5 (1979) 308-323. 
[11) M. Louter-Nool, Block-Cholesky for parallel processing, Appl. Numer. Math. 10 (1992) 37-57. [12) A. Stewart, M. Louter-Nool, H.J.J. te Riele and D.T. Winter, An investigation of data reuse on the Cray S-MP 
System 500, Supercomputer XI (1) 59 (1994/95). 
[13) A.J. van der Steen. Overview of recent supercomputers, Technical Report, 5th rev. ed., Stichting Nationale 
Computer Faciliteiten, The Hague (1995). 
