Implementation and Performance Analysis of Numerical Algorithms on the MPP, Flex/32, and Cray/2 by Fatoohi, Raad A.
Old Dominion University 
ODU Digital Commons 
Electrical & Computer Engineering Theses & 
Dissertations Electrical & Computer Engineering 
Winter 1987 
Implementation and Performance Analysis of Numerical 
Algorithms on the MPP, Flex/32, and Cray/2 
Raad A. Fatoohi 
Old Dominion University 
Follow this and additional works at: https://digitalcommons.odu.edu/ece_etds 
 Part of the Electrical and Computer Engineering Commons 
Recommended Citation 
Fatoohi, Raad A.. "Implementation and Performance Analysis of Numerical Algorithms on the MPP, Flex/
32, and Cray/2" (1987). Doctor of Philosophy (PhD), Dissertation, Electrical & Computer Engineering, Old 
Dominion University, DOI: 10.25777/hgf5-ka95 
https://digitalcommons.odu.edu/ece_etds/202 
This Dissertation is brought to you for free and open access by the Electrical & Computer Engineering at ODU 
Digital Commons. It has been accepted for inclusion in Electrical & Computer Engineering Theses & Dissertations 
by an authorized administrator of ODU Digital Commons. For more information, please contact 
digitalcommons@odu.edu. 
IMPLEMENTATION AND PERFORMANCE ANALYSIS OF NUMERICAL 
ALGORITHMS ON THE MPP, FLEX/32, AND CRAY/2
Raad A. Fatoohi 
B.S. June 1976, Mosul University 
M.S. December 1982, Syracuse University
A Dissertation Submitted to the Faculty of 
Old Dominion University in Partial Fulfillment o f the 
Requirements for the Degree of
DOCTOR OF PHILOSOPHY 
IN
ENGINEERING




ohn W. Stoughton (Director)
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
Abstract
IMPLEMENTATION AND PERFORMANCE ANALYSIS OF NUMERICAL 
ALGORITHMS ON THE MPP, FLEX/32, AND CRAY/2
Raad A. Fatoohi 
Old Dominion University, 1987
This dissertation presents the results o f the implementation o f a number o f numerical 
algorithms on three parallel/vector computers. The object o f this research is to determine 
how well, or poorly, a number of numerical algorithms would map onto three different 
architectures and to analyze the performance of these architectures using these algorithms. 
These algorithms are: a relaxation scheme for the solution of the Cauchy-Riemann equa­
tions, an ADI method for the solution of the diffusion equation, and a compact difference 
scheme for the solution o f two-dimensional Navier-Stokes equations. The computers were 
chosen so as to encompass a variety of architectures. They are: the MPP, an SIMD 
machine with 16K bit serial processors; Flex/32, an MIMD machine with 20 processors; 
and the Cray/2. The machine architectures are briefly described. The implementation of 
these algorithms is discussed in relation to these architectures and measures o f the perfor­
mance on each machine are given. The basic comparison is among SIMD instruction paral­
lelism on the MPP, MIMD process parallelism on the Flex/32, and vectorization of a serial 
code on the Cray/2. Simple performance models arc used to describe the performance. 
These models highlight the bottlenecks and limiting factors for these algorithms on these 
architectures. Finally conclusions arc presented.
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
Dedication
For Rakiya and Wisam
i i
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
Acknowledgements
I would like to take this opportunity to extend a special thanks to Professor Chester 
Grosch, without his unique offer for supervision this research would not have been possible. 
In addition to the technical expertise that he provided, I am grateful to Professor Grosch for 
his patience, encouragement, and expeditious feedback.
I would also like to thank the other members of my committee, Dr. John Stoughton, 
Dr. Stephen Zahorian, Dr. Sharad Kanetkar, and Dr. Robert Voigt for their help and advice.
I would like to express my appreciation to NASA Langley Research Center, NASA 
Goddard Space Flight Center, and NASA Ames Research Center for allowing me access to 
the Flex/32, MPP, and Cray/2. I am also grateful to Tom Crockett o f ICASE, Tor Opsahl 
and David Wildenhain o f Science Applications Research, and Dan Nagle o f Cray Research, 
Inc., for their help in using these machines.
Finally, I would like to thank Dr. Roland Mielke, Chairman o f the Electrical and 
Computer Engineering Department, and Dr. Robert Voigt, Director o f ICASE, for their 
valuable support during my Ph.D. study. This research was supported by the National 
Aeronautics and Space Administration under NASA Contracts No. NAS1-17070 and 
NAS 1-18107 while the author was in residence at ICASE, NASA Langley Research Center, 
Hampton, VA 23665.
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
Table of Contents
Page
List of Tables vi
List o f Figures viii
List of Symbols ix
Chapter
1. Introduction 1
2. The Architectures and Programming Languages 5
2.1. The MPP architecture and MPP Pascal 5
2.2. The Flex/32 architecture and Concurrent Fortran 8
2.3. The Cray/2 architecture and CFT/2 compiler 11
3. Performance Models 13
3.1. The MPP 13
3.2. The Hex/32 15
3.3. The Cray/2 17
4. Solving the Cauchy-Riemann Equations 19
4.1. The numerical method 19
4.2. Implementation on the MPP 25
4.3. Implementation on the Rex/32 31
4.4. Implementation on the Cray/2 34
5. Solving the Diffusion Equation 39
5.1. The numerical method 39
5.2. Implementation on the MPP 44
5.3. Implementation on the Hex/32 47
5.4. Implementation on the Cray/2 50
i v
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
6. Solving Navier-Stokes Equations 54
6.1. The numerical method 54
6.2. Implementation on the MPP 58
6.3. Implementation on the Flex/32 61
6.4. Implementation on the Cray/2 64
7. Comparisons and Concluding Remarks 67
List o f References 73
V
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
List of Tables
Table Page
1 Measured execution times (in machine cycles) of the elementary opera- 15
tions on the MPP.
2 Measured execution times for one iteration and processing rates for the 29
relaxation algorithm on the MPP.
3 Estimated times (in milliseconds) o f  the relaxation algorithm on the 30
MPP.
4 Speedup and efficiency o f the relaxation algorithm on the Flex/32 using 33
the MMOS locks.
5 Speedup and efficiency o f the relaxation algorithm on the Eex/32 using 33
the Local locks.
6 Measured execution times for one iteration and processing rates for the 33
relaxation algorithm using 16 processors o f  the Flex/32.
7 Measured execution times for one iteration and processing rates for the 36
relaxation algorithm on one processor o f the Cray/2.
8 Estimated and measured execution times (in milliseconds) o f the re’ax- 38
ation algorithm on the Cray/2.
9 Measured execution time per time step and processing rate for the ADI 45
procedure on the MPP.
10 Operation counts for one pass o f the ADI method using the cyclic el- 46
imination algorithm for solving the tridiagonal systems.
11 Estimated and measured times (in milliseconds) o f the ADI method on 46
the MPP.
12 Speedup and efficiency o f the ADI method on the Eex/32. 48
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
13 Measured execution times per time step and processing rates for the 
ADI method using 16 processors o f the Flex/32.
14 Estimated and measured times (in milliseconds) o f the ADI method on 
the Flex/32.
15 Measured execution times per time step and processing rates for the
ADI method on one processor o f the Cray/2.
16 Operation counts for one pass o f the ADI method using the Gaussian
elimination algorithm for solving the tridiagonal systems.
17 Estimated and measured execution times (in milliseconds) o f the ADI
method on one processor o f the Cray/2.
18 Measured execution time and processing rate o f the Navier-Stokes sub­
programs on the MPP.
19 Estimated times (in milliseconds) o f the Navier-Stokes subprograms on
the MPP.
20 Speedup and efficiency o f the Navier-Stokes algorithm on the Flex/32.
21 Measured execution times for ten time steps and processing rates for
the Navier-Stokes algorithm using 16 processors of the Flex/32.
22 Measured execution time and processing rate o f the Navier-Stokes sub­
programs on the Cray/2.
23 Estimated and measured execution times (in milliseconds) o f the
Navier-Stokes subprograms for the 64 x 64 problem on one processor 
o f the Cray/2.
24 Estimated and measured execution times (in milliseconds) of the
Navier-Stokes subprograms for the 128 x 128 problem on one processor 
o f the Cray/2.
25 Summary  o f  the results for the 128 x  128 problem.
















R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
List of Figures
Figure Page
1 Block diagram o f the MPP. 6
2 Block diagram o f the Flex/32 architectures. 9
3 Typical computational cell and the data associated with it. 21
4 Computational domain and the four color cells. 24
5 Spectral radius as a function o f the acceleration parameter. 26
v i i i
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
List of Symbols
Ca Number of cycles in add operation on the MPP (section 3.1),
Cn Number o f cycles in multiply operation on the M PP (section 3.1),
Cd Number o f cycles in divide operation on the MPP (section 3.1),
CA  Startup cost o f shift operation on the MPP (section 3.1),
C„ Number o f cycles for each step in shift operation on the M PP (section 3.1),
Cij Matrix of coefficients (section 4.1),
CP G ock  Period o f  the Cray/2 (section 3.3),
ftcfp) Busses contention factor on the Flex/32 (section 3.2),
f u  Load distribution factor o f an algorithm on the Flex/32 (section 3.2),
Fij Forcing term o f tridiagonal equations (section 5.1),
Gij Forcing term o f tridiagonal equations (section 5.1),
kctd Number o f times a vector is stored in common memory on the Flex/32 (section 3.2),
kci„ Number o f times a vector is copied to local memory on the Flex/32 (section 3.2),
kcma Number o f times a shared vector is referenced on the Flex/32 (section 3.2),
it*  Number of times a shared variable is locked on the Flex/32 (section 3.2),
Lf Length o f a floating point functional unit of the Cray/2 (section 3.3),
Lm Length o f data path between main memory and registers on the Cray/2 (section 3.3),
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
Lw  Length o f  the vector (section 3.3),
n Number o f  grid points in each dimension o f  the computational domain (section 3.2), 
Na Number o f  additions o f  an algorithm on the MPP (section 3.1),
Nd Number o f  divisions o f  an algorithm on the MPP (section 3.1),
Number o f floating point operations with stride o f  1 on the Cray/2 (section 3.3),
Nfi Number o f  floating point operations with stride o f 2 on the Cray/2 (section 3.3),
Nm Number o f  multiplications o f  an algorithm on the MPP (section 3.1),
Nnl Number o f memory access operations with stride o f 1 on the Cray/2 (section 3.3),
Nn2 Number o f  memory access operations with stride o f 2 on the Cray/2 (section 3.3),
Number o f  shift operations o f an algorithm on the M PP (section 3.1),
N„ Number o f steps in shift operations of an algorithm on the MPP (section 3.1),
p  Number o f  processors (section 3.2),
f  Box variables (section 4.1),
/?! Data transfer rate with stride o f 1 through main memory on the Cray/2 (section 3.3),
R2 Data transfer rate with stride o f 2 through main memory on the Cray/2 (section 3.3),
R(k) Residual after jfc’th iteration (section 4.1),
Re Reynolds number (section 6.1),
tc Machine cycle time for the MPP (section 3.1),
Time to access an element of shared vector on the Flex/32 (section 3.2), 
tkk Total time to lock ar.d unlock a shared variable on the Flex/32 (section 3.2),
ty^  Time to access an element of vector in local memory of the Flex/32 (section 3.2),
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
Time to spawn one process on the Flex/32 (section 3.2),
TcU Time difference between common and local memories on the Flex/32 (section 3.2),
Tclm Total time to copy shared variables to local memory on the Flex/32 (section 3.2),
Tcma Total common memory access time o f an algorithm on the Flex/32 (section 3.2),
T*  cmm Communication cost o f an algorithm on the MPP (section 3.1),
T1 cmo Total common memory overhead time of an algorithm on the Flex/32 (section 3.2),
T* cmp Computation cost o f an algorithm (section 3.1),
Time to do floating point operations with stride o f 1 on the Cray/2 (section 3.3),
Tfi Time to do floating point operations with stride o f  2 on the Cray/2 (section 3.3),
Tm i Time to do memory access operations with stride of 1 on the Cray/2 (section 3.3),
T * Time to do memory access operations with stride of 2 on the Cray/2 (section 3.3),
Tovr Overhead time o f an algorithm on the Flex/32 (section 3.2),
T‘ p Execution time o f an algorithm on the Flex/32 (section 3.2),
TA spn Spawning time o f p  processes on the Flex/32 (section 3.2),
T^n Total synchronization time o f an algorithm on the Flex/32 (section 3.2),
i? Velocity field (section 4.1),
3 , Velocity field in discrete approximation notation (section 4.1),
Zv Matrix o f  coefficients (section 4.1),
a Coefficient of tridiagonal equations (section 5.1),
0 Coefficient of tridiagonal equations (section 5.1),
7 Coefficient of tridiagonal equations (section 6.1),
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
8, Centered difference operator (section 4.1),
5* Centered second difference operator (section 5.1),
Ar Full time step (section 5.1),
£ Vorticity (section 4.1),
Xij Matrix of coefficients (section 4.1),
p , Average operator (section 4.1),
o  Spectral radius (section 4.1),
to Acceleration parameter (section 4.1),
1x1 Next integer greater than or equal to x,
x i i
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER ONE
Introduction
The objective o f the research reported here is to determine how well, or poorly, a 
number of numerical algorithms would map onto three different parallel architectures and to 
analyze the performance of these architectures using these algorithms.
It appears that single processor computers, whether scalar or vector, are nearing the 
ultimate limit o f their performance. Certainly, the circuit clock period will decrease and cir­
cuit density will increase in the future, but it appears unlikely that major and rapid gains are 
in prospect. The latest supercomputer, Cray/2, has a clock period of 4.1 nanoseconds, and 
the Cray/1, introduced about 10 years ago, had a clock period of 12.5 nanoseconds. Thus 
there has been only a factor o f 3 increase in clock speed in the last decade. A reduction of 
the clock period to one nanosecond seems possible soon. This development, while increas­
ing the processing rate, will impose rather stringent constraints on the packaging density 
and architecture of a single processor computer. An alternative way of achieving greater 
processing power is to use computers consisting of multiple processors.
There are a number o f important, unresolved questions concerning multiprocessor 
computers. Among these issues are: should they consist o f a few, rather powerful proces­
sors or many, simple processors, or something in between? Should the new computers be 
SIMD (Single Instruction Multiple Data) or MIMD (Multiple Instruction Multiple Data)? 
There is a natural expectation that the multiprocessors with a few, powerful processors will 
have an MIMD architecture and that the others will have SIMD architectures. Another
1
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
2
issue is the communication among the processors; how the memory is connected to the pro­
cessors and how these processors are connected to each other. Should the interconnection 
scheme be a lattice, a bus, a switch, or something else? It is sterile at this point to argue 
what is the "best" combination o f number o f  processors and power per processor and what 
is the "best” interconnection scheme. Rather, carrying out experiments with existing mul­
tiprocessor computers would appear to be of greater value.
If the parallel computers can be used effectively, very large gains in  overall processing 
power are possible. There are three conditions which must be met if  an algorithm is to exe­
cute at high efficiency on a multiprocessor com puter (1) it must have many operations 
which are executable in parallel, (2) the amount of communication required between the 
processors must be small compared to the amount of calculations which are required, and 
(3) each processor must have roughly the same amount o f work to do. High performance 
will actually be obtained from parallel architectures when the algorithms executed map 
efficiently to the architecture. Efficient mapping must be based on a thorough and detailed 
understanding o f the resource requirements o f the algorithms and the ability o f a given 
architecture to deliver these resources. Mappings of algorithms onto parallel architectures is 
a problem of extensive dimensionality and great complexity.
Despite the fact that the most powerful existing parallel/vector computers can perform 
at peak rates o f several hundred MFLOPS (million floating point operations per second), the 
average processing rates of many codes are in the range o f 20 to 30 MFLOPS [9]. In part, 
this can be explained by invoking Amdahl’s law [2],
S  --------- ^ ------- ,
1 + / ( * - • - 1)
where S is the speedup, /  is the fraction of the code that can be parallelized/vectorized, and 
R is the parallel/vector to scalar speed ratio. Calculations with this formula show that
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
3
nearly all the code must be parallelized/vectorized in order to achieve a substantial speedup. 
It is also clear that increasing R  will only give a very modest improvement for a fixed / .
The problem o f measuring the performance o f parallel computers is a difficult one 
and, as yet, does not have a solid theoretical foundation [22]. Performance is highly depen­
dent on the architecture o f the multiprocessor, the computational algorithm, and the software 
environment, particularly the programming language used. One abstract approach would be 
to use a model of the concurrent processor and analyze the execution o f a particular algo­
rithm or class o f algorithms [12], [16]. The degree o f abstraction o f the model and the 
depth to which the algorithm is analyzed might very well influence the results. Another 
approach is to calculate an upper bound on performance by considering the time to perform 
a single arithmetic operation, together with the number o f processors. Such a measure is 
widely held to be unrealistic because it does not include any o f the omnipresent overhead.
A different approach is to program in some language and run a specific algorithm or 
class o f algorithms on a particular multiprocessor computer. Despite the fact that this 
would be a very specific experiment, this approach has some distinct advantages. Measure­
ment of the performance gives an objective measure of the cost of computation, although 
for a specific class of algorithms, expressed in a specific language and executed on a 
specific architecture. This kind of experiment also can yield subjective evidence as to how 
well the class o f algorithms fits the architecture, how difficult it was to program in the par­
ticular language, and so on. This approach has been considered by Grosch [17] in adapting 
a Navier-Stokes code to the 1CL-DAP (an SIMD machine with 4096 one-bit processors) 
and it will be adapted in this work.
There have been a substantial number of theoretical studies of the performance of 
algorithms on parallel computers but far fewer actual experimental studies [19], [21]. Until
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
4
recently, most o f  the experimental studies had been concentrated on the use o f  the Cyber 
200 and Cray series of computers. Beside the works by Gallopoulos [11] on the MPP and 
Bokhari [6] and Crockett [8] on the Hex/32, there have been a few actual studies in using 
the machines chosen for this research.
In this research we describe the implementation o f three numerical algorithms on three 
different parallel architectures and analyze the performance of these architectures using 
these algorithms. These algorithms are: a relaxation scheme for the solution of the 
Cauchy-Riemann equations, an ADI method for the solution of the diffusion equations, and 
a compact difference scheme for the solution of the incompressible, two-dimensional, time- 
dependent Navier-Stokes equations. Both the relaxation scheme and the ADI method are 
used in the solution o f  the Navier-Stokes equations. These algorithms are described in 
Gatski et al. [13] and Grosch [16]. The architectures chosen for this study are: the MPP, an 
SIMD machine with 16K serial one-bit processors; the Hex/32, an MIMD machine with 20 
processors based on 32-bit NSC 32032 microprocessor, and Cray/2, an MIMD machine 
with four powerful vector processors. The basic comparison is between SIMD instruction 
parallelism on the MPP, MIMD process parallelism on the Hex/32, and vectorization o f a 
serial code on the Cray/2.
The basic features of the three architectures and the programming languages used are 
described in Chapter two. Chapter three presents general performance models of these 
architectures. The implementation o f the three algorithms on these architectures is 
described in Chapters four through six; each chapter contains a brief description of the algo­
rithm and the implementation and performance analysis o f that algorithm on each machine. 
Finally, Chapter seven contains a comparison of performance o f these machines at the prob­
lem solving level and some concluding remarks.
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER TWO
The Architectures and Programming Languages
A brief description o f the three architectures and the programming languages used is 
given in this chapter in  order to clarify the way in which the three algorithms, described in 
the Chapters four through six, were adapted to these architectures. Both the hardware and 
software o f  each computer are described in order that one can appreciate some of the most 
important features o f each machine that contribute to its limitations and advantages.
2.1. The MPP architecture and MPP Pascal
The Massively Parallel Processor (MPP) is a large-scale SIMD processor developed 
by Goodyear Aerospace Co. for NASA Goddard Space Flight Center [5], [15]. The MPP 
is a back-end processor for a VAX-11/780 host, which supports its program development 
and I/O needs.
The block diagram o f the hardware elements o f the MPP is shown in Fig. 1. The 
Array Unit (ARU) consists of a square array o f 128 x 128 bit-serial Processing Elements 
(PE’s). Each PE has a local 1024 bit random access memory and is connected to its four 
nearest neighbors with programmable edge connections. Arithmetic in each PE is per­
forated in bit serial fashion using a serial-by-bit adder. The PE also contains a shift register 
which is used in multiplication and division. The ARU is controlled by the Array Control 
Unit (ACU). The ACU supervises the PE array processing, performs scalar arithmetic, and 
shifts data across the PE array. Items of data are sent to the ARU and taken from the ARU
5







Fig. 1. Block diagram of the MPP.
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
7
through the staging memory. The staging memory can buffer arrays o f data transmitted 
over this path and it can also reformat them. The MPP has a cycle time o f 100 nsec.
With 16,384 PE’s operating in parallel, the array has very high processing speed. 
Despite the bit-slice nature o f each PE, the floating-point speeds compare favorably with 
other high performance machines. For example, the processing rate for the addition o f  two 
128 x 128 arrays, each consisting o f 32 bit floating-point numbers, is 430 MFLOPS and for 
multiplication it is 216 MFLOPS.
Three programming languages are currently implemented on the MPP. They are MPP 
Pascal and two assembly languages, Main Control Language (MCL) and PE Array 
Language (PEARL). MPP Pascal [14] is a machine-dependent language which has evolved 
directly from the language Parallel Pascal defined by Reeves [24], Parallel Pascal is an 
extended version of the conventional serial Pascal programming language with a convenient 
syntax for specifying array operations.
MPP Pascal provides a new intrinsic type of data structure termed a parallel array. 
This type directs the compiler to store the array in the array memory. The last two dimen­
sions o f a parallel array must be 128 x 128. In addition to arithmetic operations and func­
tion evaluations on parallel arrays, MPP Pascal provides two fundamental classes o f opera­
tions on array data which are not available in conventional programming languages. The 
first class provides operations which reduce a parallel array to a scalar. These arithmetic 
reduction functions are: maximum, minimum, sum, and product. The second class provides 
operations which permute a parallel array. The permutation functions are: shift (end off 
shift), rotate (end around shift), and transpose. The extensions also provide a single parallel 
control statement, the where-do-otherwise statement. It is similar to an if-then-else state­
ment but with an array control variable. MPP Pascal also includes two system-defined
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
arrays, row-index and col-index, that give each PE its location in the array. Their major 
use is in masking out a particular set o f  PE ’s for a given operatioa M PP Pascal programs 
can execute on the host, on the MPP, o r on a combination o f  both machines. Through the 
use o f  compiler switches, the programmer specifies, at the procedure level, the system on 
which the code will execute.
MPP Pascal’s I/O system consists o f several different modules that handle each o f the 
VO  communication links on the MPP/VAX system. There are two techniques for control­
ling data transfer to and from the array memory through the staging memory. These are 
virtual channel I/O and bit-plane I/O. In virtual channel I/O, data exist in the stager in an 
"unknown" address; retrieving data from the stager depends on knowing how it was stored. 
Virtual channel VO  software views the staging memory as a permuting channel through 
which data move and are reformated. Bit-plane I/O treats the stager as a memory not as a 
permuting channel. Bit-plane I/O allows users to access data in the stager by variable 
name, simply by specifying a b it plane address. For this reason, the stager is configured to 
look like the array memory, i.e., a 16K array of a 128 x 128 bit planes (the staging memory 
size is 32 Mbytes).
2 2 . T he Flex/32 a rch itec tu re  and  C oncurren t F o rtran
The Flex/32 is an M IMD shared memory multiprocessor based on 32 bit National 
Semiconductor 32032 processor [10]. The Flex/32 cabinet can hold up to 20 o f  any combi­
nation of processor and memory cards. The results presented in this research were obtained 
using the 20 processor machine that is now installed at NASA Langley Research Center.
As shown in Fig. 2, there are ten local buses; each connects two processors. These 
local buses are connected together and to the common memory by a common bus. The 
2.25 Mbytes o f  the common memory is accessible to all processors. Each of processors 1,











Fig. 2. Block diagram of the Flex/32 architecture.
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
10
2, and 3 contains 4 Mbytes of local memory. All other processors contain 1 Mbyte each. 
Each processor has a cycle time o f 100 nsec.
The UNIX operating system is resident in processors 1 and 2. These processors are 
also used for software development and for loading and booting the other processors. Pro­
cessors 3 through 20 run the Multicomputing Multitasking Operating System (MMOS) and 
are available for parallel processing.
The Flex/32 software provides two methods o f synchronizing communications 
between processes on separate processors. The first method is via the common memory, 
8192 locks are provided to lock variables in the common memory. The second method is a 
message sending technique; processes can communicate by sending and receiving messages.
The Flex/32 system software has special concurrent versions of C and Fortran 77. 
Concurrent C and Concurrent Fortran are extensions to C and Fortran 77 programming 
languages with all the standard definitions and features of the languages preserved. Both 
introduce new constructs for implementing parallel programs. Among these constructs are:
lock a shared variable can be locked if  it is not locked by any other process. The
locking process will then be able to access that variable while other processes 
attempting to lock it or access it will wait until the lock is released.
process define and start the execution of a code segment on a specified processor,
shared variables defined as shared are common data items located in the common
memory and are used by several processes and/or processors.
unlock release a locked variable.
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
11
23 . The Cray/2 architecture and CFT/2 compiler
The Cray/2 is an MIMD supercomputer with four Central Processing Units (CPU), a 
foreground processor which controls I/O, and a central memory. The central memory has 
256 million 64 bit words organized in four quadrants o f 32 banks each. Each CPU has 
access to one quadrant during each clock cycle. Each CPU has an internal structure very 
similar to the Cray/1, see [19], with the addition of 16K words of local memory available 
for storage o f vector and scalar data. Within each CPU there are eight vector registers (64 
words each), eight scalar registers, special purpose registers (vector length and vector mask) 
and nine pipelined functional units, four o f which support vector processing. The clock 
cycle is 4.1 nanoseconds.
The Cray/2 runs the UNICOS operating system which is based on UNIX system V. 
The four processors can operate independently on separate jobs, multiprogramming, or con­
currently on a single job, multitasking.
The Cray/2 Fortran compiler (CFT/2) [7] attempts to vectorize the innermost DO 
loops. This is the only place where vectorization is attempted. This process is automatic, 
but certain loops can not be vectorized and programmer intervention is frequently required. 
Among the conditions preventing vectorization are I/O, CALL, IF, and GOTO statements; 
dependency involving an array; and ambiguous index expressions in the innermost DO 
loop. By default, the compiler generates ‘safe’ code; it assumes the worst about ambiguous 
situations. Some o f these situations can be resolved by inserting compiler directives, using 
system libraries, or rewriting a program segment.
A vector operation in Cray/2 is performed by loading a group o f up to 64 elements 
into a vector register and moving it, one element per clock period, to the functional unit 
performing the operation. Once it is loaded in the vector register, none o f the elements can
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
12
be changed; all the elements are treated the same.
All Crays interleave words in  memory so that consecutive elements o f an array are 
stored in consecutive banks in memory. The bank cycle for the Cray/2 is 57 clock periods, 
i.e., accessing any bank in  memory creates a ‘bank busy’ condition for that bank for 57 
cycles. This problem is called ‘memory bank conflict’. In addition to the bank conflict, 
array accesses with even-numbered strides (stride means the memory increment between 
successive elements stored or fetched) will suffer quadrant delays, which are a consequence 
of the four CPU’s o f the Cray/2 taking turns accessing the four quadrants o f  memory. 
Even-numbered strides that are not divisible by four will result in more than 50% slowdown 
in data transfer rate and strides that are divisible by four will result in more than 75% slow­
down [4],
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER THREE
Performance Models
In this chapter, general performance models for each o f the three computers, the MPP, 
R ex/32, and Cray/2, are presented. These models reflea  the architecture o f  the computers, 
as described in Chapter two. In subsequent chapters these models are applied to the perfor­
mance o f  a number of algorithms on the three computers.
3.1. The MPP
The following model is based on counting the number of arithmetic and data transfer 
operations of an algorithm and multiplying these numbers by the measured cost of each 
operation in order to estimate the execution time of that algorithm. Only operations on 
parallel arrays are considered here.
The execution time of an algorithm on the MPP, T, can be modeled as follows:
(3.1)
Tcmp = tc (Na Ca + Nm Cm + Nd CA (3-2)





R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
14
tc Machine cycle time = 100 nanoseconds,
Na Number o f additions,
Nm Number o f  multiplications.
Nd Number o f divisions,
N* Number o f shift operations,
N„ Number o f steps involved in all shift operations,
Ca Number o f cycles to add two arrays o f 32 bit floating point numbers.
cn Number o f cycles to multiply two arrays of 32 bit floating point numbers,
Cd Number o f cycles to divide two arrays of 32 bit floating point numbers.
Csk Startup cost (in cycles) o f shifting an array of 32 bit floating point numbers,
Cs, Number o f cycles to perform a one step shift within a shift operation.
Table 1 contains the estimated values o f Ca, Cn, Csk, and C„ for two sets of primitives, 
IBM format and VAX form at The IBM format primitives are provided by the Goodyear 
Aerospace Co. and used mainly in the assembly language programs. The IBM format prim­
itives were unavailable for MPP Pascal programs at the time this research was conducted. 
The peak performance rates o f the arithmetic operations are computed based on the IBM 
format primitives. The VAX format primitives called by the MPP Pascal compiler version 
2 were in use until July 1987. The ones called by the MPP Pascal compiler version 3 are 
currently used and supposed to be improved. Both o f these VAX format primitives were 
programmed at NASA Goddard. The values corresponding to the VAX format were 
obtained using a simple test problem; the execution time of each operation was measured by 
using a loop of length 1000. Note that the VAX format primitives take considerably longer 
than the IBM format primitives to perform an operation. The ratio of execution times
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
15
ranges from 1.07 for multiplication to 2.53 for addition.
Table 1. Measured execution times (in machine cycles) of the elementary operations on the 
MPP.
Operation IBM format primitives VAX format primitives
version 2  compiler version 3 compiler
addition 381 824 965
multiplication 758 877 811
division 1031 1130 1225
one step shift 96 166 168
k step shift 64 + 32 k 134 + 32 k 136 + 32 k
3 2 . The Flex/32
The following model is based on estimating the values of various overheads resulting 
from running an algorithm on more than one processor. Also, the time to do the real com­
putation on each processor is estimated.
The execution time o f an algorithm on p  processors o f the Flex/32, Tp, can be modeled 
as follows:
Tp = Tcmp + Tmn (3.4)
where Tcmp is the computation time and T0„ is the overhead time. Let / w be a load distribu­
tion factor where f u  = 1 if  the load is distributed evenly between the processors and f u  > 1 if 
at least one processor has less work to do than the other processors. Then the computation 
time on p  processors can be computed by
Tcmp =fld T\ / p, (3.5)
where Tt is the computation time using a single processor.
The overhead time can be modeled by:
T0,r = T ^  + Tcnw + T ^  (3.6)
where
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
16
7 ^ . Spawning time o f p  processes,
Tcmo Total common memory overhead time, 
Tm  Total synchronization time.
These times can be estimated as follows:
Tlpn = p  V ,, (3.7)
T̂ yii = P^lcktldo (3-8)
Tcmo = 7'om + Tclm + Tcn, (3.9)
Tcmo = * k ^ i p )  tcma, (3.10)
Tclm = n Kim ( fbc(P) fcmd + llma ). (3.11)
Tcld = nkdd (fbcip) tcmo ~ llma ). (3.12)
where
n Length o f the vector,
Tcmo Total common memory access time,
Telm Total time required for copying shared vectors to local memory,
Tcld Total time difference between storing vectors in common and local memories; i.e..
Overhead time o f storing vectors in common memory instead o f local memory,
tsp* Time to spawn one process; a reasonable value is 13 msec,
v£k Total time to lock and unlock a shared variable; a reasonable value is 47 josec,
tcmo Time to access an element of a vector in common memory; a reasonable value is 6
psec,
tima Time to access an element of a vector in local memory; a reasonable value is 5 psec,
kte Number o f times a shared variable is locked and unlocked for each process.
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
17
k ^ ,  Number o f times a shared vector is referenced,
kclm Number of times a shared vector is coped  to local memory,
kcU Number o f times a vector is stored in common memory instead o f  local memory,
ftJp )  Busses contention factor. This contention results from having more than one proces­
sor trying to access the common memory at the same time; it is a function o f p.
The values o f tv „ t&, rcma, and are estimated based on timing experiments performed by 
Bokhari [6 ] and Crockett [8]. It is assumed that all common memory access operations are 
performed on vectors o f length n.
3 3 . The C ray/2
In order to estimate the cost of arithmetic and memory access operations on the 
Cray/2, the following timing values are used:
Clock Period (CP) = 4.1 nanoseconds,
Length of data path between the main memory and the registers, L„ = 56 CPs,
Length of each floating point functional unit, Lf  = 23 CPs,
Data transfer rate with stride of 1 through main memory, Rx = 1 CP/word,
Data transfer rate with stride of 2 through main memory, R2 = 2 CPs/word.
A lower bound on the values of R { and R2 is assumed here. Competition for memory banks 
from other processors causes a lower transfer rate and hence increased values o f  R-i and R2. 
The actual values are difficult to estimate.
Based on the fact that Cray vector operations are "stripmined" in sections of 64 ele­
ments, the time required to perform arithmetic and memory access operations on vectors of 
length Lvcr can be modeled as follows:
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
18
= At + Lycr) Nfl CP, (3.13)
(3.14)
Tmi = ( ~  Lm + R :L m )N m lCP, (3.15)
(3.16)
where
Ctl Next integer greater than or equal to x,
TA Time to perform floating point operations on vectors with stride o f  1,
Tp Time to perform floating point operations on vectors with stride o f  2,
Tn i Time to perform main memory access operations on vectors with stride of 1,
Tra Time to perform main memory access operations on vectors with stride of 2,
Nfl Number o f floating point operations on vectors with stride of 1,
Np  Number of floating point operations on vectors with stride of 2,
Nnl Number of main memory access operations on vectors with stride o f 1,
Nm2 Number of main memory access operations on vectors with stride o f 2.
The divide operation on the Cray/2 is performed by using the reciprocal approximation 
look-up table and the floating point vector multiply unit (Newton approximation method). 
It is assumed in this model that each division takes three times the multiplication time.
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER FOUR
Solving the Cauchy-Riemann Equations
This chapter is the first of three chapters dealing with implementing numerical algo­
rithms on the three architectures. The algorithm considered in this chapter is a relaxation 
scheme for the solution of the Cauchy-Riemann equations. The numerical method and the 
adaptation o f the algorithm to parallel computers are described in section 4.1. The imple­
mentation o f the algorithm on the MPP, Flex/32, and Cray/2 is described in sections 4.2 
through 4.4; each section contains details o f the implementation, the results, and the appli­
cation of the performance model o f the machine, developed in Chapter three, to this algo­
rithm.
4.1. The num erical m ethod
Consider the following differential equations, the Cauchy-Riemann equations:
The numerical method used to approximate these equations is based on compact 
differencing schemes developed by Rose [25] and Philips and Rose [23]. The method is 
briefly described here for the sake of completeness.
Consider approximating the solution of equations (4.1) and (4.2) in a rectangular 






R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
20
the domain into rectangular cells. This array of cells can be nonuniform. A typical cell
and the location of the variables on that cell are shown in E g . 3. A variable associated
with the side o f a cell is to be interpreted as the average of that variable over the side of the 
cell and one associated with the center o f a cell is an average over the cell. The centered 
difference and average operators are defined on a cell by:
5J ] . . «  ( > (4.3)
M /,v ,  (4.4)
Suppose that is prescribed. Then equations (4.1) to (4.2) are approximated
by,
Sx^h-1/2^1/2 +  &yV MI2j*-\/2 ~  0- (4 -5)
&xV h-m j+ lf2 ~  &yUi+l/2J+l/2 =  Crt-l/2J+l/2’ (4 -6)
M* î'+1/2^+1/2 ~  = (4 -7)
M-i^H-l/2,/+l/2 _  =  0- (4 -8)
The adaptation o f this algorithm to different parallel architectures can be simplified by 
the introduction o f box variables to represent U. The center of a cell is at (i+V2j+U2). The
box variables, P, are defined at the comers o f the cells, as shown in E g . 3. They are
related to €  by:
(4.9)
( ^  +  ^  ) (4.10)u  t+\/2j 2  ’
and similarly for 1/fl/2 and
It is easy to see that equations (4.7) and (4.8) are satisfied identically for any set of 
box variables. For the cell (/+1/2J+1/2), equations (4.5) and (4.6) become,
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
21
Fig. 3. Typical computational cell and the data associated with i t
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
22




P ~  ( ’ ^ f l ^-1 ’ ^ij+l ) »
Z=(0,Zij)T,
?ij = ( p ij • Qij )■
Zy =  2(Ay)£,i+iaj+in'
,  _  (Ay).-
‘v (A ^)/
Equation (4.11) is solved by an iteration scheme which was originally proposed by 
Kaczmarz [20] and was generalized by Tanabe [26]. If P(i) is the value after the k ’th itera­
tion, then the residual after the k ’th iteration, R(k\  is given by:
R (k) _  ^ k ) _  z
The next iteration is
(4.12)
P{M) =  P®  -  coA7'(A4r )_1/?(t), (4.13)
where co is an acceleration parameter. In detail, the kernel o f  the relaxation process has the
following equations:
RfJ = h j  ( P $ * i  + P %  -  P% i -  t i? )  +  + 2 & i "  ~  2 ® (4-14)
= hj  (fi^Ui + Q&J -  2 $ i  -  2 ^  -  + ^ + Pf) ~ z ir (4-15)
p W )  =  j m  + CiJ (k u  R<$ -  S $ ) ,  (4.16)
e ^ +1) =  Q?) +  CtJ (/?«  + Xij S \f) ,  (4.17)
^  -  ClV (XlV +  S<$, (4-18)
=  2 ^ 1  + C,v  (*<? -  ^  tf? ) . (4-19)







R  — ( Rij  , S i j ),
This relaxation scheme is equivalent to an SOR method. On a serial computer the array of 
computational cells is swept over, applying equation (4.13) to each, until the maximum resi­
dual is reduced to the desired level.
The key to the adaptation of this relaxation scheme to parallel computers is the reali-
fact is utilized on parallel computers by using the concept of reordering to achieve parallel­
ism [1], [27]; operations are reordered in order to increase the percentage of the computa­
tion that can be done in parallel. As shown in Fig. 4, the computational cells are divided 
into four sets of disjoint cells so that the cells o f each set can be processed in parallel. A 
particular f  which lies on the comer of a cell (see Fig. 4) is changed during the relaxation 
of set R first, then o f set B, then of set 0 ,  and finally set G. In each o f these cases, ^  lies 
at a different comer of the cell being relaxed. It is therefore clear that the cell iteration for 
the box variables is a four "color" scheme. Also, different linear combinations of the resi­
duals are used to update each f  and all of the P's are updated in each step. Thus the four 
steps are necessary for a complete relaxation sweep. This is due to the fact that this is a 
multicolor cell relaxation scheme in contrast to multicolor point relaxation scheme in which 
only a fraction o f the values are updated at each stage.
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
zation that each f  is updated four times in a sequential sweep over the array o f cells. This
24
3 u______ 5________________M- 2  H - l  M
.
R G R G •  •  • G R
B 0 B 0 0 B
R G R G G R




B 0 B 0 0 B
R G R G
! G R
Fig. 4. Computational domain and the four color cells 
assuming that M and N are even numbers.
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
25
In brief, the relaxation algorithm is implemented by computing the residuals, R(k), 
using equation (4.12) for each set o f cells, followed by updating the P's using equation
(4.13). This sequence must be completed four times in order to complete a sweep. Finally, 
the maximum residual is computed and tested against the convergence tolerance. The 
whole process is repeated until the iteration procedure converges.
A test problem, based on predefined values o f C and boundary values o f  u and v, is 
used to study the behavior o f the numerical method. Equations (4.12) and (4.13) are solved 
with given Ch-i/2j+i/2 and the value o f one o f the box variables on each side prescribed. In 
order to illustrate the behavior o f this relaxation scheme the variation of the spectral radius, 
a , with the acceleration parameter is shown in Fig. 5. This shows measured values o f cr for 
three cases in which the number of grid points in the computational domain was increased 
from 32 x 32 to 128 x 128.
4 2 . Im plem entation on the  M PP
The computational cells, as described in section 4.1, are mapped onto the array so that 
the comers of the cells correspond to the processors. The storage pattern used on the MPP 
is to store C+i/2>fi/2. and in the memory o f processor (ij) . Thus with a 128 x 128 
array o f processors there is an array of 127 x 127 cells.
The relaxation scheme has been implemented on the MPP for problems which fit on 
the array, 128 x 128 grid points, and for problems which are larger than the array, 128 x 255 
grid points.
In detail, the MPP algorithm for a 128 x 128 problem is implemented as follows:
(1) All of the initialization is done on the VAX. This includes computing the matrices Ẑ -,
and C,j, calculating the boundary conditions, and initialization o f the box variables







0 . 9 2
1.9< ai .oI1.5 1-61 . 2 1.51\i
CO
Fig. 5. Spectral radius as a function o f the acceleration parameter.
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
27
(fij> Qij matrices) by setting all values to zero except those determined by the boundary 
values.
(2) The five matrices, computed in step (1), are moved to the staging memory, then to the 
array using the bit-plane I/O technique.
(3) The relaxation process is carried out entirely on the MPP. A set o f  temporary matrices 
are generated to store the P's for each color. The residuals are computed and the P's o f the 
same color are updated; each o f the four comers of the computational cells is updated by 
masking out the other comers. This is easily done using the where statement and boolean 
masks. The boundary values are also masked. The relaxation sequence is implemented 
four times to complete a sweep. This is followed by the computation o f the maximum resi­
dual. This step is repeated until the process converges; that is the maximum residual is 
reduced to the desired value. Note that in updating the P's for each color only one forth of 
the processors do useful work.
(4) Finally, the box variables are moved back to the staging memory and then to the host 
using the bit-plane I/O method.
The relaxation procedure requires 22 parallel arrays of floating point numbers, all but 
5 of which are temporary, and 19 parallel arrays of boolean variables. Each floating point 
array uses 32 bits and each boolean array uses 1 bit o f the array memory. Together these 
arrays use 723 bits o f the 1024 bit PE memory. Most o f the remaining bits hold system 
functions and primitives. If one solves problems which are larger than 128 x 128, and thus 
do not "fit" on the array unit, additional memory is required. A total o f 5 floating point 
arrays o f data must be stored for each 128 x 128 sheet. Thus 160 bits of additional memory 
are needed for each sheet This additional memory is not available in the PE memory so 
we must use the staging memory as a backup and move the data arrays in and out of the
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
28
array memory when we deal with 128 x 255 and larger problems.
A 128 x 255 problem is implemented on the MPP as follows:
(1) Initialization o f the whole domain is performed on the VAX.
(2) The domain is decomposed into two regions, left and righ t This means that the data is 
divided into two sheets; each sheet contains five matrices, g,v, Zij, Q j. The two 
sheets are moved to the staging memory using bit-plane I/O method.
(3) The relaxation process is implemented on the MPP for each region separately. For each 
region, the following steps are performed: (a) A sheet o f  data is moved from the staging 
memory to the array, (b) the interface points o f the box variables are updated from previous 
iteration o f the other region, (c) the relaxation sequence and computation of the maximum 
residual are implemented as for the 128 x 128 problem, (d) the box variables are updated
and boundary conditions are reset on all sides of the region except the interface side, (e) the
interface points are saved in temporary arrays to be used for the next iteration of the other 
region, and (f) the box variables are moved back to the staging memory. These steps are 
repeated until convergence is achieved on both regions.
(4) After convergence the box variables o f both regions are moved back to the host using 
bit-plane I/O method.
The host program as well as the MPP relaxation procedure arc written in  MPP Pascal 
and run through the MPP Pascal compiler version 2. Bit-plane I/O procedures are MCL 
routines, and are called from the MPP. A Fortran subroutine is used to initialize the buffer 
on the VAX for the parallel array transfers. These routines, declared as external pro­
cedures, are compiled as separate units and linked with the main program unit for execu­
tion, since, unlike standard Pascal, MPP Pascal provides the capability o f compiling rou­
tines separately.
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
29
The relaxation algorithm mapped well onto the MPP because it can be implemented 
almost entirely with matrix operations. There are no vector operations and only two scalar 
operations per iteratioa The amount o f time spent on data transfers is quite small because 
nearly all data transfers are only between nearest neighbors. This type o f transfer is gen­
erally inexpensive in machines like the MPP. The local nature of the data transfers is due 
to the fact that the differencing scheme is a compact second order scheme.
Table 2 contains the execution time and the processing rate for one iteration for a 
128 x 128 and a 128 x 255 problem. The amount o f time spent in the host program is not 
measured, because there is only a small amount of computation involved in it. The pro­
cessing rate is determined by taking the ratio o f the number o f effective arithmetic opera­
tions to the total execution time o f the relaxation routine. In counting the number of 
effective arithmetic operations, only pure arithmetic operations, addition and multiplication, 
are counted. Data transfers as well as computing the absolute and the maximum values are 
not counted as floating point operations.
Table 2. Measured execution times for one iteration and processing rates for the relaxation 
algorithm on the MPP.
Problem size (grid points) Execution time (msec) Processing rate (MFLOPS)
128 x  128 13.56 175
128 x  255 50.55 94
If the addition and the multiplication operations are counted separately and the max­
imum processing rates are considered, the maximum possible rate for this 128 x 128 prob­
lem will be 365 MFLOPS. The measured processing rate is, therefore, about 48% o f the 
maximum possible rate.
In order to use the model developed in section 3.1 for the cost of implementing the 
relaxation algorithm on the MPP, the arithmetic and data transfer operations o f the algo­
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
30
rithm are counted. The values o f Na, Nm, A/*, N„ for one iteration are 119, 26, 63, 84 
respectively. The computation cost (equation (3.2)) and the communication cost (equation
(3.3)) o f the relaxation algorithm are listed in Table 3 using IBM format primitives and 
VAX format primitives version 2 compiler. We used the VAX format primitives called by 
the MPP Pascal compiler version 2 in our implementation and their usage in the model 
gives a reasonable measure of the performance o f the algorithm on the MPP. Based on the 
measured values o f these VAX format primitives, the computation cost contributes about 
89% o f the total cost and the communication cost contributes about 8 % of the total cost 
The costs of computing the absolute and the maximum values as well as performing the 
two scalar operations are not included in this model. However, it is estimated that these 
costs represent less than 3% o f the total cost and these operations may overlap with the 
array operations. Note that this algorithm achieves only about 50% o f the peak perfor­
mance rate o f the MPP because o f the relative inefficiency o f the VAX format primitives.
Table 3. Estimated times (in milliseconds) o f the relaxation algorithm on the MPP.
Primitives Computation Communication Total estimated Measured
time time time time
IBM format 6.50 0.67 7.17 -
VAX format 12.09 1.11 13.20 13.56
For the 128 x 255 problem there is an overhead for transferring the data to and from 
the staging memory. A total of seven data swaps between the stager and the array are 
required for each sheet. This swapping adds 11.7 msec to the time for each iteration; yield­
ing an I/O overhead of 8 6 %. This reduces the efficiency o f the MPP for oversize problems.
The code can be easily expanded to larger problems using the same numerical algo­
rithm. It is expected that the execution times will be multiples o f that for the 128 x 255 
problem.
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
31
4.3. Implementation on the Flex/32
The four color cell relaxation scheme, as described in section 4.1, was implemented 
on the Flex/32 using 64 x  64 cells (65 x 65 grid points) and 128 x 128 cells (129 x  129 grid 
points). The main program as well as the relaxation subroutine are written in  Concurrent 
Fortran.
One obvious way to partition the relaxation routine is by color using four processors; 
each processor handles one color. Although the method is implemented easily, it has a 
slow convergence rate and no gain is achieved. This is because all o f the processors are 
operating on the same initial data every iteration yielding a relaxation method which is 
equivalent to the Jacobi method. This method has a slow convergence rate compared to the 
SOR method.
In order to implement the algorithm in parallel, the domain is decomposed into 1, 2, 
4, 8 , or 16 strips; each strip contains 64 x 64, 64 x 32, 64 x 16, 64 x 8, or 64 x 4 cells for the 
65 x 65 problem. Each strip is given to a process. At the beginning the main program, 
which is process ‘m ain’ running on processor 3, creates and starts (spawns) the execution of 
the processes on specified processors with each process assigned to a separate processor. 
Processors 4 to 19 are used for parallel processing.
Data is distributed between the common and local memories, with the intention of 
doing most of the work locally. The matrices Ptr  Qij, Zij, \ j ,  for each strip are stored in 
the local memory. These matrices are initialized in parallel by all processes. The values of 
the boundary points, interface points, and error flags are stored in the common memory. 
The boundary points are computed for the whole domain in ‘main’, and used by all 
processes.
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
32
The relaxation process for each strip is performed locally by: fetching the variables 
from the local memory, computing the residuals, then updating the variables, and finally 
storing them back in the local memory. After doing these steps four times, the maximum 
residual is computed and tested against the convergence tolerance. If  the iteration has con­
verged the convergence flag of that process is set to unity. After relaxing each set o f cells 
(each color), each process exchanges the values o f the interface points with its two neigh­
bors through the common memory. A set of flags are used here to ensure that the updated 
values of the interface points are used for the next color.
Synchronization is accomplished by setting a variable, ‘countr’, in the common 
memory and assigning a lock to i t  At the beginning o f each iteration, ‘countr’ is set to 
zero by process 1. This is a signal to the processes to proceed. When each process com­
pletes a sweep it signals back to process 1 by incrementing ‘countr’. Finally process 1 tests 
for global convergence and resets ‘countr’ if  the iteration has not converged.
The performance of the parallel algorithm on the Flex/32 is evaluated by using the 
speedup and efficiency measures. The speedup is defined as the ratio of the time to solve 
the problem using one processor to the time to solve the same problem using p processors. 
Knowing the speedup, we determined the efficiency by taking the ratio o f the speedup using 
p processors to p. Thus in the ideal situation the speedup is p and the efficiency is unity. 
The speedups and efficiencies as functions o f the number of processors o f  both problems 
using two types of locks, MMOS and Local, are shown in Tables 4 and 5. The execution 
time for one iteration and the processing rate for both problems using 16 processors and 
local locks are listed in Table 6 . These results were obtained using a tim er with a 20 msec 
resolution.
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
33
Table 4. Speedup and efficiency o f  the relaxation algorithm on the Flex/32 using the 
MMOS locks.
Number o f 
processors
65 x  65 grid points 129 x  129 grid points


























Table 5. Speedup and efficiency of the relaxation algorithm on the Flex/32 using the Local 
locks.
Number o f 65 x  65 grid points 129 x  129 grid points
processors speedup efficiency speedup efficiency
1 1.000 1.000 1.000 1.000
2 1.991 0.996 1.978 0.989
4 3.955 0.989 3.941 0.985
8 7.825 0.978 7.834 0.979
16 15.294 0.956 15.437 0.965
Table 6. Measured execution times for one iteration and processing rates for the relaxation 
algorithm using 16 processors of the Flex/32.
Problem size (grid points) Execution time (msec) Processing rate (MFLOPS)
6 5 x 6 5 246.53 1.10
129 x  129 964.85 1.12
The results shown in Table 4 were obtained using the MMOS locks and the results 
shown in Table 5 were obtained using the Local locks. The MMOS locks, used to lock and 
unlock variables in the common memory, are provided by Flexible Computer Co. while the 
Local locks were programmed at NASA LaRC. The Local locks are based on the SBITI 
instruction o f the NSC 32032 microprocessor while the MMOS locks are based on some 
expensive MMOS system calls. As shown in Tables 4 and 5, the Local locks are very 
efficient compared to the MMOS locks. For the 65 x 65 problem using 16 processors, for 
example, the speedup is 10.977 using the MMOS locks while it is 15.294 when using the 
Local locks. This is an increase o f  about 39%, and shows the impact o f the design of
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
34
parallel processing primitives on the performance o f the parallel machines. It was found 
that when the MMOS locks were used the synchronization cost o f the algorithm represents 
more than 70% o f the overhead cost for large number o f processors.
In order to apply the model developed in section 3.2, equations (3.4) to (3.12), in 
estimating the computation and overhead times o f the relaxation algorithm, the parameters 
fu , kkk, and were computed. Since the load is distributed evenly between the proces­
sors, the load distribution factor is unity. The values o f kkk and kCKa per iteration are one 
and eight, respectively. Using these values, it was found that the overhead time represents 
at most 4% o f the execution time o f the algorithm. The synchronization time was 
insignificant because the routines that provide the locking mechanism are very efficient and 
overlap with the memory access. The spawning time has a minor impact because the 
processes are spawned only once at the beginning o f the program. The busses contention 
factor is very small even for a large number of processors. The memory access cost dom­
inates the overhead cost.
As the number o f processors in use is increased from 2 to 16 the computation cost per 
processor is decreased while the overhead cost is increased. This causes a degradation in 
the efficiency of the algorithm. Increasing the number of cells causes an increase by the 
same ratio in the computation cost; an increase by a smaller ratio in the memory access 
cost; and no change in the spawning and synchronization costs. This resulted in a slight 
improvement on the performance o f the algorithm for the 129 x 129 problem using a large 
number o f processors.
4.4. Implementation on the Cray/2
The relaxation algorithm, described in section 4.1, was implemented on the Cray/2 for 
computational domains o f sizes ranging from 64 x 64 to 1024 x 1024 grid points. The code.
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
35
in each case, is executed as a single job by one o f the processors; multitasking was not 
attempted.
The algorithm is mapped onto the architecture so that columns o f each color of the 
computational cells arc processed separately. This mapping removes any recursion because 
each o f these columns contains a disjoint set o f  cells. The implementation was quite sim­
ple. The code, using the reordered form of the algorithm and written in standard Fortran, 
was run through the CFT/2 compiler. Not all inner loops were vectorized. Two steps were 
taken to ensure vectorization o f all inner loops o f  the code. First, CFT/2 was told to ignore 
apparent vector dependencies by using the compiler directive, IVDEP. Second, a segment 
of the code that computes the maximum value o f  an array was rewritten in order to be used 
with an optimized library routine, ISMAX. This resulted in the vectorization of all o f the 
inner loops of the code.
The use o f the main memory can be reduced by using scalar temporaries, instead of 
array temporaries, within inner DO loops. When scalar temporaries are used the CFT/2 
compiler stores these variables in the local, rather than the main memory. This reduces 
memory conflicts and speeds up the calculation. The residuals, equations (4.14) and (4.15), 
are stored in scalar temporaries.
The measured serial rate o f the 64 x 64 problem is 18 MFLOPS. This rate was 
obtained when running a serial version of the algorithm on one processor. With this ver­
sion o f the code the compiler was not able to vectorize the inner loops in the relaxation ker­
nel. However, some o f the loops in the section where the maximum residual is computed 
were vectorized. The 18 MFLOPS rate is thus a measure o f the processing rate o f  a serial 
code; i.e., a code that runs on a serial machine. This result shows that existing codes, writ­
ten for serial machines, may produce modest performance when they are transferred to
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
36
Cray/2.
Table 7 contains the execution time and the processing rate for one iteration using the 
vectorized code when the domain size is varied from 64 x 64 through 1024 x  1024 grid 
points. Only one processor of the Cray/2 is used. There is up to 20% offset on the results 
depending on the memory traffic and the number of the active processes on the system. 
The processing rate is computed by counting the additions and multiplications only, as in 
section (4.3). As the number of grid points is increased, the processing rate is slightly 
improved. This is due to the fact that the Cray/2 runs more efficiently on longer vectors; 
the number o f instructions required to complete the loops is relatively smaller for long vec­
tors [4],
Table 7. Measured execution times for one iteration and processing rates for the relaxation 
algorithm on one processor of the Cray/2.
Problem size (grid points) Execution time (msec) Processing rate (MFLOPS)
64 x  64 2.62 100
128 x  128 10.47 102
256 x  256 41.83 103
5 1 2 x 5 1 2 164.40 105
1024 x  1024 639.49 108
The major problem in implementing the relaxation algorithm on the Cray/2 was found 
to be accessing the main memory. The Cray/2 is a memory bound machine and one of the 
general rules for writing efficient programs for the Cray/2 is to maximize the number of 
arithmetic operations per memory access. If this is done the compiler can optimize the use 
of the functional units, vector registers, and the local memory, thus minimizing the use of 
the main memory. It has been our experience that a main memory access operation costs at 
least three times a floating point operation. Another related problem is using a memory 
stride o f 2. This is inherent in the vectorized relaxation algorithm and cannot be avoided. 
A stride o f 2 causes, as described before, a slowdown o f more than 50% in data transfer
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
37
rate and about a 30% slowdown in the overall algorithm processing rate.
In general the CFT/2 compiler will, if  possible, overlap the operation o f the addition 
and multiplication pipeline units. However if there are, as is the case for this algorithm, 
more additions than multiplications, then only the addition pipeline can be active a portion 
o f the time. Thus one can estimate that most o f the time both o f these pipelines are active 
and a portion o f the time only the addition pipeline is active. These proportions can be 
estimated by counting the number o f additions and multiplications o f the algorithm.
The maximum processing rate of one pipeline in a single processor is about 244 
MFLOPS, ignoring the startup time. From this we can compute, using the relative propor­
tions o f the time when one or two pipelines are active, the maximum possible processing 
rate for this algorithm on one processor. It is approximately 350 MFLOPS. The measured 
processing rate on a single processor (Table 7) is about 27% of the peak rate o f one proces­
sor. If one includes the startup time for a pipeline, the peak processing rate drops to about 
257 MFLOPS. Thus the measured rate is about 40% of the possible peak rate.
The relaxation algorithm has two main costs, the computation cost and the memory 
access cost. These costs are estimated using equations (3.13) to (3.16). The values of N^, 
Np., N„,i, and for the relaxation routine are 15 Lxr additions and 2 Lm  multiplications, 
62 Lxr additions and 36 Lvcr multiplications, 12 Lxr, and 46 LKr respectively; where LKT is 
equal to the number o f cells in each dimension. Table 8 contains the results o f applying 
these values to equations (3.13) through (3.16) for different problem sizes. Also, the meas­
ured times are included in the table for comparison. The main memory access cost 
represents about 55% of the total estimated cost and about 70% o f the measured value 
although a lower bound on the data transfer rates is considered. The total estimated costs 
exceed the measured values because o f  the overlapping between memory access and arith­
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
38
metic operations. Most o f the multiplication operations are running in parallel with other 
operations so that the multiplication cost has minor impact on the overall cost It is 
estimated that about one half of the addition operations can be issued while the system is 
fetching operands from the main memory.
Table 8. Estimated and measured execution times (in milliseconds) o f the relaxation algo­













6 4 x  64 1.78 1.21 0.55 3.54 2.62
128 x  128 7.22 4.14 1.80 13.16 10.47
256 x  256 29.05 16.69 7.26 53.00 41.83
5 1 2 x 5 1 2 116.53 66.98 29.12 212.63 164.40
1024 x  1024 466.83 268.38 116.68 851.89 639.49
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER FIVE
Solving the Diffusion Equation
This chapter presents the implementation o f  an ADI method for solving the diffusion 
equation on the MPP, Flex/32, and Cray/2. The Gaussian elimination algorithm is used to 
solve a set o f tridiagonal systems on the Flex/32 and Cray/2 while the cyclic elimination 
algorithm is used to solve these systems on the MPP. The numerical method and both 
algorithms are described in section 5.1. The implementation on each machine including the 
application o f  the performance models, developed in Chapter three, is described in sections
5.2 through 5.4. No reference to Chapter four is made in this chapter.
5.1. The num erical m ethod
Consider the diffusion equation,
# ■  = V2*, (5.1)
ot
to be solved in 0 < t < T and the square region R: 0 < x < 1, 0 < y < 1 with boundary 
values at u{0,y), u(l,y), u(x,0), u(x,l). Consider replacing R by a net whose mesh points are 
denoted by x, = (:-l)Ax, y, = 0-1)Ay where * = 1, 2 , . . . ,  jV+1; j  = 1 .2 ............ M+1. The numer­
ical method used to approximate equation (5.1) is based on an Alternating Direction Impli­
cit (ADI) method for solving parabolic equations [3]. This method consists o f two half 
steps to advance the solution one full step in time. Let Ar be the full time step and apply 
the forward difference operator to equation (5.1) for the time derivative, giving
39
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
40
U $v2- U b  = Y ( & U $ V2 + % U & . (5.2)
= l U Z ^  + ^ U Z 1), (5.3)
where only one of the space derivatives is evaluated at the advanced time level; this restric­
tion is imposed in order to produce a set o f tridiagonal equations. Define the centered 
second difference operator by:
= (.Um j - X J ij + Uh J)
J (Ax)2
Applying the centered second difference operator to equations (5.2) and (5.3) for the space 
derivatives, we get
a  U t l f  ~ (l+2a) U *m  + a  C/ft'j2 = F #  (5.5)
P £ # ,  -  (l+2p) U *' + p U $ l = GlV, (5.6)
where
Fij = -P Uti-1 -  ( l - 2p) Ulj -  p UtM , (5.7)
Gij = -a  U tl?  ~  ( l-2 a )  t/£ I/2 -  a  (5.8)
a  = A t  I 2(Ax)2, (5.9)
p = A t  / 2(Ay)2, (5.10)
for i = 1 ,2  N+l; j  = 1, 2 , . . .  , M+l. Equation (5.5) represents a set of M+1 indepen­
dent tridiagonal systems (one for each vertical line o f the net) each o f size N+1. Similarly, 
equation (5.6) represents a set of N+l independent tridiagonal systems (one for each hor­
izontal line o f the net) each of size M+l. Equation (5.5) is solved for the set {U?j112} using 
the values of U*j while equation (5.6) is solved for the set {GJ?1} using the computed values
of U *m . In order to solve these two sets of equations, the following boundary conditions
are incorporated:
F i j  = U ?ja = t/(0 j,), F m i  = U&Tj = fo r ;  = 1 , 2  M+l, and
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
41
Gu  s  Wu1 = tffeO), Gij^+i — t/ # +1 = U(xit 1), for f = 1 ,2  Af+l.
In brief, the ADI method is implemented in two steps. In the first step, the set {F^} 
is computed, using equation (5.7), followed by solving Af-1 tridiagonal systems (excluding 
the boundary systems), using equation (5.5). With £/*+1/2 known, the set {G,v} is computed, 
using equation (5.8), followed by solving N- 1 tridiagonal systems (excluding the boundary 
systems), using equation (5.6). These two steps are needed to advance the solution one full 
time step.
The main issue in implementing the ADI method on a parallel computer is choosing 
an efficient algorithm for the solution of tridiagonal systems. The selection o f the algorithm 
depends on the amount of hardware parallelism available on the computer, storage require­
ments, and some other factors. Two algorithms are considered here: Gaussian elimination 
and cyclic elimination. Although these algorithms are described in the literature (see [19] 
for details), they are briefly described here for the sake o f completeness. Only the standard 
forms o f these algorithms are considered here.
Consider solving a single tridiagonal system o f n equations,
a,- + bi Xi +CiXM = dj, (5.11)
for i = 2, 3...........n-1, and the boundary conditions,
= du (5.12)
x„ = dn. (5.13)
The Gaussian elimination algorithm, based on an LU decomposition o f  the tridiagonal
matrix, has two stages: the forward elimination and the back substitution. In the forward 
elimination stage two auxiliary vectors, w and g, are computed as follows:
= 0.0, (5.14a)
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
42
w,- = d  I (bi -  a, wM), i = 2, 3 , ,  n—l\ (5.14b)
8i = du (5.15a)
gi = (di ~ a; &-i) / fa  -  a. wM), t = 2, 3............n-1. (5.15b)
The values o f x, are obtained in the back substitution stage as follows:
(5.16a)
= g i ~ *4 xM , i = n-1, n -2 ............ 1 (5.16b)
Gaussian elimination is an inherently serial algorithm because o f the recurrence rela­
tions in both stages o f the algorithm. However, if  one is faced with solving a set of 
independent tridiagonal systems, then Gaussian elimination will be the best algorithm to use 
on a parallel computer. This means that all systems o f the set are solved in parallel. In this 
case we obtain both the minimum number o f arithmetic operations and the maximum paral­
lelism.
The cyclic elimination algorithm, also called odd-even elimination [18] or parallel 
cyclic reduction [19], is a variant of the cyclic reduction algorithm [19] applying the reduc­
tion procedure to all o f the equations and eliminating the back substitution phase o f the 
algorithm. This makes cyclic elimination most suitable for machines with a large natural 
parallelism, like the MPP. It can be described as follows. Assume that n = 2r where r is an 
integer and
x, -  0, for i < 0 and i > n. (5.17)
Solving equation (5.11) for x, and the corresponding equations for x ^  and xi+u we have:
1) b  t_ \ ) Xj_2 (^i—i/^j—i ) X/, (5.18)
X; = (di / bi) -  ( a i l  bi )  x̂ _i -  ( d  / bi ) xi+1, (5.19)
%i+1 — fa+ifa+i) (^fifa+i) Xj (cL+]lb1+l) Xi+2* 
Substitute for x ^  and xi+1 in equation (5.19), to get
(5.20)




d P  =  - ( d i l b d ( d i. 1 l b i.  i ) ,
bP = 1 -  fa / bd (c;-i / -  (aM / *i+i) (Ci / &;),
C,(1) = -  (Ci / W (C*i / fcri-l),




The above process eliminates the odd variables in the even equations and the even 
variables in the odd equations by perforating elementary row operations. The resulting sys­
tem is again tridiagonal o f  the same form as equation (5.11) but with different coefficients 
(di, bi, c/) and forcing terms ( d i ) . This process can be repeated for r  steps ( r  = log2n) until 
one set o f equations remains. These equations are
The terms xh r  and x^r  o f  equation (5.26) are really zero because they refer to values 
outside the range 1 < i  < n  and by equation (5.17) are zero. Thus equation (5.26) becomes
The cyclic elimination procedure therefore requires the computation o f new sets o f 
coefficients and forcing terms, for levels * = 1 ,2  r ,  from
(5.26)
x ,  =  d tp  / t i p ,  for i  = 2, 3.......... n-1. (5.27)
(5.28)
t i p  = 1 -  A, C i-K  -  A i+K (5.29)
(5.30)
t i p  = A  -  A  A-vr -  Q  D i+K, (5.31)
where
(5.32)
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
44
followed by computing the values o f  x,, using equation (5.27). Note that for k  =  1 equa­
tions (5.28) to (5.31) are equivalent to equations (5.22) to (5.25) with a[0) = a„ b f> = bi.
The ADI procedure to solve equation (5.1) can be applied for any set o f  boundary and 
initial conditions. A test problem is developed by setting V  to zero everywhere at t =  0 on 
the interior and
5 2 . Im plem entation on the M PP
The ADI method, described in section 5.1, was implemented on the MPP for a 
128 x  128 mesh point problem. In each processor, data corresponding to one mesh point is 
stored. The main program as well as the ADI procedure were written in M PP Pascal and 
run through the MPP Pascal compiler version 2. The main program, run on the VAX, han­
dles input and output, initialization o f the computational domain, and calling the ADI pro­
cedure. The ADI procedure, which was executed entirely on the array, computes the 
coefficients, forcing tenns, and solves two sets o f 128 tridiagonal systems. The tridiagonal 
systems are solved by the cyclic elimination algorithm, described in section 5.1, for all rows 
and all columns. This is done in parallel on the array with a tridiagonal system o f 128 
equations being solved on each row or column. In this case the vectors x„ a„ cit d, of 
equation (5.11) become the matrices x,v, bij, c{j, dij. After solving each set o f the tridi­
agonal systems, all points o f the domain are updated except the boundary points. This is
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
= c„ and d ^  = dL.
1/(0,y) = 1 + 0.25(y -  3^), 1/(1 j )  = 1 + 025(2y -  3 /  -  3),
U(x,0) = 1 + 0.25(3x2 -  6x), U(x, 1) = 1 + 025(3x? -  5x -  2), 
for all time. The steady state solution to this problem is,
(5.33)
45
implemented by masking out the boundary columns (or the boundary rows) o f the array.
The ADI procedure is reasonably efficient and requires only five parallel arrays of 
floating point numbers. The procedure contains mostly matrix operations with a few scalar 
operations (for computing the coefficients o f U^, equations (5.5) to (5.8)) and has no vector 
operations. However, the cyclic elimination algorithm has some hidden defects. For each 
level o f the elimination process, a set o f data is shifted off the array and an equal set of 
zeros is shifted onto the array. Since all o f the processors are executing the same instruc­
tion at every cycle, some o f these processors may not be doing useful work; here they are 
either multiplying by zero or adding a zero. This is a problem with many algorithms on 
SIMD machines.
Table 9 contains the execution time and the processing rate of the ADI procedure for 
the 128 x 128 problem. The processing rate is determined by counting only the arithmetic 
operations (addition, multiplication, and division). Data transfer operations, vital as they are 
in this work, are not counted as floating point operations. However, all the arithmetic 
operations including those operations which do not contribute to the solution are counted.
Table 9. Measured execution time per time step and processing rate for the ADI procedure 
on the MPP.
Problem size (mesh points) Execution time (msec) Processing rate (MFLOPS)
128 x  128 23.698 134
In order to use the model developed in section 3.1 for the cost o f implementing the 
ADI method on the MPP, the data transfer as well as the arithmetic operations of the 
method are counted. Table 10 contains the operation counts for one pass of the ADI 
method using the cyclic elimination algorithm. These operations are also required for the 
second pass o f the method and the total number of operations required for both passes will 
be twice the number that is given in Table 10. The computation cost (equation (3.2)) and
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
46
the communication cost (equation (3.3)) of the ADI method on the MPP are listed in Table 
11 using VAX format primitives version 2 compiler (Table 1). The computation cost con­
tributes about 75% o f the total cost and the communication cost contributes about 25% of 
the total co s t The scalar operations have very little impact on the performance o f the 
method since they are inexpensive and may overlap with the array operations. It is also 
found that the cost o f solving the tridiagonal systems represents more than 95% o f the total 
cost o f the method.
Table 10. Operation counts for one pass o f  the ADI method using the cyclic elimination 
algorithm for solving the tridiagonal systems, where / = 1, 2 , . . .  , n, j  = 1, 2 , .  . . ,  n, and 
r  = log2n.
Operation Add Multiply Divide Shift Steps shifted
1. Compute 
eq. (5.7)
2 3 - 2 2
2. Compute A tJ, ClV, %  
eqs. (5.32)
” “ 3 r - “
3. Compute aff], 
eq. (5.28)
r r 2r -  1
4. Compute b\r), 
eq. (5.29)
2  r 2 r 2 r 2  (2r -  1)
5. Compute c\rJ , 
eq. (5.30)
- r “ r 2r -  1
6 . Compute d f), 
eq. (5.31)
2  r 2 r “ 2 r 2  (2r -  1)
7. Compute x-. „ 
eq. (5.27)
- ~ 1 • •
Total operations 4 r  + 2 6 r  + 3 3 r + 1 6  r  + 2 6 (2r -  1) + 2
Total operations for r -  7 30 45 22 44 764
Table 11. Estimated and measured times (in milliseconds) of the ADI method on the MPP.
Computation time Communication time Total estimated time Measured time
17.809 6.069 23.878 23.698
If we use the maximum processing rates of the arithmetic operations on the MPP (see 
section 2.1) and the operation counts from Table 10, the maximum possible rate o f  the ADI 
method will be 236 MFLOPS. Therefore, the measured processing rate is about 57% of the
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
47
maximum possibie rate. The peak performance rate is not achieved because o f  the data 
transfer costs and the inefficiency o f  the VAX format primitives.
5 3 . Im plem entation on the  Flex/32
The ADI method, described in section 5.1, was implemented on the Flex/32 using p  
processors with p = 1, 2, 4, 8 , and 16 and for problems o f sizes n x n mesh points with n = 
64, 128, and 256. The main program as well as the ADI subroutine were written in Con­
current Fortran.
In order to implement the method in parallel, the domain is decomposed first vertically 
into n by n l  p  strips, for the first pass o f the method, and then horizontally into n / p  by n 
strips, for the second pass o f  the method. In the first pass, a set o f n / p  tridiagonal systems 
(each o f n equations) corresponding to the vertical lines o f  the net are solved for each strip 
using the Gaussian elimination algorithm, described in section 5.1, In the second pass, like­
wise, a set of n i p  tridiagonal systems (each of n equations) corresponding to the horizontal 
lines o f the net are solved for each strip using the same algorithm. In our implementation 
each strip is given to a process. In addition to initialization o f  the domain and input and 
output operations, the main program creates and starts the execution of the processes on 
specified processors with each process assigned to a separate processor. The processes are 
spawned only once at the beginning o f the program and are used for both passes o f  the 
method. The data corresponding to the domain is stored in the common memory. Also, 
the values of the boundary points, computed in the main program, are stored in the common 
memory. The forcing terms and the temporary matrices wtJ and gtJ, equations (5.14) and 
(5.15), for each strip are computed and stored in the local memory o f  each processor.
The ADI method was implemented by: computing the coefficients and forcing terms 
o f the tridiagonal systems and solving these systems for all columns first and then for all
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
48
rows. Upon completing a pass, each process signals to the other processes by incrementing 
a counter. All o f the processes will wait until each o f them has finished the current pass. 
A lock is assigned to the counter to ensure this sequence o f the events.
The speedup and efficiency as functions o f  the number o f processors for problems o f 
sizes 64 x 64,128 x 128, and 256 x 256 are listed in Table 12. The efficiency o f the method 
ranges from 35%, for the 64 x 64 problem using 16 processors, to 95%, for the 256 x 256 
problem using two processors. The measured execution times and the processing rates for 
these problems are listed in Table 13.
Table 12. Speedup and efficiency o f the ADI method on the Flex/32.
Number o f 64 x  64 points 128 x  128 points 256 x  256 points
processors speedup efficiency speedup efficiency speedup efficiency
1 1.000 1.000 1.000 1.000 1.000 1.000
2 1.827 0.913 1.874 0.937 1.900 0.950
4 3.393 0.848 3.660 0.915 3.722 0.931
8 5.278 0.660 6.807 0.851 7.271 0.909
16 5.588 0.349 10.486 0.655 13.414 0.838
Table 13. Measured execution times per time step and processing rates for the ADI method 
using 16 processors of the Hex/32.
Problem size Execution time Processing rate
(mesh points) (msec) (MFLOPS)
64 x  64 340 0.36
128 x  128 740 0.66
256 x  256 2320 0.85
Table 14 contains the estimated values of the computation time and the overhead 
times o f the ADI method using equations (3.4) to (3.12). Since the ADI method is applied 
to the interior points o f  the region only, the first and last processors have less work to do 
than the other processors. This means that load is not distributed evenly between the pro­
cessors except when we use two processors only. Therefore, the load distribution factor is
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
Table 14. Estimated and measured times (in milliseconds) o f the ADI method on the 
Flex/32.
No. o f Computation Spawning Memory Total estimated Measured
procs. time time time time time
64 x  64 points
1 - - - - 1900
2 981 26 16 1023 1040
4 490 52 8 550 560
8 245 104 4 353 360
16 123 208 2 333 340
128 x 128 points
I - - - - 7760
2 3942 26 65 4033 4140
4 1971 52 32 2055 2120
8 985 104 16 1105 1140
16 493 208 8 709 740
256 x 256 points
1 - - - - 31120
2 15683 26 260 15969 16380
4 7841 52 130 8023 8360
8 3921 104 65 4090 4280
16 1960 208 33 2201 2320
The overhead time of storing vectors in the common memory instead of the local 
memories is the only overhead that contributes to accessing the common memory. This 
overhead results from having all variables stored in the local memory when we used one 
processor while some o f these variables were stored in the common memory in the mul­
tiprocessing environment. The values o f kkk (equation (3.8)) and kdd (equation (3.12)) are 
two and 8 n i p ,  respectively. The synchronization time is insignificant because the routines 
that provide the locking mechanism are very efficient The spawning time dominates the 
overhead time for large number of processors. This resulted in the degradation on the per­
formance of the method for large number of processors. However, the overhead time is 
dominated by the common memory access time for a small number of processors. The total
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
50
estimated times are less than the measured times by less than 4%. The busses contention 
factor, jk ,  is not included in the common memory access time. However, it is expected that 
it has a minor impact on the results. Finally, it was realized that the common memory 
access operations may overlap with the spawning operations during the first pass o f the 
method.
5.4. Implementation on the Cray/2
The ADI method, described in section 5.1, was implemented on one processor o f  the 
Cray/2 for domains o f sizes ranging from 64 x  64 to 1024 x  1024 mesh points. The code, in 
each case, was written and run through the CFT/2 compiler. Each set o f the tridiagonal 
systems is solved by the Gaussian elimination algorithm, described in section 5.1, for all 
systems o f the set in parallel. This means that each element o f  w,- (equations (5.14)) is 
computed for all the tridiagonal systems before moving to the next element. This process is 
repeated in computing the elements of & (equations (5.15)) and x,- (equations (5.16)). The 
implementation requires that the vectors w;, &, x,- be changed to the matrices w,v, gtj< x,v. 
When these are done, all statements o f the code vectorize fully and the recursion problem 
of the algorithm is eliminated.
The ADI method requires that the data is first referenced by vertical lines and then by 
horizontal lines. The CFT/2 compiler stores arrays by incrementing the leftmost index first 
(column major order). This means that referencing the data by vertical lines causes no 
problems because the increment between data elements is unity. However, in referencing 
the data by horizontal lines the increment between the elements will be equal to the number 
of variables in each column. In our implementation, this number ranges between 64 and 
1024. This could cause a major problem on the Cray/2. As described in section 2.3, the 
machine has 128 memory banks and consecutive elements o f an array are stored in
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
51
consecutive banks. For a memory stride o f 128, for example, words are drawn from the 
same bank and each word must wait for 57 clock periods to be moved from main memory, 
even if  there is no competition for resources from other processors. As a result, strides that 
are divisible by 128 or any large power o f  two result in major performance reductions. 
This problem is overcome by storing the data as though it had a column length one greater 
than its actual length. Thus for a stride o f 129 data elements are stored in sequential banks 
with a stride of one (stride o f 129 mod 128 banks = 1).
The number o f memory access operations can be reduced by using the local memory 
whenever that is possible. Once an array is stored in the local memory, none o f its ele­
ments can be changed; all the elements are treated the same. The forcing terms and G^, 
equations (5.7) and (5.8), are stored in the local memory.
Table 15 contains the execution time and the processing rate o f the ADI method when 
the domain size is varied from 64 x 64 through 1024 x 1024 mesh points. The processing 
rate is computed by counting the additions, multiplications, and divisions only. Division is 
counted as a single arithmetic operation. The number o f arithmetic operations for one pass 
of the ADI method, using the Gaussian elimination algorithm for solving the tridiagonal 
systems, are listed in Table 16. The total number of each operation will be twice the 
number that is given in Table 16.
Table 15. Measured execution times per time step and processing rates for the ADI method 







64 x  64 1.684 69
128 x 128 6.474 74
256 x 256 24.270 80
512 x 512 94.037 83
1024 x 1024 364.248 86
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
52
Table 16. Operation counts for one pass of the ADI method using the Gaussian elimination 
algorithm for solving the tridiagonal systems.
Operation Add Multiply Divide












Total operations 6 7 2
Table 17 contains the results of applying equations (3.13) to (3.16) for the ADI 
method. The number o f floating point operations is obtained by multiplying the values 
given in Table 16 by (Lxr = n -  2) while the number of main memory access operations 
is 12 Lm  for each pass o f the method. The multiplication time includes the time required 
for division; each division is estimated as three multiplications in this model. The cost of 
scalar operations are not included in the model. The memory access time represents about 
47% o f the total estimated time. The estimated time for memory access does not take into 
account the competition for memory banks from other processors which causes a lower data 
transfer rate. The total estimated time exceeds the measured time for large domains. This 
is because for large problems more overlapping between different operations is expected 
and the impact o f scalar operations is reduced. For these reasons, the performance rate of 
the method, Table 15, increases for large problems.
Because the ADI method has more multiplications (including divisions) than additions, 
the time to implement the method can be considered as a summation o f  two portions; a por­
tion with one operational pipeline and a portion with two pipelines. By using Table 16 and 
assuming that the peak performance rate o f one pipeline is 244 MFLOPS, ignoring the vec­
tor startup times, the maximum processing rate of the ADI method will be 357 MFLOPS.
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
53
Therefore, the measured processing rate for the 128 x 128 problem is about 20% of the peak 
performance rate. If  the startup time o f the floating point units is included, the measured 
processing rate for the 128 x 128 problem will be about 28% of the peak processing rate of 
262 MFLOPS. The major problems are accessing the main memory and the scalar portion 
of the code.
Table 17. Estimated and measured execution times (in milliseconds) of the ADI method on 













6 4 x 6 4  
128 x  128 
256 x  256 
5 1 2 x 5 1 2  


























R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER SIX
Solving Navier-Stokes Equations
In this chapter, the implementation o f Navier-Stokes equations on the MPP, Flex/32, 
and Cray/2 is presented. The cell relaxation scheme, described in Chapter four, and the 
ADI method, described in Chapter five, are used to solve these equations. The numerical 
method is presented in section 6.1. The implementation on each machine including the 
application o f the performance models, described in Chapter three, is described in sections
6.2 through 6.4.
6.1. The num erical m ethod
The Navier-Stokes equations for the two-dimensional, time dependent flow o f a 
viscous incompressible fluid may be written, in dimensionless variables, as:
where I t = (u,v) is the velocity, C, is the vorticity and Re is the Reynolds number.
The numerical method used to approximate equations (6.1) to (6.3) is based on the 
compact differencing schemes, described in section 4.1. Gatski et al. [13] applied the com­
pact scheme to solve these Navier-Stokes equations. Grosch [17] adapted the Navier-Stokes 






Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
55
applying the compact differencing scheme to equations (6 .1) and (6 .2 ) are solved using the 
cell relaxation scheme, see section 4.1. The compact difference approximation to equation
(6.3) results in an implicit set of equations for £ at the next time step. This set o f equations 
are solved by an ADI method which, as described in section 5.1, requires the solution of 
tridiagonal systems on each sweep. The approximation to equation (6.3) is briefly described 
here.
Consider the problem of approximating the solution o f equations (6.1) to (6.3) in the 
square domain 0  < x < 1, 0 < y < 1 with the boundary conditions u = 1 and v = 0  at y  = 1 and 
u = v = 0 elsewhere. Subdivide the computational domain into rectangular cells, as in sec­
tion 4.1. Apply the centered difference and average operators, equations (4.3) and (4.4), to 
get equations (4.5) to (4.8) which are solved using the cell relaxation scheme, as described 
in section 4.1. Let Ar be a full time step and apply the forward difference operator to equa­
tion (6.3) for the time derivative, giving
Applying the centered difference operator, equation (4.3), and the centered second 
difference operator, equation (5.4), for the space derivatives, we get
•n+l/2
Pi? Crtf -  (1 + 2of) + yW xgff = Fijt 




FU -  -PS? <&., -  a  -  2a?>) Vj -  iS?
o,j -  - P S ?  -  ( i  -  c i f 2 -  • » »  &!?.
(6.8)
(6.9)






' 2(Ay)f R e '
(6 .11)
RW =  a W +  u- 1 ■
Pw J 4 (Ax)j ‘- ' s
(6 .12)





The velocity field is not defined at the comers o f  the cells in  this scheme; however, it can
be computed as follows:
tfij  = ^ 0 M I 2 j  +  tfi-l/zj)- 
Using equation (4.10), to get
(6.16)
+ + (6-17) 
Equations (6 .6 ) and (6.7), which are similar to equations (5.5) and (5.6), represent two sets
o f independent tridiagonal systems. These systems can be solved using one of the algo-
rithms for solving the tridiagonal systems, described in section 5.1.
The ADI method for equation (6.3) is applied to all interior points of the domain. 
The values o f £ on the boundaries are computed using equation (6.2). On x  = 0, for exam­
ple, we have
0 <«> 
but u = 0  at x  =  0 , so (du / dy)x = o = 0  and
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
57
(Q x = 0 =  ( | j ) x  = 0- (6-19)
Assume that the length, Ax, o f  the first two cells bordering the boundary are equal and
approximate the derivative o f v, to get
( J j ) x  = 0 = Oo (v)x = 0 +  «l (v)x = Ax +  “ 2 (v)x = 2Ax- (6.20)
Using the Taylor series for v at x = Ax, 2Ax about x  -  0, we get
( f j ) x  =  o =  flb  (v)x =  o  +  |  (v)x = o  +  Ax ( | j ) ,  =  0 +  = o +  0 [(Ax)3] |
+  |̂(v ) x  = o +  2A x ( - ^ ) z = o +  j ( 2 A x ) 2( - 0 ) x = o  +  0 [(Ax)3]  j .  (6 .21)
Solving equation (6 .21) for the unknowns ao, au a2, we have
(° x = 0 =  (i ) [” 3 (V)-  = 0 +  4  (V)x = AX -  (V)x = 2Ax]• (6 .22)
The value o f  v on x = 0 is zero while its value on x = Ax, 2Ax is computed using equation 
(6 .17). Similarly,
(O x . i =  ( J j ) x =  i = (“ )[3 (v)x= i -  4 (v)z= , _ ^  + (v)z= l _ 2A*]. (6.23)
( 0 ,  = 1 =  ( - | 4 ,  = 1 = ( ^ )  [-3 H  = 1 + 4 H  = 1 - A > - ( “) > = ! - 2Ay]> (6.24)
while
(v )y  = 0, x = Ax ( v)y = 0, x = 2Ax (6.25)
The solution procedure for the Navier-Stokes equations can be summerized as follows:
step (1) Assume that C, is zero everywhere at t = 0. The box variables for the relaxation
process are initialized and their boundary values are computed.
step (2) The vorticity at the comers of the domain is not defined in this scheme. These
values are approximated using the values o f  their neighboring points. The
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
58
values of Cw/2̂ -1/2 are computed using the values o f £ at the comers o f the 
cells.
step (3) The relaxation process for the box variables, as described in section 4.1, is per­
formed.
step (4) The coefficients ajj), a,w, (i,^, p®, yW, -ybp (equations (6.10) to (6.15)) for both
passes o f  the ADI method are computed. This includes computing Uy and Vijt 
equation (6.17).
step (5) The values o f C on the boundaries, equations (6.22) to (6.25), are computed.
step (6) The ADI method for £ is applied to all interior points o f the domain, equations
(6 .6) and (6.7).
The repetition of steps (2) through (6) yields the values o f the velocity and vorticity at any 
later time. These steps were implemented using the following subprograms: setbc, step (1); 
zcntr, step (2); relaxd, step (3); cof, step (4); zbc, step (5); and triied and trijed, step (6). 
The subprogram triied computes F,v, equation (6 .8), and solves tridiagonal equations distri­
buted over columns, equation (6 .6), while trijed computes G^, equation (6.9), and solves tri­
diagonal equations distributed over rows, equation (6.7).
6 2 . Im plem entation on the M PP
The Navier-Stokes equations, described in section 6.1, were solved on the MPP using 
128 x 128 grid points (127 x 127 cells). The computational cells are mapped onto the array 
so that each comer o f  the cell corresponds to a processor, as described in section 4.2. The 
seven subprograms required to solve these equations (see section 6.1) were written in MPP 
Pascal and run through the MPP Pascal compiler version 3 (this is the only experiment 
where this version of the compiler was used). These subprograms were executed entirely
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
59
on the MPP; only the input and output routines were run on the VAX. The four color 
relaxation scheme was implemented on the array, as in section 4.2. The tridiagonal sys­
tems, equations (6 .6) and (6.7), were solved by the cyclic elimination algorithm, described 
in section 5.1, for all rows and all columns.
One of the problems in solving Navier-Stokes equations on the MPP is the size o f the 
PE memory. As indicated in section 4.2, the relaxation procedure uses almost all o f the 
1024 bit PE memory. Although the staging memory can be used as a backup memory, this 
causes an I/O overhead and reduces the efficiency o f the machine. This problem was 
solved by declaring all o f the parallel arrays as global variables and using them by different 
procedures for more than one purpose. This means that temporary arrays were redefined in 
different parts o f the code. Beside this hardware problem, there are problems in using MPP 
Pascal to perform vector operations and to extract elements o f parallel arrays. Operations 
on vectors are performed on the MPP by expanding them to matrices and performing matrix 
operations. This means that vector processing rate is 1/128 o f that for matrix operations. 
M PP Pascal does not permit extracting an element o f a parallel array on the MPP. This 
means that scalar operations involving elements o f parallel arrays need to be expanded to 
matrix operations or they should be performed on the VAX.
Table 18 contains the execution time for each subprogram o f the Navier-Stokes algo­
rithm, that for one iteration in the case o f  relaxd; the percentage o f the total time spent in 
that subprogram; and the processing rate. The percentage o f time spent in each routine 
determines which routines in the program are using the most time for a given run. It is 
clear, from Table 18, that the majority o f  the time was spent in relaxd for this particular 
run. This is because the average time step requires about 270 iterations and the total time 
spent in the other routines ( zcntr, cof, zbc, triied, tr ijed ) is only about the time to do two
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
60
iterations o f relaxd. The number o f  iterations in  relaxd per time step depends on the partic­
ular data used during a given run. A different input data set could result in a smaller 
number of iterations per time step and relatively less time spent in the relaxation routine.
Table 18. Measured execution time and processing rate o f the Navier-Stokes subprograms 







setbc 0.587 0 .00 84
zcntr 2.694 0.06 24
relaxd 15.265* 99.23 156
co f 1.933 0.05 136
zbc 1.833 0.04 1.1
triied 12.717 0.31 125
trijed 12.725 0.31 125
overall# 41.597 100.00 155
* per iteration.
# for ten time steps (execution time is in seconds for this row).
The processing rates in Table 18 are determined by counting only the arithmetic 
operations which truly contribute to the solution. Scalar and vector operations which were 
implemented as matrix operations are counted as scalar and vector operations. This is the 
reason why the routines zbc and zcntr have low processing rates; zbc has only vector opera­
tions while zcntr has some scalar operations implemented as matrix operations. The routine 
setbc has mostly scalar and data assignment operations which reduce its processing rate. 
Beside these three routines, the processing rate ranges from 125 to 155 MFLOPS with an 
average rate of about 140 MFLOPS. The processing rates for relaxd, triied, trijed are 
slightly less than the rates listed in Tables 2 and 9 because two versions o f the MPP Pascal 
compiler, which call different sets o f VAX format primitives, were used in these implemen­
tations (see table 1). This causes a difference in the processing rate o f about 12% for the 
relaxation routine and about 7% for the tridiagonal solver.
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
61
Table 19 contains the estimated computation and communication times o f  the Navier- 
Stokes subprograms. These times arc computed using equations (3.2) and (3.3) and VAX 
format primitives version 3 compiler. The cost o f scalar operations is not included in this 
model; this explains the differences between the estimated and measured times for setbc and 
cof. The cost o f  the arithmetic operations is estimated the way these operations were imple­
mented; i.e., scalar and vector operations (in zcntr and zbc ) which were implemented as 
matrix operations are considered here as matrix operations. The amount of time spent on 
data transfers is quite modest for these subprograms; from 6 % for relaxd  to 25% for triied 
and trijed.









setbc 0.300 - 0.300 0.587
zcntr 2.177 0.348 2.525 2.694
relaxd 13.592 0.840 14.432 15.265
co f 1.421 0.134 1.555 1.933
zbc 1.540 0.144 1.684 1.833
triied 9.239 3.043 12.283 12.717
trijed 9.239 3.043 12.283 12.725
6 3 . Im plem entation on the Flex/32
The Navier-Stokes algorithm, described in section 6.1, was implemented on the 
Rex/32 using 64 x 64 grid points (63 x 63 cells) and 128 x 128 grid points (127 x 127 cells). 
The main program as well as the seven subprograms of the algorithm were written in Con­
current Fortran.
The parallel implementation o f the Navier-Stokes algorithm is done by assigning a 
strip o f  the computational domain to a process and performing all the steps o f the algorithm 
by each process. The main program performs only the input and output operations and 
creates and spawns the processes on specified processors. In our implementation, we used
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
62
1, 2, 4, 8 , and 16 processors o f the machine. The domain is decomposed first vertically for 
the first six subprograms o f the algorithm ( setbc, zcntr, relaxd, caf, zbc, and triied ) and 
then horizontally for the subprogram trijed. The four color cell relaxation scheme was 
implemented as described in section 4.3. The tridiagonal systems were solved by the Gaus­
sian elimination algorithm for both passes o f the ADI method, as described in section 5.3.
Data is stored in the common memory, in the local memory of each processor, or in 
both o f them. For the relaxation routine, as described in section 4.3, the matrices Qij, 
Z^, Xij, and Cij for each strip are stored in the local memory while the interface points and 
the error flags are stored in the common memory. A copy of the matrix Qq is also stored 
in the common memory to be used in computing the matrix V^, equation (6.17). The vorti- 
city at the comers of the cells (£•,), the velocity field (t7^), and the coefficients o f the tridi­
agonal equations ( a $ , a,^, p $ , yW, and are all stored in the common memory; this 
is required in order to implement the ADI method. The forcing terms and the temporary 
matrices for the tridiagonal solvers (see section 5.3) are stored in the local memory o f each 
processor.
In order to satisfy data dependencies between segments o f the code running many 
processes, a counter is used. This counter, which is a shared variable with a lock assigned 
to it, can be incremented by any process and be reset by only one process. It is imple­
mented as a "barrier" where all processes pause when they reach it. In addition, conver­
gence flag and other flags, described in section 4.3, are used for synchronization in the 
relaxation routine.
Table 20 contains the speedups and efficiencies as functions of the number of proces­
sors for the 64 x 64 and 128 x 128 problems for two time steps. The measured execution 
times for ten time steps and processing rates for these problems using 16 processors are
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
63
listed in Table 21.
Table 20. Speedup and efficiency of the Navier-Stokes algorithm on the Flex/32.
Number of 
processors
64 x  64 points 128 x  128 points
speedup efficiency speedup efficiency
1 1.000 1.000 1.000 1.000
2 1.959 0.980 1.976 0.988
4 3.893 0.973 3.941 0.985
8 7.715 0.964 7.850 0.981
16 15.027 0.939 15.483 0.968
Table 21. Measured execution times for ten time steps and processing rates for the 
Navier-Stokes algorithm using 16 processors o f the Flex/32.
Problem size (grid points) Execution time (sec) Processing rate (MFLOPS)
64 x  64 268.7 1.09
128 x  128 2587.1 1.13
The performance of the Navier-Stokes algorithm is heavily influenced by the perfor­
mance o f the relaxation routine; about 98% o f the total time was spent in this routine for 
the two time step run. The only difference between the implementation of the relaxation 
routine in this algorithm and in solving Cauchy-Riemann equations, section 4.3, is that here 
we used domains of sizes 63 x 63 and 127 x 127 cells while in section 4.3 we used domains 
of sizes 64 x 64 and 128 x 128 cells. This means that here the number of cells is not divisi­
ble by the number o f processors used. This also means that the last processor has less 
work to do than the other processors. Therefore, the load distribution factor, equation (3.5), 
can be computed by
Using the performance model, developed in section 3.2, the overhead time represents at 
most 5% of the execution time of the algorithm. The overhead time of the relaxation rou­
tine dominates the total overhead time. As described in section 4.3, the spawning and syn­
(6.26)
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
64
chronization times are insignificant and the common memory access time dominates the 
overhead time. The other components of the common memory overhead time, Tclm and TcU, 
have a negligible impact on the total overhead time.
6.4. Implementation on the Cray/2
The Navier-Stokes algorithm, described in section 6.1, was implemented on one pro­
cessor o f  the Cray/2 using 64 x 64 and 128 x 128 grid points. The codes were written and 
run through the CFT/2 compiler. The four color cell relaxation scheme was implemented as 
described in section 4.4. The tridiagonal systems were solved by the Gaussian elimination 
algorithm for both passes of the ADI method, as described in section 5.4. The inner loops 
o f all o f  the seven subprograms of the Navier-Stokes algorithm were fully vectorized.
Table 22 contains the execution time for each subprogram of the algorithm, the per­
centage of the total time spent in that subprogram, and the processing rate for the 64 x 64 
and 128 x 128 problems. As described in section 6.2, most o f the time was spent in relaxd. 
The average time step requires about 110 iterations for the 64 x 64 problem and about 270 
iterations for the 128 x 128 problem. The subprogram setbc has a low processing rate 
because it has mostly memory access and scalar operations; however, this routine is called 
only once during the lifetime o f the program. Beside this subprogram, the processing rate 
ranges from 57 to 97 MFLOPS with an average rate o f about 70 MFLOPS for the subpro­
grams o f both problems. The processing rates for triied and trijed are slightly less than the 
rates for the ADI method for solving the diffusion equation (see Table 15) because here the 
parameters of the tridiagonal systems are matrices (see equations (6 .6) and (6.7)) while in 
solving the diffusion equation these parameters are scalars (see equations (5.5) and (5.6)). 
This causes an increase o f about 40% in the number o f main memory access operations and 
a decrease o f at least 16% in the method processing rate.
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
Table 22. Measured execution time and processing rate o f  the Navier-Stokes subprograms 
on the Cray/2.
Routine 64 x 64 grid points 128 x  128 grid points
Exec, time Perc. o f Proc. rate Exec, time Perc. of Proc. rate
(msec) time (%) (MFLOPS) (msec) time (%) (MFLOPS)
setbc 0.480 0 .0 2 25 1.651 0.01 29
zcntr 0.252 0.08 63 1.059 0.03 61
relaxd 2.719* 99.02 96 11 .0 0 1 * 99.60 97
co f 0.720 0.24 85 3.036 0 .10 84
zbc 0.015 0.01 66 0.034 0 .00 59
triied 1.007 0.33 57 4.014 0.13 59
trijed 0.928 0.30 62 3.870 0.13 62
overall# 3.048 100 .00 96 30.286 100.00 97
* per iteration.
# for ten time steps (execution times are in seconds for this row).
Tables 23 and 24 contain the estimated times o f  the Navier-Stokes subprograms for 
the 64 x 64 and 128 x 128 problems. These times are obtained using equations (3.13) to 
(3.16). The main memory access time for each subprogram represents about 50% to 70% 
of the total estimated time and the measured time. The difference between the total 
estimated values and the measured values can be contributed to several reasons. Among 
these reasons are: the memory access and arithmetic operations can overlap, specially for 
large routines; the time to perform scalar operations is not included in this model; and, as 
mentioned before, there is up to 2 0 % offset on the results depending on the memory traffic 
and the number of the active processes on the system.
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
66
Table 23. Estimated and measured execution times (in milliseconds) o f the Navier-Stokes 











setbc 0.277 0.022 0.089 0.388 0.480
zcntr 0.154 0.067 0 .022 0.243 0.252
relaxd 1.783 1.206 0.551 3.540 2.719
co f 0.480 0.173 0.173 0.826 0.720
zbc 0 .010 0.002 0.006 0.018 0.015
triied 0.510 0.130 0.281 0.921 1.007
trijed 0.510 0.130 0.281 0.921 0.928
Table 24. Estimated and measured execution times (in milliseconds) of the Navier-Stokes 











setbc 1.120 0.090 0.360 1.570 1.651
zcntr 0.622 0.270 0.090 0.982 1.059
relaxd 7.218 4.144 1.802 13.164 11.001
co f 1.967 0.711 0.711 3.389 3.036
zbc 0.019 0.004 0.013 0.036 0.034
triied 2.090 0.533 1.155 3.778 4.014
trijed 2.090 0.533 1.155 3.778 3.870
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER SEVEN
Com parisons and  Concluding Rem arks
The architectures, the performance models, and the implementation of the three algo­
rithms on the architectures along with the results were presented in previous chapters. A 
comparison of the performance o f these architectures using the three algorithms is presented 
in this chapter. Some concluding remarks are also given here.
There are a number o f measures that one can use to compare the performance o f these 
parallel computers using a particular algorithm. One is the processing rate and another is 
the execution time. However it must be borne in mind that both of these measures depend 
on the architectures of the computers, the overhead required to adapt the algorithm to the 
architecture, and the technology, that is, the intrinsic processing power of each o f  the com­
puters.
If we consider a single sized problem for all algorithms, that on a 128 x 128 grid, then, 
as shown in Table 25, the processing rate is a maximum for the M PP using the three algo­
rithms; the range o f the processing rate for these algorithms is 134 to 175 MFLOPS for the 
MPP, 74 to 102 MFLOPS for the Cray/2, and only 0.66 to 1.13 MFLOPS on 16 processors 
of the Flex/32. The low processing rates o f these algorithms on the 16 processors of the 
Hex/32 are simply due to the fact that the National Semiconductor 32032 microprocessor 
and 32081 coprocessor are not very powerful. Although these algorithms have higher per­
formance rates on the MPP than on the Cray/2, it takes less time to solve these problems on
67
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
68
the Cray/2 than on the MPP. For the relaxation algorithm, this is due to the algorithm 
overhead involved in adapting the algorithm to the MPP; for each iteration the MPP algo­
rithm has 145 arithmetic operations per grid point compared to 66  operations per grid point 
on the Cray/2. For the ADI method, the Cray/2 outperformed the MPP with respect to the 
execution time because the cyclic elimination algorithm, used to solve tridiagonal systems 
on the MPP, has 97 arithmetic operations per grid point (see Table 10) while the Gaussian 
elimination algorithm, used on the Cray/2, has only 15 operations per grid point (see Table 
16); this is a factor o f about 6.5 which makes cyclic elimination the less efficient algorithm. 
The performance of the Navier-Stokes algorithm is heavily influenced by the performance 
of the relaxation algorithm; at least 98% o f the execution time was spent in the relaxation 
routine.
Table 25. Summary o f the results for the 128 x 128 problem.
Computer Cauchy-Riemann eqs. Diffusion eq. Navier-Stokes eqs.
Exec. Processing Exec. Processing Exec. Processing
time rate time rate time rate
(msec) (MFLOPS) (msec) (MFLOPS) (sec) (MFLOPS)
MPP 13.56 175 23.70 134 41.60 155
Flex/32 964.85 1.12 740.00 0 .66 2587.10 1.13
Cray/2 10.47 102 6.47 74 30.29 97
The implementation o f these algorithms on the Flex/32 has the same number o f arith­
metic operations per grid point as on the Cray/2; there is only a reordering o f the calcula­
tions and no additional arithmetic operations in the overhead. The algorithmic overhead for 
the Flex/32 versions is the cost o f exchanging the values o f the interface points and setting 
the synchronization counters for the relaxation algorithm and accessing the common 
memory for the ADI method. This means that the code on each processor is the serial code 
plus the overhead code. When the code is run on one processor, it is just the serial code 
with the overhead portion removed.
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
69
These two performance measures can also depend on problem size. For the 128 x 128 
problem, the times to complete one iteration of the relaxation algorithm on the Cray/2 and 
the M PP are comparable. However, for problems which are laiger than the M PP array 
there is a very large overhead cost because the MPP array memory is not large enough to 
contain the data and the staging memory must be used. Thus expanding the computational 
domain caused a degradation on the performance o f  the M PP for the relaxation algorithm. 
However, expanding the domain has m inor impact on the performance o f both the Cray/2 
and Flex/32 for the three algorithms. In fact the processing rate of the Cray/2 slightly 
increased as the problem size increased.
Comparing the measured processing rate with the peak processing rate o f  an algorithm 
on each o f the computers is also a measure of how well the algorithm has been mapped 
onto the architecture. This measure is also relatively independent o f  the basic technology of 
the implementation o f the architecture. For the relaxation algorithm, these relative perfor­
mance rates are 40% for the Cray/2, 48% for the MPP, and at least 96% for the Flex/32. 
For the ADI method, these rates are 28% for the Cray/2, 57% for the MPP, and between 
64% and 94% for the Flex/32. Accessing the main memory is the major problem on the 
Cray/2. The poor performance of the VAX format primitives is the major cause of 
inefficiency on the MPP. I f  the VAX format primitives were as efficient as IBM format 
primitives (see Table 1), the performance of the MPP for the relaxation algorithm, for 
example, could be 315 MFLOPS, about 72% of the peak processing rate. Finally, we have 
found no major overhead on the Flex/32, except for small problems where the spawning 
cost could be a factor.
Another measure o f performance is the number of machine cycles required to solve a 
problem. This measure reduces the impact of technology on the performance of the
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
70
machine. As shown in Table 26, the MPP outperformed the Cray/2 (by a factor o f 7 to 19) 
and the latter outperformed the Flex/32 (by a factor o f about 4) in this measure. This 
means that one processor o f the Cray/2 outperformed 16 processors o f the Flex/32 even if 
we assume that both machines have the same clock cycle. The problem with the Flex/32 is 
that, although each processor has a cycle time o f 100 nsec, the memories (local and com­
mon) have access times o f about 1 psec.
Table 26. Clock cycles (in million cycles) for the 128 x  128 problem.
Computer Cauchy-Riemann eqs. Diffusion eq. Navier-Stokes eqs.
MPP 136 237 415970
Rex/32 9649 7400 25871000
Cray/2 2554 1579 7386859
One simple comparison between the MPP and Cray/2 is the time to perform a single 
arithmetic operation using the models developed in sections 3.1 and 3.3. Using equation 
(3.13), the time to perform a single floating point operation (addition or multiplication) on 
an array of size 128 x 128 elements on the Cray/2, excluding the memory access cost, is 
91.3 psec. The time to perform the same operation on the MPP using the MPP Pascal ver­
sion 3 compiler ranges from 81.1 psec (for multiplication) to 96.5 psec (for addition). This 
shows that the processing power of a single functional unit o f the Cray/2 is comparable to 
the processing power o f 16384 processors o f  the MPP. However, much o f the overhead is 
not included in this comparison: memory access cost on the Cray/2, data transfers on the 
MPP, and so on.
The experiment o f solving the Cauchy-Riemann equations showed that by reordering 
the computations we were able to implement the algorithm on three different architectures 
with no major modifications. The experiment o f solving the diffusion equation showed that 
two different algorithms, Gaussian elimination and cyclic elimination, were used to solve
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
71
the same numerical problem on the three architectures. These two algorithms were chosen 
to exploit the parallelism available on these architectures. The three algorithms also exploit 
multiple granularities o f parallelism. These algorithms vectorized quite well on the Cray#. 
A fine grained parallelism, involving sets o f single arithmetic operations executed in paral­
lel, is obtained on the MPP. Parallelism at higher level, large grained, is exploited on the 
Rex/32 by executing several program units in parallel.
The performance model on the MPP was fairly accurate on predicting the execution 
times o f the algorithms when we used the measured times o f the VAX format primitives. 
The performance model on the Rex/32 showed the impact of the common memory access 
and spawning overheads on the performance o f these algorithms. The performance model 
on the C ray# was based on predicting the execution costs of separate operations. This 
model is used to identify the major costs o f these algorithms and reproduced the measured 
results with an error of at most 30%.
These experiments showed the impact of software on the performance of parallel 
machines. The MPP has two sets o f primitives and the ratio of execution times of these 
primitives ranges from 1.07 for multiplication to 2.53 for addition (see Table 1). The 
Rex/32 has two sets of locks to lock and unlock variables in the common memory. There 
was an increase of about 39% in the efficiency o f the relaxation algorithm when the MMOS 
locks were replaced by the Local locks (see Tables 4 and 5). The performance o f register- 
to-register vector computers, like the Cray/2, is strongly compiler dependent. The compiler 
is responsible for overlapping the addition, multiplication and memory access operations.
The ease and difficulty in using a machine is always a matter o f interest. The Cray# 
is relatively easy to use as a vector machine. Existing codes that were written for serial 
machines can always run on vector machines. Vectorizing the unvectorized inner loops will
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
72
improve the performance of the code. Unlike parallel machines, vector machines do not 
have the problem o f "either you get it or not”. The Flex/32 is not hard to use, except for 
the unavailability o f debugging tools which is a problem for many MIMD machines (a syn­
chronization problem could cause a program to die). On the other hand, the MPP is not a 
user-friendly system. The size o f the PE memory is almost always an issue. M PP Pascal 
does not permit vector operations on the array nor does it allow extraction o f an element of 
a parallel array. The MCU has 64 Kbytes o f program memory. This memory can take up 
to about 1500 lines o f MPP Pascal code. This means that larger codes can not run on the 
MPP. Finally, input/output is somewhat clumsy on the MPP. However, other machines 
with architectures similar to the MPP may not have the same problems that the MPP does.
There is one further observation o f interest These algorithms can be implemented 
concurrently on four processors of the Cray/2 (multitasking). The codes will be similar to 
the Flex/32 versions except that all o f the variables should be stored in the main memory; 
the local memory on the Cray/2 is used only to store scalar temporaries. Adapting these 
algorithms to a local memory multiprocessor with a hypercube topology should be relatively 
easy. A high efficiency is predicted in this case because all data transfers are to nearest 
neighbors and their cost should be very small compared to the computation cost
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
List of References
[1] Adams, L „  "Reordering Computations for Parallel Execution," Comm. Appl. Numer. 
Meths., Vol. 2, No. 3, May-June 1986, pp. 263-272.
[2] Amdahl, G. M., "Validity o f the Single Processor Approach to Achieving Large 
Scale Computing Capabilities," AFIPS Conf. Proc., Vol. 30, Washington, DC, 1967, 
pp. 483-485.
[3] Ames, W. F„ "Numerical Methods for Partial Differential Equations," 2nd ed., 
Academic Press, New York, 1977, pp. 251-255.
[4] Ames Research Center, "NAS User Guide," Version 3.0, Moffett Field, CA, 1987.
[5] Batcher, K. E., "Design of a Massively Parallel Processor," IEEE Trans. Computers, 
Vol. C-29, Sept. 1980, pp. 836-840.
[6] Bokhari, S. H„ "Multiprocessing the Sieve o f Eratosthenes," Computer, Vol. 20, No. 
4, April 1987, pp. 50-58.
[7] Cray Research Inc., "Fortran (CFT2) Reference Manual," Publication SR-2007, 
Mendota Heights, MN, 1986.
[8] Crockett, T. W„ "Performance of Fortran Floating-Point Operations on the Flex/32 
Multicomputer," ICASE Interim Rep. No. 4, NASA Langley Research Center, 
Hampton, VA, August 1987.
[9] Dongarra, J. J., "Performance o f Various Computers Using Standard Linear Equa­
tions Software in a Fortran Environment," Tech. Memo. 23, Argonna National Lab., 
Argonna, IL, Nov. 1985.
[10] Flexible Computer Co., "Flex/32 Multicomputer System Overview,” Publication No. 
030-0000-002, 2nd ed., Dallas, TX, 1986.
[11] Gallopoulos, E. J., "The Massively Parallel Processor for Problems in Fluid Dynam­
ics," Computer Physics Communications, No. 37, 1985, pp. 311-315.
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
74
[12] Gannon, D. B. and Van Rosendale, J., "On the Impact o f Communication Complex­
ity on the Design o f Parallel Numerical Algorithms," IEEE Trans. Computers, Vol. 
C-33, No. 12, Dec. 1984, pp. 1180-1194.
[13] Gatski, T. B., Grosch, C. E., and Rose, M. E„ "A Numerical Study o f the Two- 
Dimensional Navier-Stokes Equations in Vorticity-Velocity Variables," J. CompuL 
Phys., Vol. 48, No. 1, 1982, pp. 1-22.
[14] Goddard Space Flight Center, "MPP Pascal Programmer’s Guide," Greenbelt, MD, 
June 1987.
[15] Goodyear Aerospace Co., "General Description of the MPP," Tech. Report GER- 
17140, Akron, OH, April 1983.
[16] Grosch, C. E., "Performance analysis o f Poisson solvers on array computers," in 
Infotech State of the Art Report: Supercomputers, Jesshope C. R. and Hockney R. 
W „ ed., 2, Infotech International, Maidenhead, England, 1979, pp. 147-181.
[17] Grosch, C. E., "Adapting a Navier-Stokes code to the ICL-DAP,” SIAM J. Scientific 
& Statistical Computing, Vol. 8 , No. 1, 1987, pp. s96-sl 17.
[18] Heller, D„ "A Survey o f  Parallel Algorithms in Numerical Linear Algebra," SIAM 
Review, Vol. 20, No. 4, 1978, pp. 740-777.
[19] Hockney, R. W. and Jesshope, C. R., "Parallel Computers: Architecture, Program­
ming and Algorithms," Adam Hilger, Bristol, England, 1981.
[20] Kaczmarz, S., "Angenaherte auflosung von systemen linearer gleichungen," Bull. 
Acad. Polon, Sci Lett. A, 1937, pp. 355-357.
[21] Ortega, J. M. and Voigt, R. G., "Solution o f Partial Differential Equations on vector 
and Parallel Computers," SIAM Review, Vol. 27, No. 2, June 1985, pp. 149-240.
[22] Parkinson, D. and Liddell, H. M., "The Measurement of Performance on a Highly 
Parallel System," IEEE Trans. Computers, Vol. C-32, No. 1, Jan. 1983, pp. 32-37.
[23] Philips, R. B. and Rose, M. E„ "Compact Finite-Difference Schemes for Mixed 
Initial-Boundary Value Problems," SIAM J. Numerical Analysis, Vol. 19, No. 4, 
1982, pp. 698-720.
[24] Reeves, A. P., "Parallel Pascal: An Extended Pascal for Parallel Computers," J. 
Parallel & Distributed Computing, Vol. 1, 1984, pp. 64-80.
R eproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
75
[25] Rose, M  E., "A ‘Unified’ Numerical Treatment of the Wave Equation and the 
Cauchy-Riemann Equations," SIAM J. Numerical Analysis, Vol. 18, No. 2, 1981, 
pp. 372-376.
[26] Tanabe, K„ "Projection Method for Solving a Singular System o f Linear Equations 
and its Applications," Numer. Math., VoL 17, 1971, pp. 203-214.
[27] Voigt, R. G., "Where are the Parallel Algorithms?,” 1985 National Computer 
Conference Proceedings, AFIPS Press, pp. 329-334.
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
Autobiographical Statement
The author was bom  in Mosul, Iraq on July 8 , 1954. He received his B.S. degree in 
Electrical Engineering from Mosul University, Mosul, Iraq, in  1976 and his M.S. degree in 
Computer Engineering from Syracuse University, Syracuse, New York, in 1982. He 
worked as an Electrical Engineer with the Organization o f Technical Industrial, Baghdad, 
Iraq, from 1977 to 1980.
The author was granted a lull scholarship from the Iraqi government for his M.S. 
study. Also, he was awarded one year doctoral fellowship from Old Dominion University. 
He is a member o f Phi Kappa Phi and the IEEE Computer Society.
76
Reproduced  with permission of the copyright owner. Further reproduction prohibited without permission.
