Parametric Micro-level Performance Models for Parallel Computing by Kim, Youngtae et al.
Computer Science Technical Reports Computer Science
12-5-1994
Parametric Micro-level Performance Models for
Parallel Computing
Youngtae Kim
Iowa State University
Mark Fienup
Iowa State University
Jeffrey S. Clary
Iowa State University
Suresh C. Kothari
Iowa State University
Follow this and additional works at: http://lib.dr.iastate.edu/cs_techreports
Part of the Systems Architecture Commons, and the Theory and Algorithms Commons
This Article is brought to you for free and open access by the Computer Science at Iowa State University Digital Repository. It has been accepted for
inclusion in Computer Science Technical Reports by an authorized administrator of Iowa State University Digital Repository. For more information,
please contact digirep@iastate.edu.
Recommended Citation
Kim, Youngtae; Fienup, Mark; Clary, Jeffrey S.; and Kothari, Suresh C., "Parametric Micro-level Performance Models for Parallel
Computing" (1994). Computer Science Technical Reports. 64.
http://lib.dr.iastate.edu/cs_techreports/64
Parametric Micro-level Performance Models for Parallel Computing
Abstract
Parametric micro-level (PM) performance models are introduced to address the important issue of how to
realistically model parallel performance. These models can be used to predict execution times, identify
performance bottlenecks, and compare machines. The accurate prediction and analysis of execution times is
achieved by incorporating precise details of interprocessor communication, memory operations, auxiliary
instructions, and effects of communication and computation schedules. Parameters are used for flexibility to
study various algorithmic and architectural issues. The development and verification process, parameters and
the scope of applicability of these models are discussed. A coherent view of performance is obtained from the
execution profiles generated by PM models. The models are targeted at a large class numerical algorithms
commonly implemented on both SIMD and MIMD machines. Specific models are presented for matrix
multiplication, LU decomposition, and FFT on a 2-D processor array with distributed memory. A case study
is done on MasPar MP-1 and MP-2 machines to validate PM models and demonstrate their utility.
Disciplines
Systems Architecture | Theory and Algorithms
This article is available at Iowa State University Digital Repository: http://lib.dr.iastate.edu/cs_techreports/64
Parametric Micro-level Performance 
Models for Parallel 
Computing
TR94-23
Youngtae Kim, Mark Fienup, Jeffrey C. Clary & Suresh C. Kothari
December 5, 1994
Iowa State University of Science and Technology
Department of Computer Science
226 Atanasoff
Ames, IA 50011
Parametric Micro-level Performance Models for Parallel
Computing
Youngtae Kim, Mark Fienup, Jerey S. Clary, Suresh C. Kothari
Department of Computer Science
Iowa State University
Ames, Iowa 50011
kim@cs.iastate.edu
fienup@cs.iastate.edu
clary@cs.iastate.edu
kothari@cs.iastate.edu
Abstract
Parametric micro-level (PM) performance models are introduced to address the im-
portant issue of how to realistically model parallel performance. These models can
be used to predict execution times, identify performance bottlenecks, and compare
machines. The accurate prediction and analysis of execution times is achieved by
incorporating precise details of interprocessor communication, memory operations,
auxiliary instructions, and eects of communication and computation schedules.
Parameters are used for exibility to study various algorithmic and architectural
issues. The development and verication process, parameters and the scope of
applicability of these models are discussed. A coherent view of performance is
obtained from the execution proles generated by PM models. The models are
targeted at a large class numerical algorithms commonly implemented on both
SIMD and MIMD machines. Specic models are presented for matrix multipli-
cation, LU decomposition, and FFT on a 2-D processor array with distributed
memory. A case study is done on MasPar MP-1 and MP-2 machines to validate
PM models and demonstrate their utility.
Keywords: Performance Model, Parallel computing, Numerical Algorithms, Memory
Access Optimization
1
1 Introduction
How to model parallel computation has been an important topic of research in high-
performance computing. Performance models have been extensively investigated through
theoretical and empirical studies. One important issue is how to make models realistic.
The papers [19, 3] discuss shortcomings of earlier theoretical research, and propose new
models called BSP and LogP for parallel computation. An important aspect of both
models is the incorporation of communication parameters which were ignored in earlier
theoretical research. The studies [2, 7, 8, 18] address several pragmatic issues and pro-
vide insights into important attributes of parallel performance. A good introduction
to performance and scalability of parallel systems is provided in recent books [10, 12].
This paper is about parametric micro-level (PM) performance models for parallel
computation. While BSP and LogP models [19, 3] focus on what is a realistic ab-
straction for modeling parallel performance, our emphasis is on pragmatic models to
accurately predict and analyze execution times. Our goal is to develop performance
models that can be actually used to predict performance on existing and future genera-
tion machines, compare machines, and facilitate ecient implementations of algorithms
by identifying performance bottlenecks. To develop such models, we adopt a micro-level
approach which incorporates precise details of interprocessor communication, memory
operations, miscellaneous overheads due to auxiliary instructions, and eects of com-
munication and computation schedules.
Execution times can be predicted by tting timing curves to experimental data,
as discussed in [8]. The basic approach is to determine an algebraic expression for
the tting formula by analysis of algorithm and then determine the coecients by
experiments. This approach is closely aligned with our goals; it can accurately predict
execution times. A tting formula expresses execution time as a function of problem
size and number of processors. It does not describe how architectural parameters
aect performance. Also, it is not possible to identify performance bottlenecks using
the tting formula. We address these shortcomings with PM models. First, instead
of predicting execution time as a scalar quantity, PM models predict a vector that
2
represents signicant components of execution time. This is useful for analysis of
performance. Secondly, the formulas are parametric. Architectural and algorithmic
parameters are incorporated as variables. The parameters provide exibility to study
a variety of architectural and algorithmic issues. For example, the impact of changing
processor speed, communication speed, or memory access speed can be studied by
varying the parameters of the model.
A tradeo is to be expected between realistic modeling and its applicability in
absence of specic information about the parallel algorithm or the architecture. It
is desirable that performance models are not unnecessarily specic with respect to
algorithms and architectures. Models need to be designed with a set of parameters
applicable to a wide class of parallel algorithms and architectures. Specics enter into
the picture when parameter values have to be determined. There is an example in [3]
where two implementations of FFT are considered. The experimental results show a
dramatic dierence in communication costs of those two implementations. If a model is
to predict the dierence, it is inevitable that details of the implementation of algorithm
have to be considered. In order to accommodate these conicting requirements, our
approach is to design the parameters and the process of model development with general
applicability in mind, and follow it up with complete examples of models which get into
specics.
We consider execution time as the principal measure of performance. The models
generate execution proles to provide a picture of how computation, memory opera-
tions, communication, and miscellaneous overheads together account for the total ex-
ecution time. The execution proles can be used to view the performance in dierent
ways. Other metrics such as speedup, eciency, and MFLOPS are dened on basis of
execution proles. It is well known that performance metrics can provide dierent and
sometimes misleading views of performance [8, 9]. We correlate various performance
metrics to provide coherent views of parallel performance.
PM models are appropriate for a large class of data parallel numerical algorithms,
described later in the paper. This class is of interest since it includes a large number
3
of algorithms encompassing many of the scientic and engineering applications. These
algorithms are typically implemented on MIMD machines, but many of them can also
be implemented quite eciently on SIMD machines. Separate PM models are needed
for dierent algorithms. Each model includes a complete representation of the parallel
algorithm, determined by key parts of the algorithm. As concrete illustrations, we
present models for matrix multiplication, LU decomposition, and fast Fourier transform
(FFT), all implemented on a 2-D processor array. These algorithms are of considerable
interest in practice; individually, they have been used as examples in many empirical
and theoretical studies [19, 3, 1, 16, 6]. Together, the algorithms represent varying
degrees of computation, communication, and memory requirements, and serve well as
test cases.
PM models are validated, and their utility is demonstrated in a case study on Mas-
Par MP-1 and MP-2. Two implementations of each algorithm are studied to illustrate
the analysis and impact of memory operations. The case study provides interesting
examples of how architectural dierences aect performance. For example, the choice
between a small number of powerful processors or a large number of less powerful
processors is often a point of debate in parallel computing. To study this issue, we
present a concrete example of performance comparison of 16K processor MP-1 and 4K
processor MP-2 using three algorithms with dierent computational characteristics.
The models are discussed in Section 2, the performance analysis is described in
Section 3, a case study is presented in Section 4, and conclusions are in Section 5.
2 Parametric Micro-level Models
In this section, the model development and verication process is described using the
examples of three parallel algorithms on a 2-D processor array. Parameters of the
model and its applicability are also discussed.
4
2.1 Model Development
Each PM model is based on a precise analytical formula that captures essential oper-
ations of a given parallel algorithm. The formula has four components to predict the
execution time as a vector. These components are computation time, communication
time, memory access time, and the time for auxiliary instructions. Architectural pa-
rameters of the model are determined by experimental measurements. In hypothetical
cases such as the study of a futuristic machine, the parameters are extrapolated. We
will rst provide an overview of model development and follow it with details.
2.1.1 Overview
The development process can be described as follows:
Step 1: Derive analytical formulas f
comp
, f
comm
, and f
mem
for parts of the execution
times for computation, communication, and memory operations respectively.
Step 2: Do experimental measurements of sample cases to determine model parame-
ters and also the time for computation, communication, accessing the memory,
and the miscellaneous time for auxiliary instructions.
Step 3: Select the template for regression analysis to estimate the miscellaneous over-
head time. Determine the regression coecients based on experimentally mea-
sured values. The regression formula for miscellaneous overhead time is denoted
by f
misc
.
Step 4: Based on the experimental measurements, modify the analytical expressions
f
mem
and f
comm
so that the predictions match with experimental timings de-
termined in Step 2. The modications to f
mem
are done to take into account
cache eects and overlap of memory accesses with other operations. The modi-
cations to f
comm
are done to take into account overlap of communication with
computation.
5
Step 5: Finally, the following formula is obtained to predict the execution time:
f
comp
+ f
comm
+ f
misc
+ f
mem
2.1.2 Details
The analytical formulas are given for the three parallel algorithms in Appendix A.
In analyzing practical scenarios for parallel machines, the lower order terms can be
signicant. These formulas are carefully derived by examining the parallel algorithm
to capture all its essential details. The formulas are complex, but the advantage is that
the performance predictions are very accurate.
The three algorithms used in the study are well-known. The LU decomposition is
described in [5]. The details of the FFT algorithm can be found in [4]. Cannon's parallel
algorithm is described in [12]. The LU decomposition uses a 2-D scattered data layout
for the coecient matrix, and it includes partial pivoting. Dierent communication
patterns are used by the three algorithms. The matrix multiplication uses nearest-
neighbor communication where elements are shifted from one processor to the next
along either a row or a column with wrap-arounds at the end. In case of the LU
decomposition, communication is needed for pivoting and for broadcasting a pivot
row and a multiplier column. A one-to-all broadcast is used along either a row or a
column of processors. To implement buttery operations, the FFT algorithm requires
communication between processors in a row or a column where the distance between
the communicating processors is a power of two.
Depending on whether the routing is pipelined or non-pipelined, the cost of a com-
munication operation varies. Table 1 summarizes dierent communication schemes
and their costs, and it also lists costs on MasPar machines. The Xnet[d] primitive
on MasPar is a version of non-pipelined routing, and the Xnetp[d] and Xnetc[d] are
for pipelined routing, where d is distance. Typically, to send a large message from one
processor to another, multiple individual messages may be required. There may also
be a limit on the number of messages that can be pipelined together. On MasPar, each
message has to be either one, four or eight bytes, and it has to be loaded in a register
6
Table 1: Communication Costs
general description MasPar specic description
routing communication cost primitive communication cost
scheme (32-bit message) MP-1 MP-2
Pipelined T
Xs
+ dT
Xp
+ kT
Xt
Xnetp[d] 58 + d 48 + d
Xnetc[d](Copy) 84 + d 48 + d
Non- T
Xs
+ dkT
Xt
Xnet[d], d = 1 43 40
pipelined Xnet[d], d > 1 19 + 35d 13 + 33d
T
Xs
: startup time d : distance
T
Xp
: time to ll the pipeline k : number of messages
T
Xt
: transmission time
rst. The pipelining is done at the bit level for each message. In our case study, single
precision arithmetic is used, and the messages are four bytes each. The communication
cost formulas in Table 1 are simplied in accordance with [15] to show the cost on
MasPar when the message size is four bytes.
Examples are cited in [8] to point out that simple overhead-type operations should
not be neglected, no matter how trivial they may seem. PM models consider miscella-
neous overheads arising from auxiliary instructions to implement loops in the machine
language, register moves, etc. A regression formula is used to predict the miscellaneous
overhead time. The template for the regression formula is determined by examining
the loop structure of the parallel program. The templates for the three algorithms are
listed in Table 2. A simple algebraic manipulation of templates shows that miscella-
neous overhead is a function of two variables the local problem size and the number of
processors. The coecients 
0
, 
1
, 
2
, 
3
, and 
4
shown in Table 2 are determined on
basis of experimental measurements of sample cases with dierent local sizes of problem
and using 1K and 4K processor.
The architecture parameters include individual timings for oating point instruc-
tions, communication primitives, and LOAD and STORE operations. It is assumed
that memory accesses are only through LOAD and STORE instructions. The archi-
tecture parameters can be obtained from the machine manual, but it is a good idea
7
Table 2: Regression Templates for Miscellaneous Overheads
regression templates regression coecients
MP-1 MP-2

0
1.20000e-7 5.40000e-7
f
MM
misc
= P (
0
+ 
1
M + 
2
M
2
+ 
3
M
3
) 
1
2.17535e-5 1.10000e-8

2
5.28270e-6 5.11250e-6

3
1.08510e-6 7.88700e-7

0
1.19000e-7 6.88000e-5
f
LU
misc
= P (
0
M + 
1
(log
2
P )M + 
2
M
2
+ 
3
M
3
) 
1
4.24202e-5 1.56110e-5

2
1.78508e-5 1.10677e-5

3
2.22000e-7 2.25800e-7

0
3.52811e-5 0.97346e-6
f
FFT
misc
= 
0
M + 
1
(log
2
P
2
)M + 
2
M log
2
M 
1
2.99546e-5 8.72500e-6

2
1.57158e-5 5.49030e-6
For f
MM
misc
and f
LU
misc
: For f
FFT
misc
:
P  P : processor array size P  P : processor array size
N N : matrix size N : number of elements
M M : local problem size per M : local problem size per
processor (M = N=P ) processor (M = N=P
2
)
Table 3: Architecture Parameters
Operation MP-1 Cycles MP-2 Cycles
T
load
Load 85 40
T
store
Store 74 35
T
mult
Floating Point Multiply 225 41
T
div
Floating Point Division 325 75
T
add
Floating Point Addition 127 26
T
neg
Floating Point Negation 36 10
T
cmp
Floating Point Comparison 84 33
T
twiddle
Twiddle Factor Calculation for FFT 9540 2845
8
to actually measure these timings. The architecture parameters are listed in Table 3
along with the values for the MasPar MP-1 and MP-2 machines. The other parameters
include problem size, PE array size, and the timing for algorithm specic primitives
such as computing the twiddle factor for FFT.
2.2 Verication of Model
A number of features are built into the models to ensure that the execution times
are predicted accurately. First, the precise details of computation, communication,
memory operations, and miscellaneous overheads are included in the models. Secondly,
the model parameters are carefully determined by experiments. However, PM models
are complex, and it is important to verify each model systematically. The procedure
for such a verication is described here. This procedure was used in our case study to
verify the models on the MasPar MP-1 and MP-2 machines.
We describe the necessary experimental measurements to be obtained by running
the parallel programs for sample problem sizes. The experimental measurements in-
clude: (i) total execution time (T
exec
), (ii) computation time (T
comp
), (iii) communi-
cation time (T
comm
), (iv) miscellaneous overhead time (T
misc
), and (v) the time for
memory operations (T
mem
). The experimental measurements for (ii), (iii), and (iv)
were done after deleting appropriate instructions from the compiler generated assem-
bly code. First, T
misc
is measured by deleting all the computation, communication
plus the associated LOAD and STORE instructions. Next, only the communication
and the memory instructions are omitted, and the computation time (T
comp
) is de-
termined by subtracting T
misc
from the resulting execution time. Finally, only the
memory instructions are omitted, and the communication time (T
comm
) is determined
by subtracting T
comp
+ T
misc
from the resulting execution time. The time for mem-
ory operations is based on the previous measurements using the equation T
mem
=
T
exec
  T
comp
  T
comm
  T
misc
.
The accuracy of models is based on the following observations:
 The computation and communication timings predicted by the analytical formu-
9
las f
comp
and f
comm
are checked individually with experimental values T
comp
and
T
comm
.
 Only a part of the experimental data is used to determine the regression coe-
cients, and the remaining data is used as the test data to verify the regression
formula.
 The memory model is checked separately.
Experimental measurements are sometimes tricky, especially due to the fact that
overlaps have to be taken into account. In some cases, we had to modify the assembly
code to get the experimental data since the compiler introduced major transformations
into the code and making changes in the high-level language did not produce the eect
we wanted. For example, this was the case in an instance where we wanted to selectively
omit certain instructions to measure their eect. There may be problems arising from
data dependencies where omiting certain instructions can have side eects. For exam-
ple, omitting a LOAD can make the subsequent division instruction to cause exception
of division by zero. These issues have to be addressed in experimental procedures.
Our experience is that LOAD-STORE architecture makes experimental procedures
simpler, it at least avoids complications resulting from complex addressing modes where
it is not possible to separate memory accesses. A systematic development of experi-
mental procedures is an important and complex topic by itself. For example, a timing
procedure suitable for programs that use message passing is described in [11]. To do
complete justice to it is beyond the scope of this paper.
2.3 Scope and Applicability
PM models are applicable to a class of numerical algorithms described as follows. First,
the work done by the algorithm is characterizable as a set of oating point operations.
Secondly, the parallel execution proceeds as a succession of steps with synchronization
points in between. Each step consists of computation followed by communication. The
same program is executed by all processors, but dierent data is processed. Within
10
each step, some processors in a MIMD machine may nish their computations earlier
and remain partly idle till the next synchronization point. The concept of tight syn-
chronization is inherent in the BSP model [19]. The BSP model considers an algorithm
as a sequence of supersteps. Each superstep combines computation and communica-
tion. Many of the numerical algorithms from scientic and engineering applications
fall in the category to which PM models can be applied. There are also important
exceptions; for example sorting algorithms where it is the data movements and not the
oating point operations that characterize work.
The parallel algorithms considered in this paper are used on both SIMD and MIMD
machines. We have implemented these algorithms on MasPar, a SIMD architecture
and nCUBE, a MIMD architecture. PM models with some changes can be applied to
dierent machines. Experimental measurements may pose a problem on some machines.
For example, in some cases it may not be possible to arrive at a cycle time for an
individual instruction because it may vary depending on the adjacent instructions.
This was observed to be the case on nCUBE. We have found it is easier to make
experimental measurements on machines that have processors with LOAD-STORE
architecture where the only instructions to access memory are LOAD and STORE
operations. Fortunately, this is the case with several recent parallel machines including
MasPar MP-1 and MP-2, Intel Paragon, IBM SP-1 and SP-2.
A PM model is useful in many ways. The case study in later section provides
an illustration of how it is useful to identify performance bottlenecks, analyze perfor-
mance, and compare machines. Constants are important in practice. For example,
a better design that increases performance by 50% is not something that a computer
manufacturer can aord to ignore. In such situations, PM models provide a viable tool
to accurately analyze performance of dierent designs. For a new generation of ma-
chines, an important consideration is cost eective improvement in performance. The
alternatives could be either faster processors, faster communication hardware or faster
memory. Such alternatives can be evaluated by PM models.
11
3 Performance Analysis
The execution proles generated by models are used as the basis for performance anal-
ysis. We derive quantitative relationships that are useful for a class of algorithms
discussed in Section 2.3.
3.1 Execution Proles
PM models predict the execution time as a sum of four components corresponding to
computation, communication, miscellaneous overheads and memory operations. The
model can be used to predict the total execution time, and each of its components
separately. The execution prole for an algorithm is presented in the form of a table
that shows percentages attributed to each component of the execution time for a range
of problem sizes. The computation component represents the useful work, and the other
three components should be as small as possible. It becomes clear from the execution
prole how signicant communication, memory operations, or miscellaneous overheads
are as performance bottlenecks.
Performance can be viewed in dierent ways using various metrics. Execution
proles provide a basis to correlate dierent views in order to provide a coherent picture
of parallel performance. Speedup, eciency, and MFLOPS are dened on basis of
execution proles in ways that reveal precisely the roles of key factors such as load
balance.
3.2 Load Balance
Load balance is an important attribute of performance in parallel computing. For
the class of algorithms considered in this analysis, load balance can be thought of as
the degree of utilization of processors averaged over all \compute only" steps after the
memory and miscellaneous overheads are factored out. The following denition of Load
Balance Factor(LB
f
) is such that the range for LB
f
is between zero to one, with one
corresponding to the best utilization of processors.
12
LB
f
(N) =
nflops(N)t
flop
P
2
f
comp
(N)
nflops(N) : number of normalized oating point operations for sequential
computation (P = 1)
t
flop
: time for a single normalized oating point operation
f
comp
(N) : total time for oating point operations done in parallel
N : problem size parameter
P  P : processor array size
To deal with the mixture of fast and slow oating point operations, normalized oating
point operations are used in this paper. For example, on MasPar MP-1 where the ADD
operation takes 127 cycles, and the MULT operation takes 225 cycles, the normalized
FLOPs for these operations are counted as 1 and 1.77 respectively.
3.3 Eciency Based on Work
Traditionally, eciency is calculated based on the work done. However, in parallel
computing, eciency is commonly dened as the speedup divided by the number of
processors. The isoeciency analysis [12, 13] is based on this denition. It has been
argued in [2] that instead of relying on time as a measure of work, eciency should be
dened by using unit counts based on the size of an indivisible task as the measure of
work. The ratio of work accomplished (wa) to the work expended (we) is proposed in [2]
as the alternative denition of eciency. Following these ideas, consider a normalized
FLOP as the unit of work. There are some objections to using FLOP as a unit of
work in general [8]. In our case, however, we are considering numerical algorithms
and taking into account memory and other operations separately. Another objection
is that operation count is an imperfect measure of computational work since it does
not standardize across computers [8]. We agree and address this point later in the
context of comparing two machines. With a normalized FLOP as the unit of work, wa
is proportional to MFLOPS and we is proportional to peak MFLOPS. Assuming a
normalization is used, the ratio of MFLOPS to peak MFLOPS can be considered as the
13
alternate denition for eciency(Eff(N)). As shown below, the ineciency resulting
from communication and other overheads is captured by the fractional term, and the
ineciency due to idle processors is represented by the load balance factor.
Eff(N) =
f
comp
(N)
f
comp
(N)+f
comm
(N)+f
misc
(N)+f
mem
(N)
 LB
f
(N)
Interestingly, for examples provided in [2], the commonly used denition and the
alternate denition of eciency both led to the same results. The following observation
may explain why it is so. On resubstituting for LB
f
and using the traditional denition
of speedup, it becomes clear that both denitions of eciency lead to the same formula.
This can be veried by using the following formula for speedup as the ratio of the
sequential execution time to the parallel execution time.
Speedup(N) =
nflops(N)t
flop
f
comp
(N)+f
comm
(N)+f
misc
(N)+f
mem
(N)
The overheads due to memory operations and miscellaneous operations are also present
in sequential processing. We have not factored those out and are in eect measuring
the overall eciency by accounting for all sources of ineciency.
3.4 MFLOPS, Eciency and Execution time
First, consider the MFLOPS measure. The normalized MFLOPS are given by:
MFLOPS(N) =
nflops(N)10
 6
T
exec
T
exec
: experimentally measured parallel computation time for size N
Based on our earlier discussions, MFLOPS can also be calculated by:
MFLOPS(N) = Peak MFLOPS  Eff(N)
Peak MFLOPS :
clock rate
number of cycles per normalized flop
 P
2
14
The question is what is a good measure of performance to compare dierent ma-
chines based on a given algorithm. A reasonable way is to interpret higher performance
as accomplishing more useful work in the same amount of time. Intuitively, one may
think that the eciency could serve the purpose. However, eciency can be a mislead-
ing measure for comparison of dierent machines. A machine may be less ecient, but
could still perform more work because it is faster than the other machine. This suggests
that one should really consider the product of the eciency and the rate of work of a
machine. If normalized FLOP is considered as the unit of work, then MFLOPS is such
a measure.
MFLOPS also has a problem. The diculty lies in using a normalized FLOP as a
unit of work across dierent machines. In spite of normalization, the same work (for
example the multiplication of two matrices of a given size) can translate into dierent
FLOPS on dierent machines. This problem can be addressed in a couple of dierent
ways. One solution is to consider a unit of work that depends on the application, not
on the machine. For example, an addition plus a multiplication is a viable unit of
work to compare performance of matrix multiplication on dierent machines. Another
solution can be to require a conversion rate to convert a FLOP from one machine to
another machine. The conversion is done so that the number of FLOPs corresponding
to the same work is unchanged in going from one machine to another machine. The
denition of work and the conversion rate depends on the algorithm.
Thus, in comparing dierent machines with respect to a given algorithm, there are
really three issues; the eciency, the rate of work, and the unit of work. The bottom
line is always the execution time assuming accuracy of calculations is satisfactory. With
the use of an appropriate conversion rate, higher MFLOPS numbers indeed mean lower
execution times. As a concrete example, for matrix multiplication one normalized
FLOP on MasPar MP-1 should be converted to (
2:58
2:77
) normalized FLOPs on MP-2.
This is because the number for normalized FLOPs for an addition plus multiplication
is 2.58 on MP-2 and 2.77 on MP-1. The LU decomposition kernel uses the same
oating point operations as matrix multiplication, thus the same conversion rate is
15
applicable for both. An analysis of FFT kernel shows that the conversion rate is one
for that algorithm. Incidentally, without the conversion the MFLOP numbers on MP-
1 are inated, when used for measuring performance of matrix multiplication and LU
decomposition.
4 Case Study
This study was done on a 16K processor MasPar MP-1 with 16K bytes of memory per
processor and a 4K processor MP-2 machine with 64K bytes of memory per processor.
PM models of matrix multiplication, LU decomposition, and FFT are considered. Two
implementations of each algorithm were studied to illustrate the analysis and impact
of memory operations. The second implementation included software pipelining to
reduce the time for memory operations. The highest level of compiler optimization
was used with both implementations. A pre-analysis was done assuming the memory
overlap ratio to be zero in the model. Secondly, a post-analysis was done by including a
non-zero overlap ratio based on the experimental data from the second implementation
which introduced signicant memory overlap as a result of software pipelining.
4.1 Parallel Machines
MasPar MP-1 and MP-2 machines are based on a single-instruction stream, multiple
data stream (SIMD) architecture with processors arranged in a two dimensional toroidal
grid. A parallel program runs on the array control unit (ACU) which broadcasts
instructions to the processors. The communication operations on MasPar and their
costs are discussed earlier.
The MP-1 and MP-2 machines have a clock rate of 12.5 MHz, and identical in-
struction sets. However, the MP-1 uses 4-bit processors while the MP-2 uses 32-bit
processors. The MP-2 processor can perform oating point operations four to ve
times faster than the MP-1 processor. Measured cycle times for several instructions
are shown in Table 3. There is no cache memory on either machine, and each pro-
cessor has forty 32-bit registers. Memory accesses are done only through LOAD and
16
STORE instructions. Other instructions, including interprocessor communication, are
all register based.
Table 4: Accuracy of Execution time Predictions
16K MP-1 4K MP-2
model experi- di. model experi- di.
N mental mental
(sec) (sec) (%) (sec) (sec) (%)
1024 2.47 2.56 3.27 2.33 2.36 1.27
1536 7.90 8.10 2.49 7.46 7.52 0.76
Matrix 2048 18.22 18.58 1.91 17.26 17.34 0.50
Multiplication 2560 34.98 35.54 1.56 33.13 33.23 0.28
3072 59.73 60.56 1.37 56.69 56.78 0.17
3584 94.13 95.26 1.18 89.31 89.45 0.16
4096 139.58 141.03 1.02 132.58 132.72 0.11
1024 2.05 2.09 1.60 1.77 1.80 1.74
2048 10.00 10.11 1.06 9.44 9.58 1.45
LU 3072 28.17 28.40 0.81 27.58 27.91 1.20
Decomposition 4096 60.82 61.23 0.66 60.66 61.31 1.05
5120 112.27 112.89 0.55 113.15 114.23 0.95
6144 186.77 187.66 0.48 189.56 191.23 0.88
7168 288.60 289.82 0.42 294.33 296.75 0.82
2
19
0.147 0.147 0.53 0.181 0.181 0.08
Fast 2
20
0.303 0.302 0.50 0.369 0.369 0.11
Fourier 2
21
0.623 0.621 0.46 0.754 0.755 0.14
Transform 2
22
1.281 1.276 0.41 1.539 1.541 0.17
2
23
2.632 2.623 0.37 3.140 3.147 0.20
2
24
5.404 5.387 0.33 6.407 6.421 0.23
4.2 Validation of Models
To validate PM models, their predictions are compared with experimental results on
MasPar MP-1 and MP-2 machines. We did compare the model and the experimental
results for the four parts of the executions time separately. Instead of presenting the
individual comparison for each part, the comparison of the total execution time is
17
presented Table 4. The results show that in all cases, the models are very accurate.
4.3 Pre-Analysis: Identifying Performance Bottleneck
PM models yield execution proles that can provide clues for improving performance.
The proles for the three algorithms are shown in Tables 5, 6, and 7. The execution pro-
les include the total execution time, and its break-up based on computation, commu-
nication, miscellaneous overheads, and memory operations. Note that the components
other than the computation should be as small as possible for high performance.
The pre-analysis tables 5 and 6 show that memory operations account for a signi-
cant portion of the execution time. For matrix multiplication, miscellaneous overheads
and interprocessor communication together constitute only a small part (10% or less)
of the execution time, but memory operations account for as much as 37% on MP-1
and 52% on MP-2. It gets worse with LU decomposition as it is a more memory-access
intensive algorithm. Miscellaneous overheads for LU decomposition are signicant for
smaller problem sizes, but they decrease for larger problems. The performance pro-
le for FFT (Table 7) is quite dierent. It is clear that memory operation is not the
problem. The performance loss with FFT is mainly due to interprocessor communica-
tion. The pre-analysis suggests that the performance of matrix multiplication and LU
decomposition could be signicantly improved by using techniques that minimize the
time for memory operations.
4.4 Memory Access Optimization
The performance loss due to memory operation can be minimized by exploiting the or-
ganization of the memory and how it works. We used blocking and software pipelining.
Since there is no cache memory on MasPar, blocking was implemented using the reg-
isters. Software pipelining was found to be more critical for performance improvement
on MasPar machines.
18
Table 5: Pre-Analysis by Model : Matrix Multiplication
N 1024 1536 2048 2560 3072 3584 4096
comp % 57.0 59.7 61.1 62.0 62.6 63.0 63.3
MP-1 comm % 1.7 1.2 0.9 0.8 0.6 0.6 0.5
mem access % 37.2 35.6 34.7 34.1 33.8 33.5 33.3
misc % 4.1 3.5 3.3 3.1 3.0 2.9 2.9
comp % 37.2 39.0 39.8 40.3 40.7 40.9 41.1
MP-2 comm % 2.8 1.9 1.5 1.2 1.0 0.9 0.8
mem access % 52.3 51.9 51.7 51.6 51.5 51.4 51.4
misc % 7.7 7.2 7.0 6.9 6.8 6.8 6.7
Table 6: Pre-Analysis by Model : LU Decomposition
N 1024 2048 3072 4096 5120 6144 7168
comp % 36.6 44.8 48.6 50.7 52.1 53.0 53.7
MP-1 comm % 13.6 9.0 6.6 5.2 4.3 3.7 3.2
mem access % 31.6 36.6 38.2 38.9 39.3 39.5 39.7
misc % 18.2 9.6 6.6 5.2 4.3 3.8 3.4
comp % 23.9 28.4 30.3 31.4 32.1 32.5 32.9
MP-2 comm % 14.2 8.9 6.4 5.0 4.1 3.5 3.0
mem access % 45.7 52.6 55.1 56.4 57.2 57.7 58.1
misc % 16.2 10.1 8.2 7.2 6.6 6.3 6.0
Table 7: Pre-Analysis by Model : Fast Fourier Transform
N 2
19
2
20
2
21
2
22
2
23
2
24
comp % 52.9 53.3 53.9 54.4 54.8 55.2
MP-1 comm % 31.0 30.0 29.0 28.1 27.2 26.4
mem access % 4.8 5.4 6.0 6.5 7.0 7.5
misc % 11.3 11.2 11.1 11.0 11.0 10.9
comp % 34.3 34.7 35.1 35.4 35.7 36.1
MP-2 comm % 46.2 44.8 43.6 42.4 41.3 40.2
mem access % 9.5 10.4 11.2 12.0 12.7 13.4
misc % 10.0 10.1 10.1 10.2 10.3 10.3
19
register a, b, c;
for i = 0 to M-1
begin
for j = 0 to M-1
begin
c = C(i,j);
for k = 0 to M-1
begin
a = A(i,k);
b = B(k,j);
c += a * b;
end
C(i, j) = c;
end
end
(basic version)
register a0, a1, b0, b1, c;
for i = 0 to M-1
begin
for j = 0 to M-1
begin
c = 0.0;
a0 = A(i,0);
b0 = B(0,j);
for k = 0 to M-1
begin
(1) a1 = A(i,k+1);
(2) b1 = B(k+1,j);
(3) c += a0 * b0;
a0 = a1;
b0 = b1;
end
c += a0 * b0;
C(i, j) += c;
end
end
end
(software pipelined version)
Figure 1: An example of software pipelining applied to matrix multiplication
4.4.1 Software Pipelining Technique
Software pipelining is used to reduce the overhead of accessing the memory. This tech-
nique has been previously studied [14, 17] for VLIW and other architectures. The
technique is commonly used on RISC workstations. On MasPar, we had to apply the
technique by hand to source level programs to change the order of operations in suc-
cessive iterations of a loop so that data could be prefetched. Software pipelining helps
if the hardware can overlap prefetching of data with computation and communication.
We applied software pipelining to computation loops with oating point operations and
also to communication loops that move a block of data from the local memory of one
20
processor to another processor.
The software pipelining technique is illustrated in Figure 1 by the example of matrix
multiplication. For the basic matrix multiplication loop in the left program segment,
elements of the A and B arrays are used for oating point operations immediately after
they are accessed. As a result, oating point operations cannot start until the memory
accesses are complete. On the other hand, for the pipelined loop, the array elements
get prefetched in lines (1) and (2). This prefetching is overlapped with the oating
point computations done in line (3). Software pipelining can be combined with loop
unrolling for further improvement in performance.
4.4.2 Measurement of Memory Overlap
This section illustrates how memory access optimization is accounted for, and how
signicant is its impact on performance. The impact of overlapping memory operations
is measured by the overlap ratio (O
r
) based on the equation: f
mem
 (1 O
r
) = T
mem
.
As dened originally, f
mem
gives the time for memory operations in absence of overlap,
and T
mem
is the experimentally measured time for memory operations in presence of
software pipelining. The overlap ratio plays a role similar to the hit ratio for analyzing
the cache memory performance. Similar to the cache hit ratio, the overlap ratio has a
value between 0 and 1, and the closer it is to 1, the higher the performance.
The overlap resulting from software pipelining is expected to increase up to a point
with increasing number of pipelined iterations of the for loop. In a pipelined operation,
the eciency increases with the number of jobs until it levels o at a maximum value.
The same trend is observed for the overlap ratio. The overlap ratio depends on the
algorithm, architecture of the machine and problem size. It increases with the local
problem size until it levels o as shown in Figures 2 and 3. Note that each gure refers
to the total problem size and not the local size at each processor. The corresponding
local problem sizes are larger on MP-2 as it has only 4K processors compared to 16K
processors on MP-1.
LU decomposition shows higher overlap than matrix multiplication on MP-1, and
21
0:65
0:70
0:75
0:80
0:85
0:90
Overlap
ratio
Problem size (N)
Matrix Multiplication
s
s
1536
s
2048
s
2560
s
3072
s
3584
s
4096
LU Decomposition
c
c
2048
c
3072
c
4096
c
5120
c
6144
c
7168
Fast Fourier Transfrom


2
19

2
20

2
21

2
22

2
23

2
24
Figure 2: Memory Overlap Ratio on 16K MP-1
0:65
0:70
0:75
0:80
0:85
0:90
Overlap
ratio
Problem size (N)
Matrix Multiplication
s
s
1536
s
2048
s
2560
s
3072
s
3584
s
4096
LU Decomposition
c
c
2048
c
3078
c
4096
c
5120
c
6144
c
7168
Fast Fourier Transfrom


2
19

2
20

2
21

2
22

2
23

2
24
Figure 3: Memory Overlap Ratio on 4K MP-2
22
it is other way around on MP-2. LU decomposition kernel is more memory intensive;
it requires an additional STORE operation compared to matrix multiplication kernel.
We veried that if an additional STORE operation is included (redundantly) in the
matrix multiplication kernel, then its overlap ratio matches closely with that of LU
decomposition.
Memory accesses need to be factored into realistic models of parallel computing.
Memory accesses can have signicant impact on performance even in parallel com-
puting. For two out of the three algorithms in our study, the memory access cost in
fact turns out to be substantially higher than the interprocessor communication cost.
Memory access times can vary signicantly due to memory hierarchy and overlap of
memory accesses with other operations. We have addressed memory overlap which is
the relevant issue on MasPar machines where there is no cache memory, but the overlap
is a signicant factor. As future research, it will be worthwhile to do case studies on
other machines with cache memory. There is extensive literature on performance anal-
ysis of cache memory which needs to be explored in the context of realistic modeling
for parallel machines with distributed memory.
4.5 Post-Analysis
A \post-analysis" was done to study performance after it was improved by software
pipelining. To account for the memory overlap, f
mem
is replaced by f
mem
 (1  O
r
)
in the post-analysis. A comparison of execution times between pre-analysis and post-
analysis (Table 8) shows that a signicant improvement in performance is possible on
MasPar machines by overlapping memory operations with other operations.
The post-analysis tables 9, 10 and 11 provide a quantitative picture of how dier-
ent overheads impact performance. The following trends are observed for the three
algorithms when overheads are considered as percentages of the total execution time.
For matrix multiplication, memory is the dominant overhead. For LU decomposition,
miscellaneous and communication overheads are also high, but only for smaller prob-
lems. As the problem size increases, the other two overheads diminish and memory
23
Table 8: Improvements by Overlappping of Memory Operations
16K MP-1 4K MP-2
pre- post- improve pre- post- improve-
N anal. anal. -ment anal. anal. ment
(sec) (sec) (%) (sec) (sec) (%)
1024 3.24 2.56 26.7 3.77 2.36 60.0
1536 10.43 8.10 28.8 12.18 7.52 62.0
Matrix 2048 24.16 18.58 30.0 28.24 17.34 62.8
Multiplication 2560 46.50 35.54 30.9 54.42 33.23 63.8
3072 79.59 60.56 31.4 93.18 56.78 64.1
3584 125.52 95.26 31.8 147.00 89.45 64.3
4096 186.39 141.03 32.2 218.27 132.72 64.5
1024 2.64 2.09 26.6 2.59 1.80 43.8
2048 13.70 10.11 35.5 15.09 9.58 57.5
LU 3072 39.55 28.40 39.2 45.52 27.91 63.1
Decomposition 4096 86.51 61.23 41.3 101.88 61.31 66.2
5120 160.93 112.89 42.6 192.14 114.23 68.2
6144 269.14 187.66 43.4 324.30 191.23 69.6
7168 417.48 289.82 44.0 506.35 296.75 70.6
2
19
0.152 0.147 4.1 0.194 0.181 7.5
Fast 2
20
0.315 0.302 4.6 0.400 0.369 8.3
Fourier 2
21
0.652 0.621 5.1 0.823 0.755 9.1
Transform 2
22
1.347 1.276 5.5 1.693 1.541 9.8
2
23
2.778 2.623 5.9 3.477 3.147 10.5
2
24
5.727 5.387 6.3 7.138 6.421 11.2
becomes the dominant overhead. Software pipelining helps substantially and more so
in case of MP-2, but the cost of memory operations still remains relatively high. For
FFT, the communication overhead is the most signicant followed by the miscellaneous
overhead; memory overhead is very low.
Next, we analyze eciency which is aected by overheads and the load balance.
The eciency curves on MP-1 and MP-2 are shown in Figures 4 and 5. Matrix multi-
plication has the least overhead plus the best possible load balance, thus it achieves the
highest eciency among the three algorithms. After software pipelining, the overall
24
Table 9: Post-Analysis by Model
a
: Matrix Multiplication
N 1024 1536 2048 2560 3072 3584 4096
comp % 74.6 78.9 81.1 82.4 83.5 84.1 84.7
MP-1 comm % 2.3 1.6 1.2 1.0 0.8 0.7 0.6
mem access % 17.7 14.9 13.4 12.5 11.7 11.3 10.9
misc % 5.4 4.6 4.3 4.1 4.0 3.9 3.8
comp % 60.4 63.5 65.2 66.3 66.9 67.5 67.8
MP-2 comm % 4.5 3.2 2.4 2.0 1.7 1.4 1.3
mem access % 22.7 21.6 21.0 20.4 20.2 20.0 19.9
misc % 12.4 11.7 11.4 11.3 11.2 11.1 11.0
Table 10: Post-Analysis by Model
a
: LU Decomposition
N 1024 2048 3072 4096 5120 6144 7168
comp % 47.1 61.4 68.2 72.0 74.7 76.4 77.8
MP-1 comm % 17.5 12.3 9.3 7.5 6.2 5.3 4.6
mem access % 12.0 13.2 13.2 13.1 12.9 12.8 12.7
misc % 23.4 13.1 9.3 7.4 6.2 5.5 4.9
comp % 34.9 45.5 50.1 52.8 54.5 55.7 56.7
MP-2 comm % 20.8 14.2 10.6 8.4 7.0 6.0 5.2
mem access % 20.5 24.1 25.8 26.7 27.2 27.6 27.8
misc % 23.8 16.2 13.5 12.1 11.3 10.7 10.3
Table 11: Post-Analysis by Model
a
: Fast Fourier Transform
N 2
19
2
20
2
21
2
22
2
23
2
24
comp % 54.7 55.6 56.4 57.1 57.8 58.4
MP-1 comm % 32.1 31.2 30.3 29.5 28.7 28.0
mem access % 1.5 1.6 1.7 1.8 1.9 2.0
misc % 11.7 11.6 11.6 11.6 11.6 11.6
comp % 36.9 37.6 38.3 39.0 39.6 40.2
MP-2 comm % 49.7 48.7 47.6 46.6 45.7 44.8
mem access % 2.6 2.8 3.0 3.2 3.4 3.5
misc % 10.8 10.9 11.1 11.2 11.3 11.5
a
Post-Analysis shows performance after memory access optimization is done
25
0:3
0:4
0:5
0:6
0:7
0:8
0:9
1:0
Eciency
Problem size (N)
Matrix Multiplication
s
s
s
s
s
s
s
1536
2048
2560
3072
3584
4096
LU Decomposition
c
c
c
c
c
c
c
2048
3072
4096
5120
6144
7168
Fast Fourier Transfrom


 

 
2
19
2
20
2
21
2
22
2
23
2
24
Figure 4: Eciency on 16K MP-1
0:3
0:4
0:5
0:6
0:7
0:8
0:9
1:0
Eciency
Problem size (N)
Matrix Multiplication
s
s
s
s
s s
s
1536
2048
2560
3072 3584
4096
LU Decomposition
c
c
c
c
c
c
c
2048
3072
4096
5120
6144
7168
Fast Fourier Transfrom


 

 
2
19
2
20
2
21
2
22
2
23
2
24
Figure 5: Eciency on 4K MP-2
26
overhead for LU decomposition becomes smaller compared to FFT, especially for large
problems. The load balance for LU decomposition is low for small size problems, but
it improves due to 2-D scattered decomposition as the problem size increases. Using
the formula from Section 3.2, it was checked that the load balance factor (LB
f
) for
LU decomposition changed from 0.64 to 0.94 on MP-1 and from 0.76 to 0.96 on MP-2.
The net result is that the eciency curve for LU decomposition eventually takes o,
and is much higher than the FFT curve. For matrix multiplication and FFT, it is easy
to see from the parallel algorithm itself that LB
f
= 1, i.e., processors are fully utilized
when the problem size is a multiple of the PE array size.
4.6 Comparison of Two Machines
We will use PM models to compare two machines. This comparison provides a concrete
example to study an important issue in parallel computing, namely, \which choice is
better? { a small number of powerful processors or a large number of less powerful
processors". MP-1 has 16K simple 4-bit processors whereas MP-2 has 4K 32-bit pro-
cessors. Each MP-2 processor is four to ve time faster than MP-1 processor in terms
of oating point computation. The peak ratings of 16K processor MP-1 and 4K pro-
cessor MP-2 are respectively 1613 and 1969 normalized MFLOPS. The two machines
have the same amount of total memory, thus it is possible to compare problems of the
same size on both machines. The three algorithms used for the comparison are useful
to get dierent perspectives. We will compare the two machines in terms of overheads,
eciency, MFLOPS, and execution times.
The impact of overheads turns out to be signicantly dierent on MP-1 and MP-2.
For each algorithm, we compare the data for the same size problems on MP-1 and
MP-2. As seen from the post-analysis tables 9, 10 and 11, all overheads including
memory, communication, and miscellaneous are signicantly higher on MP-2. This can
be understood on basis of two factors related to dierences in architectural parameters.
First, only certain operations are faster on MP-2, and those are also not in the same
proportion. For example, oating point operations are four to ve times faster, but
27
MP-1 > MP-2
MP-2 > MP-1
6
?
6
?
10%
5%
0%
5%
10%
15%
20%
25%
30%
Dierence
Problem size (N)
Matrix Multiplication
s
s
s
s
s
s
s
1536
2048
2560
3072
3584
4096
LU Decomposition
c
c
c
c
c
c
c
2048
3072
4096
5120
6144
7168
Fast Fourier Transfrom







2
19
2
20
2
21
2
22
2
23
2
24
Figure 6: Performance Comparison based on MFLOPS
MP-1 > MP-2
MP-2 > MP-1
6
?
6
?
10%
5%
0%
5%
10%
15%
20%
25%
30%
Dierence
Problem size (N)
Matrix Multiplication
s
s
s
s
s
s
s
1536
2048
2560
3072
3584
4096
LU Decomposition
c
c
c
c
c
c
c
2048
3072
4096
5120
6144
7168
Fast Fourier Transfrom







2
19
2
20
2
21
2
22
2
23
2
24
Figure 7: Performance Comparison based on Execution time
28
memory operations are only twice as fast compared to MP-1. The communication
operations and auxiliary instructions leading to miscellaneous overheads are not faster
at all. Secondly, MP-1 is a larger machine where more processors are connected to each
other, thus its communication bandwidth is higher.
Next, we compare the two machines in terms of eciency and MFLOPS. In all
cases, the eciency on MP-2 is lower (compare Figures 4 and 5). A smaller machine
can achieve higher load balance which helps eciency. In this case, however, the main
factor is overheads which are signicantly higher on MP-2. Although MP-2 is less
ecient, it is faster and has higher peak MFLOPS rating than MP-1. So we compare
the two machines in terms of MFLOPS. The 4K processor MP-2, in all instances,
achieves lower MFLOPS than the 16K processor MP-1 (see Figure 6).
The comparison based on overheads, eciency, and MFLOPS implies that MP-2 is
worse machine than MP-1. Before we admit that conclusion, let us compare execution
times. A comparison of execution times is shown in Figure 7. It is seen that MP-
2 is better than MP-1 for matrix multiplication in all cases, it is also better for LU
decomposition with smaller problems. These results are not surprising based on the
earlier discussion in Section 3.4. It was pointed out that the MFLOPS numbers on
MP-1 are inated for matrix multiplication and LU decomposition. To get a coherent
picture of performance, we need to convert MFLOPS from one machine to another. The
conversion rates are given in Section 3.4. If MFLOP comparison is redone using proper
conversion, then it turns out to be exactly the same as the execution time comparison.
In fact, for FFT, the conversion rate is one which is consistent with the observation
that both the MFLOP and the execution time comparison curves are almost identical
for that algorithm (compare Figures 6 and 7).
4.7 Predictions for a Future Machine
We illustrate how PM models can be used to make performance predictions for a future
generation machine. For a new machine, many dierent alternatives may be of interest.
For example, it may be necessary to consider impact of increasing processor speed,
29
Table 12: Speedup Predictions for 16K MP-2 over 4K MP-2
4K MP-2 16K MP-2 relative
N time (sec) time (sec) speedup
1024 2.36 0.81 2.9
1536 7.52 2.19 3.4
Matrix 2048 17.34 4.83 3.6
Multiplication 2560 33.23 9.18 3.6
3072 56.78 15.24 3.7
3584 89.45 23.94 3.7
4096 132.72 35.16 3.8
1024 1.80 0.99 1.8
2048 9.58 3.89 2.5
LU 3072 27.91 9.98 2.8
Decomposition 4096 61.31 20.42 3.0
5120 114.23 36.24 3.2
6144 191.23 58.38 3.3
7168 296.75 88.42 3.4
2
19
0.181 0.067 2.7
Fast 2
20
0.369 0.136 2.7
Fourier 2
21
0.755 0.275 2.7
Transform 2
22
1.541 0.558 2.8
2
23
3.147 1.133 2.8
2
24
6.421 2.297 2.8
improving memory access times, enhancing communication hardware, or increasing the
number of processors. We use PM models to predict performance when the number of
processors is increased from 4K to 16K in a future MP-2 machine.
The speedup predictions are given in Table 12. We have shown speedups obtained
by increasing the number of processors from 4K to 16K on MP-2. Execution proles
are provided in Table 13 to give an idea of how overheads due to interprocessor com-
munication, memory accesses, and auxiliary instructions are expected to change. If
Tables 9, 10, 11 and Table 13 are compared, it is seen that overheads increase. The
increase is most signicant in case of FFT.
30
Table 13: Prediction of Execution Proles on 16K MP-2
N comp % comm % memory % misc %
1024 43.2 6.5 25.9 24.4
1536 54.1 5.4 26.3 14.2
Matrix 2048 58.3 4.3 25.3 12.1
Multiplication 2560 59.8 3.6 25.2 11.4
3072 62.3 3.1 23.2 11.4
3584 62.9 2.7 23.1 11.3
4096 64.0 2.4 22.4 11.2
1024 20.7 31.3 18.6 29.4
2048 31.7 27.6 18.8 21.9
LU 3072 37.9 22.9 21.4 17.8
Decomposition 4096 42.0 19.3 23.3 15.4
5120 45.2 16.8 24.0 14.0
6144 47.3 14.8 24.9 13.0
7168 49.0 13.2 25.6 12.2
2
19
24.9 66.1 1.4 7.6
Fast 2
20
25.6 65.1 1.6 7.7
Fourier 2
21
26.2 64.2 1.7 7.9
Transform 2
22
26.8 63.4 1.8 8.0
2
23
27.4 62.4 2.0 8.2
2
24
28.0 61.6 2.1 8.3
It is dicult to check validity of future predictions, but it may be possible to check
validity of the approach. To check validity of the approach, hypothetical predictions
were made for 16K processor MP-1, and they were checked using the real machine.
A couple of things are worth mentioning about the validation. The memory overlap
ratio depends on the local size of problem, and we veried that it is fairly accurate
to extrapolate the overlap ratio on that basis. The regression formulas for predicting
miscellaneous overheads were developed using test cases on 1K and 4K processors on
MP-1, and their validity was checked on 16K processors.
31
5 Conclusions
This paper presents pragmatic models for analyzing and predicting performance of
parallel computing. These models, called PM models, are based on a parametric and
micro-level approach to modeling. Software developers can use such models for ana-
lyzing and improving the performance of parallel programs. Hardware designers can
use the models to understand implications of changing processor, communication, and
memory parameters in order to design a cost-eective and well balanced parallel ma-
chine. The paper discusses various aspects of PM models and demonstrates the utility
of these models through concrete examples.
As future research, the idea of PM models could be extended to study not just the
computation kernels but entire application programs. Automation is desirable to deal
with complexity of large application programs. One could carry further the existing
compiler technology to scan programs in order to develop the precise analytical formulas
needed in the micro-level models. So far, we have done the scanning by hand and found
that it is prone to human errors that can be avoided by automation. Another area for
automation is the modications of assembly programs in order to measure timings for
dierent parts of the program.
Acknowledgments
We would like to thank the Scalable Computing Laboratory at Iowa State University
for providing the MP-1 and MP-2 machines. We are grateful to numerous friends who
read the manuscript and suggested several changes to improve the presentation and
clarity of the paper.
32
Appendix A: Analytical formulas for three algorithms
Cannon's Matrix Multiplication: C = A B
 Theoretical time in each step (Assuming A and B are pre-skewed)
N N : matrix size
P  P : PE array size
M : N=P
m :
word size in bits
communication channel bandwidth in bits
L : the maximum number of messages that can be pipelined
together
1. Dot product calculation of a row of A and a column of B
M
2
P (M(2T
load
+ T
mult
+ T
add
) + T
load
+ T
store
)
{ 2T
load
are for both of A and B.
2. Shift the A matrix to West
M
2
P (T
load
+mT
Xt
+ T
store
) + dM
2
=LePT
Xs
3. Shift the B matrix to North
M
2
P (T
load
+mT
Xt
+ T
store
) + dM
2
=LePT
Xs
Note : For the shift communication on MasPar, xnet[1] is used.
 Computation time
f
comp
= PM
3
(T
mult
+ T
add
)
 Communication time
f
comm
= 2M
2
PmT
Xt
+ 2dM
2
=LePT
Xs
 Memory access time
f
mem
= P [2M
3
T
load
+M
2
(3T
load
+ 3T
store
)]
 Total time
f
exec
= f
comp
+ f
comm
+ f
mem
= P [M
3
(T
mult
+ T
add
+ 2T
load
) +M
2
(2mT
Xt
+ 3T
load
+ 3T
store
)
+ 2dM
2
=LeT
Xs
]
{ If L = 1:
f
exec
= P [M
3
(T
mult
+ T
add
+ 2T
load
)
+ M
2
(2(T
Xs
+mT
Xt
) + 3T
load
+ 3T
store
)]
{ If L M
2
:
f
exec
= P [M
3
(T
mult
+ T
add
+ 2T
load
) +M
2
(2mT
Xt
+ 3T
load
+ 3T
store
) + 2T
Xs
]
33
LU Decomposition
 Theoretical time in each step
N N : matrix size
P  P : PE array size
M : N=P
m :
word size in bits
communication channel bandwidth in bits
L : the maximum number of messages that can be pipelined
together
1. Find pivot (local - sequential comparison)
 
1
X
i=M
i
!
P (T
load
+ T
cmp
+ T
neg
+ T
cmp
)
=
M(M + 1)
2
P (T
load
+ 2T
cmp
+ T
neg
)
{ The

P
1
i=M
i

term shows that the problem size becomes smaller.
{ The (T
cmp
+ T
neg
) term is for calculating the absolute value.
2. Find pivot (global - logarithmic comparison)
N((log
2
P )(T
Xs
+ T
cmp
) + (P   1)mT
Xp
)
+ N((log
2
P )T
Xs
+ (P   1)mT
Xp
)
= 2N((log
2
P )T
Xs
+ (P   1)mT
Xp
) +N((log
2
P )T
cmp
)
{ The N((log
2
P )T
Xs
+(P   1)mT
Xp
) term is for exchanging the location
of the maximal element.
{ On MasPar, Xnetp[d], a pipelined communication, is used for the above
communication.
3. Interchange the pivot row with the current row and broadcast
2M
2
P (T
load
+ T
store
+mT
Xt
) + 2dM=LeMP (T
Xs
+
P
2
T
Xp
) (1)
+
 
1
X
i=M
i
!
P (T
load
+T
store
+mT
Xt
) +
 
1
X
i=M
di=Le
!
P (T
Xs
+PT
Xp
) (2)
= 2NM(T
load
+ T
store
+mT
Xt
) + 2NdM=Le(T
Xs
+
P
2
T
Xp
)
+
M(M + 1)
2
P (T
load
+ T
store
+mT
Xt
) +
 
1
X
i=M
di=Le
!
P (T
Xs
+ PT
Xp
)
34
{ The

P
1
i=M
i

PT
store
term is to store the part of the pivot array for
later calculation.
{ On MasPar, Xnetp[d] and Xnetc[d] are used for (1) and (2) respec-
tively.
4. Coecient inversion
NT
div
5. Save inverted coecient
MT
store
6. Calculate and broadcast multiplier
 
1
X
i=M
i
!
P (T
load
+ T
store
+ T
mult
+mT
Xt
)
+
 
1
X
i=M
di=Le
!
P (T
Xs
+ PT
Xp
)
=
M(M + 1)
2
P (T
load
+ T
store
+ T
mult
+mT
Xt
)
+
 
1
X
i=M
di=Le
!
P (T
Xs
+ PT
Xp
)
{ On MasPar, Xnetc[d] is used for the above communication.
7. Update submatrix
 
1
X
i=M
i
2
!
P (2T
load
+ T
store
+ T
mult
+ T
add
)
=
M(M + 1)(2M + 1)
6
P (2T
load
+ T
store
+ T
mult
+ T
add
)
{ The multipliers are kept in a register so that a T
store
is saved.
 Computation time
f
comp
=
M(M + 1)
2
P (2T
cmp
+ T
neg
) +N(log
2
P )T
cmp
+NT
div
+
M(M + 1)
2
PT
mult
+
M(M + 1)(2M + 1)
6
P (T
add
+ T
mult
)
= P [M
3
(
1
3
T
add
+
1
3
T
mult
) +M
2
(
1
2
T
add
+ T
mult
+ T
cmp
+
1
2
T
neg
)
+ M(((log
2
P ) + 1)T
cmp
+
1
2
T
neg
+ T
div
+
1
6
T
add
+
2
3
T
mult
)]
35
 Communication time
f
comm
= 2N((log
2
P )(T
Xs
+ (P   1)mT
Xp
)
+ 2NMmT
Xt
+ 2NdM=Le(T
Xs
+
P
2
T
Xp
)
+
M(M + 1)
2
PmT
Xt
+
 
1
X
i=M
di=Le
!
P (T
Xs
+ PT
Xp
)
+
M(M + 1)
2
PmT
Xt
+
 
1
X
i=M
di=Le
!
P (T
Xs
+ PT
Xp
)
= P [3M
2
mT
Xt
+M(2(log
2
P )T
Xs
+ 2(P   1)mT
Xp
+mT
Xt
)
+ 2MdM=Le(T
Xs
+
P
2
T
Xp
) + 2
 
1
X
i=M
di=Le
!
(T
Xs
+ PT
Xp
)]
 Memory access time
f
mem
=
M(M + 1)
2
PT
load
+ 2NM(T
load
+ T
store
) +
M(M + 1)
2
P (T
load
+ T
store
)
+ MT
store
+
M(M + 1)
2
PT
load
+
M(M + 1)
2
PT
store
+
M(M + 1)(2M + 1)
6
P (2T
load
) +
M(M + 1)(2M + 1)
6
PT
store
= P [M
3
(
2
3
T
load
+
1
3
T
store
) +M
2
(4T
load
+
7
2
T
store
)
+ M(
4
3
T
load
+ (
7
6
+
1
p
)T
store
)]
 Total time
f
exec
= f
comp
+ f
comm
+ f
mem
= P [M
3
(
1
3
T
add
+
1
3
T
mult
+
2
3
T
load
+
1
3
T
store
)
+ M
2
(
1
2
T
add
+ T
mult
+ T
cmp
+
1
2
T
neg
+ 3mT
Xt
+
9
2
T
load
+
7
2
T
store
)
+ M(((log
2
P ) + 1)T
cmp
+
1
2
T
neg
+ T
div
+
1
6
T
add
+
2
3
T
mult
+ 2(log
2
P )T
Xs
+ 2(P   1)mT
Xp
+mT
Xt
+
11
6
T
load
+ (
7
6
+
1
p
)T
store
)
+ 2MdM=Le(T
Xs
+
P
2
T
Xp
) + 2
 
1
X
i=M
di=Le
!
(T
Xs
+ PT
Xp
)]
36
{ If L = 1:
f
exec
= f
comp
+ f
comm
+ f
mem
= P [M
3
(
1
3
T
add
+
1
3
T
mult
+
2
3
T
load
+
1
3
T
store
)
+ M
2
(
1
2
T
add
+ T
mult
+ T
cmp
+
1
2
T
neg
+ 3T
Xs
+ 2PT
Xp
+ 3mT
Xt
+
9
2
T
load
+
7
2
T
store
)
+ M(((log
2
P ) + 1)T
cmp
+
1
2
T
neg
+ T
div
+
1
6
T
add
+
2
3
T
mult
+ (2(log
2
P ) + 1)T
Xs
+ (3P   2)mT
Xp
+mT
Xt
+
11
6
T
load
+ (
7
6
+
1
p
)T
store
)]
{ If L M :
f
exec
= f
comp
+ f
comm
+ f
mem
= P [M
3
(
1
3
T
add
+
1
3
T
mult
+
2
3
T
load
+
1
3
T
store
)
+ M
2
(
1
2
T
add
+ T
mult
+ T
cmp
+
1
2
T
neg
+ 3mT
Xt
+
9
2
T
load
+
7
2
T
store
)
+ M(((log
2
P ) + 1)T
cmp
+
1
2
T
neg
+ T
div
+
1
6
T
add
+
2
3
T
mult
+ (2(log
2
P ) + 4)T
Xs
+ (5P   2)mT
Xp
+mT
Xt
+
11
6
T
load
+ (
7
6
+
1
p
)T
store
)]
37
Fast Fourier Transform
 Theoretical time in each step
N : number of elements
P  P : PE array size
M : N=P
2
m :
word size in bits
communication channel bandwidth in bits
L : the maximum number of messages that can be pipelined
together
1. Calculate initial twiddle factors
(M=2)(T
twiddle
1
+ 2T
store
)
2. Perform log
2
M in-memory stages
(M=2)(log
2
M)[8(T
add
+ T
mult
) + 6(T
load
+ T
store
)]
3. Perform log
2
P
2
communication stages
(M=2)[8 log
2
P
2
(T
add
+ T
mult
) + 6T
load
+ 4T
store
+ 8mPT
Xt
] + dM=Le4(log
2
P )T
Xs
{ On MasPar, Xnet[d] is used for the communication.
 Computation time
f
comp
= (M=2)[T
twiddle
+ 8(log
2
N)(T
add
+ T
mult
)]
 Communication time
f
comm
= (M=2)8mPT
Xt
+ dM=Le4(log
2
P )T
Xs
 Memory access time
f
mem
= (M=2)[6(T
load
+ T
store
) + 6(log
2
M)(T
load
+ T
store
)]
 Total time
f
exec
= f
comp
+ f
comm
+ f
mem
= (M=2)[T
twiddle
+ 8(log
2
N)(T
add
+ T
mult
) + 8mPT
Xt
+ 6(T
load
+ T
store
) + 6(log
2
M)(T
load
+ T
store
)]
+ dM=Le4(log
2
P )T
Xs
1
Initial twiddle factor calculation (9540 cycles on MP-1, 2845 cycles on MP-2)
38
{ If L = 1:
f
exec
= (M=2)[T
twiddle
+ 8(log
2
N)(T
add
+ T
mult
)
+ 8((log
2
P )T
Xs
+mPT
Xt
)
+ 6(T
load
+ T
store
) + 6(log
2
M)(T
load
+ T
store
)]
{ If L M :
f
exec
= (M=2)[T
twiddle
+ 8(log
2
N)(T
add
+ T
mult
) + 8mPT
Xt
+ 6(T
load
+ T
store
) + 6(log
2
M)(T
load
+ T
store
)]
+ 4(log
2
P )T
Xs
39
References
[1] Bronson, E. C., and Casavant, T. L. Experimental Application-Driven Architec-
ture Analysis of an SIMD/MIND Parallel Processing system. IEEE Trans. Parallel
Distrib. Systems 1, 2 (Apr. 1990), 195-205.
[2] Carmona, E. A., and Rice, M. D. Modeling the Serial and Parallel Fractions of a
Parallel Algorithm. J. Parallel Distrib. Comput. 13 (1991), 286-298.
[3] Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K. E., Santos, E., Sub-
ramonian, R., and Eicken, T. V. LogP: Towards a Realistic Model of Parallel
Computation. Proc. 4th ACM SIGPLAN symp. Principles and Practices of Par-
allel Programming. 1993, pp. 1-12.
[4] Fienup, M. A., and Kothari, S. C. Implementations of Fast Fourier Transform
on the MasPar MP-1 and MP-2. Tech. Rep. TR 93-01, Department of Computer
Science, Iowa State University, Jan. 1993.
[5] Fox, G. C., Johnson, M. A., Lyzenga, G. A., Otto, S. W., Salmon, J. K., and
Walker, D. W. Solving Problems on concurrent Processors Vol. 1, Prentice Hall,
Englewood Clis, NJ, 1988.
[6] Gupta, A., and Kumar, V. The Scalability of FFT on Parallel Computers. IEEE
Trans. Parallel Distrib. Systems 4 (Aug. 1993), 922-932.
[7] Gustafson, J. L., and Snell, Q. O. HINT: A New Way To Measure Computer
Performance. Tech. Rep. IS-5109, Ames Laboratory, July 1994.
[8] Gustafson, J. L. The Consequences of Fixed Time Performance Measurement.
Proc. Twenty-fth Hawaii Internat. Conf. System Sciences Vol.3. 1992, pp. 113-
124.
[9] Hennessy, J. L., and Patterson, D. A. Computer Architecture A Quantitative
Approach, Morgan Kaufmann Publishers, 1990.
[10] Hwang, K. Advanced Computer Architecture Parallelism, Scalability, Programma-
bility, McGraw-Hill, NY, 1993.
40
[11] Karonis, N. T. Timing Parallel Programs That Use Message Passing. J. Parallel
Distrib. Comput. 14 (1992), 29-36.
[12] Kumar, V., Grama, A. Y., Gupta, A., and Karypis, G. Introduction to Parallel
Computing, Design and Analysis of Algorithms, Benjamin/Cummings, CA, 1994.
[13] Kumar, V., and Gupta, A. Analyzing Scalability of Parallel Algorithms and Ar-
chitectures. Tech. Rep. TR91-18, Computer Science Department, University of
Minnesota, June 1991.
[14] Lam, M. S. Software Pipelining: An Eective Scheduling Technique for VLIW
Machines. Proc. ACM SIGPLAN Conf. Prog. Lang. Design and Implementation.
1988, pp. 318-328.
[15] MasPar Assembly Language Reference Manual, MasPar Com puter Corporation,
Sunnyvale, CA, 1990.
[16] Ponnusamy, R., Thakur, R., Choudhary, A., Velamakanni, K., Bozkus, Z., and
Fox, G. Experimental Performance Evaluation of the CM-5. J. Parallel Distrib.
Comput. 19 (1993), 192-202.
[17] Rau, B. R., Lee, M., Tirumalai, P. P., and Schlansker, M. S. Register Allocation
for Software Pipelined Loops. Proceedings of the ACM SIGPLAN '92 Conference
on Programming Language Design and Implementation, 1992.
[18] Sun, X. H., and Rao, V. N. Scalable Problems and Memory-bounded Speedup.
SIAM J. Scientic and Statistical Computing 11 (May 1990), 838-858.
[19] Valiant, L. G. A Bridging Model for Parallel Computation. Comm. ACM 33,
8(Aug. 1990), 103-111.
41
IO
WA
 
 
STA
TE  UNIVERSITY
O
F
 
 SCIENCE
 
 AND  TEC
HN
OL
O
G
Y
SCIENCE
with
PRACTICE
DEPARTMENT OF COMPUTER SCIENCE
Tech Report: TR94-23
Submission Date: December 5, 1994
