Algorithms for parallel flow solvers on message passing architectures by Vanderwijngaart, Rob F.
NASA-CR-197758
MCAT Institute
Final Report
95-15
(NASA-CR-I97758) ALGORITHMS FOR
PARALLEL FLOW SOLVERS ON MESSAGE
PASSING ARCH[TECTURES Final Report
(MCAT Inst.) 32 p
G3/34
N95-26588
Unclas
0048493
Algorithms for parallel flow solvers
on message passing architectures
Rob F. Van der Wijngaart
@
Januaw 1995
MCAT Institute
3933 Blue Gum Drive
San Jose, CA 95127
NCC2-752
https://ntrs.nasa.gov/search.jsp?R=19950020168 2020-06-16T07:11:45+00:00Z
Z
Algorithms for parallel flow solvers
on message-passing architectures
Rob F. Van der Wijngaart
1 Introduction
The purpose of this project has been to identify and test suitable technologies
for implementation of fluid flow solvers--possibly coupled with structures and
heat equation solvers--on MIMD parallel computers. In the course of this
investigation much attention has been paid to efficient domain decomposition
strategies for ADI-type algorithms. In references [1, 2, 3] the near-optimal
properties of the multi-partition strategy were explained, and efficient im-
plementations were presented. These included solving the heat equation on
rectilinear and curvilinear grids on the Intel iPSC/860, and solving the NAS
scalar penta-diagonal parallel benchmark on the iPSC/860, Paragon, IBM
SP2, and on a network of Silicon Graphics workstations connected through
Ethernet. Multi-partitioning derives its efficiency from the assignment of
several blocks of grid points to each processor in the parallel computer. A
coarse-grain parallelism is obtained, and a near-perfect load balance results.
By contrast, in the uni-partitioning strategy every processor receivesre:
spSnsibility for exactly one block of grid points instead of several: This neces-
sitates fine-grain pipelined program execution in order to obtain a reasonable
load balance. Although fine-grain parallelism is less desirable on many sys-
tems, especially high-latency networks of workstations, uni-partition methods
are still in wide use in production codes for flow problems. Consequently,
it remains important to achieve good efficiency with this technique that has
essentially been superseded by multi-partitioning for parallel ADI-type algo-
rithms.
Another reason for the concentration on improving the performance of
pipeline methods is their applicability in other types of flow solver kernels
that havea stronger implied data dependencethan ADI, suchas SSOR [4],
or flux-vector-split methods [5].
2 Results and conclusions
An important feature of fine-grain parallel applications on MIMD distributed-
memory computers is the fact that there is no global hardware-supported syn-
chronization. This means that communications may be initiated on certain
processors, while the result is not expected (yet) by the receiving proces-
sors. The trap mechanism for unexpected messages incurs certain overheads
which may lead to significant dynamic load imbalances when implementing
pipeline algorithms. The effect of these overheads gets aggravated by fast
feeding of the pipeline by the first processor, and substantial improvements
can be obtained by artificially slowing down this first processor.
Analytical expressions can be derived for the size of the dynamic load
imbalance incurred in traditional pipelines. From these it can be determined
what is the optimal first-processor retardation that leads to the shortest total
completion time for the pipeline process. Theoretical predictions of pipeline
performance with and without optimization match experimental observations
on the iPSC]860 very well.
Analysis of pipeline performance also highlights the effect of uncareful grid
partitioning in flow solvers that employ pipeline algorithms. If grid blocks
at boundaries are not at least as large in the wall-normal direction as those
immediately adjacent to them, then the first processor in the pipeline will
receive a computational load that is less than that of subsequent processors,
magnifying the pipeline slowdown effect. Extra compensation is needed for
grid boundary effects, even if all grid blocks are equally sized.
The results of the above investigations are described in references [6] and
[7], which are attached to this report as appendices.
References
[1] R.F. Van der Wijngaart, Efficient implementation of a 3-dimensional
ADI method on the iPSC/860, Proceedings Supercomputing '93, Port-
land, OR, November 93
[2] R.F. Van der Wijngaart, T. Phung, E. Barszcz, Three implementations
of the NAS scalar penta-diagonal benchmark, submitted for presentation
at Supercomputing '94, November 1994
[3] M.H. Smith, R.F. Van der Wijngaart, Granularity and the parallel ef-
ficiency of flow solution on distributed computer systems, 25 th AIAA
Fluid Dynamics Conference, Colorado Springs, CO, June 20-23, 1994
[4] S.E. Rogers, D. Kwak, Steady and unsteady solutions of the incom-
pressible Navier-Stokes equations, AIAA Journal, Vol. 29, No. 4, 1991,
pp. 603-610
[5] G.H. Klopfer, G.A. Molvik, Conservative multizonal interface algorithm
for the 3-D Navier-Stokes equations, AIAA Paper 91-1601, AIAA 10 th
Computational Fluid Dynamics Conference, Honolulu, HI, June 1991
[6] R.F. Van der Wijngaart, S.R. Sarukkai, P. Mehra, Analysis and Opti-
mization of Software Pipeline Performance on MIMD Parallel Comput-
ers, submitted for publication in Institute of Electrical and Electronics
Engineers (IEEE) Journal of Parallel and Distributed Computing
[7] R.F. Van der Wijngaart, S.R. Sarukkai, P. Mehra, The Effect of In-
terrupts on Software Pipeline Ezecution on Message-passing Architec-
tures, submitted for presentation at Fifth ACM SIGPLAN Symposium
on Principles and Practice of Parallel Programming, Santa Barbara, CA,
July 19-21, 1995

Analysis and Optimization of Software Pipeline Performance on
MIMD Parallel Computers
Rob F. Van der Wijngaart*, Sekhar R. Sarukkai t, Pankaj Mehra t
NASA Ames Research Center, Moffett Field, CA 94035
Abstract
Pipelining is a common strategy for extracting parallelism from a collection of
independent computational tasks. Filling the pipeline creates an inevitable per-
formance penalty. When implemented on MIMD parallel computers that transfer
messages asynchronously, pipeline algorithms suffer an additional slowdown; pro-
cessor interrupts cause a wave-like propagation of delays. This phenomenon, which
has been observed experimentally using the AIMS performance monitoring system,
is investigated analytically, and an optimal correction is derived to eliminate the
wave. Increased efficiency through the correction is verified experimentally.
1 Introduction
Pipelining is a common strategy for extracting parallelism from a set of independent tasks,
each of which is sequential in nature but is executed by multiple processors. A well-
known example is the solution of large numbers of banded-matrix equations resulting
from the discretization and approximate factorization of partial differential equations
using the alternating-direction implicit (ADI) algorithm (e.g. [4]). Irrespective of the
hardware platform on which a pipeline algorithm is implemented, a delay proportional
to the number of processors that share the sequential data dependence is incurred. In
the case of dedicated hardware pipelines such as those employed in traditional vector
supercomputers, a result is produced during every clock cycle once the pipeline is full.
We show that software pipelines implemented on MIMD (Multiple Instruction/Multiple
Data) distributed-memory parallel computers with nonzero process interrupt times also
suffer nonlinear delays that make them unattractive when the computational work per
pipe segment is small. Such machines (e.g. IBM SP/2, Intel iPSC/860 and Paragon)
constitute an important class of parallel computers.
*MCAT Institute
)Recom Technologies
Here we investigate the structure of those delays based on a simple parallel performance
model. Building on the results of our investigation, an optimal strategy for reducing the
delays is proposed and verified.
2 Algorithm and performance model
We consider a model problem consisting of a set of N identical independent tasks,
{w k ] k = 1, N}, to be completed by p identical processors. Each task is divided uniformly
k is assigned to processor j. Ainto a set of p subtasks, {w] [j = l,p}. Every subtask wj
k cannot be started before subtaskdata dependence exists among the subtasks; subtask wj
k has been finished.w j-1
The pipeline algorithm is constructed as follows. Processor 1 completes subtask w_ and
sends a message to processor 2 indicating that subtask w_ can commence. Subsequently,
processor 1 completes subtask w_ while processor 2 completes subtask w_. After both
subtasks are completed, processor 1 sends another message to 2, and processor 2 signals
processor 3. This pattern is repeated, and after p - 1 subtasks have been completed by
processor 1 all processors are active, provided that N > p.
Assumptions:
. The message length is zero (similar to assuming infinite network bandwidth). This
is a reasonable approximation for fine-grained pipelines where messages are usually
very short.
k requires a constant period of2. The computational work associated with subtask wj
c time units.
. When a message is sent by a processor a constant non-overlappable send overhead
of s time units is incurred immediately by that processor. For short messages a
significant part of the send overhead may be due to the construction of the path to
the receiving node. This has been observed on the Intel iPSC/860. Nearest-neighbor
communication assures that s is indeed constant.
4. When a message arrives at a processor a constant non-overlappable receive-interrupt
overhead of ri time units is incurred immediately by that processor.
5. When a message is used by a processor a constant non-overlappable receive-handling
overhead of rh time units is incurred by that processor.
Assumptions 1 and 2 will be relaxed later. Previous pipeline analyses (e.g. [1]) assume
that messages always arrive before the receive has been posted, so that processors never
need to wait for data. Our observations (see Figure 1) reveal that this is not true in
practice, so we drop this assumption. Certain communication protocols (CMMD, MPI)
2
may actually support withholding messagesuntil a request for them has been posted
in order to avoid copying of messagebuffers [3]. This kind of communication delay
automatically provides the type of pipeline optimization describedin section 4, but at
the cost of implicit synchronizationbetweenall pipeline segments.
Assumption4 is what distinguishesour modelmost prominently from previousmodels,
such as the one presentedin [2], which usess and rh, but not ri. Interrupts generate
dynamic load imbalances because they consume cpu time during intermediate pipeline
stages.
3 Performance analysis
We use the performance-monitoring package AIMS [5] (Automated Instrumentation and
Monitoring System) to visualize the pipeline behavior of a problem where p = 4 and
N = 100, implemented on an Intel iPSC/860 hypercube computer. In Figure 1, horizontal
striped bars indicate processor status. Dark sections signify that a processor is performing
idle time
4 : : f : " ' : : " • :
3
processor
number
1
time
active in subtask
, ::i :" :: :::: _ : ::::: i" t :::::1:
 iiii!]li! l i ...... ' ..... ........ ....
message from
processor 2 to 3
varying subtask times
Figure 1: AIMS processor activity status
a subtask, whereas white space within a bar indicates that a processor is not doing any
computational work, but is sending a message or waiting for one instead. Black lines
connecting bars denote messages being passed among processors. Message lines originate
on the AIMS time line where the sender blocks (suspends program execution), waiting for
local completion of communication, and terminate where the receiver unblocks (resumes
program execution), having received the message. Therefore, the end of a message line
should not be confused with the instant when the message was actually delivered to the
3
receiving processor. Although the amount of computational work per subtask is constant,
the amount of time spent within snbtasks varies (see inset), as does the amount of time
spent waiting in between subtasks. This variation is not due to variations in subtask
execution times, but shows because AIMS does not explicitly monitor processor interrupts
due to arriving messages.
A clear fan-out of message transfer lines is visible between processor 1 and processor
2, and to a lesser extent between 2 and 3. This implies that some subtasks take longer
to complete on a certain processor than on its predecessor in the pipeline. But there
are also phases in the pipeline algorithm during which message transfer lines are parallel,
indicating that communicating subtasks of successive processors in the pipeline take equal
amounts of time. In general, each processor experiences four pipeline phases, which are
identified schematically in Figure 2. They are discussed below in reverse chronological
order.
,,,il .....
Blocking..... :5............ ii :
proc. 2 m_ Z_ i
pr_. 1 ==========================_._::._:.:-:.:.:_ :::+:
t_ t2 t3 t_ ts
! !
_! k-I
(I-m,)Z,-
.mlZ l-mz2-mz 3
Pipeline
phases
iiiiii :  i3ii!!ii
[Interrupted
Figure 2: Four pipeline processor phases
FLUSHING. The processor is not interrupted by its predecessor in the pipeline, either be-
cause there is no predecessor (processor 1), or because the predecessor has already
completed all its subtasks and has no more messages to be issued. However, all pro-
cessors in this phase still need to handle already arrived messages (except processor
1), perform the remaining subtasks, and send messages to their successor; incoming
message transfer lines are parallel, since the preceding processor has fired messages
from the same no-interrupt state, so subtasks take equally long (except on processor
2, see below).
INTERRUPTED. The processor features long subtask times because it is interrupted at
high frequency by its predecessor. The predecessor is in the flush phase, and sends
messages rapidly. Processor 1 does not exhibit the interrupt phase.
4
WAITING. This phase is dominated by waiting for messages to arrive from the predeces-
sor. Subtasks on the predecessor take a long time to complete, either because that
processor is in the wait phase itself, or because it is being interrupted frequently;
incoming and outgoing message transfer lines are parallel, which means that sub-
tasks take equally long on successive processors in the pipeline. Processors 1 and 2
do not exhibit this phase.
BLOCKING. The processor is blocked while waiting for a message signaling that its first
subtask can be started (pipeline fill). Processor 1 does not exhibit this phase.
We now analyze in detail the durations of the different phases on processor k (k > 2).
The phases can be grouped into two major modes, the blocked mode and the active
mode. The blocked mode coincides with the blocking phase; no subtasks can be started
yet due to the pipeline fill. The active mode contains the remaining three phases. Its
total length is the sum of all subtask durations. Subtasks can be grouped together into
three classes of equal subtask lengths. These correspond to the periods tl, t2 through
tk-1, and tk, respectively, which are indicated in Figure 2. Periods are defined recursively;
they equal the amount of time needed to finish all the subtasks whose messages have been
received (i.e. whose receive calls have been cleared) during the corresponding period of
the predecessor. If there is no such corresponding period, then the period equals the time
needed to flush all remaining subtasks.
The number of subtasks executed during the flushing period on processor k is Zk.
Clearly, Z1 = N, since processor 1 has only a single period, during which all subtasks
are flushed. The number of subtasks executed by processor 2 during period tl is mlZ1.
ml is the ratio of subtask duration on processor 1 over that on 2 during the first period.
Its inverse 1/ml is the frequency with which processor 2 gets interrupted by 1. This
frequency may be fractional, since it is an average over many subtasks. It is determined
as follows. A single subtask on processor 2 during tl is interrupted by an average of 1/ml
messages from processor 1. Hence, it lasts c + s + rh + film1 time units. Since every
subtask on processor 1 issues exactly one message, it takes (c + s)/ml time units for that
processor to generate the l/m1 interrupts of processor 2. Equating the two lapses yields
c + s + rh+ r,lm, = (c+ s)lm, (1)
SO
c+ s-rl
= (2)
c+s+rh
Note that the number of subtasks executed during tl by all processors numbered higher
than 2 is mlZ1, just as on processor 2; they are wait-dominated due to the slowdown of
the frequently interrupted processor 2.
The interrupt frequency 1/m during flushes other than in period tl is slightly lower,
because the subtasks on the flushing processor last longer than those on processor 1. A
5
flushing processoris no longer interrupted by its predecessor,but it doesneedto handle
the already arrived messages,which incurs an overheadof rh time units per subtask. So
now we equate the subtask duration c + s + rh + ri/m on the interrupted processor to
(c -{- s -{- rh)/m time units on the interrupter, and obtain
c + s + rh -- ri
m = (3)
c+s+rh
Notice that the interrupt frequency 1/m is the same for all flushing periods other than that
on processor 1, as the sender states and receiver states are the same for all subsequent
nodes. So the number of subtasks executed on the interrupted and waiting processors
during tj equals mZj, where j > 2.
Equation 2 implies that ri < c + s. The meaning of violation of this inequality is
that processor 2 will be constantly interrupted until processor 1 exhausts all its subtasks.
In this case the parallel pipeline breaks down, and (partial) serial execution results. If
c + s < ri < c + s + rh, then ml is zero and the first processor finishes completely before
the second has finished even one subtask, but all subsequent processors are properly
pipelined. If c + s + rh < ri , then both ml and m are zero, and execution is completely
serial. Although this is possible in principle, it is not of interest for this analysis.
Evidently, each processor ultimately has to execute all Z_ subtasks, so the number of
subtasks Zk remaining during the flushing phase on processor k is
k-I
Zk=ZI--raIZ_-m_Zj, with Z2=(1-m_)Z_, Z_=N. (4)
.i=2
This equation constitutes a simple recursion, which is most easily solved by computing
Zk+l - Zk. It immediately follows that Zk+_ = (1 - m)Zk, so
Zk=N(1-m,)(1-rn) k-2, k>__2. (5)
The total duration T_' of the active mode on processor k is now computed as
k-I
T_ = t_ + _ tj + t,
j=2
k-1
= (c+s)Z,+(c+s+rhWri/m)my_Zj+(c+s+ra)Zk
j=2
= N(c+s)+N(1--ml)(c+s+rh+ (1-(1-m) j'-2) ri/m), k>2. (6)
The total duration T_ of the blocked mode on processor k is determined as follows.
We assume that subtask w_ is interrupted exactly once (by subtask w__z) , so that
T_= T:_, +c+ s + ri + rh. (7)
This may not be true for k = 2, because processor 1 can potentially send multiple messages
arriving during subtask w_. But the error that follows from this simplifying assumption
is small, and results only in a uniform shift of the time lines of all subsequent processors,
if any. Since T_ is zero, we find:
T_ = (k - 1)(c + s + r, + rh). (8)
Summing the durations of the idle and active modes, we finally obtain for the total
pipeline duration Tk on node k:
Tk = (k -1)(c + s + rh + ri) + N(c + s) +
N(1-m,)(c+s+rh+(1-(1-m) k-2) r,/m), k>2. (9)
Substituting equations 3 and 2 into 9 and defining the two relative interrupt and
handling overheads
Th ri
, , (lO)
c+s c+s
we obtain the scaled delay time:
N(c+s) 1-(l+¢r+i_)k 1 ¢r+_ l+¢r-_ . (11)
The first term on the right hand side of equation 11 signifies the expected scaled
completion delay time due to pipelining for processor k without nonlinear interrupt effects.
The second term is the additional nonlinear delay due to the wave-like propagation of the
effects of message interrupt and handling overheads in the interrupt states (Figure 2).
This nonlinear delay is depicted in Figure 3 for a fixed scaled handling overhead of 0.3,
and for a range of scaled interrupt overheads.
For large k or small _, the asymptotic scaled delay time is:
6tk = (1 + c_ + fl)_-_ +
(a + fl)(1 + a)
1 +o_-,8
(12)
This expression is not valid for the last node of the pipeline, since no more send operations
are required, but the error is negligible for pipelines involving many processors.
If the number of processors is small, then the timing for the last processor may differ
significantly from the asymptotic value, or even from equation 11. This situation occurs
often with pipelines resulting from three-dimensional domain decompositions (e.g. [4]).
The number of processors in a pipeline in this case is the cube root of the total number
of processors, which is usually a small number.
Since the subtask durations for the last processor, p, in the waiting phase are deter-
mined by processor p - 1, the number of subtasks performed during tl through tv_2 is
the same as before. But the fan-out in the interrupt phase will be reduced, and actually
7
nonfine_
dehy
20
processor
number
Figure 3: Scaled nonlinear delay for cr = 0.3
completely vanishes if s > ri; in the latter case t 1,
so that
6tp = Tp-l 4- (c-F,s -F ri -b rh)
N(c + s)
1+c_+3
= 5tp_l + N
= (1 + a + _)_--_ +
-1
= O, and Tp = Tp__ + (c + s + r, + rh),
{(_+/_ 1 +c_-_ , s> (13)1 +cr-fl kl +cr] j ri.
If s < ri there will still be a fan-out, and tp # 0. Rewriting equation 1 for node p, we
obtain:
ri/mp + rh + c = (rh + c + s)/mp, (14)
so that
c+ s-i-rh--ri
mp "--
C+rh
(15)
Modifying Z v accordingly (see equation 5), we obtain for the ntimber of subtasks remaining
during the flushing phase:
Z v = N(1 - ml)(1 - my)(1 - m) v-3 . (16)
The total time spent on processor p is now easily calculated as:
Tv = Tv_, + (c + s + rh + r_) + Zv(c + rh). (17)
Introducing a third relative overhead:
87 = --, (18)
cH-s
we find the following scaled delay time:
6t_ = (1-Fa-F_)_--_-F
{ ]}a-Ffl l÷a- #- (#- 7)(1 +a-fl). ,s<r_. (19)1 + a- # \1 +a] 1 +a
Equations 13 and 19 can be combined to form:
6t v -- (I-Fa-Ffl)_--_-F
a+/3 {l+a-l+a-fl
fl _ max(O,# _ ,,/) 1 -Fa-flT¥. (20)
3.1 Finite message length
If the message length is nonnegligible and the communication bandwidth between nodes
is finite, the model has to be changed slightly; the message send overhead may increase
if data gets copied locally, but this can be absorbed in the definition of s, since it adds
to the overhead on the sending processor. Similarly, copy costs on the receiving end
can be incorporated in rh, and longer processor tie-up with interrupts results in larger
ri. If the network is not autonomous, the sending processor needs to participate in the
entire transfer of the message, which can again be absorbed in the definition of s; in
that case the model as presented above remains the same. If the network is autonomous,
the sending processor can resume subtask execution once the message is placed on the
network. Here the finite bandwidth results in an increased blocking time. If we assume
a constant message bandwidth of b bytes per second and a constant message size of B
bytes between subtasks, then the message transfer time q per message on a contention-free
network is:
B
q = _-. (21)
Consequently, the block time becomes:
T_ = (k - 1)(c + s + q + ri + rh) (22)
The finite message size has no effect on the rest of the phases, since the frequency of
departing and arriving messages on an autonomous network is not affected by the linear
shift q. Introducing the scaled message travel time
q (23)
cA-s
we find in general that
_tk = (1 -}- a + _ +/3)_--_ + a +/3 1 + a --/3 , (24)l+a-/3
and for the last processor in the pipeline:
_tp = (l+a+a+/3)LN-l+{1+o(±)l +a-/3 ,,1 1}/3- max(0,/3 - + a -/3 . (25)l+a
4 Optimization
It follows from the analysis in section 3 that the reason for the nonlinear slowdown of
the pipeline algorithm is the high frequency of interruption of certain sets of subtasks.
All delays are incurred during the interrupt phase whose fan-out propagates as a wave
through the pipeline. No such wave pattern would be observed if the first processor
were to send messages at a lower rate. This suggests that the pipeline algorithm can be
optimized by artificially increasing the amount of work performed by the first processor
for each subtask. Another type of optimization can be obtained if the computational cost
of a subtask is not fixed, but is manipulated by grouping several (independent) subtasks
together. This coarsening of the granularity reduces the number of messages--and hence
the communication overhead--at the expense of increased blocking time. Moreover, the
relative overheads a and/3 decrease, reducing the nonlinear slowdown as well.
4.1 Optimal padding
We replace c by c'+c on processor 1, where d is a padding amount, and redo the analysis,
keeping the computational work for the subtasks on other processors the same. We use
the symbol ' to indicate perturbations due to padding. Note that no benefits can be
10
obtained by increasingthe paddingbeyond the sum of handling and interrupt overheads,
sinceat that time processor2 is forcedto wait for data from processor1. Consequently,
0 <__C' <__ri -t- rh. (26)
If we assume that the padding operations within each subtask on processor 1 take
place after each send operation, then the processor block times stay uniformly the same
as before, i.e.,
= r:. (27)
In addition, we find that
t'_ = N(c' + c + s). (2s)
Expressions for the numbers of subtasks per period stay the same, but the definition of
ml changes:
C !, +c+s-ri
m 1 = (29)
c+s+rh
The total time spent on node k becomes:
= (k-l+N)(c+s)+(k-1)(ri+rh)+Nc'+
N(1 - 1/ml)(c + s + rh + (1 - (I - l/m) k-2) m ri) ,k > 2. (30)
Introducing a scaled padding:
C !
r = I, (31)
c+s
we obtain the following expression for the scaled completion delay:
k-1 _+/_-r {_t_ = (l+c_+13)--_+r+l+c,_/3 i+_-13
}= &k+rl+__/_ -1 (32)
Interestingly, for k = 2 (the second processor in the pipeline) we find that _t_, = _tk,
independent of the value of r; different amounts of padding cause differences in the dura-
tions of the interrupt and flush phases, respectively, but the sum of these times will be the
same, since processor 2 never needs to wait for data from processor 1. This implies that
optimization of the pipeline can never reduce the increase in execution time that proces-
sor 2 incurs over processor 1, and savings can only be obtained starting with processor 3.
Equation 32 represents a linear function in r, whose extremal values are attained at the
11
boundariesof the interval definedby equation 26. As before,we will assumethat fl < 1
(i.e. ri < c + s), so the coefficient of r is negative. Consequently, the total delay time is
minimized by maximizing r, yielding:
r'-a+/3 or c'=ri+rh. (33)
Note that the optimal r is independent of the processor number k and the amount of
work per subtask c. Optimal padding results in the equality
' (34)ml--i ,
which implies that subtasks on processor 1 now take equally long as on 2. In fact, all
subtasks are now performed within the waiting phase for all processors, and the other
phases (except for the initial pipeline blocking) are totally eliminated. Since the optimal
r does not depend on the number of processors, the suggested padding strategy is globally
optimal; no padding of subtasks on other processors can improve the completion time.
We finally find:
7' = (1 + + Z)e + + Z. (35)
This expression has the appearance of a synchronized pipeline result with a slightly mod-
ified subtask duration. In the case of finite-message-length communications on an au-
tonomous network the expected correction term a appears, but the optimal amount of
padding remains the same:
(36)
4.2 Optimal grain size
The total amount of computational work for each processor is N c, as before. But now
instead of dividing this into N subtasks of size c each, we allow for grouping into N °pt =
N/n °pt subtasks of size c°vt = n °pt c. Again we minimize the effect of the wave-like delay
propagation, so the optimal padding of the subtasks on the first processor in the pipeline
keeps the same form, and the final completion time for node p is as in equation 36.
Some assumptions about the impact of message and subtask consolidation have to
be made. Most notably, if copying of message buffers takes place on the sending and
receiving processors, then the send and receive handling overheads have to be functions
of the subtask grouping size n. A simple linear model is adopted in which copying and
computing speeds and network bandwidth are constants. A processor interrupt is assumed
to take a fixed amount of time. The following substitutions are made:
N ,-- N/n
C ,I-- I'_C
12
s _-- s+sinB
ri _ ri
q #-- nB/b
rh _-- rh + rlnB
This leads to the following final completion time:
Tp = N/n {nc + s + s,nB + rh + r, nB + ri} +
{nc d- s d- slnB Jr rh -F rlnB d- ri -4- nB/b} (p- 1)
= Y{c+(s+rh)B}+{s+rh+ri}(p--1)+
N/n {s + rh + ri} + {c + (sa + r, + 1/b)B} n(p - 1) (37)
The optimal subtask grouping size is obtained by setting
OT.
o--h-= o, (38)
which yields:
nOpt = ,/N s + rh + ri (30)Vp -1 c + (sl + rl + l/b)B '
and the corresponding optimal padding per subtask is:
c' = (v, + ,'h)l n°_' + ,',B. (40)
It follows from the last equation that padding vanishes in the limit of large grouping sizes
if no copying of message buffers on the receiving end is performed.
5 Verification
In order to verify the validity of our performance model we use a simple test program
in which the amount of computational work per pipeline segment can be varied. The
program is run on the Intel iPCS/860 and uses blocking send and receive calls of the NX
message passing library to pass zero-length messages. The node numbering in the pipelines
is such that only nearest-neighbor communication occurs, which eliminates contention.
The number of pipeline tasks does not have an important qualitative influence on the
performance; it is kept fixed (N = 256) in this experimental investigation.
The primary data set obtained consists of final completion times on all 16 nodes
involved in the program execution. The C routine used to produce an entry of the data
set is printed below.
13
void pipeline( work, padding) int work, padding;
(
int s, p, my_seguent, next, first, last;
double *in, *out, q;
my_segment = ginv(mynods()); next = gray(my_segment+l);
first = O; last = nunmodes()-l;
if (my_segment!=first) for (p=O; p<256; p++)
crecy(p, in, 0);
for (s=O; s<10*work; s++) q = 1.0/(s+l.O);
if (my_segment != last) csend(p, out, O, next, 0);
else for (p=O; p<256; p++)
for (s=O; s< lO*(work+padding); s++) q = 1.0/(s+I.0);
csend(p, out, O, next, 0);
The variable work determines the amount of arithmetic work for all pipeline segments,
whereas padding determines the additional work per segment for the first node. The
multiplication factor of 10 is used to scale the work to measurable amounts. The main
program is a double loop over the pipeline routine for work ,, 0 step 1 to 16, and
padding ,, 0 step 1 to 32, and computes ensemble averages for each parameter pair
over 20 independent runs. Figure 4 shows the final completion time versus node number
work
0.08
O.07
0.06
time (s)
O.05
0.04
0.03
2 4 6 8 10 12 14
node
Figure 4: Completion times for increasing arithmetic loads
for zero padding and varying amounts of work per pipeline segment (higher curve generally
means more work per segment). Interestingly, the completion time is not a monotonically
increasing function of the work per segment; this also follows from the the theoretical
model. From the data displayed in Figure 4 we compute the interrupt and handling
overheads as follows. The first node provides the value of the scale factor e + s, since all it
does is compute and send. From equation 9 we derive that ri + rh = T2/(N + 1) - T1/N,
14
70[handle (St s) 60
40
6 8 i0 12 14
70
interrupt (las) 60
50
40
6 8 10 12 14
node node
Figure 5: Measured receive overheads
which can be combined with equation 12 for larger values of k and k + 1 to compute a
and/3 (and hence ri and rh) separately. The results of these computations are depicted in
Figure 5. Again, different curves pertain to:executions with different amounts of work per
pipeline segment, starting with work = 4. The first few curves with very small amounts
of work per pipeline segment (work < 4) are left out, since the overhead of the extra test
in the case of receiving processors introduces significant skew in comparison to the work
performed. It can also be seen that the overhead results are rather noisy for the small node
numbers. By averaging the results for node 15, we obtain the values of rh = 39Fts and
ri = 57/_s. Qualitative validity of the performance model derived above is demonstrated
in Figure 6, which shows completion times on nodes 2 and 16, respectively, for the entire
(work,padding) parameter space. As predicted by equation 32, the completion time on
NODE 2 NODE 16
0.1 0.100
time(s) 0"075 / _ [ time(s) 0.075_ [
o.o5ot --j o.o5o ---/
o.o25 / o.o25t /
o
0.000
10 10
P
Figure 6: Completion times on the second and last pipeline nodes
node 2 stays roughly constant while the padding is below the optimal amount. Completion
times on node 16 show that optimal padding is a more or less fixed quantity (padding =
15
13), independentof the amount of computational work per pipeline segment. In Figure
130
120
time (_s) 110
100
9O
8O
  . ding
6 8 lo -i2--i4--16
work ....
Figure 7: Comparison of receive overheads and optimal padding
7 we Show the actual cpu time involved with optimal padding (c'). The upper curve
is obtained by subtracting the non-padded completion times from those at padding ..
13 for the first node. The lower curve depicts the computed sum of handle and receive
overheads. _iy predicts-that the two _curves_be horizontal and coincident. _Ithough
they deviate and exhibit Some scatter, they are Close in absolute Value.
The achieved improvements in execution time for optimal padd!ng versus no padding:
are plotted :in Figure 8, alongside with thethe_retically predicted improvements based on
the values of rh, ri and c+s determined above. Note that the quantity 'work' (per pipeline
segment computation) in this figure is now given in microseconds instead of in number of
times the loop body in the test program is executed. Agreement between predicted and
measured speed-ups is quite good.
::o t
0.02_ _ lraditional
60 80 100 120 140 160
work Ors)
Figure 8: Predicted (dashed) and measured (solid) completion times with and without
padding, and traditional model (dOtted) _ -_ __
In the same figure we also depict expected pipeline performance using traditional
analysis (dotted line), which takes latency into account, but ignores the effect Of processor
16
interrupts. The traditional model, which assumesthat all messageoverheadsare borne
by the communicationnetwork, significantly underpredictscompletion times, evenwhen
comparedto the caseof optimal padding.
6 Conclusions
In this paper we have investigated the performance of software pipelines on MIMD parallel
computers with message passing based on interrupts. Based on observations obtained
using the performance monitoring and visualization tool AIMS, an analytical model for
the pipeline behavior was formulated. A salient feature of the model is that it considers
handling and interruption overheads associated with receive operations. Four phases
of pipeline operation are identified, each with different communication characteristics.
Recurrences identified in some phases give rise to nonlinear delays. Each phase is analyzed
independently, and predicted total completion time is obtained by summing the individual
contributions. Comparison of predicted execution with measured execution is favorable.
In addition to providing accurate predictions, the model also points to possible opti-
mization of the implementation of parallel software pipelines. The optimization involves
reducing the frequency with which the first processor in the pipeline sends messages to its
successor. This artificial slow-down of the first processor actually improves overall pro-
gram performance. The predicted magnitude of optimal first-processor retardation and
the resulting total speed-up are again verified experimentally.
Finally, it should be noted that the model also serves as an investigative tool for
determining detailed communication characteristics of MIMD parallel computers, whose
effects get magnified in software pipelines.
Acknowledgements
The authors would like to thank Rob Block from the University of Illinois for his early
investigations and observations of software pipelines, and Bill Saphir from Computer
Sciences Corporation for his many useful suggestions and insight, and for his insistence
on dropping large amounts of extra redundant unneeded notation and equations.
References
[1] King, C.-T. Chou, W.-H., and Ni, L.M. Pipelined data-parallel algorithms: Part I --
Concept and modeling. IEEE Trans. on Parallel and Distrib. Systems 1, 4, (October
1990), 470-485
17
[2] Adve, V. et al. Integrated performance analysis of data-parallel programs, Workshop
on Debugging and Performance Tuning for Parallel Computing Systems, Sponsor:
Los Alamos National Laboratory, Chatham, MA, 1994
[3] Saphir, W.C. Message buffering and its effect on the communication performance
of parallel computers, NASA Ames Tech. Rep. RNS-94-004, NASA Ames Research
Center, Moffett Field, CA, April 1994
[4] Van der Wijngaart, R.F. Efficient implementation of a 3-dimensional ADI method
on the iPSC/860, Proc. Supercomputing '93, Portland, OR, 1993, pp. 102-111
[5] Yan, J.C. Performance tuning with AIMS--An automated instrumentation and mon-
itoring system for multicomputers, Proc. 27 th Hawaii International Conference on
System Sciences II, Wailea, HI, 1994, pp. 625-633,
l
18
The Effect of Interrupts on Software Pipeline Execution on Message-passing
Architectures
Rob F. Van der Wijngaart', Sekhar R. Sarukkait, Pankaj Mehrat
NASA Ames Research Center, Moffett Field, CA 94035
Abstract
Observations show that fine-grain software pipelines on MIMD parallel computers with asynchronous
communication suffer from dynamic load imbalances which cause delays in addition to the expected pipeline
fill time. An analytical model is presented that fully explains these load imbalances, and that allows for their
removal. The results of applying this optimization to a tri-diagonal equation solver on the Intel iPSC/860
and Paragon are presented.
1 Introduction
It is well-known that software pipeline performance concerns a trade-off between granularity and exploited
parallelism; pipeline fill time may be decreased at the expense of a larger number of communications. Indeed,
experiments show that an optimum is usually achieved at less than the maximum exploitable parallelism.
Here we demonstrate that part of the delay of fine-grain software pipelines implemented on MIMD distributed-
memory parallel computers with nonzero process interrupt times (e.g. IBM SP/2, Intel iPSC/860) is attributable
to dynamic load imbalances created by interrupts. We investigate the structure of those delays based on a simple
parallel performance model. Building on the results of our investigation, an optimal strategy for reducing the
delays is proposed and verified.
The same strategy is subsequently employed to optimize the solver phase of an important class of algorithms
for the solution of partial differential equations whose parallel implementation necessitates fine-grain pipelines.
An example is the Alternating Direction Implicit method in the widely used parallel flow solver POVERFLOW
[6]. Since processor speeds are generally increasing faster than network bandwidths, it is envisioned that fine-
grain pipelines will keep gaining prominence as medium-grain pipeline applications move toward the fine end
of the granularity spectrum.
In order to describe previous and the current work on software pipelines, we consider a model problem
consisting of N identical independent tasks, {w k ] k = 1, N}, to be completed by p identical processors. Each
task is divided uniformly into p subtasks, {w] I J = 1, p}. Every subtask w] is assigned to processor j. A
data dependence exists among the subtasks; subtask w] cannot start before subtask w]_ 1 has finished. The
pipeline algorithm is constructed as follows. Processor 1 completes subtask w_ and sends a message to processor
2 indicating that subtask w_ can commence. Subsequently, processor 1 completes subtask w_ while processor
*MCAT Institute, tRecom Technologic6
2 completes subtask w_. After both subtasks are completed, processor 1 sends another message to 2, and
processor 2 signals processor 3. This pattern is repeated, and after p- 1 subtasks have been completed by
processor 1, all processors are active, provided that N _> p.
An intuitively appealing description of this pipeline process assumes that each subtask requires a fixed,
constant amount of cpu time, and that a network-borne message latency creates a delay between issuance of a
message on a processor and reception of that message on its successor in the pipeline. This simple model results
in a perfectly synchronized pipeline operation in which no processor ever waits for data once the first message
is received; pipeline fill times and total completion times are linear functions of the processor number p.
We use the performance-monitoring tool AIMS [5] (Automated Instrumentation and Monitoring System)
to visualize the pipeline behavior of a problem where p = 4 and N = 100, implemented on an Intel iPSC/860
hypercube computer. In Figure 1, horizontal striped bars indicate processor status. Dark sections signify that a
idle time active in mbtask
4 i .........
3
2
processor
number
1
time
message from
processor 2 to 3
Figure 1: AIMS processor activity status for pipeline algorithm
processor is performing a subtask, whereas white space within a bar indicates that a processor is not doing any
computational work, but is sending a message or waiting for one instead. Black lines connecting bars denote
messages being passed among processors. Message lines in the AIMS time line originate where the sender blocks
(suspends program execution), waiting for local completion of communication, and terminate where the receiver
unblocks (resumes program execution), having received the message.
Obviously, total completion time is not linear in p, and not all subtasks last equally long, since not all
message transfer lines are parallel. Moreover, in contrast with previous studies (e.g. King et al. [1]) that assume
that messages always arrive before the receive has been posted so that processors never need to wait for data,
AIMS probes show that in many instances processors are actually waiting for messages from their predecessors
in the pipeline; a refinement of the simple model is needed.
A more realistic model such as that by Adve et al. [2] ascribes certain communication overheads to the
processors in the parallel machine, rather than to the communication network. More specifically, they assume
that a fixed amount of cpu time is spent by a processor that receives and processes '(handles)' a message. Since
the first processor never receives a message, this model can account for a disparity in completion times between
the first and the second processor, but not for nonlinearities in completion time on higher numbered processors.
Introduction of an additional overhead incurred by the processor that sends a messages does not change the
model qualitatively.
2 Performance analysis
Here we postulate a communication model that completely explains the observed pipeline behavior, and that
offers a means for optimization described in setion 3. Its most important feature is the occurrence of interrupt
events that generate dynamic load imbalances.
Assurnplions:
The message length is zero (similar to assuming infinite network bandwidth). This is a reasonable approxi-
mation for fine-grained pipelines where messages are usually very short. It is not essential for the analysis,
but leads to somewhat simpler algebra.
The computational work associated with subtask wJ_ requires a constant period of c time units.
When a message is sent by a processor a constant non-overlappable send overhead of s time units is incurred
immediately by that processor.
When a message arrives at a processor a constant non-overlappable receive-interrupt overhead of ri time units
is incurred immediately by that processor.
When a message is used by a processor a constant non-overlappable receive-handling overhead of rh time
units is incurred by that processor.
In Figure 1 we observe a clear fan-out of message transfer lines between processor 1 and processor 2, and
to a lesser extent between 2 and 3. This implies that some subtasks take longer to complete on a certain
processor than on its predecessor in the pipeline. But there are also phases in the pipeline algorithm during
which message transfer lines are parallel, indicating that communicating subtasks of successive processors in
the pipeline take equal amounts of time. In general, each processor experiences four pipeline phases, which are
identified schematically in Figure 2. They are discussed below, using the new communication model.
FLUSHING. The processor is not interrupted, either because there is no predecessor (processor 1), or because
the predecessor has already completed all its subtasks. Processors in this phase still need to handle
already arrived messages (except processor 1), perform the remaining subtasks, and send messages to their
successor; since the preceding processor has fired messages from the same no-interrupt state, subtasks take
equally long on successive processors in the pipeline (except on processor 2, see below).
INTERRUPTED. Long subtask duration is caused by high-frequency interruption by the predecessor. The pre-
decessor is in the flush phase, and sends messages rapidly. Processor 1 does not exhibit the interrupt
phase.
WAITING. This phase is dominated by waiting for messages to arrive from the predecessor. Subtasks on the
predecessor take a long time to complete, either because that processor is in the wait phase itself, or
because it is being interrupted frequently; subtasks take equally long on successive processors in the
pipeline. Processors 1 and 2 do not exhibit this phase.
proc. 5
proc. 4
proc. 3
proc. 2
proc. 1
... ............. , ................
 iiiiiiiliiiiii!!iiiiiiii  !  !!!!!iMii!ii!!ii!iiiiiiiiimz2
tl t2
I I I
iiiiiiii i iiiii!iiiiii 
mZ3
,:..... IF
...... It-.
_'1"
Z t- n Z_
t3 t4 t5
I I I
Figure 2: Four pipeline processor phases
,_', k-l
" Zl_ m_2Zi
_Zk- (1-ml)
mlZl-mZ2-mZ 3
Pipeline
phases
Blocking
:_:!:!:!"+"!*"!::T!_::!_:!:::!:_:::_:!:i:_:;!iiiiii_jijiiiiiiii
BLOCKING. The processor is blocked while waiting for a message signaling that its first subtask can be started
(pipeline fill). Processor 1 does not exhibit this phase.
We now analyze in detail the durations of the different phases on processor k (k > 2). The phases can be
grouped into two major modes, blocked and active. The blocked mode coincides with the blocking phase; no
subtasks can be started yet due to the pipeline fill. The active mode contains the remaining three phases. Its
total length is the sum of all subtask durations. Subtasks can be grouped together into three classes of equal
subtask lengths. These correspond to the periods tl, t2 through tk-1, and tk, respectively, which are indicated
in Figure 2. Periods are defined recursively; they equal the amount of time needed to finish all the subtasks
whose messages have been received (i.e. whose receive calls have been cleared) during the corresponding period
of the predecessor. If there is no such corresponding period, then the period equals the time needed to flush all
remaining subtasks.
The number of subtasks executed during the flushing period on processor k is Zt. Clearly, Z1 = N, since
processor 1 has only a single period, during which all subtasks are flushed. The number of subtasks executed
by processor 2 during period tl is mlZ_. m_ is the ratio of subtask duration on processor 1 over that on 2
during the first period. Its inverse 1/ml is the frequency with which processor 2 gets interrupted by 1. This
frequency may be fractional, since it is an average over many subtasks. It is determined as follows. A single
subtask on processor 2 during tl is interrupted by an average of l/rex messages from processor 1. Hence, it
lasts c + s + ra + ri/mi time units. Since every subtask on processor 1 issues exactly one message, it takes
(c + s)/m_ time units for that processor to generate the 1/ml interrupts of processor 2. Equating the two lapses
yields c + s + rh + ri/ml = (e + s)/ml, so
e+s--ri
mx = . (1)
c+s+rh
Note that the number of subtasks executed during t_ by all processors numbered higher than 2 is rn_Z_,
just as on processor 2; they are wait-dominated due to the slowdown of the frequently interrupted processor 2.
4
The interrupt frequency 1/m during flushes other than in period tl is slightly lower, because the subtasks on
the flushing processor last longer than those on processor 1. A flushing processor is no longer interrupted by its
predecessor, but it does need to handle the already arrived messages, which incurs an overhead of rh time units
per subtask. We equate the subtask duration c + s + rh + film on the interrupted processor to (c + s + rh)/m
time units on the interrupter, and obtain
c + s -{- rh -- ri
'_= (2)
c+s+rh
The interrupt frequency 1/m is the same for all flushing periods other than that on processor 1, as the sender
states and receiver states are the same for all subsequent nodes. So the number of subtasks executed on the
interrupted and waiting processors during tj equals mZj, where j >__2.
Evidently, each processor ultimately has to execute all Z, subtasks, so the number of subtasks Zt remaining
during the flushing phase on processor k is
k-I
Zk=ZI-mlZI-m__Zj, with Z2=(1-ml)Z1, ZI=N. (3)
j=2
This simple recursion is most easily solved by computing Zk+l - Zk. It follows that Zk+l = (1 -- m)Zt, so
Z6 = N(1 - m_)(1 - m) _-2 , k >_ 2. The total duration T_ of the active mode on processor k is now computed
as
k-1
j=2
k-1
= (c + .)z, + (_+ s + r, + r,l._)m _ z_ + (c +. + _,)z6
j=2
= N(c+s)+N(1-m,)(c+s+rh+(1-(1-m) 6-2) ri/m), k_>2. (4)
The total duration T_ of the blocked mode on processor k is easily determined:
T_ = (k - 1)(c+ s + ri+ rh). (5)
Summing the durations of the idle and active modes, we finally obtain for the total pipeline duration T6 on
node k:
Tk = (k -- 1)(c+ s + rh + _) + N(_ + s) +
N(1-ml)(c+s+rh+(1-(1-m) 6-_)rJm), k>2. (6)
Substituting equations 2 and 1 into 6 and defining the two relative interrupt and handling overheads (_ -
rh/(c + s) and/_ = ri/(c + s), we obtain the scaled delay time:
_t6 = N(c+s) 1=(1+_+8) + l+a-3 . (7)
The first term on the right hand side of equation 7 represents the expected pipeline fill. The second term is
the additional nonlinear delay due to the wave-like propagation of the effects of message interrupts (see Figure
2).
5
For largek or small fl,the asymptotic scaleddelay time is:
6tk=(l+_+3)_ --_+
+ 3)(1 + a)
1 + a - 3 (8)
3 Optimization
It follows from the analysis in section 2 that the reason for the nonlinear slowdown of the pipeline algorithm
is the high frequency of interruption of certain sets of subtasks. All delays are incurred during the interrupt
phase whose fan-out propagates as a wave through the pipeline. No such wave pattern would be observed if the
firs/processor were to spread its messages evenly during the second processor's execution. This suggests that
the pipeline algorithm can be optimized by artificially increasing the amount of work performed by the first
processor for each subtask.
We replace c by d + c on processor 1, where d is a padding amount, and redo the analysis, keeping the
computational work for the subtasks on other processors the same. The symbol _ indicates perturbations due
to padding. No benefits can be obtained by increasing the padding beyond the sum of handling and interrupt
overheads, since then processor 2 is forced to wait for data from processor 1. Consequently,
0 < C' < ri "+ rh. (9)
If we assume that padding operations within each subtask on processor 1 take place affereach send operation,
then processor block times remain unchanged, i.e., (T_)' - T_. In addition, we find that t_ = N(d + c + s).
Expressions for the numbers of subtasks per period stay the same, but the definition of ml changes:
d+c+s-riI
m I = (10)
c+s+rh
The total time spent on node k becomes:
T_ = (k-l+N)(e+s)+(k-1)(ri+rh)+Nc'+
N(1--1/ml)(c+s+rh+(1--(1--1/m) k-u) mri) ,k>2. (11)
Introducing a scaled padding: r = d/(c + s), we obtain the following expression for the scaled completion delay:
a+3- 1+a-3
6t_ = (1+(_+3) +r+ I+a-3 \I---_] j
= _tk+rl+c__31\_ ] -1 . (12)
Equation 12 represents a linear function in r, whose extremal values are attained at the boundaries of the
interval defined by equation 9. We can assume that/3 < 1 (i.e. ri < C + S), SO the coefficient of r is negative.
Consequently, the total delay time is minimized by maximizing r, yielding:
l"=a+fl or ct=ri+rh . (13)
Note that the optimal r is independent of the processor number k and the amount of work per subtask e.
Optimal padding results in the equality m_ = 1, which implies that subtasks on processor 1 now take equally
long as on 2. In fact, all subtasks are now performed within the waiting phase for all processors, and the other
phases (except for the initial pipeline blocking) are totally eliminated. Since the optimal r does not depend on
the number of processors, the suggested padding strategy is globally optimal; no padding of subtasks on other
processors can improve the completion time. We finally find:
p*= (I+ + (14)
which has the appearance of a synchronized pipelineresultwith a slightlymodified subtask duration.
4 Verification artd numerical application
In order to verify the validity of our performance model we use a simple test program in which the amount
of computational work per pipeline segment can be varied. The program is run on the Intel iPCS/860 and
uses blocking send and receive calls of the NX message passing library to pass zero-length messages. The node
numbering is such that only nearest-neighbor communication occurs, which eliminates contention. The number
of pipeline tasks does not have an important qualitative influence on the performance; it is kept fixed (N = 256)
in this experimental investigation.
The primary data set obtained consists of final completion times on all 16 nodes involved in the program
execution. The C routine used to produce an entry of the data set is printed below.
void pipeline( work, padding) int work, padding;
{ int s, p, my_segment, next, first, last ;
double sin, sour, q;
my_segment = ginv(nynodsO) ; next = graT(my_aogmont+l) ;
first = O; last = nulmod@e()-l;
if (Iay_segment:_first)for (p=O; p<2S6; p++)
{crecy(p, in, 0);
for (s=O; s<lO*work; e++) q = l.O/(a÷1.0);
if (-.y_segment := last) ¢send(p, out, O, next, 0);)
else for (p=O; p<256; p++)
{for (s=O; e< lO*(work+padding); s÷+) q = 1.0/(e÷l.O);
csend(p, out, O, next, 0);)
work
0.06
0.07
0.06
time (s)
0.05
0.04
0.03
2 4 6 8 10 12 14
node
Figure 3: Completion times for increasing arithmetic loads
Thevariableworkdeterminestheamountof arithmeticworkfor all pipelinesegments, whereas padding
determines the additional work per segment for the first node. A multiplication factor of 10 is used to scale
the work to measurable amounts. The main program is a double loop over the pipeline routine for work =
0 step 1 to 16, and padding = 0 step 1 to 32, and computes ensemble averages for each parameter pair
over 20 independent runs.
Figure 3 shows the final completion time versus node number for zero padding and varying amounts of work
per pipeline segment (higher curve generally means more work per segment). Interestingly, the completion time
is not a monotonically increasing function of the work per segment; this also follows from the the theoretical
model. From the data displayed in Figure 3 we compute the interrupt and handling overheads as follows. The
first node provides the value of the scale factor c + s, since all it does is compute and send. From equation 6
we derive that ri + rh = T2/(N + 1) - T1/N, which can be combined with equation 8 for larger values of k and
k + 1 to compute c_ and _ (and hence ri and rh) separately. By averaging the results of these computations
for node 15, we obtain the values of ra = 39ps and ri = 57/zs. Qualitative validity of the performance model
derived above is demonstrated in Figure 4, which shows completion times on nodes 2 and 16, respectively, for
the entire (work,padding) parameter space. As predicted by equation 12, the completion time on node 2 stays
NODE 2 NODE 16
°l°°t / °"°°t /
I time(s) 0.075 I
o.o25 / 0.025t /
/ /
,0
P 30 30
Figure 4: Completion times on the second and last pipeline nodes
roughly constant while the padding is below the optimal amount. Completion times on node 16 show that
optimal padding is a more or leas fixed quantity (padding = 13), independent of the amount of computational
work per pipeline segment.
The achieved improvements in execution time for optimal padding versus no padding are plotted in Figure
5, alongside with the theoretically predicted improvements based on the values of rh, ri and c + s determined
above. Note that the quantity 'work' (per pipeline segment) in this figure is now given in microseconds instead of
in test program loop count. Agreement between predicted and measured speed-ups is quite good. In the same
figure we also depict expected performance using traditional analysis (dotted line), which takes latency into
account but ignores the effect of interrupts. The traditional model, which assumes that all message overheads
are borne by the communication network, significantly underpredicts completion times, even when compared
thre (s) °°8I0.07[
:::8
o.o2 
60 80 100 120 140 160
work OJts)
Figure 5: Predicted (dashed) and measured (solid) completion times with and without padding, and traditional
model (dotted)
to the case of optimal padding.
Before padding
After padding
Figure 6: Optimization of tri-diagonal solver on Intel Paragon
We now apply padding to a problem derived from the solution of 3-dimensional partial differential equations
(e.g. [4]). It concerns a set of 256 independent tri-diagonal matrix equations, each of rank 128. The problem
is solved on 8 nodes of an Intel Paragon XP/S without message co-processor. Pipelining is obtained by having
every node solve 1/8 th of each matrix equation using Gaussian elimination without pivoting. We only consider
the forward elimination, which requires 6 multiply/adds and 1 division per matrix row. Each node passes only
two 8-byte numbers for each matrix equation to its successor. Figure 6 shows the AIMS performance profile
before and after padding. Although the padding is slightly more than optimal (execution time on node 2 has
increased),theperformancegainisalreadymorethan30%.Improvementswill bemoredramaticfor theeven
finer-grainedbacksubstitutionphase(only2multiply/addsandnodivisionspermatrixrow).
Thesexercisesalsoindicatepotentialforimprovementofpipelinealgorithmsinapplicationprogramsuchas
POVERFLOW[6],whichsolvesacollectionofpenta-diagonalmatrixequationsderivedfromthediscretization
ofthecompressibleNavier-Stokesquations.Hereeveryprocessorreceivesablockof astructuredgrid,andis
responsibleforsolvingthepartofeachmatrixsystemthatcorrespondstothegridpointsin theblock.Obviously,
POVERFLOWcanbenefitfromtheoptimizationderivedabove,but it shouldalsobenotedthat equal-sized
gridblocksaggravatethedynamicloadimbalance,sinceprocessorsa signedblocksat agridboundaryhaveless
workto dothanothers.Suchprocessorsarefirst in thepipelineduringeithertheforwardeliminationor the
backsubstitution,andsosomeadditionalpaddingisneededforanoptimalalgorithm.
5 Conclusions
We have investigated the performance of software pipelines on message passing architectures. Based on ob-
servations obtained using the performance monitoring and visualization tool AIMS, an analytical model for
the pipeline behavior was formulated. Salient features of the model are the handling and interrupt overheads
.....................
associated with receive operations. Four phases of pipeline operation are identified, each with different com-
munication characteristics. Recurrences identified in some phases give rise to nonlinear delays. Each phase is
analyzed independently, and predicted total completion time is obtained by summ_ing the individual contribu-
tions. Comparison of predicted and measured execution is favorable_
The model also points to possible optimization of t_mtJlementatlon of parallel software pipelines by
reducing the frequency with which the first processor in the pipeline sends messages to its successor. This
artificial slow-down actually improves overall program performance. The predicted magnitude of optimal first-
processor retardation and the resulting total speed-up are again verified experimentally.
References
[1] C.-T. King, W.-H. Chou, L.M. Ni, Pipelined data-parallel algorithms: Part I -- Concept and modeling,
IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 4, October 1990
[2] V. Adve et al., Integrated performance analysis of data-parallel programs, Workshop on Debugging and
Performance Tuning for Parallel Computing Systems, Chatham, MA, October 3-5, 1994
[3] W.C. Saphir, Message buffering and its effect on the communication performance of parallel computers,
NASA Report RNS-94-004, NASA Ames Research Center, Moffett Field, CA, April 1994
[4] R.F. Van der Wijngaart, Efficient implementation of a 3-dimensional ADI method on the iPSC/860,
Proceedings of Supercomputing '93, pp. 102-111, Portiandl OR, November 15-19, i993
[5] J.C. Yah, Performance tuning with AIMS--An automated instrumentation and monitoring system for
multicomputers, Proceedings of the 27 th Hawaii International Conference on System Sciences, Vol. II,
pp. 625-633, Wailea, HI, January 4-7:19941
[6] J.S. Ryan, S.K. Weeratunga, Parallel computation of 3-D Navier-Stokes flowfields for supersonic vehicles,
AIAA 93-0064, 31 at Aerospace Sciences Meeting and Exhibit, Reno, NV, January 11-14, 1993
10

|i !
D
Z
_;<- |
: i
;|
_=
=
=
! +;
