Parallelization of the Pipelined Thomas Algorithm by Povitsky, A.
NASA/CR- 1998-208736
ICASE Report No. 98-48
Parallelization of the Pipelined Thomas Algorithm
A. Povitsky
ICASE, Hampton, Virginia
Institute for Computer Applications in Science and Engineering
NASA Langley Research Center
Hampton, VA
Operated by Universities Space Research Association
National Aeronautics and
Space Administration
Langley Research Center
Hampton, Virginia 23681-2199
Prepared for Langley Research Center
under Contract NAS 1-97046
November 1998
https://ntrs.nasa.gov/search.jsp?R=19990018954 2020-06-15T22:34:39+00:00Z
Available from the following:
NASA Center for AeroSpace Information (CASI)
7121 Standard Drive
Hanover, MD 21076-1320
(301) 621-0390
National Technical Information Service (NTIS)
5285 Port Royal Road
Springfield, VA 22161-2171
(703) 487-4650
I !
PARALLELIZATION OF THE PIPELINED THOMAS ALGORITHM
A. POVITSKY *
Abstract. In this study the following questions are addressed. Is it possible to improve the
parallelization efficiency of the Thomas algorithm? How should the Thomas algorithm bc formulated
in order to get solved lines that are used as data for other computational tasks while processors are
idle?
To answer thcsc questions, two-step pipclined algorithms (PAs) arc introduced formally. It is shown
that the idle processor time is invariant with respect to the order of backward and forward steps in PAs
starting from onc outcrmost processor. The advantage of PAs starting from two outermost processors
is small. Versions of the pipelincd Thomas algorithms considered here fall into the category of PAs.
These results show that the parallclization efficiency of the Thomas algorithm cannot bc improved
directly. However, the processor idle time can be used if some data has been computed by the time
proccssors become idle. To achieve this goal the Immediate Backward pipclined Thomas Algorithm
(IB-PTA) is developed in this article. The backward step is computed immediately after the forward
step has been completed for the first portion of lines. This enables the completion of the Thomas
algorithm for some of these lines before processors bccomc idlc. An algorithm for generating a static
processor schedule recursively is developed. This schedule is used to switch bctwccn forward and
backward computations and to control communications between processors. The advantage of the IB-
PTA over the basic PTA is the presence of solved lines, which are available for other computations,
by the time processors become idlc.
Key words. Thomas algorithm, band matrix, pipclined algorithms, parallel computations, im-
plicit numerical methods, MIMD computer
Subject classification. Computer Science
1. Introduction. Implicit methods arc widely used in computational mechanics. Usually, the
systems of linearizcd and discretized equations obtained are of a bandwidth type.
A very efficient direct solver, known as the Thomas algorithm, is used for solution of thcsc systems
of equations in computational mechanics [1]. The Thomas algorithm represents a version of the Gauss
Elimination method for band matrix systems. For multi-dimensional cases, thc Alternative Direction
Implicit (ADI) methods arc based on solution of linear systems with band matrices corresponding to
finite-difference discretization schemes in each direction.
The usual way of parallelizing numerical schemes for MIMD computers is to partition the com-
putational field into multiple subdomains, then to allocate each subdomain to a different processor.
This is done so as to minimize communication, delay and processor idle timc.
Parallclization of implicit solvcrs that use the Thomas algorithm for the solution of large banded
linear systcms of equations is hindered by global spatial data dependencies. Parallel versions of the
Thomas algorithm are of the pipelined type. A pipeline in a parallel program involves each processor
performing the same set of operations on a successive (continuous) stream of data. Pipelines occur
*ICASE, NASA Langley Research Center, Hampton, VA 23681-2199, c-mail: acralpo_icase.edu. This rcscarch was
supported by the National Aeronautic and Space Administration under NASA Contract No. NAS1-97046 while the
author was in residence at the Institute for Computer Applications in Science and Engineering (ICASE), NASA Langley
Research Center, Hampton, VA 23681-2199.
dueto therecurrenceof datawithina loop[2].Themaindisadvantageis that duringthepipclincd
processomeprocessorswill bc idlewhenwcswitchfromtheforwardto thebackwardstepsof the
Thomasalgorithm.
In orderto avoidpipetining,someparallelThomasalgorithmsincludethereductionofa banded
systemofequationsonP "slave" processors, the solution of the reduced system of size O(P) on the
"master" processor, the broadcast of this solution to the "slaves", and simultaneous computation of
the final solution on P processors. This algorithm includes global communications of "slave-master"
type and additional computations on each "slave" processor. F. Gustavson and A. Gupta [3] recently
have proposed such an improved parallel algorithm for tri-diagonal systems. This algorithm has a
redundancy of two: it requires twice as many computational operations per grid point as the serial
Thomas algorithm.
Implementation of internal boundary conditions eliminates far-field data dependencies, allowing
band matrix systems to be solved independently on each processor. However, modification of either the
finite-difference approximation or the implicitness of the scheme duc to interface boundary conditions
can cause accuracy, stability and convergence properties to deteriorate relative to the original serial
method [4].
Naik et al. [5] reduced the parallelization penalty of the solution of tri-diagonal systems by sending
necessary information to neighboring processors for groups of rendered lines at forward and backward
steps of the Thomas algorithm. The authors of [5] derived the optimal number of lines to be solved
per message as a function of computation time per grid point and communication time. This pipelined
Thomas algorithm (PTA) does not need global communication and performs the same computations
as the serial Thomas algorithm; however, at two points in this algorithm each processor has to wait
for all processors ahead.
There is no available systematic study about various formulations of the pipelined Thomas al-
gorithms, thercfore wc begin with theorctical proofs about processor idle time for thcsc algorithms,
using the unified approach of two-step pipelined algorithms.
In this paper a new pipelined Thomas algorithm is developed. This algorithm is designed for
parallel raulti-domaln solution of band matrix linear systems. Wc call it the Immediate Backward
pipclined Thomas Algorithm (IB-PTA): the backward step is pcrformcd linc-by-line immediately after
the forward step has been completed for these lines. This algorithm provides exactly the same solution
as the serial Thomas algorithm.
The advantage of the IB-PTA over the basic PTA is that some lines have been completed by
the backward step of the Thomas algorithm before the processors are idle. In non-linear and multi-
dimensional problems the IB-PTA may be used for other computational tasks at the time when
processors are idle. These tasks include computation of the Jacobians of the original non-lincar
systems, computation of right hand sides of ADI equations and computation by the Thomas algorithm
in the next spatial direction. Use of the IB-PTA should lead to a considerable reduction of idle
processor time.
Practical use of the proposed IB-PTA is based on a static processor schedule, which has been
created before computations by thc Thomas algorithm are executed. The rest of the article describes
the recursive algorithm for the assignment of this schedule.
Implementation of the IB-PTA within a 3-D PDE solver is the scope of another study [6].
The article contains four sections. In section 2, various formulations of the Thomas pipelined
algorithms are discussed. In section 3, a formal definition and study of two-step pipclined algorithms
(PA)is presented.In section4,theprocedureto generatethecomputationalndcommunicational
scheduleof processorsispresented.
2. The pipelinedThomasalgorithms. ThetheoryofpipclincdThomasalgorithmsdeveloped
in thispapercanbeappliedtovarioustypesofmatrices,includingblockbandwidthandn-widthband
matrices.In numericalanalysiswcusuallyhaveto solvethesystemof N × L equations which is split
to L banded systems of N equations. Each banded system corresponds to one line of numerical grid,
and commonly N _ L. The set of L systems where each systcm is a scalar tridiagonat matrix N × N
is taken here as an example:
(1) ak,lXk 1,1+ bk,lXk,l + ck,lxk+l,l = fk,t, k = 1, ..., N, I = 1, ..., L,
where ak,l are the coefficients, xk,t arc the unknown variables, and L is the number of lines.
The version of the Thomas algorithm for the scalar tridiagonal case is as follows. The forward (the
first) step of the serial Thomas algorithm is
Ck 1,1 k = 2,..., N,
dl,l = bl,t, dk,_ = bkj - ak,l-_k_l, ,
fl,l --ak,lgk-l,l + fk,l k = 2, ..., N.
(2) gl,_ -- dl,t' gk,l = dk,l '
The backward (the second) step is
Ck,l
(3) XN,I = gN,l_ Xk,l = gk,l -- Xk+l,l"_-'-, k = N - 1, ..., 1.
_k,l
For pipelined Thomas parallel algorithms considered in this article, the cocfficients of Eq. (1) arc
mapped onto P processors so that the subset
{ak,l, bk,l, Ckj, I k = N(p - 1)/P + 1, ..., Np/P, l -- 1, ..., L} belongs to the pth processor.
We now briefly describe the pipelincd Thomas Algorithm (PTA) [5]. Rendering the 1th line,
the pth processor rcceives coefficients dN(p_l)/p,l , gN(p-1)/P,l from the (p - 1) th processor; computes
the forward step coefficients dk,l and gk,1, where k = N(p - 1)/P + 1, ..., Np/P, sends coefficients
dNp/P,l, gNp/P,l to the (p + 1) t_ processor and repeats computations (2) for the next lines until all
the forward step computations are completed. The order of computation of lines by the forward and
backward steps of the PTA is shown in Fig. la. The three upper lines in Fig. la arc computed
by the forward step computations. After completion of all forward step computations specific to a
single processor, the pth processor (except last) has to wait for the completion of the forward step
computations by all processors ahead of it. The last outermost (pth) processor starts the backward
step computations (3) first. Other processors proceed with the backward step computations in the
similar manner as the forward step computations. The three lower lines in Fig. la represent the
backward step computations. The coefficients d and g of the forward step at interface boundaries are
required for beginning of computations in the (p + 1) th processor; similarly the computed x values of
the backward step are required for the beginning of computations in the (p - 1) th processor.
Processors are idle between the forward and backward steps (sec Fig 3a), and there is no data
available for other computational tasks by this time.
The algorithm denoted as thc Immediate Backward PTA (IB-PTA) is shown in Fig. lb. It is
assumed that the computational time per part of a line, corresponding to the steps performed by a
single processor, is equal to unity for both forward and backward steps. Actually these times are
differentfor forwardandbackwardstepsof Thomasalgorithm.This issuewill bcdiscussedin the
nextsections.First,linesarccomputedbytheforwardstepuntilthefirstlineiscompletedonthelast
processor(thetwoupperlinesinFig. lb). Thenthebackwardstepcomputationsforeachlinestart
immediatelyafterthecompletionof theforwardsteponthelastprocessor(thenextfourlinesinFig.
lb). EachprocessorswitchesbetweentheforwardandbackwardstepsoftheThomasalgorithmas
showninFig.2a at two sequential time units. Each processor communicates with its neighbors to get
necessary data for beginning of either the forward or backward computations for next line. Finally,
remaining lines are computed by the backward step computations and there are no available lines for
the forward step computations (the two lower lines in Fig. lb).
Two other versions of the pipclincd Thomas algorithms are shown in Fig. lc-d. These formulations
of the PTA, denoted as the first-last pipelined Thomas algorithms (FL-PTAs), start the forward step
computations from the two outermost processors (indeed, each line starts from one side). Each
processor switches between pipclincd computations in increasing (from the first to the pth processor)
and decreasing directions. Four upper lines in Fig. lc are computed by forward step computations.
Outermost processors start their forward step computations simultaneously, and middle processors
arc idle waiting for available data. Then outermost processors start the backward computations
simultaneously and perform it in the same way as the forward step computations (the four lower lines
in Fig. lc). Combination of the FL-PTA and the IB-PTA leads to the First-Last Immediate Backward
Pipelined Thomas algorithm (FL-IB-PTA), where four fluxes of data are treated simultaneously (Fig.
ld). Outermost processors start the forward step computations simultaneously as in the case of the
FL-PTA. The backward step computations for each line start immediately after completion of this
line by the forward step computations.
Now wc can calculate the processor idle time for the basic PTA and the IB-PTA. The idle time
for the pth processor is time between the start of the forward step computations and completion of
the backward step computations on the pth processor, when this processor is idle. The maximum
processor idle time for the system of P processors is equal to the maximum of the idle times of
individual processors.
In the basic PTA, the pth processor waits for the completion of the forward step computations for
the last line by P-p processors ahead of it, and for the completion of the backward step computations
for the first line by the same P - p processors. Thus, the processor idle time is equal to 2(P - p) for
the pth processor (see Fig 3a).
Performing the IB-PTA, the pth processor has computed 2(P - p) + 1 lines by the forward step
before it starts the backward step computations for the first line. At that point, each processor spends
one half of its time computing new lines by the forward step and the other half computing the previous
lines by the backward step. Once all lines are completed by the forward step, processors are busy
only one half of the time executing the backward step computations (sec Fig. 3b). At this time there
are 2(P - p) + 1 lines uncompleted by the backward step. Thus, the idle time of the pth processor
2(P - p) is the same as for the PTA.
Obviously, there are a lot of possible versions of PTAs. In order to predict the processor idle time
for any PTA, a unified approach is developed in the following section.
3. Theoretical estimation of processor idle time for the PTA.
3.1. Two-step pipelined algorithms. Suppose there arc L lines of data. Each line is split
equally between P processors and is treated as follows: The first (forward) step computations start
from the first outermost processor and run in a pipelined way to the last outermost processor. The
] !
second (backward) step for any line is performed from the pth to the first processor after the first
step has been completed for this line.
This algorithm is fully characterized by thc I1 (p, l) and/2 (p, l) functions that arc the times of the
beginning of the first and second step computations, respectively, for the l th line on the pth processor.
The computational time per fraction of a line assigned to a single processor is equal to unity for both
(forward and backward) steps on a single processor.
To make the algorithm valid, the following relations must be satisfied for the I1 and/2 functions:
(4) I_(p,i) 7_ I_(p,j),Vr,s • {1,2}, iCj;
(5) Ii (p, i) > Ii (p - l, i) + l, p=2,...,P;
(6) I2(P,i) >_ h(P,i) + 1;
(7) I2(p,i) >_I2(p+ l,i)+ l, p=P-1,...,1,
where i,j = 1, ..., L are line numbers, p = 1, ..., P are processor numbers, and r E {1, 2} and s • {1, 2}
are indices specifying either the/1 or/2 function.
Condition (4) states that each processor computes one step (forward or backward) for a single
line per time unit. The pipelincd nature of the algorithms considered here leads to inequalities (5,
7) for the first and second steps, respectively. Inequality (6) corresponds to the condition that the
second step starts after completion of the first step for each line.
Additionally, if there arc some available lines that have been completed by a previous processor
or that arc rcady to start from the outermost processor, the linc with minimum sequential number
is computed first. Thus, the outermost processors execute the forward and backward steps for lines
in incrcasing order. This requirement, denoted as A-requirement, is made for convenience in the
following discussion.
The maximum elapsed time is defined for a system of processors as the time interval between the
beginning of computation of the first Iinc and the completion of computation of the last line. For the
PA, the elapsed time is equal to
(S) T¢ = I2(1, L) -/1(1, 1) + 1.
The useful time of each processor is equal to 2L, as computational time is equal to unity per
fraction of line assigned to a single processor. The maximum processor idle time for the system of
processors is defined as the maximum idle time over all processors performing the PA, and is equal to
T_ - 2L.
The following theorem is valid for the PA.
THEOREM 3.1. For the PA, the maximum elapsed time is greater than or equal to 2L+ 2(P- 1).
There exists functions I1 and I2 satisfying conditions (4-7) such that T_ = 2L + 2(P - 1) for these I1
and 12.
Proof. The proof of the first part of the theorem is done by the method of mathematical induction
on the number of processors P.
For P = 1 (a single processor is involved) the elapsed time is equal to 2L for any number of lines
L.
Suppose for p < P the theorem is satisfied. Let the (P + 1) th processor render the same L lines.
Assume, to show a contradiction, that for P + 1 processors the elapsed time is
(9) Te(P+ 1) < 2L + 2((P+ 1) - 1)
for some particular/1 and/2 satisfying (4-7).
Now switch off the first processor and consider the system of P processors 2, ..., P + 1 with I_ _w
and I_ ew as follows:
(10) I_e_(p,i)= Ir(p+ l,i), p= 1,...,P, re {1,2}
Note that the functions/1 (p, t) and/2 (p, l), where p = 2,., P + 1, remain the same as for the assumed
system of P + 1 processors. This set (I'_ _, I_ _w) satisfies (4-7). The elapsed time for this system of
P processors is equal to
(11)
According to conditions (5,7)
(12) I2(2, L) < I2(1,L) - 1,
Therefore,
T_(P) = I_e_(1,L) - I_(1, 1) + 1 = I_(2, L) - I,(2, 1) + 1.
I1(2 , 1) > 11(1, 1) q- 1.
(13) T_(P) <_ T_(P + 1) - 2 < 2L + 2(P - 1).
This contradiction of the induction hypothesis (T¢(P) > 2L + 2(P - 1)) proves the first statement of
the theorem.
One particular distribution of 11 and/2
(14) Ii(p,l)=p+l-1, I2(p,t)=P-p+l+(P+L-1)=2P+L+l-p-1
meets conditions (4-7) and
(15) T_ = I2(1, L) - I1(1, 1) + 1 = 2L + 2(P - 1).
This proves the second statement of Theorem 3.1. []
The distribution (14) of/1 and/2 corresponds to the basic PTA (sec the previous section).
There is the following corollary from Theorem 3.1.
COROLLARY 3.2. If the maximum idle time for P processors performing PA is equal to 2(P - 1)
then the idle time of the pth processor is equal to 2(P-p).
Proof. The pth processor starts to render its fraction of the first line p- 1 time units later than the
first processor, and the pth processor starts to render its fraction of the last line p - 1 time units earlier
than the first processor. Thus, the following relations hold for the I1 and/2 functions corresponding
to the first and the last lines:
(16) I1(p, 1) = 11(1, 1) +p- 1, I2(1, L) -- I2(p,L) +p- 1.
The maximum idle time for the system of P processors is calculated as follows:
(17) T¢(P) = 12(1, L) - i1 (1, 1) = I2(p, L) + (p- 1) - (Ii(p, L) - (p - 1)) = T_(p) + 2(p- 1).
Therefore,
(18) Te(p) = Te(P) - 2(p- 1).
According to Theorem 3.1 Te(P) ----2L + 2(P - 1), the elapsed time of the pth processor is given
by
(19) Te(p) = 25 + 2(P - 1) - 2(p - 1) ----2L + 2(P - p).
The useful time of each processor is equal to 2L; therefore, the idle time is equal to 2(P - p). [:]
3.2. Two-step and first-last pipelined algorithms. We define the two-step and first-last
pipclined algorithms (FL-PA) starting from two outermost processors. Subsets A1 and A2 include
lines starting from the first and the pth processors, respectively. There are two more fluxes of data
than for the one-way PAs defined in the previous subsection. The functions /3 and /4 arc starting
times of the forward and backward computations, respectively, for lincs belonging to A2. These
functions have to satisfy the following relations similar to (5-7):
(20)
(21)
(22)
where i E A2.
I3(p,i) >_I3(p÷ l,i)+ l, p= P-1,...,1;
h(1,i) > 13(1,0 + 1;
I4(p,i) >_I4(p- l,i)÷ l, p= 2,...,P;
Conditions (5-7) must be satisfied for i c A1. Thc condition (4) with r, s E {1,2, 3, 4} and A-
rcquircmcnt must bc satisfied for all functions I_, whcrc r c {1, 2, 3, 4}. According to thc dcfinition of
the maximum elapsed time for P processors, this time for FL-PAs is given by
(23) T_ _- max(I2(1, 51), I4(P, 52)) - rain(Ix(I, El), I3(P, F2)) ÷ 1
where F1, F2, 51, L2 arc the first and last lines in snbscts A1 and A2, respectively.
Wc define a passage as a set of starting times of the pipelined forward or backward computations
on P processors for thc i_h line:
(24) {Ir(p,i)[p= 1,...,P}, {Is(p,i)[p = P,...,1},
where r e {1,3}, s • {2,4}.
The FL-PAs drive the fAll+ Ih2] = L passages from the first to the last processor governed by
I1 and /4 functions and the L passages from the last to the first processor governed by /2 and /3
functions.
Now we will define the following reformulation of the FL-PA. This reformulated algorithm uses
the same passages as the original FL-PA, i.e.,
(25) Vi,rl 3j, r2 :
(26) Vi, S 1 3j, 8 2 :
I_(p,i)--- Ir_(p,j);
I_W(p, i) = Is2(p,j);
where p = 1, ..., P, rl, r2 • {1, 3}, sl, s2 • {2, 4}.
Thelinesstartfromthefirst proccssorin thefollowingorder:F1, ..., L1 (the forward step) then
F2, ..., L2 (the backward step), and from the last processor, F2, ..., L2 (the forward step), and then
/'1, ..., L1 (the backward step). Hcrc indices 1 and 2 correspond to subsets A1 and A2, respectively.
Now all backward computations starting from an outermost processor are performed later than the
forward computations starting from the same outermost processor:
(27) Vii • hi Vi2 • A2: I_e_(p, il) < I_eW(p, i2), I_eW(p, i2) < I_eW(p, il).
Obviously, conditions (5,7,20,22) are satisfied for each passage of the FL-PA. The reformulated
algorithm meets these conditions as it uses the same passages as the original one. Conditions (6,21)
arc satisfied because each processor completes forward step computations for each line no later than
and begins backward step computations for each line no earlier than in the original algorithm.
The forward step computations for the first lines (F1 and F2) for the reformulated and original
FL-PA start at the same time. The backward step computations for the last lines for the reformulated
and original FL-PA (L1 and L2) arc completed at the same times. Thus, the maximum elapsed time
for this algorithm is equal to that for the original algorithm for the same number of processors.
LEMMA 3.3. The reformulation of the FL-PA described above does not change the maximum
elapsed time.
We define the symmetric FL-PA (SFL-PA) as a particular case of the FL-PA so that
(28) Ii(p, il)=I3(P-p+ l,i2), I2(p, il)=Ia(P-p+ l,i2),
where [All = [A21 = L/2, i1 • A1, i2 • A2, p = 1, ...,P. This definition is valid for an cvcn number
of processors P and an cvcn total number of lines L.
All first-last pipclined Thomas algorithms described in the previous section fall into the class of
SFL-PA (sec Fig. lc,d).
In a more common sense, one may reformulatc any FL-PA in such a way that the lines belonging
to A1 start from the last outermost processor and the lines belonging to A2 start from the first
processor. Obviously, this reformulation does not change the maximum elapsed time. Therefore, T_ is
a symmetric function of its arguments (T_ (I1,/2, I3,/4) = Te(I3, Ia, I1, I2)), and the case of symmetric
FL-PA/1 =/3; /2 =/4 corresponds to the local minimum of elapsed time of FL-PAs.
THEOREM 3.4. For the SFL-PA, the maximum elapsed time is greater than or equal to 2L+2(P-
2). There exist functions I1, I2, I3 and I4 meeting conditions (3-7, 20-22) such that Te = 2L+2(P-2)
for these I functions.
Proof. The method of mathematical induction on the number of processors P with the induction
step equal to two is used to prove the first part of the theorem.
For P = 2 and L = 2 the processors can perform computations with zero idle time. The first and
the second processor start the forward step computations simultaneously for the first and the second
line. At the end of the first time unit, they exchange the interface data and complete the forward step
(the first processor renders the second line and vice versa). Processors compute the backward step
for two lines the same way. The functions It, r = 1,2, 3, 4 are given by:
I1(1,1)=1, I1(2,1)=2, /3(2,2)=1, I3(2,1)=2,
(29) I2(1,1)=4, I2(2,1)=3, I3(1,2)=3, I3(2,2)=4.
For P = 2 and L = 2m, rn > 1, the lines are coupled and treated as described above couple by
couple. The elapsed time is % = 2L = 2L + 2(2 - 2).
]Ii
Suppose that the statement of induction is valid for p < P. Assume that for p = P + 2 the elapsed
time is T_(P + 2) < 2L + 2((P + 2) - 2) for some set of I satisfying conditions (4-7, 20-22).
Lct's reformulate this algorithm according to Lcmma 3.3. This procedure does not change thc
elapsed time. Obviously, thc reformulated algorithm is also symmetrical.
Now switch off the two outermost processors. Dcfinc the new set I_eW, r E {1,2,3,4} on P
processors
/FeW(p, il) ----- II(p -[- 1, il),
I_eW(p, i2) = 13(pq- 1,i2),
I_eW(p, il) ---- I2(p q- 1, il) - 2,
(30) I_e_'(P, i2) = Ia(p + 1, i2) - 2,
where p -- 1, ..., P, il,2 ----1, ..., L/2.
The SFL-PA charactcrizcd by the sct I new starts the backward computations two time units earlier
than the assumed SFL-PA on P ÷ 2 processors. There is no forward or backward step computations
on switched-off processors; therefore, the backward step computations on the second and on the
(P q- 1) th processors can start immcdiatcly after completion thc forward computations for the last
lines (Li E A1,L2 c A2).
According to Lcmma 3.3, forward computations for il E A1 are executed earlicr than backward
computations for i2 E A2. According to the definition of symmetrical FL-PA, all forward computations
arc pcrformcd carlicr than all backward computations on all processors. Therefore, condition (4) is
satisfied for I corresponding to different, forward and backward, pipclincd computations. The sct
I '_ meets conditions (4-7, 20-22) as I_,_3w are equal to [1,3 and I_e4w are equal to 12,4 - 2.
The elapsed time for this SFL-PA on P processors is derived in the same manner as in the proof
of Theorem 3.1 (see (11)):
Te(P) = I_(1, L1) - I_'_(1, F1) + 1 --
I2(2, L1) -/1(2, F1) -{-1 --- (I2(1,L1) - 3) - (I1 (1, F1) + 1) =
(31) Te(P + 2) - 4 < 2L + 2(P - 2).
This contradiction of the induction hypothesis proves the first statement of the theorem.
One example with the minimum clapscd time is presented here for P = 2q and L = 2m :
(32) 1t= _ p +l-1 ifp<q 1<I< m( m+q-l+(p-q)+l-1 ifp > q;
i3=_ P-p+l-m ifp>q rn+l<l<L(33) [ m+q-l+q-p+l-rn ifp<<q.
The functions/2 and/4 are analogous to/3 and I1, respectively. One half of the lines is rendercd by
the forward step beginning from the first processor and the other half is computed starting from the
last processor. After complction of the first half of lines by the processors 1, ..., q and simultaneous
completion of the second half by the processors P, P- 1, ..., q + 1, processors q and q + 1 cxchange
the interface data and then continue to render lines by the forward step. The first q processors now
render thc sccond half of lines and vice versa. After a simultaneous completion of the forward step,
processorscomputethebackwardstepthesameway.Forbothgroupsofprocessors,it takesrn + q - 1
units of time to render m lines either by backward or forward step.
These subsets of q processors are referred to as single super-processors, and subsets AI and A2 of
rn lines are referred to as single super-lines. Thus, there are two super-processors and two super-lines,
as in tile basis of induction, which has been considered in the proof of this Theorem. Thus, the elapsed
time is equal to 4(m + q - 1) = 2L + 2(P - 2). D
Therefore, SFL-PAs have only a small advantage in terms of the maximum idle processor time
over the basic PTA.
3.3. Applications to the pipelined Thomas algorithms. All pipclincd Thomas algorithms
described in the second section belong to PAs. Explicit derivation of the I functions can be a cumber-
some task for some versions of PTA. It is difficult to do this for the First-Last Immediate Backward
pipclined Thomas Algorithm (Fig. ld) since it treats four fluxes of data simultaneously. Instead, one
can use Theorems 3.1 and 3.4 to estimate the elapsed time and the processor idle time.
The first two algorithms, the basic PTA (Fig. la) and the Immediate Backward pipelincd Thomas
Algorithm (IB-PTA) (Fig. lb) have equal minimum elapsed times (Theorem 3.1). The first-last
algorithms (Fig. lc,d) have a small advantage of two time units idle time for an even number of
lines (Theorem 3.4). Thus, these pipelincd Thomas algorithms do not reduce the processor idle time
considerably. However, for the IB-PTA and for the FL-IB-PTA there are completed lines by the time
processors become idle. These processors can bc used for other computational tasks while they arc
idle from the Thomas algorithm.
For real MIMD computers, there is a latency time of communications between processors. There-
fore, it is advantageous to solve a number of lines per single message, tic., to transfer coefficients (for
the forward step) or the solution (for the backward step) for a number of lines.
The computational times per node may not be equal for forward and backward steps. For example,
there arc three multiplication operations per grid point for the forward step and two multiplication
operations per grid point for the backward step of the tridiagonal Thomas algorithm. The number of
portions of lines may be different for forward and backward steps of the pipclincd Thomas algorithm.
Those lines that have been completed by the forward step are gathered in groups of lines on the last
processor (see the next section for details). Sets A l and Ab include portions of lines (not single lines)
for the forward and backward computations, respectively. The same number of lines is treated by
forward and backward steps:
(34) ]A/IK1 = IAblK2,
where K1 and K2 are numbers of lines solved per message for forward and backward steps, respectively.
To avoid idle time, the amount of computational work should bc equal for the forward and
backward steps of the Immediate Backward PTAs. Thus, numbers of lines solved per one message for
forward and backward steps must satisfy the following equation:
(35) NKlgl =NK2g2,
where gl, g2 are computational times per grid node for forward and backward steps. Therefore, the
computational time per fraction of a portion of lines assigned to a single processor is taken to bc
equal to unity for forward and backward step computations. As in the previous cases of the line-
by-line treatment, processors switch between forward and backward steps of the IB-PTA. However,
10
processorsmaytreatseveralportionsoflinesbytheforwardstcpcomputationspriorto switchingto
thebackwardstepcomputations(comparcFig. 23andb).
To analyzetheseversionsof the pipeIinedThomasalgorithmswhichrcndcrlinesportion-by-
portion,wcdefinetheextendedpipclincdalgorithms(EPA)thatarcanalogoustothePAsdefinedin
subsection3.1.TheI functions must satisfy relations analogous to (4-7), whcre i, j arc the numbers
of portions of lines. The condition (6) is givcn by
(36) Ii(P,i) < I2(P,j),
whcrc iK1 < jK2. This condition states that all lines belonging to the i th portion has bccn completed
by the forward step computations prior to the start of the backward step computations for any of
these lines.
The first-last portion-by-portion algorithms are analogous to FL-PTAs (subsection 3.2). Thc I
functions should satisfy conditions analogous to (4-7) and (20-22). We define a symmetric FL-EPA,
denoted as SFL-EPA, as follows:
(37) Ii(p, il$) : I3(P-p+ 1,i2f), I2(p, ilb) = I4(P -p+ 1,i2b),
where ]AI/I = IA2II = LI/2, [Alb] = [A2bl = Lb/2, ill CAII, i2j _ A2I, ilb C Alb, i2b E A2b, P ----
1,..., P.
The proofs of Theorems 3.1 and 3.4 can bc used straightforwardly to prove the following Theorem.
THEOREM 3.5. For the EPA, the maximum elapsed time of the system of processors is greater
than or equal to
T_ = Lf + Lb + 2(P- 1).
For the SFL-EPA, the maximum elapsed time is greater than or equal to
T_ = Lf + Lb + 2(P- 2).
The various formulations of the PTA considered in the prcvious section can bc formulated in this
casc where portions of lines arc treated per single interface messagc. If the same numbcr of lincs, say,
K1 for the forward step and K2 for the backward step, arc solved per message for the basic EPA and
the IB-EPA, the minimum of processor idle times is equal for these algorithms.
4. Generation of processor schedule. At each time unit each processor either performs for-
ward or backward step computations or is idle. For the basic PTA the processor scheduling is simple:
each processor executes the forward step, becomes idle and then executes the backward step. Each pro-
cessor starts computations immediately after the corresponding data are available from its neighbor.
In this casc, communications govern computational tasks and there is no necessity of the proccssor
schcduling before computations.
For other considered versions of the Thomas pipelined algorithms, there arc more simultaneous
fluxes of data in a system of processors and the ordcr of processor tasks is more complicated. For
example, completed forward step coefficients from the previous processor are not transfercd to the
current processor immediately after they havc bccn computed, as this processor may executc the
backward step computations at the next time unit. Our earlier attempts to govern processors by
communications have led to cumbcrsome code patterns. The static processor schedule should bc
11
computedprior to runningthePTA.Thisschedulegovernstheconsequenceof computationsand
communicationsoneachprocessor.
Tosetupthisschedule,letusdefinethevariableJ(p, i) that governs the sequence of the forward
and backward computations and the idle state:
+1 forward step computations
(38) Y(p, i) = 0 processor is idle
-1 backward step computations,
where p is the number of processors and i is the serial number of the time unit. The first time unit
on each processor corresponds to computation of the first portion of lines by the forward step on this
processor. Therefore, the Y(p, i) and J(p + 1, i) correspond to two following time units. Use of this
variable is more convenient for computer programming of the pipelined Thomas algorithms rather
than the explicit use of the I(p, l) functions.
Wc confine ourselves to one-way PTAs in this section.
THEOREM 4.1. For any given pipelined Thomas algorithm on P processors with the minimum
elapsed time there exists the PTA on P+I processors with the same processor schedule for all processors
except the first (outermost) processor such that the elapsed time of this PTA is minimum.
Proof. Consider the pipelined Thomas algorithm with minimum elapsed time (2L + 2(P - 1)) on
P processors. W'c construct here a schedule for the additional processor meeting conditions of this
theorem.
Renumber processors 1, ..., P to 2, ..., P + 1. First, we formulate the schedule for the first (new)
processor as follows:
(39)
J(1, i) = i if J(2,i) = 1
J(1, i+2)=-i if J(2, i)=-I
J(1, i) -= 0 otherwise,
where index i runs from 1 to 2L + 2(P - 1) on the second processor.
There are two more elapsed time units on the first processor than on the second one, therefore,
this algorithm has the minimum elapsed time, 2L + 2(P - 1) + 2, on (P + 1) processors.
Now wc have to check that the first processor computes no more than one portion of lines at
each time unit (condition (4) in terms of I functions). This means that the assignment (39) does not
assign 1 and -1 to the same i th time unit on the first processor. This situation may occur only as
a result of a switch from the backward step to the forward step on the second processor such that
J(2, ic - 2) = -1 and J(2, ic) -- 1. Note that the switch from the backward step to the forward step,
such that J(2,ic - 2) = -1, J(2, i_ - 1) = 1 and J(2, ic) ---- -1, does not lead to this collision. --
Consider the first switch leading to the collision. The assignment (39) maps one-to-one the set
{1 < i < i_ - 1]J(2, i) _ 0} into the set {1 <_ i <_ ic + lIJ(1,i ) _ 0}. Thcre exists at least two
Y(1,i),i = 1,...,it + 1 that are equal to zero. As Y(1,i_) = 1 there exists at least one 1 < i < i_ - 1
such that J(1, i) = 0. Consequently, there is an idlc time unit on the first processor which may be
•th time unit on the second processor.used for the forward step computations preceding thosc at the l_
After every switch from thc backward to thc forward stcp, thcrc is a switch from the forward to
thc backward step, as thc backward step computations always perform last. Considcr a switch from
the forward to the backward step, i.e., J(2, ie) = 1 and J(2, ie + 1) = -1. The value J(1,ie + 1) is
i
12
equalto zerounlessJ(2, ie - 1) = -1 (see (39)). In this case there is no collision due to switching
from the backward to the forward step (scc above). After each collision, an additional vacant time
unit appears due to the forward-to-backward switch, and it might bc used to resolve the next collision.
Therefore, wc adopt the following single-valued assignment:
J(1,i,,_in) ----1 if J(2, i)=l
J(1, i+2)=-1 if J(2, i)=-I
(40) J(1, i) = 0 otherwise,
whcrc imi,_ = min(1 < j < i]J(1,j) = 0).
The obtained schedule mects the pipclincd nature of PTAs (5,7) as for any portion of lines thc
forward computations arc performed earlier on thc first processor than on the second one, and the
backward computations are performed earlier on the sccond processor than in the first one. Obviously,
the inequality (6) is satisfied, as there arc no changes of the schedule of the last outcrmost processor.
This proves the theorem. [:]
Let us define a variable that governs communications between processors:
0 processors p and p + 1 do not communicatc
(41) Corn(p, i) ---- 1 send forward-step coefficients from p to p + 1
2 receive solution from p + 1 to p
3 scnd forward-step coefficients and receive solution,
wherc p refers to communications between the p_h and the (p + 1) th computers, i is thc cnd of the i th
time unit (or beginning of the (i + 1) th time unit) on the pth processor.
If J(p, i) is given, the variablc Corn(p, i) is computed by
1 ifJ(p+l,i)= 1
(42) Corn(p,i)= 2 ifJ(p÷l,i-1)=-i and J(p+l,i)_l
3 if J(p+ 1,i - 1) = -1 and J(p+ 1,i) -- 1
0 otherwise.
The end of the i th time unit on the pth processor corresponds to the beginning of the i th time
unit on the (p ÷ 1) th processor. If J(p + 1,i) -- 1, thc (p + 1) th processor receives the forward step
coefficients. The first line in (42) dcfincs the transfer of thc forward step cocfficicnts from the pth to
the (p + 1) th processor.
The (p+ 1) th processor sends the backward step solution to the pth processor immediately after the
completion of their computations. The second line of (42) defines the solution transfer for backward
pipclined computations.
The third line of (42) corresponds to the case of simultaneous transfer of the forward step co-
cfficients from the pth to the (p + 1) th proccssor and the backward step solution from thc (19+ 1) th
to thc pth processor. If the send and receive operations are performed simultaneously, parallelism of
communications appears as an additional advantage of the IB-PTA.
One rccursive algorithm that assigns the schedule of computations to the pth processor and the
schedule of communications bctwecn the pth and the (p + 1) th processors for a given schedule of the
(p-}- 1) th processor, is shown in Fig. 4. First, this algorithm assigns zero values to J and Corn
variables corresponding to the pth processor. For negative J(p + 1, i) the values J(p, i + 2) -- -1 and
Corn(p, i + 1) = 2 are assigned (see (40) and (42)).
13
ForpositiveJ(p + 1, i), the value of Corn(p, I) is equal either to 1 or to 3. The latter case is
realized if J(p + l,i - 1) = -1 (see (42)), and the previous Com(p,i) -- 2 is reassigned. Then the
search for a vacant time unit imin (SCC (40)) is realized and the assignment J(p, i,n{n) = 1 is completed.
The entire loop is repeated 2(P - p) + LI + Lb times according to Theorem 3.5.
Theorem 4.1 statcs that the algorithm is correct. The only reason that thc algorithm is unable
to find a vacant time unit for the forward step computations (error warning) is an erroneous schedule
(J(P, i)) on the last outermost processor (scc below).
Thus, a valid schedule should be assigned to thc pth outermost processor before recursive com-
putations of the time schedule for other proccssors are cxecuted. This schedule must satisfy condition
(4). To keep the minimum idle time for the system of P processors, the pth processor should execute
forward and backward step computations in a contiguous way. If the outermost processor is idle some
time, wc cannot obtain a schedule corresponding to the minimum elapsed time on other processors.
The assignment of J(P, i) for the basic PTA is simple:
(43) J(P,i)=l if i----1,...,L I
J(P, i) = -1 if i = L I ÷ 1, ..., L f + Lb.
The algorithm for assignment of J(P, i) for the IB-PTA is shown in Fig. 5. First, the forward step
computations are performed for the first portion of lines. The variable KS is equal to the number of
available lines which have been completed by the forward step and not yet rcndcrcd by the backward
step.
Scheduling the IB-PTA for thc pth processor, the backward stcp computations begin for the next
portion of/(2 lines once these lines have been completed by thc forward step (KS >/(2). Otherwise,
one more portion of lines must be completed by the forward step computations.
For example, consider the tri-diagonal Thomas algorithm with K2/K1 = gl/g2 = 1.5. 1 Once
two first portions of K1 lines have becn completed by the forward step, 1.5K1 lines from these two
portions form the first portion for the backward step computations. The remaining 0.5K1 lines and
the third portion of lines form the second portion of/(2 lines for the backward step computations.
This loop repeats until the completion of lines.
The processor schedules for the IB-PTA and the basic PTA are shown in Fig. 6a,b respectively.
The pth column of values of J corresponds to the pth processor (from 1 to P). Columns are shifted so
as each horizontal line corresponds to a single time moment in terms of wall clock. Arrows - - - >,
< - - - and < -- > denote the send, receive and send-receive communications that correspond to
the values of the variable Corn (see (42)). One can find that the maximum elapsed time (length of the
first column) is the same for the PTA and the IB-PTA. Of course, the idle time (the number of zeros)
is also the samc. The advantage of the IB-PTA is that the processors become idle after completion of
some lines, therefore, the processors may bc used for other computational tasks using these data.
5. Conclusions. The ways of efficient parallclization of two-step (forward-backward) pipclined
algorithms (PAs) originated from the solution of banded matrix linear systems on parallel computers
are discussed. The recursive algorithm for the assignment of the processor computation and commu-
nication schedule for PAs is derived. This scheduling algorithm can be used to execute multiblock
computations where different lines come to a processor from different neighbors, to reduce idle time
1the exact ratio is compiler- and computer-dependent
14
byschedulingothercomputationaltaskswhileprocessorsareidlefromPA computationsandto im-
plementthoseversionsofPAswhichcanimprovetheparallclizationefficiency.
It isshowntheoreticallythattheprocessoridletimeofthepipelinedThomasalgorithmscannot
bc improvedirectly.
Toremedytheproblemoftheprocessoridletime,thenewparallelversionoftheThomasalgorithm
(IB-PTA)is proposedanddenotedastheImmediateBackwardPipelinedThomasAlgorithm(IB-
PTA).Thebackwardstepisprocessedimmediatelyaftertheforwardstephasbeencompletedforthe
first portionof lines.Thus,partof thelineshavebeencomputedbytheThomasalgorithmbefore
proccssorsbecomeidle,andthe idleprocessorscanpcrformothercomputationaldata-dependent
tasks.TheobtainedprocessorschedulefortheproposedIB-PTAandbasicPTAarepresented.Thcsc
schedulesconfirmourtheoreticalresultsaboutPAs.
6. Acknowledgment.Theauthorwishesto thankProf.AlexPothcn(ODUandICASE)and
Dr. StephenCuattery(ICASE)for carefulreadingofthemanuscriptandusefulsuggestions.
REFERENCES
[1]HmSCH,CH. Numerical computation of internal and external flows, Vol. 1: Fundamentals of
numerical discretization, John Wiley and Sons, Chichcstcr, 1994.
[2] JOHNSON, S.P, LEGCETT, P.F., IEROTHEOU, C.S. ET AL., Computer Aided Par-
aUeIization Tools (CAPTools). User Manual, University of Greenwich, UK, 1996,
http://www.grc.ac.uk/-captool/.
[3] CUSTAFSON, F.C. AND CUPTA, A., A New Parallel Algorithm for Tridiagonal Symmetric Pos-
itive Definite Systems of Equations, Proceedings of the Third International Workshop of
Applied Parallel Computing, PARA'96, pp. 341-349.
[4] POVITSKY, A. AND WOLFSHTEIN, M., Multi-domain Implicit Numerical Scheme, International
Journal for Numerical Methods in Fluids, 25 (1997), pp. 547-566.
[5] NAIK, N.H., NAIK, V.K. AND NICOULES, M., Parallelization of a Class of Implicit Finite
Difference Schemes in Computational Fluid Dynamics, International Journal of High Speed
Computing, 5, 1993, pp. 1-50.
[6] POVITSKY, A., Parallel Directionally Split Solver Based on Reformulation of PipeIined Thomas
Algorithm, ICASE Report No. 98-45.
15
l st processor
forward ste
idle time-
backward stel: - -
last processor
-2_-
J
idle time
forward step
backward step
idle time
a. Basic TPA
l st _rocessor
forward step
forward and
backward step
.... /
--:_-
I-
backward step and I
idle time
I
I
I
4
last _rocessor
idle time
forward and
backward step
backward step
idle time
b, Immediate backward TPA
forward step and
idle time
backward step
backward step and
idle timc
I I
forward step and
idle time
backward step
<:
+
| backward step and
idle time
I
I
forward and
backward step
forward and
backward
steps and
idle time
/
forward and
backward step
forward and
backward
steps and
idle time
c. Two-way TPA d. Two-way immediate backward TPA
FiG. 1. Examples of pipelined Thomas algorithms (a) basic Pipelined Thomas Algorithm (PTA); (b) Immediate
Backward Pipelined Thomas Algorithm (IB-PTA); (c) First-Last Pipelined Thomas Algorithm (FL-PTA); (d) First-
Last Immediate Backward Pipelined Thomas Algorithm (FL-IB-PTA), where - - - > denotes the forward step, ----,
denotes the backward step
16
TT+I
F
B
I+l
B
F
I+2
F
B
I
I+3
B
F
I+4
F
B
I+5
B
F
a
T
T+I
F
F
I+ 1 I+2 1+3 1+4 1+5
F
B
B
F
F
F
b
Ftc. 2. Sequence of forward and backward steps for the IB-PTA at two sequential time units, where I,...,I÷5
are the numbers of processors, F is the forward step, B is the backward step, arrows denote data transfer between
processors; (a) equal computational times for the F and B steps (gl = g2); (b) non-equal computational times for the
F and B steps, corresponding to the tri-diagonal Thomas algorithm (91 = 1.592)
17
PTA
1st
P th
2 (P-1)_
F
/
I B
1st
P th
F F+B I
F+B
f
IB-PTA
FIG. 3. Pipelined Thomas algorithms: F-the forward step, B-the backward step, I-the idle state of processors, (a)
basic pipelined Thomas Algorithm (PTA); (b) Immediate Backward Pipelined Thomas Algorithm (IB-PTA)
18
TI I_k-
I=I I1=0 ]
J(p+ I,*)=0 [
Corn(p+ 1,*)=0]
._ I=I+ 1 J(p+ 1, 1)= 1 N
N
J(p+l,I)=- 1
Y
J(p+l, I-1
N
Com(p,I)= 1
Com(p,I+ 1)=2
J(p,I+2)=- 1
II=II+l
_ y
] J(p'I1) 1 [
err
ex" Y
FIc. 4. Assignment of processor schedule on the (p + l) th processor based on a given schedule on the pth processor
19
J(P,1)=I
KS=K1
I=2
I=I+l
Y
J_,I)=l
KS=KS+K1
Y
I
J(P,I)=- 1
KS=KS-K2
FIG. 5. Assignment of the schedule on the last processor (pth) for the IB-PTA
20
'1 i i
11
1
1
1
1
1
1
1
0
<-->
-i
0
-I
0
-I
0
-I
0
-i
0
-I
1
1
1
1
1
1
1
1
<-->
-I
1
_-->
-i
0
0
-i
0
-i
0
-I
0
-I
1
1
1
1
1
1
<-->
-I
1
<-->
-i
1
1
<-->
-i
0
<-->
-i
0
0
-I
0
-I
1
1
1
1
<-->
-I
1
<-->
-1
1
1
<-->
-I
1
<-->
-I
1
0
<-->
-I
0
-I
-I
1
-1
1
1
-I
1
-I
1
1
-I
1
-I
b
0 0 0
0 0 0
0 0 -I
0 -I -I
<--- <--- <---
-I -I -I
<--- <--- <---
-1 -1 -1
<--- <--- <---
-1 -1 -1
<--- <--- <---
-1 -1 -1
-1 -1
-1
1
1
1
1
1
1
1
1
1
0
0
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
FIG. 6. Schedule of processors for the IB-PTA (a) and PTA (b), P = 5, L I = 9, Lb = 6
21
REPORT DOCUMENTATION PAGE Form Approved
OMB No. 0704-0188
Public reporting burden for th;s collection of information is estimated to average I hour per response, including the time for reviewing instructions, searching existing data sources,
gathering and maintaining the data needed, and compret;ng and reviewing the collection of" information. Send comments regarding this burden estimate or any other aspect of this
colrectlon of" information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports. 1215 JefFerson
Davis Highway. Suite 1204. Ari;ngton. VA 22202-4302. and to the Office of Management and Budget. Paperwork Reduction Project (0704-0188), Washington. DC 20503.
i 1. AGENCY U'SE"ONLY(Leave blank) 2. REPORT DATE
November 1998
4. TITLE AND SUBTITLE
Para]leIization of the pipelined Thomas algorithm
6. AUTHOR(S)
A. Povitsky
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES)
Institute for Computer Applications in Science and Engineering
Mail Stop 403, NASA Langley Research Center
Hampton, VA 23681-2199
9, SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES)
National Aeronautics and Space Administration
Langley Research Center
Hampton, VA 23681-2199
3. REPORT TYPE AND DATES COVERED
Contractor Report
5. FUNDING NUMBERS
C NAS1-97046
WU 505-90-52-01
8. PERFORMING ORGANIZATION
REPORT NUMBER
ICASE Report No. 98-48
10. SPONSORING/MONITORING
AGENCY REPORT NUMBER
NASA/CR- 1998-208736
ICASE Report No. 98-48
11. SUPPLEMENTARY NOTES
Langley Technical Monitor: Dennis M. Bushnell
Final Report
Submitted to the Journal of Parallel and Distributed Computing.
12a. DISTRIBUTION/AVAILABILITY STATEMENT
Unclassified Unlimited
Subject Category 60, 61
Distribution: Nonstandard
Availability: NASA-CASI (301)621-0390
12b. DISTRIBUTION CODE
13. ABSTRACT (Maximum 200 words)
In this study the following questions are addressed. Is it possible to improve the parallelization efficiency of the
Thomas algorithm? How should the Thomas algorithm be formulated in order to get solved lines that are used as
data for other computational tasks while processors are idle?
To answer these questions, two-step pipclined algorithms (PAs) arc introduced formally. It is shown that the
idle processor time is invariant with respect to the order of backward and forward steps in PAs starting from one
outermost processor. The advantage of PAs starting from two outermost processors is small. Versions of the pipelined
Thomas algorithms considered here fall into the category of PAs.
These results show that the parallelization efficiency of the Thomas algorithm cannot bc improved directly.
However, the processor idle time can bc used if some data has been computed by the time processors become idle.
To achieve this goal the Immediate Backward pipelincd Thomas Algorithm (IB-PTA) is developed in this article.
The backward step is computed immediately after the forward step has been completed for the first portion of
lines. This enables the completion of the Thomas algorithm for some of these lines before processors become idle.
An algorithm for generating a static processor schedule rccursively is developed. This schedule is used to switch
between forward and backward computations and to control communications between processors. The advantage of
the IB-PTA over the basic PTA is the presence of solved lines, which are available for other computations, by the
time processors become idle.
_14. SUBJECT TERMS
Thomas algorithm; band matrix; pipelined algorithms; parallel computations;
implicit numerical methods; MIMD computer
17. SECURITY CLASSIFICATION
OF REPORT
Unclassified
hlSN 7540-01-280-5500
18. SECURITY CLASSIFICATIO#_
OF THIS PAGE
Unclassified
19. SECURITY CLASSIFICATION
OF ABSTRACT
15. NUMBER OF PAGES
26
16. PRICE CODE
20. LIMITATION
OF ABSTRACT
Standard Form 298(Rev. 2-89)
Prescribed by ANSI Std. Z3g-18
298-102
11 I!
