Efficient multitasking of Choleski matrix factorization on CRAY supercomputers by Overman, Andrea L. & Poole, Eugene L.
|https://ntrs.nasa.gov/search.jsp?R=19920001084 2020-03-17T14:47:11+00:00Z
Z
NASA Technical Memorandum 4259
Efficient Multitasking
of Choleski Matrix
Factorization on
CRAY Supercomputers
Andrea L. Overman
Langley Research Center
Hampton, Virginia
Eugene L. Poole
Analytical Services & Materials, Inc.
Hampton, Virginia
National Aeronautics and
Space Administration
Office of Management
Scientific and Technical
Information Program
1991
The use of t:ademarks or names of manufacturers in this
report is for accurate reporting and does not constitute an
official endorsement, cither expressed or implied, of such
products or manufacturers by the National Aeronautics and
Space Administration.
Report Documentation Page
Nat_onfl[ Aeronautics and
aace Administralbn
1. ReportNAsANO.TM_4259 2. Government Accession No. 3. Recipient's Catalog No.
4. Title and Subtitle 5. Report Date
Efficient Multitasking of Choleski Matrix Factorization on CRAY September 1991
Supercomputers
7. Author(s)
Andrea L. Overman and Eugene L. Poole
9. Performing Organization Name and Address
NASA Langley Research Center
Hampton, VA 23665-5225
12. Sponsoring Agency Name and Address
National Aeronautics and Space Administration
Washington, DC 20546-0001
6. Performing Organization Code
8. Performing Organization Report. No.
L-16901
10. Work Unit No.
505-90-52-02
11. Contract or Grant No.
13. Type of Report and Period Covered
Technical Memorandum
14. Sponsoring Agency Code
15. Supplementary Notes
Andrea L. Overman: Langley Research Center, Hampton, Virginia.
Eugene L. Poole: Analytical Services & Materials, Inc., Hampton, Virginia.
16. Abstract
This paper describes a Choleski method used to solve linear systems of equations that arise in
large-scale structural analyses. The method uses a novel variable-band storage scheme and is
structured to exploit fast local memory caches while minimizing data access delays between main
memory and vector registers. Several parallel implementations of this method arc described for
the CRAY-2 and CRAY Y-MP computers demonstrating the use of microtasking and autotasking
directives. A portable parallel language, FORCE, is used for comparison with the microtasked and
autotasked implementations. Results are presented comparing the matrix factorization times for
three representative structural analysis problems from runs made in both dedicated and multiuser
modes on both the CRAY-2 and CRAY Y-MP computers. Both CPU (central processing unit)
times and wall clock times are given for the parallel implementations and are compared with
single-processor times of the same algorithm. Computation rates over 1 GIGAFLOP (1 billion
floating point operations per second) on a four-processor CRAY-2 and over 2 GIGAFLOPS on
an eight-processor CRAY Y-MP are demonstrated as measured by wall clock time in a dedicated
environment. Reduced wall clock times for the parallel implementations relative to the single-
processor implementation of the same Choleski algorithm are also demonstrated for runs made in
a multiuser environment.
17. Key Words (Suggested by Author(s))
Choleski factorization
CRAY-2/CRAY Y-MP
Parallel processing/vectorization
Macrotasking, microtasking, and autotasking
FORCE preprocessor
19. Security Classif. (of this report)
20. Security Classif. (of this page)Unclassified
18. Distribution Statement
Unclassified Unlimited
Subject Category 60
21. 60N°' of Pages 22. A04PriceUnclassified
NASA-Langley, 1991NASA FORM 1626 OCT 86
For sale by the National Technical Information Service, Springfield, Virginia 22161-2171

Abstract
This paper describes a Choleski method used to solve linear system,
of equations that arise in large-scale structural analyses. The method
uses a novel variable-band storage scheme and is structured to exploit
fast local memory caches while minimizing data access delays between
main memory and vector registers. Several parallel implementations
of this method are described for the CRA Y-2 and CRA Y Y-MP com-
puters demonstrating the use of microtasking and autotasking direc-
tives. A portable parallel language, FORCE, is used for comparison with
the microtasked and autotasked implementations. Results are presented
comparing the matrix factorization times for three representative struc-
tural analysis problems from runs made in both dedicated and multiuser
modes on both the CRAY-2 and CRAY Y-MP computers. Both CPU
(central processing unit) times and wall clock times are given for the
parallel implementations and are compared with single-processor times
of the same algorithm. Computation rates over 1 GIGAFLOP (1 billion
floating point operations per second) on a four-processor CRA Y-2 and
over 2 GIGAFLOPS on an eight-processor CRAY Y-MP are demon-
strated as measured by wall clock time in a dedicated environment. Re-
duced wall clock times for the parallel implementations relative to the
single-processor implementation of the same ChoIeski algorithm are also
demonstrated for runs made in a multiuser environment.
1. Introduction
The computational cost of computerized struc-
tural analysis is often dominated by the cost of solv-
ing a very large system of algebraic equations associ-
ated with a finite element model. This linear system
of equations has the form
a forward-and-backward solution using the triangular
systems
Lz = f (forward) (2a)
LTu = z (backward) (2b)
Ku=f (1)
where K is the symmetric positive-definite stiffness
matrix, f is the load vector, and u is the vector of
generalized displacements. The linear systems can
be as large as several hundred thousand degrees of
freedom and often require significant computing re-
sources, in terms of both memory and execution time.
The development of algorithms to solve these linear
systems on existing and future high-performance su-
percomputers that have multiple parallel high-speed
vector CPU's (central processing units) can signifi-
cantly reduce the computer time required for struc-
ttrral analysis. Efficient algorithms that exploit both
vectorization and parallelization gain the maximum
benefit on such supercomputers.
One widely used method for solving equation (1)
is the Choleski method (ref. 1). In the Choleski
method the stiffness matrix K is factored into the
product of the triangular matrices LL T, followed by
The factorization of K, which is by far the most com-
putationally expensive part of the Choleski method,
can be carried out in many ways. The particular
implementation of the Choleski method that is the
most effective for a class of problems depends on both
the structure of the linear systems and the architec-
tural features of the computer used for the analy-
sis. Though the structure of K for many structural
analysis applications is initially very sparse, usually
20 to 50 nonzero coefficients in a given row of the
matrix, the factored matrix L contains many more
coefficients produced during the factorization step.
For many structural analysis problems, a reordering
of the nodes in the finite element model can be car-
ried out that minimizes the bandwidth of the fac-
tored matrix, thereby significantly reducing matrix
storage requirements and the number of computa-
tions required. A novel variable-band matrix storage
scheme is used with the Choleski implementations
described in this paper to reduce the computational
cost of the factorization. The reverse Cuthill-McKee
reorderingalgorithm(ref.2) is usedto minimizethe
bandwidthof the linearsystems.
In thispapera vectorizedvariable-bandCholeski
method,denotedasVBAND,isdescribedfor single-
processorimplementation.A paraIIclCholeski equa-
tion solver, based on the VBAND method and
denoted as PVBAND, is also presented. Several
implementations of the PVBAND method are de-
scribed demonstrating three different approaches for
implementing parallel algorithms on CRAY multi-
processor supercomputers (manufactured by Cray
Research, Inc.). The performance of the parallel
implementations is demonstrated by comparing fac-
torization times for several large representative struc-
tural analysis problems. In a dedicated comput-
ing environment, maximum computation rates over
1 GIGAFLOP (1 billion floating point operations
per second) on a four-processor CRAY-2 and over
2 GIGAFLOPS on an eight-processor CRAY Y-MP
are achieved for the matrix factorization portion of
tile computerized structural analyses. In a multiuser
environment, reduced wall clock times arc demon-
strated for tile parallel methods as compared with
the wall clock times for the single-processor VBAND
Choleski algorithm.
The outline of this paper is as follows: In sec-
tion 2, architectural features of the CRAY multipro-
eessor supercomputers arc described. Key hardware
features of the CRAY-2 and CRAY Y-MP computers
are compared and contrasted, and implementation
issues related to exploiting vectorization as welt as
parallel processing oil CRAY supercomputers are dis-
cussed. CRAY multitasking alternatives (macrotask-
ing, microtasking, and autotasking) are introduced.
In section 3, both single-processor and parallel im-
plementations of the variable-band Choleski method
are described. Detailed discussions of the differences
between the parallel implementations on the CRAY
computers are included. The implementation details
provide an aid for understanding the differences in
performance across the parallel implementations. In
section 4, the role of equation solvers in computer-
ized structural analysis is briefly discussed. Three
representative structural analysis problems that are
used to compare the parallel implementations of the
variable-band Choleski method are described. In sec-
tion 5, numerical results are presented from nmltiple
runs in both dedicated and multiuser environments,
and in section 6, concluding remarks are given. Fi-
nally, an appendix is given containing FORTRAN
listings for several of the subroutines described in this
paper.
2. Parallel-Vector Computer
Architectures
The experiments described in this report were
performed on a CRAY-2 at the NASA Langley
Research Center (LaRC) and a CRAY Y-MP at
the NASA Ames Research Center (ARC). This sec-
tion briefly describes the key architectural feature_
of these computers including a discussion of the
major hardware differences of the CRAY-2 and
CRAY Y-MP machines that affect algorithm de'-
sign. In addition, optimization issues related to
exploiting both the vector and parallel processing
capabilities of CRAY supercomputers are briefly
introduced. Differences between macrotasking,
microtasking, and autotasking that affect algorithm
performance are discussed at the end of the parallel
processing subsection.
CRAY Hardware
The CRAY-2 and tile CRAY Y-MP are shared-
memory multiprocessor computers that have a max-
imum of four and eight central processing units
(CPU's), respectively. Each CPU has multiple vector
arithmetic and logic functional units that access very
large main memories through eight high-speed vector
registers. The maximum computation rate, com-
monly measured in units known as MFLOPS (mil-
lions of floating point operations per second), is at-
tained when both addition and multiplication vector
fimctional units are operating simultaneously, thus
producing two results every machine clock period.
The CRAY-2 has a clock period of 4.1 nanoseconds
(nsec) that results in a maximum theoretical rate of
488 MFLOPS per processor. The CRAY Y-MP used
for the experiments in this paper has a clock period of
6.0 nsec that results in a maximum theoretical rate of
333 MFLOPS per processor. These theoretical peak
rates arc seldom achieved because of several factors,
two of which are discussed next.
Memory access delays. One factor that re-
duces the computation rate is memory access delay.
Some delay is always incurred if the operands for
a vector addition or multiplication must be trans-
ferred from main memory to the vector registers. As
soon as the elements are in the vector registers they
Can be accessed at the maximum computation rat}?:
Both the CRAY Y-MP and the CRAY-2 have 128
million words of main memory arranged in banks.
Since memory locations in the same bank cannot be
accessed in successive clock periods, contiguous array
elements are stored in an interleaved pattern across
the banks. This storage pattern allows contiguous
array elements to be accessed at successive clock pe-
riods after an initial access delay. In general, the
2
:=
does not have chaining, the memory references for
a given set of operands must be completed before
computations can begin. This problem is greatly
reduced for FORTRAN DO loops containing many
addition and multiplication instructions and several
different operands since memory references on one
set of operands can be carried out while multiplica-
tions or additions are taking place on another set of
operands. One technique used to increase the ratio of
computations to memory references is referred to as
loop unrolling. In this technique, nested loops con-
taining single-vector instructions, such as SAXPY's
or inner products, are combined within a single loop
containing several vector computations. This tech-
nique allows vectors to remain in the vector registers
longer before results arc stored back to main memory,
thus resulting in a more efficient use of the multiple
functional units.
Parallel Processing
In addition to the hardware features of the
CRAY-2 and CRAY Y-MP that improve perfor-
mance on a single CPU, parallel processing is avail-
able. The individual CPU's share a large main mem-
ory, and synchronization is accomplished using both
the shared-memory and special hardware semaphore
registers. Because current CRAY computers have
small numbers of very powerful processors, the best
overall speedups are obtained by parallelizing highly
vectorized codes. The speedup from parallel process-
ing is always less than p, where p is the number of
processors, wherea_s the speedup duc to vectorization
often exceeds 20 or more on a single processor for
many codes. Three types of multitasking are avail-
able on the CRAY computers and are described be-
low. The PVBAND method was implemented using
all three types of multitasking to determine the best
approach for both multiuser and dedicated modes.
Details of the parallel implementations are given in
section 3.
Macrotasking. The earliest form of parallel pro-
ccssing developed for CRAY computers was macro-
tasking. Macrotasking is intended to be used with
large codes running in a dedicated environment.
Macrotasked codes perform best when separat e par-
allel tasks, requiring little synchronization, are de=
fined at the subroutine level. Macrotasking is the
most difficult form of multitasking used on CRAY
machines. Since parallelism must bc defined at a
higher level, macrotasking typically requires more
code changes and data analysis than microtasking
and autotasking. Parallelism is defined through the
insertion of library subroutines, making the code un-
4
portable to other machines. (See refs. 3 and 4 for
detailed discussions on macrotasking.)
Macrotasking requires extensive task manage-
ment that is handled by the library scheduler, a col-
lection of library subroutines. Parallel tasks, inde-
pendent portions of work, are initiated through calls
to the TSKSTART and are synchronized by events,
locks, or the TSKWAIT subroutine. TSKSTART cre-
ates the logical CPU's, or processes, if they have not
been created previously. When a task changes state,"
the library scheduler places the task in the appropri-
ate library queue. For example, the scheduler peri-
odically checks the Ready queue for tasks that can be
attached to logical CPU's. Attachment to a logical
CPU designates a task as schedulable for execution
on a physical CPU. On the other hand, if a task must
wait for another task to be completed, the library
scheduler disconnects the waiting task from the log-
ical CPU and places it in the Suspended queue. In
addition, if no new tasks are ready, the logical CPU
is "put to sleep," marking that process as unschedu-
lable. That logical CPU must be reactivated later.
A portable parallel language, FORCE, devel-
oped for shared-memory parallel computers (refs. 5
and 6) has been implemented on the CRAY com-
puters using macrotasking. FORCE is a machine-
independent parallel language that allows program-
mers to implement parallel algorithms using a fixed
syntax. Both parallel constructs (such as barriers
and critical regions) and data types (i.e., private and
shared variables) can be programmed without de-
tailed knowledge of the implementation of the paral-
lel constructs and data-type definitions on each com- r
puter. FORCE uses the UNIX sed editor and a set
of script files to substitute machine specific code for
each of the FORCE statements. In addition to re-
placing FORCE statements in the user's program,
the FORCE preprocessing phase converts the user's
main program to a subroutine that is called by a stan-
dard driver routine at run time. The driver routine E
is not seen by the user and serves to create, through
calls to TSKSTART, the fixed number of tasks spec-
ified by the user at run time. The entire user's pro-
gram is called from the main driver program by each
task. Sequential portions of the code must be placed
within FORCE barrier statements or other explicit L
code statements. At the end of the program, the i
FORCE driver routine terminates the tasks with the _
TSKWAIT subroutine.
Microtasking. Microtasking was designed to
lower the overhead for CPU synchronization and
to allow for more efficient multitasking in a mul-
tiuser environment. In a multiuser environment,
maximumtransferratebetweenmainmemoryand
thevectorregistersoccurswhencontiguousarrayel-
ementsareaccessedfrommainmemory.
Therearetwokeydifferencesin thearchitectures
oftheCRAY-2andCRAYY-MPthat affectmemory
accessdelays. Thefirst is the numberof pathsto
memory. On the CRAY-2,thereis only a single
pathbetweenmainmemoryandthevectorregisters.
On the CRAY Y-MP, there are four paths to main
memory; two can be used to read from memory, one
to write from the registers to main memory, and
one for input/output (I/O) operations. The second
key difference is the use of local memory caches on
the CRAY-2. Each processor on the CRAY-2 has
a 16384-word local memory cache. Local memory
is accessible only through the vector registers, and
only contiguous access of array elements is possible.
However, the time delay incurred when accessing
array elements stored in local memory is significantly
less than that when accessing array elements stored
in main memory. As a result, vectors that are
used repeatedly in computations may be stored in
local memory in order to significantly reduce memory
access delays.
Floating point operations and chaining.
Another key factor that affects the computation rate
is the type of operations performed to carry out an
algorithm. For example, vector SAXPY operations
(single precision a x X + Y) are efficient on CRAY
computers since they contain both addition and nml-
tiplication instructions that can be carried out nearly
: simultaneously. However, inner-product instructions
(Sum = _n 1 xiYi) arc somewhat less efficient since
they require a summation of many vector elements
: into a single scalar value.
An additional feature that improves the perfor-
: mance of floating point operations on the CRAY
: Y-MP is the chaining of operations and memory ac-
cesses. Chaining occurs whenever vector instructions
on a set of operands arc allowcd to overlap, thus
reducing the start-up overhead associated with in-
dividual vector instructions. For example, on the
CRAY Y-MP, a vector add of 2 arrays multiplied by
a constant begins by issuing a load instruction for the
first 64 elements of the first array that is followed, on
the very next clock period, by a second load instruc-
tion for the first 64 elements of the second array. As
: soon as the first opcrands of the two arrays are in
the vector registers, the addition operation begins.
The multiplication operation begins as soon as the
first result is available from the add functional unit.
Finally, the store instruction to transfer the first 64
results back to main memory is issued as soon as the
first results are available from the multiply functional
unit. If the length of the arrays is n, then this exam-
ple requires O(n) clock periods on the CRAY Y-MP.
On the CRAY-2, the same example would require
O(3n) clock periods to load the two vector operands
and store the result using the single path to memory
and an additional O(2n) clock periods to complete
the addition and nmltiplication instructions.
Although flmctional unit chaining, or tailgat-
ing, is available on some CRAY-2 machines, most
CRAY-2 computers havc no chaining capability.
However, the CRAY-2 computers can benefit from si-
multaneous memory access and computations if these
operations take place on different sets of operands.
Loops that contain several addition and multipli-
cation operations with several independent sets of
opcrands are oftcn able to benefit from simultaneous
memory acecss and computations. In general, loops
that execute at high vector spccds on thc CRAY-2
also perform well on the CRAY Y-MP, but the con-
verse is not necessarily true.
Vector Optimization
In order to achieve high computation rates on
vector computers, algorithms must be designed to ex-
ploit the key hardwarc features discussed above. The
challenge to the software developer is to design algo-
rithms that minimize the memory access delays while
utilizing the full computing power of multiple vector
functional units. Three key vector optimization is-
sues are vector length, vector stride, and the ratio
of memory references to computations. Long vector
operations (i.e., operations on hundreds or even thou-
sands of array elements in a single loop) are desirable
since thc cost due to loop overhead and initial mem-
ory access is amortized over a largcr number of com-
putations. Vector stride refers to the memory access
pattern for arrays. Stride one vector operations occur
when arrays are accessed contiguously. This occurs,
for example, in FORTRAN DO loops where the first
dimension of an array reference is incremented by one
for each loop iteration. Although much improvement
has been made on vector computers in recent years
to improve memory access for strides other than one,
stride one vector operations are still faster on both
the CRAY-2 and CRAY Y-MP. On the CRAY-2, the
arrangement of memory banks into quadrants makes
even stride accesses particularly bad, especially when
the stride is a multiple of four.
The ratio of computations to memory references
for vector computations is the third key issue on vec-
tor optimization. This ratio is particularly important
for the CRAY-2 since there is only one path between
memory and vector registers. Because the CRAY-2
microtaskedcodesdynamicallyadjustto the avail-
ableresources,thusmakingefficientuseof proces-
sors that are availablefor short periodsof time.
Sincesmallergrainedparallelismcanbeefficiently
exploitedusingmicrotasking,parallelismis typically
definedat the DO loop level. Outermostloops
areusuallyparallelized,whereasinnerloopscontain
vectorcomputations.Microtaskingis implemented
throughthe useof compilerdirectivesthat appear
ascommentsto othercompilers,thus maintaining
- portability to othercomputers.Discussionsof mi-
crotaskingcanbe foundin references3and4.
Microtaskingis designedfor multiuserenviron-
mentsand doesnot requirethe task management
usedby macrotasking.In microtasking,additional
processesreferredto asslave processes are created at
the beginning of the program and immediately enter
a library subroutine called PARK where they remain
until summoned by the master process. Meanwhile,
the master process executes all serial code outside
the microtasked subroutines. Upon entering a mi-
crotasked subroutine, the master process signals the
slaves to enter the parallel region. If a slave process
is active, i.e., if the process is executing on a physi-
cal CPU, the slave process enters the parallel region.
The code within microtasked subroutines is executed
by all active processes unless it is contained within
control structures. Within a control structure, par-
allel work is distributed to active processes, and all
work within a control structure must be completed
before any CPU can proceed past that point. A
unique number associated with each control structure
as well as the number of active processes within each
control structure are stored in special shared vari-
ables. These variables are updated by several special
library routines that are called from the user's pro-
gram. The calls to these routines are automatically
inserted by the microtasking preprocessor at com-
pilation time. All variables in COMMON blocks,
SAVE statements, DATA statements, or in the ar-
gument list of a microtasked subroutine are shared
variables and can be accessed by all CPU's. All other
variables, such as loop indices, are private variables,
and each CPU accesses a separate copy.
Autotasking. Autotasking is the most recent
parallel programming tool available to programmers
on CRAY computers. (See ref. 7 for a complete de-
scription of autotasking.) Autotasking expands the
features of microtasking by adding the capability to
automatically detect some forms of parallelism at the
DO loop level and by improving the efficiency of cer-
tain parallel constructs. One major difference be-
tween microtasking and autotasking is the definition
of parallel regions. In microtasking, the parallel re-
gion extends to the subroutine boundaries; whereas
in autotasking, multiple parallel regions may exist
within a subroutine. The master process executes all
code outside the parallel regions and must wait for
the slave processes to finish before proceeding past
the boundary of a parallel region. Compiler dircc-
tives, similar to microtasking directives, may bc used
to specify parallcl regions within the code. In this
paper, the automatic parallelizing feature of auto-
tasking was not used because it could not detect the
parallelism in VBAND. Expanded capabilities for au-
tomatic paralle!ization are planned in future releases
of the compiling system.
Multiuser and dedicated environments. The
maximum benefit of parallel processing on the CRAY
computers to a single user is experienced in dedi-
cated modc. On dedicated systems, parallel work is
executed on independent CPU's and all the CPU's
are available for a single user. However, users also
benefit from parallel processing in multiuser batch
modes in many cases. On multiuser systems, multi-
tasked programs typically receive more of the avail-
able computing resources than a single-processor im-
plementation of the same code. Multiuser computing
environments for the CRAY-2 and CRAY Y-MP de-
fine multitasking runs in terms of parallel work that
may or may not run concurrently on independent
CPU's. In macrotasking, a fixed number of tasks are
explicitly defined, and the synchronization of these
parallel tasks can often result in significant delays on
heavily loaded systems. On thc other hand, micro-
tasking and autotasking, rather than defining a fixed
number of tasks, distribute the work only to the pro-
cesses that are active at execution time. Those active
processes arc synchronized at the end of each con-
trol structure and at the end of each parallel region.
In this paper, both multiuser and dedicated modes
of operation are discussed with results presented for
both modes.
3. Description of Methods
This section briefly discusses the variable-band
data structure and follows with descriptions of the
sequential vectorized VBAND Choleski method and
the parallel PVBAND Choleski method. The im-
plementations of the sequential VBAND method il-
lustrate the techniques used to obtain computation
rates of greater than 200 MFLOPS on the CRAY-2
and CRAY Y-MP. The description of the PVBAND
method illustrates the general approach used to par-
allelize the VBAND method. Details of the three par-
allel implementations are discussed in a separate sub-
section. The details presented for the macrotasking,
microtasking,and autotaskingimplementationsare
essentialto understandingthedifferencesin perfor-
mancefortheparallelimplementations.Someresults
aregivenforthesmallestexampleproblem,theHigh-
SpeedCivil Transport(HSCT)aircraftdiscussedin
section4,to compareseveralimplementationsof the
VBANDandPVBANDmethods.
Variable-Band Data Structure
Thevariable-bandatastructureis describedin
references8 and9. Thelowertriangularpart of the
symmetricinputstiffnessmatrixK inequation(1) is
storedbycolumns,beginningwith themaindiagonal
downto thelast nonzeroentry in eachcohmm,in-
cludingzeros.Twoimportantadjustmentsaremade
to thelengthsof eachcolumnof K to accountforfill-
in duringfactorizationandto ensurethat colunms
ingroupscorrespondingto the levelofloopunrolling
endin the samerow. This datastorageschemeis
differentfrom traditionalskylincor profileCholeski
methodsthat storethe upper triangular part of K
by columns beginning with the main diagonal and
storing all coefficients up to the first nonzero coeffi-
cient in each column. The skyline scheme is equiv-
alent to storing the lower triangular part of K by
rows, whereas the variable-band scheme storcs the
lower triangular part of K by columns. The advan-
tagc of the skyline scheme is that it does not require
additional adjustments for fill-in and, in some cases,
requires considcrably less storage than the variable-
band storage scheme. The big disadvantage of the
skyline scheme is that the main computations in the
associated Choleski methods arc inner products that
are much slower than the SAXPY operations used by
the VBAND method on vector computers.
Variable-band storage:
Lower triangnltar part by columns
from diagonal down to last nonzero
coefficient in each column, with
additional adjustnmnts fi)r fill
a a a,.s
a._ az_ a_._ o
o a a,._ a_., o
o o a 4,3 a4, 4 a4.s
as.' 0 0 as.4 as.s
Skyline storage:
Upper triangular part by columns
from diagonal up to last nonzero
coefficient in each column
(a) Symmetric matrix.
I,,I I I I I 13 1 I I I I ! J i ra a_, 0 0 as, a_2 a 0 0 a_ a4_ 0 a,,,, as, ass
(b) Storage of variable-band scheme in single-dimensioned array.
a_,,., a_.,, as_ a4s o o a,.s
(c) Storage of skyline scheme in single-dimensioned array.
Figure 1. Example of variable-band and skyline matrix storage schemes. In parts (b) and (c), arrows indicate pointers to starting
coefficient in each column; zeros indicate memory locations allocated for ill] elements during factorization.
Variable-band example. An example of the
variable-band storage scheme and skyline storage
scheme is given in figure 1. The symmetric input
matrix has nonzero coefficients in the positions shown
in figure l(a). The position of each coefficient within
a single-dimensioned array is shown for each storage
scheme in figures l(b) and l(e). Auxiliary arrays
store the starting location of each diagonal element
in the matrix, denoted in fignlre 1 by arrows. The
zeros in figure 1 represent zero coefficients that are
stored by both schemes to allow for possible fill-in
coefficients. During factorization the coefficients of
K are modified, and many of the zero coefficients are
replaced with nonzero values. The factored matrix L
overwrites K. In the skyline storage scheme, no space
is allocated for elements al, 3 and al, 4 in the first
row of the upper triangular part of the input matrix
since they lie above the last nonzero coefficient in
columns 3 and 4, respectively. However, in the
variable-band scheme, the corresponding elements of
column 1 of the lower triangular part of the input
matrix, a3,1 and a4,1, lie between nonzero coefficients
in column 1 of L and extra storage is allocated.
Additional adjustments are made, if necessary, to
the column lengths in both the skyline and variable-
band storage schemes if loop unrolling is added to
the factorization implementation. In general, loop
unrolling requires that groups of columns end in the
same row. For example, if loop unrolling to level 3 is
used, then columns are arranged in groups of three
and extra zero coefficients are allocated as required to
ensure that each group of columns ends in the same
row. In practice, the adjustment for loop unrolling
increases the overall nlatrix storage requirement very
little.
Matrix assembly. The specialized variable-
band data structure used for the Choleski methods
described in this paper requires some interface with
the matrix assembly process when used in an exist-
ing finite element code. Matrix assembly routines
can be written to directly assemble finite element
stiffness matrices into the variable bandwidth form,
but this approach may not be desirable in realistic
applications. The variable-band storage scheme is
required only for factorization. For portions of the
_analysis computations, which may require matrix-
vector multiplications or the addition of two ma-
trices, the variable-band storage scheme would be
very inefficient because it would require a great many
"additional operations. This is also true for very
large problems with high bandwidths, where iterative
solvers may be used in place of direct Choleski meth-
ods. For these reasons, a more practical approach
is to assemble the stiffness matrices into a general
sparse matrix format that can be quickly convcrtcd
to the variable-band format when the variable-band
Choleski solver is to be used.
The conversion from a general sparse matrix stor-
age scheme, in which the lower triangular nonzero co-
efficients are stored by columns, to the variable-band
data storage scheme can be carried out efficiently as
follows: The lengths of each column in the variable-
band matrix are initially determined from the row
indices stored for each nonzero coefficient. Then,
the row lengths are adjusted to account for fill and
loop unrolling as previously described. Next, the row
index pointers for each nonzero coefficient are con-
vcrted to location pointers for each coefficient within
the variable-band matrix array. Finally, the variable-
band array is initialized to zero and tile nonzero cocf-
ficients are assigned using the integer location point-
crs. Most of the operations used to convert from
sparse to variable-band data storage schemes vcctor-
ize, and the time required to convert from sparse to
variable-band storage is small compared with the fac-
torization time. In the appendix, FORTRAN listings
for two subroutines used for this conversion are given.
Sequential VBAND Choleski Method
This section describes a basic Choleski factor-
ization method and several modifications that sig-
nificantly improve the computation rate on vector
computers. In the Choleski method a symmetric,
positive-definite matrix K is factored into the prod-
uct LL T. The factorization can be carried out in
many ways but, in general, elements of L are com-
puted from K a column (or row) at a time begin-
ning with column (or row) 1 of K and proceeding
to the last column (or row'). Only the lower part of
K is stored, using the variable-band storage format
described above, and L, which is computed by mod-
ifying K, is stored in the same space as the original
matrix K.
Choleski faetorization in ijk forms. Ref-
erence 10 describes Choleski factorization for vector
computers in terms of the so-called ijk forms. Tile
ijk forms describe Choleski factorizations in terms
of the nesting order of three loops that compute the
modifications of K into L. These modifications are
computed in the innermost loop in Choleski factor-
ization implementations and are of the form
Ki, j _-- Ki, j - Li,kLk, j (3)
The notation _, used in equation (3) and here-
after, indicates that the variable is updated and
overwritten with a new value specified by the right-
hand expression. The order of the three letters i, j_
7
z=,
=
and k indicates the order of the three loops. For ex-
ample, traditional skyline or profile Choleski meth-
ods are of the ijk form where the inner vectorized
loop varies the index k in equation (3) to compute
the inner product of rows i and j of L. Tile middle
loop, denoted by j, determines that row j of L is
used in the inner-product computation with row i of
L to update element Ki, j, After the inner-product
update of Ki,j, element Li,j is finished by dividing
Kid by tile diagonal element Lj,j. The outer loop is
the i loop, specifying that the computations proceed
by rows beginning with row 1 and ending with the
last row.
A second example, the kji form, is better suited
for the variable-band data structure. In the basic kji
form, the outermost loop in the factorization is the
k loop. This means that for each k, a new column of
L is finished, and then each j in tile middle loop of
the factorization corresponds to column j of K that
is modified by the SAXPY operation using column
k of L in the innermost loop. Because the primary
computation is a vector SAXPY operation, the kji
form is well suited for vector computers such as the
CRAY-2 and CRAY Y-MP. An additional benefit of
tile kji form is that the kth column of L, which is
used for many SAXPY updates, can be stored in the
fast local cache memory on the CRAY-2.
Modified kji VBAND method. The VBAND
Choleski method, described in reference 8, is a mod-
ified form of the basic kji Choleski method. The
VBAND method is shown in figure 2 in pseudoeode
instructions. In this method the basic kji form is
used for the cohmm updates in LOOP3, but the jki
form is used to finish groups of r columns of L. The
jki form also uses SAXPY operations, but the or-
der of the SAXPY column updates is different. The
combination of these two forms works as follows:
1. jki portion: Columns k, k+ 1, ..., k+r- 1 of L
are completed (where r corresponds to the level
of loop unrolling) at each iteration of the outer
loop k.
2. kji portion: Appropriate columns of K are mod-
ified using columns k, k + 1, ..., k + r - 1 of L,
beginning with column k + r and ending with the
column corresponding to the last nonzero coeffi-
cient in the groups of r columns of L.
The jki portion corresponds to the code required
between LOOP1 and LOOP2 in figure 2. The first
column k, computed in the jki portion, requires only
a vector divide. Column k + 1 is modified using the
just-completed column k.in a single SAXPY update
8
LODPI k= I ton-r-i in steps of r
{Complete columns k, k+ i, ..., k+r
1 of L}
lastrk = k + len(k) - 1
L00P2 j = k + r to lastrk
IF Lj,k,..., Lj,k+r_ 1 are not all
zero THEN
{Update column j of L using
r columns of L via L00P3}
L00P3 i = j to lastrk
Kid +- Ki,j - Li,k * Lj,k
-Li,k+ 1 * Lj,k+l-..-
-hi,k+r_ 1 * Lj,k+r-1
END L00P3
ENDIF
END L00P2
END L00PI
{Finish any remaining columns of L
(if n is not a multiple of r)}
Figure 2. Sequential VBAND Choleski method.
and is then divided by the diagonal element of col-
umn k + 1. Column k + 2 is modified next by using
both columns k and k+ 1 (a double SAXPY loop) and
then is finished by a vector divide. Each successive
column uses one additional column in the update por-
t.ion up to loop unrolling level r- 1. The columns arc
copied into temporary arrays that are stored in local
memory on the CRAY-2. On the CRAY Y-MP, the
r columns are also copied into temporary arrays to
reduce loop indexing overhead. The kji portion, cor-
responding to LOOP3 in figure 2, consists entirely of
the multiple SAXPY updates. The SAXPY updates
are skipped if all r scalar multipliers are zero. This
zero-checking option reduces the computation rate
slightly but reduces overall CPU time considerably
by eliminating many unnecessary operations. This
feature is especially important in the variable-band
method for problems where the increase in storage
requirements for the variable-band scheme is large
compared with the requirements for a skyline stor-
age scheme. If the zero-checking option is not used
in these cases, the number of computations for the
VBAND method can be much greater than the num-
ber of computations required for a skyline Choleski
method.
The key computation in the VBAND algorithm
is the multiple SAXPY update loop. Over 97 per-,
cent of the computations in the factorization oc-
cur in this loop for each of the three problems
solved for this study. The multiple SAXPY com-
putations are carried out efficiently on both the
CRAY-2 and CRAY Y-MP computers because they
Table1. VBANDComparisonsfor High-SpeedCivil Transport Problem
[16 146 equations; Maximum semibandwidth = 594; Average semibandwidth = 319]
Method a
LL1
LL6
LL6L
LL6LZ
1.0
2.6
3.3
4.5
[ CPUtime b _,Vall timeb [ Rate, l(second) (timed MFLOPS Speedup c
NASA Langley CRAY-2 computer in dedicated mode
27.9 ] 28.0 73
10.8 ] 10.8 188
8.5 8.6 238
6.2 6.2 216
NASA Ames CRAY Y-MP computer in dedicated mode
LL1
LL6
LL6L
LL6LZ
14.7
8.5
7.7
5.6
14.7
8.6
7.7
5.6
138
236
263
242
1.0
1.7
1.9
2.6
aMethods:
LL1 variable-band Choleski, single SAXPY update
LL6 variable-band Choleski, multiple (6) SAXPY update
LL6L variable-band Choleski, multiple (6) SAXPY update, uses temporary storage for six
update columns (stored in local memory on CRAY-2)
LL6LZ variable-band Choleski, multiple (6) SAXPY update, uses temporary storage, skips
SAXPY updates if all six scalar multipliers are zero
bCRAY FORTRAN timing routines second and timer are used for CPU and wall clock times,
respectively, which are given in seconds.
CSpeedup is computed using wall clock times relative to LL1.
allow overlapping of memory accesses with simulta-
neous use of both the add and multiply functional
units. All columns of L that are accessed in LOOP3
of figure 2 on the CRAY-2 are stored in local mem-
ory. Various values for r were tried with this method
and the value of 6 was found to be best in most cases.
As the level of loop unrolling increases, the amount
of serial computation increases and the performance
gains level off. If r is too large, the columns may
not fit in the local memory of the CRAY-2. In prac-
tice, only approximately 10K of the 16K words can
be used because the operating system and compiler
also use the local memory.
VBAND examples. Table 1 gives a compari-
son of the basic kji Choleski method, denoted as LL1,
with several implementations that add the improve-
ments described above. Runs were made using the
H;SCT example problem, described in section 4, on
both the CRAY-2 and the CRAY Y-MP computers
in dedicated mode. The speedup of the vectorized
LL1 method relative to LL1 compiled with vector-
ization inhibited is much greater than the speedups
shown in table 1. All methods that begin with the
symbols LL6 use the combined jki and kji Choleski
forms already described with the level of loop un-
rolling set to 6. Tile L designation after the 6 de-
notes the use of temporary arrays to store the six
columns of L used at each iteration of the outermost
loop. On the CRAY-2 these arrays are stored in local
memory by using the compiler directive cdir$ reg-
file. Even though tile CRAY Y-MP does not have
local memory, the use of temporary arrays for the
six columns reduced the indexing overhead and im-
proved the computation rate by approximately 9 per-
cent. (Compare LL6 and LL6L for the Y-MP runs.)
The addition of zero checking, denoted by the letter
"Z," decreased the computation rate somewhat but
reduced the overall time substantially. For the High-
Speed Civil Transport problem, the factorization re-
quired 2.033 billion operations without zero checking
and 1.346 billion operations with zero checking.
A comparison of the CRAY-2 and CRAY Y-MP
times in table 1 shows that the techniques used to
improve performance result in a greater improve-
ment for the CRAY-2 than for the CRAY Y-MP.
The LL1 method is nearly twice as fast on the
CRAY Y-MP compared with the CRAY-2 initially.
9
==
However, the local-memory implementations LL6L
and LL6LZ run at nearly the same rate on both
CRAY computers. The VBAND Choleski method,
denoted by LL6LZ in table 1, is 4.5 times faster
than the vcctorized basic kji method, LL1, on the
CRAY-2, but it is only 2.6 times faster than the LL1
method on the CRAY Y-MP. This difference in re-
sults for the two CRAY computers shows that al-
gorithms that arc carefully designed to exploit key
features of tile CRAY-2 architecture also run well
on the CRAY Y-MP, but the converse is not nec-
essarily true. The following paragraphs describe
the approach used to parallelize this method on the
CRAY-2 and CRAY Y-MP.
Parallel PVBAND Choleski Method
The sequential VBAND Choleski method is well
suited for parallel implementation on multiproccssor
computers such as the CRAY-2 and CRAY Y-MP.
FORTRAN listings of an autotasked subroutine and
a FORCE subroutine for tile PVBAND method are
given in the appendix. The basic PVBAND Choleski
method is shown in figure 3 in pseudocode instruc-
tions. The jki part of the PVBAND method, i.e.,
the completion of colunms k, k + 1, ..., k + r- 1 of
L in figure 2, must be finished before the parallel up-
dates using those columns begin. In this paper, the
region surrounding tile jki portion of the PVBAND
method is referred to as the barrier region. A bar-
rier region begins at a point at which all CPU's must
arrive before any can proceed; then the work within
the barrier region is completed by one CPU before
the remaining CPU's resume execution. The number
of barrier regions required for the PVBAND method
is n/r, where n is the number of equations and r is
the level of loop unrolling. The number of computa-
tions required within the barrier region is very small
(between 0.6 and 2.3 percent for the three example
problems considered in this paper) compared with
the total number of computations requircd for the
factorization. The multiple SAXPY updates can bc
computed in parallel with no synchronization or com-
munication required. The following two subsections
describe general differences in the implementation of
the barrier regions and parallel loops using FORCE
(macrotasking), microtasking, and autotasking. A
separate section is devoted to detailed descriptions
of the different parallel implementations.
Barrier regions. In general, the FORCE im-
plementations used in the PVBAND method assume
a fixed number of tasks and must wait for all tasks to
arrive at a barrier region, let one task complete the
work within the barrier, and then exit sequentially.
An excellent discussion of barrier algorithms used in
L00Plk=lto n-r-1 in steps of r
Begin barrier region
{Complete columns k, k + 1, ..., k +
r - 1 of L}
End barrier region
lastrk = k + fen(k) - 1
Parallel LOOP2 j ----k + r to lastrk
IF Lj,k,..., Lj,k+r_ 1 are not all zero
THEN
{Update columns j of L using r
columns of L via LOOPS}
LOOPS i----j to lastrk
Kid _ Kid- Li, k * Lj,k-
Li,k+l * Lj,k+l--...
--Li,k+r_ 1 * Lj,k+r_l
END LOOP3
ENDIF
END Parallel LOOP2
END LOOP1
Begin barrier region
{Finish any remaining columns of L
(if n is not a multiple of r)}
End barrier region
Figure 3. Parallel PVBAND Choleski method.
the FORCE language is given in reference 11. On
the other hand, microtasking and autotasking imple-
mentations use a less restrictive form of the barrier
synchronization. In this case, the first active pro-
cess that reaches a barrier region completes the work
within that region and leaves the barrier after the
work is completed. Other active processes that arrive
at the barrier region must wait until the work is com-
pleted or, if the work has already bccn completed,
they immediately skip the barrier region and con-
tinue execution. Each type of barrier implementation
requires critical regions, that is, regions where vari-
ables in shared memory must be incremented by one
(and only one) CPU at a time. The shared variable
must be locked to all other CPU's during the time
that it is being changed by a single CPU. The cost of
locking and unlocking variables is a major part of the
parallel overhead in the PVBAND implementations.
Parallel loops. The implementations of the par-
allel loops also differ substantially for the FORCE,
microtasking, and autotasking versions. In the
FORCE implementation, the parallel loop is distri_o-
uted to the fixed number of tasks in a wrap-around
fashion. That is, task 1 completes the SAXPY
updates associated with loop indices 1, 1 + p, 1 + 2p,
etc., where p is the number of tasks. This approach
has the advantage of evenly dividing the work in cases
10
c **** begin top of FORCE Barrier
call lockon(barlck)
if (ffnbar .lt. np- 1) then
ffnbar = ffnbar + 1
call lockoff(barlck)
call lockon(barwit)
endif
if (ffnbar .eq. np- 1) then
c **** end top of FORCE Barrier
c **** execute code inside barrier
c **** begin bottom of FORCE Barrier
endif
if (ffnbar .eq. O) then
call lockoff(barlck)
else
ffnbar = ffnbar -i
call lockoff(barwit)
endif
c **** end bottom of FORCE Barrier
Figure 4. FORTRAN code for a standard force barrier.
where the amount of work for each task is nearly the
same. The disadvantage of a prescheduled assign-
ment is that a large potential delay may bc incurred
on busy systems if all tasks are not active. In the
microtasking and autotasking implementations, the
parallel work is distributed only to active processes.
This approach has the advantage of dynamic alloca-
tion of the computing resources in a multiuscr envi-
ronment. The disadvantage of this approach is that
each CPU must determine its next loop iteration in-
side a critical region. A self-scheduled parallel loop
is also available in FORCE which uses the same ap-
proach as microtasking and autotasking. However,
the FORCE critical regions, which use maerotasking
lock subroutines described in the next section, are
much more expensive than the critical regions imple-
mented using microtasking and autotasking.
Parallel Implementation Details
Details of the various parallel implementations of
the PVBAND method are discussed next. These de-
tails show that autotasking and microtasking best
exploit the parallel architecture of the CRAY com-
puters, particularly in multiuser environments. The
FORTRAN code used to implement the parallel con-
structs is not usually seen by algorithm developers
since it is inserted automatically at compile time.
However, to illustrate the differences in the imple-
mentation of the barrier region and parallcl loop
using FORCE, microtasking, and autotasking, the
actual FORTRAN statements substituted for the
FORCE statements and various compiler directives
are examined. These details help to explain the
sometimes large differences in performance that are
observed in the results for the example problems us-
ing the same basic parallel algorithm but with differ-
ent parallel implementations.
FORCE implementations. The FORCE lan-
guage is described in detail in reference 6. Details
of the use of FORCE will not be presented here,
but the FORCE statements Barrier, End barrier,
and Presched DO are discussed. In FORCE, crit-
ical regions are implemented through macrotasking
lock subroutines. The locks are identified as shared
integer variables. The lock subroutines do not make
use of the fast hardware semaphore registers although
there are macrotasking subroutines that do use the
semaphore registers to implement certain types of
critical regions (refs. 3 and 4). A barrier subroutine
is also available in macrotasking but is not used by
the current version of FORCE.
The FORCE statements Barrier and End bar-
rier were used to implement the barrier region in
the PVBAND method. The Barrier and End barrier
statements are transformed by the FORCE prepro-
cessor into the FORTRAN code shown in fignlre 4.
The code listed in figure 4 uses subroutine calls to
lock tile shared variables barlck and barwit. These
calls ensure that only one task at a time can incre-
ment the counting variable ffnbar and prevent tasks
from exiting the barrier region before the work inside
the barrier is completed. When the routine lockon is
called, a task attempts to lock the argument variable.
If the argument variable is already locked, the task
incrementing the counting variable is disconnected
from the process and placed in the Blocked on a Lock
queue. When the argument variable becomes un-
locked, the library scheduler will move the tasks from
the Blocked on a Lock queue into the Ready queue.
Tasks in the Ready queue may then be assigned to
logical CPU's that are executed on physical CPU's
as they become available. If the argument vari-
able is not locked, the calling task locks the variable
and continues execution. Estimates of lock overhead
for both the CRAY-2 (ref. 3) and the CRAY Y-MP
(ref. 4) are described below. The CRAY-2 requires
200 clock cycles when the variable to be locked is un-
locked, whereas the CRAY Y-MP requires 400 clock
cycles. If the variable is locked, the CRAY-2 over-
head is 4000 clock cycles plus the time that a task
must wait for the variable to become unlocked. The
11
c **** begin top of alternate FORCE
Barrier
if (me .eq. i) then
ist = 2
i0 continue
do 20 i = ist, np
if (icord(i) .ne. k) then
ist = i
goto I0
endif
20 continue
c **** end top of alternate FORCE
Barrier
execute code inside barrier
3O
c **** begin bottom of alternate FORCE
Barrier
Icord(1) = k
else
continue
Icord(me) = k
if (icord(1) .ne. k) then
goto 30
endif
endif
c **** end bottom of alternate FORCE
Barrier
Figure 5. FORTRAN code for an alternate FORCE barrier.
corresponding CRAY Y-MP overhead is 1500 clock
cycles plus the wait time.
The FORCE barrier is implemented using two
locks, barlck and barwit. The initial state, before each
program call to the barrier, has barlck unlocked and
barwit locked. At the beginning of the FORCE bar-
ricr, the first task to reach the call lockon statement
locks the variable barlck and increments the counter
ffnbar. That task then unlocks variable barlck and
attempts to lock the second barrier shared variable
barwit, but since barwit is initialized in the locked
condition, that task must wait for barwit to be un-
locked. As soon as barlck is unlocked, another task
can lock it, increment the counting variable, unlock
variable barlck, and wait for barwit to become un-
locked. When the last task arrives at the barrier
statement (i.e., ffnbar = np 1, where np is the num-
ber of tasks), the variable barlck is left locked and
the FORTRAN code between the Barrier and End
barrier statements is executed while the remaining
np- 1 tasks arc still either suspended or trying to
12
lock variable barwit. When the single task finishes
the sequential FORTRAN code, tile counting vari-
able ffnbar is decremented and bar'wit is unlocked.
One of the remaining tasks now relocks barwit (which
keeps the remaining tasks still trying to lock bar-
wit) and continues execution by first decrementing
the counting variable ffnbar, unlocking barwit, and
continuing execution. The last task to lock barwi[
(i.e., when ffnbar = 0) unlocks barIck and continues
execution past the barrier. This method ensures that
all tasks arrive at a barrier region before the work in-"
side the barrier work is started, and no task leaves
the barrier region until the work inside the barrier
region is completed.
The standard FORCE barrier described in the
preceding paragraph is expensive on any parallel
system where the cost of locking shared variables
is high or where the specified number of tasks may
not be concurrently executing. The cost of locking
shared variables is eliminated by using an alternate
FORCE barrier, shown in figure 5. This barrier was
developed by Jones (ref. 12). In this barrier a shared
array, lcord(), is initialized once outside of LOOP1 _--
in figure 3 by using a standard FORCE barrier. The
shared array is then used for the barrier region within :
LOOP1 in figure 3. Each task is associated with one
array location (i.e., 1 through p) and writes only to
that location. At the beginning of the barrier region,
tasks 2 through p write the value of the LOOP1
iteration variable k to their respective locations in
array lcord. Task 1 checks locations 2 through p in
lcord and executes the code within the barrier region
only after locations 2 through p in Icord arc equal
to k. Tasks 2 through p check location 1 of array
Icord, continuing in a loop until task 1 completes
the barrier region and writes the value of the loop
iteration variable k to location 1 of lcord. Tasks
are not automatically suspended while waiting for
appropriate values in the Icord array since no calls to
lock routines are used. However, excessive CPU time
may accumulate when the number of tasks specified
at run time is greater than the number actually
running concurrently. For this reason, this barrier is -
used only on lightly loaded systems or in dedicated
mode. This type of barrier synchronization is useful -
within a loop such as in the PVBAND method, where
the outer-loop index k can be used for the test value
of the lcord array. It is not a general-purpose barrier.
The parallel loop in the PVBAND method is im-
plemented in FORCE using the Presched DO state-
ment. This statement is translated into a FORTRAN
DO loop by the FORCE preprocessor. For example,
the FORCE statement
Presched DO I I=ISTART,IEND
c **** begin top of process control
structure
cmic$process
cdir$ suppress(list of variables)
if (utqbcs (-I)) then
utqitr = utqifa(l,utqitr)
if (utqitr .eq. -I) then
c **** endtop of processcontrol struc-
ture
c **** execute code inside control
structure
c **** begin bottom of process control
structure
cmic$endprocess
endif
cdir$ suppress(list of variables)
call utqecs
endif
c **** endbottom of process control
structure
Figure 6. Microtasking coder or the process and end process
directives.
is translated to the following FORTRAN DO loop:
DO 1 I=ISTART+ME-I,1END,NP
The variable NP is the number of tasks specified
by the user at run time, and the task identification
variable ME is an integer between 1 and NP. This
FORTRAN loop assigns each iteration of the DO
loop to the specified number of tasks in a wrapped
fashion.
Microtasking implementations. Microtask-
ing is accomplished by inserting compiler directives
in FORTRAN source code to identify control struc-
tures. The barrier regions in figure 3 are implemented
using the cmic$ process and cmic$ end process direc-
tives, and the parallel loop is implemented using the
cmie$ do global directive. The code containing the
microtasking directives is preprocessed so that the
FORTRAN code implementing each control struc-
ture is inserted. In the following paragraphs, the
code generated by the preprocessor for both control
structures is discussed.
The process and end process control structure di-
rectives bound a barrier region that is to be com-
pleted by only one CPU before any additional CPU's
may proceed past the barrier region. In tile micro-
tasked algorithm, one CPU completes the r columns
of L at each outer loop iteration, and only one CPU
finishes the leftover work at the end. Only processes
currently active must wait at the end of the process
control structure. This approach differs significantly
from the FORCE barrier described in the previous
paragraphs where all tasks must reach the barrier
before the sequential code can begin and all tasks
must leave the barrier sequentially to begin execut-
ing the next section of code after the sequential code
is completed. As soon as the sequential code is com-
pleted, all processes may proceed to the parallel loop,
including any additional processes that may become
active. The FORTRAN code inserted by the micro-
tasking preprocessor for the process and end process
directives is discussed next.
Figure 6 shows the FORTRAN code generated
by premult, the microtasking prcproccssor, for the
process control structure. The preprocessor replaces
the process directive with a comment and inserts the
multitasking code. The cdir$ suppress directive is in-
serted by the preprocessor at the beginning and end
of the barrier region to force all variables referenced
within the barrier region from registers into shared
memory. This ensures that all CPU'_s have acccss to
the most current values of the shared variables. The
subroutine utqbcs is called by each CPU to determinc
if the control structure is eligible to bc proccsscd.
The control structure is eligible for processing if the
control structure has not been started or if it has been
started but not yet completed. A shared variable is
used to indicate which control structure is currently
active. An additional shared variable is used to up-
date the number of active processes within the con-
trol structure. The code within a process directive is
cxccuted by only one CPU, but all othcr CPU's must
wait until the sequential code is completed. For ex-
ample, if two of four CPU's are available, and hence
two CPU's attempt to cxccute one of the barrier re-
gions, the first CPU to reach the barrier region will
begin executing the code. The second CPU will call
routine utqecs which either suspends the process tem-
porarily or just delays the second CPU until the sin-
gle task is completed. As soon as the barrier region is
completed, both CPU's proceed to the next section
of code. If a third CPU becomes available and arrives
at the same barrier region after the work within the
barrier region is completed, routine utqbcs will return
a value of false and that CPU will skip the entire bar-
rier region and go on to the next section of code. The
implementation of this scheme uses special shared
and private variables, hardware semaphore registers
to lock appropriate variables, and assembly coded
13
2O
c **** begin top of do global control
structure
cmic$ do global
utqtrips = comp@((lastr) - (k+6))
if (utqbcs (utqtrips) then
if (utqtrips.lt.O) then
utqitr = utqifa(1,utqitr)
utqdummy = comp©(utqitr)
utqoff = (lastr)+l
j = utqoff + utqitr
if (utqdummy .ge. O) then
continue
c **** end top of do global control
structure
c **** execute code inside control
structure
c **** begin bottom of do global control
structure
utqitr = utqifa(l,utqitr)
utqdummy = comp_(utqitr)
j = utqoff + utqitr
if (utqdummy .ge. O) goto
20
endif
endif
call utqecs
endif
c **** end bottom of do global control
structure
Figure 7. Microtasking code for the do global directive.
routines such ,as utqbcs, utqifa, and utqecs to effi-
ciently carry out the necessary synchronization steps.
The do global directive is used to implement the
parallcl loop in figure 3. Each iteration of LOOP2,
the multiple SAXPY updates, is assigned one at a
time to cach active process. This mode of parallel
execution differs from the FORCE implementation
that distributcs tile iterations of LOOP2 to each task
in a predetermined sequence. The total number of
loop iterations is computed from the limit variables in
the FORTRAN DO loop. The particular loop indcx
j executed by cach CPU is determined by reading
a shared variable within a critical region and thcn
incrementing that variable.
Figure 7 shows the code generated by premult for
the do global control structure. CPU's that reach the
parallel DO loop will begin execution of the loop as
long as routine utqbcs returns a value of true. This
14
means that even CPU's that may become available
after much of the work within the parallel region is
completed may still be used in the parallel loop. Each
CPU determines its current iteration value utqitr
through calls to the routine utqifa. The routine
utqifa maintains a shared variable used to determine
iteration values. The shared variable does not appear
in the FORTRAN code in figure 7. All CPU's
that enter the parallel DO loop continue to execute
different loop iterations until the loop counter utqitr
gets to zero or to a positive value. Each CPU finishes
its last available loop iteration and then calls routine
utqecs, which delays all CPU's until all work in the
parallel region is completed.
Thc code displayed in figures 6 and 7 was pro-
duced oil the CRAY-2. The corresponding code gen-
erated on the CRAY Y-MP is different because of
hardware differences and the greater use of intrin-
sic functions that are used to test semaphore regis-
ters and to read from and write to shared registers.
These differences can cause noticeable changes in per-
formance even though the same basic strategy is fol-
lowed on both machines in implemcnting the barrier
regions and parallel loops.
Autotasking implementations. The depen-
dency phase fpp of the autotasking compilation was
not able to detect the parallelism in the VBAND
method. Therefore, directives that were functionally
equivalent to microtasking directives were inserted
in the FORTRAN subroutine used to carry out the
PVBAND method. Two autotasking implementa-
tions for the PVBAND method wcrc explored.
In the first autotasking implementation, LOOP2
in the PVBAND algorithm is preceded by the cmic$
do all directive. The do all directive defines the start
of a parallel region, and any code outside this region
is to bc executed only by the master process. The
advantage of this approach is simplicity (i.e., only
a single directive is required to parallelize the entire
method), but the disadvantage is that each outer loop
iteration requires additional overhead to start up a
new parallel region.
In the second autotasking implementation, the
entire subroutine is declared a single parallel region
using the cmic$ parallel and cmic$ end parallel di-
rectives. This approach is identical to the microtask-
ing implementation already discussed, but different
directives are used. All variables within the subrou-
tine are explicitly declared to be shared or private
within the parallel directive. The autotasking di-
rectives cmic$ case and cmic$ end case perform the
same function as the microtasking process and end
processdirectives, whereas the cmic$ do parallel di-
rective corresponds to the microtasking do global di-
rective. All code outside the control structures, but
inside the parallel region, is executed by all active
processes. In experiments, the second autotasking
method was found to be more efficient and was used
for all autotasking results presented in this paper.
The autotasking translation phase, generated by
fmp, produces code very similar to the code produced
by premult using microtasking. The autotasked sub-
routine has additional code generated to handle the
creation of parallel regions. In addition, separate
code segments are created for the master and the
slave processes. The code generated for the auto-
tasking do parallel directive is the same as the code
for the microtasking do global directive. Likewise, the
code generated for the autotasking case directive is
the same as the code for the microtasking process di-
rective, except for some differences related to the sup-
press directive. In microtasking, the suppress direc-
tive, at the beginning of the process control structure,
forces all variables that will be read inside the control
structure to be written from registers to memory. At
the end of the process control structure, all variables
that have been modified are again written to main
memory. In autotasking, only shared variables are
subject to the suppress directive that precedes and
follows the case directive.
CRA Y-2 local-memory considerations. The
local-memory version of the PVBAND method re-
quires some additional considerations. The columns
of L, completed inside the barrier region in the
PVBAND method, are computed by only one CPU
but arc used by all other CPU's. Since the CRAY-2
and CRAYY-MP are shared-memory machines,
thesc columns are accessible to all CPU's. How-
ever, the fast local memory cache on the CRAY-2
is not shared between CPU's and so each CPU must
copy the columns into local memory before begin-
ning the parallel SAXPY updates. Two approaches
were tested to minimize the extra cost of copying the
columns of L into local memory. No noticeable dif-
ference was observed for the two approaches in the
microtasked and autotasked implementations. How-
ever, for the FORCE implementations, a significant
-difference was observed.
The first approach was to complete all r columns
inside a single barrier, or control structure. The CPU
-completing the r columns copies them into its lo-
cal memory as soon as each column of L is com-
pleted. The remaining CPU's copy all six columns
in one loop as soon as the single CPU finishes the
barrier region. The second approach was to use r
barrier or control structures, allowing the idle CPU's
to copy each column into local memory as soon as
each column is completed. This approach reduces
the time required at the end of the barrier to copy
the columns into local memory, but at the expense of
more barriers and an increased number of DO loops
for the copying operation. Because of the high cost
of FORCE barrier statements, the second approach
was much slower than the first. In all the results prc_
sented in section 5 for the CRAY-2, the first approach
was used for the FORCE implementations, whereas
the second approach was used for the microtasking
and autotasking implementations.
AmdahI's law. The degree of parallelism for any
method is always limited by the number of sequen-
tial computations relative to the number of parallel
computations. A much-used measure of the effect
of sequential operations on the parallel efficiency of
a given method is Amdahl's law, which states that
the theoretical maximum speedup S possible for a
parallel algorithm is given by
S= P
1 + ws(p- 1) (4)
where p is the number of processors and ws is the
percentage of time required by sequential compu-
tations. (See ref. 13 for a thorough discussion of
several measures of the degree of parallelism.) For
the PVBAND method, the jki portion at each outer
loop iteration is the sequential part of the method,
whereas all remaining computations are carried out
in parallel in LOOP3 (fig. 3). Maximum theoretical
speedup estimates for the PVBAND method for each
of the three example structural analysis problems are
given in table 2 by using equation (4). The percent-
ages used for the computations in table 2 were ob-
tained by inserting counting variables in the barrier
region of code in the autotasked and microtasked im-
plementations. The maximum speedup estimations
assume that the sequential computations are carried
out at the same rate as the multiple SAXPY up-
dates in LOOP3 and also that no overhead exists for
the parallel loop and the barrier regions. The actual
speedups attained will always be less than the esti-
mates in table 2, but values close to these estimates
indicate good parallel implementations.
PVBAND examples. Table 3 gives a compar-
ison of parallel and sequential timings for the High-
Speed Civil Transport problem. The PVBAND im-
plementations are denoted by appending the letters
A, M, F, and J to the VBAND implementation sym-
bols in table 3. The letters correspond to the au-
totasking, the microtasking, the FORCE using the
15
Table2. Effect of Amdahl's Law on Maximum Theoretical Speedup for PVBAND Method
lber of
_ssors, p
2
4
8
16
Maximum theoretical speedup for problem
HSCT a
1.96
3.74
6.89
11.90
SRB b
1.96
3.75
6.93
12.03
3-D panel c
1.99
3.93
7.68
14.70
aHSCT: High-Speed Civil Transport, sequential operations 2.3 percent of total operations
for factorization.
bSRB: Space Shuttle solid rocket booster, sequential operations 2.2 percent of total opera-
tions for factorization.
e3-D panel: Cross-ply composite laminate, sequential operations 0.59 percent of total
operations for factorization.
standard barrier, and the FORCE using the special-
purpose barrier implementations of the PVBAND
method, respectively. Runs were made on the
CRAY-2 in dedicated mode comparing tile paral-
lel PVBAND implementations with the correspond-
ing sequential VBAND implementations described
previously. Parallel speedups arc computed using
the PVBAND wall clock times and the sequential
VBAND wall clock times for the same method.
Speedups are often computed by dividing the parallel
CPU time by the parallel wall clock time, but such
speedups do not accurately reflect the cost of par-
allcl processing. A comparison of the CPU time
for each PVBAND implementation with the CPU
time for the sequential VBAND implementation of
the same method shows that the total CPU time is
always greater for the parallel implementations than
for a single-processor implementation of the same al-
gorithm. If the CPU time for the parallel implemen-
tations is used to compute parallel speedups, imple-
mentations with the largest CPU time may appear to
have the greatest parallel efficiency. For example, in
table 3 the parallel speedup for the LL6LF1 FORCE
implementation would be i_9 if computed by divid-
ing the parallel CPU time by the parallel wall clock
time. However, the actual wall clock time for this
implementation is greater than the wall clock time
for the sequential VBAND implementation.
The FORCE PVBAND implementations, which
used the special alternate barrier described previ-
ously, were the fastest in dedicated mode and have
the highest parallel speedups. The faster times arc
expected since no locking of variables is required
for the special FORCE barriers and no tasks are
ever suspended while waiting for sequential work to
be completed. However, the special barrier works
effectively only in a dedicated environment. The
FORCE implementations that used the standard bar-
riers were the slowest of the PVBAND implementa-
tions. The LL6LF1 implementation, which uses one
barrier for each column computed in the jki portion
of the PVBAND method, is very inefficient on the
CRAY-2. The wall clock time for this method was
greater than the single-processor time of the corre-
sponding VBAND implementation. CPU times were
also much higher for the FORCE implementations on
the CRAY-2 than for the microtasking and autotask-
ing implementations. This difference reflects the high
cost of barriers in the FORCE implementations. The
differences between autotasking and microtasking are
small, with autotasking slightly faster for these runs.
The parallel speedups shown in table 3 are
affected by both the number of sequential compu-
tations and the overhead associated with the im-
plementation of barrier regions. Table 4 shows the
percentage of sequential operations for this problem.
The LL1 methods have a lower number of sequential
computations compared with the LL6 methods but
they also require six times as many barrier regions.
As a result, the parallel speedups in table 3 are, in
some cases, higher for the LL6 methods than for the
LL1 methods. This is especially true for the FORCE
implementations that use the standard FORCE bar-
rier. The results in table 3 also show that the use
of temporary arrays for local memory versions and
zero-checking reduces parallel speedups. However,
both local memory and zero checking reduce total
wall clock time and are therefore beneficial.
r
16
Table 3. PVBAND Method Timings for High-Speed Civil Transport Problem
"NASA Langley CRAY-2 computer in dedicated mode;]
16 146 equations; Maximum semibandwidth = 594; |
Average semibandwidth = 319 J
CPU time W_all time
Method a (second) (timer) Speedup b
Single-processor VBAND Choleski method
LL1 27.87
LL6 10.79
LL6L 8.53
LL6LZ 6.20
Parallel PVBAND method
27.95
10.82
8.55
6.22
using autotasking
1.0
1.0
1.0
1.0
LL1A
LL6A
LL6LA
LL6LZA
32.85
12.33
9.91
7.77
8.29
3.13
2.67
1.96
3.37
3.46
3.20
3.17
Parallel PVBAND method using microtasking
LL1M
LL6M
LL6LM
LL6LZM
33.53
12.90
10.06
7.85
8.54
3.67
2.89
2.15
3.27
2.95
2.96
2.89
Parallel PVBAND method using standard FORCE barrier
LLIF
LL6F
LL6LF1
LL6LF2
LL6LZF
50.86
14.26
54.52
13.37
11.42
Parallel PVBAND method using
17.18
3.98
28.52
4.25
3.94
1.63
2.72
2.01
1.58
special FORCE barrier c
LLIJ 30.26 7.64 3.66
LL6J 12.17 3.06 3.54
LL6LJ 9.56 2.41 3.55
LL6LZJ 7.15 1.79 3.47
aVBAND method names appended by letters A, M, F, and J to denote autotasking,
microtasking, FORCE with standard barrier, and FORCE with special alternate barrier
implementations of PVBAND method, respectively.
LL6LF1 - uses six barrier statements for each outer loop iteration in jki portion of
PVBAND method
LL6LF2 - uses only one barrier statement to finish all six columns in jki portion of
PVBAND method
bpVBAND speedups calculated using the PVBAND wall clock time and the corresponding
wall clock time for the single-processor method.
eSpecial FORCE barrier avoids use of critical regions that require calls to routines that lock
the shared variables.
Solution of Triangular Systems
The solution of the triangular systems is a for-
ward solution (eq. (2a)), followed by a backward so-
lution (eq. (2b)). As in the factorization of K, there
are several possible implementations of the triangu-
lar solutions. Since the factorization is more costly
than the triangular solutions, the data structure used
for the factorization often determines the implemen-
tations of the forward and backward solutions. In
17
Table 4. Number of Computations and Memory Requirements for Example Problems
Factorization Sequential, MemoryMethod (total operations) percent of total (64-bit words)
High-Speed Civil Transport aircraft
(16 146 equations; Max. semibandwidth = 594; Av. semibandwidth = 319)
LL1 2033 084 710
LL6 2 033 161 281
LL6LZ 1 345 866 285
0.09
1.5
2.3
5 160 501
5 160 591
5 160 591
Composite cross-ply laminate with hole
(11929 equations; Max. semibandwidth = 1597; Av. semibandwidth = 1081)
LL6LZ 13 098 674 109 I 0.59 12 901 825
Space Shuttle solid rocket booster
(54 870 equations; Max. semibandwidth = 558; Av. semibandwidth = 355)
LL6LZ 5 427 998 155 I 2.2 19 522 323
the variable-band data structure, the lower trian-
gular part of L is stored by columns. As a result,
the forward solution is carried out using SAXPY op-
erations. This so-called column-sweep algorithm is
shown in figure 8(a). For the backward solution, the
columns of L are now the rows of LT; therefore, a
different approach that uses inner products is used
to ensure stride one memory accesses. Figure 8(b)
illustrates this inner product algorithm.
LOOPI j = 1 to n
fj = 0 rW N LOOP2
zj ----fj/Lj,j
lastr = j + len(j) - 1
LOOP2 i = j + 1 to lastr
fi = fi - Li,j * zj
END LOOP2
END LOOPI
(a) Column-sweep forward solution, Lz = f.
LDOPi j = n to i
lastr = j + len(j) - 1
LOOP2 i = j + 1 to lastr
zj = zj - Li, j * u i
End L00P2
uj = zj/Lj,j
End LOOP1
(b)Innerproductbackward solution,LTu = z.
Figure 8. Forward-and-backward triangular solution
algorithms.
The zero-checking option used to reduce oper-
ations in the factorization can also be applied to
the forward solution, as shown in figure 8(a). For
problems where the right-hand side vector has few
nonzero values, this strategy can significantly reduce
tile operation counts for the forward solution. The
zero-checking strategy cannot be used effectively for
the inner products in the backward solution. The
loop-unrolling techniques described for the factoriza-
tion were also applied to both the forward and back-
ward solutions. The computation rates for the tri-
angular solutions were improved significantly on the
CRAY-2 as a result of loop unrolling, but on the
CRAY Y-MP the improvement was very small. This
difference is due to the positive effect of chaining on
the CRAY Y-MP which caused the initial triangu-
lar solutions to be nearly two times faster than the
CRAY-2 versions. Loop unrolling made up much
of the difference on the CRAY-2, but the overhead
of additional sequential computations required for
the loop-unrolling version limited the amount of the
improvement. A FORTRAN listing of the forward
and backward solutions using loop unrolling and zero
checking is given in the appendix.
For the parallel implementations of the PVBAND
method, the triangular solutions were carried out on
a single processor. Times for the triangular solutions
are given in section 5 for the three structural analysis
problems.
Out-of-Core Extensions
An out-of-core version of both the VBAND and
PVBAND methods is under development. The out-
of-core version permits the solution of problems
that are too large to fit in main memory or al-
lows users to solve larger problems interactively on
machines where interactive memory limits are im-
posed. The out-of-core method is designed to take
advantage of the large memories available on today's
18
Finite element model
(Model size is half of sketch shown)
2851 nodes
5189 two-noded rod elements
4329 four-noded quadrilateral elements
114 three-noded triangular elements
Stiffness matrix
16 146 equations
499 505 coefficients
594 max. semibandwidth
319 av. semibandwidth
Figure 9. Finite element model of High-Speed Civil Transport aircraft.
supercomputers. This is accomplished by using the
buffer in and buffer out CRAY FORTRAN I/O state-
ments and three buffers to store blocks of the factored
matrix. The input matrix is stored in large blocks by
records in an unformatted file. The I/O required is
carried out using one of the three buffers in a rotat-
ing fashion, and computations required for the fac-
torization proceed in the remaining buffers while I/O
operations are underway. The data are read in once
and written out after an entire buffer is finished, and
no movement of data in memory is required. The
overlapping of I/O and computations in this method
results in factorization times for the out-of-core ver-
sion that are very close to the factorization times for
the in-core version of the factorization. The main
difference in wall clock times for the in-core meth-
ods compared with the out-of-core methods occurs
in the triangular solutions. The triangular solutions
require two passes through the factored matrix, and
the number of computations per coefficient is only
two. As a result, the time required for the triangular
solutions is increased dramatically for the out-of-core
methods.
4. Structural Analysis Applications
In this paper, three representative structural
analysis problems are used to compare tile differ-
ent parallel implementations of the variable-band
Choleski method. The three problems represent full-
scale problems that can all be stored in main memory
on the CRAY-2 and CRAY Y-MP computers. The
example structural analysis problems described in
this work were carried out using the COmputational
MEchanics Testbed (COMET), a large-scale struc-
tural analysis software system developed by the Com-
putational Mechanics Branch at the LaRC (refs. 14
and 15). Although the structural response determi-
nation is the primary goal of structural analysis, the
three problems considered herein are used primarily
to assess the performance of the parallel/vector equa-
tion solvers. Hence, only a brief description of each
structural problem is provided.
High-Speed Civil Transport Aircraft
Projected trends in travel to the Pacific Basin
have led to a renewed national interest in the
19
Finite element model
4114 nodes
3328 eight-noded hexahedral
elements arranged in 16 layers
Stiffness matrix
11 929 equations
397 139 coefficients
1597 max. semibandwidth
1081 av. semibandwidth
Cutout view showing stacking
of elements through thickness
Figure 10. Finite element model of cross-ply laminate with a hole.
development of a High-Speed Civil Transport (HSCT)
aircraft (refs. 16 and 17). The design of such a vehicle
will require an integrated analysis approach involv-
ing both structures and aerodynamics. To accelerate
the development of mathematical models of various
structural configurations, a parameterized finite ele-
ment model has bccn developed. The finite element
model (fig. 9) used herein represents the symmet-
ric half of the ttSCT aircraft. The model involves
2851 nodes, 5189 two-noded rod elements, 4329 four-
noded quadrilateral elements, and 114 threc-noded
triangular elements. The linear static response is
determined for the case of a wingtip loading. The
linear system for this problem has 16 146 equations
with a maximum semibandwidth of 594 and an av-
erage scmibandwidth of 319. The nodes in the finite
element model were ordered using the reverse Cuthill-
McKee algorithm implemented in the COMET soft-
ware to minimize bandwidth.
Cross-Ply Composite Laminate With a
Hole
A detailed stress analysis of composite structures
is often required to accurately determine through-
the-thickness (or interlaminar) stress distributions.
Some sources of interlaminar stress gradients include
free-edge effects, holes, and ply drop-offs (e.g., ta-
pered stiffener attachment flanges). To study these
effects, three-dimensional finite element models are
frequently used. To study the performance of these
solvers for three-dimensional finite element models,
the overall structural response of an 8-ply cross-ply
composite laminate with a central circular hole was
considered. The finite clement model (fig. 10) used
herein has 4114 nodes and 3328 eight-noded hexa-
hedral solid elements arranged in 16 layers. This fi-
nite element model is adequate for overall response
characteristics but must be refined in order to deter-
mine the interlaminar stress state accurately. The
linear system for this problem has 11 929 equations
with a maximum semibandwidth of 1597 and an av-
erage semibandwidth of 1081. The larger bandwidth
of this problem is characteristic of three-dimensional
models. The nodes in the finite clement model were
ordered using the reverse CuthilI-McKee algorithm
implemented in the COMET software to minimize
bandwidth.
Space Shuttle Solid Rocket Booster
A preliminary assessment of the Space Shuttle
solid rocket booster (SRB) global shell response to se-
lected prelaunch loads was presented in reference 18.
The two-dimensional shell finite element model used
in this study was translated into a format compatible
m
2O
Finite element model
9205 nodes
1273 two-noded beam elements
9 i 56 four-noded quadrilateral elements
90 three-noded triangular elements
Stiffness matrix
54 870 equations
1 310973 coefficients
558 max. semibandwidth
355 av. semibandwidth
Figure 11. Finite element model of Space Shuttle solid rocket booster.
with the COMET software. The finite element model
(fig. 11) involves 9205 nodes with 1273 two-noded
beam elements, 90 three-noded triangular elements,
and 9156 four-noded quadrilateral elements. The
linear static response to the internal pressure load-
ing in the SRB was obtained. The linear system for
this problem has 54 870 equations with a maximum
semibandwidth of 558 and an average semibandwidth
of 355. The nodes in the finite element model were
ordered using the bandwidth minimization options
in PATRAN software developed by PDA Engineer-
ing (ref. 19). The PATRAN bandwidth optimization
reduced the bandwidth for this problem more than in
previously reported results (refs. 8 and 9) which used
the reverse Cuthill-McKee algorithm implemented in
the COMET software. As a result, the number of
operations required for the matrix factorization was
reduced approximately 25 percent.
5. Numerical Results and Comparisons
In this section, timing results are presented
for the matrix factorization stage of three struc-
tural analysis problems comparing the sequential
VBAND Choleski method and the parallel PVBAND
Choleski implementations. The experiments were
performed on the CRAY-2 at the LaRC and on
the CRAY Y-MP at the ARC. Results are pre-
sented from computer runs on both the CRAY-2
and CRAY Y-MP computers made first during dedi-
cated single-user mode and second from runs made in
21
Method CPU time Wall time Rate, Speedup
(second) (timej9 MFLOPS
High-Speed Civil Transport aircraft (16 146 equations;
Max. semibandwidth = 594; Av. semibandwidth -- 319)
VECTOR
AUTO
MICRO
SFORCE
JFORCE
6.20
7.77
7.85
11.42
7.15
6.22
1.96
2.15
3.94
1.79
216
687
626
342
752
3.17
2.89
1.58
3.47
Composite cross-ply laminate with hole (11 929 equations;
Max. semibandwidth = 1597; Av. semibandwidth = 1081)
VECTOR
AUTO
MICRO
SFORCE
JFORCE
47.92
5O.52
51.22
55.34
51.27
48.07
12.84
13.04
17.57
12.95
272
1020
1004
746
1011
3.74
3.69
2.74
3.71
Space Shuttle solid rocket booster (54 780 equations;
Max. semibandwidth = 558; Av. semibandwidth = 355)
VECTOR
AUTO
MICRO
SFORCE
JFORCE
24.83
30.25
31.07
43.87
29.20
24.91
7.78
8.01
15.74
7.40
218
698
678
345
734
3.20
3.11
1.58
3.37
(a) Comparison of vector and parallel implementations of
Choleski factorization for three example problems.
i VECTOR - no paralMization
li_ AUTO - autotasking
B MICRO - microtasking
L_ SFORCE - macrotasking, standard FORCE barrier
JFORCE - macrotasking, special FORCE barrier
I , I , i I _ I
0 .25 .50 .75 1.00
(b) Comparison of wall clock times (normalized).
Figure 12. CRAY-2 factorization times in dedicated mode with a maximum of four CPU's.
normal multiuser environments. The runs made
in dedicated mode show the maximum computa-
tion rates possible for the three problems using the
PVBAND method and measure parallel overhead
without additional delays caused by other programs
executing at the same time. The runs made in mul-
tiuser mode measure the wail clock times for both
the VBAND and PVBAND methods when other
programs are executing. The results show that the
PVBAND method is efficient on the CRAY comput-
ers and significantly reduces wall clock time in both
multiuser and dedicated modes.
In the results that follow, the names VECTOR,
AUTO, MICRO, SFORCE, and JFORCE are used
to denote the vector implementation of the VBAND
method and the autotasking, microtasking, and two
FORCE implementations of the PVBAND method,
respectively. SFORCE denotes the use of the
FORCE Barrier statement, and JFORCE denotes
the FORCE implementation that uses the alternate
barrier described in section 3. All parallel methods
compared in this section use loop unrolling to level 6
and the zero-checking option described in section 3.
Table 4 shows the number of computations, percent-
age of sequential operations during factorization, and
memory requirements for each of the example struc-
tural analysis problems.
Timing Considerations
The measure of time for the comparison of algo-
rithms is an important consideration, particularly on
multiuser parallel computers like the CRAY comput-
ers which offer several timing options. For sequential
algorithms, CPU time is the most often quoted mea-
sure of time and is usually measured on the CRAY
computers by the FORTRAN function second(). For
parallel runs, second() returns a cumulative CPU
time, summed for all processes and as such provides
no useful measure of parallel speedup. A different
timing function, tsecnd(), may be used to measure
individual CPU time for a given task. The CPU time
for a routine executed by p processes may be taken as
the largest CPU time returned for any single process.
This routine can be used as an effective measure of
22
Method CPUtime Walltime Rate, Speedup
(second) (time]) MFLOPS
High-Speed Civil Transport aircraft (16 146 equations;
Max. semibandwidth = 594; Av. semibandwidtl- = 319)
VECTOR 5.50 5.50
AUTO 6.46 1.62
MICRO 6.29 1.57
SFORCE 6.98 1.75
JFORCE 6.33 1.58
Composite cross-ply laminate with
Max. semibandwidth =
245
831 3.40
857 3.50
769 3.14
852 3.48
hole (11 929 equations;
1597; Av. semibandwidth = 1081)
VECTOR 46.33 46.34 283
AUTO 48.71 12.18 1075 3.80
MICRO 48.11 12.03 1089 3.85
SFORCE 48.69 12.19 1075 3.80
JFORCE 49.14 12.29 1066 3.77
,, ,, r
Space Shuttle solid rocket booster (54 780 equations;
Max. semibandwidth = 558; Av. semibandwidth = 355)
VECTOR
AUTO
MICRO
SFORCE
JFORCE
21.92
25.56
24.83
27.19
25.10
21.93
6.39
6.21
6.80
6.28
248
849
874
798
864
3.43
3.53
3.23
3.49
(a) Comparison of vector and parallel implementations of
Choleski factorization for three example problems.
VECTOR - no parallelization
AUTO - autotasking
MICRO - microtasking
SFORCE - macrotasking, standard FORCE barrier
[_ JFORCE - macrotasking, special FORCE barrier
, I i I i I t I
.25 .50 .75 1.00
(b) Comparison of wall clock times (normalized).
Figure 13. CRAY Y-MP factorization times in dedicated mode with a maximum of four CPU's.
the load balance for a given algorithm executed on a
fixed number of processors, as in FORCE implemen-
tations described in this paper. However, tsecnd()
often does not provide an accurate measure of the
total time required for parallel algorithms. Even on
dedicated systems, the tsecnd() time is often sub-
stantially less than the wall clock time measured for
the same process. This difference may lead to overly
optimistic performance projections that cannot ac-
tually be obtained even in a dedicated environment.
The most reliable measure of actual achievable com-
putation rates for given parallel algorithms on the
CRAY computers is obtained using wall clock time
(function time f) in a dedicated, single-user environ-
ment. In this paper, only wall clock times are used
to evaluate the performance of the parallel implemen-
" tations. Additional results are given that show the
times reported by second(), and they are referred to
as CPU time. The CPU times show in general that
parallel runs accumulate more total CPU time than
the corresponding CPU time for a single-processor
run, and these times generally increase as the num-
ber of processors increases.
Dedicated Runs
Dedicated, single-user runs were obtained for the
three example problems to measure the maximum
performance of the VBAND and PVBAND imple-
mentations. For each run, the wall clock time
was used to compute the MFLOP rate and paral-
lel speedups were measured relative to the wall clock
time for the VBAND method VECTOR.
CRAY-2 results. Figure 12 shows CRAY-2 re-
sults for all three structural analysis problems using
the VBAND and PVBAND implementations. The
JFORCE method, which uses the special FORCE
barrier with no calls to macrotasking lock subrou-
tines, is the fastest PVBAND implementation on two
of the three problems. The SFORCE method, which
uses the standard FORCE barrier, is the slowest
PVBAND implementation, demonstrating the high
23
Method CPUtime Walltime Rate, Speedup
(second) (timef) MFLOPS
High-Speed Civil Transport aircraft (16 146 equations;
Max. semibandwidth = 594; Av. semibandwidth = 319)
VECTOR
AUTO
MICRO
SFORCE
JFORCE
Composite
5.50 5.50 245
7.64 .96 1402
7.39 .93 1447
10.93 1.42 948
7.43 .93 1447
cross-ply laminate with hole (11 929
Max. semibandwidth = 1597; Av. semibandwidth
5.73
5.91
3.87
5.91
equations;
= 1081)
VECTOR 46.33
AUTO 51.32
MICRO 50.55
SFORCE 55.00
JFORCE 52.11
Space Shuttle solid
Max. semibandwidth = 558; Av.
VECTOR 21.92 21.93
AUTO 29.69 3.72
MICRO 28.90 3.64
SFORCE 39.44 4.95
JFORCE 29.10 3.64
46.34 283
6.43 2037 7.21
6.34 2066 7.31
7.96 I646 5.82
6.52 2008 7.11
rocketbooster(54 780equations;
semibandwid_ = 355)
248
1459 5.90
1491 6.02
1097 4.43
1491 6.02
(a) Comparison of vector and parallel implementations of
Choleski factorization for three example problems.
VECTOR - no parallelization
AUTO - autotasking
k_ MICRO - microtasking
SFORCE - macrotasking, standard FORCE barrier
JFORCE - macrotasking, special FORCE barrier
i I J I i ! ' I
.25 50 .75 1.00
(b) Comparison of wall clock times (normalized).
Figure 14. CRAY Y-MP factorization times in dedicated mode with a maximum of eight CPU's.
cost of the lock subroutines used by the standard
FORCE implementation, Over 1 GIGAFLOP was
achieved on the largest bandwidth problem, the com-
posite cross-ply laminate, for all but the SFORCE
method. This problem has the lowest percentage
of sequential operations in the barrier region and
the longest vector lengths in tile parallel update sec-
tion. On the CRAY-21 autotasking (method AUTO)
is slightly faster than microtasking (method MICRO)
for all three problems.
CRAY Y-MP results. Figures 13 and 14
give timing results from runs on the CRAY Y-MP
using four and eight processors in a dedicated mode.
The eight-processor CRAY Y-MP results in figure 14
show rates of over 1 GIGAFLOP for M1 three prob-
lems and over 2 GIGAFLOPS for the cross-ply lam-
inate problem. Parallel spccdups for each of the
problems compare favorably with the theoretical
maximums computed in table 2.
A comparison of the CRAY-2 results in figure 12
and the CRAY Y-MP results in figure 13 shows that
both the computation rates and parallel speedups are
higher on the CRAY Y-MP than on the CRAY-2.
These results may be due to the fact that the
CRAY Y-MP has more hardware available than the
CRAY-2 for multitasking purposes. There are nine
clusters of shared registers for interprocessor com-
munication and synchronization, each of which has
8 shared address, 8 shared scalar, and 32 semaphore
registers. In contrast, the CRAY-2 has only eight
semaphore registers that synchronize common mem-
ory references in multitasked jobs.
The CRAY Y-MP performance of the FORCE
implementations relative to the autotasking and mi- .
crotasking implementations is significantly improved
over the CRAY-2 results. The FORCE times in fig-
ure 13 from CRAY Y-MP runs using four CPU's are
very close to the autotasking and mierotasking times.
However, the FORCE times are noticeably slower
than the autotasking and microtasking times as the
number of processors is increased from four to eight
(fig. 14).
m
24
OntheCRAYY-MP,themicrotaskedPVBAND
implementationisslightlyfasterthantheautotasked
implementation.Thisresultisreversedfromtherela-
tiveperformanceofautotaskingandmicrotaskingon
theCRAY-2.ThePVBANDimplementationsonthe
CRAYY-MP werecompiledusingthe CFT77ver-
sion4.0compiler.Earlierresultsthat useda previ-
ouscompilerhadshownfastertimesforautotasking.
TheCRAY-2compilerusedfortheresultspresented
in thispaperisCFT77version3.1.
The times in figures 12-14 show that the
PVBANDmethodisefficientandachievesahighper-
centageof themaximumpossiblecomputationrate
onbothCRAYcomputers.Thetimesfor theSpace
ShuttleSRBproblemarethe lowestwallclockand
CPU timesyet achievedfor this problem(refs.8
and 9). The ratesachievedin the VBAND and
PVBANDimplementationsuseonlyFORTRANsub-
routines. Further improvementsin the computa-
tion ratescanbe achievedby writing partsof the
PVBANDimplementationsin assemblycode.In ad-
dition, if the nextgroupof r columns of L is com-
pleted as part of the parallel update region in the
PVBAND method at each stage, or outer loop it-
eration, the barrier region can be eliminated. This
change eliminates the barrier region and should im-
prove the performance of the PVBAND method as
the number of processors increases. These improve-
ments are the subject of current and future studies.
Multiuser Runs
Although the dedicated runs indicate the maxi-
mum benefit of parallel processing to a single user,
the typical computing environment on multiproces-
sor CRAY supercomputers is a multiuscr mode with
many users sharing the four to eight CPU's. Many
users may be logged in at any given time and may bc
interactively executing jobs, editing files, or submit-
ting batch runs to one or more of several job queues.
In this environment, users rarely have control of all
the CRAY CPU's for an entire job execution. The
performance of multitasked jobs will differ across ma-
chines according to the system parameters defined by
each computing center. In order to see the effect of
multitasking in a busy environment on the CRAY-2
and the CRAY Y-MP, each problem was run multiple
times using the VBAND and PVBAND implemen-
tations. Multiple runs were submitted using script
.files during normal operating hours. The two largest
problems were run in batch queues that are normally
run after the normal working hours along with other
large batch runs. The results shown in figures 15 17
were taken from 10 runs of each problem. Both the
CRAY-2 and CRAY Y-MP were running the CRAY
UNICOS 5.1.9 operating system at the time of this
study. The CRAY UNICOS 6.0 operating system in-
corporates changes that should improve the perfor-
mance of multitasked jobs in a batch environment.
CRAY-2 results. Figure 15 shows the best,
median, and worst elapsed times for all three prob-
lems from runs on the CRAY-2. One factor that
significantly affects the performance of the VBAND
and PVBAND implementations in multiuser mode
is the memory-swapping size linfit. For programs
below the memory-swapping size limit, the entire
matrix is swapped out of main memory more of-
ten than for larger problems when tasks arc inter-
rupted. Larger problems are left in memory even
though the CPU may be interrupted and dedicated
to another user for a time period. This effect can
be seen in figure 15 by comparing the CPU and wall
clock times for the SRB problem, which is over the
swapping size limit, with the HSCT and composite
panel problems, which are under the swapping size
limit. The best-case wall and CPU times for the
SRB problem are nearly identical to the dedicated
times for the VBAND method VECTOR. For the
smaller problems, the best-case wall times were be-
tween two and four times longer than tile dedicated
wall times for method VECTOR, indicating that the
cost of memory swapping significantly increased the
wall clock times. The wall clock times for the auto-
tasking and microtasking PVBAND implementations
indicate that these parallel implementations signifi-
cantly reduce wall clock time when compared with
the wall clock time for the VBAND VECTOR im-
plementation for all three problems. The FORCE
implementations did not reduce CPU time in the
multiuser mode, and in most cases they significantly
increased wall clock times relative to the VBAND
method. The very poor performance of FORCE on
the CRAY-2 is the result of the interaction between
the high cost of standard FORCE barriers and the
scheduling of a fixed number of tasks in a busy en-
vironment. This apparently leads to more frequent
swapping between memory and disks, thus greatly in-
creasing the wall clock times. The multiuser times for
the FORCE implementations on the CRAY-2 were
as much as 35 times greater than the corresponding
method VECTOR times in the worst case. The auto-
tasking and microtasking implementation times were
very close on all three problems, with the autotask-
ing implementations giving the fastest best-case wall
clock times on all three problems.
CRAY Y-MP results. On the ARC CRAY
Y-MP, the maximum number of CPU's used for the
single user's program is set by default to four. Users
25
!Method Best times: Median times: I Worst times:
Wall / CPU Wall / CPU [ Wall / CPU
High-Speed Civil Transport aircraft (16 146 equations;
Max. semibandwidth = 594; Av. semibandwidth = 319)
VECTOR 25.64 / 6.21 32.79/6.22 38.27/6.23
AUTO 8.92 / 7.06 11.51 / 6.97 14.09/6.99
MICRO 9.35 / 7.41 13.34/7.29 15.54/7.06
FORCE 51.17 / 13.88 253.95/15.10 1345.16/16.20
Composite cross-ply laminate with hole (11 929 equations;
Max. semibandwidth = 1597; Av. semibandw_th = 1081)
VECTOR 98.41/47.64 145.80147.68 357.79/47.72
AUTO 38.79/49.35 51.01/49.15 71.16/48.95
MICRO 39.23/50.03 49.42/49.85 70.93/49.87
FORCE 79.42/55.17 350.31/56.95 1318.95/54.56
Space Shuttle solid rocket booster (54 780 equations;
Max. semibandwidth = 558; Av. semibandwidth = 355)
VECTOR 24.90/24.87 33.14/24.86 50.86/24.85
AUTO 12.38/29.17 16.64/28.28 25.74/27.88
MICRO 12.99/29.95 18.86/29.01 24.52/28.62
FORCE 40.17/51.98 62.54/50.07 100.71/54.78
(a) _,Vall clock and CPU times for Choleski factorization
for 10 runs each on three example problems.
Wall clock time
Best
[---'1 Median
[_ Worst
6.6 35
_-_--i_!_!i:!ili!;iiiiiiiiiiiii;iiiiiii!iiiiill_<_ "7
I i I I I
0 1 2
(b) Comparison of wall clock times (normalized).
Figure 15. CRAY-2 factorization times in multiuser mode with a maximum of four CPU's.
I
Method Best times: ] Median times: Worst times:
Wall / CPU [ Wall / CPU Wall / CPU
High-Speed Civil Transport aircraft (16 146 equations;
Max. semibandwidth = 594; Av. semibandwidth = 319)
VECTOR 5.60 / 5.58 5.62/5.60 5.84 / 5.59
AUTO 1.64 / 6.54 1.66/6.55 1.78/6.52
MICRO 1.59 [ 6.33 1.61 / 6.35 2.76 / 6.28
FORCE 1.77 / 7.05 1.80/7.12 2.84/7.04
Composite cross-ply laminate with hole (11 929 equations;
Max. semibandwidth = 1597; Av. semibandwidth = 1081)
VECTOR 65.32/47.15 96.06/47.62 120.87/47.39
AUTO 28.35/48.92 38.88/49.13 64.76/48.78
MICRO 25.56/48.52 37.84/47.91 67.94/48.06
FORCE 26.10 / 50.17 31.74/49.94 42.71/49.84
Space Shuttle solid rocket booster (54 780 equations;
Max. semibandwidth = 558; Av. semibandwidth = 355)
VECTOR 22.15 / 22.10 34.49/22.45 58.79/22.36
AUTO 7.62/25.53 13.72/25.54 42.30/24.46
MICRO 7.38 / 25.08 14.99/24.86 19.76/24.40
FORCE 10.26 / 28.99 14.73 / 28.38 28.39/28.07
(a) Wall clock and CPU times for Choteski factorization
for 10 runs each on three example problems.
Wall clock time
Best
[""] Median
Worst
_.__ _.'.-;_:?_:.::!.::._;:;._j?;.':_y;::_,'_&[-.',.%:-:_:_:_.::___._, [
:.x.:-x..-:.x'.x_:._x:_._:!:,"'_.._-'-.-'
, I
0 .5 1.0
(b) Comparison of wall clock times (normalized).
Figure 16. CRAY Y-MP factorization times in multiuser mode with a maximum of four CPU's.
l,
26
Best times: Median times: Worst times:Method
Wall / CPU Wall / CPU Wall / CPU
High-Speed Civil Transport aircraft (16 146 equations;
Max. semibandwidth = 594; Av. semibandwidth = 319)
VECTOR 5.60/5.58 5.62/5.60 5.84/5.59
AUTO 1.04 [ 7.57 1.31 / 7.33 6.21 / 6.86
MICRO 1.03/7.31 1.49/7.04 14.42/6.60
FORCE 1.64/11.45 2.27/11.50 13.0919.04
Corn mslte cross-ply Iaminate with hole (11 929 equations;
Max. semibandwidth = 1597; Av. semibandwidth = 1081)
VECTOR 65.32 / 47.15 96.06 / 47.62 120.87/47.39
AUTO 28.46 / 50.00 34.62/49.61 93.09/49.18
MICRO 19.94 / 49.94 39.93 / 48.65 49.86 [48.36
FORCE 19.50 [ 54.01 26.55 / 53.81 70.46 [52.99
Space Shuttle solid rocket booster (54 780 equations;
Max. semibandwidth = 558; Av. semibandwidth = 355)
VECTOR 22.15 / 22.10 34.49 / 22.45 58.79/22.36
AUTO 6.92/27.07 11.32/27.16 29.66/25.22
MICRO 6.46[26.87 11.86[ 26.87 26.93/25.02
FORCE 10.33 / 38.45 18.31 / 33.91 31.93/38.20
(a) Wall clock and CPU times for Choleski factorization
for 10 runs each on three example problems.
Figure 17. CRAY Y-MP factorization times in
can request a maximum of eight CPU's at run time.
Figure 16 gives results from multiuser runs for all
three example problems using the default maximum
number of four CPU's, and figure 17 gives results
for the same problems using a maximum of eight
CPU's. Very little improvement in wall clock time
is seen in figure 17 compared with the results in
figure 16. This result indicates the difficulty of
obtaining all eight processors in a busy multiuser
environment. The results in both figures i6 and 17
show a substantial reduction in wall clock time by all
the PVBAND implementations. A major difference
between the CRAY-2 and CRAY Y-MP results is
that the CRAY Y-MP wall clock times are much
better than the CRAY-2 wall clock times for the
FORCE PVBAND implementations. This difference
between the implementation of macrotasking on the
two CRAY computers is reflected in a comparison of
the wall clock times in figures 15, 16, and 17 for the
FORCE implementations.
Triangular Solutions
Table 5 shows the CPU times for the solution of
the triangular systems for the three structural anal-
Wall clock time
Best
r-] Median
Worst
_i:i_i_i:i:i:i:i:i::.:i_:.:i_:_:i_.:i:i_:_:h`._:_!:_:!:!_{:!:_i:i:i:i_:i_
:::::::::::::::::::::::::::::::::::::::::::::::: |_v_v.i
::_:::;_:_ .....
l I
0 1.0
I
2.0
(b) Comparison of wall clock times (normalized).
multiuser mode with a maximum of eight CPU's.
ysis problems used in this study. The improvement
in computation rate is demonstrated for the HSCT
problem by comparing the basic triangular solution
times LLIS with the triangular solutions using loop
unrolling to level 6, i.e., LL6S. This improvement is
largest on the CRAY-2, where the bottleneck due to
the single path to memory causes single SAXPY or
inner product computations to be much slower than
the same operations on the CRAY Y-MP. The re-
duction in operations due to the zero-checking strat-
egy LL6SZ is also shown in table 5 for the HSCT
problem. For all three problems, the percentage of
CPU time for the triangular solutions as compared
with the factorization time is small. The potential
benefit from parallelizing the triangular solutions is
small because of the small percentage of total time
required for the triangular solutions. For analyses
requiring multiple triangular solutions, the benefit of
a parallel triangular solution would be far greater.
6. Concluding Remarks
A Choleski method that effectively exploits key
architectural features of the CRAY-2 and CRAY
Y-MP computers has been described. This method
is well suited for parallel processing, and several
27
Table5. Forward-and-BackwardSolutionTimesfor ThreeExampleProblems
CPUtime Numberof Rate, Factortime,b
Problem Method a (second) operations MFLOPS percent
NASA Langley CRAY-2 computer runs
LL1S 0.27 20 610 072
HSCT LL6S .19 20 610 072
LL6SZ .17 16 210 224
Panel
SRB
LL1S
LL6S
LL6SZ
LL1S
LL6S
LL6SZ
0.45
.37
.38
0.86
.71
.68
51583442
51583442
51583442
77979552
77979552
77494 584
NASA Ames CRAY Y-MP computer runs
76 4.3
107 3.1
94 2.7
114 0.9
138 .8
136 .8
91 3.5
110 2.9
105 2.7
HSCT
Panel
SRB
LL1S
LL6S
LL6SZ
LL1S
LL6S
LL6SZ
LL1S
LL6S
LL6SZ
0.14
.14
.12
0.24
.26
.25
0.50
.50
.49
20610072
20 61O072
16 210 224
51583442
51583.442
51583442
77979552
77979552
71494 584
147
147
94
215
198
2O6
156
156
146
2.5
2.5
2.2
0.5
.6
.5
2.3
2.3
2.2
aLLIS: forward-backward solution, no loop unrolling.
LL6S: forward-backward solution, loop-unrolling level 6
LL6SZ: forward-backward solution, loop-unrolling level 6 with zero checking.
bpercentage calculation uses CPU time for VECTOR method.
implementations of the PVBAND method have been
presented comparing the use of macrotasking, auto-
tasking, and microtasking. The differences between
implementing the PVBAND method with a fixed
number of t_ks (i.e., as in using FORCE) and im-
plementing the PVBAND method using processors
as they become available at the time of program ex-
ecution were discussed.
Three structural analysis example problems were
used to compare the parallel implementations of the
PVBAND Choleski method. The example prob-
lems were all generated using the COMET finite
element software system and are representative of
a wide class of large.scale problems for which the
variable-band Choleski method is very efficient. The
PVBAND implementations that use autotasking and
microtasking were shown to bc more effective on
the CRAY multiprocessor supercomputers than the
FORCE implementations that used macrotasking.
Dedicated single-user runs on both the CRAY-2
and CRAY Y-MP computers demonstrated maxi-
mum computation rates of over 250 MFLOPS per
processor and maximum rates of over 1 GIGAFLOP
on the CRAY-2 using four processors and over
2 GIGAFLOPS on the CRAY Y-MP using eight pro-
cessors. Parallel spcedups for the PVBAND imple-
mentations compared favorably with maximum the-
oretical speedups in all cases except for the FORCE
implementations on the CRAY-2. Though com-
putation rates and parallel speedups were slightly
higher on the CRAY Y-MP than on the CRAY-2,
the VBAND and PVBAND implementations are ef-
ficient on both computers.
The results presented for both dedicated and mul-
tiuser environments indicate that parallel process-
ing is effective in both cases in reducing wail clock
time. The cost of parallel processing runs will al-
ways be higher if based on total CPU (central pro-
cessing unit) time, as indicated by increased CPU
times for the PVBAND implementations compared
with the VBAND implementations. The comparison
of timing results for autotasking and microtasking
implementations with the FORCE implementations
demonstrates the importance of dynamically using
available processors in a multiuser environment.
28
Appendix
VBAND and PVBAND FORTRAN
Listings
In this appendix,FORTRANlistingsof several
subroutinesusedto implementthe VBAND and
PVBANDmethodsaregivenandbrieflydescribed.
FiguresA1 andA2 containthe listingsfor twosub-
routinesthat areusedto convertasymmetric,sparse
matrix datastructureto the variable-bandmatrix
datastructureusedfor theVBANDandPVBAND
Choleskimethods.FiguresA3 andA4 containlist-
ingsfor the autotaskedand FORCEimplementa-
tions of the PVBAND method,respectively.The
twoPVBANDimplementationsillustratethetwoap-
proachesdescribedin section3 underthesubsection
"CRAY-2localmemoryconsiderations."FigureA5
containsalistingof theforward-and-backwardsolu-
tionsusedin this study. Eachfigureincludesline
numberswith commentsto explainmajor sections
andkeyvariableswithin eachsubroutine.A brief
descriptionof eachsubroutinefollows.
Subroutinesp2vb, shown in figure A1, accepts as
input the sparse matrix pointer arrays and computes
new pointer arrays for the variable-band data struc-
ture. The sparse matrix pointer arrays contain the
number of nonzero coefficients in each column of the
lower triangular matrix as well as an integer row in-
dex for each nonzero coefficient. The variable-band
pointer arrays contain the lengths of each column in
the lower triangular part of the variable-band ma-
trix and the starting position of each column in a
single-dimensioned array. In addition, the row in-
dex values from the sparse data structure are con-
verted to offsets that give the location of each nonzero
coefficient from the input sparse matrix within the
variable-band matrix array. The column lengths for
the variable-band matrix data structure are initially
computed in lines 56 61 in figure A1. The initial
lengths are increased where necessary in lines 67 69
to account for possible additional fill-in during factor-
ization. A second adjustment is made to the column
lengths in lines 77-92 to allow for loop unrolling. Af-
ter adjustments to the column lengths, the starting
position of each column within a single-dimensioned
array is determined in lines 103 105. Finally, the
row index pointers for each nonzero coefficient, stored
in array ia(), are changed to location pointers in
lines 113 118. These location pointers give the po-
sition of each nonzero coefficient within the single-
dimensioned variable-band matrix array. Subrou-
tine sp2vb also determines the amount of memory
required for the variable-band matrix array and re-
turns that value along with the maximum bandwidth
and average bandwidth as arguments.
Subroutine convert, shown in figure A2, is called
after subroutine sp2vb to build the single-dimensioned
array used to store the variable-band matrix. The
array is initially filled with zeros in lines 29 and 30
in figure A2. Then, in lincs 34 41, the nonzero co-
efficients from the sparse matrix storage array are
assigned to the variable-band matrix array as deter-
mined by the locations stored in array ia(). Though
initially many zeros are stored in the variable-band
array, many are changed to nonzero values during
matrix factorization. The use of two subroutines al-
lows for efficient dynamic memory allocation of the
variable-band matrix array after the exact length of
the array is determined by subroutine sp2vb.
Subroutine pvband, shown in figure A3, is the au-
totasked version of method PVBAND used to ob-
tain the results presented in this paper. The sub-
routine uses loop unrolling, locM memory (on the
CRAY-2), and zero checking as described in section 3
along with autotasking compiler directives to imple-
ment a fast, efficient parallel/vector version of the
PVBAND method. The FORTRAN listing of the se-
quential vectorized version of the VBAND method is
not given in this appendix, but it is easily obtained
from the listing in figure A3 by deleting both the
parallel directives and the sections of code used by
additional processes to copy completed columns into
local memory. For example, lines 53 61 copy a com-
pleted column of the variable-band matrix into local
memory and are not required for the sequential FOR-
TRAN version. Variable idone is used to facilitate
the copying of columns into local memory in the au-
totasking implementation and is not required in the
sequential FORTRAN version. In figure A3, lines 33
244 correspond to the barrier region in the PVBAND
method described in section 3. Each of six columns,
k through k + 5, are completed within an autotask-
ing control structure (case directive). This sequen-
tial portion of the algorithm accounts for a very small
percentage of the total computations. (See table 4.)
Lines 253-284 comprise the parallel portion of the
PVBAND method. The outer loop iterations of DO
loop 61 (line 254) are completed in parallel, each per-
forming a column update using the six-term SAXPY
shown in lines 274 279. Thc zero-checking option
(lines 257-271) skips the SAXPY update whenever
all six scalar multipliers in DO loop 610 are all equal
to zero.
The FORCE subroutine Forcesub pvband, shown
in figure A4, illustrates the use of the FORCE lan-
guage syntax to implement the PVBAND method.
The variable NPROC in line 1 of figure A4 is the
29
fixednumberof tasks,determinedat run time,that
executestheparallelsubroutine.VariableME,also
in line l, is tile taskidentificationvariableandwill
havea uniquevaluebetween1 and NPROC.In
theFORCEvariabledeclarationportionendingwith
line 13,all variablesusedwithin subroutinepvband
are declared private and all argument variables are
declared as shared or private in the calling program.
The sequential portion of the PVBAND method in
lines 40 159 is computed within a single FORCE bar-
rier. The six columns computed within the barrier
region by only one task are copied into the local mem-
ory of all remaining tasks in lines 162-179. The par-
allel update portion in lines 183-222 is implemented
by the FORCE Presched DO statement in line 190.
The FORCE subroutine in figure A4 will run as is on
any parallel computer for which FORCE is installed.
Figure A5 contains the FORTRAN listing of the
forward and backward solutions used for the VBAND
and PVBAND methods. The subroutine in figure A5
uses loop unrolling and zero checking. The main
computation in the forward solution (lines 17-130)
is the six-term SAXPY loop (lines 76-84). The
main computations in the backward solution (lines
132-253) are the six inner product computations
(lines 217-224). The remaining computations are
sequential computations.
3O
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:
38:
39:
40:
41:
42:
43:
44"
45:
46:
subroutine sp2vb(spptr,splen,ia,n,nroll,vbptr,vblen,
+ ibw,iavbw,ialth)
integer spptr(*),splen(*),ia(*),row,colm
integer vbptr(*),vblen(*)
c
c
c
c
c
c
c.
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c *** This routine is called to create pointer arrays which are
c *** subsequently used by routine convert to convert an input
*** matrix stored in a sparse data structure to a variable-band
*** matrix data structure. Major steps are: 1) Create arrays vblen and
*** vbptr, 2) Change row index values stored for each lower triangular
*** coefficient to location pointers for each coefficient.
*** INPUT ARGUMENTS :
*** spptr() - starting position for each column of sparse
*** matrix within a single-dimensioned array
*** splen() - number of nonzero coefficients in each column of
*** sparse matrix
*** ia() - row index for each nonzero below the main diagonal
*** in input sparse matrix. The row index values are
*** arranged in the same order as the nonzero
*** coefficients; by columns.
*** n - number of equations in the sparse matrix
*** nroll - level of loop unrolling used in the factorization
*** routine (nroll > I requires adjustments to column
*** lengths)
***
*** OUTPUT AKGUMENTS :
*** ia() - return values in this array contain an offset for
*** each nonzero coefficient in sparse matrix which
*** will be used to expand the sparse storage into
*** variable-band storage.
*** vbbtr() - starting position for each column of variable-band
*** matrix within a single-dimensioned array
*** vblen() - length of each column of variable-band matrix;
*** includes diagonal and allows for zeros down to the
*** last nonzero coefficient in each column plus
*** possible adjustments for fill past the last nonzero
*** and loop-unrolling adjustments.
*** ibw - maximum semibandwidth (includes diagonal) after
*** all adjustments.
*** iavbw - average semibandwidth (total storage required
*** divided by number of equations)
*** ialth - total memory required for variable-band matrix
*** coefficient array, includes diagonal
Figure A1. FORTRAN listing for subroutine sp2bv,
31
47: c *** Compute the lengths of each variable-band column
48: c *** from the main diagonal to the last nonzero coefficient
49: c *** in the lower triangular part of the matrix. The main
50: c *** diagonal is part of the column length.
51: c *** Loop i assumes that nonzero coefficients within each
52: c *** column of sparse input data structure are ordered by increasing
53: c *** row indexes. Pointer irx points to the last nonzero row index
54: c *** in each column.
55:
56: vblen(n)=l
57: do 1 colm=l,n-1
58: irx=spptr(colm+l)-i
59: row=ia(irx)
60: vblen(colm)=row-colm+l
61: 1 continue
62:
63: c *** Adjust column lengths for possible fill during factorization;
64: c *** each column must extend at least to the last row
65: c *** of the previous column.
66:
67: do 2 colm=2,n
68: vblen(colm)=max(vblen(colm-l)-l,vblen(colm))
69: 2 continue
70:
71: c *** Adjust for loop unrolling if nroll > 1 ;
72: c *** groups of nroll columns must end in the same row.
73: c *** Zeros are effectively added to the ends of columns
74: c *** as required to enforce this condition by increasing
75: c *** the length of the column (vblen(colm)).
76:
77:
78:
79:
80:
81:
82:
83:
84:
3O
if (nroll.gt.l) then
lastc=n-mod(n,nroll)
do 30 colm=nroll,lastc,nroll
lth=vblen(colm)
do 30 i=1,nroll-1
vblen(colm-i)=Ith+i
continue
85: c *** Fix up the end columns whenever n mod nroll <> 0
86: c *** The remainder columns are full.
87:
88: iadd=2
89: do 31 row=n-1,1astc+1,-i
90: vblen(row)=iadd
91: 31 iadd=iadd+l
92: end if
Figure A1. Continued.
32
93:
94: c *** Add up the column lengths and form vbptr array;
95: c *** also compute maximum semibandwidth (including diagonal).
96:
97:
98:
99:
100:
101: 40
102:
103:
104:
105: 41
106:
107:
108: c ***
109: c ***
110: c ***
111: c ***
112:
113:
114:
115:
116:
117:
118: 51
119:
120:
121:
ialth=O
ibw=O
do 40 colm=l,n
ibw=max(ibw,vblen(colm))
ialth=ialth+vblen(colm)
vbptr(1)=l
do 41 colm=2,n
vbptr(colm)=vbptr(colm-1)+vblen(colm-1)
iavbw=ialth/n
Change the row index values from
the sparse column storage in ia() to the
locations where each nonzero coefficient is stored
within the variable-band matrix single-dimensioned array.
do 51 colm=1,n-1
do 51 k=spptr(colm),spptr(colm+l)-I
row = ia(k)
loc = vbptr(colm) + row - colm
ia(k) = loc
continue
return
end
Figure A1. Concluded.
33
i:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:
38:
39:
40:
41:
42:
43:
44:
subroutine convert(coefs,diag,ia,vbptr,ncoef,ialth,n,a)
integer ia(*),row,colm
integer vbptr(*)
real a(*),coefs(*),diag(*)
C _**
C _@_
C _*_
C *_
C @_
C _**
C _
C )))
C )))
C *_
C _*
C *_
C )))
C _*)
C )_)
C _
C *_
This routine converts sparse data structure into variable-band
data structure. Routine sp2vb must be called first to
create input array vbptr() and modify array ia(*).
INPUT ARGUMENTS:
coefs() - single-dimensioned array containing nonzero
coefficients of sparse matrix, arranged by columns,
lower triangular coefficients only
diag() - main diagonal of sparse input matrix
ia() - for each coefficient in array coefs() contains an offset
into variable-band matrix array, a()
vbptr() - starting position of each column in the variable-band
matrix coefficient array a().
number of nonzero in array coefs()
length of the variable-band array
number of equations in sparse and variable-band matrices
ncoef -
ialth -
n
OUTPUT ARGUMENTS:
a() - single-dimensioned array containing the coefficients
of the variable-band output matrix including zeros
added within columns as appropriate.
c *** Zero out space for vband storage of coefficients.
do i0 i=l,ialth
10 a(i)=O.O
c *** Assign nonzero coefficients to locations in variable-band
c *** matrix array as designated by array ia().
do 20 i=l,ncoef
20 a(ia(i))=coefs(i)
c *** Place main diagonal coefficients at beginning of each column
c *** in variable-band array.
do 30 colm=l,n
a(vbptr(colm))=diag(colm)
30 continue
return
end
Figure A2. FORTRAN listing for subroutine convert.
34
I:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
1,3:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:
38:
39:
40:
41:
42:
43:
44:
subroutine pvband(a,alth,colptr,collth,n,nopsf)
integer k,i,j,lastr,lth,tl,iadd
integer istl,ist2,ist3,ist4,ist5,ist6
integer ii,i2,i3,i4,i5,i6
integer n,alth,nopsf
integer colptr(n),collth(n)
real a(alth)
parameter (Imsize=9600)
parameter (16=imsize/6)
real ml(16),m2(16),m3(16),m4(16),m5(16),m6(16)
common /Imem/ idone,ml,m2,mS,m4,m5,m6
cdir$ regfile imem
cmic$ parallel shared(a,alth,colptr,collth,n,nopsf) private(k,i,j,
cmic$1 lastr,lth,tl,iadd,istl,ist2,ist3,ist4,ist5,ist6,
cmic$2 il,i2,i3,i4,i5,i6,idone,ml,m2,m3,m4,m5,m6,ict,itst)
C ***
C ***
C ***
C ***
C ***
C ***
C ***
C ***
Variable idone is a private variable and is used to
determine which processors must copy multipliers into
local memory. The processor which computes a multiplier
already has that column in local memory and proceeds
to the next code section, skipping the copy segment.
idone=O
k=l
Counting variable nopsf is shared and must be zeroed prior
to calling this routine. The variable is used to count
only serial operations.
999 continue
C ***
C ***
C ***
Begin jki portion: Complete 6 columns of L; k through k+5.
The 6 columns are used as multiplier columns in the
update section.
C _ Finish the kth multiplier.
istl=colptr(k)
il=istl+l
cmic$
cdir$
case
a(istl)=l.O/sqrt(a(istl))
tl=l
ivdep
Figure A3. FORTRAN listing for autotasked subroutine pvband.
35
45: do i0 i=il,il+collth(k)-2
46: a(i)=a(i)*a(istl)
47: ml(tl)=a(i)
48: tl=tl+l
49: i0 continue
50: idone=k
51: cmic$ end case
52:
53: c *** Everyone else copy the kth column into local memory.
54:
55:
56:
57:
58:
59:
60:
61:
62:
63:
64:
65:
66:
67:
68:
69:
70:
71:
72:
73:
74:
75:
76:
77:
78:
79:
80:
81:
82:
83:
84:
85:
86:
111
if (idone.lt.k) then
tl=l
do 111 i=il,il+collth(k)-2
ml(tl)=a(i)
tl=tl+l
continue
idone=k
end if
C **_ Update, then finish column k+l.
ist2=colptr(k+l)
i2=ist2+l
cmic$ case
tl=l
ict=O
cdir$ ±vdep
do 11 i=ist2,ist2+collth(k+l)-I
a(i)=a(i)-ml(tl)*mi(tl+ict)
ict=ict+1
11 continue
c *** Finish the (k+l)st multiplier.
a(ist2)=l.O/sqrt(a(ist2))
t1=2
cdir$ ivdep
do 20 i=i2,i2+collth(k+1)-2
a(i)=a(i)*a(ist2)
m2(tl)=a(i)
tl=tl+1
20 continue
idone=k+1
cmic$ end case
Figure A3. Continued.
36
87:
88: c *** Everyone else copy column k+l into local memory.
89:
90:
91:
92:
93:
94: 211
95:
96:
97:
98: c ***
99:
100:
101:
102:
103:
104:
105:
106:
107:
108:
109:
ii0
Iii
112
113
114
115:
116:
117:
118:
119: 30
120:
121
122:
123:
124:
125:
126:
127:
128:
129: 311
130:
131:
132:
133: c ***
if (idone.lt.k+l) then
t1=2
do 211 i=i2,i2+collth(k+l)-2
m2(tl) =a(i)
tl=t1+1
continue
idone=k+l
end if
Update, then finish column k+2.
ist3=colptr(k+2)
i3=ist3+1
cmic$ case
t1=2
ict=O
cdir$ ivdep
do 21 i=ist3,ist3+collth(k+2)-I
a(i)=a(i)-ml(tl)*ml(t1+ict)-m2(tl)*m2(tl+ict)
ict=ict+l
21 continue
: c *** Finish the (k+2)nd multiplier.
: a(ist3)=l.0/sqrt(a(ist3))
: t1=3
: cdir$ ivdep
do 30 i=i3,i3+collth(k+2)-2
a(i)=a(i)*a(ist3)
m3(tl)=a(i)
t1=tl+l
continue
idone=k+2
: cmic$ end case
c *** Everyone else copy the column k+2 into local memory.
if (idone.lt.k+2) then
t1=3
do 311 i=i3,i3+collth(k+2)-2
m3(tl)=a(i)
t1=tl+1
continue
idone=k+2
endif
Update, then finish column k+3.
Figure A3. Continued.
37
ist4=colptr(k+3)
i4=ist4+l
case
t1=3
ict=O
ivdep
do 31 i=ist4,ist4+collth(k+3)-i
a(i)=a(i)-ml(tl)*ml(tl+ict)-m2(tl)*m2(tl+ict)-m3(tl)*mS(tl+ict)
ict=ict+l
continue
134:
135:
136:
137: cmic$
138:
139:
140: cdir$
141:
142:
143:
144: 31
145:
146: c *** Finish the (k+3)rd multiplier.
147: a(ist4)=l.O/sqrt(a(ist4))
148: ti=4
149: cdir$ ivdep
150: do 40 i=i4,i4+collth(k+3)-2
151: a(i)=a(i)*a(ist4)
152: m4(tl)=a(i)
153: tl=tl+l
154: 40 continue
155: idone=k+3
156: cmic$ end case
157:
158: c *** Everyone else copy the column k+3 into local memory.
159:
160:
161:
162:
163:
164:
165:
166:
167:
168:
169:
170:
171:
172:
173:
174:
175:
176:
177:
178:
179:
180:
181:
411
if (idone.lt.k+3) then
t1=4
do 411 i=i4,i4+collth(k+3)-2
m4(tl)=a(i)
tl=tl+l
continue
idone=k+3
endif
C _ Update, then finish column k+4.
ist5=colptr(k+4)
i5=ist5+l
cmic$ case
tl=4
ict=O
cdir$ ivdep
do 41 i=ist5,ist5+collth(k+4)-i
a(i)=a(i)-m1(tl)*ml(tl+ict)-m2(tl)*m2(tl+ict)-m3(tl)*m3(t1+ict)
+ -m4(tl)*m4(t1+ict)
ict=ict+l
41 continue
Figure A3. Continued.
38
182:
183:
184:
185:
186:
187:
188:
189:
190:
191:
192:
193:
194:
195:
196:
197:
198:
199:
200:
201:
202:
203:
204:
205:
206:
207:
208:
209:
210:
211
212
213
214
215
216
217
218
219:
220:
221:
222:
223:
224:
225:
226:
c *** Finish the (k+4)th multiplier.
a(ist5)=l.0/sqrt(a(ist5))
tl=5
cdir$ ivdep
do 50 i=i5,i5+collth(k+4)-2
a(i)=a(i)*a(ist5)
mS(tl)=a(i)
tl=tl+l
50 continue
idone=k+4
cmic$ end case
C ***
511
Everyone else copy the column k+4 into local memory.
if (idone.lt.k+4) then
t1=5
do 511 i=i5,i5+collth(k+4)-2
m5(tl)=a(i)
tl=tl+l
continue
idone=k+4
endif
C *** Update, then finish column k+5.
ist6=colptr(k+5)
i6=ist6
cmic$
: cdir$
: +
: 51
: C ***
case
tl=5
ict=O
ivdep
do 51 i=ist6,ist6+collth(k+5)-i
a(i) =a(i) -ml (t i) *ml (t l+ict)-m2 (t i) *m2 (t l+ict) -m3 (t l)*m3 (t l+ict)
-m4(tl)*m4(tl+ict)-m5(tl)*m5(tl+ict)
ict=ict+l
continue
Finish the (k+5)th multiplier.
a(ist6)=l.O/sqrt(a(ist6))
tl=6
cdir$ ivdep
do 60 i=ist6+l,ist6+collth(k+5)-I
a(i)=a(i)*a(ist6)
m6(tl)=a(i)
tl=tl+1
60 continue
Figure A3. Continued.
39
227: idone=k+5
228:
229: c *** Count all floating point vector operations for finishing
230: c *** this group of 6 multiplier columns.
231: nopsf=nopsf+collth(k+5)*36 + 55
232: cmic$ end case
233:
234: c *** Everyone else copy column k+5 into local memory.
235: if (idone.lt.k+5) then
236: t1=6
237: do 6111 i=ist6+l,ist6+collth(k+5)-I
238: m6(tl)=a(i)
239: t1=t1+1
240: 6111 continue
241: idone=k+5
242: endif
243:
244: c *** End jki portion: 6 columns of L completed.
245:
246: c *** Begin kji portion: update columns k+6 thru lastr
247: c *** of matrix using the 6 completed multiplier columns.
248:
249: c *** Variable lastr corresponds to the last row in column k
250: c *** that is stored in the variable-band format.
251: lastr=k+collth(k)-i
252:
253: cmic$ do parallel
254: do 61 j=k+6,1astr
255: tl=j-k
256: itst=tl
257: c *** Begin full zero checking
258: if (ml(itst).eq.O.O) then
259: if (m2(itst).eq.O.O) then
260: if (m3(itst).eq.O.O) then
261: if (m4(itst).eq.O.O) then
262: if (m5(itst).eq.O.O) then
263: if (m6(itst).eq.O.O) then
264: goto 611
265: endif
266: endif
267: endif
268: endif
269: endif
270: endif
271: c *** End zero checking.
272: lth=lastr-j
Figure A3. Continued.
4O
273:
274:
275:
276:
277:
278:
279:
280:
281:
282:
283:
284:
285:
286:
287:
288:
289:
290:
291:
292:
293:
294:
295:
296:
297:
298:
299:
300:
301:
302:
303:
304:
305:
306:
307:
308:
309:
310:
311:
312:
313:
314:
315:
316:
317:
318:
319:
cdir$ ivdep
do 610 i=colptr(j),colptr(j)+ith
a(i)=a(i)-ml(itst)*ml(tl)-m2(itst)*m2(tl)
+ -m3 (it st) *m3 (t i) -m4 (it st) *m4 (t I)
+ -m5 (it st )*m5 (t 1)-m6 (i tst) *m6 (t i)
t1=t1+1
610 continue
611 continue
61 continue
c *** End kji portion: increment k and begin next jkl portion
c *** if k is less than n-6.
k=k+6
if (k.lt.n-6) goto 999
c *** Finish any remaining columns of L. There are up to 7
c *** columns remaining. The last 7 columns are explicitly
c *** computed without loop overhead, etc.
cmic$ case
goto (100,110,120,130,140,150,160),(n-k+1)
continue
6 columns left to update
istl=colptr(k)
a(istl)=1.0/sqrt(a(istl))
a(istl+l)=a(ist1+l)*a(istl)
a(istl+2)=a(ist1+2)*a(istl)
a(istl+3)=a(ist1+3)*a(istl)
a(istl+4)=a(ist1+4)*a(istl)
a(istl+5)=a(ist1+5)*a(istl)
a(istl+6)=a(ist1+6)*a(istl)
ist2=colptr(k+1)
a(ist2)=a(ist2)-a(istl+l)*a(ist1+l)
a(ist2+l)=a(ist2+l)-a(ist1+l)*a(istl+2)
a(ist2+2)=a(ist2+2)-a(istl+l)*a(istl+3)
a(ist2+3)=a(ist2+3)-a(ist1+l)*a(istl+4)
a(ist2+4)=a(ist2+4)-a(istl+1)*a(istl+5)
a(ist2+5)=a(ist2+5)-a(ist1+l)*a(istl+6)
ist3=colptr(k+2)
a(ist3)=a(ist3)-a(istl+2)*a(ist1+2)
a(ist3+l)=a(ist3+l)-a(istl+2)*a(istl+3)
a(ist3+2)=a(ist3+2)-a(istl+2)*a(istl+4)
a(ist3+3)=a(ist3+3)-a(ist1+2)*a(istl+5)
a(ist3+4)=a(ist3+4)-a(ist1+2)*a(istl+6)
Figure A3. Continued.
41
320:
321:
322:
323:
324:
325:
326:
327:
328:
329:
330:
331:
332:
333:
334:
335:
336:
337:
338:
339:
340:
341:
342:
343:
344:
345:
346:
347:
348:
349:
350:
351:
352:
353:
354:
355:
356:
357:
358:
359:
360:
361:
362:
363:
ist4=colptr(k+3)
a(ist4)=a(ist4)-a(istl+3)*a(istl+3)
a(ist4+l)=a(ist4+1)-a(istl+3)*a(istt+4)
a(ist4+2)=a(ist4+2)-a(istl+3)*a(istl+5)
a(ist4+3)=a(ist4+3)-a(istl+3)*a(istl+6)
ist5=colptr(k+4)
a(ist5)=a(ist5)-a(istl+4)*a(istl+4)
a(ist5+l)=a(ist5+l)-a(istl+4)*a(istl+5)
a(ist5+2)=a(ist5+2)-a(istl+4)*a(istl+6)
ist6=colptr(k+5)
a(ist6)=a(ist6)-a(istl+5)*a(istl+5)
a(ist6+l)=a(ist6+l)-a(istl+5)*a(istl+6)
a(colptr(k+6))=a(colptr(k+6))-a(ist1+6)*a(istl+6)
k=k+l
nopsf=nopsf+49
continue
5 columns left to update
ist1=colptr(k)
a(istl)=l.0/sqrt(a(istl))
a(istl+l)=a(ist1+l)*a(istl)
a(istl+2)=a(istl+2)*a(istl)
a(istl+3)=a(istl+3)*a(istl)
a(istl+4)=a(istl+4)*a(istl)
a(ist1+5)=a(istl+5)*a(istl)
ist2=colptr(k+l)
a(ist2)=a(ist2)-a(istl+l)*a(istl+l)
a(ist2+l)=a(ist2+l)-a(istl+l)*a(istl+2)
a(ist2+2)=a(ist2+2)-a(isti+l)*a(istl+3)
a(ist2+3)=a(ist2+3)-a(istl+l)*a(istl+4)
a(ist2+4)=a(ist2+4)-a(istl+l)*a(istl+5)
ist3=colptr(k+2)
a(ist3)=a(ist3)-a(ist1+2)*a(ist1+2)
a(ist3+l)=a(ist3+1)-a(istl+2)*a(ist1+3)
a(ist3+2)=a(ist3+2)-a(isti+2)*a(ist1+4)
a(ist3+3)=a(ist3+3)-a(istl+2)*a(istl+5)
ist4=colptr(k+3)
a(ist4)=a(ist4)-a(istl+3)*a(ist1+3)
a(ist4+l)=a(ist4+l)-a(istl+3)*a(istl+4)
a(ist4+2)=a(ist4+2)-a(istl+3)*a(ist1+5)
ist5=colptr(k+4)
a(ist5)=a(ist5)-a(ist1+4)*a(istl+4)
a(ist5+t)=a(ist5+l)-a(ist1+4)*a(istl+5)
a(colptr(k+5))=a(colptr(k+5))-a(istl+5)*a(istl+5)
r
r
Figure A3. Continued.
42
364:
365:
366:
367:
368:
369:
370:
371:
372:
373:
374:
375:
376:
377:
378:
379:
380:
381:
382:
383:
384:
385:
386:
387:
388:
389:
390:
391:
392:
393:
394:
395:
396:
397:
398:
399:
400:
401:
402:
403:
404:
405:
406:
407:
408:
k=k+i
nopsf=nopsf+36
continue
4 columns left to update
istl=colptr(k)
a(istl)=l.0/sqrt(a(istl))
a(istl+l)=a(istl+l)*a(istl)
a(istl+2)=a(istl+2)*a(istl)
a(istl+3)=a(istl+3)*a(istl)
a(ist1+4)=a(istl+4)*a(istl)
ist2=colptr(k+1)
a(ist2)=a(ist2)-a(istl+l)*a(istl+1)
a(ist2+l)=a(ist2+1)-a(istl+l)*a(ist1+2)
a(ist2+2)=a(ist2+2)-a(istl+1)*a(istl+3)
a(ist2+3)=a(ist2+3)-a(ist1+1)*a(istl+4)
ist3=colptr(k+2)
a(ist3)=a(ist3)-a(istl+2)*a(ist1+2)
a(ist3+l)=a(ist3+l)-a(ist1+2)*a(istl+3)
a(ist3+2)=a(ist3+2)-a(istl+2)*a(ist1+4)
ist4=colptr(k+3)
a(ist4)=a(ist4)-a(ist1+3)*a(istl+3)
a(ist4+1)=a(ist4+l)-a(istl+3)*a(istl+4)
a(colptr(k+4))=a(colptr(k+4))-a(istl+4)*a(istl+4)
k=k+l
nopsf=nopsf+25
continue
3 columns left to update
istl=colptr(k)
a(istl)=l.O/sqrt(a(istl))
a(istl+l)=a(istl+1)*a(istl)
a(ist1+2)=a(istl+2)*a(istl)
a(istl+3)=a(istl+3)*a(istl)
ist2=colptr(k+l)
a(ist2)=a(ist2)-a(istl+l)*a(istl+l)
a(ist2+l)=a(ist2+l)-a(istl+l)*a(istl+2)
a(ist2+2)=a(ist2+2)-a(istl+l)*a(istl+3)
ist3=colptr(k+2)
a(ist3)=a(ist3)-a(istl+2)*a(istl+2)
a(ist3+l)=a(ist3+l)-a(istl+2)*a(istl+3)
a(colptr(k+3))=a(colptr(k+3))-a(istl+3)*a(istl+3)
k=k+l
nopsf=nopsf+16
Figure A3. Continued,
43
409: 120
410: c ***
411:
412:
413:
414:
415:
416:
417:
418:
419:
420:
421:
422: 110
423: c ***
424:
425:
426:
427:
428:
429:
430:
431: 100
432: c ***
433:
434:
435: cmic$
436: cmic$
437:
438:
439:
continue
2 columns left to update
istl=colptr(k)
a(istl)=1.0/sqrt(a(istl))
a(ist1+1)=a(istl+l)*a(istl)
a(istl+2)=a(ist1+2)*a(istl)
ist2=colptr(k+l)
a(ist2)=a(ist2)-a(istl+l)*a(istl+l)
a(ist2+l)=a(ist2+l)-a(istl+1)*a(ist1+2)
a(colptr(k+2))=a(colptr(k+2))-a(ist1+2)*a(istl+2)
k=k+l
nopsf=nopsf+9
continue
1 column left to update
istl=colptr(k)
a(istl)=l.0/sqrt(a(istl))
a(istl+l)=a(istl+l)*a(istl)
ist2=colptr(k+l)
a(ist2)=a(ist2)-a(istl+l)*a(istl+l)
nopsf=nopsf+4
continue
0 columns left to update
a(colptr(n))=l.0/sqrt(a(colptr(n)))
nopsf=nopsf+l
end case
end parallel
return
end
Figure A3. Concluded.
44
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18: c ***
19: c ***
20: c ***
21: c ***
22: c ***
23:
24:
25:
26: c ***
27: c ***
28:
29: 999
30:
31: c ***
32: c ***
33: c ***
34:
35: c ***
36:
37:
38:
39:
40:
41:
42: cdir$
43:
44:
45:
46:
47: i0
Forcesub pvband(a,alth,colptr,collth,n,nopsf) of NPKOC ident ME
Private integer k,i,j,lastr,lth,tl,iadd
Private integer istl,ist2,ist3,ist4,ist5,ist6
Private integer il,i2,i3,i4,i5,i6,ict,itst
integer n,alth,nopsf
integer colptr(n),collth(n)
real a(alth)
parameter (Imsize=9600)
parameter (16=Imsize/6)
Private real ml(16),m2(16),m3(16),m4(16),m5(16),m6(16)
common /Imem/ idone,ml,m2,m3,m4,m5,m6
End declarations
cdir$ regfile Imem
Variable idone is a private variable and is used to
determine which processors must copy multipliers into
local memory. The processor which computes a multiplier
already has that column in local memory and proceeds
to the next code section, skipping the copy segment.
idone=O
k=l
nopsf =0
Counting variable nopsf is private and is used to count
the number of operations used by each process.
continue
Begin jki portion: Complete 6 columns of L; k through k+5.
The 6 columns are used as multiplier columns in the
update section.
Finish the kth multiplier.
istl=colptr(k)
il=istl+l
Barrier
a(istl)=l.O/sqrt(a(istl))
tl=l
ivdep
do 10 i=il,il+collth(k)-2
a(i)=a(i)*a(istl)
ml(tl)=a(i)
tl=t1+1
continue
Figure A4. FORTRAN listing for FORCE subroutine pvband.
45
48:
49:
50:
51:
52:
53:
54:
55:
56:
57:
58:
59:
60:
61:
62:
63:
64:
65:
66:
67:
68:
69:
70:
71:
72:
73:
74:
75:
76:
77:
78:
79:
80:
81:
82:
83:
84:
85:
86:
87:
88:
89:
90:
91:
92:
93:
94:
95:
c *** Update, then finish column k+l.
ist2=colptr(k+l)
i2=ist2+l
tl=l
ict=O
cdir$ ivdep
do Ii i=ist2,ist2+collth(k+l)-i
a(i)=a(i)-ml(tl)*ml(tl+ict)
ict=ict+l
Ii continue
c *** Finish the (k+l)st multiplier.
a(ist2)=l.O/sqrt(a(ist2))
tl=2
cdir$ ivdep
do 20 i=i2,i2+collth(k+l)-2
a(i)=a(i)*a(ist2)
m2(tl)=a(i)
tl=tl+l
20 continue
c *** Update, then finish column k+2.
ist3=colptr(k+2)
i3=ist3+l
tl=2
ict=O
cdir$ ivdep
do 21 i=ist3,ist3+collth(k+2)-I
a(i)=a(i)-ml(tl)*ml(tl+ict)-m2(tl)*m2(tl+ict)
ict=ict+l
21 continue
c *** Finish the (k+2)nd multiplier.
a(ist3)=l.O/sqrt(a(ist3))
t1=3
cdir$ ivdep
do 30 i=i3,i3+collth(k+2)-2
a(i)=a(i)*a(ist3)
m3(tl)=a(i)
tl=tl+l
30 continue
C _ Update, then finish column k+3.
ist4=colptr(k+3)
i4=ist4+l
tl=3
ict=O
Figure A4. Continued.
46
96:
97:
98:
99:
100:
101:
102:
103:
104:
105:
106:
107:
108:
109:
110:
111:
112:
113:
114:
115:
116:
117:
118:
119:
120:
121:
122:
123:
124:
125:
126:
127:
128:
129:
130:
131:
132:
133:
134:
135:
136:
137:
138:
139:
140:
141:
cdir$ ivdep
do 31 i=ist4,ist4+collth(k+3)-i
a(i)=a(i)-ml(tl)*ml(tl+ict)-m2(tl)*m2(tl+ict)-m3(tl)*m3(tl+ict)
ict=ict+l
31 continue
c *** Finish the (k+3)rd multiplier.
a(ist4)=l.O/sqrt(a(ist4))
tl=4
cdir\$ ivdep
do 40 i=i4,i4+collth(k+3)-2
a(i)=a(i)*a(ist4)
m4(tl)=a(i)
tl=tl+l
40 continue
C ***
cdirk$
4-
41
Update, then finish column k+4.
ist5=colptr(k+4)
i5=ist5+l
t1=4
ict=O
ivdep
do 41 i=ist5,ist5+collth(k+4)-i
a(i)=a(i)-ml(tl)*ml(tl+ict)-m2(tl)*m2(tl+ict)-m3(tl)*m3(tl+ict)
-m4(tl)*m4(tl+ict)
ict=ict+l
continue
c *** Finish the (k+4)th multiplier.
a(ist5)=l.0/sqrt(a(ist5))
tl=5
cdir\$ ivdep
do 50 i=i5,i5+collth(k+4)-2
a(i)=a(i)*a(ist5)
m5 (tl)=a(i)
tl=tl+l
50 continue
C ***
cdir\$
Update, then finish column k+5.
ist6=colptr(k+5)
i6=ist6
t1=5
ict=O
ivdep
do 51 i=ist6,ist6+collth(k+5)-I
a(i)=a(i)-ml(tl)*ml(tl+ict)-m2(tl)*m2(tl+ict)-m3(tl)*m3(tl+ict)
Figure A4. Continued.
47
142:
143:
144: 51
145:
146:
147:
i48:
i49:
i50:
i5i:
152:
153:
154: 60
i55:
156:
157: c ***
158: c ***
159:
160:
161:
162: c ***
163:
164:
165:
166:
167:
168:
169:
170:
171:
172:
173:
174:
175:
176:
177: 6111
178:
179:
180:
181: c ***
182:
183: c ***
184: c ***
185:
186: c ***
187: c ***
188:
-m4(tl)*m4(tl+ict)-m5(tl)*m5(tl+ict)
ict=ict+l
continue
c *** Finish the (k+5)th multiplier.
a(ist6)=l.O/sqrt(a(ist6))
t1=6
cdir\$ ivdep
do 60 i=istS+l,ist6+collth(k+5)-1
a(i) =a(i) *a (ist6)
m6(tl)=a(i)
tl=tl+l
continue
idone=k+5
Count all floating point vector operations for finishing
this group of 6 multiplier columns.
nopsf=nopsf+collth(k+5)*36 + 55
End barrier
Everyone else copy columns k thru k+5 into local memory.
if (idone.lt.k+5) then
il=colptr(k)+5
i2=colptr(k+l)+4
i3=colptr(k+2)+3
i4=colptr(k+3)+2
i5=colptr(k+4)+l
i6=colptr(k+5)
do 6111 ict=1,collth(k+5)-i
ml(5+ict)=a(i1+ict)
m2(5+ict)=a(i2+ict)
m3(5+ict)=a(i3+ict)
m4(5+ict)=a(i4+ict)
m5(5+ict)=a(i5+ict)
m6(5+ict)=a(i6+ict)
continue
idone=k+5
endif
End jki portion: 6 columns of L completed.
Begin k3i portion: update columns k+6 thru lastr
of matrix using the 6 completed multiplier columns.
Variable lastr corresponds to the last row in column k
that is stored in the variable-band format.
lastr=k+collth(k)-I
Figure A4. Continued.
L_
48
189:
190:
191:
192:
193: c ***
194:
195:
196:
197:
198:
199:
200:
201:
202:
203:
204:
205:
206:
207: c ***
208:
209:
210:
211:
212:
213:_
214:
215:
216:
217:
218:
219: 611
220: 61
221:
222:
223:
224:
225:
226:
227:
228: c ***
229: c ***
230: c ***
231:
232:
233:
234:
Presched do 61 j=k+6,1astr
tl=j-k
itst=tl
Begin full zero checking
if (ml(itst).eq.O.O) then
if (m2(itst).eq.O.O) then
if (m3(itst).eq.O.O) then
if (m4(itst).eq.O.O) then
if (m5(itst).eq.O.O) then
if (m6(itst).eq.O.O) then
goto 611
endif
endif
endif
endif
endif
endif
End zero checking.
lth=lastr-j
cdir\$ ivdep
do 610 i=colptr(j),colptr(j)+lth
a(i)=a(i)-m1(itst)*ml(tl)-m2(itst)*m2(tl)
+ -m3(itst)*m3(tl)-m4(itst)*m4(tl)
- + -m5(itst)*m5(tl)-m6(itst)*m6(tl)
t1=tl+l
610 continue
c *** Count floating point vector operations in kji update loop
nopsf=nopsf+12*(Ith+l).
continue
End presched do
c *** End kji portion: increment k and begin next jki portion
c *** if k is less than n-6.
k=k+6
if (k.lt.n-6) goto 999
Finish any remaining columns of L. There are up to 7
columns remaining. The last 7 columns are explicitly
computed without loop overhead, etc.
Barrier
goto (lO0,110,120,130,140,150,160),(n-k+1)
Figure A4. Continued.
49
235:
236:
237:
238:
239:
240:
241:
242:
243:
244:
245:
246:
247:
248:
249:
250:
251:
252:
253:
254:
255:
256:
257:
258:
259:
260:
261:
262:
263:
264:
265:
266:
267:
268:
269:
270:
271:
272:
273:
274:
275:
276:
277:
278:
279:
160
C ***
150
C ***
continue
6 columns left to update
istl=colptr(k)
a(istl)=l.O/sqrt(a(istl))
a(istl+l)=a(istt+l)*a(istl)
a(istl+2)=a(istl+2)*a(istl)
a(istl+3)=a(istl+3)*a(istl)
a(istl+4)=a(istl+4)*a(istl)
a(istl+5)=a(istl+5)*a(istl)
a(istl+6)=a(ist1+6)*a(istl)
ist2=colptr(k+l)
a(ist2)=a(ist2)-a(istl+l)*a(istl+l)
a(ist2+l)=a(ist2+l)-a(istl+l)*a(istl+2)
a(ist2+2)=a(ist2+2)-a(istt+t)*a(istl+3)
a(ist2+3)=a(ist2+3)-a(istl+1)*a(istl+4)
a(ist2+4)=a(ist2+4)-a(istl+l)*a(istl+5)
a(ist2+5)=a(ist2+5)-a(istl+l)*a(ist1+6)
ist3=colptr(k+2)
a(ist3)=a(ist3)-a(istl+2)*a(ist1+2)
a(ist3+l)=a(ist3+l)-a(istl+2)*a(istl+3)
a(ist3+2)=a(ist3+2)-a(istl+2)*a(istl+4)
a(ist3+3)=a(ist3+3)-a(istl+2)*a(istl+5)
a(ist3+4)=a(ist3+4)-a(istl+2)*a(istl+6)
ist4=colptr(k+3)
a(ist4)=a(ist4)-a(istl+3)*a(istl+3)
a(ist4+l)=a(ist4+l)-a(istl+3)*a(istl+4)
a(ist4+2)=a(ist4+2)-a(istl+3)*a(ist1+5)
a(ist4+3)=a(ist4+3)-a(istl+3)*a(ist1+6)
ist5=colptr(k+4)
a(ist5)=a(ist5)-a(ist1+4)*a(istl+4)
a(ist5+i)=a(ist5+l)-a(istl+4)*a(ist1+5)
a(ist5+2)=a(ist5+2)-a(istl+4)*a(istl+6)
ist6=colptr(k+5)
a(ist6)=a(ist6)-a(istl+5)*a(istl+5)
a(ist6+l)=a(ist6+l)-a(ist1+5)*a(isti+6)
a(colptr(k+6))=a(colptr(k+6))-a(istl+6)*a(ist1+6)
k=k+l
nopsf=nopsf+49
continue
5 columns left to update
ist1=colptr(k)
a(istl)=l.O/sqrt(a(istl))
a(ist1+l)=a(istl+l)*a(istl)
a(ist1+2)=a(istl+2)*a(istl)
Figure A4. Continued.
50
280:
281:
282:
283:
284:
285:
286:
287:
288:
289:
290:
291:
292:
293:
294:
295:
296:
297:
298:
299:
300:
301:
302:
303:
304:
305:
306:
307:
308:
309:
310:
311:
312:
313:
314:
315:
316:
317:
318:
319:
320:
321:
322:
323:
324:
325:
326:
327:
328:
a(istl+3)=a(istl+3)*a(istl)
a(istl+4)=a(istl+4)*a(istl)
a(istl+5)=a(istl+5)*a(istl)
ist2=colptr(k+l)
a(ist2)=a(ist2)-a(istl+l)*a(istl+l)
a(ist2+l)=a(ist2+1)-a(istl+1)*a(istl+2)
a(ist2+2)=a(ist2+2)-a(ist1+l)*a(ist1+3)
a(ist2+3)=a(ist2+3)-a(istl+l)*a(istl+4)
a(ist2+4)=a(ist2+4)-a(istl+l)*a(ist1+5)
ist3=colptr(k+2)
a(ist3)=a(ist3)-a(istl+2)*a(istl+2)
a(ist3+l)=a(ist3+l)-a(istl+2)*a(ist1+3)
a(ist3+2)=a(ist3+2)-a(ist1+2)*a(istl+4)
a(ist3+3)=a(ist3+3)-a(istl+2)*a(ist1+5)
ist4=colptr(k+3)
a(ist4)=a(ist4)-a(istl+3)*a(istl+3)
a(ist4+1)=a(ist4+l)-a(ist1+3)*a(istl+4)
a(ist4+2)=a(ist4+2)-a(istl+3)*a(ist1+5)
ist5=colptr(k+4)
a(ist5)=a(ist5)-a(istl+4)*a(istl+4)
a(ist5+l)=a(ist5+l)-a(istl+4)*a(ist
a(colptr(k+5))=a(colptr(k+5))-a(ist
k=k+l
nopsf=nopsf+36
1+5)
l+5)*a(istl+5)
continue
4 columns left to update
istl=colptr(k)
a(istl)=l.O/sqrt(a(istl))
a(ist1+l)=a(istl+l)*a(istl)
a(istl+2)=a(istl+2)*a(istl)
a(istl+3)=a(istl+3)*a(istl)
a(istl+4)=a(istl+4)*a(istl)
ist2=colptr(k+l)
a(ist2)=a(ist2)-a(istl+l)*a(istl+l)
a(ist2+l)=a(ist2+1)-a(istl+l)*a(istl+2)
a(ist2+2)=a(ist2+2)-a(istl+l)*a(istl+3)
a(ist2+3)=a(ist2+3)-a(istl+l)*a(istl+4)
ist3=colptr(k+2)
a(ist3)=a(ist3)-a(istl+2)*a(istl+2)
a(ist3+l)=a(ist3+l)-a(istl+2)*a(istl+3)
a(ist3+2)=a(ist3+2)-a(istl+2)*a(istl+4)
ist4=colptr(k+3)
a(ist4)=a(ist4)-a(istl+3)*a(istl+3)
a(ist4+l)=a(ist4+1)-a(istl+3)*a(istl+4)
a(colptr(k+4))=a(colptr(k+4))-a(istl+4)*a(istl+4)
k=k+l
nopsf=nopsf+25
Figure A4. Continued.
51
329: 130
330: c ***
331:
332:
333:
334:
335:
336:
337:
338:
339:
340:
341:
342:
343:
344:
345:
346:
347: 120
348: c ***
349:
350:
351:
352:
353:
354:
355:
356:
357:
358:
359:
360: 110
361: c ***
362:
363:
364:
365:
366:
367:
368:
369: 100
370: c ***
371:
372:
373:
374:
375:
376:
continue
3 columns left to update
istl=colptr(k)
a(istl)=l.0/sqrt(a(istl))
a(istl+l)=a(ist1+l)*a(istl)
a(ist1+2)=a(istl+2)*a(istl)
a(istl+3)=a(istl+3)*a(istl)
ist2=colptr(k+l)
a(ist2)=a(ist2)-a(istl+1)*a(istl+1)
a(ist2+l)=a(ist2+1)-a(istl+1)*a(ist1+2)
a(ist2+2)=a(ist2+2)-a(ist1+1)*a(ist1+3)
ist3=colptr(k+2)
a(ist3)=a(ist3)-a(istl+2)*a(ist1+2)
a(istS+l)=a(istS+1)-a(istl+2)*a(ist1+3)
a(colptr(k+3))=a(colptr(k+3))-a(istl+3)*a(istl+3)
k=k+1
nopsf=nopsf+16
continue
2 columns left to update
ist1=colptr(k)
a(istl)=l.O/sqrt(a(istl))
a(istl+l)=a(istl+l)*a(istl)
a(ist1+2)=a(istl+2)*a(istl)
ist2=colptr(k+l)
a(ist2)=a(ist2)-a(ist1+1)*a(istl+l)
a(ist2+l)=a(ist2+l)-a(istl+l)*a(istl+2)
a(colptr(k+2))=a(colptr(k+2))-a(istl+2)*a(istl+2)
k=k+l
nopsf=nopsf+9
continue
1 column left to update
istl=colptr(k)
a(istl)=l.0/sqrt(a(istl))
a(istl+l)=a(istl+l)*a(istl)
ist2=colptr(k+l)
a(ist2)=a(ist2)-a(istl+l)*a(ist1+1)
nopsf=nopsf+4
continue
0 columns left to update
a(colptr(n))=l.0/sqrt(a(colptr(n)))
nopsf=nopsf+1
End barrier
return
end
Figure A4. Concluded.
52
i:
2:
3:
4:
5:
6:
7:
8:
9:
subroutine vbsolve(a,alth,colptr,collth,n,b,nopss)
integer alth,n,nopss
integer i,j,ict,lastr,left
integer tl,t2,t3,t4,t5,t6,t7
integer istl,ist2,istB,ist4,ist5,ist6
integer colptr(n),collth(n)
real a(alth),b(n),t
c *** This routine performs forward-and-backward substitution
I0: c *** of triangular factored matrices. It assumes a variable-band
ii: c *** data structure with adjustments for loop unrolling to level 6.
12: c *** NOTE: This solver assumes that the diagonal elements of
c *** the triangular systems are the reciprocals of the actual
c *** diagonal elements
nopss=O
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28
29
30:
31:
32:
33:
34:
35:
36:
37:
38:
39:
40:
41:
42:
c *** Solve Ly=b (column sweep)
left=mod(n,6)
last=n-left
do I i=1,1ast,6
istl=colptr(i)
ist2=colptr(i+l)
ist3=colptr(i+2)
ist4=colptr(i+3)
ist5=colptr(i+4)
: ist6=colptr(i+5)
: c *** Begln zero-checking option.
if (b(i).eq.O.O) then
if (b(i+l).eq.O.O) then
if (b(i+2).eq.O.O) then
if (b(i+3).eq.O.O) then
if (b(i+4).eq.O.O) then
if (b(i+5).eq.O.O) then
goto II
endif
endif
endif
endif
endif
endif
43: c *** End zero-checking option.
44: b(i)=b(i)*a(istl)
45: b(i+l)=(b(i+l)-a(istl+l)*b(i)
46: + )*a(ist2)
47: b(i+2)=(b(i+2)-a(istl+2)*b(i)
Figure A5. FORTRAN listing for subroutine vbsolve.
53
48:
49:
50:
51:
52:
53"
54:
55:
56:
57"
58:
59:
60:
61:
62:
63:
64:
65:
66"
67:
68:
69:
70:
71"
72:
73:
74:
75:
76:
77:
78:
79:
80"
81:
82:
83:
84:
85"
86:
87:
88:
89:
90:
91:
92:
93:
+
+
+
+
+
+
+
+
+
-a(ist2+l) *b(i+l)
)*a(ist3)
b (i+3) = (b(i+3) -a (ist 1+3) *b(i)
-a(ist2+2) *b(i+l)
-a(ist3+l) *b(i+2)
)*a(ist4)
b (i+4) = (b(i+4) -a (ist I+4) *b (i)
-a (ist 2+3) *b (i+l)
-a (ist 3+2) *b (i+2)
-a(ist4+l) *b(i+3)
)*a(ist5)
b (i+5) = (b(i+5)-a (ist l+5)*b (i)
-a(ist 2+4) *b (i+l)
-a(ist3+3)*b (i+2)
-a (ist4+2) *b (i+3)
-a (istS+l) *b (i+4)
)*a(ist6)
nopss=nopss+36
istl=istl+6
ist2=ist2+5
ist3=ist3+4
ist4=ist4+3
ist5=ist5+2
ist6=ist6+l
Ith=collth(i)-6
jrow=i+6
cdir\$ ivdep
do I0 ict=O,ith-I
b(3row)=b(jrow)-a(istl+ict)*b(i)
+ -a(ist2+ict)*b(i+l)
+ -a(ist3+ict)*b(i+2)
+ -a(ist4+ict)*b(i+3)
+ -a(ist5+ict)*b(i+4)
+ -a(ist6+ict)*b(i+5)
jrow=jrow+t
I0 continue
nopss=nopss+12*Ith
Ii continue
I continue
goto (lO0,101,102,103,104,105),(left+l)
105 c ont inue
i=n-4
istl=colptr (i)
Figure A5. Continued.
54
94:
95:
96:
97:
98:
99:
100:
101:
102:
103:
104:
105:
106:
107:
108:
109:
110:
111:
112:
113:
114:
115:
116:
117:
118:
119:
120:
121:
122:
123:
124:
125:
126:
127:
128:
129:
130:
131:
132:
133:
134:
135:
136:
137:
138:
139:
104
103
102
101
100
205
b(i)=b(i)*a(istl)
b(i+l)=b(i+l)-a(istl+l)*b(i)
b(i+2)=b(i+2)-a(istl+2)*b(i)
b(i+3)=b(i+3)-a(ist1+3)*b(i)
b(i+4)=b(i+4)-a(ist1+4)*b(i)
nopss=nopss+9
continue
i=n-3
ist l=colptr (i)
b(i):b(i) *a(istl)
b (i+l) =b (i+l)-a (ist 1+1)*b (i)
b (i+2) =b (i+2) -a (ist 1+2) *b (i)
b (1+3) =b (i +3) -a (i st 1+3) *b (i)
nopss=nopss+7
continue
i=n-2
istl=colptr (i)
b(i)=b(i)*a(istl)
b (i+l) =b(i+l)-a(ist l+l)*b (i)
b (i+2) =b (i+2) -a (ist 1+2) *b (i)
nopss=nopss+5
continue
i=n-i
ist1=colptr(i)
b(i)=b(i)*a(istl)
b(i+l)=b(i+l)-a(istl+l)*b(i)
nopss=nopss+3
continue
istl=colptr(n)
b(n)=b(n)*a(istl)
nopss=nopss+l
continue
Solve L(t)x=y
goto (200,201,202,203,204,205) left+l
continue
istl=colptr(n)
ist2=colptr(n-1)
ist3=colptr(n-2)
Figure AS. Continued.
55
140:
141:
142:
143:
144:
145:
146:
147:
148:
149 :
150:
151:
152:
153:
154:
155:
156:
157:
158:
159:
160:
161:
i62:
163:
164:
165:
166:
167:
i68:
169:
170:
i7i:
172:
173:
174:
i75:
i76:
177:
178:
i79:
180:
181:
182:
183:
184:
185:
186:
187:
204
ist4=colptr(n-3)
ist5=colptr(n-4)
b (n) =b (n) *a (ist 1 )
b (n- 1) = (b (n- 1) -a (ist 2+ 1) *b(n)
+ )*a(ist2)
b (n-2) = (b (n-2) -a (ist3+2) *b (n)
+ -a(ist3+i)*b(n-1)
+ )*a(ist3)
b(n-3)=(b(n-3)-a(ist4+3)*b(n)
+ -a(ist4+2)*b(n-i)
+ -a(ist4+l)*b(n-2)
+ )*a(ist4)
b(n-4)=(b(n-4)-a(ist5+4)*b(n)
+ -a(ist5+3)*b(n-1)
+ -a(ist5+2)*b(n-2)
+ -a(ist5+l)*b(n-3)
+ )*a(ist5)
nopss=nopss+25
goto 200
+
+
+
203
+
+
+
continue
ist l=colptr (n)
ist2=colptr (n- 1)
ist3=colptr (n-2)
ist4=colptr(n-3)
b(n)=b(n)*a(istl)
b(n-l)=(b(n-1)-a(ist2+l)*b(n)
)*a(ist2)
b(n-2)=(b(n-2)-a(ist3+2)*b(n)
-a(ist3+l)*b(n-l)
)*a(ist3)
b(n-3)=(b(n-3)-a(ist4+3)*b(n)
-a(ist4+2)*b(n-1)
-a(ist4+l)*b(n-2)
)*a(ist4)
nopss=nopss+16
goto 200
continue
istl=colptr(n)
ist2=colptr(n-1)
ist3=colptr(n-2)
b(n)=b(n)*a(istl)
b(n-1)=(b(n-1)-a(ist2+l)*b(n)
)*a(ist2)
b(n-2)=(b(n-2)-a(ist3+2)*b(n)
-a(ist3+l)*b(n-l)
)*a(ist3)
Figure A5. Continued.
56
188:
189:
190:
191:
192:
193:
194:
195:
196:
197:
198:
199:
200:
201:
202:
203:
204:
205:
206:
207:
208:
209:
210:
211:
212:
213:
214:
215:
216:
217:
218:
219:
220:
221:
222:
223:
224:
225:
226:
227:
228:
229:
230:
231:
232:
233:
nopss=nopss+9
goto 200
202
+
continue
istl=colptr(n)
ist2=colptr(n-l)
b(n)=b(n)*a(istl)
b(n-l)=(b(n-l)-a(ist2+l)*b(n)
)*_(ist2)
nopss=nopss+4
goto 200
201 continue
istl=colptr(n)
b (n) =b (n) *a (ist I)
nopss=nopss+l
200 continue
do 2 i=last,1,-6
ith=collth(i)-i
istl=colptr(i-5)+5
ist2=colptr(i-4)+4
ist3=colptr(i-3)+3
ist4=colptr(i-2)+2
ist5=colptr(i-1)+1
ist6=colptr(i)
if (lth.eq.O) goto 21
cdir\$ ivdep
do 20 ict=l,lth
b(i)=b(i)-a(ist6+ict)*b(i+ict)
b(i-1)=b(i-l)-a(ist5+ict)*b(i+ict)
b(i-2)=b(i-2)-a(ist4+ict)*b(i+ict)
b(i-3)=b(i-3)-a(ist3+ict)*b(i+ict)
b(i-4)=b(i-4)-a(ist2+ict)*b(i+ict)
b(i-5)=b(i-5)-a(istl+ict)*b(i+ict)
20 continue
nopss=nopss+12*lth
21 continue
b(i)=b(i)*a(ist6)
b (i-i) = (b (i-l) -a (ist5) *b (i)
+ )*a(ist5-1)
b(i-2)=(b(i-2)-a(ist4)*b(i)
+ -a(ist4-1)*b(i-l)
+ )*a(ist4-2)
b (i-3) = (b (i-3)-a(ist3)*b (i)
Figure A5. Continued.
57
234:
235:
236:
237:
238:
239:
240:
241:
242:
243:
244:
245:
246:
247:
248:
249:
250:
251:
252:
+ -a(ist3-t)*b(i-1)
+ -a(ist3-2)*b(i-2)
+ )*a(ist3-3)
b(i-4)=(b(i-4)-a(ist2)*b(i)
+ -a(ist2-1)*b(i-l)
+ -a(ist2-2)*b(i-2)
+ -a(ist2-3)*b(i-3)
+ )*a(ist2-4)
b(i-5)=(b(i-5)-a(istl)*b(i)
+ -a(istl-1)*b(i-1)
+ -a(istl-2)*b(i-2)
+ -a(istl-3)*b(i-3)
+ -a(istl-4)*b(i-4)
+ )*a(istl-5)
nopss=nopss+36
continue
return
end
Figure A5. Concluded.
58
References
1. Ortega, James M.: Matrix Theory A Second Course.
Plenum Press, c.1987.
2. Everstine, Gordon C.: The BANDIT Computer Program
for the Reduction of Matrix Bandwidth for NASTRAN.
Rep. 3827, Naval Ship Research _z Development Center,
Mar. 1972.
3. Cray Research, Inc.: Cray Multitasking Programmer's
Reference Manual. Central Scientific Computing Com-
plex Document CR-18a, SN-2026C, c.1989.
4. Cray Research, Inc.: Cray Y-MP, Cray X-MP EA and
Cray X-MP Multitasking Programmer's Manual. Central
Scientific Computing Complex Document CR-33, SR-
0222 F-01, c.1989.
5. Jordan, Harry F.: Programming Language Concepts
for Multiprocessors. Parallel Comput., vol. 8, no. 1-3,
Oct. 1988, pp. 31 40.
6. Jordan, Harry F.; Benten, Muhammad S.; Arenstorf,
Norbert S.; and Ramanan, Aruna V.: Force User's
Manual A Portable, Parallel FORTRAN. NASA CR-
4265, 1990.
7. Cray Research, Inc.: Cray UNICOS Autotasking User
Guide. Central Scientific Computing Complex Document
CR,17, SN-2088 CFT77 3.0, c.1989.
8. Poole, Eugene L.; and Overman, Andrea L.: The Solution
of Linear Systems of Equations With a Structural Analy-
sis Code on the NAS CRAY-2. NASA CR-4159, 1988.
9. Agarwal, Tarun K.; Storaasli, Olaf O4 and Nguyen, Duc
T.: A Parallel-Vector Algorithm for Rapid Structural
Analysis on High-Performance Computers. A Collection
of Technical Papers, Part 2 AIAA/ASME/ASCE/AHS/
ASC 31st Structures, Structural Dynamics and Materials
Conference, Apr. 1990, pp. 662 672. (Available as AIAA-
90-1149-CP.)
10. Ortega, J. M.: The ijk Forms of Factorization Methods I.
Vector Computers. Parallel Comput., vol. 7, no. 2, Junc
1988, pp. 135 147.
11.
12.
13.
14.
15.
16.
17.
18.
19.
Arenstorf, Norbert S.; and Jordan, Harry F.: Comparing
Barrier Algorithms. Parallel Comput., vol. 12, no. 1,
Oct. 1989, pp. 157 170.
Jones, Mark T.; and Patrick, Merrell L.: The Use of
Lanczo's Method To Solve the Large Generalized Symmet-
ric Eigenvalue Problem in Parallel. NASA CR-182072,
ICASE Rep. No. 90-48, 1990.
Ortega, James M.: Introduction to Parallel and Vector
Solution of Linear Systems. Plenum Press, c.1988.
Knight, N. F., Jr.; Gillian, R. E.; McCleary, S. L.;
Lotts, C. G.; Poole, E. L.; Overman, A. L.; and Macy,
S. C.: CSM Testbed Development and Large-Scale Struc-
tural Applications. Science and Engineering on Cray
Supercomputers Proceedings of the Fourth International
Symposium, Cray Research, Inc., 1988, pp. 359 387.
(Available as NASA TM-4072, 1989.)
Knight, Norman F., Jr.; McCleary, Susan L.; Macy,
Steven C.; and Aminpour, Mohammad A.: Large-Scale
Structural Analysis: The Structural Analyst, the CSM
Testbed, and the NAS System. NASA TM-100643, 1989.
Robins, A. Warner; Dollyhigh, Samuel M.; Beissner,
Fred L., Jr.; Geiselhart, Karl; ._Iartin, Glenn L.; Shields,
E. W.; Swanson, E. E.; Coen, Peter G.; and Morris,
Shelby J., Jr.: Concept Development of a Mach 3.0 High-
Speed Civil Transport. NASA TM-4058, 1988.
Kao, Pi-Jen; VCrenn, Gregory A.; and Giles, Gary L.:
Comparison of Equivalent Plate and Finite Element Anal-
ysis of a Realistic Aircraft Structural Configuration.
AIAA-90-3293, Sept. 1990.
Knight, Norman F., Jr.; Gillian, Ronnie E.; and Nemeth,
Michael P.: Preliminary 2-D Shell Analysis of the Space
Shuttle Solid Rocket Boosters. NASA TM-100515, 1988.
PATRAN Plus User Manual Volume II, Release 2.4.
Publ. No. 2191024, PDA Engineering, c.1990.
59

