Scalability study of parallel spatial direct numerical simulation code on IBM SP1 parallel supercomputer by Hanebutte, Ulf R. et al.
NASA Contractor Report 194975
ICASE Report No. 94-80
/_ --_X,/
IC S
SCALABILITY STUDY OF PARALLEL SPATIAL
DIRECT NUMERICAL SIMULATION CODE ON
IBM SP1 PARALLEL SUPERCOMPUTER
Ulf R. Hanebutte
Ronald D. Joslin
Mohammad Zubair
Contract NAS 1-19480
August 1994
Institute for Computer Applications in Science and Engineering
NASA Langley Research Center
Hampton, VA 23681-0001
Operated by Universities Space Research Association
,,t
cO
cO _
_JW
..I ..J t_o
_ <[-J <
Jt_ 09
OZr0
O C
tt_ _ W t_
0"0_ e-
l -J<_
< .J .J c_
https://ntrs.nasa.gov/search.jsp?R=19950012169 2020-06-16T09:19:25+00:00Z
Scalability Study of Parallel Spatial Direct Numerical Simulation Code on IBM
SP1 Parallel Supercomputer
Ulf R. Hanebutte 1
Institute for Computer Applications in Science and Engineering
NASA Langley Research Center
Hampton, VA 23681, USA
Ronald D. Joslin
NASA Langley Research Center
Hampton, VA 23681, USA
Mohammad Zubair 1
International Business Machines Corporation
Thomas J. Watson Research Center
Yorktown Heights, NY 10598, USA
Abstract
The implementation and the performance of a parallel spatial direct numerical simulation (PSDNS) code are
reported for the IBM SP1 parallel supercomputer. The spatially evolving disturbances that are associated
with laminar-to-turbulent transition in three-dimensional boundary-layer flows are computed with the PS-
DNS code. By remapping the distributed data structure during the course of the calculation, optimized serial
library routines can be utilized that substantially increase the computational performance. Although the
remapping incurs a high communication penalty, the parallel efficiency of the code remains above 40 percent
for all performed calculations. By using appropriate compile options and optimized library routines, the
serial code achieves 52-56 Mflops on a single node of the SP1 (45 percent of theoretical peak performance).
The actual performance of the PSDNS code on the SP1 is evaluated with a "real world" simulation that
consists of 1.7 million grid points. One time step of this simulation is calculated on eight nodes of the SP1
in the same time as required by a Cray Y/MP supercomputer. The 32-node SP1 is 2.9 times faster than the
Cray Y/MP for the same simulation. The scalability information provides estimated computational costs
that match the actual costs relative to changes in the number of grid points.
z
1This research was supported by the National Aeronautics and Space Administration under NASA Contract No.
NAS1-19480 while the authors were in residence at the Institute for Computer Applications in Science and Engineering
(ICASE), NASA Langley Research Center, Hampton, VA 23681.
" • _ _ _i_ i__I_ i_I_ , _ _
1 Introduction
In a recent review article, Fischer and Patera [4] summarize current work in the area of parallel
simulation of viscous incompressible flows. In their work, they discuss parallel solution strategies for
Poisson, Stokes, and Navier-Stokes problems. Iterative and direct-solution techniques that utilize
domain decomposition for structured and unstructured finite-element grids are presented. The
coupled approach (based on a full-implicit temporal treatment) and the decoupled approach with
a semi-implicit time step are compared for the Navier-Stokes equations. Examples of Solutions to
turbulent flow problems with either transpose-based fast Fourier transform (FFT) techniques or
distributed FFT algorithms are given. These Fourier examples include the first parallel computation
of a viscous incompressible flow by Moin and Kim [12] in 1982 on a 64-processors ILLIAC IV.
However, a discussion of parallel three-dimensional (3D) spatial direct numerical simulation (DNS)
algorithms for laminar-to-turbulent transition (the subject of this paper) is not included in reference
[4].
This report can be considered as a follow-up to the work by Joslin and Zubalr [9], in which
the performance of the parallel spatial direct numerical simulation (PSDNS) code on the the rela-
tively small and slow INTEL iPSC/860 computer was analyzed. The limited local memory of the
iPSC/860 seriously restricted the analysis. Because the IBM SP1 is a newer generation of parallel
computer with an increased local memory capacity and improved computation and communication
performance, comparison of these two machines is useless. However, the general statements given
in reference [9] in regard to the scaling of major kernels of the PSDNS code are valid.
2 Performance
2.1 The Parallel Computing Environment
The IBM SP1 [5] scalable parallel computer utilized in the presented performance study consists
of 128 processing nodes. Each node is essentially an IBM RS/6000 model 370 workstation with
a clock rate of 62.5 MHz. The local memory is 128 Mb, and the processor data and instruction
cache is 32 kb each. The individual nodes are connected by a multistage network that consists
of high-performance switches (50#sec latency, 8.5 Mb bandwidth); each switch can support up to
16 nodes. The peak performance obtained by performing one multiplication and one addition on
64-bit floating point numbers per clock cycle is 125 Mflops for each processing node. However, in
practice, a FORTRAN code delivers 15-75Mflops. Although the next-generation parallel computer
from IBM, called the SP2 [8, 13], is identical to the SP1 architecture, its node performance has
more than doubled and the communication network bandwidth has increased fourfold. For the
SP2, the increase in communication bandwidth relative to the computing performance will provide
a better balanced system, which should further improve the performance results of the presented
code. The access to Argonne's SP1 is controlled by a scheduler, which ensures that the requested
node partition is operated in a dedicated mode. Thus, only the user application and some necessary
Unix demons are executed on the assigned processor partition.
2.2 The Parallel Application
The PSDNS code developed by Joslin and Zubair [9] has been ported to the SP1 with only minor
changes. The original parallel code is based on the message-passing paradigm with explicit data
distribution,whichenablesgoodportability among a broad class of parallel computers. The inter-
ested reader is referred to references [9] and [10] for algorithmic details of the spatial DNS code. In
the PSDNS code, the data are distributed among the np processors in block form with a z-mapping.
That is, the 3D data are partitioned into np blocks that contain nz/np two-dimensional (2D) planes
of nxny data items each (Figure 1). To perform local FFT's in the spanwise direction nz, the data
must be remapped. As indicated in Figure 2, an x-mapping allows the utilization of optimized
serial FFT library routines [7] in the z direction. The INTEL implementation of the PSDNS code
relies on the xor algorithm [2] for the global data exchange; the IBM implementation makes use
of a global index routine provided by the AIX-parallel environment [6]. As shown in the study by
Joslin and Zubair, a significant performance gain can be achieved by utilizing a machine-specific
basic linear algebra subprogram (BLAS) level 3 routine [3] for the matrix by matrix multiplication.
Because this routine is also available on the IBM as part of the ESSL library [7], the advantage can
also be taken in the present implementation. The performance of the application code is further
improved through appropriate selection of the compile options. As a result, the run time of the
serial code can be reduced by a factor of 2.3 compared with a compilation without any options.
For a small test problem (for which Joslin and Zubair [9] obtained 189 Mflops on the Cray Y/MP
and 5 Mflops on a single node of the iPSC/860), a single node of the SP1 delivers 52.5 Mflops for
the d0uble-precision (i.e., 64-bit) computation.
2.3 Performance Study
To document the performance of the simulation code for a wide range of problem sizes and number
of processing nodes, three test suites are considered here. Further, a scaling analysis is presented in
which each of the three problem dimensions are scaled individually and the number of processors
is kept constant. The performance study is concluded with a discussion of a real-world large-scale
simulation. Although thousands of time steps are required for a single simulation performance
figures for only one time step of the PSDNS code are presented here. Performance figures for
one time step are sufficient because the workload for each time step is constant throughout the
simulation:
For all three test suites, performance data are collected for the serial code on a single node of
the SP1 and for the parallel code on up to 64 processing nodes. The chosen problem dimensions are
representative of actual simulations that are currently performed on Cray-class supercomputers.
The wall-normal dimension is fixed at 4'1 grid points for all three test suites. The number of grid
points in the streamwise direction determines whether the cases are considered to be small, medium,
or large. For small problems, the streamwise direction consists of 64 grid points; the medium and
large suites have 128 and 256 grid points, respectively. For each test suite, the spanwise dimension
is varied from 8 to 128 Fourier modes in powers of 2. Thus, experimental performance data can be
obtained for problems that range from as small as 64 streamwise, 41 wall-normal, and 8 spanwise
grid points (20 992 grid points) to a problem that is 64 times larger and contains 256 streamwise,
41 wall-normal, and 128 spanwise grid points.
The PSDNS code is instrumented with a set of timers to record separate performance data
for different parts of the computation (the total and four dominating algorithmic kernels) and the
communication. These measurements are wall-clock time. By including the idle time that results
from the necessary synchronization points of the code in the time data, processor-independent
performance figures can be obtained. Processor idle time is discussed below in conjunction with the
2
largesimulationfor whichthe smallserialfractionofthePSDNScodeisexperimentallydetermined.
In Figures 3(a), 4(a), and 5(a), the computational times for a single time step of the small,
medium, and large test suites, respectively, are given in double logarithmic graphs. The associated
communication times are given in Figures 3(b), 4(b), and 5(b). The excellent scaling of the code
on the SP1 can be observed immediately. However, large communication costs relative to the
computation costs are incurred because of the unbalanced architecture of the current SP1 (ie.,
network performance lags behind compute performance of processing nodes) on the one hand and
the algorithmic communication penalty on the other hand. The communication penalty must be
incurred in order to utilize highly optimized serial FFT routines in the spanwise direction. The
good scaling of the communication cost with respect to the number of processors is noteworthy
because the communication that occurs in the PSDNS code involves a complete exchange, which
represents a stringent test to the communication network.
The speedup of a parallel code for fixed-size problems is an important performance metric. In
Figures 6(a), 7(a), and 8(a), the actual speedup of the complete calculation for each case is given.
The performance of the algorithm can be improved by scaling the problem size by either increasing
the number of spanwise grid points or increasing the number of streamwise grid points. However,
the code is less sensitive to changes in the size of the streamwise dimension than it is to changes
in the number of spanwise grid points. For all test cases, the parallel efficiency of the PSDNS code
stays above 40 percent, even when 64 processing nodes are utilized.
A theoretical speedup metric can be obtained by ignoring all communication costs. These
metrics are given in Figures 6(b), 7(b), and 8(b). For large problems with 64 and 128 spanwise
grid points, a superlinear speedup is observed. The superlinear theoretical speedup seen for the
large problems is not a surprise. The good scalability of the algorithm, combined with the better
memory access of the local portion of the distributed data structure is an obvious explanation. For
a discussion of superlinear speedup, the reader is referred to reference [14].
To further examine the performance of the simulation algorithm, a cost breakdown and an
itemized speedup are given for each of the three test cases with 64 spanwise grid points in Fig-
ures 9, 10, and 11. The four dominating kernels of the algorithm, for which the operation count
and a normalized count are given in Table 1, are the matrix-matrix multiply (an ny, ny matrix is
multiplied by an ny, n_ matrix), the one-dimensional FFT in the spanwise direction, a tridiagonal
solver, and a pentadiagonal solver. The cost of each kernel relative to the computational cost shows
that both the FFT and the matrix-matrix multiply each require roughly 30 percent of the total
computing time. When the number of processors is small, the cost for the FFT routine is higher
than for the matrix-matrix multiply. The tridiagonal and pentadiagonal systems remain nearly
constant at about 10 and 5 percent, respectively. The cost for communication is relatively high,
and for a large number of processors the communication costs are equivalent to 80-90 percent of
the computational cost. The slight drop in the communicational costs from 32 to 64 processors
for the medium and large test cases can be attributed to the fact that idle time is included in the
overall computational cost.
The itemized speedup curves presented in Figures 9(b), 10(b), and ll(b)show the following: a
superlinear speedup for the FFT kernel; an ideal linear speedup for the tridiagonal solver; a nearly
ideal speedup for the matrix-matrix multiply; and a moderate speedup for the pentadiagonal solver.
The overall speedup of the computational fraction of the algorithm is close to that of the matrix-
matrix multiply (the kernel that dominates the algorithm). In addition, notice the speedup in
communication results with 64 processors in comparison with the 2-processor results.
2.4 Complete Data Exchange
The performance of the complete data exchange portion of the algorithm deserves a closer anal-
ysis. The startup latency and bandwidth are the two most important quantities for evaluating a
communication network. However, to analyze an application code, obtaining only an experimen-
tal bandwidth that includes the startup costs is sufficient. Before a meaningful discussion of the
bandwidth achieved per processor can be given, both the actual message volume and the message
size must be determined. The regular and throughout the simulation fixed communication pattern
results in deterministic values for these quantities. The data-exchange routine is called 51 times
during one time step of the algorithm. Thus, the message volume during one time step of the
double-precision code (i.e., 8 b per data item) is given by 51 × 8 x nxny(np - 1)_n_. For the small
test suite, the message volume is given in Figure 12(a) for 32, 64, and 128 spanwise grid points
and up to 32 processing nodes. The message volume quickly approaches its asymptotic value of
51 × 8 × nxnynz. The message size of each individual message drops rapidly with the number
of processors, as can be seen in Figure 12(b). The formula for determining the message size is
8 × nxnynz/n 2. To obtain the experimental bandwidth, the message volume must be divided by
one-half of the required wall-clock time. The factor of one-half is used because the data must be sent
as well as received. The experimental bandwidth achieved per processor is shown in Figure 13 as a
function of the message size, which is given on a logarithmic scale. For large messages, the experi-
mental bandwidth is close to the maximum value of the network bandwidth (8.5 Mb/sec). However,
the startup latency reduces the observed bandwidth for smaller message sizes. The startup latency
prevents the communication from being ideally scalable with the number of processors. A fixed-size
problem distributed among a larger set of processors necessitates the exchange of more messages
of shorter message size. Hence, network contention does not slow down the data exchange.
2.5 Scaling Analysis of the PSDNS Code
The scalability study is summarized in Figures 14, 15, and 16. The test case with 64 streamwise, 41
wall-normal, and 32 spanwise grid points is used as the pivot point for this study, which is carried
out on 16 processors. Figure 14(a) depicts the computational costs for varying the streamwise
dimension. The slowdown that occurs when the number of grid points in the streamwise direction
is doubled or quadrupled is given 'in Figure 14(b). The FFT, the matrix-matrix multiply, and
the communication cost are slightly superlinear; the overall computation time doubles for 128 grid
points and quadruples for 256 points.
A variation in the wall-normal dimension of the problem from 41 to 61 and, finally, to 81
grid points has a greater effect on the costs associated with the simulation. Figure 15(a) gives
the computational costs for the individual kernels, and Figure 15(b) plots the slowdown. The
normalized count for the matrix-matrix multiply is O(n_) (Table 1), which can readily be seen in
the slowdown curve for this major kernel. If nu is doubled, an execution time that is four times
larger results for the matrix-matrix multiply. The pentadiagonal Solver also shows a slowdown that
is more than linear; the FFT, the tridiagonal solver, and the communication costs scale linearly. The
scaling of the overall computational cost follows the dominating kernel (matrix-matrix multiply);
thus, it exhibits a slowdown of 2.8 when the wall-normal dimension is doubled.
Figure 16(a) shows the costs are effected when the number of spanwise grid points is increased.
A nearly linear scaling of the overall execution time (computation plus communication time) is
observed for the range of spanwise dimensions (32 to 128 grid points). A closer look at the individual
4
slowdowngivenin Figure 16(b)confirmsthe normalizedcountof O(log2nz) (Table 1) for the FFT
kernel. Although all other kernels scale linearly, the FFT kernel causes the computational cost to
follow the logarithmic increase of the FFT. We can expect that the overall time will increase at a
rate that is greater than linear as the number of spanwise grid points is increased.
2.6 Memory Requirement of the PSDNS Code
The memory requirement of the PDSNS code seriously limited the previous implementation on the
INTEL Hypercube [9]; however, the SP1, with 128 Mb of core memory on each node, allows the
larger simulations to be carried out. The largest grid that can be calculated on a single node of
the SP1 contains 671 744 grid points (128 streamwise, 41 wall-normal, and 128 spanwise points).
The executable code for this calculation requires 110 Mb of core memory. A rough estimate for the
memory requirement for the serial code then can be given as 170 b per grid point. However, because
the memory requirement is a function of both the problem size and the number of processors, no
simple relationship between executable code size and problem size can be given. In Figure 17, the
memory sizes of the executable code for the large test suite are presented. Due to limited space,
only the curves for 8 and 128 spanwise grid points are labeled. The reduction in the executable code
size caused by the distributed data structure is clearly visible. The total memory requirement (i.e.,
the size of the parallel executable code times the number of processors) is larger than the memory
needed by the serial code (due to the additional overhead that results from the data remapping
routine).
2.7 A Large-Scale Simulation
The nonlinear evolution of a crossflow vortex packet on a swept wing has been computed with
the spatial DNS code described by Joslin and Street [11]. Because this study required substantial
computational resources (i.e., approximately 125 CPU hours on a Cray-2 with a single processor),
it is representative of a large-scale simulation. For the SP1 compatible simulation, a grid with 896
streamwise, 61 wall-normal, and 32 spanwise grid points was Used. Thus, the computational grid
contains over 1.7 million grid points. The Cray Y/MP performs one time step of this simulation
in 54 seconds and delivers 240 Mflops. Therefore, the computational expense of one time step is
12 960 Mflop.
The large core memory of the SP1 allows a problem of the same size to be computed on as
few as eight processing nodes. The computational costs of the PSDNS algorithm for 8, 16, and 32
nodes of the SP1 are presented in Figure 18. The dashed line gives the total time required by the
algorithm to perform one time step. If we compare these SP1 timings with the times required by
a single node of the Cray Y/MP and Cray C-90 (marked with solid squares in the same plot) we
see that the PSDNS code is highly competitive with these serial supercomputer performances for
as few as 8 and 32 processing nodes of the SP1, respectively.
The actual measured execution time, which includes communication and idle time, is given
in Table 2 for 8, 16, and 32 nodes of the SP1. An idealized execution time can be obtained by
subtracting those times that each processing node spends idle or in communication. Because the
serial part of the algorithm is performed only on the first node, two idealized times must be recorded;
one value for the first node and another for the remaining processing nodes. The performance of
the PSDNS code (in Mflops), based on the actual time and the idealized time, is given in Table 2.
The idealized performance of 55 Mflops per processor is noteworthy. Recall that even though the
peakperformanceof a singlenodeis 128Mflops,15-75Mflopsaregenerallyobservedfor actual
applications.Thelast columnof Table2 showsthe memoryrequirementsof the executablecode;
thesenumbersshowthat the codeis far from reachingthelocalmemorylimit of 128Mb.
A performancesummaryfor thecommunicationpart of the algorithmis givenin Table3. The
presentedvaluesfor the messagevolumeand the messagesizeare calculatedwith the formula
presentedin section2.4. Themeasuredexperimentalbandwidthagreeswellwith thevaluesshown
in Figure13for the largetest case.
By usingtheidealizedexecutiontimesfor node1andfor nodes2-np in Table2 (the idealized
executiontime excludesall idle time andcommunicationcosts),onecandetermineexperimentally
the serialandparallelfractionsof the PSDNSalgorithm.Thedifferencebetweenthe twoexecution
timesis the time spentin the serialpart of the parallelalgorithm. If wemultiply the execution
time of node2 by the numberof processors,weobtainthe executiontime of the parallelportion.
In this context,total executiontime is equalto the time spentby all processingnodescombined.
If wenormalizethe time spentin the serialand parallelportionsof the algorithmwith the total
executiontimeweobtaintheserialfractions and the parallel fraction p, respectively. Surprisingly,
the serial fraction is only 1.4 percent, and the parallel fraction is 98.6 percent of the total. Amdahl's
law [1] provides a theoretical speedup that is derived from these two quantities:
1
s÷2-
_p
For 8, 16, and 32 processing nodes, the theoretical speedup Sp of the PSDNS code is 7.29, 13.22,
and 22.32, respectively. In the limit of np --* oo, the speedup asymptotically reaches the value
1/s. Even though the parallel granularity of the PSDNS code is restricted for this problem to 32
processing nodes, the theoretical maximum speedup is 71.
3 Conclusions
The expectations raised in reference [9] for the performance of the PSDNS code on a larger and
more powerful distributed memory machine may be realized with its implementation on the SP1.
In reference [9], due to hardware limitations, only a vague estimate for the performance of a large-
scale simulation on a 32-node INTEL iPSC/860 with sufficient core memory was given. Joslin and
Zubair concluded that the execution time for the PSDNS code on the 32-node iPSC/860 would be
twice the time required by a Cray supercomputer. In this work, we have shown that only eight
nodes of the more powerful SP1 are needed to perform such a large-scale simulation in the same
amount of time as required by a Cray Y/MP. Furthermore, the utilization of 32 processing nodes on
the SP1 reduces the execution time to roughly one-third. Both the parallel efficiency of the PSDNS
code (above 40 percent for all performed calculations) on the SP1 and the high serial performance
of 52-56 Mflops on a single SP1 node (45 percent of theoretical peak performance) contribute to
this success. On 32 processing nodes of the SP1, the PSDNS code is also highly competitive in
comparison with the advanced Cray C-90 on large-scale simulations.
The scalability of the computation and communication parts of the PSDNS code have been
documented for three test suites. A performance gain can be realized with a more balanced ar-
chitecture, in which the network bandwidth is increased relative to the processing performance
of the nodes. The next generation of IBM's parallel computer, the SP2, which has been recently
6
i:i
introduced,has a greater balance between communication and computational performance. Thus,
a higher parallel efficiency can be expected for the PSDNS code on the SP2, in addition to the
higher serial performance.
The scalability information obtained by independently varying the number of grid points in
each of the three problem dimensions confirms the theoretical scaling analysis and is in agreement
with the results obtained on the iPSC/860 [9]. The actual time required for the large simulation
on 16 processors can be predicted correctly. In Figure 14(a), the total time for a problem of size
nx = 64, ny -- 41, nz = 32 is 1.3 sec. The scaling coefficients for nx can be determined from
Figure 14(b) to be 14.3. For scaling the wail-normal dimension, Figure 15(b) gives a factor of
1.6. Therefore, the estimated time for the large simulation is 14.3 x 1.6 x 1.3 sec = 29.74 sec,
which is close to the value recorded in Table 2. Estimates for both 8 and 32 processors can also
be obtained with the speedup information provided in Figure 8(a) for the large test suite. The
obtained execution times are 56 and 17 sec, respectively, which again are reasonably close to the
actual measured time.
Acknowledgments
The authors gratefully acknowledge use of the Argonne National Laboratory High-Performance
Computing Research Facility (HPCRF). The HPCRFis funded principally by the U.S. Department
of Energy Office of Scientific Computing.
/ 7
References
[1] AMDAHL, G. (1967). Validity of the single-processor approach to achieving large scale com-
puting capabilities. Proceedings of the AFIPS Conference, 483-485.
[2] BOKHARI, S.H. (1991). Complet e Exchange on the iPSC/860. ICASE Report No. 91-4.
[3] DONGARRA, J.J., DuCRoz, J., HAMMARLING, S., AND DUFF, I. (1990). A set of level 3
basic linear algebra subprograms. ACM Transactions on Mathematical Software, 16(1), 18-28.
[4] FISCHER, P.F., AND PATERA, A.T. (1994). Parallel simulation of viscous incompressible
flows. Annu. Rev. Fluid. Mech., 26, 483-527.
[5] GROPP, W., LUSK, E., AND PIEPER, S.C. (1994). Users Guide for the Argonne National
Laboratory IBM SP1. Argonne National Laboratory.
[6] IBM PARALLEL PROGRAMMING REFERENCE, AIX Parallel Environment, Release 1.0. SH26-
7228-00, (1993).
[7] IBM GUIDE AND REFERENCE, Engineering and Scientific Subroutine Library, Version 2.
SH23-0526-00, (1992).
[8] IBM PRESS RELEASE, Cornell Theory Center First To Receive IBM's Newest High Perfor-
mance Power Parallel System. Obtained via World Wide Web, April 5, 1994.
[9] JOSLIN, R.D., AND ZUBAIR, M. (1993). Parallel Spatial Direct Numerical Simulations on the
Intel iPSC/860 Hypercube. ICASE Report No. 93-53.
[10] JOSLIN, R.D., STREET, C., AND CHANG, C.-L. (1993). Spatial DNS of Boundary-Layer
Transition Mechanisms: Validation of PSE Theory. Theor. and Comp. Fluid Dyn. 4(6), 271-
288.
[11]
[12]
JOSLIN, R.D., AND STREET, C. (1994). The Role of Stationary Crossflow Vortices in
Boundary-Layer Transition on Swept Wings. Accepted for publication in Phys. Fluids A.
MOIN, P., AND KIM, J. (1982). Numerical investigation of turbulent channel flow. J. Fluid
Mech. 118, 341-377.
[13] SAINI, S. (1994). The IBM SP2: Hardware, Software, Porting and Optimization Overiew. Nu-
merical Aerodynamics Simulation Program, NASA Ames Research Center, NAS User Seminar,
July 27, 1994.
[14] SUN, X.-H., AND ZItU, J. (1994). Shared Virtual Memory and Generalized Speedup. ICASE
Report No. 94-2. Also to appear in the Proceedings of the International Parallel Processing
Symposium, 1994.
8
i!i
Table 1: Operation Counts for Major Kernels
Kernel
MAT-MAT
FFT
TRIDIAG
PENTADIAG
Operation count (OC)
O(nxnynz log2nz )
O(nxnynz)
Normalization count
= OC/nxnynz
O(log2n )
o(1)
o(1)
Table 2: Performance of large simulation on 8, 16, and 32 Nodes of SP1
Number of
pro cessors
np
8
16
32
Time, sec
Actual Idealized
53.75
29.75
18.75
node
1 2-np
32.4 29.1
17.7 14.5
10.2 7.1
Per proc. Total
30 241
27 436
22 691
Performance, Mflops
Actual [ Idealized
Per pro c. Total
55 440
55 880
56 1760
Executable
code,
Mb
79
60
50
i :
Table 3: Total Data Exchange for Single Iteration Step of Large Simulation
Number of
processors
np
8
16
32
Comm.,
sec
19.0
11.0
7.5
Message
Volume, Size,
Mb Mb
595 0.208
638 .052
659 .013
Bandwidth
Total, Per Processor,
Mb/sec Mb/sec
63 7.9
116 7.3
176 5.5
' 'L
_J ii I!................................................-- :
,,,i i! ,,
!tfl'J lJ
!111 Jl
i! y
ii
ii
"_ * Wall normal
I
.[_$pam__se
Streamwise
"'- --,. il ii
Ill!
liil
!il
I
nx
(a) Entire domain. (b) Mapped onto four processors.
Figure 1: Computational domain.
1
?$ x
I
iil o
i z
! !
f!
! !
i 1
! !
! !
I l
! !
! !
! !
I J
! !
! !
! I
P1
! !
I I
! I
I I
I I
I!
II /
//
_Y
I I
i i P3
i ,
I Ii!
i!
i i
Figure 2: Global remapping results in local FFT's in spanwise direction.
10
t"
!
i
10 2
10_
i
(_ 10 0
nx: 64; ny: 41; nz: variable
;..6..'....:,...'..i..i.iiii i i i ii
.......... !......_----_---_--:-_--i--i-!...........i......!---.i---i--
t i i I I illi i I I I
10 0 101
Number of Processors
(a) Computational costs.
10 2
nx: 64; ny: 41; nz: variable
0
II)
(/)
E 101
0
--i-.
0
0
._
r-
3
E
E
o 10 o
::::::::::::::::::::::::::::::::::::::: ...... :::::::::::::::::::::::::
.......... , ...... :-----:---_--_--:--:--:--: ...... -!----i.--!---!
: : : : : ::: : : :
................. _---_---i--i--i--_.H.................. :----'---:--
:::::::::::::::::::::::::::::::::::::::::::::::
=====================
10 ° 101
Number of Processors
(b) Communication costs.
Figure 3: Computational and communication costs for small test suite.
nx: 128; ny: 41; nz: variable
'::!!!!!i!!!!_i!!!i!!!i?!:.!!i!!i!;i!!!!!!!!!!!i!!!!!!i!!!!i!!!i!!_
!!!!i!!i!i!i!!!!!!!!!!!i!!!!!!i!!!!!!!!i!i:
101 _.32.i .... i....:., i..i..!..:.ii., i i i i i
i
(J 1 0 ° !!
..........! ...ii"!!hii" ...........i ..!-"??-
.......... !...... _ " _ --! ! '!-!H ........... !...... !----?---!---.
I I I iliiii I I I I i
00 101
Number of Processors
(a) Computational costs.
102
L)
(/)
E 101
0
0
0
C
E
E
o 10 o
nx: 128; ny: 41; nz: variable
!!!!!!!!i!!!!!!i!!!i!i ! i! i ! !!!!!!!!!!!!!!!!!!!i!!!!!i-2=---'......i---i---i--i--i--!-i-i...........! ..i----i---!--!
.......... ( ...... :-- ---',-- H-- }. -:- -:- -:- -.' .:...... "-- --:----:
.......... ! ...... ? " ! " ! ! '!!!".' ........... !...... !?9-.
I I I I i IIII I I I I I
10 0 101
Number of Processors
(b) Communication costs.
Figure 4: Computational and communication costs for medium test suite.
11
nx: 256; ny: 41; nz: variable
102
_.-".-"i!::_i_!!i!!i!!i!!i::i::i!!!!!Z::Z!!:_::::::_::::_:::_:::_=.6_!_"-:::::i::: -::--!-i--ii-i........ !!!!!!!i!!!!i!!!i!!i
=.3._---,......_'IiZLI!!i!i_!IIZZ]ZZI]iZI]
16 _ ! _il;_ii i i i!!
{ 101
!_!!_!!i::!!!!i!:!!i!i;_!!i::i!i":::_!!!:!!:!i::!!!!!_!!i!!!i!!!
3
_ lOO========================================================================
-_!::!::!!!!!i!!!!!!i!!!i!!!i::!!!!!!i!!i!!!!!!!!::!!!!!!!!i!!!!i!!!!:
.......... !...... "i!"!iH'! ........... [......!ii
.......... ! ...... !9 f!"!!!! ........... i......!-!!.
i i I I I lill i I i i i
100 101
Number of Processors
(a) Computational costs.
10 2
nx: 256; ny: 41; nz: variable
0
G)
101
0
°_
0
t3
¢-.
:3
E
E
o 10o
-6_---!......i---i--_-,-i-T--;......i ....iT;_
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
.......... !......;-----i---!--!-<--i--i-!...........;- _ i--i.-!
.......... !...... hi!!!!H ........... !...... i-!!.
I i i i I IIII J i i I i
0 0 101
Number of Processors
(b) Communication costs.
Figure 5: Computational and communication costs for large test suite.
+ 40
n
E
o 30
_ 2o
®
®
nx: 64; ny: 41; nz: variable
60 .................... ;..................... _.......... _....... _...
o_so....................i....................i ¢_......._ .
i/
........................ ide_i ......_ .......... i..............
=:// i
.........i ......_;4 ..........__
i i/ i i _ i
................._....L.._
i / i : i i i
20 40 60
Number of Processors
(a) Complete calculation.
f-%
-_ 50
E
O
6. 40
E
-_ 30
_ 2o
10
nx:64; ny: 41; nz: variable
i i i/
60 ............................... i....................._......._
.......................................
......... i.......... i.......... i......... _ i 64 1
i i i_-i i !
........ i...... _ ......... -......... i.......... ._......... -....
== i6 i i i i
20 40 60
Number of Processors
(b) Computation only.
Figure 6: Speedup for small test suite.
12
L
!.
E
o 30
(J
_ 2o
"0
®
ID
nx: 128; ny: 41; nz: variable
i i i/
60 .................... ;.......... _.......... _.................. _...
E i : !
' / iE 50 ....................i.........._..........i........;4.........._---0
o /
/
+ 40 ...................... icieai_ .........................
: /
.........................._._............._ 2_..._
.................
....._" ........i"
20 40 60
Number of Processors
(a) Complete c_culation.
6O
--_ 5O
t-
O
6. 4O
E
0
_J
_-- 30
n
_ 2o
®
10
nx: 128; ny: 41; nz: variable
.........i..........i.........._........._ ......_......._."
i i _ __........._..........i..........i..........i......... i ..........: -
i i i i_'!
i i i/.! i i
........................321..........i ....i:
I I I I
20 40
Number of Processors
(b) Computation only.
6o
Figure 7: Speedup for medium test suite.
nx: 256; ny: 41; nz: variable
60 .................... ;.......... ..,.................... _....... /.: ....
: ! / i
• ' / i
oE 50 .................... i .......... i.......... _........ ;_ .......... _....
rj : /
+ 40 ..................... ideai ....._ ..........i...............
E !/ :
o 30 .........i .........!......._ ..........!..........i-..........: ...
2o.........i........j <...."i_°_-"
"0 /: / " :
® I i _32 i
o_ !/ : i !
I: I
20 40 60
Number of Processors
(a) Complete calculation.
nx: 256; ny: 41; nz: variable
60 ....................................................._ ._ ....
• /.
50 ideai/
6. 40 -........_.........;..........;.:.....7;........._.........i....
E i /:: _0
30 ......... i"i ....... i......... i.... :321.......... i':......... i:
n
_ 20
®
_ ! i6 i i i i
10
20 40 60
Number of Processors
(b) Computation only.
Figure 8: Speedup for large test suite.
13
0El
r_
E
0
(J
10C_
80
60
4o
20
nx: 64; ny: 41; nz: 64
_,_-_O: 0 :0 i
i : :
" i..........i ....oic_mh.............
/i i i _iCom_.i
.....i.................. _...... oi_tdf_at] ....
i i _i Tridiag ::
:: ! x! PenlOdiagi
......."'" ...... ! .......... i.......... ! .......... .,".......... :---
:A : '
20 40 60
Number of Processors
(a) Cost breakdown.
100
E
E 80
0
(9
i..
o 60
N--
p.-
_ 4o
I---
o
= 20
I,--
nx: 64; ny: 41 nz: 64
0 Coml_ i
[] Comm i
.....Z_TFF'r -! .......... ! ....... !....... i ..................
o i Maf-iMaf
.....*...i..T.ridi.ag.................i.........
]
20 40 60
Number of Processors
(b) Speedup.
Figure 9: Cost breakdown and speedup for small test case with 64 spanwise grid points.
nx: 128; ny: 41; nz = 64
100_0- 0 . .0 • . •,/ 2 j
8o ........i ... _ : _ i.........L.ol--
_. I._ i ol Mof4Mot
"_ 40'0.
E
0
20
.._....: i i _i Tridi'og
...
':............i: .i... ..
20 40 60
Number of Processors
(a) Cost breakdown.
1O0
E
E 80
0
o 60
N--
I-
_ 4o
I--
D
= 20
m
F-
nx: 128; ny: 41; nz: 64
• i iOiComp i i
[] !Com_n i i
0 i Maf-iMaf i !
20 40 60
Number of Processors
(b) speedup.
Figure 10: Cost breakdown and speedup for medium test case with 64 spanwise grid points.
14
: ' • _ '-: : • ' : .::::, :' i_i:, .!:::: : :':.,!::_:•i'i:.:¸
nx: 256; ny: 41; nz: 64
8o.........i ...... L. _ i .....i..........i...
_1 _. .:_ Comin i
..... ! ! _i Tridiag !
o_ 4o .........i ........i .......i ........i" .....i ..
20
l
_,o ,as* ,o ,
0 _- ..... Ix I i -
20 40 60
Number of Processors
nx: 256; ny: 41; nz: 64
) 100 :
0 Comp i i
"'E [] Com_ i
oE80..... ......................! i .
<>!Maf-iMaf i ! / i
o 60 ....._..i.:Tr.Ldi.o.g. .....i ......
i--
40-
I--
_o
20
p-.
0 -
20 40 60
Number of Processors
(a) Cost breakdown. (b) speedup.
Figure 11: Cost breakdown and speedup for large test case with 64 spanwise grid points.
nx: 64; ny: 41" nz: varioble
140 10 °
120
Q)
_ 100
E 80
3
> 60
o 40
O9
0,1
:_ 20
0
nz = 128
i
/
-_- nz = 64
nz = 32
(/)
"_>-i0_.i
N
g lo-2
a
2_
10-3
nx: 64; ny: 41; nz: variable
I I I I I
5 10 15 20 25 30 0
Number of Processors
(a) Message volume. (b) Message size.
Figure 12: Message volume and size for one time step of small test suite.
/a,
%*,,
\_-k'", 0 nz: 32
_ Anz: 64
_'"-. • nz: 128
\ "_-..
I I I I I I _
10 20 30
Number of Processors
15
8.5
nx: 64; ny: 41" nz: variable
_8.0
0
o 7.,.5
EL.
o_ 7.0
_ 6.5
x_ 6.0
5.5
/z 32
--_//,"'" A nz: 64
/ • -x(-nz: 128
IIIII I I Iit1111 I I IIIIIII I I Illltl
10-3 10-2 10-1
Message Size, Mbytes
00
Figure 13: Communication bandwidth per processor as function of message size.
5
_4
-O
E
O
_3
d
E2
°_
nx: variable; ny: 41; nz: 32
6 Processors
0 Comp / /[] Comm /
FFT /
o Mat-Mat /
/
* Tridiag
x Pentadiag /
/-'damp + comm
////_
O_ ! I '_ i J i I { I ':
80 120 160 200 240
n x
(a) Computational costs.
5.0
4.5
4.0
3.5
II
t_=_3.0
_ 2.5
2.0
1.5
1 .C;
nx: variable; ny: 41; nz: 32
16 Processors
0 Camp .j
[] Comm
rrT ///'/"
0 Mat-Mat ///,I
* Tridiag. //'-/
I I I
80 120 160 200 240
I"1x
(b) Slowdown.
Figure 14: Performance of PSDNS code with varying streamwise dimension.
16
I
I
i
I
iz
L
k
1
i
i
3.0
nx: 64; ny: variable; nz: .32
(/)
-o 2.0
C
0
o
1.5
d
•-- 1.0
F-
/16 Processors /
/
2.5 /
camp +//carom
/
i////_
0.5 _
0.0 _ : : : :_ ! ', • _ '._
40 50 60 70 80
ny
(a) Computational Costs
4.0
nx: 64; ny: variable; nz: 32
3.5
3.0
tl
__=_2.5
2.0
1.5
1.0
40
16 Processors . /_
0 Camp / /
[] Carom
z_ FFT
0 Mat-Mat / E)
Tridiag / /
x Pentadia V
f/_"""_ I I I I I I I
50 60 70 80
ny
(h) Slowdown
Figure 15: Performance of PSDNS code with varying wall normal dimension.
4.5
4.0
3.5
-o 5.0
C
0
2.5
u)
2.0
1.5
1.0
0.5
0.0
nx: 64; ny: 41; nz: variable
1 6 Processors /
/
/
/
/
/
/
/
/ camp + comm/
- ' ,_ •r. " I I I ..
40 80 120
n z
(a) Computational Costs
.nx: 64; ny: 41; nz: variable
4"0I 16 Pr°cess°rs
"_ _L 0 Comp ///
_'_I []Comrn ////
/ _ FFT //// £
3.0}- o Mat-Mat ///_///-
| * Tridiag /////
;=L,2.5L x Pentadiag .J.._/_" J
"///2.0
,.st 
1 O= v' I I I i
40 80 120
n z
(b) Slowdown
Figure 16: Performance of PSDNS code with varying spanwise dimension.
17
/ ,i' :/ i ¸¸, . / • . ,
nx: 256; ny: 41; nz: variable
-,_ 102
3
q)
"5
•° nz: 128
.__ 101 "--
U') I_
0 20 40 60
Number of Processors
Figure 17: Memory size of executable codes in large test suite.
6O
5O
P_
-o 40
C
O
30
d
._E2o
F.-
10
nx: 896; ny: 61; nz: 32
Cray Y//MP O Comp (,)
[] Comm\
A FFT\ SP 1
o Mat-Matk
\ • Tridiag
\ x Penfadiag
,C-90 "_'_
0 , __ , i ,
0 10 20 30
Number of Processors
Figure 18: Computational costs for large simulation: IBM SP1 versus Cray Y/MP and Cray C-90.
18
i
i:
!
_U
i


REPORT DOCUMENTATION PAGE I FormApproved
I OMB No. 0704-0188
Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources,
gathering and maintainin_ the data needed, and comp et ng and reviewing the co ect on of information. Send comments reJzarding th s burden estimate or any other aspect of this
collection of information, including suggestions for reduci'ng this burden, to Wash ngton Headquarters Services, Directorate _or Information Operations and Reports, 1215 Jefferson
Davis Highway, Suite 1204, Arlington, VA 22202-4302, and to the Ofl]ce of Management and Budget, Paperwork Reduction Project (0704-0188), Washington, DC 20503.
'1. AGENCY USE ONLY(Leave blank) 2. REPORT DATE 3. REPORT TYPE AND DATES COVERED
August 1994 Contractor Report
4. TITLE AND SUBTITLE 5. FUNDING NUMBERS
Scalability Study of Parallel Spatial Direct Numerical Simulation
Code on IBM SP1 Parallel Supercomputer
6. AUTHOR(S)
Hanebutte, Ulf R., Ronald D. Joslin, and Mohammad Zubair
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES)
Institute for Computer Applications in Science
and Engineering
Marl Stop 132C, NASA Langley Research Center
Hampton, VA 23681-0001
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES)
National Aeronautics and Space Administration
Langley Research Center
Hampton, VA 23681-0001
II. SUPPLEMENTARY NOTES
Langley Technical Monitor: Michael F. Card
Final Report
To be submitted to the Journal of Scientific Computing
12a. DISTRIBUTION/AVAILABILITY STATEMENT
Unclassified-Unlimited
C NAS1-19480
WU 505-90-52-01
8. PERFORMING ORGANIZATION
REPORT NUMBER
ICASE Report No. 94-80
lO. SPONSORING/MONITORING
AGENCY REPORT NUMBER
NASA CR-194975
ICASE Report No. 94-80
12b. DISTRIBUTION CODE
Subject Category 60,61
13. ABSTRACT (Maximum 200 words)
The implementation and the performance of a parallel spatial direct numerical simulation (PSDNS) code are reported
for the IBM SP1 supercomputer. The spatially evolving disturbances that are associated with laminar-to-turbulent
in three-dimensional boundary-layer flows are computed with the PS-DNS code. By remapping the distributed data
structure during the course of the calculation, optimized serial library routines can be utilized that substantially
increase the computational performance. Although the remapping incurs a high communication penalty, the parallel
efficiency of the code remains above 40 percent for all performed calculations. By using appropriate compile options
and optimized library routines, the serial code achieves 52-56 Mflops on a single node of the SP1 (45 percent of
theoretical peak performance). The actual performance of the PSDNS code on the SP1 is evaluated with a "real
world" simulation that consists of 1.7 million grid points. One time step of this simulation is calculated on eight
nodes of the SP1 in the same time as required by a Cray Y/MP for the same simulation. The scalability information
provides estimated computational costs that match the actual costs relative to changes in the number of grid points.
14. SUBJECT TERMS
spatial direct numerical simulation, parallel computing, three-dimensional boundary-
layer flow
17. SECURITY CLASSIFICATION
OF REPORT
Unclassified
JSN 7540-01-280-5500
18. SECURITY CLASSIFICATIO_ 19. SECURITY CLASSIFICATION
OF THIS PAGE OF ABSTRACT
Unclassified
i
_U.S. GOVERNMENT PRINTING OFFICE: 1994 - 628-064/23061
15. NUMBER OF PAGES
20
16. PRICE CODE
A03
20. LIMITATION
OF ABSTRACT
s
tandard Form 298(Rev. 2-89)
Prescribed by ANSI Std. Z39-18
298-102
