Early benchmark results on the NEC SX-4 by Potma, K. et al.
DOCUMENT CONTROL SHEET 
ORIGINATOR'S REF. 
NLR TP 96464 U 
SECURITY CLASS. 
unclassified 
ORIGINATOR 
National Aerospace Laboratory NLR, Amsterdam, The Netherlands 
TITLE 
Aerly benchmark results on the NEC SX-4. 
PRESENTED A T  
Parallel CFD '96, Capri, Italy, 20-23 May 1996. 
AUTHORS 
K.  Potma, G.J. Hameetman, W. Loeve, 
and G. Poppinga 
DESCRIPTORS 
Acceptability Parallel processing (computers) 
Computational fluid dynamics Performance tests 
Computer aided design Requirements 
NEC computers Supercomputers 
ABSTRACT 
In the spring of 1995, NLR decided to replace its supercomputer within a 
year by a more powerful system. After some preliminary benchmarks, an NEC 
SX-4 shared memory vectorcomputer with 16 processors and 4 GByte of main 
memory was purchased, providing a peak performance of 32 GFlop/s. This 
system will be extended to a 32 processor system with 8 GByte main memory 
in January 1999. 
This article describes the results of the acceptance tests related to 
performance, which have been executed at NLR in the period June 17 - July 
15, 1996. 
DATE 
960724 
PP ref 
12 6 
NLR TECHNICAL PUBLICATION 
TP 96464 U 
EARLY BENCHMARK RESULTS 
ON THE NEC SX-4 
by 
K. Potma, G.J. Hameetman, 
W. Loeve, G. Poppinga 
This paper has been prepared for presentation at Parallel CFD '96, Capri, Italy, 20-23 May 1996. 
Prepared : 
Completed : 960724 
Order number : 051.334 
TYP. : LMT 
Contents 
1 Introduction 
2 Motivations to upgrade the NLR supercomputer 
2.1 Current trends in computer aided engineering 
2.2 Current trends in the supercomputer area 
2.3 Requirements 
2.4 Motivations to purchase an NEC SX-4 supercomputer 
2.5 Configuration 
3 The acceptance benchmark 
3.1 Benchmark codes 
3.2 Single job performance 
3.3 Additional performance issues 
4 The NAS parallel benchmark kernels 
5 Concluding remarks 
References 
3 Tables 
(12 pages in total) 
This page is intentionally left blank 
Early benchmark results on the NEC SX-4 supercomputer 
G.J. Hameetman, W. Loeve, G. Poppinga, K. Potma" 
"Informatics Division, National Aerospace Laboratory NLR, 
Anthony Fokkerweg 2, 1059 CM Amsterdam, the Netherlands 
In the spring of 1995, NLR decided to replace its supercomputer within a year by a 
more powerful system. After some preliminary benchmarks, an NEC SX-4 shared memory 
vectorcomputer with 16 processors and 4 GByte of main memory was purchased, providing 
a peak performance of 32 GFlop/s. This system will be extended to a 32 processor system 
with 8 GByte main memory in January 1999. 
This article describes the results of the acceptance tests related to performance, which 
have been executed at  NLR in the period June 17 - July 15, 1996. 
1. INTRODUCTION 
The National Aerospace Laboratory NLR is the central institute in the Netherlands for 
aerospace research. Its principal mission is to provide expert contributions to activities in 
aerospace and related fields. Since 1988, NLR has the disposal of supercomputer facilities, 
to support organizations in the field of aircraft operations, space technology, aircraft 
utilization and aircraft development. Due to ever growing computational demands, NLR 
decided in the spring of 1995, to replace its supercomputer by a system which should be 
at least twenty times as powerful. After some preliminary investigations, NLR decided to 
purchase an NEC SX-4 shared memory vectorcomputer. 
The upgrade of the supercomputer capabilities is executedin two phases. A 16 processor 
NEC SX-4 supercomputer, providing a peak performance of 32 GFlopJs, has been installed 
in the June 1996, and will be extended to a 32 processor system with 8 GByte of main 
memory in January 1999, providing a peak of 64 GFlop/s. 
As part of the acceptance procedure, performance tests have been executed in June 1996 
at the National Aerospace Laboratory NLR. This article describes the preparations and 
performance results of these tests. 
In April 1996, a factory acceptance test has been executed in Tokyo, Japan. This 
test was executed on a benchmark system which was not completely similar to the sys- 
tem ordered by NLR. Performance figures had to he translated to NLR's NEC SX-4/16 
configuration as a result of this. These results have been presented at Parallel CFD'96. 
Acceptance tests have been executed to determine the single processor performance, the 
parallelization performance, the throughput performance, the interactive response times, 
the data-commnunication performance and the disk 110 performance. The results of these 
final acceptance tests at NLR are reported in this paper. 
Section 2 describes the motivations to purchase a more powerful shared memory vector- 
computer, the NEC SX-4. Section 3 describes the actual acceptance benchmark. Section 4 
presents some additional results of the NAS Parallel Benchmark Kernels [2] which sup- 
ported the decision to purchase an NEC SX-4 supercomputer. Finally some concluding 
remarks are presented in the last section. 
2. MOTIVATIONS T O  U P G R A D E  T H E  N L R  S U P E R C O M P U T E R  
At NLR supercomputing is 
computing is integrated with 
NLR's customers [6]. As far 
applied for a range of compute intensive problems. Super- 
desktop computation at NLR and in the organizations of 1 
as required computing power is concerned, computational I 
fluid mechanics is the most demanding application at NLR. NLR's supercomputer also is 
the lcernel of the infrastructure for treating CFD aspects in projects that form part of the 
Dutch HPCN program, which started in the beginning of 1996. I 
2.1. Curren t  t r ends  in  computer aided engineering 
In first instance CFD in computer aided engineering requires powerful computers for 
batch processing as well as for interactive processing via workstations. Engineering of 
advanced products is characterized by inter-disciplinary cooperation of engineers. This 
cooperation requires exchange of information between participants. As a result of these 
considerations, powerful computers for support of engineering have to be integrated in 
computer infrastructures, together with workstations and other servers such as those for 
information management. The powerful computers must also allow easy installation of 
commercially available software for support of engineering. 
The computational effort for flow simulation in computer aided engineering by industrial 
partners, provides the most demanding requirements for the NLR computer infrastructure. 
2.2. Curren t  t r ends  in t h e  supercomputer  area  
Two major architectural classes can be distinguished for supercomputers. 
Shared memory systems 
These systems traditionally consisted of vectorprocessors connected to a shared 
memory. The state of the art shared memory vectorcomputers contain up to 32 
vectorprocessors connected to the same shared memory. The manufacturers supply 
automatically parallelizing compilers and analysis tools to explore the parallelization 
capabilities of an application. 
e Distributed memory systems 
Distributed memory systems often consist of a large number of scalar processors 
each with their private memory, connected by means of a communication network. 
Efficient programming models to support parallelization of codes on these architec- 
tures are not yet available. 
Recently, standard communication libraries, such as PVM and MPI, have been de- 
fined and are supported by most manufacturers. Usage of these systems can also 
be eased by usage of HPF, which is an extension of the Fortran 90 language with 
directives for distribution of data and operations. Given the data and operation 
distributions, the required communications are inserted by the HPF compiler. 
The major advantages of shared memory systems compared to distributed memory. sys- 
tems are the high single processor performance, the relative programming ease and the 
availability of automatic parallelizing compilers and analysis tools. The major disadvan- 
tages are the limitations on the total number of processors caused by the access speed of 
the shared memory, resulting in a limitation of their aggregate computing power, and the 
relatively expensive components. 
Several supercomputer manufacturers (NEC, SGI, Cray) also offer hybrid combinations 
of the above architectures. To releave the limitations on the number of processors and 
thus computing power, they couple clusters of shared memory machines by means of a 
fast communication network. 
2.3. Requirements 
By means of an inventory of the current and future NLR applications, requirements on 
the following characteristics of the new supercomputer have been determined: 
e kfulti-disciplinary simulations 
Due to the increased usage of multi-disciplinary simulations, the new supercomputer 
should support the integration of process- and data-management and visualization 
tools for computer aided design with other NLR computer platforms and with com- 
puter systems in customer organizations. 
e Computational power 
The required computational power is determined by the following types of applica- 
tions: 
1. Calculation of flows around complete aircraft configurations with software 
based on Euler and Reynolds averaged Navier-Stokes models, for the calcu- 
lation of aerodynamic designs of aircrafts. 
2. Development of software based on Large Eddy Simulation to generate infor- 
mation to improve the turbulence modeling in flow simulation methods. 
3. Design optimization by means of repeated usage of the first application steered 
by an optimization algorithm, to accelerate and improve the design. 
To execute these problems overnight an increase of the computing power of a single 
processor of the NEC SX-3/22 of a factor of 20 is required [5 ] .  
From the users point of view and based on the current situation, the following require- 
ment are imposed on the new supercomputer: 
e Single processor performance 
Not all codes for which NLR's supercomputer is used are already parallelized or 
easy to ~arallelize. The single processor performance of these non-parallelized codes 
should not deteriorate. 
e kIemory capacity 
The memory capacity should be significantly larger than the current memory ca- 
pacity (1 GByte), to avoid the situation where a non-parallelized application uses 
all memory but only one processor, and is thus blocking all other processors. More- 
over, most CFD applications on the NEC SX-3/22 were bounded by the available 
memory. 
e Memory access speed 
Most CFD applications are memory intensive (the amount of memory references 
is of the same order as the amount of floating point operations). Access to main 
memory should not be a bottleneck to provide the processors with the required data. 
e Porting effort and supporting tools 
Existing CFD codes have to be ported to the new supercomputer without requiring I 
much effort. The new machine should provide a user friendly development environ- i 
ment, including tools and debuggers. Maintenance, adaptations and parallelization ; 
of CFD codes should require little effort. j 
e Handling of short vectors 
Current and future CFD codes use multi-grid algorithms to drive the CFD solver 
to convergence. Usage of multi-grid results in short vectorlengths at the coarser 
gridlevels, which should be handled efficiently. 
e Load balancing 
Current and future CFD codes use multi-block grids to model complicated config- 
urations. Multi-block codes can cause load imbalance when the processing of the 
blocks is distributed over the processors. Means for dynamic load balancing should 
be provided. 
e Handling of indirect addressing 
The generation of structured grids around a complicated configuration such as a 
complete aircraft is very time and man-power consuming. This process can be 
automated when unstructured grids are used. The usage of unstructured grids 
results in indirect addressing. Efficient instructions should be provided for gathering 
and scattering of indirectly addressed data. 
2.4. Motivations to purchase an NEC SX-4 supercomputer 
Based upon the requirements in the previous section, NLR decided that a distributed 
memory system is not yet suited to serve as a general purpose supercomputer for the 
following reasons. The current (non-parallelized) applications can only be handled fast 
enough by a vectorprocessor. Development of applications on a distributed memory sys- 
tem requires so much manpower that it is only acceptable if they can not be executed 
within a reasonable amount of time on a shared memory machine. 
From the shared memory supercomputers, only the NEC SX-4 and the Cray T90 con- 
tain a processor which is powerful enough to avoid performance degradation for non- 
parallelized codes. 
After some preliminary benchmarks on several shared memory machines, NLR finally 
decided to purchase an NEC SX-4 supercomputer because it was superior to the Cray T90 
with respect to its price-performance ratio for typical NLR codes, which is also illustrated 
by te NAS parallel benchmark kernels (see 4). 
2.5. Configuration of the NEC hardware 
This subsection presents the initial and final configuration of the new NEC SX-4 su- 
percomputer of NLR. For reference purposes, the NLR configuration of the NEC SX-3/22 
superco~nputer is also presented. 
e Configuration of the NEC SX-3/22 
The NEC SX-3/22 is a 2 processor shared memory .vectorcomputer, with a clock 
cycle time of 2.9 ns and 2 sets of vector pipes per processor (each set consisting of 
4 vector pipelines, capable of processing one floating point operation per machine 
cycle), leading to a peak performance of 2.75 GFlop/s per processor. 
The 1 GByte main memory unit of the NEC SX-3/22 has a peak bandwidth of 
24 GByte/s, regardless of the number of processors. The peak bandwidth of the 
4 GByte extended memory unit is equal to 2.75 GByte/s. 
Initial configuration of the NEC SX-4/16 (phase 1) 
The NEC SX-4/16 is a 16 processor shared memory vectorcomputer, with a clock 
cycle time of 8 ns and 8 sets of vector pipes per processor, leading to a peak perfor- 
mance of 2 GFlop/s per processor. 
The 4 GByte main memory unit of the NEC SX-4/16 has a peak bandwidth of 
16 GByte/s per processor. The peak bandwidth of the 8 GByte extended memory 
unit is equal to 8 GByte/s. 
e Final configuration of the NEC SX-4/32 (phase 2) 
The NEC SX-4/32 is a 32 processor shared memory vectorcomputer, with a clock 
cycle time of 8 ns and 8 sets of vector pipes per processor, leading to a peak perfor- 
mance of 2 GFlop/s per processor. 
The 8 GByte main memory unit of the NEC SX-4/32 has a peak bandwidth of 
16 GByte/s per processor. The peak bandwidth of the 16 GByte extended memory 
unit is equal to 8 GByte/s. 
3. THE ACCEPTANCE BENCHMARK 
In this section, the results of several performance tests are reported. 
Section 3.1 describes the benchmark codes used for the single-processor, multi-processor 
and throughput tests. In the other subsections, the results of the performance tests are 
reported, including reference figures for the NEC SX-3/22. 
3.1. Benchmarks codes 
As basic material for the single-processor, multi-processor and throughput benchmarks, 
the following codes have been used: 
a The code D2EUL [I] is the implementation of a 2D simulation method for the 
solution of the Euler equations on unstructured grids. The convective terms are 
discretized by using upwind flux differencing according to P. Roe. The scheme is 
embedded in a nlultigrid method, which drives the solution to steady state. Second 
order spatial accuracy is obtained by defect correction techniques. 
e The code HEXADAP [3] describes the implementation of an unstructured, adaptive, 
3D Euler solver. The grid is based on hexahedrons. The grid adaptation can add 
cells by dividing a cell in one of the three directions as well as remove cells. The code 
is developed for parallel vector machines and uses a coloring algorithm to improve 
vectorization and parallelization. 
e The code SOLEQS [4] describes a flow solver which computes the solution of the 
Navier-Stokes equations on structured multi-block grids. It uses a multi-zone ap- 
proach. Turbulence is modeled by the Baldwin-Lomax turbulence model. The 
simulation method is suited for both stationary and time-dependent problems and 
uses a multigrid algorithm to improve convergence. 
3.2. Single job  performance 
Tests concerning vectorization performance of the benchmark codes on the NEC SX- 
3/22 and vectorization and parallelization performance on the NEC SX-4/16 have been 
performed and are reported in section. 
Single processor performance tests have been executed on the NEC SX-3/22 of NLR to 
produce reference figures for the acceptance benchmark of the NEC SX-4/16. 
Single and multi-processor runs have been executed on the NEC SX-4/16. The results 
are presented in Table 1. The speed-up figures presented for the single processor runs on 
the NEC SX-4/16, presented in Table 1 are based on the runtimes of the complete code 
on the NEC SX-3/22 and the NEC SX-4/16. The multi-processor speed-up figures are 
based upon the runtimes of the compute intensive solver parts (excluding pre- and post- 
processing). For the multi-processor runs the speed-up compared to a single processor 
NEC SX-4/16 run is given. 
Table 1 
Performance of the single job runs on the NEC SX-3/22 and the NEC SX-4/16 
code machine procs Runtime Runtime Compute speed-up 
complete solver speed 
D2EUL SX-4 1 744 621 1081.7 1.5 
D2EUL SX-4 8 238 86 3414.8 7.2 
D2EUL SX-4 16 195 45 4124.1 13.8 
HEXADAP SX-3 1 6853 6249 331.8 
HEXADAP SX-4 1 3947 2950 618.2 1.7 
HEXADAP SX-4 8 848 452 2430.3 6.5 
HEXADAP SX-4 16 715 312 2915.1 9.5 
SOLEQS SX-3 1 8575 7969 551.9 
SOLEQS SX-4 1 3908 3442 896.7 2.2 
SOLEQS SX-4 8 1279 667 2743.0 5.2 
SOLEQS SX-4 16 1095 483 3207.5 7.1 
3.3. Additional performance issues 
The NEC SX-4 is used at NLR as a general purpose compute server, which imposes 
additional performance requirements on the throughput performance, interactive response 
times, data-communication performance and the disk-110 performance. 
A mixture of the benchmark codes defined in the previous section has been used to 
determine the throughput performance of the NEC SX-4/16. The mixture consisted of 
both single processor and multi-processor jobs. The ideal .throughput time is defined as 
the sum of CPU times of the jobs run in a dedicated machine, divided by 16 (the number 
of processors used for the throughput benchmark). The throughput benchmark showed a 
degradation of 24% with respect to this ideal throughput time. 
An interactive stimulator has been deireloped which simulates the behaviour of several 
interactive users logged on to the NEC SX-4/16. Requirements were imposed upon the 
interactive response times of basic unix commands cat, cd, chmod, cp,  csh, hostname, is, 
ksh, mkdir, mu, pwd, rm, rmdir, sh, ui, wc, who and upon the maximum degradation 
of a concurrently running throughput benchmark. The interactive response times were 
comparable with the response times on the NEC SX-3/22 and the throughput degradation 
caused by the interactive stimulation was approximately 10%. 
Data communication tests have been performed between the NEC SX-4/16 and the 
general file server of NLR, resulting in a performance of up to 50 Mb/s for the transfer of 
a 100 MByte file with ftp. For NFS a transfer speed of up to 20 Mb/s has been obtained. 
The NEC SX-4/16 supercomputer is connected to two NEC RAID disks. With two 
I/O jobs executed in parallel, a disk 110 speed of 136 MByte/s was measured. 
4. T H E  NAS PARALLEL BENCHMARK KERNELS I 
A well known benchmark is the set of NAS parallel benchmark kernels [2], selected by 
NASA Ames after evaluation of a number of large scale CFD and computational aero- 
sciences applications. From the NAS parallel benchmark data-base, performance results of 
the most relevant kernels are reported in Table 2. Price performance ratio's are presented 
in Table 3. In the following tables (FT) represents the Fast Fourier Transform, (LU) a 
CFD kernel based on LU decomposition, (SP) a CFD kernel based on a pentadiagonal 
solver and (BT) a CFD kernel based on a block-tridiagonal solver. 
Table 2 
NAS Parallel Benchmark: performance scaled to single processor Cray C90 
benchmark procs CRAY T90 NEC SX-4 SGI PC XL 
(FT) 8 11.7 13.8 1.0 
16 23.3 1.2 
(Lu) 8 10.4 1.1 
16 15.9 12.8 1.5 
( sp)  8 11.4 12.1 1.6 
16 17.0 21.7 2.3 
(BT) 8 11.1 12.3 1.5 
16 16.0 22.1 2.6 
Table 3 
NAS Parallel Benchmark: sustained performance per dollar 
machine (Lu)  (SF') (BT) 
NEC SX-4/16 1.57 3.11 3.26 
SGI PC ~ ~ 1 1 6  (90 MHz) 1.51 2.28 2.56 
CRAY T916 1.06 1.13 1.07 
5. CONCLUDING REMARKS 
With only a couple of weeks porting and optimizing we were able to obtain a reasonably 
parallelization performance on the new NEC SX-4/16 supercomputer for three typical 
NLR applications, mainly due to the availability of automatically parallelizing compilers 
and optimization tools. Porting these codes to a distributed memory machine would have 
required a significant effort (several months of manpower per code). 
Especially the D2EUL code shows a good scalability, mainly due to its large paral- 
lelization grainsize. The loop over all gridnodes containing most calculations has been 
parallelized using stripmining. 
The HEXADAP code has been parallelized by distributing the different colored loops 
used for vectorization of the code over the different processors, also resulting in a sufficient 
parallelization grainsize. But due to the limited number of colors the performance is 
degraded by load imbalance. 
SOLEQS has been optimized for the NEC SX-4 by collapsing loops and routines to 
obtain a larger grain size and enable re-use of previously calculated data. The two outer 
loops of the 3D nested loop structures are collapsed and parallelized, the inner loop is 
vectorized, resulting in a moderate parallelization grainsize and moderate speed-up figures. 
The NAS parallel benchmark results presented in the previous section support the 
conclusion that the NEC SX-4 provides the best price performance ratio for CFD type 
problems (Table 3). Moreover, the NEC SX-4 shows the best scalability (Table 2). 
REFERENCES 
1. K. de Cock, High-lift system analysis method using unstructured meshes, AGARD- 
CP-515, 1992. 
2. S. Saini, D.H. Bailey, NAS Parallel Benchmark Results 12-95, Report NAS-95-021, 
December 1995. 
3. J.J.W. v.d. Vegt, H. v.d. Ven, Hexahedron based grid adaptation for future large eddy 
simulation, AGARD-CP-578, April 1996. 
4. M.E.S. Vogels, Design of SOLEQS, a program for the solution of the flow equations 
on a three-dimensional block-structured grid, NLR T R  92418 L, 1992. 
5. W. Loeve, From R&D in Parallel CFD to a tool for Computer aided Engineering in 
Industry, Proceedings Parallel CFD'95 Pasadena, 1995. 
6. E.H. Baalbergen, W. Loeve, "SPINE: Software Platform for Computer Supported Co- 
operative Work in Heterogeneous Computer and Software Environments", NLR T P  
94359 L, August 1994. 
