Recently, researchers at NASA Ames have de ned a set of computational benchmarks designed to measure the performance of parallel supercomputers. In this paper, we describe the parallel implementation of the ve kernel benchmarks from this suite on the IBM SP2, a scalable, distributed-memory parallel computer. High-performance implementations of these kernels have been obtained by mapping the computation of these kernels to the underlying architecture of the SP2 machine. Performance results for the SP2 are compared with publicly available results for other high-performance computers.
Introduction
Recently, researchers at NASA Ames have de ned a set of computational benchmarks for the performance evaluation of parallel supercomputers for large scienti c applications 3, 4] . Known as the NAS parallel benchmarks, this set has become an increasingly recognized means of quantifying performance of high-performance computers on a range of algorithms of interest to many users of such machines. A key feature of these benchmarks is that the choice of data structures, algorithms, processor allocation and memory usage are left open to the discretion of the implementer. In other words, an implementer has the exibility to design an algorithm which matches the target machine. This is an important feature which motivates researchers working in the area of high performance parallel algorithms to investigate e cient ways of implementing the benchmarks.
The NAS Kernel Benchmarks
The NAS benchmark suite consists of ve kernel benchmarks and three simulated computational uid dynamics application benchmarks. In this paper we focus on the implementation of kernel benchmarks on IBM SP2; the implementation of the simulated application benchmarks is discussed in 9, 10, 11] .
The ve kernel benchmarks vary in their computation and communication requirements:
EP: This is an Embarrassingly Parallel kernel designed to measure primarily oatingpoint compute performance. It requires minimal interprocessor communication.
IBM POWER Parallel Systems, Neighborhood Rd., Kingston, NY 12401 MG: This is a 3-D multigrid kernel requiring highly structured interprocessor communication.
CG: This is a conjugate gradient kernel to compute an approximation to the smallest eigenvalue of a large, sparse matrix. This kernel tests irregular long-distance communication.
FT: This kernel solves a 3-D partial di erential equation using FFTs and is a rigorous test of long-distance communication performance.
IS: This is a large integer sort operation testing both integer computation speed and interprocessor communication. This kernel stresses the integer performance of the underlying node.
The IBM SP2 Parallel System
The SP2 is the second o ering in IBM's Scalable POWERparallel family of parallel systems based on IBM's RS/6000 processor technology. The SP2 is a distributed-memory system consisting of up to 128 processor nodes connected by a High-Performance Switch. Three di erent processor nodes are available, based on RS/6000 Model 370, 390, and 590 CPU planars. (These nodes are also known as Thin 62, Thin 66, and Wide nodes, respectively). The Model 370 processor is based on the original POWER architecture while the 390 and 590 processors are based on POWER2 architecture. Each compute processor has at least 64 MB of local memory (wide nodes can have up to 2 GB of local memory per node). Each compute node has a locally attached disk.
All performance results reported in this paper are made on a wide-node SP2 system with POWER2 (Model 590) compute nodes. POWER2 processors have two oating point units and two xed point units, and therefore can perform two xed-point instructions and two oatingpoint instructions every cycle, if no dependencies exist. The SP2 wide nodes are clocked at 66.7 MHz, and thus each node has a peak oating-point performance of 266 MFLOP/s based on two multiply-add instructions per cycle. This model has a 256 KB four-way set-associative data cache with a cache line size of 256 bytes. An important feature of the POWER2 architecture is the availability of oating-point quad-word load/store operations. Using both xed-point units, this results in an e ective bandwidth of four double-words per cycle between cache and the oating-point registers. Furthermore, all POWER2 nodes can fetch a complete cache line from memory in eight cycles after the rst word of the line arrives; on the SP2 wide nodes, this represents a local memory bandwidth of 32 bytes per cycle, or 2134 MB/s. The very high data access rates between cache and registers and between memory and cache make possible the very high sustained performance of the POWER2 nodes in the SP2. Many of the optimizations discussed in this paper are designed to exploit these capabilities.
The High-Performance Switch is a multi-stage packet switch providing a peak point-topoint bandwidth of 40 MB/s in each direction between any two nodes in the system. For the wide-node system, the sustained application bu er to application bu er transfer rate is approximately 35 MB/s for a uni-directional transfer measured as one half of the time necessary for a round-trip \ping" operation between two compute nodes. The latency (i.e. the time for a zero-byte message) measured in the same manner is approximately 40 microseconds on the SP2. In the case where a compute processor simultaneously sends and receives di erent messages, the aggregate (incoming plus outgoing) bandwidth at this node is approximately 48 MB/s on the wide-node system. This is the transfer rate observed when two nodes exchange long messages, a common communication operation in many parallel algorithms.
The Embarrassingly Parallel Benchmark
In this benchmark, two-dimensional statistics are accumulated from a large number of Gaussian pseudo-random numbers, which are generated according to a particular scheme that is well suited for parallel computation. This problem is typical of many "Monte-Carlo" applications. Since it requires very little communication, this benchmark measures the compute performance of the underlying node. The problem has been de ned in two sizes: class A, whose size is n = 2 28 , and class B problem size is four times bigger. Because the problem scales very nicely, it is su cient to restrict ourselves to the class B problem on a single processor.
Statement of the EP Problem
Generate pairs of Gaussian random deviates, also called two independent normally distributed variables, according to a speci c scheme, see 7] , and tabulate the number of pairs in successive square annuli. (?2 log t j )=t j , where log denotes the natural logarithm. Then X k and Y k are independent normally distributed variables with zero mean and unit variance. Approximately n =4 pairs will be constructed in this manner. Finally, for 0 l 9 tabulate Q l as the count of the pairs (X k ; Y k ) that lie in the square annulus l max(jX k j; jY k j) < l + 1, and output the ten Q l counts.
On a p processor machine, every processor generates the statistics for a set of n=p points. This is done in parallel without any inter-processor communication. The only communication in this problem is to add the 10 sums from various processors at the end, which is insigni cant. Thus, the only optimization we did was to improve the performance of the single node. We now summarize some of the major techniques employed to improve the single node performance. The details can be found in 2].
An Improved Random Number Generator
We used an improved random number generator which utilizes the fused multiply-add unit of RS/6000. On RS/6000 and POWERPC machines, during the multiply-add operation, all 106 bits of the product are added to the 53 bit operand. This results in the best possible accuracy. On a POWER2 node, the new random number generator generates approximately 40 million random numbers per second.
A Table Based Algorithm for Generating Gaussian deviates and their classi cation
The Gaussian deviates are generated using the function f(t) = p (?2 log t)=t. We subdivide the interval (0,1) into a set of discrete points t i = ih where h is the sub-interval size. We construct the table of points f(t i ). The function f(t) is a monotonic function in the range of interest 0 < t < 1. Now t = 0 is a singularity of f(t). However, only a very small fraction of the random numbers have t close to zero. Therefore, we handle the t values in rst interval separately as a rare event.
The tables f(t i ) is used to decide the bin number l where this random pair lies. Thus, unless we are close to a bin boundary, there is no need to compute f(t) with high precision.
Thus for most of the cases when we are not close to a bin boundary we use the table values f(t i ) and f(t i+1 ) bordering f(t) to get the bin number.
Performance Tuning for Power Architecture
A large amount of ine ciency in the generic code supplied by NASA is due to the overheads associated with the conversion from a oating point number to an integer. This conversion is required to get the table index and the bin index. The compiler calls a function routine which takes many cycles. In this conversion, the function routine has to take into account all possibilities including negative integers and over ows. However, in our case, the integers are ?1. These bits are then loaded in a xed point register. This store/load combination could introduce several cycles of delay and the code must be scheduled to do useful work during this period.
We also did some additional tuning which is generally applicable to high performance RISC workstations 2].
In the NAS parallel benchmark results report, 3], changes were made to the EP benchmark. In Section 2.1 of 3] the benchmark authors state that \the intent of the EP benchmark is to provide an accuracy and performance check on the FORTRAN LOG and SQRT intrinsic ..." and thus they made two changes. Brie y, the changes disallows the use of table look-up and also the construction of the composite function SQRT(-LOG(X)). In Tables 2a and 2b of 3] EP timing based on the table look-up approach are given for Cray C90, Cray T3D, IBM SP1, and IBM RS6000-590. Here, for brevity, we only mention that the table based approach is about 4.5 times faster on SP2 machine than the non-table based approach. The results reported in this paper (See Table 2 Figure 1 illustrates the serial implementation of the MG algorithm. Here, k denotes the grid level such that the k-th grid has 2 k points in each dimension, and z k M k r k . The coe cients of the operators A, P, Q, and S shown in Figure 1 are given in 3]. These operators represent three-dimensional, nearest-neighbor, 27-point stencils, but with some coe cients set to zero. A single layer of "ghost points" is added around the exterior boundary of the computational mesh to facilitate evaluation of these stencils for points on the external boundary. Values of the solution at these ghost points are updated using explicit copies of active data points consistent with the speci ed periodic boundary conditions. These copies are performed after each update (e.g. for r k ; z k ; ) on each grid level.
Parallel Implementation
The parallel implementation of the MG kernel is a data parallel algorithm applied at each grid level. The computational grid is subdivided into subdomains using a three-dimensional block data decomposition, with a one-to-one mapping of subdomains to processors. Processors are logically con gured in a three-dimensional grid; for example, 32 processors are con gured as a 2 4 4 processor grid. Each processor applies the stencil operations to grid points in its subdomain, requiring interprocessor communication to access data needed for the evaluation of the points on the boundary of the subdomain. This decomposition is applied only on ( ner) grids above a speci ed grid level. For grids at or below this level (i.e. on coarser grids), the distributed data is combined and replicated in each processor, and each processor then does the computation for all points in these coarser grids. Figure 2 summarizes the parallel implementation of the down cycle (residual restriction) included in Figure 1 . In Figure 2 , r k denotes the residual for points for the speci ed subdomain, and R k is the residual for all points on level k. words for each operation. Note that single "cornerpoint" values are moved in successive communication steps as part of the much longer messages. Furthermore, in certain phases of the algorithm, it is necessary only to communicate in the positive (or negative) coordinate directions, therefore requiring only 3 distinct communication steps. All such communication is implemented using non-blocking, double-bu ered, point-to-point send and receive operations. This communication operation also serves to enforce periodic boundary conditions, and is invoked at precisely the same phases in the parallel algorithm as the copy operation is done in the serial implementation.
The gather operation requires interprocessor communication in order to concatenate the distributed data and then replicate the concatenated data in each processor. In practice, we use k cutoff = 3, and we observe that the time necessary to perform the gather operation is negligible since k cutoff << K (= 8). 
Our tuning of this kernel had three components:
1. selecting an approach that minimizes communication and computation costs, 2. using the memory hierarchy of the processing nodes e ectively, and 3. striving for maximal use of the processors on the inner loop. The choice of a general approach was considerably simpli ed by the decision of the NAS Benchmark Committee to disallow approaches based on factoring the sparse matrix A. The choice narrows to either a one-dimensional or a two-dimensional decomposition of this matrix. Each decomposition requires the same amount of computation but the one-dimensional de- Nk multiply-adds, communication of N(P 1 + P 2 ) values, and P log 2 P messages.
The key to e ective use of the memory hierarchy on each node is keeping x j in cache during the matrix-vector multiplication. This can be achieved by an appropriate factoring of the number of processors P into P 1 and P 2 . This argues for making P 2 large, but if P 1 P 2 then the communication cost will mount. In addition, there is a measurable loop overhead for the inner loop that gets amortized over k=P 2 multiply-adds. This also argues for keeping P 2 from getting too big. Careful balancing of the various costs results in the choices for P 2 given in the following table:
The total number of processors, P. Table 1 . The number of processors in a row, P 2 for the class B problem.
The inner loop computes the dot product of x j with one row of the local submatrix. This entails about k=P 2 multiply-adds. Since the matrix is sparse, each multiply-add requires three loads: a value >from A ij , the value's column index, and the corresponding value from x j . If the column index were used to subscript the vector directly then it must be shifted (multiplied by the size of a doubleword) in order to be used. This additional xed-point cycle can be eliminated using the Fortran 90 pointer construct. Thus, each multiply-add requires 3 loads, each taking one cycle on the POWER architecture. On the POWER2 architecture, which has two xed-point units, the three loads take 1.5 cycles. Each oating-point unit is capable of executing a multiply-add every cycle, but the result is not ready until the third cycle. Thus, accumulation of a dot-product into a single register requires 2 cycles per multiply-add. This would be a bottleneck on the POWER2 architecture, so the alternate terms of the dot-product are accumulated in two separate registers which are added together at the end of the loop. Assuming the input vector ts in cache, the resulting code can run on the POWER (POWER2) architecture at about 3 cycles (1.5 cycles) per multiply-add, plus the cost of cache misses on the data structures representing the A matrix.
This cost depends on the cache size, but ranges from 3=8 cycles (on a wide-node SP2) to 3=2 cycles (on an SP1) per multiply-add.
The 3-D Fourier Transform Benchmark
In this benchmark a Poisson partial di erential equation (PDE) is solved using 3-D forward and inverse discrete Fourier transform (DFT). Basically, this benchmark rst requires a forward 3-D t computation. Then in a loop the transformed data is multiplied by a coe cient array followed by an inverse 3-D t computation. We now give a brief description of the benchmark for the Class B problem size. For details one can refer to 4].
Statement of the FFT Problem
Initialization phase:
Set n 1 = 512, n 2 = 256, n 3 = 256, and = 10 ?6 . Generate 2 n 1 n 2 n 3 64-bit real numbers using the pseudorandom generator outlined in 4]. Assign these real numbers to a complex array, U(0 : n 1 ? 1; 0 : n 2 ? 1; 0 : n 3 ? 1) such that two consecutive real elements are assigned to an element of U. end for
Parallel Implementation
The major e ort in parallelizing the above benchmark is to e ciently implement the 3-D inverse DFT computation inside the for loop. An over all description of our algorithm is as follows. Initially the 3-D data is assumed to be distributed across the processors along the third dimensions. We rst do FFT computation along the rst dimension. This computation does not require any communication. Second, we move data between the processors such that the new distribution across the processors is along the rst dimension. Third, we do FFT computation along the second and third dimensions locally on a node without any communication.
The implementation details can be found in 1].
On a single node we use FFT routines of the Engineering and Scienti c Subroutine Library (ESSL) 13]. We implemented the complete benchmark using 1-D FFT routines along with some data movement routines blocked for cache. This results in a better cache behavior of the RS/6000 node. 
The central idea of the algorithm is as follows. At each node, we sort keys into a certain number of buckets based on some key bits. For many reasonable distribution of keys, sorting on the middle bits assures nearly uniform key density distribution across all buckets. If there are nb buckets, nb/p of them are sent to each processor using a global transpose communication routine. Here, we assume p to be a power of two. Each processor receives buckets from all processors corresponding to some of the middle key bits. At this stage, for each key value, there are on the average n=2 m keys. For the NAS IS problems this average is 16. Therefore, for e ciency, we implement a distribution count sort 6, 5] . At the same time, at no additional cost, we implement sub-bucket sorting on the high order bits. Each processor assigns subbucket ranks to all keys received by it. These ranks along with all the sub-bucket counts are sent back to the originating processor using another global transpose of the same size. The originating processor computes the nal rank by adding the sub-bucket rank to the global rank o set for that sub-bucket. In the region of interest, i.e., n=2 m 1, both computing and communication scale with the number of processors.
Results
To date, two di erent speci cations of the NAS parallel benchmarks have been de ned. Class A refers to the original problem dimensions, while Class B denotes a more recent speci cation in which problem sizes have been increased by a factor of 4 in most of the kernels 3]. Table 2 summarizes the performance results for the Class B kernels. As is the convention, results are reported as ratios to the respective single-processor Cray C-90 time 3]. However, in the caption of Table 2 we have added the single-processor Cray C-90 times for each of the benchmarks. This allows one to compute wall clock times for each of the benchmarks. The SP2 results are for a wide-node system with 128 MB of memory per node, and were obtained using the user-space communication protocol in the IBM Message Passing Library 14] included as part of the IBM Parallel Environment 15] . For comparison, we have also included the performance results for other scalable parallel machines; these results are taken from 3, 16] . Note that the IBM SP2 EP results are based on the latest NAS rules which also require additional checksums 3]. The Intel Paragon performance results correspond to the better of the two environments, OSF1.2 or SunMos, wherever applicable.
In general, the SP2 performance per processor exceeds that of the other parallel machines shown in Tables 2 for all processor con gurations for which SP2 results are available. This performance is due primarily to the very high sustained performance of the SP2 wide nodes, combined with the relatively high sustained interprocessor communication bandwidth for these problems. One of the referee's noticed that for the EP benchmark (Table 2) , which requires very little communication, the Intel Paragon performance is almost as good as that of the SP2, whereas for the more communication-intensive benchmarks in Table 2 , the Paragon's performance is much worse than that of the SP2. Here is an explanation. First, in 16, p.5] the NAS benchmark authors note that the SunMos-turbo operating system for the for the Paragon allows both i860 processors on the node to be used for computation (in regular SunMos and OSF the second processor is used purely for communication). Additionally, Intel Paragon results in Table 2 do not include the computation of the two new check sums which adds extra time to the IBM numbers. Also, NAS ground rules require that only o cial vendor libraries be used for the intrinsic functions such as square root and log. Recently, IBM has made available an alternate library. With the new library we have been able to decrease the IBM SP2 EP numbers in Table 2 by nearly a factor of two.
Conclusion
In this paper, we have discussed the novel techniques used to implement the NAS kernel benchmarks on the IBM SP2 scalable parallel system. The central idea of these techniques is to match the implementation of these kernels with the underlying architecture of the SP2. Performance results have shown that the SP2 implementations compare very well with results currently available on other scalable parallel machines.
