We describe the implementation of a fluid dynamical benchmark on the 256 node SUPRENUM-1 parallel computer. The benchmark, the Shallow Water Equations, is frequently used as a model for both oceanographic and atmospheric circulation. We describe the steps involved in implementing the algorithm on the SUPRENUM-1 and we provide details of performance.
Optimal SUPRENUM performance requires algorithms that may be compiled into vector instructions with long vector length, and as with many other MIMD systems, relatively few communication operations. For such algorithms the system delivers a very impressive fraction of its theoretical peak rate. SUPRENUM software is excellent, including communication facilities and a fully vectorizing compiler for Fortran 77 which was used in this study.
We have measured 5.33 Mflops (64-bit arithmetic) for single node performance, and 1280 Mflops aggregate performance with 256 nodes, at efficiencies up to 95%. This compares well with vector and MIMD supercomputers and shows that SUPRENUM was among the fastest MIMD computers during 1992. Performance of 1530 Mflops was measured for the same algorithm on the CRAY YMP/8, and 543 Mflops was measured on the 128-node Intel iPSC/860. The SIMD Thinking Machines CM-200 delivers 5.25 Gflops (64-bit) and 8.09 Gflops (32-bit) for the benchmark. We also discuss the influence of physical cluster interconnection topology and asynchronous communication on SUPRENUM performance.
INTRODUCTION
The Shallow Water Equations are a standard model for atmospheric and oceanographic processes. Implementations of the algorithm have been used as benchmarks for vector and parallel supercomputer performance for many years [1] [2] [3] [4] [5] . The Shallow Water algorithm is very memory intensive, involving 14 variables per grid point, and accesses these using nine-point stencils, non-linear expressions and essential divisions. The combined effect provides a decidedly non-trivial test of any computer system. We have recently implemented the benchmark on the 256 node SUPRENUM-1 MIMD parallel supercomputer and report on the results in this paper -see also [5] [6] [7] .
The tests were run on the SUPRENUM-1 hardware at the GMD in St-Augustin, Germany, which was running the Peace 3.0 operating system software. The Shallow Water code ran on a single node using 64-bit arithmetic at 5.33 Mflops and on a 256 node system at up to 1280 Mflops. Performance of 5.33 Mflops per node was quite impressive, especially as the code was not explicitly vectorized in any way.
In fact this single-node performance exceeds the typical performance we have seen for the same algorithm on the Intel iPSC/860 systems, despite the fact that the latter system's nodes have several times the peak performance of SUPRENUM's. We conclude that the SUPRENUM compiler is doing an excellent job of locating vectorizable statements, and in generating efficient pipelined vector instructions to implement such statements. Numerical results agreed to high precision with those from other machines. We expect that even higher per-node performance could be achieved by utilizing explicit optimizations, and by coding computationally intensive segments using SUPRENUM Fortran's array extensions (Fortran 90).
The multi-node performance compares well with the iPSC/860 hypercube where an optimized Fortran version of the Shallow Water Equations runs at 543 Mflops (64-bit precision) on 128 processors 5 , with the CRAY XMP which solves the equations at a rate of 560 Mflops on 4 processors, and with the CRAY YMP where 1530 Mflops has been attained using 8 processors. The SIMD Thinking Machines CM-200 however is substantially faster, delivering 5.25 Gflops (64-bit) and 8.09 Gflops (32-bit) 8 . Some single-node SUPRENUM measurements reported here are slightly different from those reported a year ago 5 . This is because of variations due to compiler or operating system changes. In that paper measurements were restricted to only 16 nodes as only a SUPRENUM prototype consisting of a single cluster was available. The main content of this paper is the measurement of performance on up to 256 nodes and the demonstration of good scaling behavior of the complete system.
THE SUPRENUM-1 SUPERCOMPUTER
The German SUPRENUM-1 computer couples up to 16 processor clusters with a network of 200 Mbit/sec busses. The busses were intended to be arranged as a rectangular grid with 4 horizontal and 4 vertical busses, although other configurations have also been employed (see below). Each cluster consists of 16 processors connected by a fast bus, along with I/O devices for communication to the global bus grid and to disk and host computers. There can be a dedicated disk for each cluster. Individual processors can deliver up to 20 Mflops (64-bit chained) or 10 Mflops (64-bit unchained) of computing power and support 8 Mbytes of memory, upgradable to 32 Mbytes. The high bandwidth of the bus network makes this an interesting machine for a wide range of applications, including those requiring long-range communication. No more than four communication steps are ever required between remote nodes (with four steps needed only if both a horizontal and a vertical bus must be traversed).
While SUPRENUM clusters are well defined by their interconnection bus, the connectivity between clusters is modifiable by rewiring the connections appropriately. In principle this is simple, although in practice it turns out to be a major undertaking because there are severe physical constraints on the length of the buses involved, plus the fact that each bus must actually connect to form a ring. Each ring must visit from 4 to 6 clusters. During 1991, the SUPRENUM-1 clusters were connected in a simple ring (actually four parallel rings, although it was not possible to fully utilize the parallelism). In January 1992 the SUPRENUM-1 was re-configured as a full double matrix of busses. In June 1992 the SUPRENUM-1 topology was changed to provide a topology where each cluster has a direct connection to every other one so that no all communication operations required at most three steps.. The results reported on in this paper deal primarily with the latter interconnection network as this provided the best overall performance from the three interconnection schemes which were studied. SUPRENUM software is characterized by the best support for MIMD scientific applications to be found among the various distributed memory MIMD vendors. The effort invested in development of libraries of high-level grid and communication primitives greatly eases the effort of moving applications to the computer, and also provides substantial high-level portability to other systems, since the communication library can be implemented in terms of low level primitives on any distributed system.
The first 16-processor prototype system was delivered in 1989 and the first operational 256 processor system became available in November 1991. The full system has a 5 Gflops peak rating and should have high realizable efficiency in appropriate applications, namely those where communication is relatively infrequent and where long vector lengths predominate.
THE SHALLOW WATER EQUATIONS BENCHMARK
As an example of the current capabilities of the SUPRENUM system we describe the implementation of a standard two-dimensional atmospheric model -the Shallow Water Equations -on the SUPRENUM-1. These equations provide a primitive but useful model of the dynamics of the atmosphere. Because the model is simple, yet captures features typical of more complex codes, the model is frequently used in the atmospheric sciences community to benchmark computers 1, 2 . Furthermore, the model has been extensively analyzed mathematically and numerically 9, 10 .
The Shallow Water Equations, without a Coriolis force term, take the form
where u and v are the velocity components in the x and y directions, P is pressure, ζ is the vorticity: ζ = ∂x ∂v − ∂y ∂u and H , related to the height field, is given by:
It is required to solve these equations in a rectangle
Periodic boundary conditions are imposed on u , v , and P , each of which satisfies f (x +b ,y ) = f (x +a ,y ), f (x ,y +d ) = f (x ,y +c ).
A scaling of the equations results in a slightly simpler format. Introduce mass fluxes U =Pu and V =Pv and the potential velocity Z =ζ/P , in terms of which the equations reduce to:
DISCRETIZATION
We have discretized the above equations on a rectangular staggered grid with periodic boundary conditions. The variables P and H have integer subscripts, Z has halfinteger subscripts, U has integer and half-integer subscripts, and V has half-integer and integer subscripts respectively.
Initial conditions are chosen to satisfy ∇ → . v → = 0 at all times. We time difference using the Leap-frog method. We then apply a time filter to avoid weak instabilities inherent in the Leap-frog scheme:
where α is a filtering parameter. The filtered values of the variables at the previous timestep are used in computing new values at the next time-step. For a complete description of the discretization we refer to 1 .
SERIAL FORTRAN IMPLEMENTATION
The Fortran code implementing the above algorithm involves a 2D rectangular grid with variables:
There are three main loops, two corresponding to the Leap-frog time propagation of various quantities, and one for the filtering step. Execution of these three loops completes a single time step, which is then repeated until the desired temporal simulation interval has been achieved. A typical code sequence, used in the updating of the U , V and P variables, is:
continue
Each such loop is followed by code to implement the periodic boundary conditions. In the above case, the corresponding boundary code takes the form: (1,j+1) pnew(m+1,j) = pnew (1,j) 
Note that there are such loops for both the horizontal and vertical boundaries, and in addition some corner values are copied as single items.
Excluding the boundary computations, the three major loops in a time step involve 65 arithmetic operations per grid point. Furthermore 14 physical variables must be stored per grid point, which significantly limits the largest grid size that can be accommodated in a single node.
SUPRENUM IMPLEMENTATION
To speed the implementation effort we decided to test the idea of porting a generic MIMD parallel version of the Shallow Water Equations to the SUPRENUM-1. The work was based on a parallel code developed by McBryan and Pozo 8, 11 . The code was developed for a generic class of MIMD parallel computers, based on the assumption of a single process per node model. The code was developed and tested using a simulator for the generic model developed previously 12, 13 . The simulator supports versions of the Intel iPSC communication protocols.
SUPRENUM supports a library interface allowing both Intel iPSC1 and iPSC2 communication interfaces to be utilized. It suffices to declare the main program of both the host and node processes to be SUPRENUM tasks, while the rest of each program may remain as a pure Intel iPSC program. This approach greatly eased code modifications that would have been required to develop a complete SUPRENUM-1 implementation from scratch. In fact the code was ported and fully working within hours. The program ran immediately and gave correct results on the first try. This demonstrates the advantages of developing MIMD codes initially using simulators, and transferring to hardware only when the simulations are running correctly.
Since the code involves rectangular grid arrays, and a nine-point stencil, the parallelization of the code is straightforward. A logical mapping of the processors to a two dimensional array is selected. Thus if P = p x p y , is a factorization of the number of processors P , then we regard the processors as arranged in a p x ×p y logical grid. The large arrays representing physical variables (u ,v , etc.) are then decomposed into equal sized blocks, with one block assigned to each processor. For simplicity we assume that the x and y grid dimensions are exact multiples of the corresponding processor numbers p x and p y . Each such block is then stored in an array of the same shape, but which has an extra boundary row or column provided on each of the four sides. These extra boundary points are used to maintain copies of the true (i.e. interior) boundary points of the four neighboring processors. The three main loops of the time step are decomposed into equivalent loops performed by each processor on the interior points of the block assigned to that processor. Prior to each loop, the boundary values are updated by exchanging appropriate values between neighboring processors, following a synchronization to ensure that all neighbors have completed changes. Such exchanging generally requires communication which was implemented by communicating large packets for each of the four sides of a block.
There is an essential simplification that occurs in the case that either p x or p y is 1 -in which case the logical rectangular processor array reduces to a line of processors. In this case two of the four communications required within each main loop are not needed, reducing substantially the communication overhead. As mentioned previously, the Shallow Water code uses periodic boundary conditions in each dimension. Normally periodic boundary conditions require copying data between processors at opposite edges of the processor array. In the case that one or other of p x or p y is 1, the periodic boundary condition in the corresponding dimension may be implemented by in-memory copying, rather than by communication.
A final optimization of the communication structure was required to get the peak performance. Before each of the main loops in the algorithm, the boundary data for the various physical variables (P ,U ,V ,Z ,H ) used in that loop need to be copied from neighboring processors. Typically two or three variables are needed from a specific direction, although the number needed may depend on the direction. Because of the high communication startup cost of SUPRENUM-1 (at least 2 msecs), it is essential to limit the number of individual communication requests. This was accomplished by packaging several communications of different physical variables in a single direction into one large communication package. For some steps this reduced startup overhead by a factor of three. In the final implementation we also replaced the Intel iPSC communication calls for this one exchange operation by explicit calls to SUPRENUM Fortran equivalents, thereby saving an extra copying of each data array to a communication buffer. SUPRENUM Fortran supports explicit communication operations using a standard Fortran I/O control list syntax.
There is potential in the Shallow Water Equations to overlap communication with computation, provided the underlying hardware supports asynchronous communication modes. In this case one would begin each major loop by an asynchronous exchange of boundary data. Following this one executes the main body of the loop, however iterating only over the "interior points" of the subgrid. It is then necessary to await completion of the exchange operation, after which the the loop iteration may be completed on the outermost rows and columns. In principle such an approach can yield 100% computational efficiency -i.e. communication effects become negligible. We implemented such an algorithm on SUPRENUM-1. However due to inherent design aspects of the PEACE operating system we were unable to effectively use asynchronous communication in the current version of PEACE.
PERFORMANCE RESULTS: SUPRENUM-1
All measurements were performed on a 256-processor SUPRENUM-1 system at the GMD, in Schloss Berlinghoven, Germany. The Shallow Water Code was exactly the standard sequential code, modified only to take account of communication. No attempt was made to introduce Fortran 90 vectorization constructs, or to otherwise adapt the code to known features of the SUPRENUM compiler. The code was compiled with both the vectorizer and optimizer switches on.
Because SUPRENUM nodes are vector processors, there is a substantial advantage to arranging the subgrids in each node such that the grid columns are as long as possible. In practice, Fortran columns longer than about 1024 words are not an advantage. This is because the vector registers are limited to a total of 7K words, and Shallow Water requires 7 registers for efficient code generation.
In order to maximize computational efficiency (by minimizing communication words sent per Mflop), it is desirable to solve as large a problem as will fit in each node. This turns out to be a problem with 32K grid points which consumes approximately 6 MBytes of node memory. All measurements presented here utilize subgrids of maximal size, although their rectangular shape may vary. We maximize both vector performance and computational efficiency on a node by using a 32×1024 subgrid in each processor. To indicate the importance of preserving a long vector length we note that performance on a single node goes from 2.69 Mflops on a 128×256 grid to 5.33 Mflops on a 32×1024 grid, essentially a factor 2 improvement (see Table 1 below).
As discussed earlier, the number of communications per node can be reduced by a factor of two by chosing a one-dimensional processor grid, which may be aligned with either the X or Y axes. If the processors are in a line in the X direction, then the communication packets will be of size 1024 words (Y dimension of the subgrids) per variable, while if aligned along the Y axis, only 32 words are communicated per physical variable.
More generally we can expect lower performance as the subgrids tend towards a square shape, such as 128×256, due to the shorter vector lengths. Also using fully twodimensional processor grids such as a 16×16 grid will double the number of communications per node, resulting in poorer performance. All of these phenomena are illustrated in the measured results.
The final effect which we have studied is the influence of cluster interconnection topology on performance. The SUPRENUM-1 has been interconnected in three different ways as described in section 2 -ring, full matrix and full interconnect, and we have measured Shallow Water Equations performance in all cases. There is a significant dependence of performance on the topology used. For example the worst-case efficiency measured with the double matrix topology was 39% while with the full interconnect topology, the worst case efficiency is 70%. On the other hand for the most efficient (linear) cases, performance with the full connection topology is slightly worse, dropping from 96% to 94% efficiency. Clearly the advantages of the full interconnect topology outweigh the disadvantages. We give only the measured data for the full interconnect topology.
We present the measured results in Tables 1-4 . The tables indicate the number of processors P , their arrangement as a logical Px ×Py rectangular processor array, the computational domain size Mx ×My , the resulting computational efficiency and the Mflops generated. The computational efficiency in all cases is defined as
where T (P ) is the solution time with P processors and T best (1) is the best possible singlenode performance with a subgrid of the same size but optimal shape. Table 1 presents the effect of varying the grid shape in a single node. This demonstrates clearly the importance of maximizing vector length. Indeed the almost square 256×128 grid provides only 77% of the performance of the elongated 32×1024 grid with the same number of grid points. At the other extreme, the 1024×32 grid delivers only 43% of the performance of the 32×1024 grid. Table 2 describes the performance of Shallow Water on grids of optimal shape for the system. Each node contains an optimal 32×1024 grid and the processors are arranged in a line parallel to the Y direction in order to minimize communication. Table 3 is similar except that the processors are arranged in a line parallel to the X axis, resulting in more square grids, and slightly increased communication cost.
To relate SUPRENUM-1 to other systems, it is fair to say that for most of 1992 this system was among the most powerful available MIMD systems. However the SIMD CM-200 is far more powerful. With the arrival of vector nodes for the Thinking Machines CM-5 computer, SUPRENUM will no longer be in this position in late 1992. The performance measurements also should be qualified by the cost per Mflops of the different systems, which we have not considered in detail. However it does appear that SUPRENUM loses much of its performance advantage relative to the iPSC/860 if pricing is considered. This is due to the fact that the SUPRENUM nodes involve essentially more complex hardware (e.g. vector nodes) than the iPSC/860. The SUPRENUM node design was formulated long before the much cheaper i860 processor appeared.
CONCLUSIONS
The SUPRENUM-1 system is shown to deliver excellent performance per node for problems which are vectorizable and which also have a long vector length. This performance scales well to large systems. While communication overheads are greater than on many competing systems (e.g. Intel iPSC/860), this is more than counterbalanced by the higher achievable node performance. With the successful demonstration of the 256-node SUPRENUM-1, the SUPRENUM project may be regarded as a scientific success. Based on this success of the initial SUPRENUM-1 prototype, it is unfortunate that a successor system based on newer technology is not being developed.
