The results of implementing a 3-D production MacCormack explicit NavierStokes code with damping on a 1024-node NCUBE hypercube is presented.
Indeed, success on such architectures has inspired other investigators to study the distribution of data as well as computation on so-called massivelyparallel machines such as the hypercube. In contrast, access conflicts in sharedmemory multiprocessors are assumed to occur randomly and presently limit the number of processors to eight. Fortunately, the same distributed CFD algorithms can be the basis of many production codes and are applicable to several current MIMD commercial architectures, so that significant programming effort can be justified. This paper extends the distributed CFD results of Catherasoo and others to include a full Navier-Stokes (N-S) code with damping and, more importantly, shows the efficiency of distributed N-S algorithms up to 1024 nodes (processors) . 3 , Detailed timing analysis identifies sources of the modest parallelization overhead.
!&id P a r W a For D i s t r i b u t e d
. . # The code on which this is based is a MacCormack explicit solution of the Navier-Stokes equations with damping.
6,7
In physical coordinates, data for a right circular cylinder of radius 1.0 was used (see Figure 1) ; this was transformed to a rectangular uniform coordinate system. A simple grid generator was written which would develop grid x, y,z coordinates for a grid of specified size. The common characteristics among all grids used in timing cases was that grid points were evenly spaced at increments of .2 radially (outward from the cylinders surface) and .2 axially (along the length of the cylinder). See Figure 1 for two views of a typical grid.
The same data set was used for all timing cases. The data set had the following characteristics: To achieve this, grid points are blocked in three-dimensional cubes 3. Grid points near the block periphery must be shared with another node so that 2-nd order differencing (due to damping) and grid transformations can be performed in both nodes. This creates a 3-wide shell of shared points, illustrated in Figure These will be termed utatiQn grid w.
Study of the resultant code shows the memory requirements in the nodes are (in bytes)
Based on these formulae, it is possible to calculate the largest grid partition that can be accommodated by each processor. Table 1 shows the results, including the 512 kbyte/node case used for measured data. With recent increases in memory density, the 2048-kbyte case should be easily achievable by the next generation of 1024-node machines. 
IV. Performance
The initial release of hypercube hardware suffered the liabilities of limited individual node processor capabilities and too few processors to approach state-of-the-art vector processor performance, even for efficient parallel algorithms. Table 2 shows that 1024 processors allows the parallel code to compete with (uniprocessor) CRAY-class machines. Two caveats should be noted: (1) the NCUBE results are in 32-bit arithmetic and CRAY results in 64-bit arithmetic, the hardware wordlengths of each machine, and (2) in the grids studied, each node processed the largest cubic grid partition permitted" ; a total grid was then constructed from these cubes.
Both of these caveats favor the parallel performance, although the next hardware generation of nodes will likely perform 64-bit arithmetic. It should be possible to process a (8x8~8) partition on a 512-kbyte NCUBE node. However, only a (7x7~7) partition would complete a sufficient number of iterations to obtain repeatable timings on successive iterations.
where T1 is the uniprocessor execution time and Tn the execution time on the same problem with n processors. With limited local memory, however, T1 cannot be determined for representative grid sizes. Therefore, a reference problem of size (Ic, Jc,Kc) = (7,7,7) and devoid of message-passing or synchronization is run on a single node. The choice of grid size of this reference problem is arguable and affects efficiency calculations, since a node processes a larger partition more efficiently (i. e., less time per grid point) .
If the execution time of this reference problem is Tr for one iteration with a total of Gr computation grid points, then with n processors and Gn total computation grid points, the speedup with per-iteration time Tn is defined as and the measured efficiency is The speedups and efficiencies for grids chosen to optimize the parallel performance are given in Table 3 .
wrocessor C o m c a t i o n
Communication occurs for two reasons within each iteration.
(1) Plane swap. Data from the shell of shared grid points in adjacent cells ( Figure 3 ) are exchanged; this occurs twice within each iterationafter both the predictor and corrector steps -in simultaneous exchanges between all nodes. A corresponding timing diagram is shown in Figure 4 . All nodes participate in this exchange, although nodes processing grid points on the periphery of the grid with some missing neighbors exchange less data. Nodes without missing neightbors are termed internal nodes.
(2) Time-step calculation. Stability limitations are calculated for the cube of grid points private to each node; these are communicated to node 0, where an overall time step is determined and communicated back to all processors.
(It should be noted that nodes on the k-edge wraparound are processed as internal nodes; i.e., with a full set of neighbors. )
Since all internal nodes carry out the same computation, their efficiencies are equal and can be calculated from E, = Numerical computation time / Total time for each iteration. These are shown in Table 3 ; they are found to be in excellent agreement with measured efficiencies (Em) .
Noting that more local memory permits less data flow per floating-point computation, it is possible to extrapolate the timing model of Figure  4 to hardware with more local memory. The resultant calculated efficiency Ec is shown in Table 1 , showing only modest gains in efficiency for more local memory.
Thus, future machines with similar numerical computation and data transfer rates could tradeoff local memory size for more nodes (parallelism) 8.
VI. Conclusions
The above results show the viability of both massive parallelism and small (512-kbyte) local memory for 3-D production explicit Navier-Stokes codes.
Performance of this code is also being studied as a function of processor characteristics other than memory size, such as MFLOP rate, interprocessor data-transfer rate, and message startup time (latency). Such a generic model will be able to predict performance for other message-passing commercial multiprocessors expected in the near-term.
The authors are indebted to Sandia Corporation (Albuquerque) for access to their 1024-node hypercube. John Holm's contribution in development of an early version of the code is acknowledged; Paul Kominsky's assistance is also noted. The code was provided by Dr. J. Shang of WRDC, WPAFB. 
