INTRODUCTION
The NRL Layered Ocean Model, NLOM, has been under continuous development for 20 years [8] , [15] , [12] . The Users Guide [15] describes the model at the time the guide was released, and is still the primary reference for the model formulation. A new implementation of horizontal diffusion and a discussion of the model formulation in spherical coordinates can be found in Moore and Wallcraft [12] . Here we describe the techniques used to make the model's implementation (source code) portable to all major computer architectures and to several scalable programming styles. Many of the techniques described will be applicable to other ocean models.
The original implementation was targeted to SMP (shared memory parallel) systems, and in particular to PVP (parallel vector processors) where each CPU is a vector processor. This was the dominant supercomputer architecture throughout the 1980's, and is still used in machines such as Cray's C90 today. The model was written in standard Fortran 77 using a coding style that allowed compilers to automatically run loop nests in parallel across multiple processors, while still automatically vectorizing inner loops. The compiler technology involved is known as autotasking in Cray compilers. It is available in the Fortran compilers from all major vendors of SMP systems, and this has allowed the original ocean model implementation to run efficiently on a wide range of shared memory machines, from workstations to supercomputers. SMP systems are limited by the need to have uniform access from each processor to each element in memory, which is only cost effectively implementable for a maximum of 16, or perhaps 32, processors. Distributed memory architectures make memory local to the processors. This allows scalability to hundreds of processors, but at the expense of all or part of hardware support for the uniform shared memory programming style. On non-uniform memory access (NUMA) systems, all memory is accessible (without remote processor involvement) from all processors, but access speeds are faster (perhaps much faster) for local memory. Many distributed systems do not implement a shared memory at all, requiring messages to be transfered over a network to access another processor's memory typically with the cooperation of the remote processor. It has so far proved impossible, in general, to take existing "dusty deck" shared memory source code and have a compiler automatically produce an efficient distributed memory version. This is the case even if the original source was optimized for SMP systems and the target is a cluster of SMP machines with a NUMA global shared memory, and these are the most similar shared and distributed memory designs. The problem is that compilers are best at local optimizations (e.g. DO-loop or subroutine level), but efficient use of a distributed memory requires global decisions about how memory is used.
Since distributed memory systems are almost certain to be the fastest available for ocean Manuscript approved September 4, 1997. modeling over the next 10 years, an implementation of NLOM that can run efficiently on these machines has been produced. Just as the original vector/parallel implementation ran with reasonable efficiency on any single processor or SMP system, an important design goal for the scalable implementation is that it be reasonably efficient on any computer system likely to be used for ocean modeling: workstations, workstation clusters, SMP systems, SMP clusters, vector, vector-parallel, and massively parallel computer systems. This allows a single source code to be used across all machine types, greatly simplifying maintenance and the development of new features.
DISTRIBUTED MEMORY PROGRAMMING STYLES

Data Parallel
One approach for distributed memory machines that retains the idea of a global shared memory is data parallel programming. High Performance Fortran (HPF) is an extension to Fortran 90 implementing the data parallel approach [9] . It consists of a standard set of functions, compiler directives, and a very few language extensions. The language extensions have been included in Fortran 95, so a HPF program is also a Fortran 95 program (i.e. is portable to any machine with a Fortran 95 compiler). HPF is based on CM Fortran, which is a data parallel language for the CM5. The first distributed memory implementation of NLOM was written in CM Fortran, and could also run on any machine with a Fortran 90 compiler. The code was completely different from the original SMP Fortran 77 version, because CM Fortran requires all parallel operations be expressed using Fortran 90 array expressions. CM Fortran also requires the addition of directives that tell the compiler how to layout arrays across the distributed memory. These directives do not change the meaning of the program, but they have a very large effect on performance. Optimization for the CM5 (or any data parallel machine/language) is largely involved with getting memory layout right. In principle, the required array syntax could be automatically generated from the Fortran 77 model code. However, the current generation of tools cannot do this, despite the basic self similarity of finite difference codes. Consider the following simple code fragment: The arrays A and DA have been extended by a one column "halo" to allow a clean implementation of a periodic boundary. On entry A(IH+1,:) must be identical to A(l,:), and at some later phase in the program this condition would be explicitly applied to DA also. The equivalent CM Fortran Automatic converters might be able to handle this simple example, but in an actual code the use of a halo to handle periodic boundaries would not be correctly converted to a CSHIFT on an array without a halo. The arrays A and DA are no longer extended by one column, because the CSHIFT (periodic, or circular, shift) array operation is a one to one mapping to periodic boundary conditions. The data parallel approach is similar to autotasking, in that parallelization is a local process. This implies that autotasking compilers could certainly be extended to parallelize arbitrary array expressions. Therefore, HPF (or CM Fortran) codes are in principle portable to almost any machine design. The CM Fortran implementation of NLOM is in fact a legal Fortran 90 code, so it can be run on a Cray C90 (say) or a workstation with a Fortran 90 compiler. The problem with using HPF for a portable code is efficiency:
• HPF requires communication for almost every array expression.
• Different LAYOUT configurations may be optimal on different machines, i.e. difficult to write portable optimized HPF.
• Data parallel codes tend to use many array temporaries.
• Data parallel codes are typically written to be scalable to a large number of processors. This almost always involves additional floating point operations, and these reduce performance on machines with a small number of processors.
• A single nested DO-loop is typically equivalent to several array operations, reducing optimization opertunities.
• Current Fortran 90 compilers produce better code for DO loops than for array expressions.
Some of these limitations will be significantly reduced as compilers improve. On SMP machines, a smart compiler might eventually be able to make a collection of data parallel array expressions as efficient as a similar set of nested DO loops. However, the data parallel version will probably still require more memory. On distributed memory machines, a simplistic HPF compiler only works well if remote memory access is very fast (as it can be on a data parallel machine such as the CM5, or on most NUMA systems). More sophisticated compilers can try to cache selected remote memory values to reduce network traffic.
After producing a working data parallel version of NLOM [14] , it became clear that it was too slow and memory hungry to replace the existing version on SMP systems. The data parallel version worked well on the CM5, but maintaining two separate versions of an ocean model that is itself under continuous development was a serious maintenance headache. Moreover, HPF compilers for new distributed memory systems (such as the T3D) were incomplete and not sufficiently robust to allow easy and early migration to each new system.
Tiled Message Passing
Message passing libraries can be considered the assembly language of distributed memory machines. For ocean models, message passing is typically used via the domain decomposition SPMD (Single Program Multiple Data) programming style. The domain is divided into approximately equally sized pieces, one per processor, and each processor runs the same program over its piece of the domain, calling a message passing library to obtain values from other sub-domains. All scalar calculations, i.e. those not acting directly on model domain fields, are calculated separately on each processor. This implies that in practice SPMD must use a homogeneous set of processors, otherwise the scalar calculations could diverge. Efficiency and program transparency is enhanced by including a halo (or fake zone, or ghost points) of nearby points, which allows existing finite difference code to be used almost without change provided the halo is updated, via message passing, at appropriate intervals. Consider our simple code fragment: Provided the halo is up to date, the code fragment calculates the required values over the subdomain owned by this processor. Note that no knowledge is required about which subdomain is being calculated. This is all handled by the message passing and I/O routines which would need to be added to the original code. This is actually a special case of domain decomposition, tiling, that is applicable to all finite difference ocean models. The sub-regions are rectangular and halos have a simple mapping onto nearby processors. If the tiles are all the same size (as they are here) this gives a mapping of arrays to processors that is very similar to the data parallel approach. Data parallel codes don't have to explicitly include a halo, but halos (invisible to the programmer) are one technique a data parallel compiler can use to cache frequently accessed off-processor memory locations.
Converting an existing autotasked code to use the tiled message passing approach has a minimal effect on SMP performance, since when compiled for a single tile covering the entire region the original code is effectively used except for a few calls to update the tile halos. Note that in the single tile mode, halo updates do not involve message passing and can include periodic boundary conditions. However, if an existing code cannot autotask efficiently, then the only SMP parallelization option with this approach is to message pass between multiple tiles on the same SMP machine, and current message passing libraries are not efficient when running in a multi-user SMP environment.
Overall, tiled message passing is a viable approach to portability. Certainly more viable today than the data parallel approach. It does have some disadvantages:
Scalable NLOM
• Message passing is inefficient on data parallel machines, such as the CM5.
• All non-local operations require a subroutine call, e.g. AIJ = A(I,J) becomes CALL XCEGET(AIJ, A,I, J), and XCEGET must be coded using message passing.
• Debugging is difficult. In particular, the correctness of a single tile code does not imply that the corresponding multi-tile code is correct. Data parallel codes can in principle be fully debugged on a single processor.
• Message passing is not efficient on a multi-user SMP system, although autotasking may reduce the importance of this fact.
Tiled Data Parallel
It is possibile to combine the tiled message passing and data parallel methods. Consider our simple code fragment: If MP and NP are both 1, this is functionally identical to the message passing code fragment. If MPxNP represents the number of processors (or, on the CM5, the number of vector units), this is data parallel and the compiler does not need to generate any off-chip communication. In principle, a tiled data parallel code should be as fast or faster than the corresponding native data parallel code, because it does less communication unless the compiler is caching off-chip values. However, the speed of a tiled data parallel code depends on how well the compiler handles serial, i.e. on-chip, operations and CM Fortran's serial optimizations are currently quite limited. Since CM Fortran requires array syntax, the CMAX preprocessor must be invoked on the CM5, as part of the compile phase, to convert the DO-loop nest to an array expression: 
Wallcraft and Moore
This is unnecessary in HPF, which will directly parallelize the DO loops.
The tiled data parallel programming style adds more boiler plate to the original code, to handle the tile number dimensions, but can use message passing to update halos etcetera and is in every way functionally equivalent to the standard tiled message passing style. It can also run on data parallel machines/compilers, using data parallel constructs in communication subroutines. Disadvantages include:
• Boiler plate reduces clarity, when compared to either native data parallel or tiled message passing codes.
• All non-local operations require a subroutine call, e.g. AIJ = A(I,J) becomes CALL XCEGETCAIJ, A,I, J), and XCEGET must be coded using both data parallel and message passing.
On SMP machines autotasking can occur at the block level (outermost loop) even if the original code did not autotask, and the multi-tile code can be debugged on a single processor (except for the message passing communication routines).
The NLOM is now implemented in the tiled data parallel style, and this single code has replaced the original separate SMP and data parallel versions. The boundary conditions imposed at the end of each time step will allow for land points, and therefore the presence of absence of shrinkwrapping does not affect correctness. The variable inner loop extent slows down vectorization, and causes some load inbalance when autotasking (i.e. it tends to reduce the substainable Mfiop/s rate). However, it is almost always a net win in overall performance on SMP machines to shrinkwrap land, because the time saved by skipping calculations is larger than the time lost by less efficient vectorization and parallelization. The /ZILOOP/ COMMON contains four sets of shrinkwrap limits, for the four separate staggered grids used in the model calculation.
Some ocean models skip all land calculations, either by using a mask or by splitting the inner loop into many loops that are land-free. This allows them to run with simplified boundary corrections at the end of each time step, but they take a performance hit on every finite difference calculation (particularly on vector machines). NLOM typically spends less than 5% of its time in boundary routines, and this is easily made up for by increased finite difference code performance.
Example of Tiled Data Parallel Code
Portability often requires different code to be executed on different machine types. Since the code for machine X may not even be legal Fortran syntax on machine Y, the appropriate code fragment needs to be selected before the Fortran compiler sees the source code. The C preprocessor, cpp, is an integral part of standard C and, among other things, allows conditional compilation. There are alternatives, but cpp is commonly used as a Fortran preprocessor. In fact, many Fortran compilers follow the convention that files ending in .F or .F90 are first passed through the C preprocessor before entering the standard compilation phases. Macro subsitution, i.e. the replacement of one text string with another, is also a feature of cpp that can be useful in Fortran programs, although it sometimes produces counter-intuitive results. NLOM makes extensive use of cpp capabilities. The following are a set of global C preprocessor macros, that may be referenced by all NLOM subroutines. This is more complicated than the example used to introduce the tiled data parallel technique. The halo consists of one row/column on one side and two rows/columns on the other, because an extra row/column is required to handle some operations on the staggered grids. Arrays are dimensioned (1: IHP+3,1: JHP+3, IP, JP), rather than (0: IHP+2,0: JHP+2, IP, JP). The latter would simplify array indexing, but Fortran compilers are not consistant in how they handle arrays that don't start at (1,1,1,1), and so the less transparent dimensioning scheme is used to enhance portability. The array size is slightly different than the original. IH no longer includes a halo to handle the global ocean periodic boundary (i.e. 512 rather than 515), and JH has been increased to a value rich in powers of two (288 vs 287, but much larger differences in size are possible). Array dimensions rich in powers of two allow the number of equally sized tiles to be a large power of two, which is optimal on some massively parallel machines. The non-standard data types REAL*4 and REAL*8 are used because they are the most portable way to get 32-bit REAL and 64-bit DOUBLE PRECISION. Data parallel compiler directives for array layout are included via macros, which allows both CM Fortran and HPF to be supported, and makes it possible to use different layouts on different machines. On SPMD machines, there is no advantage to shrinkwrapping because overall speed is controlled by the tile with the least land. When shrinkwrapping, "ilf, ill" is expanded to "ILF(J,M,N),ILL(J,M,N)" but otherwise it is "2.IHP+1". Tiling actually simplifies shrinkwrapping. All four staggered grids have the same extent, specified in ILF and ILL, but intermediate values may need to be calculated at halo points in which case MLF and/or MLL are used. Note that in the shrinkwrapped case, MLF and MLL are not necessarily ILF-1 and ILL+1 because, for example, a halo point can be a sea point even if the adjacent non-halo point is land.
The data parallel style requires all tiles to be the same size and there must be a one to one mapping between tiles and a 2-D mesh of processors. This can lead to some inefficiencies in handling land. On a SMP machine with relatively few processors, shrinkwrapped tiles don't all perform the same amount of work. Better load balance can be obtained by using variable sized tiles, with tile sizes chosen so that each tile contains about the same number of ocean points. NLOM reduces SMP load inbalance by the alternative technique of having more tiles than processors, e.g. 12 tiles for 2, or 3, or 4 processors. On a large message passing machine, some tiles will contain only land and hence do no useful work. There is no need to include these empty tiles in the configuration at all. This is relatively simple to implement, since the bulk of the model code does not know about other tiles in any case. A tiled data parallel code in data parallel mode would have to include all tiles, but some could still be missing when running the same code with message passing. NLOM does not use this optimization, because the Capacitance Matrix Technique it uses to solve Helmholtz's equations performs most calculations at all points and because parallel I/O is more difficult with missing tiles. Also, some massively parallel machines have explicit support for, and run optimally over, a 2-D grid of processors. On the other hand, most machines can handle missing tiles with little communication overhead and the potential saving in time is significant (20% to 40% of points are typically over land). NLOM may eventually be extended to skip land tiles, and this is certainly Scalable NLOM H an optimization worth considering for any message passing ocean model. The scalable version of the MICOM model [2] , includes both the variable tile size and empty tile optimizations.
Communication Subroutines
The following are one line descriptions of all NLOM 2-D array communication routines:
XCAGET -convert an array from tiled to non-tiled layout XCAPUT -convert an array from non-tiled to tiled layout XCBGET -extract A(IA:IA+1,JA:JA+1) from a tiled array XCEGET -extract the element A(IA.JA) from a tiled array XCEWGT -extract the element A(IA,JA) from a tiled wind array XCHALT -emergency stop all processes, called by one process XCLGET -extract a line from a tiled array XCLGT2 -extract a double line from a tiled array XCLPUT -place a line of values into a tiled array XCMASS -sum a tiled array XCMAS1 -sum a sub-region of a tiled array XCMINH -find the minimum value in a tiled array XCNORM -find the L2 norm of a tiled array XCRANG -find value and location of min and max from tiled array XCSHFT -periodic shift of a tiled array XCSPMD -initialize processor data structures, called once XCSTOP -stop all processes, called by all processes XCTILB -update a tiled array's halo XCTILI -initialize for halo communication, called once XCTILR -update a tiled array's halo, for red-black SOR XCTILV -partial update of U,V tiled array's halos XCTILW -update a tiled wind array's halo
With the exception of XCHALT, all these routines are assumed to be called with identical argument lists by all processors when using SPMD message passing. This is not difficult to arrange, since by default all routines are called in this manner in a SPMD run. Most communication routines act as implicit barriers that synchronize processor state, i.e. when a processor exits a communication routine at the very least all processors that must communicate with it have entered the same subroutine. In addition the macro "barrier" is provided for cases where all processors must enter a critical section of code before the first processor exits. This is implemented as a macro, rather than a subroutine, to reduce overhead for autotasking and data parallel cases which never need user invoked barriers (since the compiler synchronizes the processors, whenever necessary).
NLOM is designed to produce identical results on a given system no matter how many processors are used. This greatly improves robustness and simplifies debugging. Global sum routines (XCMASS and XCNORM), cannot use the fast algorithms typically implemented by data parallel compilers and by message passing libraries because these are not independant of the number of processors involved in the sum. Instead, a pipelining implementation is used. Pipelining is described in the context of a tridiagonal solver in section 3.7.
Three versions of each communication subroutine are provided; one for message passing, one for autotasking/data parallel using Fortran 77 syntax, and one for data parallel using This version is longer than the others, because it allows multiple tiles per process and therefore includes the autotasking version's logic as well as message passing logic. Multiple tiles per process might be used on a SMP cluster, for example, with each process on a separate SMP machine communicating with other processes on other machines via message passing but parallelizing across multiple processors internally via autotasking. We will follow the usual convention of treating process and processor as interchangable terms to mean a single message passing entity, even though such an entity may in fact be muliple autotasked processors. Message passing uses either the T3D's SHMEM library or the standard MPI library [13] . Several macros are used to enhance portability. The macro "header" allows for a machine specific include file, "mtyper" allows for machine specific REAL data type, "barrier" blocks until all processors execute it, and macro "shmem_getr" allows for different subroutine names in T3D and SGI versions of the library. The variables MPROC and NPROC identify the local processor, with respect to a 2-D mesh or processors, and the array IDPROC contains the mapping from the mesh to processor number. IDPROC includes a halo to simplify periodic shifts. So, IDPROC (MPROC, NPROC) is the processor number of this processor, and its northern neighbour is processor number IDPROC (MPROC, NPR0C+1), etcetera. The processor array is configured to periodic wrap in both dimensions. On small workstation clusters it might be more efficient to skip the wrap when it isn't needed. SHMEM implements one sided communication, and can get values in any static array from another processor, the SAVE statement makes AI and AJ static. NLOM requires IHP and JHP to be even, so the first dimensions of AI and AJ are also even. This avoids performance problems that occur when the local and remote buffers are misaligned in memory. The barrier before each shmem.getr call ensures that the buffer, AI or AJ, is up to date on the remote processor before it is copied. In general, a barrier would also be required after the copies to ensure the remote memory is not altered before it is copied. This is not required here, because a total of two barriers are sufficient to protect two buffers (either a classic double buffer or, as here, two independent buffers called in sequence). SHMEM is a particularly good match to the tiled data parallel programming style, because global barriers can always be placed between each stage of a data parallel algorithm allowing safe access to remote memory. The MPI version is not necessarily optimal on a given machine, but it is a portable starting point for a machinespecific MPI implementation. An optimized version might not use the buffers AI and AJ, which are essential for SHMEM but represent potentialy unnecessary copy operations under MPI. The subroutine MPI.SENDRECV is a simple way to implement a circular shift. It sends one set of values to one processor while receiving another set from a second processor. Since both ends cooperate in passing messages, there is no need for a global barrier. The default SPMD global communicator MPI_C0MM_W0RLD is being used here. On some machines a communicator based explicitly on a 2-D mesh might be more efficient.
Parallel I/O
NLOM is used on many machine types, so it is important that NLOM files be portable. This is acomplished, in part, by using IEEE 754 32-bit floating point values for all I/O. Integer values are converted to floating point, because integers can be either 32-bit or 64-bit and are hence nonportable. To allow some degree of parallel I/O on distributed memory machines, separate files are used for scalars and arrays. Scalar files are formatted or unformatted sequential, and array files are Fortran direct access with one 2-D array (representing a single field for a single layer) per record. Array I/O is performed via subroutine calls, which can do I/O in parallel provided the resulting file is configured as required. On the Cray C90, an assign statement is used (actually an ASNUNIT subroutine call) to configure for IEEE I/O. On DEC machines, the f77 compiler switch "-convert big.endian" is used to make the files compatable with the majority of Unix systems (DEC is little endian, but most others are big endian by default).
Our implementation of I/O is somewhat unusual, but a natural one for message passing SPMD codes. All files are opened either READONLY or WRITEONLY. Scalar files are read independently by all processors, but written by the first processor only. All processors can write to STDOUT, Fortran unit 6, but reading from STDIN, Fortran unit 5, is not allowed because on some machines it is not available to all processors. Fortran units 0 and 7 are not used, because they can have default bindings, to STDERR, which are not portable. Under a data parallel or autotasking compiler, scalar code and scalar I/O do not see multiple processors and so all I/O is from the single program instance. How array files are handled depends on the machine. Writing the entire array from the first processor using direct access I/O always works, and is the only choice for single node and autotasked cases. However, on distributed memory machines it requires all tiles to be message passed to one processor. An alternative is to use direct access I/O from the first processor in each row, reading/writing the entire rows's set of tiles as a single record. Each active processor is then accessing a different, non-overlapping, record in the file. This is always safe for parallel reading and is usually safe for parallel writing if the record length is a multiple of the disk block size. On a data parallel machine, copying from a tiled (4-D) array to a native (2-D) array and then using data parallel I/O is an option. In general, any method that reads/writes the standard file is allowed. There are several efforts underway to produce standards for parallel I/O, but it is not yet clear if any of them will be compatable with serial I/O. The NLOM technique maintains portability, but does not necessarily allow maximum parallel I/O performance. This is acceptable, because NLOM spends much more time computing than doing I/O. Using two files, scalar and array, to hold values is less robust than mixing them in the same file. We must rely on file name association to tell us which scalar file goes with which array file, but it is not generally possible to do portable parallel I/O on files than mix scalars and arrays. Array files contain all land points. They can be reduced in size, after the fact, either by standard file compression techniques, e.g. via the gzip command, or run length encoding of land points via the NLOM-specific command lzip, or by converting them to the original NLOM history file format (a 24-bit fixed point value file containing no land values at all).
Serial Optimizations
The original implementation of NLOM was optimized for multi-processor vector machines, such as the Cray C90. Its performance on RISC microprocessor based workstations and SMP servers was of less importance. RISC-based systems now represent a much more significant fraction of the high performance computing field. So, the scalable NLOM implementation must be nearly optimal over a wider range of machines, including both vector and RISC processors. Vector machines can be characterized as having no cache and a very high memory to processor bandwidth. Arrays are often used to hold temporaries, and a non-unit stride through an array can be efficient unless the stride is a large power of two (e.g. 64 or 128). RISC-based machines always include a cache, often a multi-level cache, and have a low memory to cache bandwidth. Non-unit strides through an array can be expensive, unless the required elements are already in cache.
With the exception of the Helmholtz's equation solvers, described in section 3.7, the original NLOM code structure was already sufficiently clean to optimize well on RISC systems. There were no unnecessary non-unit strides through arrays, and the loop nests were simple enough for the compiler to automatically unroll most inner loops. Loop unrolling is essential for good performance on a RISC machine, and a few loops had to be unrolled by hand, with the original code retained for vector machines via cpp logic, when one or more compilers failed to recognize the possibility of unrolling. A few array temporaries have been replaced by scalars, but the original NLOM code typically used array temporaries, rather than scalars, only when they reduced the operation count. So most array temporaries have been retained. Using 32-bit REAL's effectively doubles the cache size and the memory to cache bandwidth. This provides a performance boost even on machines that run at the same speed for 32-bit and 64-bit floating point arithmetic. All ocean models should be able to use 32-bit IEEE 754 REALs for most calculations.
Division is always expensive, but it is particularly costly on some RISC systems. NLOM already performed the minimum possible number of divisions. The IEEE 754 floating point arithmetic standard does not allow division to be replaced by multiplication by a reciprocal. A compiler that strictly enforces this requirement could not replace: even though the latter might be ten times faster. This is one of the few situations where it is still advantageous for a programmer to promote loop independent calculations to outside the loop.
Codes optimized for vector, or data parallel, machines tend to make liberal use of simple vector operations, such as Y=X, Y=0.0, and Y=Y+X. These are relatively inexpensive on vector machines, but can take a significant amount of time on machines with a low memory bandwidth. The obvious first step is to remove unnecessary operations of this sort. However, good programming practice and numerical accuracy considerations have lead to many such operations being retained in NLOM. So they have optionally been replaced by calls to optimized BLAS, Basic Linear Algebra Subroutines [10] , where available. The BLAS were designed only to support the needs of linear algebra, and not all the required vector operations are included. Moreover, many vendors do not optimize the original level-1 (vector-vector) BLAS, because vector lengths are short in linear algebra. Level-2 (matrix-vector) BLAS [4] and level-3 (matrix-matrix) BLAS [5] have more room for optimization and are more important to modern linear algebra applications. There is a need for a standardized vector library to facilitate the porting of codes from vector to RISC architectures. This could be implemented in assembly language, if necessary, for each RISC architecture. For ocean models, a level-2 BLAS extension, subroutine SGESUMC, implementing the matrix (2-D array) operation "B = alpha*op(A) + beta*B + gamma", where op(A) is either A or its transpose, would be sufficient to cover most simple vector operations (provided cases where the scalars alpha, beta, and gamma are zero or one are treated optimally). Cray has a similar BLAS extension, SGESUM, on the T3D only. To illustrate that Fortran compilers do not produce optimal code for simple vector operations, consider two implementations of the same subroutine. First, the obvious Fortran version. 
END
The second version uses the non-standard LOC function to return the address of S(l) in bytes, and hence to select S(l) or S(2) as a legal REAL*8 pointer. It then calls a REAL*8 version of the subroutine with a REAL*8 constant containing two copies of W and about half the original vector length. This version is not standard Fortran 77, but it is significantly faster than the original on many machines because the memory bandwidth is higher for 8-byte memory writes than for 4-byte writes. Presumably, an assembly language version of R4WSET could be faster still. The scalable NLOM code includes the option of using optimized BLAS and 64-bit vs 32-bit optimizations when appropriate, but a standard vector library optimized for each machine would simplify NLOM logic and further improve performance.
Cache Management
All ocean models are intrinsically memory intensive. The ratio of floating point operations to memory accesses is small, and all array elements are accessed at least once every one or two time steps. The previous section discussed essentially local optimizations. This section is concerned with global code restructuring for better cache performance. On a SGI Power Challenge, with a 4 MB secondary cache, a very small NLOM test case runs at 103 Mflops/s but a more realisticly sized case runs at only 39 Mflop/s. The two cases were not exactly comparable, but since the small case is running almost entirely cache contained it does place an upper bound on what cache optimizations can do. The NLOM code slabs on layers, i.e. all lower level routines deal with a single layer and the loop over all layers is outside these routines. Moreover, each low level routine implements a relatively small and self contained operation, for example there are separate routines for advection and diffusion, and there are several involved in calculating pressure gradients. This programming style is flexible and easy to maintain, but it does require more passes though each set of layer arrays than is strictly necessary. Layer slabs are large and don't typically fit in cache, so low level single layer subroutines are not optimal for cache management. A monolithic programming style where all the finite difference operations for a time step were performed by one big loop (or a small number of loops), would maximize locality of reference, but compilers don't always perform well when optimizing very complicated loops and large loops are difficult to maintain.
In principle, the message passing version of NLOM should already be configurable for optimal cache management. All that is required is to set the tile size so that a complete set of arrays for a single layer fit in cache. This would typically result in more tiles, and therefore processes, than physical processors. If these processes could be optimally scheduled by the operating system and if the per process message passing overhead were small, we would have a near optimal configuration with good cache performance. However, most message passing libraries assume there is one dedicated processor for each process and Unix time slices all processes and schedules them in a round robin (which is far from optimal). Optimal scheduling would consist of each process continuing to run on a physical processor until it blocks, at a barrier or a message passing operation, at which point it sleeps and another process takes over until it blocks, etcetera. A similar scheduling scheme is available, and works well, for some thread libraries [11] , but schemes that work for light weight threads may be impractical for heavy weight processes. The direct use of threads, with one thread per tile, would allow better scheduling and might in fact lead to the fastest possible implementation on those SMP machines that support the mapping of many threads onto fewer processors. However, threads are not a natural fit to the SPMD programming style, and a combination of threads and message passing would be required when running across a cluster of SMP systems. Autotasking makes use of threads on many machines and, even if autotasking uses heavy weight processes, the master-slave algorthm used to parallelize DO-loops provides near optimal scheduling. What is required is to move the tile loops out of low level subroutines, and call as many subroutines as possible within each autotasked tile loop. However tile loops cannot contain calls to communication routines. In pseudo-code, for IP=1, this might look like: DO N= l.JP HENCE (1,1,1,N) . CALL SUBKDUCl, 1,1,N) , U(l, 1,1,N. NTO.K), ) CALL SUB2(DV (1,1,1,N), H(l,1,1,N.NTO.K) , .... ) CALL SUB3(VV (1,1,1,N), V(l,1,1,N. NTO.K), ) CALL SUB4(UV (1,1,1,N), U(1,1,1,N,NT0,K 1,1,1,N), W(l,1,1,N),V(1,1,1,N.NTO.K) ) CALL SUB6 (DV(1,1,1,N), VV(1,1,1,N),H(1,1,1,N.NTO.K) , .... ) ENDDO As it stands, this is not data parallel unless SUB1 to SUB6 are treated as HPF EXTRINSIC subroutines and these, while part of the standard language, are unlikely to be portable. However, compatability with data parallel can be maintained by using cpp macros to configure the same code to put tile loops either inside or outside low level routines. Compiler directives would typically be needed to force autotasking on these N loops, because the compiler would otherwise have to assume the subroutine calls were not thread safe.
! LOW LEVEL ROUTINES HANDLE A SINGLE TILE,
Tile loops outside low level routines can also improve cache performance on a single processor. So a generic advantage of tiled data parallel over tiled message passing is that each clustered system can run a single program that itself handles multiple tiles, either to improve cache performance on a single processor, or to use multiple shared memory processors via autotasking. Hybrid parallelization techniques, such as autotasking plus message passing, are likely to be increasing important in the future, due to the growing popularity of SMP clusters.
The NLOM implementation does not currently support tile loops outside low level subroutines. A test case that simulated such loops without the synchronization required for communication routines, i.e. a best case test, ran 35% faster on a SGI Power Challange, but only 19% faster on a Sun SPARCstation 20, compared to the standard scalable code. A test case that was entirely cache contained but performed equal work, ran 55% and 29% faster than the standard case on SGI and Sun machines respectively. So there is some overhead whenever the model is larger than the cache, but a version with high level autotasked loops might perform significantly better than autotasking at the lowest loop nest level. The difference in speedup on the Sun and SGI is due to the relatively better balance between memory and cache bandwidth on the Sun. The existing generation of multi-processor servers, like the SGI Power Challange, typically use a shared memory bus that does not scale well as more processors are added. The next generation of servers are more likely to use memory crossbar switches, which have better bandwidth and scalability properties, and should therefore have less need for agressive cache optimization.
Parallelizing the Capacitance Matrix Technique
The NLOM gets its intrinsic efficiency from the use of Lagrangian layers in the vertical and a semi-implicit treatment of gravity waves. A semi-implicit model treats a linearized pressure gradient implicitly within an otherwise explict model. The time step is then independent of gravity wave speed, but the continuity equation becomes 3-D elliptic. The NLOM is designed so that the 3-D elliptic equation can be transformed to modal space, where it becomes a decoupled set of 2-D Helmholtz's equations (one per mode) [15] . The internal modes equations can each be solved with 4 to 10 iterations of Red-Black SOR, a classical iterative method with good scalabilty properties, but the single external mode equation is less well conditioned and is solved by the Capacitance Matrix Technique, CMT. Since explicit finite difference code is highly scalable, the overall scalability of NLOM depends on how scalable the CMT solver is. The fully scalable alternative to a semi-implicit time step is a split-explicit scheme that uses a smaller time step for the external gravity mode only. The split-explicit scheme is not competitive with semi-implicit on a single processor, because of its much higher operation count, but its superior scalability makes it relatively more efficient on a large number of processors. Thus our initial design goal was to make the CMT sufficiently scalable so that the semi-implicit model would out perform an equivalent split-explicit version. This goal has been exceeded, and NLOM is certainly still the most efficient existing basin-scale ocean model on any machine with up to hundreds of processors or even, if the problem is large enough, a few thousand processors.
The CMT [3] , is a method for extending fast direct solvers for Helmholtz's equation over a separable region, such as a rectangle, to arbitrary bounded regions. Perhaps the simplest fast direct method is now usually called FACR(O) [6] . For an IH by JH rectangle, it involves JH independent forward FFT's each of length IH, followed by IH independent solves of tridiagonal systems of length JH, and finally JH independent reverse FFT's of length IH. The CMT involves a FACR(O) solve, followed by a boundary correction via the solution of a dense set of linear equations at boundary points, and finally a FACR(O) solve to apply the correction to the entire region. The Capacitance matrix contains every boundary node that is inside the rectangle, and so it can be very large. For a l/32nd degree global region it is about an 100,000 by 100,000 matrix. However the matrix is not time dependent, so it can be inverted once per region and then one matrix-vector multiply used to solve the system every time step. The memory required for the Capacitance matrix is large, but its per time step cost is low and highly scalable. Thus the scalability of CMT depends on the Figure 1 plots wall time against number of processors for the test case on a SGI Power Challenge. This is a close fit to Amdahl's Law for 98% parallelization. The measured performance drops below the theoretical curve for 16 processors. This could either be due to load imbalance or to a lack of memory bandwidth. Figure 2 plots total time against number of processors for the test case on a Cray T3D. Divide total time by the number of processors to get wall time. Perfect scalability would produce a horizontal curve. The model is 99.2% parallel. The explicit finite difference code is essentially 100% parallel, as expected. The SOR and CMT solvers take about 18% and 9%, respectively, of the total time and show good scalability. All the rest of the model, the overhead, represents 5% of the total on 4 nodes but 24% on 64 nodes. Figure 3 plots total time against number of processors for the test case CMT solver on a Cray T3D. The large jump in total time on 64 nodes is primarily due to the fact that this solver is based on a 512 by 384 element array, rather than the model's 512 by 288 array. The original array can't be used on 64 nodes because 288 is not a multiple of 64. As expected, the matrix-vector multiply and FFT phases are 100% parallel. The transposes between ID and 2D are a relatively small percentage of the total solver time, and show good scalability. The pipelined tridiagional solver has some load inbalance (from the pipelining) and many short off-chip transfers. Even so, scalability is quite good. The rest of the CMT solver has a very small operation count, but requires communication and is the factor that limits the overall scalability of the solver as a whole. Figure  4 plots total time against number of processors for the test case SOR solver on a Cray T3D. As usual, the finite difference code is 100% parallel. The halo updates are not perfectly scalable, but are a small fraction of the total even on 64 nodes. The calculation of the L2 norm of the residual requires global communication, and is the factor that limits the overall scalability of the solver as a whole. The requirement that the model be bit-for-bit reproducable on an arbitrary number of processors prevents us from using the fastest global sum algorithms. Figure 5 plots the percentage of total time against number of processors for the major overhead tasks on a Cray T3D. These are all involved in boundary condition processing. Conventional boundary condition processing, for no-slip coastlines and zeroing out land areas, scales well and is only about 1% of the total time. Halo updates scale less well, but are only 3% of the total time on 64 nodes. The global region has many friction patches to control flow through straits that are not well resolved on the model grid. These are less scalable than halo updates, but also represent only 3% of total time on 64 nodes. By far the largest factor in overall scalability are the port boundary conditions. These are applied at 65N in the Atlantic to simulate the effects of the GINSea, which is not included in the model, and in particular to drive the global thermohaline circulation. The operation count for the ports is very small, but it is all "serial" code and therefore 0% scalable. Additional optimization of port routines is planned for the next release of NLOM. Figure 6 plots the percentage of total time against number of processors for the overhead tasks on a Cray T3D that are not on figure 5 . The halo update curve is on both figures as a point of reference. The model's formulation of mixing requires the calculation of the region-wide mass for each layer every time step. This is similar to the L2 norm calculation in the SOR solver, with similar scalability properties, but is only 1.4% of the total time on 64 nodes. Data sampling is performed every few days throughout the simulation, to produce statistics on transport across sections, mixing in subregions, etcetera. The operations are not very scalable, but are relatively infrequent and represent only 2.5% of the total time on 64 nodes. The model writes out a restart record containing all prognostic variables every month. In principle, this I/O is performed in parallel on the T3D but scalability is poor. The decrease in time from 4 to 8 processors is due to the fact that smaller buffers had to be used on four processors because of memory constaints.
SUMMARY
The scalable NLOM implementation has achieved its design goals of running exactly the same model source code and model data files on a wide range of computer systems from many vendors. Scalability is based primarily on the tiled data parallel parallel programming paradigm. This is sufficiently general that the actual technique used on a given machine to obtain scalability can be selected at compile time from: (i) data parallel, (ii) SPMD message passing, (iii) autotasking, or (iv) SPMD message passing between multi-processor autotasked systems. Times from actual practical model runs presented in this report demonstrate good scalability across the range of processor complexes available within DoD today. As larger machines become available, problem sizes (i.e. model resolution) will increase and scalability should be possible up to at least 2048 processors.
