Traditionally, symmetric multiprocessors have used modest numbers of processors. Since many of them were bus-based systems, they inherently lacked scalability to what might be referred to as moderate-sized systems. With
List of Figures
viii 
List of Tables

Introduction
Several supercomputer architectures are viable today. MPPs, such as the Cray T3E, offer a large number of processors, each with its own nonshared memory. In MPP machines, when one processor needs to access data in the memory of another processor, the processor that ʺowns‶ the data must explicitly send the data to the requesting processor. *
In contrast to distributed memory architectures are shared multiprocessor SMP machines, such as the Sun E10000, which share memory among all the processors. In between these two examples is the SMP cluster (such as an IBM SP). Here, a small number (e.g., 2-16 in the various implementations of the IBM SPs configured with SMP nodes) of processors share memory, and the machine is made up of a large number of these SMP nodes. As in more traditional MPPs, explicit cooperation between two processors is required to transfer data from one SMP node to another.
Another intermediate architecture is the cc-NUMA machine, such as the SGI Origin 2000, where all the memory is logically shared but physically distributed. Here, two processors (one node) share local memory, but any processor can access all memory locations in the machine without the aid of any other processor.
There can be significant differences in the designs and implementations of this class of system from vendor to vendor. As a result, some systems are much better suited for certain classes of problems-systems from SGI are heavily marketed in the scientific computing market, while systems from HP, Sequent, and Data General are more frequently marketed to the commercial/database market.
Several programming models exist today, and each is supported on one or more computer architectures. While MPI was developed for distributed memory machines (MPPs), it can and has been implemented on SMP and shared memory machines. Writing shared memory code is perhaps easier than writing MPI code. But many codes today are written in MPI due to the popularity of the MPP machines for the last several years. When an MPI version of a code already exists, the programmer might as well consider using it, even if it would not be their choice if writing the code from scratch. So then it becomes a performance question as to whether a shared memory version or an MPI version of the code is * When using SHMEM (or equivalent) calls on systems that support them, programs may explicitly instruct processors to either put data into the memory of other nodes, or get data from the memory of remote nodes. However, this is very different from cache-coherent shared-memory symmetric multiprocessors, where the data resides in a globally accessible/coherent memory system accessed automatically using standard load and store instructions. most suitable on a non-MPP machine that provides efficient support for MPI, as almost all machines now do.
As the performance runs presented in this report show, no single machine has a monopoly on the best performance with all programming models. While the Cray T3E does very well on MPI codes, it cannot run most shared memory codes. While some other machines can run all programming models, their performance varies, with each machine performing best on one code or another.
The purpose of this report is not to explain the results or conclude that one machine is better than another. Rather, its sole purpose is to document the results that different groups have reported, so readers are better equipped to come to their own conclusions about the merits of the hardware, programming paradigms, and other related issues. Furthermore, while some of the codes mentioned in this report were tuned for one or more of these machines, tuning can be a major undertaking. As a result, for HPC codes that are commercially available and/or maintained by other sites, the authors have little or no ability to tune them for the specific machines. Instead, the authors of those codes tuned their own codes.
The authors made these measurements as unbiasedly as possible. In fact, many of these results came from benchmarking efforts associated with procurement efforts (all such data reported in this report came from runs done in-house). Additionally, selecting which results to report was based on the perceived importance and merits of the codes in question; no results were excluded from this report because they violated a preconceived notion. As such, there are examples of different machines excelling for different codes. Some readers may wish to consider issues such as cost effectiveness, but this report does not include any cost data. Most likely, the faster machine is not always the most cost effective.
Other issues not addressed in this report or only briefly addressed are as follows:
(1) the stability of systems, (2) the scalability of systems to very large numbers of processors, (3) problems with the compilers and/or the operating systems, (4) the relative merits of the input/output (I/O) systems, (5) issues involving the queuing of jobs, (6) the requirements of the highly varied user community that uses the resources provided by the DOD HPCM Program, and (7) performance, profiling, and debugging tools.
Brief Observations
The following observations have been collected from a number of sources.
• HPF runs better on the SGI Origin than on the IBM SP (Wierschke 1997 ).
• HPF runs best on the Cray T3E since the Portland Group first implements new ideas on it (Shires 2000) .
• In theory, jobs that run well on the SGI Origin should run well on the Sun HPC 10000. In practice, some codes would not compile, others would not run (at first), and many required some degree of tuning.
• Care should be taken to avoid ʺoverloading‶ (more processes/threads actively running than there are processors) any of the shared memory systems, since overloading can result in significant performance degradation and a significant increase in CPU time.
• By itself, automatic parallelization is frequently of limited value; however, it may improve the performance of some programs parallelized using compiler directives.
• Many codes run well on either the Sun or SGI systems, showing reasonable levels of performance and scalability.
• Some codes will show significantly better per processor and/or overall performance on the SGI Origin than on either the Cray T3E or the IBM SP with P2SC processors.
• The performance of the Sun HPC 10000 is frequently reported to be between that of the SGI Origin 2000 with 300-MHz R12000 processors and the SGI Origin 2000 with 195-MHz R10000 processors.
• For some vectorizable codes, the shared memory programming paradigm is an excellent choice for parallelizing programs that are difficult to parallelize.
• For some codes, HPF is still the most natural programming paradigm (Mohan 1999 ).
• For projects requiring high levels of scalability (e.g., 128 or more processors), the IBM SP or the Cray T3E are better choices (Namburu 1999 ).
• Large MPPs tend to have stability problems; 128-processor Origins are particularly susceptible to periods of instability.
• Some performance differences are caused by design tradeoffs. The data show that some of these design tradeoffs sacrifice efficiency for peak speed and vice versa. Both approaches are of value and need to be considered when evaluating the merits of different systems.
Performance
Figures 1-6 and Tables 1-8 show performance results from various sources. Some of these runs were made explicitly for benchmarking the performance of a particular system, other runs were made as part of a porting/tuning effort, and a few of the runs were made for other reasons. As such, there has been no systematic attempt made to identify the reasons why a particular code runs faster on one machine than another. The authors assume that in some cases, additional tuning could improve the performance of a particular code on a particular machine. However, such tuning is beyond the scope of this report. Furthermore, when a code is not locally written/maintained, there may be little or no opportunity for the user to tune a code.
In the following CTH benchmark runs for Figure 1 and Table 2 , the number of computational cells was increased by a factor of 2 each time the number of processors was doubled. This was done to maintain a constant number of computational cells per processor, which keeps the computation to communication ratio constant. In this set of benchmarks, the number of iterations was fixed, meaning that perfect scaling results in constant benchmark run times. The difference in the run time on the 64-processor Origin 2000 and the 128-processor Origin 2000 is the direct result of the increase in the average memory latency as one increases the size of an Origin 2000.
For the runs in Figure 2 and Table 3 , the grid was incrementally refined by decreasing the characteristic cell length in each direction by the cubed root of two each time the number of processors doubled. In these runs, the number of iterations was not fixed. Instead, the number of iterations approximately increased by the cube root of two each time the number of processors doubled, since the time step decreases as a result of finer mesh. When ideal scaling occurs, the Grind Time will decrease by half every time the number of processors is doubled. This results from the units of Grind Time being microseconds/zone/cycle. Since the time per cycle is expected to remain constant and the amount of work per cycle doubled, the amount of time/zone/cycle should be halved. The amount of time/cycle should remain constant, as in Table 2 . It is worthwhile noting how closely the performance of these runs matches the ideal performance. Additionally, the performance of the 300-MHz Origin 2000 and the 400-MHz Sun HPC 10000 is very similar for both these runs and those involving F3D (see Figures 3 and 4 and Tables 4 and 5).
Figures 3 and 4 and Tables 4-6 show the performance of two different versions of the implicit CFD code F3D for three problem sizes. The problem sizes range from 1-million to 206-million grid points without a significant decrease in the per processor performance. This is an indication that it is possible to tune an HPC code for a cache-based architecture. Tables 7 and 8 contain results for two other CFD codes.
Figures 5 and 6 and Tables 9 and 10 demonstrate the effect on performance and the waste of CPU time that can occur when an SMP becomes overloaded. The program used for these measurements was the shared memory version of F3D. It ran the 1-million grid point test case.
Summary
We have provided a number of observations and performance data from a variety of sources for a number of representative codes. These codes were run on the SGI Origin 2000 and the Sun HPC 10000. In many cases, there were also runs made on other commonly used HPC systems. Additionally, some of the tables provide comparisons of the performance achievable when using various programming paradigms. The last two tables demonstrate the inefficiency of allowing an SMP to become overloaded. It is hoped that this report and, in particular, the figures and data tables will enable the reader to better evaluate the merits of these systems in relation to his or her needs. a The job size was scaled in proportion to the number of processors (Kimsey et al. 1998; Schraml and Kimsey 2000) . b The ideal values are extrapolated from the performance of runs using eight processors. Hisley et al. (1998) . b The shared memory implementation combined compiler directives and the automatic parallelization facility (-pfa). c The 31-processor PVM run was not made because it was too difficult to decompose the grids with the available tools. 
