Timings have been conducted for several benchmark testcases using the same code on a variety of common hardware platforms and compilers. The results indicate surprisingly little variation in the performance given the considerable number of vendors, architectures and compilers. The largest differences amounted to less than 30% of run-times. For the single processor runs, an increase in cache size reduced run-times, though not dramatically (approximately 10%). Going from 32 bits to 64 bits (with the same clockspeed, cache size, memory and compiler) in most cases produced a gain of 10%, although in some cases no gain was recorded. Overall, the chip with the slowest clocktime (Intel It II at 1.50 GHz) achieved the best performance. In some cases, it beat the 64-bit, 3.40 GHz P4/Xeon machines by a considerable margin. For shared memory parallelism, the best scaling was achieved by the SGI Altix. This should not come as a surprise, as SGI's CCNUMA technology has matured over the last decade. For the AMD Opteron, the SUN compiler exhibited the best scaling.
I. INTRODUCTION
The continuous advance and update of computer hardware (microchips, memory, motherboards) and software (compilers, field solvers) make it almost impossible to compare the large variety of options available in the market. It is often difficult for a user of Computational Fluid Dynamics (CFD) solvers to assess the relative merit of these different offerings from published BLAS 2 , LAPACK 5 , SPEC 17 or NAS Parallel Benchmarks 14 figures. In order to gain some insight into the state of hardware and software options for scientific computing at the end of 2006, a series of CFD benchmarks were conducted on commonly used hardware and compilers.
II. CFD CODE
The CFD code used for the benchmarks was FEFLO. FEFLO was conceived as a general-purpose CFD code based on the following principles:
-Use of unstructured grids (automatic grid generation and mesh refinement); -Finite element discretization of space; -Separate flow modules for compressible and incompressible flows; -ALE formulation for moving grids; -Embedded or immersed formulation for dirty CAD/ cracks/ shock-structure interaction; -Edge-based data structures for speed; -Optimal data structures for different architectures; -Bottom-up coding from the subroutine level to assure an open-ended, expandable architecture; -Standard Fortran-77 for maximal portability.
The code has had a long history of relevant development and applications 1, 6-13, 15, 16 . As far as timing is concerned, the CPU-intensive loops are over edges (fluxes, Riemann solvers, artificial viscosity, etc.). The code renumbers points, elements and edges to minimize cache-misses while avoiding memory contention for pipelining and shared-memory parallel execution. 6 
III. BENCHMARKS
In order to cover a variety of solver techniques and the associated algorithms, the following benchmarks were selected: 
Figures 1c,d Lift Force and Residual History
The projection scheme 7 with three explicit stages per timestep 10, 12 for the advective velocity prediction was run for 50 timesteps. At this point, for the coarser mesh, the lift force is converged to better than 2% as seen from Figure 1c , and the residual for pressures had dropped by more than 3 orders of magnitude. The main CPU intensive operations are the limiting of advective fluxes and the Poisson solver for the pressure. with local timesteps, a Courant-nr. of C = 10 and 10 inner sweeps was run for 100 timesteps. At this point, the residual for the coarser grid has been lowered by five orders of magnitude (see Figure 2c) . The lift force converges much quicker as seen from Figure 2d . The main CPU intensive operations are the limiting of fluxes, the HLLC-solver and the LU-SGS sweeps. 
IV. MACHINES
The machines that were used for the benchmarks, together with their characteristics, have been compiled in Table 1 . Note that all the Intel P4/Xeon CPUs have hyperthreading, and that the AMD Opterons are dual core. We tried running in shared-memory mode (OpenMP) on the hyperthreaded machines, but the gains in speed were only of the order of 15%. However, on the same machines, running two separate jobs at the same time produced negligible degradation for each, i.e. the gain in throughput was almost 100%. The dual core AMD Opterons scaled very well in shared-memory mode (OpenMP).
V. COMPILERS
The following compilers and compiler flags were used for the benchmarks: flags: -fsloppy-char -i4 -r8 -fzero -O3
All the multiprocessor cases were run in shared-memory (OpenMP) mode. This implied that the user did not have to go through the usual distributed-memory steps of splitting the domain, running in parallel and assembling the results at the end. The benchmarks shown are of fixed mesh topology, so they could have been run in distributed-memory mode (the CFD code has such options). However, many of the applications usually run with the CFD code used in the benchmarks require adaptive mesh refinement, remeshing, and regions of the flow where CPU requirements change drastically (e.g. particles or chemically reacting flows). For these cases, the distributed-memory option becomes rather involved, and the shared-memory option has been the preferred choice of users.
VI. TIMINGS
We have decided to show the timings in tabular form, as this is the only way to have all details present simultaneously. The list starts with runs carried out on one processor using the different hardware configurations and compilers, and then proceeds, in ascending number of processors, to the OpenMP shared-memory runs. These same tables have also been compiled in a different format at the end of the paper.
NACA0012 Wing (Incompressible, Euler)
We note that for this coarse mesh the largest differences in recorded run-times are due to compilers, cache size (10% gain) and 32 bits vs. 64 bits (another 10% gain). The SUN compiler, while worst on the uniprocessor P4 machine, performs very well on the AMD Opteron, and also exhibits the best scaling for the OpenMP runs.
We note that for this fine mesh the results mirror, to a large extent, the runs on the coarse mesh. The 10% gain due to cache size increase is again present. The switch from 32 bits to 64 bits does not produce a performance increase. The reason is that the code has all real variables declared as real*8. As before, the SUN compiler, while worst on the uni-processor P4 machine, performs admirably on the AMD Opteron, and also exhibits the best scaling for the OpenMP runs. The SGI Altix, while powered by the (slow) Itanium II chip, achieves the best performance when using more than 2 processors.
Generic Missile (Compressible, Euler)
For this coarse mesh the largest differences in recorded run-times are due to compilers, cache size (10% gain) and 32 bits vs. 64 bits (another 10% gain). As in the first bechmark, the SUN compiler, while worst on the uni-processor P4 machine, performs very well on the AMD Opteron, and also exhibits the best scaling for the OpenMP runs.
For this fine mesh the results mirror, to a large extent, the runs on the coarse mesh. The 10% gain due to cache size increase, as well as the 10% gain from the 32/64 bits switch, are again present. As before, the SUN compiler, while worst on the uni-processor P4 machine, performs remarkably well on the AMD Opteron, and also exhibits the best scaling for the OpenMP runs. The SGI Altix, while powered by the (slow) Itanium II chip, achieves the best performance for any number of processors.
LPD 17 Ship
This large case did not fit into the machines with smaller memory. Therefore, we only report the results for the machines with larger memory.
For this fine mesh the Intel compiler performed best on the AMD Opteron, even for the multiprocessor runs. The larger memory SGI Altix while powered by the (slow) Itanium II chip, achieves the best performance, no matter how many processors are used. We also remark that this case clearly shows the (obvious) importance of large memory for large problems. 
VII. CONCLUSIONS
Typical compressible and incompressible solvers/schemes were tested on a number of platforms using a variety of compilers. As a general result/conclusion from the present benchmarks, the results indicate surprisingly little variation in the performance achieved using a variety of vendors, architectures and compilers. The largest differences amounted to less than 30% run-time variation. The low dependency of performance on the compiler used indicates that most of these take advantage of the hardware features offered, and that most of them have achieved near-optimal performance on these architectures. For the single processor runs, an increase in cache size reduced run-times, though not dramatically (approximately 10%). Similar findings were reported by Guler and Ali 4 . Going from 32 bits to 64 bits (with the same clockspeed, cache size, memory and compiler) in most cases produced a gain of 10%, although in some cases no gain was recorded. Overall, the chip with the slowest clocktime (Intel It II at 1.50 GHz) achieved the best performance. In some cases, it beat the 64-bit, 3.40 GHz P4/Xeon machines by a considerable margin. For shared memory parallelism, the best scaling was achieved by the SGI Altix. This should not come as a surprise, as SGI's CCNUMA technology has matured over the last decade. For the AMD Opteron, the SUN compiler exhibited the best scaling. The dual core AMD Opterons scaled very well in shared-memory mode (OpenMP), even for the cases that required considerable memory (LPD-17 Ship).
VIII. AKNOWLEGMENTS
It is a pleasure to acknowledge the help of Stan Posey and John Gorski from SGI, as well as A.J. Jackson from Western Scientific, Inc., not only for providing computer resources, but also for guidance in setting compiler directives and system/ run-time environment variables which are sometimes less than obvious to the novice user. 4  ----------270  8.2  --32  49  16.7  --------------64  34  24.1  --------------96  24  34.1  -------------- 
