This paper introduces an industry strength, multi-purpose, benchmark: Shamrock. Developed at the Atomic Weapons Establishment (AWE), Shamrock is a two dimensional (2D) structured hydrocode; one of its aims is to assess the impacts of a change in hardware, and (in conjunction with a larger HPC Benchmark Suite) to provide guidance in procurement of future systems.
INTRODUCTION
There are currently a number of High Performance Computing (HPC) architecture options available, or becoming available, to the high-end HPC application user. Intel's Westmere, has been available since the beginning of the second quarter of 2010, which, in addition to Nehalem, accounts for 54.6% of all the machines on the current (November 2010) Top500 list [1] . The POWER7 processor is the latest evolution of IBM's long running POWER series. Released in 4Q09 in its server variant, it follows on from the successful POWER5 and POWER6 ranges, which account for 18 machines currently on the Top500. Although no POWER7 based HPC system is yet on the Top500 listing, 2011 will see the POWER7 used as the building block for the 10 PetaFlop (PFlop) Blue Waters system at the National Center for Supercomputing (NCSA) [2] . IBM's BlueGene architecture's third generation: BG/Q, will be at the heart of the 1.6 million core Sequoia platform. Sequoia, a 20 PFlop platform will be based at Lawrence Livermore National Laboratory (LLNL), and is due into production in 2012. This follows the first and second BlueGene generations: BG/L and BG/P, with ten combined platforms occupying the Top500, which, in conjunction with the Power series, accounts for 8% of the Top500 systems. AMD's 8-and 12-core fourth generation Opteron processor, 'Magny-Cours', were introduced to the market in March 2010. In its second, third and forth generations, the Opteron accounts for 11.4% of the Top 500 list, most prevalent being its third generation Barcelona chip in its quad-core form. In addition to these, Graphics Processing Units (GPUs) offer a way to build extremely large HPC platforms, with an increased Flops/Watts ratio. GPU accelerated machines already occupy Number One and Number Four on the November 2010 Top500 List [1] , with 11 systems in total being GPU accelerated.
With such a diverse range of heterogeneous architectures coming to the fore, the challenge for HPC practionners is complex. Understanding which hardware is best suited for their application mix involves understanding a range of competing problems including application performance, algorithm re-coding, effective scheduling and associated running costs (e.g. power and cooling).
This paper introduces an industry strength benchmark: Shamrock. Developed at the Atomic Weapons Establishment (AWE), Shamrock is a two dimensional (2D) structured hydrocode. One of its aims is to assess the impacts of a change in hardware, and (in conjunction with a larger HPC Benchmark Suite) to provide guidance in procurement of future systems. The code has been ported to 97.8% of the processor families that make up the Top500 list, and executed on 42% of the exact processor types.
A representive test case and problem size is discussed, and subsequently executed on a high-end workstation for a range of compilers and MPI implementations. Based on these observations, specific configurations are built and executed on a selection of HPC processor generations. These include Intel's Nehalem and Westmere micro-architectures, IBM's POWER-5, POWER-6, POWER-7, BG/L, BG/P, and AMD's Opteron chip set. Comparisons are made between these architectures, for the Shamrock benchmark, and relative compute resources are specified that deliver similar time to solution, along with their associated power budgets.
Additionally, performance comparisons are made for a port of the benchmark to a Nehalem X5550 based, 100MbE connected, cluster, accelerated with Tesla C1060 GPUs, with details of the port, and extrapolations to possible performance exploitation of the GPU.
The remainder of this paper is organised as follows: Section 2 provides background on the Shamrock benchmark, the purpose of the benchmark, a description of the test case of interest, description of related work, and the uniqueness of the benchmarking and predictions of this study. Section 3 introduces the performance of the benchmark on a local, high-end, workstation, followed by a discussion of the HPC platforms the code has been ported to. Section 4 looks at the performance characteristics on these HPC platforms, and details the port of the benchmark to a GPU cluster, and its relative performance. Section 5 concludes the paper.
SHAMROCK
Prediction of the dynamic behaviour of materials as they flow under the influence of high pressure and stress is a key field of investigation at AWE. As a result, hydrodynamic simulations account for a large proportion of compute cycles on AWE's HPC systems. Representative benchmarks have existed for many years: 2D hydrodynamic code fragments were part of the original Livermore Loops [3] , and have been used in earlier performance studies [4] . More recently, and primarily, due to the large HPC resources required to execute them, focus has been on three dimensional (3D) benchmarks codes, such as SAGE from LANL [5] and Hydra from AWE [6] . However, although not in the capability regime, 2D hydrodynamics accounts for a significant amount of capacity computing, with finer resolution and growth in the number of CPU hours continually increasing.
To reflect this, Shamrock was developed at AWE as an industry-strength, domain-decomposed, multi purpose benchmark. It is a 2D structured hydrocode, written predominantly in Fortran 90, using the Message Passing Interface (MPI) as its means of communication between remote domains. The code has been designed for a number of purposes: (i) assessment of the impact on code performance of system upgrades to an incumbent architecture; (ii) to be run as part of a larger HPC Benchmark Suite to help assess platform suitability during a procurement cycle, and (iii) to be used to assess current and emerging technologies. Shamrock can configured to run in Adaptive Mesh Refinement (AMR) and Uniform modes but for the purposes of this paper only Uniform executions are considered.
The benchmarking and predictive modelling in this paper differs from earlier studies. It is the first benchmarking of the POWER7 platform, using a 2D hydrodynamics benchmark; it encompasses the most diverse range of current architectures, compilers, and MPI invocations for such a benchmark, and is distinct in its comparison of these. There have been a number of studies investigating the use of GPUs as accelerators in the field of 2D hydrodynamics: [7] and [8] , and have reported speed-ups of factors of 70 over a single threaded CPU. However, this paper not only looks at speedups of the GPU over a single threaded CPU, but makes comparisons running in an accelerated distributed MPI mode.
A representative, test case for this code is an interacting shock wave simulation. In this problem square inner, middle, and outer regions of ideal gas at differing initial densities and energies cause the inner and outer regions to compress the middle region. This gives rise to shock fronts which collide and send the problem Rayleigh-Taylor unstable.
Typical problem sizes are in the range of 300k, to 5M cells; a representative problem of approximately 1.05M cells (1024 x 1024) was chosen as a problem size in the middle of this typical range, and by measuring the time to solution for 10 iteration time steps a workable turnaround time for benchmarking purposes was achieved. 
It is known that, as with many hydrodynamics applications, as the cell quantities are updated each timestep, there is not a large amount of data reuse, and hence the code is one that is memory bound, rather than CPU bound.
ARCHITECTURES
Initially the code was developed and tested on a local, highend, workstation for a range of compilers and MPI implementations. Based on these observations, specific configurations were subsequently built and executed on a selection of HPC architectures, including Nehalem, Westmere, POWER-5, POWER-6, POWER-7, BG/P, BG/L, and AM-D's Barcelona. The following section describes these architectures, and observations from benchmarking.
Initial Benchmarking
An initial performance assessment was carried out on an Intel Xeon E5405 [9] , dual socket, quad core desktop workstation. The hardware has a 12MB L2 cache, a 2.00 GHz clock speed, and a Front Side Bus (FSB) speed of 1333 MHz. Four compilers were chosen: Sun's SunStudio 12.1 [10] , Portland Group's PGI 10.1.0 [11], Intel's 11.0.073 Compiler Tookit [12] , and GNU's g95 4.0.3 [13] . For the MPI, two variants: MPICH2 [14] and OpenMPI [15] were chosen, although builds were not available for all compiler and MPI permutations. Those that were available, and their respective versions, can be found in Table 1 .
The Shamrock build has some self imposed restrictions on compilation options. These are in place to ensure that the results between architectures and compilers are as numerically comparable as possible. A full list of compiler options used in this study is specified in Table 2 . Figure 2 shows the high memory watermark (HMW) of the application during its execution, for the four compilers tested. SUN and GNU show a 5.26% increase in memory over the PGI compiler, however it would appear that Intel's optimisations require a greater amount of memory to be executed, with a memory footprint 36.84% greater at 418MB. This is a significant difference, and will impact the problem size that could be fitted into local memory.
When considering the same compiler, but with a different MPI implementation, in both cases where this is possible: SUN and Intel, the OpenMPI build, out performs the equivalent MPICH2 build, by 4.5% and 8.5% respectively, when averaged over multi-core runs. To gain best performance, an MPI implementation should be optimality tuned for the system in question, this can be achieved by disabling error checking in the MPI builds, or setting configuration options for a specific system. However, in this study both MPI implementations are default builds, so it would appear that, OpenMPI has the edge over MPICH2, in handling the shared-message MPI queue that will be being used, for Shamrock, when running locally on the workstation. One POWER7 755 server was benchmarked. Named 'p90' it contains four POWER7 chips, with a clock speed of 3.3 GHz and 128 GB of accessible memory. The system has 4-way SMT enabled by default. This is denoted as Pwr7 (3.3GHz) throughout the remainder of this paper. For all the POWER platforms, MPI is provided via IBM MPI.
From IBM's BlueGene architecture series, access was available to both the first and second generations of the hardware: BG/L and BG/P. BG/L hardware was available via 'uBG/L' (BG/L) based at the Lawrence Livermore National Laboratory (LLNL). uBG/L's building block is the 700 MHz, 32-bit PowerPC (450). 40,960 dual core compute nodes combine to provide 81,920 cores, each with 256 MB / core. This gives 'uBG/L' a peak performance of 229.4 TFlops. Fortran is provided by XL Fortran 10.1.0.4. 'DawnDev' (BG/P), is a BG/P architecture, again based at LLNL. BG/P is built from the 850 MHz 32-bit PowerPC (450d) processor, with four cores per node, and 1 GB/core. As a test system, DawnDev contains 1,024 nodes, giving a peak of 13.9 TFlops. Fortran is provided by XL Fortran 11.1.0.5. For both BlueGene systems, IBM BlueGene MPI is the MPI implementation.
'Hera' is an Appro [19] integrated cluster based at LLNL. Consisting of 790 nodes of AMD quad-core Opteron 8356 (Barcelona) processors, it contains 13,824 cores, with a clock speed of 2.3 GHz, and 32 GB/node, or 2 GB/core. Hera's peak performance equates to 127.2 TFs. MPI is provided by a build of OpenMPI 1.3.2 A choice of Fortran compilers are Intel 11.1 and PGI 8.0-1. These are denoted in the text as 8356 Intel and 8356 PGI. Each HPC platform is summarised in Table 3 .
HPC PERFORMANCE
The problem set which was executed locally on the E5405 workstation was run on the HPC architectures and com- piler/MPI configurations as detailed in Section 3.2. As with the local builds, to ensure equality of results, restrictions were applied to the compilations. Again, Table 1 can be referred to for compiler specific details. Figure 3 shows the runtimes, on a logarithmic scale, for all of the systems previously described. A number of observations can be deduced from this chart:
Platform Comparisons
The improvement seen, on the E5405, for the SunStudio compiler over that of the Intel compiler, carries over to the Nehalem L5530. With runtimes for single core L5530 Intel BullX 18.8% slower than the L5530 Sun OpenMPI.
The use of IBM's SMT is demonstrated on the Pwr6 (4.2GHz). Using all 16 of the possible SMT threads gives little gain over an 8 core, non SMT, run. However, en- abling the SMT, but only utilising a sub-set of the total SMT threads, a factor speedup of 1.42 over 8 threads on 12, is close to a 1.5 maximum.
Of most interest to the science end-user, is the fastest time to solution for the problem. In the case of all the architectures benchmarked in this study, this is achieved for the X5660 Intel BullX Westmere, on 128 cores for the 1024 2 problem size. To enable additional platform comparisons a time to solution of 8.24 seconds was selected. This matches the execution time on 64 cores running on the 8356 Intel. Hence, it can be inferred from Figure 2 the equivalent number of cores required from each of the architectures benchmarked, to match this 8.24 second turnaround. This is captured in Table 3 , along with the respective power consumption for that number of cores on a given architecture.
The power consumption figures given are extrapolations based on full system runs of LINPACK [20] for the systems specified in the Top500. In the case of the POWER-7, as no systems as yet make the list, this figure is based on the maximum possible power draw for a system, as specified in IBM's p755 Redbook specification [21] . This figure will most certainly be higher than an equivalent LINPACK run.
The power consumption figures clearly show an improvement from those architectures of the previous generation (POWER-6 and AMD's Barcelona) to today's latest hardware (POWER-7, Nehalem and Westmere). The BlueGene solution is between these two ranges, although the imminent arrival of BG/Q is expected to greatly reduce this figure.
GPU Comparisons
AWE has a modest GPU test bed architecture, codenamed 'Dexter'. Consisting of four nodes: one master, and three compute. The master node contains two quad core Nehalem X5550 [22] 2.67 GHz processors, and four NVIDIA Tesla C1060 [23] GPUs. Where each C1060 contains 240 streaming processor cores with a frequency of 1.3 GHz. The compute nodes consist of one quad core X5550, two of the three with four NVIDIA GeForce GTX 285 [24] GPUs and the third with three GTX 285's and one ATI Radeon HD5870. For the purposes of this study, only the NVIDIA cards were considered, and the C1060 and GTX 285's treated as one. Dexter is running OpenSUSE as its operating system and the GNU compiler with OpenMPI to provide the Fortran and MPI environments respectively.
To enable a port of the Shamrock benchmark to the GPU, compute intense sections of the code were identified and subsequently turned into kernels. In total eight kernels were identified, accounting for approximately 95% of compute time.
The method deployed for creating the kernels, was to Finally, by use of AWE's Acrylic wrapper code parser, data management and placeholder code for the ported C routines were generated, allowing fast utilisation of the OpenCL (Open Computing Language) [26] framework, to create GPU executable kernels from the C routines.
At the time of this study, four of these possible kernels have been developed: lagren calculates adiabatic heating on a cell, based on the volume change in the cell and its pressure, using a predictor/corrector method; lagrac calculates nodal accelerations due to pressure gradients and subsequently updates the nodal velocities; lagrqq calculates an artificial viscous pressure around shock waves to smooth out discontinuities and reduce oscillations; and the volume fluxes across cell faces, which are later used to carry out the advective remap, are calculated by the lagrvf kernel. These account for 13.55%, 8.06%, 10.78%, and 1.5% of total compute respectively. Although this gives 33.89% of the code resident on the CPU, there are some caveats which restrict claiming a fully distributed GPU enabled version of the code. Currently, for each kernel call, all data associated with the kernel is transferred to and from the accelerator device. Although, logic has been added enabling data to be shared between kernels there still is a significant overhead from the data copies. Additionally, a number of boundary conditions are assumed for GPU execution, this severely restricts the problem range that Shamrock is capable of running. However, due to careful problem selection, the test case described in Section 2, and later analysed is unaffected by this restriction. Hence some direct comparisons between non-accelerated and accelerated runs of the code can be made. Figure 4 compares the average execution time for one each of the kernels when run in their original (Orig), simplified Fortran (F), C, and OpenCL form. By re-writing the original Fortran, performance gains are apparent for each of the kernels, with the C versions providing further factors of two for all but one of the kernels. For the OpenCL version, despite no optimisations yet taking place, further performance gains are observed for the same three kernels over the C equivalents. In the case of the OpenCL kernels, two average iteration times are given for each kernel, the former includes data transfer to and from the device, the latter for compute time only of the kernel on the device. Comparing The benchmark was run MPI distributed, in a non-accelerated and accelerated mode (including the current transfer (CT) scheme) on Dexter, using the GNU compiler. These figures were then normalised to the performance figures seen for the Intel compiler, as captured in Section 3.1. The resultant figures are captured in Figure 5 as X5550 and X5550 Acc, CT. These show no gain from using the OpenCL kernels, and indeed performance is worse for the GPU runs when greater than 8 GPUs are deployed. This is not unexpected; as previously stated, the data copies, to and from the device each time a kernel is called, begin to dominate. When the remaining four kernels are ported, the current data transfer will be negated and replaced with a one-off initial copy to (and final copy from) the device. A worst case overhead for such a copy can be calculated. The total amount of data necessary for the entire test case stands at 0.45GB/N per processing core, where N is the number of processing cores used. This is well within the memory constraints of today's devices: NVIDIA's Tesla C2070 has 5.25GB of user available memory. 4 The memory bandwidth test, oclBandwithTest, which is shipped with NVIDIA's CUDA Toolkit [27] can be executed to measure the data transfer speeds for a range of transfer sizes. Table 5 shows these transfer speeds, using direct access and paged memory 5 averaged over ten runs.
Using these bandwidth figures, an estimate of the data transfer overheads can be made for the code if the data transfer per kernel is replaced with an inital copy to, and final copy from, the device. X5550 Acc, ST, in Figure 5 , shows the estimated distributed runtime for the benchmark with this change of data transfer from the current transfer (CT) scheme, to a single transfer (ST) scheme.
The four kernels already ported to the GPU, accounting for 33.89% of the total code compute time, and their average speedup, over the original coding, is a factor of 18.97. If it is assumed the remaining 61.11% of compute, contained in the remaining four kernels, achieves a similar average performance gain, Amdahl's law can be used to calculate an estimated speedup for a distributed, 95% GPU resident, version of the benchmark. Taking the 5% of the application which is not accelerated, each of the kernels with their percentage of compute and performance gains, and the estimated gain for the remaining 61.11%, an overall speedup of a 0.1006 is calculated. This gives an estimated order of magnitude gain over a non-accelerated x5550 run. Figure 6 shows a subset of those runs depicted in Figure 4 . Additionally, runtimes of the benchmark with the four OpenCL kernels and the CT scheme, X5550 Acc, CT, and estimated runtimes of the benchmark with 95% resident and the ST scheme, are also plotted. This shows that for an equal time to solution of 8.24 seconds, could be theoretically achieved using two X5550 cores, each accelerated with one Tesla C1060 GPU card. The resultant power consumption, as detailed in Table 5 , shows savings compared to alternative platforms.
CONCLUSIONS
This paper introduced the industry strength, multi-purpose 2D benchmark code Shamrock. Using a suitably defined test problem, tests on a high-end workstation show that the SunStudio compliers give best performance when comparing with Intel, PGI, and GNU, when compiled in a restricted, but numerically accurate, mode. Although, lifting these restrictions, gives 25% improvements for Intel. This would imply that up to this 25% figure could be achieved if restrictions are only placed on those routines whose optimisation relaxations cause reductions in accuracy. The memory footprint shows a significant overhead from the execution of the Intel compiled binary, with a runtime memory HMW 36.84% greater than PGI equivalent. For MPI implementation OpenMPI outperforms MPICH2 for the benchmark, although both are default builds and both would need to be tuned for the particular hardware in question. SMT shows some merit when implemented on POWER6, if a restricted number of threads are utilised. A factor of 1.42 was gained over 8 threads, using 12 of a possible 16 was observed. This prompts future investigations into 4-way SMT on POWER7 and 2-way Intel's SMT on Nehalem architectures.
This paper also presents comparisons of Nehalem, Westmere, POWER-5, POWER-6, POWER-7, BlueGene/L, BlueGene/P, and Barcelona hardware, showing that the fastest time to solution for the benchmark code was achieved on 128 Westmere cores. Taking a set time to solution of 8.24s, the POWER7 achieved this with the fewest core count, and a relatively low power consumption figure.
An OpenCL port to a Tesla C1060 accelerated Nehalem based cluster was described for four of eight possible kernels within the benchmark, accounting for 33.89% of the total code compute time. Running with the accelerated kernels in a distributed mode, and comparing with the non-accelerated distributed code, experimental data showed increasing overheads due to the data transfer to and from the device. Removing the current data transfer scheme, and replacing with a larger, single data transfer scheme, and additionally applying an average speedup factor of 18.97 on the remaining four kernels (based on the average speedup observed for the ported kernels) predicted execution time of a distributed, 95% resident version of the benchmark was given. This yields an equal time to solution of 8.24s is achievable with two C1060 accelerated Nehalem cores.
Future publications will document how close these predictions are met, once the remaining kernels are accelerated. Early tests on NVIDIA's Tesla C2050 ('Fermi') GPU have given approximate factors of three on the kernels compute, over the C1060. Further comparisons will be made once the current C1060s are augmented with C2050s and C2070s [28] in the 'Dexter' cluster. An additional approach is to consider, and compare against the OpenCL kernels, the use of directive based pragmas [29] to achieve accelerated code. Ultimately, Shamrock's AMR mode will be enabled, and a similar hardware benchmarking study will be performed.
ACKNOWLEDGMENTS

