Abstract. An upgrade from dual-core to quad-core AMD processor on the Cray XT system at the Oak Ridge National Laboratory (ORNL) Leadership Computing Facility (LCF) has resulted in significant changes in the hardware and software stack, including a deeper memory hierarchy, SIMD instructions and a multi-core aware MPI library. In this paper, we evaluate impact of a subset of these key changes on large-scale scientific applications. We will provide insights into application tuning and optimization process and report on how different strategies yield varying rates of successes and failures across different application domains.
Introduction
Scientific productivity on the emerging Petascale systems is widely attributed to the system balance in terms of processor, memory, network capabilities and the software stack. The next generations of these Petascale systems are likely to be composed of processing elements (PE) or nodes with 8 or more cores on single or multiple sockets, deeper memory hierarchies and a complex interconnection network infrastructure. Hence, the development of scalable applications on these systems cannot be achieved by a uniformly balanced system; it requires application developers to develop a hierarchical view where memory and network performance follow a regular but non-uniform access model. Even the current generation of systems with peak performance of hundreds of Teraflops such as the Cray XT and IBM Blue Gene series systems offer 4 cores or execution units per PE, multiple levels of unified and shared caches and a regular communication topology along with support for distributed computing (message-passing MPI) and hybrid (MPI and shared-memory OpenMP or pthreads) programming models [Dagnum98, Snir98, BGL05, BGP08, XT3a-b, . As a result, it has become extremely challenging to sustain let alone to improve performance efficiencies or scientific productivity on the existing systems as we demonstrate in this paper. At the same time however, these systems serve as test-beds for applications targeting Petascale generation systems that are composed of hundreds of thousands of processing cores.
We have extensive experience of benchmarking and improving performance efficiencies of scientific applications on the Cray XT series systems, beginning from the first-generation, single-core AMD based ~26 Teraflops Cray XT3 system to the latest quad-core based ~263 Teraflops Cray XT4 system. During these upgrades, a number of system software and hardware features were modified and replaced altogether such as migration from Catamount to Compute Node Linux (CNL) operating system, network capabilities and support for hybrid programming models and most importantly multi-core processing nodes [Kelly05] . A complete discussion of individual features are beyond the scope of this paper, however we do attempt to provide a comprehensive overview of the features updated in the latest quad-core upgrade and how these features impact performance of high-end applications. We provide an insight by using a combination of micro-benchmarks that highlight specific features and then provide an assessment of how these features influence overall performance of complex, production-level applications and how performance efficiencies are improved on the target platform.
In this paper, we focus on micro-architectural characteristics of the quad-core system particularly the new vectorization units and the shared level 3 cache. This study also enables us to identify features that are likely to influence scientific productivity on the Petascale Cray XT5 system. Hence, a unique contribution of this paper is that it not only evaluates performance of a range of scientific applications on one of the most powerful open-science supercomputing platform but also discusses how the performance issues are addressed during the quad-core upgrade. The Cray XT5 system shares a number of features including the processor and the network infrastructure with its predecessor, the quad-core XT4 system. However, the XT5 system has some distinct characteristics; most importantly, hierarchical parallelism within a processing node since an XT5 PE is composed of two quad-core processors thereby yielding additional resource contentions for memory, network and file I/O operations.
The outline of the paper is as follows: background and motivation for this research along with a description of target system hardware and software features is provided in section 2. In section 3, we briefly outline micro-benchmarks and high-end scientific applications that are targeted for this study. Details of experiments and results relating to each feature that we focus on in this paper are presented in section 4. Conclusions and future plans are outlined in section 5.
Motivation and Background
The Cray XT system located at ORNL is the second most powerful computing capability for the Department of Energy's (DOE) Office of Science, and in fact represents one of the largest open science capability platforms in the United States. Named Jaguar, it is the primary leadership computer for the DOE Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program, which supports computationally intensive, large-scale research projects. The 2008 program awarded over 140 million processor hours on Jaguar to groups investigating a broad set of science questions, including global climate dynamics, fusion, fission, and combustion energy, biology, astrophysics, and materials.
In order to support this scale of computing, Jaguar has been upgraded from a 119 Teraflops capability to 262 Teraflops (TFLOPS). Several fundamental characteristics of the architecture have changed with this upgrade, which have a wide-ranging impact across different application domains. This motivates our research of identifying and quantifying the impact of these new architectural and system software stack features on leadership scale applications.
The current incarnation of Jaguar is based on an evolutionary improvement beginning with the XT3, Cray's third-generation massively parallel processing system, building on the T3D and T3E systems. Based on commodity AMD Opteron processors, most recently for instance the quad-core Barcelona system, a Cray custom interconnect, and a light-weight kernel (LWK) operating system, the XT3 was delivered in 2005. Each node consisted of an AMD Opteron model 150 (single core) processor, running at 2.4 GHz with 2 GBytes of DDR-400 memory. The nodes were connected by a SeaStar router through HyperTransport, in a 3-dimensional torus topology, and running the Catamount operating system. With 5,212 compute nodes, the peak performance of the XT3 was just over 25 TFLOPS . Jaguar processors were upgraded to dual-core Opteron model 100 2.6 GHz processors in 2006, with memory per node doubled in order to maintain 2 GBytes per core. It was again upgraded April, 2007, with three major improvements: 6,296 nodes were added; memory on the new nodes was upgraded to DDR2-667, increasing memory bandwidth from 6.4 GBytes per second (GB/s) to 10.6 GB/s; and the SeaStar2 network chip connected the new nodes, increasing network injection bandwidth (of those nodes) from 2.2 GB/s to 4GB/s and increasing the sustained network performance from 4GB/s to 6GB/s. Thus with 23,016 processor cores, this so-called XT3/XT4 hybrid provided a peak performance of 119 TFLOPS .
In spring 2008, Jaguar was again upgraded: 7,832 quad-core processors replace the 11,508 dual-core (illustrated in Figure 1 , the interconnect is now fully SeaStar2, and the LWK is a customized version of Linux named Compute-Node Linux (CNL). Each compute node now contains a 2.1 GHz quad-core AMD Opteron processor and 8 GBytes of memory (maintaining the per core memory at 2 GBytes). As before, nodes are connected in a 3-dimensional torus topology, now with full SeaStar2 router through HyperTransport (see Figure 1(b) ). This configuration provides 262 TFLOPS with 60 TBytes of memory.
3
Micro-benchmark and Application Details
HPCC Benchmark Suite
We used High Performance Computing Challenge (HPCC) benchmark suite to confirm micro-architectural characteristics of the system. HPCC benchmark suite [HPCCa-b] is composed of benchmarks measuring network performance, node-local performance, and global performance. Network performance is characterized by measuring the network latency and bandwidth for three communication patterns. The node local and global performance are characterized by considering four algorithm sets, which represent four combinations of minimal and maximal spatial and temporal locality: DGEMM/HPL for high temporal and spatial locality, FFT for high temporal and low spatial locality, Stream/Transpose (PTRANS) for low temporal and high spatial locality, and RandomAccess (RA) for low temporal and spatial locality. The performance of these four algorithm sets are measured in single/serial process mode (SP) in which only one processor is used, embarrassingly parallel mode (EP) in which all of the processors repeat the same computation in parallel without communicating, and global mode in which each processor provides a unique contribution to the overall computation requiring communication.
Application Case Studies
The application test cases are drawn from the workload configurations that are expected to scale to large number of cores and that are representative of Petascale problem configurations. These codes are large with complex performance characteristics and numerous production configurations that cannot be captured or characterized adequately in the current study. The intent is rather to provide a qualitative view of system performance using these test cases to highlight how the quad-core system upgrade has influenced the performance as compared to the preceding system configurations.
Fusion Application (AORSA)
The two-and three-dimensional All-ORders Spectral Algorithm (AORSA [AORSA08] ) code is a full-wave model for radio frequency heating of plasmas in fusion energy devices such as the International Thermonuclear Experimental Reactor5 (ITER) and the National Spherical Torus Experiment (NSTX) [Jaerger06-07]. AORSA operates on a spatial mesh, with the resulting set of linear equations solved for the Fourier coefficients. A Fast Fourier Transform algorithm converts the problem to a frequency space, resulting in a dense, complex-valued linear system. Parallelism is centered on the solution of the dense linear system, currently accomplished using a locally modified version of HPL [Dongarra90, Longau07] . Quasi-linear diffusion coefficients are then computed, which serve as an input to a separate application (Fokker-Plank solver) which models the longer term behavior of the plasma.
Turbulent Combustion Code (S3D)
Direct numerical simulation (DNS) of turbulent combustion provides fundamental insight into the coupling between fluid dynamics, chemistry, and molecular transport in reacting flows. S3D is a massively parallel DNS solver developed at Sandia National Laboratories. S3D solves the full compressible Navier-Stokes, total energy, species, and mass continuity equations coupled with detailed chemistry. It is based on a high-order accurate, non-dissipative numerical scheme and has been used extensively to investigate fundamental turbulent chemistry interactions in combustion problems including auto-ignition [Chen06] , premixed flames [Sankaran07] , and nonpremixed flames [Hawkes07] .
The governing equations are solved on a conventional three-dimensional structured Cartesian mesh. The code is parallelized using a three-dimensional domain decomposition and MPI communication. Spatial differentiation is achieved through eighth-order finite differences along with tenth-order filters to damp any spurious oscillations in the solution. The differentiation and filtering require nine and eleven point centered stencils, respectively. Ghost zones are constructed at the task boundaries by non-blocking MPI communication among nearest neighbors in the three-dimensional decomposition. Time advance is achieved through a six-stage, fourth-order explicit Runge-Kutta (R-K) method [Kennedy00] .
Quantitative Evaluation and Analysis of Selected Features

Vector (SSE) Instructions
The Cray XT4 system is upgraded from dual-core Opteron to a single-chip, native quad-core processor called Barcelona. One of the main features of the quad-core system was quadrupling the floating point performance using a wider, 32-byte instruction fetch, and the floating-point units can execute 128-bit SSE operations in a single clock cycle (including the Supplemental SSE3 instructions Intel included in its Core-based Xeons). In addition, the Barcelona core has relatively higher bandwidth in order to accommodate higher throughput-internally between units on the chip, between the L1 and L2 caches, and between the L2 cache and the north bridge/memory controller.
In order to measure the impact of the new execution units with 128-bit vectorization support, we ran two HPCC benchmarks that represent scientific computation: DGEMM and FFT, in single processor (SP) where a single instance of an application runs on a single core and embarrassingly parallel (EP) where all cores execute an application without communicating with each other. We also measure performance per socket to estimate overall processor efficiencies. The quad-core XT4 has 4 cores per socket while the dual-core XT4 has two cores per socket. The Figure 2 . We observe a significant increase in per core performance for the dense-matrix computation benchmark (DGEMM), which is able to exploit the vector units. The FFT benchmarks on the other hand showed a modest increase in performance. Results in the EP mode when all four cores execute the same program revealed the impact of the shared L3 cache as the FFT performance slows down at a much higher rate for the quad-core system as compared to the dual-core and single-core XT platforms. The L3 behavior is detailed in the next section. Although there is a slowdown for FFT in the EP mode, we observe that per socket performance of the quad-core processor is significantly higher than that of the dualcore processor. We conclude that the significant performance boost per core brings in additional requirements for code development and generation for the quad-core processors. In other words, a misaligned and non-vector instruction could result in a code achieving less than a quarter of total achievable performance. Our two target applications highlighted the need for optimizing these vector operations.
AORSA, a fusion application, has a distinguished history running on the XT-series, allowing researchers to conduct experiments at resolutions previously unattainable executing at unprecedented computational scales. For example, the first simulations of mode conversion in ITER were run on the single-core XT3 [Jaeger06] on a 350 x 350 grid. On the dual-core XT3/XT4, this feat was again achieved, at increased resolution (500 x 500 grid), with the linear solver achieving 87.5 TFLOPS (74.8% of peak) on 22,500 cores [Jaeger07] . This same problem run on the quad-core XT increased this performance to 116.5 TFLOPS, and when run on 28,900 cores performance increased to 152.3 TFLOPS. Performance results for this scale are shown in Figure 4 . Results are shown for the dual-core (DC) and quad-core (QC) processors with ScaLAPACK (Scal) and HPL (hpl) based solver. Moreover, experimental mixed-precision (mp) results are also shown in the figure. While impressive, relative to the theoretical peak performance has decreased from 74.8% to 61.6%. Although this is not unexpected due to the decreased clock speed and other issues associated with the increased number of cores per processor, we are pursuing further improvements. However, the time-to-solution (the relevant metric of interest) dropped from 73.2 minutes to 55.0 minutes, a decrease of 33%.
We expect performance of the solver phase to increase based on planned improvements to the BLAS library and the MPI implementation. In addition, we are experimenting with a mixed-precision approach [Langou07] . This capability is currently included in the Cray math library (-libsci) as part of the Iterative Refinement Toolkit (IRT). While this technique shows promise, it is not providing an improvement at the relevant problem scales. Although the condition of the matrix increases with resolution, this does not appear to be an issue. More likely is the use of the ScaLAPACK factorization routine within IRT compared with the HPL version: at 22,500 cores on the dual-core Jaguar, ScaLAPACK achieved 48 TFLOPS, whereas HPL achieved 87 TFLOPS. The turbulent combustion application, S3D, is parallelized using a threedimensional domain decomposition and MPI communication. Each MPI process is responsible for a piece of the three-dimensional domain. All MPI processes have the same number of grid points and the same computational load. Inter-processor communication is only between nearest neighbors in a logical three-dimensional topology. A ghost-zone is constructed at the processor boundaries by non-blocking MPI sends and receives among the nearest neighbors in the three-dimensional processor topology. Global communications are only required for monitoring and synchronization ahead of I/O. A comparison of dual-core and quad-core performance is shown in Table 1 .
The initial port (Table 1) showed a decrease in performance, though less than that attributable to only the decrease in clock speed. This suggests that vectorization is occurring, though not as aggressively as desired. Special effort was applied to the computation of reaction rates, which consumed approximately 60% of overall runtime. Table 2 shows the effects of the compiler when able to vectorize code. Although for each category the number of operations increases, the proportion of operations occurring in vector mode increased by 233%, resulting in a decrease in runtime of this computation by over 20%. Table 1 : S3D single processor performance (weak scaling mode). The amount of work is constant for each process. "MPI mode" refers to the number of MPI processes and how they are assigned to each node: -n is the total number of processes, -N is the number of processes assigned to each quad-core processor node. Time is wall clock in units of seconds; "cost" is defined as micro-sec per grid point per time step. The "vec" columns show the performance after the code was reorganized for stronger vectorization. 
Deeper Memory Hierarchy (L3 Cache)
Another distinctive feature of the quad-core processor is the availability of an L3 cache that is shared among all four cores. There was no L3 cache in the predecessor Opteron processors. L3 serves as a victim cache for L2. L2 caches (not shared) are filled with victims from the L1 cache (not shared) i.e. after the L1 fills up rather than sending data to memory it sits in L2 for reuse. Hence, data-intensive applications could benefit from this L3 cache only if the working set is within cache range. Two HPCC memory performance benchmarks, stream and random access, were targeted to quantitatively evaluate performance of the memory sub-system. We compared the quad-core performance with the dual-core (XT4-DC) and single-core (XT3) AMD processors that preceded the latest quad-core (XT4-QC) processing Figure 5 . Both memory benchmarks highlight the effect of using a single core (SP mode) as compared to using all four cores simultaneously (EP mode) both for regular, single-strided (stream) access and random access benchmarks. Random memory access benchmarks highlight this cache behavior. We note that the shared resources in memory sub-system do account for slowdown in the EP mode, however this slowdown is less than by a factor of 4. In fact on the quad-core system, we have a relatively high per socket performance as compared to the dual-core system, which can be attributed to the shared L3 cache.
We have multiple applications that show slowdown in the quad-core or virtual node mode (VNM) modes as compared to single-core (SMP). In VNM mode 16 XT nodes are used while in SMP mode 64 XT nodes (256 cores) are reserved but only one core per node is used for 64 MPI tasks altogether. The S3D application has about a 25% slowdown in the mode where all four cores contribute to a calculation as compared to only single-core per processor. We collected hardware counter data using the PAPI library that confirms our findings [PAPI00] . L3 cache (shared between all four cores) behavior is measured and computed using the following PAPI native events. The L3 miss rate shows how frequently a miss occurs for a set of retired instructions. The L3 miss ratio indicates the portion of all L3 accesses that result in misses. Our results confirm that the L3 cache miss and request rate increase by a factor of two when using 4 cores per node versus using 1 core per node mode.
The most distinctive feature of the Petaflops XT5 system is the dual-socket, quadcore nodes as compared to a single-core socket node. In other words, there could be an additional level of memory and communication hierarchies that could be exposed to the application developers that are familiar with the quad-core XT4 memory subsystem. Although the optimization for the wide vector units would be beneficial for the XT5 system, the issues of memory sub-system are likely to become more complex since there will be 8 cores sharing the Hypertransport link on the XT5 node as compared to 4 cores on the XT4 node.
5
Conclusions and Future Plans
We have demonstrated how individual features of a system's hardware and software stack could influence performance of high-end applications on over a 250 Teraflops scale supercomputing platform. Our capability of comparing and contrasting performance and scaling of applications on multiple generations of Cray XT platforms, which share many system software and hardware features, enable us to not only identify the strategies to improve efficiencies on the current generation system but also prepare us to target the next-generation Petascale system. Since only a selection of features is studied in detail for this study, we plan on expanding the scope of this research by including application that have hybrid programming models to study the impact of within and across nodes and sockets. We are in process of working with application groups that have a flat MPI hierarchy models to explore and incorporate alternate work decomposition strategies on the XT5 platform.
