Abstract. Trends in chip technology and system design are causing a revolution in highperformance computing. The emergence of multicore processor chips, the construction of very large computing systems, and the increasing need to deal with power and energy issues in these systems are three of the most significant changes. We focus on the way that these trends have created a new set of challenges in the area of performance engineering, the measurement, analysis, and tuning of computing systems and applications. We discuss these changes and outline recent work at the Renaissance Computing Institute to meet these challenges.
Trends in high-performance computing
As it has over its entire brief history, computing technology continues to evolve at a rapid pace. Since the emergence of microprocessor-based clusters, this has generally been felt as incremental improvements in single-thread performance. Recently, however, several trends are introducing qualitative changes, especially in high-performance computing.
Instead of continuing to exploit increased clock rates, and an increased degree of implicit, instruction level parallelism (ILP), the improved performance of the current generation of chips comes from increased explicit, multi-core, and multi-thread parallelism. Technology improvements make it feasible to build high-end systems with many tens-to hundreds-of thousands of chips. Combined with new generations of processor chips, these systems will have a degree of thread-level parallelism in the millions. The sheer scale of these systems forces everything, including performance analysis, to be done in parallel. Power and energy considerations impinge on every decision. Power density, efficiency, and cooling issues are driving the trend towards explicit parallelism. Power and cooling infrastructure are dominating facility costs. Energy costs are a large, and increasing, component of the total cost of system ownership.
Each of these trends is having a substantial impact on the general area of "performance engineering" in high-performance computing. We describe ongoing work at RENCI to address the first two of these issues. Power and energy are overarching problems that are driving the other trends, but specifics of dealing with them are beyond the scope of this paper. Within the present context they are a matter of running parallel systems as efficiently as possible.
2. Resource-centric performance tools and analysis 2.1. Processors in an era of single-thread performance gains The impressive increase in microprocessor performance has been a consequence of several several related trends. The linear feature size on chips commodity chips has shrunk from tens of microns to a range of 45 to 65 nanometers; this has led to a quadratic increase in the number of devices per unit area. The size of commodity dies has also increased. The increase in devices per chip, usually known as Moore's law, has permitted designers to improve performance by adding instruction level parallelism (ILP) by building deeply pipelined, out-of-order, superscalar processors. The decreasing feature size has also improved transistor switching time and dramatically higher clock rates. Memory density has increased at a similar pace; and large, complex caches have mitigated the much slower rate of increase of memory speed.
We are emerging from a period in which "performance engineering" of applications was relatively simple and effective. For many applications, it was sufficient to "ride the Moore's law wave"; why waste effort optimizing code if computers will double in speed in the next eighteen months anyway? For other classes of application, this was a "golden age of optimization" with program transformations, applied manually by "performance specialists" or automatically by optimizing compilers [1] . Even applications that were bound by memory bandwidth could achieve some ILP by exploiting memory interface protocols to overlap concurrent memory operations. Average memory concurrency of only 6.25 can yield bandwidths of over 4 GB/sec on systems that would otherwise only achieve 640 MB/sec (one 64-byte block per 100ns).
On high-ILP systems, performance issues were a matter of how all of the instructions in flight interacted with one another in the pipelines and the memory system. For example, whether a particular memory reference is a cache miss is not as important as how much of the miss latency can be hidden by concurrent instructions or by a hardware prefetcher.
The era of explicit parallelism
The overhead incurred to achieve ILP increased much faster than the performance gains realized. At the same time, the amount of circuit-level parallelism and power consumed has increased commensurately. Hence, the processor industry has embraced a more efficient explicitly-parallel design strategy. Clock rates temporarily decreased, and now grow much more slowly.
With the advent of explicitly parallel, multicore, multithread processor chips, the execution context of each instruction now includes all the concurrent instructions in all the other threads and cores. In such environments, the interactions of threads (cores) competing for shared resources will dominate performance.
In a single application, some interthread interactions will be explicit scheduling and synchronization. Implicit contention for shared resources is as much of a problem. To illustrate this, we ran independent copies of the NAS parallel benchmarks on quad-core Intel and AMD processors and measured the throughput [2] . While throughput increased from one to four threads, total run time increased to as much as 80% for MG and SP, yielding a parallel efficiency of 1/1.8 = 0.55 for nominally independent executions. On Intel Clovertown, parallel efficiency was as low as 0.43 for four copies of MG. Other measurements on AMD chips show dramatic increases in the number of DDR memory operations (especially for DDR page conflicts). The impact of inter-thread interactions is clearly large, and this effect can have a larger impact on performance than local code tuning within a single thread.
Since costs are increasingly less attributable to a single thread, tools that focus on "first-person" performance measurement [3, 4] are less able to measure and diagnose crucial performance problems in this domain. At RENCI, we are pursuing the approach of developing "resource centric" measurement and analysis tools. Our focus is on the use of shared resources and interactions among active threads during periods of high utilization.
Information generated by tools needs to be available for off-node analysis and reporting tools. We are pursuing on-node, on-line performance introspection. The current state of our prototype will be presented with the poster accompanying this paper at the SciDAC 2008 conference.
Challenges of system scale
The degree of parallelism in top systems is increasing dramatically as a result of simultaneous trends toward much larger node counts, multisocket node architectures, and multicore/multithread chips. At such scales, Amdahl's law will be a harsh master for applications, system software, and even for performance tools. Sequential algorithms for measurement, data collection, analysis, and presentation of performance data on these systems are already untenable. Current popular performance tools [3, 4, 5] focus primarily on measurement, storing large volumes of data to disk and doing post-mortem analyses on other systems. Profiling tools [3, 4] reduce data volume by compiling summary statistics of metrics over time, but they lose valuable temporal information that could be used for root cause analysis. Tracing tools [3, 5] preserve this information, but the output is too large to store at petascale without significant perturbation. Further, sequential analysis of such data on workstations or small clusters would be too slow to be usable, even if petascale data volumes could be practically stored. Developers could measure, analyze, and tune on small-scale systems and problem instances. This situation can certainly help in some cases, but small-scale performance is no guarantee of large-scale behavior.
Almost by definition, driving problems for very large computer systems use the entire system and run for days or weeks. Otherwise, it would be preferable to run longer but more efficient jobs on smaller systems. Furthermore, the codes to solve these problems use the most advanced algorithms, which are often adaptive. Addressing the problem of performance measurement and analysis, if only to confirm that the code is running as well it can, requires operating at very large scale for long times. Cost prohibits conducting such runs only to measure performance, so tool overheads must be acceptable in a production environment.
Approach
To address these issues, we are conducting research and development on scalable tools that filter and analyze data in-situ. At runtime, the tools have access to the same computing power and parallelism as the application. Using low-overhead parallel filtering algorithms such as multiscale wavelet compression, we can reduce and analyze the data before it is stored to disk or even moved out of the computing fabric. These techniques also allow hierarchical, interprocessor analyses in real time, facilitating adaptive measurement and even steering.
Work to Date
AMPL [6] is a tracing tool that uses population sampling techniques to reduce trace data volumes in large applications. AMPL's output size scales sublinearly with the number of nodes in the parallel system, and sample overhead can be further reduced by monitoring homogeneous groups of processes separately.
Load balance is an important problem for petascale performance but a difficult challenge for tool designers, as it requires system-wide, inter-process measurements over the duration of program execution. The reason is that adaptive solvers and irregular data partitioning methods can dynamically introduce imbalances. It is thus important to not record and store data that scales linearly with problem size and run length. Furthermore, to diagnose the causes of imbalance, it is important to attribute observed problems to the region of application code that caused them. Without knowledge of application semantics, this task is difficult, so we have developed a model for characterizing the sources of computational load in SPMD codes, and a tool that uses the model to dynamically extract and store multidimensional load data efficiently [7] . This tool uses parallel wavelet transforms to reduce data volume by 2-3 orders of magnitude with very low error. Compression time is fast enough for online, production use. We have proposed, with our collaborators, a scalable component framework for rapid development of application-specific petascale performance tools [8] . Our framework enables existing scalable components (measurement, data aggregation, analysis) to be used modularly in conjunction with each other and to be reconfigured dynamically to fit the needs of applications.
Figure 1.
Transient load in ParaDiS collision computation for 128 processes and 256-timesteps. 
Dynamic load-balance case study
We used our load-balance tool to monitor ParaDiS [9] , a crystal dislocation dynamics code. ParaDiS includes a data-dependent computation phase in which dislocations are split or merged when they collide. Figure 1 shows transient load spikes in this phase for a 128-processor run. Figure 2 shows elapsed time to gather over 100 such plots for 256 time steps of ParaDiS on Blue Gene/L for increasing processor counts, both with and without our tool. Without our tool, the time to collect and compress the data scales linearly with the number of processors in the system. With it, data compresses to a volume handled well by Blue Gene/L's I/O backbone, and we achieve constant scaling.
Case study: LQCD codes on multicore
Quantum chromodynamics (QCD) is the theory of the strong force in the Standard Model of subatomic physics. Since planned Lattice QCD (LQCD) calculations are expected to consume hundreds of teraflop-years it is vital that LQCD code be as efficient as possible. As part of the SciDAC LQCD project we are engaged in the evaluation and tuning of LQCD codes.
In common LQCD simulations, one phase involves a global matrix inversion. For a fixed size problem, the cost of the inversion phase increases super-linearly with the number of computational nodes used. Restated, for a fixed global problem size, inverter efficiency improves as each node's local problem size is increased.
The other phases are dominated by local computation and nearest neighbor communication. Implementations are extremely sensitive to memory traffic. Hence, highly tuned implementations of these phases perform very efficiently when they can be done in cache. Said differently, local kernel efficiency decreases when local problem size becomes larger than the cache size. We discussed the opposing efficiency trends in a previous article [10] in this series.
A key computational kernel for LQCD is the Wilson Dslash operator. Researchers at JLab in collaboration with ECCC at Edinburgh, UK have been adding multithread support to Dslash to better utilize multicore hardware. A comparison of several multicore implementations is presented in a related paper in this proceedings [11] . Prototypes compared include "MPI everywhere" and hybrid implementations using MPI and either OpenMP, or QMT, a multithreading library specifically for QCD applications.
In conjunction with Jlab, RENCI researchers are endeavoring to explain the details of the performance of these multithreaded implementations [11] . The performance differences between the hybrid versions are due to differences in thread scheduling. Each iteration of the Dslash test involves multiple parallel regions separated by MPI operations in sequential regions. In the OpenMP version we observed that CPU utilization for each core averaged only about 60% and this is confirmed by CPU cycle measurements within each thread. Hence, the computational cores block and go to a halt state between parallel regions, requiring relatively heavyweight operations to restart. The amount of work done in the parallel regions, particularly for smaller local problem sizes, is insufficient to amortize this overhead. The OMP WAIT POLICY control in OpenMP version 3 will mitigate this problem.
Of particular concern are issues of data and thread affinity, sharing data vs. competing for shared cache space, and general issues of cache locality. In the QMT version, we have diagnosed two areas for improvement. The version used in [11] uses explicit AMD "nontemporal" prefetches and stores. Eliminating extra prefetch instructions and using the native caching policies results in a 4-core, 1-chip speedup of about 10 percent. This is reflected as a greatly reduced L3 cache miss rate. About one quarter of the memory operations are now initiated by the hardware prefetcher.
QMT uses spin locks for synchronization and task-dispatching. Hardware performance counter measurements show that four QMT threads running an 8 4 problem on a single 2.1 GHz AMD Barcelona chip generate approximately an average of 2.1 GBytes/second of memory traffic attributed mostly to the spin-lock. This memory traffic is uniform across threads indicating that there's no indication of imbalance. The memory traffic is, however, all generated during the sequential phases in which MIP operations occur and may thus be a source of interference. We are continuing our efforts to diagnose and improve the performance of this code.
Concluding remarks
In this brief paper, we have endeavored to present an overview of some the challenges of todays high performance computing environments and to discuss recent activities at the Renaissance Computing Institute to meet those challenges.
