Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing this collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden to
Summary
The work for this project comprised of the following major areas:
• Provide a focused research and development program, creating new generations of high end programming benchmarks in order to realize a new vision of high end computing, high productivity computing systems (HPCS).
• Expose the issues of low efficiency, scalability, software tools and environments, and growing physical constraints.
• Architecture performance characterization of parallel systems being developed for the DAPRA High Productivity Computing Systems Program.
• Development of software for the benchmarking and performance evaluation of key components of high performance systems.
• Development of methods for guiding the collection of performance data and for analyzing and abstracting from measured performance data.
• Helping to promote this effort in the community
The objectives of this effort are:
• To establish a comprehensive set of parallel benchmarks that is generally accepted by both users and vendors of parallel systems.
• To provide a focus for parallel benchmark activities and avoid unnecessary duplication of effort and proliferation of benchmarks.
• To set standards for benchmarking methodology and result-reporting together with a control database/repository for both the benchmarks and the results.
• To make the benchmarks and results freely available in the public domain.
• To engage the high performance community in helping define the future expansion of the benchmark collection.
• To run HPC Challenge over a range of parameters.
• Collect and make available performance results in a standard web based format.
• To compute software and hardware metrics.
• Apply the run time tools being studied by the Execution Time modeling group to HPC Challenge.
Introduction
Unfortunately, much of the literature focuses on ad hoc approaches to evaluation of systems rather than on potential standardization of the benchmark process. If benchmarking is to mature sufficiently to meet the requirements of system architects as well as application and algorithm developers, it must address the issue of standardization.
A number of projects such as Perfect, NPB, ParkBench, and others have laid the groundwork for what we hope will be a new era in benchmarking and evaluating the performance of computers. The complexity of these machines requires a new level of detail in measurement and comprehension of the results. The quotation of a single number for any given advanced architecture is a disservice to manufacturers and users alike, for several reasons. First, there is a great variation in performance from one computation to another on a given machine; typically the variation may be one or two orders of magnitude, depending on the type of machine. Secondly, the ranking of similar machines often changes as one goes from one application to another. So, for example, the best machine for circuit simulation may not be the best machine for computational fluid dynamics. Finally, the performance depends greatly on a combination of compiler characteristics and the human effort that was expended on obtaining the results.
Methods, Assumptions, and Procedures
This first phase of the project developed, hardened, and reported on a number of benchmarks. The collection of tests included tests on a single processor (local) and tests over the complete system (global). Each examined performance evaluation for spatial locality and temporal locality. The tests on a local basis include DGEMM, STREAM, RandomAccess, and FFT and the tests on a global basis included High Performance Linpack, PTRANS, RandomAccess, and FFT.
The most reliable technique for determining the performance of a program on a computer system is to run and time the program (multiple times), but this can be very expensive and it rarely leads to any deep understanding of the performance issues. It is also does not provide information on how performance will change under different circumstances (e.g., scaling the problem or system parameters, or porting to a different machine).
An alternative approach to running the actual application codes is to develop a set of representative benchmark programs and to run these benchmarks on various systems with various problem and system sizes. Problems with this approach are that a quantitative analysis of the measured data is necessary to allow a deeper understanding and interpretation -i.e., abstraction -of the measured results. Statistical analysis techniques require a large amount of data to be collected. However, collecting data for all possible system and problem parameter settings is impractical. Hence, a determination needs to be made of what and how much data needs to be collected to provide an adequate basis for sound analysis.
Another approach is to generate a model of the program and the computer system and use the model to make performance predictions, varying model parameters to simulate varying program and computer system parameters. The difficulty with this approach is in generating and validating the model. The performance of production-level application codes is a result of complex interactions between processor architecture, memory access patterns, the memory hierarchy, the communication subsystem, and the system software. Modeling each of the complex components of the system alone is a challenge. Still more challenging is the task of accurately modeling the interactions between components and the performance of complex application codes on the entire system. Our approach in the second phase of this effort was to investigate the performance modeling problem by combining benchmarking, statistical analysis, and hierarchical modeling techniques to produce accurate models that can predict performance of complex applications on today's and tomorrow's high performance systems.
Although this work did not directly address performance engineering of complex application codes, our work laid the basis for the construction of parallel libraries that allow the reconstruction of application codes on several distinct architectures so as to assure performance portability. Once the requirements of applications are well understood, one can construct a library in a layered fashion.
The overall objective of this effort was to survey a number of DARPA related applications in an effort to ascertain their needs with respect to determining what metrics exist and what metrics need to be developed. In the course of this effort we helped in defining the metrics for future productivity, in particular:
• Temporal data locality measures the memory access patterns' reuse of data in CPU time domain. In other words, it measures the likelihood of a datum to be used in two close points in time.
• Spatial data locality measures the memory access patterns' reuse of data in memory address space domain. In other words, it measures the likelihood of two data being used provided that they are close to each other in memory.
• SLOC count is a simple yet very effective measure of code complexity and in turn characterizes very well the human effort involved in writing, maintaining, and refactoring a piece of code.
Using the above metrics, a set of representative application kernels were selected that reveal system performance and productivity under the workloads of varying values of the metrics so that bounds can be established for end-user applications.
Integrity of the Benchmark Code
The HPCC benchmark includes existing and well known benchmark codes as well as not so well known codes that were not intended for benchmarking by their authors. In both cases one may argue that it is possible to obtain HPCC-equivalent functionality by running each of the included tests separately or some subset thereof. Based on our extensive experience in high performance benchmarking, utilization of the entire suite, up to this point, offers as complete an analysis of benchmarking as has ever been done. There are a few important reasons why performing individual or subset tests would be neither practical nor complete:
1. Uniform verification Each code was examined and (as necessary) was augmented with a robust verification procedure that ensures numerical correctness of the result. This is in sharp contrast to traditional forms of benchmarking that only focus on best performance (or best time).
2. Reasonable optimization In the hands of a skillful benchmarking engineer, each of HPCC's individual tests alone can be optimized beyond any skillful user's comprehension. Such scenario is made unlikely with an HPCC framework that encapsulates all the tests in a single runtime thus excluding the possibility of switching the tested system into a special mode that would only benefit a single test. Another contribution of HPCC is the transfer of knowledge from the benchmarking engineer to the user as the optimization techniques are meant to be disclosed by the party submitting results.
3. Convenience and correctness Running each of HPCC's tests separately requires manual effort which is cumbersome, costly and error prone if done in a robust and reliable manner. HPCC eliminates this by automating the process of gathering performance data on widely applicable hardware characteristics.
Results Overview
As a result of this project, software was developed in an open source mechanism and distributed to the community via the normal channels for open source software -a publicly available web page is used for downloading stable releases of the software while read-only CVS access may be used for development snapshots. Occasionally, the source code was distributed via email if other means were inaccessible. There was a monthly phone call and/or Access Grid meetings of participants to exchange ideas and progress as well as an electronic mailing list to assist with participant communication. Finally, the lead participants participated in face-to-face meetings as needed, and there has been at least one such meeting annually.
The problem area may be characterized by most common memory access patterns and is defined by seven benchmarks: HPL, DGEMM, STREAM, PTRANS, RandomAccess, FFT, and Latency/Bandwidth: 1. HPL is the Linpack Toward Peak Performance (TPP) benchmark. The test stresses the floating point performance of a system. 2. DGEMM measures the floating point rate of execution of double precision real matrix-matrix multiplication.
3. STREAM is a benchmark that measures sustainable memory bandwidth (in GB/s) of simple vector application kernels. 
PTRANS (from the

Benchmark Data
To ensure broad impact and scientific value of the benchmark suite, results from a vast array of data is collected for each submission. The data is gathered in a database that has read-only public access. The data in the database can be divided into the following categories:
• CPU parameters such as floating-point execution rate (measured in Gflop/s) of various computational kernels
• Memory subsystem parameters such as data transfer rates (measured in GB/s) for various CPU workloads and memory bus sharing scenarios
• Communication subsystem parameters such as transfer rates (measured in GB/s) across the system interconnect and message latencies (measured in micro-seconds)
• Hardware, software, and productivity optimizations including complete description of the hardware configuration, software environment and tools that were used to produce the executable, and all the specific changes applied to the optimized run.
Benchmark Optimization and Result Database
An integral part of HPCC is the database that stores various optimization techniques applicable to the HPCC tests as well as the results of applying these optimizations. All of the data is time stamped and consistently gathered from every system submitted to the HPCC website. As such, it constitutes invaluable resource for both hardware and software vendors as well as application vendors. The optimization portion of the database stores two types of information:
1. Hardware/software optimization This portion includes reports on how the programming environment and system libraries influence hardware and its performance.
Productivity optimization
This part of the database shows how the human factor is included in the overall system design. In particular the result submitters describe changes made for the optimized run of the benchmark. Based on this information, conclusions may be drawn about feasibility and actual performance gains for end-applications.
The results portion of the database includes various computer system parameters gathered during benchmark run. In particular, the data may be divided into three groups:
Processor parameters
These include floating-point execution rates for computational kernels such as global linear equation solving, local matrix multiplication and local/global FFT. Numerical capabilities of the processor are noted as well by measuring relevant numerical norms that assess correctness and quality of the delivered solution.
Memory parameters
These include transfer rates of multiple sorts that show performance of the simplest data movement scenarios (CPU-memory transfer) and more elaborate schemes that involve multiple CPU calculations combined with simultaneous accesses to multiple memory modules.
Interconnect parameters
Essentially, two types of parameters are considered: latency and bandwidth. But the measurement scenarios used to obtain them vary greatly from the simple polling-driven scheme with only two interconnect end-points exchanging data in a synchronous fashion to network-capacity, limited tests of raw communication and computation-interleaved probes that rely on communication system throughput and tolerance of high volume traffic. 2. Exchange of the used mathematical algorithm Any change of algorithms must be fully disclosed and is subject to review by the HPC Challenge Committee. Passing the verification test is a necessary condition for such an approval. The substituted algorithm must be as robust as the baseline algorithm. For the matrix multiply in the HPL benchmark, Strassen Algorithm may not be used as it changes the operation count of the algorithm. 
Project's Website
In order to provide easy access to the results of this project a publicly available web site was developed. The website can be accessed at http://icl.cs.utk.edu/hpcc/ (the site has nearly 2000 visitors per month and has been used to download the benchmark code by almost 1000 visitors). It consists of the following components:
• Rules for running the benchmark and reporting results.
• News item from external media outlets.
• Download page that allows to download the benchmark code in various forms and versions.
• Frequently Asked Questions has an extensive list of questions frequently encountered in benchmarking and pertaining to the HPCC Suite.
• Resource page with (mostly external) links related to the benchmark suite.
• Pages with sponsors and collaborators that made the project possible.
• Web form for submitting information about tested system and the output of the benchmark.
• Database read-only interface that allows users to interactively obtain various views of the submitted data or export the contents of the database for archiving or more thorough analysis on user system. The entire contents of the website, including the source code (in PDF format) for all seven releases of the benchmark suite, have been placed on a CD accompanying this report.
Conclusions
The impact of this work on the community is the availability of an easy mechanism to test, evaluate and compare high productivity systems. The applications of performance modeling are numerous, including evaluation of algorithms, optimization of code implementations, parallel library development, comparison of system architectures, parallel system design, and procurement of new systems.
The main components of the HPC Challenge Benchmark Suite are based on existing codes that are well known and used in the HPC community. This fact greatly contributes to wider adoption of our effort. In addition, our framework combines these existing codes together in a unique way by defining very specific rules about the conditions under which they should be run and some of the included tests are used in new scenarios. We also largely expanded on the benchmark deployment and results data to facilitate fairness of performance assessment. 
