The HPCChallenge suite of benchmarks will examine the performance of HPC architectures using kernels with memory access patterns more challenging than those of the High Performance Linpack (HPL) benchmark used in the Top500 list. The HPCChallenge suite is being designed to augment the Top500 list, provide benchmarks that bound the performance of many real applications as a function of memory access characteristics e.g., spatial and temporal locality, and provide a framework for including additional benchmarks. The HPCChallenge benchmarks are scalable with the size of data sets being a function of the largest HPL matrix for a system. The HPCChallenge benchmark suite has been released by the DARPA HPCS program to help define the performance boundaries of future Petascale computing systems. The suite is composed of several well known computational kernels (STREAM, High Performance Linpack, matrix multiply -DGEMM, matrix transpose, FFT, RandomAccess, and bandwidth/latency tests) that attempt to span high and low spatial and temporal locality space.
High Productivity Computing Systems
The DARPA High Productivity Computing Systems (HPCS) [1] is focused on providing a new generation of economically viable high productivity computing systems for national security and for the indus-trial user community. HPCS program researchers have initiated a fundamental reassessment of how we define and measure performance, programmability, portability, robustness and ultimately, productivity in the HPC domain.
The HPCS program seeks to create trans-Petaflop systems of significant value to the Government HPC community. Such value will be determined by assessing many additional factors beyond just theoretical peak flops (floating-point operations). Ultimately, the goal is to decrease the time-to-solution, which means decreasing both the execution time and development time of an application on a particular system. Evaluating the capabilities of a system with respect to these goals requires a different assessment process. The goal of the HPCS assessment activity is to prototype and baseline a process that can be transitioned to the acquisition community for 2010 procurements.
The most novel part of the assessment activity will be the effort to measure/predict the ease or difficulty of developing HPC applications. Currently, there is no quantitative methodology for comparing the development time impact of various HPC programming technologies. To achieve this goal, the HPCS program is using a variety of tools including
• Application of code metrics on existing HPC codes,
• Several prototype analytic models of development time,
• Interface characterization (e.g. programming language, parallel model, memory model, communication model),
• Scalable benchmarks designed for testing both performance and programmability,
• Classroom software engineering experiments,
• Human validated demonstrations.
These tools will provide the baseline data necessary for modeling development time and allow the new technologies developed under HPCS to be assessed quantitatively.
As part of this effort we are developing a scalable benchmark for the HPCS systems.
The basic goal of performance modeling is to measure, predict, and understand the performance of a computer program or set of programs on a computer system. The applications of performance modeling are numerous, including evaluation of algorithms, optimization of code implementations, parallel library development, and comparison of system architectures, parallel system design, and procurement of new systems.
Motivation
The DARPA High Productivity Computing Systems (HPCS) program has initiated a fundamental reassessment of how we define and measure performance, programmability, portability, robustness and, ultimately, productivity in the HPC domain. With this in mind, a set of kernels was needed to test and rate a system. The HPCChallenge suite of benchmarks consists of four local (matrix-matrix multiply, STREAM, RandomAccess and FFT) and four global (High Performance Linpack -HPL, parallel matrix transpose -PTRANS, RandomAccess and FFT) kernel benchmarks. HPCChallenge is designed to approximately bound computations of high and low spatial and temporal locality (see Figure 1 ). In addition, because HPCChallenge kernels consist of simple mathematical operations, this provides a unique opportunity to look at language and parallel programming model issues. In the end, the benchmark is to serve bothe the system user and designer communities [2] .
The Benchmark Tests
This first phase of the project have developed, hardened, and reported on a number of benchmarks. The collection of tests includes tests on a single processor (local) and tests over the complete system (global). In particular, to characterize the architecture of the system we consider three testing scenarios:
1. Local -only a single processor is performing computations.
2. Embarrassingly Parallel -each processor in the entire system is performing computations but they do no communicate with each other explicitly.
3. Global -all processors in the system are performing computations and they explicitly communicate with each other.
The HPCChallenge benchmark consists at this time of 7 performance tests: HPL [3] , STREAM [4] , RandomAccess, PTRANS, FFT (implemented using FFTE [5] ), DGEMM [6, 7] and b eff Latency/Bandwidth [8, 9, 10] . HPL is the Linpack TPP (toward peak performance) benchmark. The test stresses the floating point performance of a system. STREAM is a benchmark that measures sustainable memory bandwidth (in GB/s), RandomAccess measures the rate of random updates of memory. PTRANS measures the rate of transfer for larges arrays of data from multiprocessor's memory. Latency/Bandwidth measures (as the name suggests) latency and bandwidth of communication patterns of increasing complexity between as many nodes as is time-wise feasible.
Many of the aforementioned tests were widely used before HPCChallenge was created. At first, this may seemingly make our benchmark merely a packaging effort. However, almost all components of HPCChallenge were augmented from their original form to provide consistent verification and reporting scheme. We should also stress the importance of running these very tests on a single machine and have the results available at once. The tests were useful separately for the HPC community before and with the unified HPCChallenge framework they create an unprecendented view of performance characterization of a system -a comprehensive view that captures the data under the same conditions and allows for variety of analysis depending on end user needs.
Each of the included tests examines system performance for various points of the conceptual spatial and temporal locality space shown in Figure 1 formance bounds on metrics important to HPC applications. The expected behavior of the applications is to go through various locality space points during runtime. Consequently, an application may be represented as a point in the locality space being an average (possibly time-weighed) of its various locality behaviors. Alternatively, a decomposition can be made into timedisjoint periods in which the application exhibits a single locality characteristic. The application's performance is then obtained by combining the partial results from each period. Another aspect of performance assesment addressed by HPCChallenge is ability to optimize benchmark code. For that we allow two different runs to be reported:
• Base run done with with provided reference implementation.
• Optimized run that uses architecture specific optimizations.
The base run, in a sense, represents behavior of legacy code because it is conservatively written using only widely available programming languages and libraries.
It reflects a commonly used approach to prallel processing sometimes referred to as hierachical parallelism that combines Message Passing Interface (MPI) with threading from OpenMP. At the same time we recognize the limitations of the base run and hence we allow (or even encourage) optimized runs to be made. The optimizations may include alternative implementations in different programming languages using parallel environments available specifically on the tested system. To stress the productivity aspect of the HPC Challange benchmark, we require that the information about the changes made to the orignial code be submitted together with the benchmark results. While we understand that full disclosure of optimization techniques may sometimes be impossible to obtain (due to for example trade secrets) we ask at least for some guidence for the users that would like to use similar optimizations in their applications.
Benchmark Details
Almost all tests included in our suite operate on either matrices or vectors. The size of the former we will denote below as n and the latter as m. The following holds throughout the tests:
Or in other words, the data for each test is scaled so that the matrices or vectors are large enough to fill almost all available memory. HPL is the Linpack TPP (Toward Peak Performance) variant of the original Linpack benchmark which measures the floating point rate of execution for solving a linear system of equations. HPL solves a linear system of equations of order n:
by first computing LU factorization with row partial pivoting of the n by n + 1 coefficient matrix:
Since the row pivoting (represented by the permutation matrix P) and the lower triangular factor L are applied to b as the factorization progresses, the solution x is obtained in one step by solving the upper triangular system:
The lower triangular matrix L is left unpivoted and the array of pivots is not returned. The operation count for the factorization phase is 2 3 n 3 − 1 2 n 2 and 2n 2 for the solve phase. Correctness of the solution is accertained by calculating scaled residuals:
, and
where ε is machine precision for 64-bit floating-point values.
DGEMM measures the floating point rate of execution of double precision real matrix-matrix multiplication. The exact operation performed is:
The operation count for the multiply is 2n 3 and correctness of the operation is accertained by calculating scaled residual:
(Ĉ is the result of reference implementation of the multiplication).
STREAM a simple synthetic benchmark program that measures sustainable memory bandwidth (in GB/s) and the corresponding computation rate for four simple vector kernels:
where:
a, b, c ∈ R m ; α ∈ R.
As mentioned earlier, we try to operate on large data objects. The size of these objects is determined at runtime which contrasts wit the original version of the STREAM benchmark which uses static storage (determined at compile time) an size. The original benchmark gives the compiler more information (and control) over data alignment, loop trip counts, etc. The benchmark measure GB/s and the number of items transferred is either 2m or 3m depending on the operation. The norm of differnce between reference and computed vectors is used to verify the result: x −x . PTRANS (parallel matrix transpose) exercises the communications where pairs of processors communicate with each other simultaneously. It is a useful test of the total communications capacity of the network. The performed operation sets a random an n by n matrix to a sum of its transpose with another random matrix:
The data transfer rate (in GB/s) is calculated by dividing the size of n 2 matrix entries by the time it took to perform the transpose. The scaled residual of the form A−Â ε n verifies the calculation.
RandomAccess measures the rate of integer random updates of memory (GUPS). The operation being performed on an integer array of size m is:
The operation count is m and since all the operations are in integral values using Galois field they can be checked exactly with a reference implementation. The verification procedure allows 1% of the operations to be incorrect (either skipped or done in the wrong order) which allows loosening concurrent memory update semantics on shared memory architectures. 
The operation count is taken to be 5m log 2 m for the calculation of the computational rate (in GFlop/s). Verification is done with a residual
x−x ε log(m) wherex is the result of applying a refernce implementation of inverse transform to the outcome of the benchmarked code (in infinite-precision arithmetic the residual should be zero).
Communication bandwidth and latency is a set of tests to measure latency and bandwidth of a number of simultaneous communication patterns. The patterns are based on b eff (effective bandwidth benchmark) -they are slightly different from the original b eff. The operation count is linearly dependant on the number of processors in the tested system and the time the tests take depends on the parameters of the tested network. The checks are built into the benchmark code by checking data after it has been received.
Rules for Running the Benchmark
There must be one baseline run submitted for each computer system entered in the archive. There may also exist an optimized run for each computer system.
Baseline Runs
Optimizations as described below are allowed.
(a) Compile and load options Compiler or loader flags which are supported and documented by the supplier are allowed. These include porting, optimization, and preprocessor invocation.
(b) Libraries Linking to optimized versions of the following libraries is allowed:
Acceptable use of such libraries is subject to the following rules:
• All libraries used shall be disclosed with the results submission. Each library shall be identified by library name, revision, and source (supplier). Libraries which are not generally available are not permitted unless they are made available by the reporting organization within 6 months.
• Calls to library subroutines should have equivalent functionality to that in the released benchmark code. Code modifications to accommodate various library call formats are not allowed.
• Only complete benchmark output may be submitted -partial results will not be accepted.
Optimized Runs (a) Code modification
Provided that the input and output specification is preserved, the following routines may be substituted:
• However the substitution of algorithms is allowed (see Exchange of the used mathematical algorithm). ii. Exchange of the used mathematical algorithm Any change of algorithms must be fully disclosed and is subject to review by the HPC Challenge Committee. Passing the verification test is a necessary condition for such an approval. The substituted algorithm must be as robust as the baseline algorithm. For the matrix multiply in the HPL benchmark, Strassen Algorithm may not be used as it changes the operation count of the algorithm. iii. Using the knowledge of the solution Any modification of the code or input data sets, which uses the knowledge of the solution or of the verification test, is not permitted. iv. Code to circumvent the actual computation Any modification of the code to circumvent the actual computation is not permitted.
Software Download, Installation, and Usage
The reference implementation of the benchmark may be obtained free of charge at the benchmark's web site: http://icl.cs.utk.edu/hpcc/. The reference implementation should be used for the base run. The in- stallation of the software requires creating a script file for Unix's make(1) utility. The distribution archive comes with script files for many common computer architectures. Usually, few changes to one of these files will produce the script file for a given platform.
After, a succesful compilation the benchmark is ready to run. However, it is recommended that a changes be made to the benchmark's input file that describes the sizes of data to use during run. The sizes should reflect the available memory on the system and number of processors available for computations.
We have collected a comprehensive set of notes on the HPCChallenge benchmark. They can be found at http://icl.cs.utk.edu/hpcc/faq/. Figure  2 show a sample rendering of the results web page: http://icl.cs.utk.edu/hpcc/hpcc results.cgi. Figure 3 show a sample kiviat diagram generated using the benchmark results.
Example Results

Conclusions
No single test can accurately compare the performance of HPC systems. The HPCChallenge benchmark test suite stresses not only the processors, but the memory system and the interconnect. It is a better indicator of how an HPC system will perform across a spectrum of real-world applications. Now that the more comprehensive, informative HPCChallenge benchmark suite is available, it can be used in preference to comparisons and rankings based on single tests. The real utility of the HPCChallenge benchmarks are that architectures can be described with a wider range of metrics than just Flop/s from HPL. When looking only at HPL performance and the Top500 List, inexpensive buildyour-own clusters appear to be much more cost effective than more sophisticated HPC architectures. Even a small percentage of random memory accesses in real applications can significantly affect the overall performance of that application on architectures not designed to minimize or hide memory latency. HPCChallenge benchmarks provide users with additional information to justify policy and purchasing decisions. We expect to expand and perhaps remove some existing benchmark components as we learn more about the collection.
[10] Rolf Rabenseifner. Hybrid parallel programming on HPC platforms. 
