The Scalable HeterOgeneous Computing (SHOC) benchmark suite was released in 2010 as a tool to evaluate the stability and performance of emerging heterogeneous architectures and to compare different programming models for compute devices used in those architectures. Since then, highperformance computing (HPC) system architectures have increasingly incorporated both discrete and fused multi-core and many-core processors. The TOP500 list illustrates this trend: heterogeneous systems grew from a 3.4% to 18.0% share of the list between June 2010 and June 2015. Not only are there more heterogeneous systems on the TOP500 list today, those machines are responsible for a disproportionately large percentage of list's aggregate performance: as of June 2015, the performance share for heterogeneous systems has grown to 33.7%.
INTRODUCTION
Due to the growing interest in heterogeneous computing during the past few years, there is a large variety in accelerators being used for general scientific computing applications. Add to this the variety of programming models which support these devices, and one can imagine the difficulty in getting a consistent view of the performance characteristics within this landscape. In particular, the following types of questions are becoming more and more difficult to answer:
• What are the performance differences between devices from different architecture generations due to changes in peak compute capability, memory architecture, memory bandwidth, etc.?
• Are there additional performance costs for using a lowpower accelerator aside from the difference in maximum theoretical peak FLOPS?
• How much of a performance trade-off is involved when moving from a low-level accelerator programming model to a directive-based approach, and what does the code actually look like for the same or an equivalent algorithm?
The Scalable HeterOgeneous Computing benchmark suite (SHOC) [10] was designed to answer these kinds of questions, in addition to providing traditional benchmarking capabilities like the comparison of the same algorithm on different devices and the evaluation of development tool chains for a variety of common scientific computing algorithms. Sec. 3 demonstrates some examples of these types of comparisons, and provides pointers to the raw data that has been made publicly available for a variety of platforms. Since SHOC's introduction in 2010, the heterogeneous computing landscape has changed with respect to both hardware and software. This paper presents additions and modifications to SHOC that address some of these changes. In Sec. 2.2, we discuss the OpenACC directive-based programming model as well as the implications of this approach on the SHOC benchmark kernel implementations. Sec. 2.3 presents similar issues regarding SHOC support for the Intel Xeon Phi accelerator. In addition to extending SHOC to support new programming models, we also applied our extended version of SHOC to several new architectures. We give an overview of these architectures in Sec. 3.2. To improve the stock SHOC distribution's application domain coverage, our extensions include new benchmark kernels that more thoroughly cover traditional scientific computing algorithms. In Sec. 2.4 we present our benchmark extensions in the areas of integer operation-intensive hashing, sparse linear algebra, computational fluid dynamics, and clustering algorithms. Our extensions to the stock SHOC also include core "big data" algorithms such as breadth-first search, relational database primitives, and deep-learning neural nets as discussed in Sec. 2.4.1.
ALIGNING SHOC WITH RECENT HET-EROGENEOUS TRENDS
Since its introduction in 2010, the SHOC developers have made periodic improvements to the SHOC benchmark suite. Besides support for new programming models, hardware, and algorithms, the developers made small functionality improvements including improvements to benchmark execution scripts and results gathering capabilities, and the deployment of a results repository that is hosted along with the SHOC source code at GitHub, a widely used, publicly available Git repository server site [37] . Table 1 summarizes the SHOC benchmarks, highlighting the new benchmarks and programming model support to distinguish them from those provided by the original SHOC developers. The original SHOC benchmark suite was designed to support comparisons using only CUDA and OpenCL. We have added additional benchmarks and support for new programming models. When describing our augmented version of SHOC, we adopt the terminology from the original developers, whereby Level 0 benchmarks support basic "speeds and feeds" characterization of hardware, Level 1 benchmarks represent application kernels used in a variety of computing applications, and Level 2 benchmarks represent the core computation of large scientific applications of important to the Department of Energy. Many of the Level 1 and Level 2 benchmarks can be categorized according to the application domains and patterns they represent. As shown in Table 1 , we provide a mapping to Berkeley's 13 "Dwarf Mine" classifications [2] (in parentheses under "Characteristics"). The Berkeley dwarves are an extension of Colella's original seven computational dwarves [8] that describe the communication and computation characteristics of parallel algorithms. For example, in this classification scheme, Monte Carlo is considered to be a "MapReduce" algorithm due to limited inter-processor communication and repeated, independent trials. Our extended version of SHOC has good coverage of both Colella's original seven dwarves (1; 3-7) and the extended Berkeley set (1-5; 7-10; 12) . Some algorithms like sorting are considered too fundamental to be classified in this way, and so the primary benchmark characteristic (local memory bandwidth) is listed in the table instead.
Comparison With Other Benchmarks
SHOC's initial 2010 release was bracketed by two other benchmark suites focused on accelerators, Rodinia [5] and then Parboil [34] . Rodinia included CUDA and OpenCL versions of applications that exhibited features of Colella's seven algorithmic "dwarves" [8] , and the current Rodinia suite includes kernels similar to SHOC's implementations of Sort, MD, and BFS. It also includes more novel algorithms like Needleman-Wunsch (bioinformatics) and backpropagation (machine learning). Parboil contains fewer application kernels than Rodinia or SHOC but provides more implementations of each, with optimized CUDA kernels, OpenMP, and OpenCL versions of benchmarks that include SAD (H.264 encoding) and Lattice Boltzmann (fluid dynamics). Parboil also contains benchmarks like BFS, GEMM, SpMV, and 3D Stencil for which SHOC has similar implementations. The three benchmark suites SHOC, Rodinia, and Parboil have varying degrees of application coverage, and each provides CUDA and OpenCL implementations. Our extended version of SHOC differs from the other benchmark suites in its support for directive-based programming languages (OpenACC and LEO as opposed to OpenMP). Because implementations of even the common benchmarks vary across suites, the appropriate choice of suite may depend on the user's benchmarking goal.
Applications from both Rodinia and Parboil have been adopted into the SPEC Accel 1.0 benchmark suite along with the OpenACC implementation of the NAS Parallel benchmarks [41] and several SPEC OpenMP benchmarks. Unlike SHOC, Rodinia, and Parboil, the SPEC benchmark suite is not free, even for research purposes.
Finally, LonestarGPU [3] and Pannotia [4] are a benchmark suite and research study, respectively, that focus more on irregular applications like graph analytics and which also include implementations of BFS.
Additional Programming Models
In recent years, OpenACC gained wide exposure as a path 
forward for application developers who wished to take advantage of accelerators without the need for drastic modifications to their legacy codes. However, the concept of programming accelerators using directives predates OpenACC. As early as 2009 vendors such as CAPS and The Portland Group (PGI) had their own, disparate implementations of directive-based programming approaches [29] . In the fall of 2010, CAPS and PGI joined with NVIDIA and Cray to form the OpenACC working group, and then a year later the OpenACC 1.0 specification was released [27] . There are now more than 20 OpenACC members with representation from academia, industry, and government institutions. An updated 2.0 specification was released June 2013, and a draft of the version 2.5 specification is available for public comment. Updates are driven largely by the user community, and they continue to be integrated for future versions. The utility of OpenACC is quite broad, with the intent of the specification itself extending beyond graphics processing units (GPUs) to a wide variety of accelerator device vendors and architectures including so-called fused GPUs, many-core, and digital signal processors. Because OpenACC specifications are open (i.e., not tied to a particular type of device or device vendor), once an OpenACC specification is released it is up to the community to provide compilers that implement the specification. These implementations come from a variety of sources including not only industry (e.g., PGI, Cray, etc.), but the research [20, 40, 28] , and open source [12] communities as well.
Because of the current level of interest in OpenACC, its quick evolution, and broad applicability, most of the SHOC benchmark kernels have been updated so that they can be built using the OpenACC directive programming model, with the remainder in development. However, in most cases the SHOC OpenACC versions do not attempt to reproduce exactly the algorithms used in the corresponding CUDA/OpenCL versions for two reasons. First, the OpenACC specification does not consider architecture-specific features that are exposed by CUDA and/or OpenCL. For example, NVIDIA GPUs contain shared memory that can be explicitly managed using CUDA and OpenCL, but the same level of control of this memory is not exposed to the developer when using OpenACC. Second, OpenACC is not targeted at user who are porting code bases that are already accelerated using a lower-level programming model. Instead, the typical work flow when programming with directives is to parallelize an existing implementation of a serial algorithm, or (less frequently) to retarget a directive-based implementation that only supports traditional CPUs so that it can use accelerators. SHOC currently only uses features from the OpenACC 1.x series of specifications; as more compilers offer reliable support for 2.x features, further benchmark optimizations will be possible. Similarly, as OpenMP's emerging support for accelerated offload directives solidifies, our version of SHOC will also incorporate support for OpenMP accelerator directives.
Additional Hardware Platforms
The Intel Xeon Phi discrete accelerator (sometimes also called Many Integrated Core or MIC) was officially released in 2012 [7] . As opposed to current discrete GPU accelerators, the Xeon Phi is built around power-efficient x86 cores with extended support for vector operations. Intel's "Knight's Corner" Xeon Phi accelerator contains 57-61 Atomlike (P1) cores on a PCIe form-factor board with eight to sixteen GB of GDDR5 memory that provides bandwidth of 240-352 GB/s. Xeon Phi cores support the x86 instruction set, and reward effective use of integrated 512-bit SIMD vector units and large, coherent L1 and L2 caches that are not present on current GPU architectures.
Many developers find the Xeon Phi attractive because it supports the familiar and widely-supported x86 instruction set. However, like discrete GPUs, the Phi's unique vectorbased architecture means that codes must often be vectorized manually to achieve maximum performance [18] . Also, the Phi has been shown to excel at regular, strided accesses and is somewhat easier to program for object segmentation and feature computation algorithms, whereas GPUs currently have much better performance for random accesses to device memory [35] . The Xeon Phi accelerator runs a Linux operating system on the device, and code compiled for the Phi can be launched by transferring the code to the device via ssh or a shared file system. As an alternative to these established but low-level mechanisms, Intel provides proprietary Language Extensions for Offload (LEO) [16] directives that allow for offloading computation to a Xeon Phi accelerator from a program that is running on the host. Similar to OpenACC, LEO uses directives to target data and functions for transfer to the Xeon Phi. In addition, LEO supports features like asynchronous data transfers and fall-back execution on the CPU in case an offload operation fails. OpenMP 4.0's targeting directives [22] are expected to provide some functionality similar to that which LEO provides.
In order to provide code that could be run natively on the Knight's Corner as well as in "offload" mode, a combination of OpenMP 3.0 and LEO were used as the target language for the MIC versions of the SHOC benchmarks. New Level 0 and 1 benchmarks were designed in partnership with Intel, and a port of the Level 2 benchmark S3D was recently completed, allowing for comparisons with the CUDA and OpenCL versions. Due to the Phi's flat memory hierarchy (when compared to GPUs) certain benchmarks like Device Memory are more similar to the OpenACC implementations than the original SHOC's CUDA/OpenCL versions.
Additional Application Domains
SpMV. Sparse matrix-vector multiply (SpMV) has traditionally been one of the most heavily used kernels in HPC, dominating the performance of diverse applications in computational science and engineering, economic modeling and information retrieval. However, it is a frequent bottleneck, notorious for sustaining a low fraction of peak processor performance [38] . It is included in popular high-performance numerical libraries, such as MKL [15] and CUSPARSE [26] , and it is the subject of ongoing research seeking good SpMV performance from modern heterogeneous architectures [1, 38] .
Our extensions to SHOC include an SpMV implementation which tests the operation using two of the most common sparse matrix representation schemes, ELLPACK [17] and compressed sparse row (CSR). The benchmark uses wellknown optimizations when the hardware architecture and programming model supports them, such as the reading of the dense vector out of texture memory in the CUDA and OpenCL implementations when running on devices that expose access to such memory.
MD5 Hashing. MD5 is a hash function, or message digest algorithm, generating a 128-bit hash for an arbitrarily long sequence of input bytes [23] . Although it was designed as a cryptographic hash function that is difficult to invert and find hash collisions, it is now known to have weaknesses [33] which limit its current practicality in the cryptographic domain. However, MD5 has other attributes that make it a useful complement to the various memorybandwidth and floating-point intensive algorithms already in SHOC. In particular, like many hash functions, MD5 contains many integer and bit-wise operations. In addition, it has a small memory footprint (requiring few table lookups), and is designed to be reasonably fast. By including a test of these attributes, SHOC is able to probe an additional dimension of performance of heterogeneous computing algorithms and devices.
MD5 itself is not a parallel-friendly algorithm. Also, not being highly compute intensive, using it to digest a large quantity of text would stress bus bandwidth more than the compute operations of MD5. As such, the SHOC "MD5Hash" benchmark is designed to generate many hashes simultaneously of many input keys at once, which is similar to the behavior one would observe in generating indices in a standard hash table algorithm. These input keys are not fed to the MD5Hash algorithm from an input data set, but generated programmatically from a well-defined keyspace. Specifically, each parallel thread generates and hashes a consecutive segment of keys from this keyspace. This design makes it easy to stress the integer and bit operations of heterogeneous devices. In fact, this specific benchmark scales quickly in performance, nearly approaching its asymptotic peak hash rate within only a few seconds of computation even on today's high-end accelerators.
Level 2. The SHOC Level 2 benchmarks present a workload that constitutes a substantial kernel from a full application. SHOC currently contains two Level 2 benchmarks, S3D and QTC. The S3D benchmark measures the performance of the most computationally intensive function from the S3D [13] turbulent combustion simulation application. Mapping this function, getrates, to a GPU has been shown to provide substantial performance benefits compared to the traditional implementation [32] . SHOC provides both CUDA and OpenCL implementations of this benchmark, both of which break the getrates function into several smaller kernels of straight line code. The QTC benchmark implements the quality threshold data clustering algorithm [14] , originally developed for clustering time course gene expression data. Quality threshold clustering has an advantage over the commonly used k-means clustering approach in that the number of clusters need not be specified a priori. Despite its reputation as being computationally demanding, a parallel version of quality threshold clustering has been shown to perform well on systems with heterogeneous architectures after overcoming several implementation challenges [11] . SHOC currently includes a CUDA version of this threshold clustering algorithm, and extending support for these new benchmarks to other programming models is planned for a future release.
Data Analytics
As data sets continue to grow due to increased computational ability, increased connectivity, and collection through sources like social media, there has been a shift in emphasis within the HPC community from generating data (e.g., by running a simulation) to processing extremely large data sets. Accordingly, so-called "big-data" issues have gotten increased attention from the heterogeneous computing community as well. Companies like MapD and SQLSTREAM have begun to focus on accelerators for their products. Although accelerators like GPUs and Xeon Phi are well-suited for fast processing, they can be hindered by data-transfer bottlenecks. To be useful to those benchmarking outside the traditional HPC problem domains, our extended version of SHOC includes benchmarks incorporating algorithms used by big-data applications. Although software like Hadoop and LexisNexis HPCC can also be used for benchmarking, their bar to entry is high because they normally require significant database software infrastructure that SHOC does not require.
Neural Network. Big-data applications like graph analysis, sequence analysis, relational frameworks, and deep learning neural networks were heavily emphasized at NVIDIA's GTC14 [31] and GTC15 [36] conferences. Deep learning is a type of machine learning, and can be used, among other things, for image classification, voice recognition, spam filtration, and character recognition. Interest in implementing deep learning neural networks on accelerators exploded when the Google Brain project which required 16,000 CPUs and 600kW of power was reproduced on a workstation with two GPUs and 4kW of power. NVIDIA's release of the cuDNN library with GPU support for deep learning further spurred community interest.
SHOC now includes a deep learning neural network that recognizes handwritten digits, 0-9 based on Michael Nielsen's Python implementation [25] of Andrew Ng's [24] deep learning algorithm. This python code was used as a baseline to produce both a CUDA version, using CUBLAS and handwritten CUDA, and an Intel Xeon Phi version, using OpenMP and offload primitives. The neural net has 784 input neurons, ten output neurons, and one hidden layer with thirty neurons. The network is trained with images from the Mixed National Institute of Standards and Technology MNIST database of handwritten digits [19] . After being trained with 50,000 training sets, the neural net can recognize digits with 95% accuracy.
Data Analytics Microbenchmarks. Data analytics is represented in our extended version of SHOC by relational algebra kernels based on the TPC-H [9] industry standard "decision support" benchmark suite. The TPC-H benchmark suite currently contains 21, unaccelerated queries, that perform operations like SQL "select,""group by," and "order by" to return the number and pricing of a certain item in the data warehouse over a certain time period. As discussed in related work [39] , this type of query can be resolved into multiple relational algebra operators including sort, join, select, and aggregate. These relational algebra operators can then be implemented using CUDA or OpenCL kernels to accelerate data warehousing queries.
The SHOC implementation of data analytics kernels includes OpenCL-based standalone tests for ten relational algebra operators, and three common patterns that are similar to those that can found as part of TPC-H queries. These operators include select, project, inner join, product, unique, union, intersection, reduction, reduce by key, and difference in addition to basic mathematical functions (add, subtract, multiply). The patterns included by SHOC include serial operations like multiple selects, multiple joins, and combinations of other multiple relational algebra operators.
EXAMPLE CASE STUDIES
In this section, we present case studies of using our extended version of SHOC to investigate the performance of hardware, software systems, and application areas that have become relevant in the HPC space since the original release of SHOC. On the hardware side, new architectures have been introduced by different vendors, and established architectures have undergone some significant changes in areas like capability and energy efficiency. This section shows how SHOC can be used to gain insight into the relative strengths and weaknesses of these expanded options. Directive-based programming is being promoted as a high-productivity way to use accelerators, and so we used SHOC to evaluate the efficacy of these models compared to the lower-level programming models (e.g. CUDA and OpenCL) that were considered in the original SHOC study [10] . 1 
New Benchmarks
Neural Net Results. On an Intel Xeon E5-2670 CPU with 16 cores, the OpenMP version of the neural net benchmark scaled to 8 cores, achieving a performance of 25,000 training sets per second. The same OpenMP version of the neural net benchmark with offload directives scaled to 14 cores, achieving a performance of 4,000 training sets per second on an Intel Xeon Phi 5110P (61 cores). The CUDA version of the neural net benchmark achieved a performance of 35,000 training sets per second on an NVIDIA M2090. It is counter-intuitive that the GPU version performed significantly better than the Phi version, since the 5110P has a theoretical double precision peak of > 1 TFLOP/s, while the M2090 has a theoretical double precision peak of only 665 GFLOP/s. The high GPU performance relative to the Phi is due primarily to the fact that the lion's share of the neural net processing time occurs inside a long sequence of relatively tiny, highly rectangular ((30 × 784) × (784 × 10)), matrix-matrix products, executed sequentially. It is well known that matrix-matrix products of this size and shape do not scale to large numbers of cores. In fact, threaded MKL will ignore the threads given to it for this problem, and use a single thread to compute the matrix-matrix products. For this reason, we parallelized the matrix-matrix products using OpenMP on the Phi. On the GPU, the matrix-matrix products are parallelized using CUBLAS. The GPU architecture is better suited for the special processing required to scale tiny, highly rectangular matrix-matrix products, and the CUBLAS developers have had years to perfect support for this kind of operation.
A second reason for the low Phi performance relative to the GPU is the fact that the Intel compilers are not fully vectorizing the Phi code. In order to achieve the full power of the Xeon Phi, applications must both scale to 61 cores, and vectorize well (making use of the wide Xeon Phi vector unit). It is also counter-intuitive that the 5110P performance was significantly lower than the E5-2670, since the E5-2670 has a theoretical double precision peak of 332 GFLOP/s. In order for the performance of an application running on the Phi to exceed the performance of the same application running on the CPU, that application must scale well beyond 16 cores (due to the fact that the Phi cores are less powerful). This is not the case for the neural net code. Figure 1 shows that the Phi neural net code ran significantly slower with a single thread per core than with two or four threads per core. This is because a Phi core swaps 1 Due to space limitations, we provide a sampling of our results in Section 3. The full results from these experiments are freely available from the new SHOC data repository that accompanies the SHOC source code repository [37] . between threads every other cycle. When the core has only a single thread, it sits idle half the time. Full results from the analytics microbenchmarks are published in [30] , and as with other SHOC benchmarks, the analytics kernel benchmarks illustrate the speed at which a particular architecture can perform a computation kernel as well as the total number of operations per second when taking into account data transfer. For example, Figure 2 demonstrates that the Kepler GPU had the best performance on the project compute kernel with 7.54 Giga-operations per second (GOPS) for its largest input, 1024 MB, while the Xeon Phi computed at 2.17 GOPS. In addition, fused parts like the AMD Trinity A10-5500K showed mixed performance for the CPU and GPU components. The Trinity CPU executed at only 0.38 GOPS due to low thread count and clock frequencies while the integrated GPU slightly outperformed other CPUs like the Haswell i7-4770 (0.89 GOPS for Trinity vs. 0.85 GOPs for the Haswell, both with 128 MB input). As further described in the related paper, this benchmark also demonstrates the performance impacts of zero-copy mechanisms that are becoming more common in integrated CPU and GPU parts like the Intel Haswell GPU (result not shown), which can perform the project operation at 1.17 GOPS for a 256 MB input set, including all data transfer costs. SpMV, S3D, MD5Hash, and QTC Results. The new SHOC benchmark kernels SpMV, S3D, MD5Hash, and QT Clustering represent more traditional, compute-oriented HPC application domains. We discuss them in the context of hardware and programming model case studies in the following sections. 
Hardware

GPU only
With PCIe Figure 3 : Speedups of the full K40 card relative to the Jetson TK1. Due to clock speed and SMX count differences, we expect speedups in 10x-15x range. Figure 3 shows the architectural scaling within the Kepler series of GPU from NVIDIA. Specifically, the Jetson TK1 has one Kepler SMX and the K40 has 15. We do not expect perfect scaling with 15-fold improvement, however, because of clock speed differences; the actual FLOPS rate ratio is closer to 12:1 (depending on the impact of clock boost), and memory bandwidth is about 14:1. The Kepler architecture scaled well with increased SMX count; most benchmarks showed the improvement we expect, in the 10× -15× range for K40 versus TK1. The few noteworthy cases of lower improvement are due to the limited effective bandwidth between the host and device memory on the K40 (< 10 GB/s) and the GPU and global memory on the Jetson which is limited to 14.8 GB/s [6] . Since both devices showed similar bus speed rates for memory accesses, the benchmarks which include CUDA memory copy timings (labeled "PCIe" in the figures) showed a smaller improvement. A few cases, such as SGEMM, MD, and BFS showed higher than expected improvement; we note that the different host CPU, library optimization, and other platform differences could potentially cause these results. Kepler (K40). First, with the exception of the benefit provided by the PCIe bus speed on the M2090 platform and its impact upon a few of the PCIe-heavy benchmarks (Triad, ScanPCIe and ReductionPCIe), the Kepler improved upon the Fermi in almost every way. This is generally a correlation from synthetics to benchmarks; for example, benchmarks such as Scan, Sort, and Reduction improve the same amount as the improvement in memory bandwidth, while more FLOPS-dependent benchmarks like SGEMM improve in a manner more similar to the improvement in FLOPS rate. The MD benchmark provided an unexpected result-it improved more than either the FLOPS rate or the memory rate, showing that the differences in architecture between these generations are more complex than raw numbers would indicate.
0.1x 1x 10x
Speedup W9100 over K40 (log scale)
GPU only
With PCIe Figure 5 : Relative performance of current high end GPUs from AMD and NVIDIA. Numbers greater than 1x show the AMD FirePro W9100 as faster, while values less than 1x show the NVIDIA Tesla K40 as faster. Figure 5 shows relative performance of the AMD FirePro W9100 versus the NVIDIA Tesla K40c; bars above the 1.0 line show the AMD card as faster, and bars below the line show the NVIDIA card as faster. Both results were obtained using OpenCL; in some cases, the NVIDIA results using OpenCL were slower than those using CUDA. For synthetics such as maximum floating point rate and memory bandwidth, the W9100 surpassed the K40, and that often translates into better performance in benchmarks such as FFT, GEMM, and MD. The higher bus speed on the K40 platform can sometimes offset these performance deficits, however. Other data-parallel patterns like Scan and Sort were still faster on the K40, however, while an integer-heavy operation like MD5Hash was much faster on the W9100. Figure 6 shows relative performance of the K20 GPU when compared to MIC for CUDA 6.5 and the best MIC result on a Phi 5110P using offload mode with Intel 2013 or 2015 compilers. Both devices have similar MaxFlops results (K20 had a speedup of .6 with 3120 SP MaxFlops vs. 1936 for the Phi) and global memory bandwidths (148.87 vs 175.70 GB/s for global read bandwidth on the K20 and MIC, respectively), but each device excelled at specific benchmarks and has different local memory/cache characteristics. The K20 benefited from a more mature software stack in CUDA and hand-tuned code that enabled speedups of 2.90, 2.56, 7.64, 1.70, and 15.80 for SGEMM, DGEMM, MD (SP FLOPS), Scan, and SpMV (CSR,SP,vec) benchmarks. The 5110P performed best on Sort, Stencil, and S3D with speedups of 1.62, 1.62, and 1.58 respectively.
While some of the benchmark results can be explained by differences in caching effects on each device and differences in local memory bandwidth or cache size (K20 has faster local memory while Phi has a larger cache), these last three benchmarks seem to vary in performance based on the implementation and underlying complexity of the accelerated code portion. Sort and Stencil both have finely tuned kernels for the CUDA and MIC versions, but it seems likely that the Phi's larger cache, slightly faster global memory read speed, and possibly even the ring bus connection between Phi cores contributed to speedup in favor of the MIC version. In addition, the CUDA version of Stencil makes use of an optimized template scheme to access and manage shared memory. While this type of optimization improved performance within a single generation of NVIDIA GPUs, we found it to be susceptible to performance portability differences across GPU generations; the cumulative effect could be a code-related advantage to the MIC version. Finally, while GPGPU is also a good candidate for accelerating S3D, examination of the larger production code for use with OpenMP and OpenACC has shown that the majority of the computation is performed in well-defined loops that iterate over the 3D grid and can be easily parallelized with directives [21] .
Discussion This selection of results provided an example of the types of hardware comparisons that SHOC supports. For more indepth insight into these platforms, there is a data repository for many architectures that contains results for all benchmarks, which is hosted along with the SHOC source code at GitHub [37] .
The new benchmarks in SHOC can help analyze and understand the basic design and performance of recent hardware devices. For example, we observed that AMD cards exhibited far greater integer performance than one would expect given the ratio of floating point compute performance relative to an NVIDIA card. Something straightforward like a different selection of available integer operations can swing the advantage toward one device or another; this is seen sometimes in specialized usage scenarios occurring in the HPC and consumer spaces. In another example, the K20 versus MIC comparisons exposed relative differences between memory bandwidth and peak compute capability.
In other cases, however, the performance relationship between these cards can be complex and dependent on a variety of factors. The new SHOC benchmarks can help to determine the relationship in even these cases. Thread launch parameters provide a well known example. AMD GPUs will generally perform at their peak when using different kernel dimensions than those that give the best performance with NVIDIA GPUs. The host platform represents another example: in our experiments, both the AMD and NVIDIA GPUs were connected to their host using PCIe gen 2, and so transfer rate and bandwidth measurements will not reflect the full capability of the latest accelerators which can take advantage of PCIe gen 3. Besides the supporting hardware environment, the software environment can also have an effect when comparing different platforms and their software stacks. As a concrete example, we generally find that the NVIDIA BLAS implementation in CUBLAS is updated more frequently than Intel's MKL.
Programming Models -Directives
One of SHOC's strengths is that it allows users to compare operations implemented using different programming models being used for heterogeneous HPC applications. The original SHOC developers compared CUDA and OpenCL [10] .
In this section, we compare directive-based approaches to their alternatives for both GPU and Intel MIC platforms. Results m2090_s1_acc_pgi1310 m2090_s2_acc_pgi1310 k40_newark_s4_opencl m2090_s4_acc_pgi146 m2090_s4_acc_pgi147 m2090_s1_acc_pgi1310 m2090_s1_acc_pgi146 m2090_s1_acc_pgi147 m2090_s2_acc_pgi1310 m2090_s2_acc_pgi146
1.E-02
1.E-01
1.E+00
Speedup vs CUDA 6.5
OpenACC PGI 13.10
OpenACC PGI 14.6
OpenACC PGI 14.7 In all cases, a Fermi M2090 GPU was targeted. The two leftmost benchmarks in the figure, MaxFlops and Bus BW (download), show that there is no intrinsic penalty for either raw compute or bandwidth when switching from the lowerlevel CUDA to the directive-based OpenACC programming model. Scan and Sort showed significant slowdown when using OpenACC; both of these algorithms involve unavoidably non-coalesced memory accesses, and so make heavy use of the on-chip shared memory which is not exposed to the programmer via OpenACC. The OpenACC implementation of reduction and stencil showed a less dramatic slowdown, but it is exacerbated by the across-the-board drop in performance when moving to newer versions of the compiler, possibly reflecting a regression problem in the compiler's OpenACC support. Figure 9 shows relative performance of the OpenMP plus LEO implementation run in offload mode (from the host) versus running it in native mode on the MIC itself. For most benchmarks, the performance was almost the same with offload mode showing a performance advantage. However, any of the benchmarks that include PCIe bandwidth as a measurement metric showed that the native version, which reads from global device memory, outperformed the much slower PCIe 2.0 interconnect -for example, Triad is 12.9x faster when using the local memory than when using PCIe 2.0. On the other hand, many of the native runs for the benchmarks without data transfer time showed that performance is, in general, slower by 2% (SGEMM) to 42% (Scan). Most surprisingly, the global memory read bandwidth was almost 50% slower in native mode. One possible reason for this slowdown is that the compilation flags used for compiling OpenMP code differ between native and offload mode, and offload pragmas are not optimized to the same performance level with the native compilation flags. for the compute portion of the benchmark but still had better transfer performance and so had a speedup of 1.6x for MD BW (SP) w/PCIe (10.81 vs. 6.70 GFLOPS). While each version is designed to mirror each other as close as possible, this variation is likely due to a limitation in how OpenMP parallelizes the core MD kernel, compute lg force.
Discussion
Using SHOC's OpenACC benchmarks illuminated some key points about productivity and performance portability when using this relatively new programming model.
Despite the fact that the 1.0 version of the specification was released more than 2 years ago, compiler support for OpenACC is still quite immature. Commercial availability is limited to PGI and Cray, and GCC support for OpenACC on a few devices (Knight's Landing and PTX backends) has just been introduced with GCC 5.0. We used the PGI compiler for all of the development and results collection of the OpenACC SHOC benchmarks presented in this paper, and observed wide functionality and performance variability between compiler versions. This is illustrated in Figure 7 , which shows some results comparing CUDA and several versions of the PGI compiler. For example, when using version 13.10 of the compiler, MaxFLOPS performed near the device peak flops rate as expected, but the MD5Hash benchmark would not complete at all, with no warnings or errors provided by the compiler's built-in diagnostics. However, when using the newer 14.6 version of the compiler, MD5Hash would run as expected, but performance of MaxFlops dropped to less than 50% of the device peak. We also observed troublesome behavior with the compiler translating macros and standard C89 language constructs correctly when enabling OpenACC functionality.
Another major limitation of the OpenACC programming model is due not to compiler support, but rather is inherent to the specification. In an attempt to be device agnostic, the specification tries to avoid constructs that are aimed at device-specific features. An example of such a feature is the shared memory on NVIDIA GPUs, which is not available to the developer solely via OpenACC.
There is a broader result of these limitations beyond difficulty of use or performance portability between software stacks. The lack of access to specific hardware capabilities makes it difficult to develop high-performance data parallel primitives as building blocks for more complex algorithms. For example, primitives such as parallel scan and radix sort depend on the use of shared memory in GPUs to deal with unavoidably non-coalesced memory access patterns. Figure 7 shows some of these performance penalties in benchmarks such as Reduction, Scan, and Sort. Another situation where SHOC demonstrated that a lack of support for device-specific features can have a significant performance impact involved the use of LEO to take advantage of Xeon Phi features. Using OpenMP and LEO versus using OpenCL to implement and run the Xeon Phi benchmarks showed that while most Level 0 benchmarks have similar performance across languages, the custom code written for the MIC versions generally outperformed OpenCL by a significant margin. Benchmarks like MD and Triad showed similar performance due to their simplicity, but Sort, GEMM, and Scan had large differences in performance, on the order of 6 − −30×, in part due to the OpenMP pragmas that allow scaling to large numbers of cores and in part due to a lack of any specific Phi optimizations for workgroup and vector size in OpenCL. Related work for the TPC-H microbenchmarks [30] also demonstrated this same performance penalty for OpenCL code running on the Phi, and this could be one reason that Intel has not fully committed to supporting OpenCL for Phi devices through their SDK. Even within the OpenMP-based framework for the Xeon Phi, the seemingly minor differences in compilation between compiler versions and flags used for offload mode versus native mode have been shown to cause large variations in performance.
SUMMARY
Heterogeneous HPC computing has gone from mostly experimental to mainstream since the SHOC benchmark suite was first released in 2010, but new hardware, software stacks, and algorithmic developments have emerged since then. To continue to aid researchers, software developers, platform procurement efforts, and hardware vendors, we augmented the original SHOC benchmark suite to better support a wide variety of accelerator architectures such as recent discrete and fused GPUs as well as many-core accelerators. New algorithms, including a few from the burgeoning field of "big data" are now included as SHOC benchmarks to help evaluate a system for a wider variety of possible application spaces. We also extended SHOC to support directive-based approaches in addition to its original low-level accelerator programming models (CUDA, OpenCL). In our case studies, we have shown that the immaturity some of these tool chains can be a serious impediment to performance despite the programming model's goal of improved productivity.
