Spatter: A Tool for Evaluating Gather / Scatter Performance by Lavin, Patrick et al.
Spatter: A Tool for Evaluating Gather / Scatter
Performance
Patrick Lavin, Jeffrey Young, Jason Riedy, Rich Vuduc
Georgia Institute of Technology
Email: plavin3,jyoung9,jason.riedy,richie@gatech.edu
Aaron Vose, Dan Ernst
Cray Inc.
Email: avose,dje@cray.com
This paper describes a new benchmark tool, Spatter, for
assessing memory system architectures in the context of a
specific category of indexed accesses known as gather and
scatter. These types of operations are increasingly used to
express sparse and irregular data access patterns, and they
have widespread utility in many modern HPC applications
including scientific simulations, data mining and analysis
computations, and graph processing. However, many traditional
benchmarking tools like STREAM, STRIDE, and GUPS focus
on characterizing only uniform stride or fully random accesses
despite evidence that modern applications use varied sets of
more complex access patterns.
Spatter provides a tunable and configurable framework to
benchmark a variety of indexed access patterns, including
variations of gather / scatter that are seen in HPC mini-apps
evaluated in this work. The design of Spatter includes tunable
backends for OpenMP and CUDA, and experiments show
how it can be used to evaluate 1) uniform access patterns for
CPU and GPU, 2) prefetching regimes for gather/scatter, 3)
compiler implementations of vectorization for gather/scatter,
and 4) trace-driven “proxy patterns” that reflect the patterns
found in multiple applications. The results from Spatter exper-
iments show that GPUs typically outperform CPUs for these
operations, and that Spatter can better predict the performance
of some cache-dependent mini-apps than traditional STREAM
bandwidth measurements.
I. INTRODUCTION
New CPU architectures have begun to incorporate advanced
vector functionality like AVX-512 and the Scalable Vector
Extension (SVE) for improved SIMD application performance.
In addition, many of these new vector specifications include
explicit support for indexed accesses like gather and scatter
(G/S). These types of memory operations involve a load
or store through a level of indirection, such as reg ←
base[idx[k]], and they appear commonly in scientific
and data analysis applications.
While many memory-focused microbenchmarks [1] are
available today, a gap exists in the evaluation of indexed
accesses including gather and scatter. We are motivated to
design a benchmarking tool that assesses system performance
on gather / scatter workloads for three different types of users:
1) vendors and hardware architects might wonder how new ISAs
(such as AVX-512) and their implementation choices actually
impact memory system performance, 2) application developers
may consider how the data structures they choose impact
the G/S instructions their code compiles to, and 3) compiler
writers might require better data on real-world memory access
patterns to decide whether to implement a specific vectorization
optimization for sparse accesses.
In considering these needs, we have formulated one such
tool, called Spatter. It evaluates indexed access patterns based
on gather and scatter operations, which represent a variety
of applications across different language and architecture
platforms. More importantly, we believe Spatter can help to
answer a variety of system, application, and tool evaluation
questions, some of which include: 1) What application gather /
scatter patterns exist in the real world, and how do they impact
memory system performance? 2) How does prefetching affect
performance of indexed accesses on modern CPU platforms?
3) How does the performance of G/S change when dealing
with sparse data on CPUs and GPUs?
We show in this work that the Spatter tool suite can address
these questions with the following key features. At a basic level,
Spatter provides tunable gather and scatter implementations.
These include CUDA and OpenMP backends with knobs for
adjusting thread block size and ILP on GPUs and work-per-
thread on CPUs. Spatter also includes a scalar, non-vectorized
backend that can serve as a baseline for evaluating the benefits
of vector load instructions over their scalar counterparts.
Finally, Spatter includes support for running built-in, param-
eterized memory access patterns, or custom patterns. We show,
for instance, how one can collect G/S traces from Department
of Energy (DoE) mini-apps to gain insights or make rough
predictions about performance for hot kernels that depend on
indexed accesses.
Results from Spatter show that newer GPU architectures
perform best for both gather and scatter operations in part
due to memory coalescing and faster memories. AMD Naples
performs best of all the CPU-based platforms (Broadwell,
Skylake, TX2) for strided accesses. A study of prefetching with
Spatter further shows how gather / scatter benefits from modern
prefetching across Broadwell and Skylake CPUs. Spatter’s
scalar backend is also used to demonstrate how compiler
vectorization can improve gather / scatter (Section V-C) with
large improvements for both Skylake and Knight’s Landing.
Experiments for three DoE mini-apps show G/S performance
improvements enabled by caching on CPU systems and by
fast HBM memory on GPUs. Surprisingly, these parameterized
1
ar
X
iv
:1
81
1.
03
74
3v
4 
 [c
s.P
F]
  1
 N
ov
 20
19
access pattern studies also show that STREAM bandwidth does
not correlate well with specific mini-apps (Nekbone) that are
cache-dependent.
II. GATHER / SCATTER IN REAL-WORLD APPLICATIONS
To motivate our interest in gather / scatter performance, we
studied several prominent DOE mini-apps from the CORAL
and CORAL-2 procurements [2], [3]. Such software provides a
rich source of information about the computational and memory
behavior requirements of critical scientific workloads in both
governmental as well as academic environments. Many of
these workloads contain important kernels which stress gather
/ scatter performance. Indeed, one aim of Spatter is to leverage
such mini-apps as a source of real-world G/S patterns.
In particular, this work considers mini-apps from CORAL
and CORAL-2, including AMG [4], LULESH [5], and Nek-
bone [6]. We built these mini-apps targeting ARMv8-A with
support for Arm’s Scalable Vector Extension (SVE) [7] at a
vector length of 1024 bits. The resulting executables were
run through an instrumented version of the QEMU functional
simulator [8] to extract traces of all instructions accessing
memory along with their associated virtual addresses. We then
extracted G/S patterns, along with their frequency, from these
instruction and address streams from rank 0 for selected kernels.
The problem sizes are chosen so as to prioritize a realistic
working set with 64 MPI ranks per node with one thread per
rank, while the number of iterations is less emphasized. For
these apps, it is expected that multiple kernel iterations will
have many patterns in common.
Table I shows the gather / scatter characteristics extracted
from several kernels selected from the aforementioned mini-
apps, along with the amount of data motion performed by
these gather / scatter operations. The reported G/S data motion
percentages are conservative, as current data records all scalar
loads and stores them as being 64 bits, while a significant
fraction of 32-bit scalar data types is expected.
Examination of the gather / scatter behavior results in the
observation of a small number of common G/S pattern classes:
uniform-stride, where each element of a gather is a fixed
distance from the preceding element; broadcast, with some
elements of a gather share the same index; mostly stride-1, in
which some elements of a gather are a single element away
from the preceding element; and more complex strides, in which
elements of a gather have a complicated pattern containing
many different strides.
We can make a few high-level remarks about Table I. First,
gathers are more common than scatters. Secondly, gather /
scatter can account for high fractions of total load / store
traffic (last column; up to 67.6%, or just over two-thirds, in
these examples). Thirdly, the appearance of differing categories
of stride types suggests that there are multiple opportunities for
runtime (inspector / executor) and hardware memory systems
to optimize for a variety of gather / scatter use-cases, which
Spatter can then help evaluate.
Fig. 1: An overview of the Spatter benchmark with inputs and
outputs.
III. DESIGN OF THE SPATTER BENCHMARK
We have developed Spatter because existing benchmark
suites like STREAM[1] and STRIDE[9] focus on uniform
stride accesses and are not configurable enough to handle
non-uniform, indirect accesses or irregular patterns. For more
information on related benchmarks, see Section VII. Figure 1
shows a conceptual view of the design of the Spatter benchmark.
The use of the benchmark is described further below.
Kernels: Spatter contains Gather and Scatter kernels for
three backends: OpenMP, CUDA, and Scalar. A high-level
view of the gather kernel is in Figure 2, but the different
programming models require that the implementation details
differ significantly between backends. Both the OpenMP and
CUDA backends expose parameters like work-per-thread
and block-size to allow for performance tuning.
OpenMP: The OpenMP backend is designed to make it
easy for compilers to generate G/S instructions. Each thread
will perform some portion of the iterations shown in Figure 2.
To ensure high performance when gathering, each thread will
gather into a local destination buffer (vice-versa for scattering).
This avoids the effects of false sharing.
CUDA: Whereas in the OpenMP backend, each thread will
be assinged its own set of iterations to perform, in the CUDA
programming model, an entire thread block must work together
to perform an iteration of Figure 2 to ensure high performance.
These backends are similar though, in that each thread block
gathers into thread local memory to allow for high performance.
Scalar: The Scalar backend is based on the OpenMP
backend, and is intended to be used as a baseline to study
the benefits of using CPU vector instructions as opposed to
scalar loads and stores. The major difference between this and
the OpenMP backends is that the Scalar backend includes a
compiler pragma to prevent vectorization, namely #pragma
novec.
A. Benchmark Input
Spatter accepts either a single pattern and run configuration
as input, or a JSON file containing many such patterns and
configurations.
Pattern Specification: In Spatter, a memory access pattern
is described by specifying either gather or scatter, and providing
a short index buffer and a delta. The user should also specify
a number of gathers to do so that the data buffer does not fit
in cache. Spatter includes two built-in, parameterized patterns,
2
TABLE I: High-Level Characterization of Application G/S Patterns.
Application (Extracted Patterns) Selected Kernels Gathers Scatters G/S MB (%)
AMG (partially stride-1)
hypre_CSRMatrixMatvecOutOfPlace 1,696,875 0 217 (17.8)
LULESH (fixed-stride)
IntegrateStressForElems 828,168 382,656 155 (22.4)
InitStressTermsForElems 1,121,844 1,153,827 291 (67.6)
Nekbone (fixed-stride)
ax_e 2,948,940 0 377 (33.3)
PENNANT (fixed-stride, partially stride-0, complex strides)
Hydro::doCycle 728,814 0 93 (13.9)
Mesh::calcSurfVecs 324,064 0 41 (39.5)
QCS::setForce 891,066 0 114 (45.5)
QCS::setQCnForce 1,214,318 323,800 197 (64.5)
TABLE II: Details for Selected Applications and Kernels Used for G/S Pattern Extraction.
Application – Version Problem Size / Changes Kernel Notes
AMG – github.com/
LLNL/AMG commit
09fe8a7
Arguments -problem 1 -n 36 36 36 -P
4 4 4, also mg_max_iter in amg.c set to 5
to limit iterations.
Entirety of each of the functions listed in Table I.
LULESH – 2.0.3 Arguments -i 2 -s 40, also modifications to
vectorize the outer loop of the first loop-nest in
IntegrateStressForElems.
The first loop-nest in
IntegrateStressForElems. Arrays
[xyz]_local[8] as well as B[3][8] give
stride-8 and stride-24. Also, the entirety of the
InitStressTermsForElems function.
Nekbone – 2.3.5 Set ldim = 3, ifbrick = true, iel0
= 32, ielN = 32, nx0 = 16, nxN = 16,
stride = 1, internal np and nelt distribu-
tion. Also, niter in driver.f set to 30 to
limit CG iterations.
First loop in ax (essentially a wrapped call to
ax_e) contains the observed stride-6.
PENNANT – 0.9 Config file sedovflat.pnt with
meshparams 1920 2160 1.0 1.125
and cstop 5.
Entirety of each of the functions listed in Table I.
uniform stride and mostly stride-1, to simplify input. For
example, one could run Spatter with
./spatter -k Gather -p UNIFORM:8:1 -d 8 -l $((2**24))
to run 224 (-l) gathers (-k), each one 8 doubles beyond the
last (-d), and each using an index buffer of length 8 and
uniform stride-1 (-p). This will produce a STREAM Copy-
like number, but it will only be read bandwidth, as a gather
reads data from memory to a register. Spatter includes further
options for choosing backends and devices and performance
tuning that are described in its README.
JSON Specification: When running tests, it is common to
run many different patterns. To support this, Spatter accepts a
JSON file as input that contains many different configurations.
Spatter will parse this file and allocate memory once for all
tests, greatly speeding up test sets with many different patterns,
and easing data management.
B. Benchmark Output
For each pattern specified, Spatter will report the minimum
time taken over 10 runs to perform the given number of
gathers or scatters. It will also translate this into a bandwidth,
with the formula Bandwidth = (sizeof(double) *
len(index) * n) / time, where n is the number of
Fig. 2: The first two iterations of the gather kernel with a uniform
stride-2 index buffer of length 4, and a delta of 1.
gathers or scatters. This is the amount of data that is moved to
or from memory, and does not count the bandwidth used by
the the index buffer, as it is assumed to be small and resident
in cache.
Optionally, PAPI [10] can be used to measure performance
counters, as shown in Section VI.
For JSON inputs, Spatter will also report stats about all
of the runs, such as the maximum and minimum bandwidths
observed across configurations, as well as the harmonic mean
of the bandwidths.
3
TABLE III: Experimental Parameters and Systems (OMP Denotes OpenMP, and OCL Denotes OpenCL).
System description Abbreviation System Type STREAM (MB/s) Threads, Backends
Knight’s Landing (cache mode) KNL Intel Xeon Phi 249,313 272 threads, OMP
Broadwell BDW 32-core Intel CPU (E5-2695 v4) 43,885 16 threads, OMP
Skylake SKX 32-core Intel CPU (Platinum 8160) 97,163 16 threads, OMP
Cascade Lake CSX 24-core Intel CPU (Platinum 8260L) 66,661 12 threads, OMP
ThunderX2 TX2 28-core ARM CPU 120,000 112 threads, OMP
Kepler K40c K40c NVIDIA GPU 193,855 CUDA
Titan XP Titan XP NVIDIA GPU 443,533 CUDA
Pascal P100 P100 NVIDIA GPU 541,835 CUDA
Broadwell (ICC) BDW2 12-core CPU (E5-2650) 85,750 24 threads, OMP
Skylake (ICC) SKX2 6-core CPU (Gold 6128) 66,661 12 threads, OMP
IV. EXPERIMENTAL SETUP
Table III describes the different configurations and backends
tested for our initial evaluation using the Spatter benchmark
suite. We pick a diverse set of systems based on what is
currently available in our lab and collaborator’s research labs,
including a Knight’s Landing system, and a prototype system
with ARMv8 ThunderX2 chips designed by Cavium. We also
include a server-grade and desktop-grade Intel CPU system
and several generations of NVIDIA GPUs. Unfortunately
we currently do not have access to a recent AMD GPU,
CPU, or APU system for testing but hope to include this
in future experiments. Note that the two Intel boxes listed at
the bottom (BDW2 and SKX2) are machines used in Table I for
experiments with the Intel ICC toolchain and PAPI rather than
with Cray compilers (9.0) used on the other listed platforms.
OpenMP: To control for NUMA effects, CPU systems
are tested using all the cores on one socket or one NUMA
region if the system has more than one CPU socket. Some
systems like the KNL on Cori have an unusual configuration
where the entire chip is listed as 1 NUMA region with
272 threads. For all the OpenMP tests, Spatter is bound
to one socket and run using one thread per core on that
socket. The following settings are used for OpenMP tests:
1) OMP_NUM_THREADS = <num_threads_single_socket>
2) OMP_PROC_BIND = master
3) OMP_PLACES = sockets
4) KMP_AFFINITY = compact (only for KNL)
CUDA: When testing on GPUs, the block size for Spatter
is set at 1024 and an index buffer of length 256 is used. These
settings allow Spatter to reach bandwidths within 20% of the
vendor reported theoretical peak for both gather and scatter
kernels. These bandwidths are slightly different than what is
typically reported, as gather is designed to only perform reads,
and scatter should only perform writes.
PAPI: The PAPI tests for the Spatter proxy investigation
in Section VI are run on a Broadwell 12-core Intel CPU and
Skylake 6-core Intel CPU system using Intel 19.0 compiler
tools (version 19.0.3.199) and the same JSON inputs that are
investigated in I. PAPI 5.7.0 is used for all testing, and up to
three native counters are measured for the selected inputs. One
PAPI counter value for each configuration is reported for the
fastest (in terms of time) iteration, so a run with 10 iterations
would report the PAPI counter values for the fastest iteration.
Experimental Configurations: Runs of Spatter use the
maximum bandwidth out of 10 runs for the platform compar-
ison uniform stride and application pattern tests. STREAM
results are generated using 225 elements with either STREAM
for CPU or BabelStream for GPU, while all Spatter uniform
stride tests read or write at least 8GB of data on the GPU and
16GB on the CPU. The difference between CPU and GPU
data sizes results from most GPUs having less than 16 GB of
on-board memory. The application-specific pattern tests read
or write at least 2GB.
V. RESULTS
Spatter is designed to be flexible, and allow the user to run
many different memory access patterns and expose many knobs
used for tuning. We have used Spatter to investigate several
questions regarding CPU and GPU memory architecture. In
this section, we use Spatter to investigate A) uniform stride
access on CPUs, B) uniform stride access on GPUs, C) the
effectiveness of gather/scatter over scalar load/store, and D)
the performance of real-world gather/scatter patterns on CPU.
Section VI investigates a different set of configurations focused
on comparing Spatter as a proxy app on two CPU systems
with PAPI and Intel compilers.
A. CPU Uniform Stride
We start our investigation by performing a basic test: running
Spatter with the uniform stride pattern, and increasing the
stride by 2x until performance flattens. At stride 1 (where all
strides are powers of 2), this is analogous to the STREAM
benchmark1, except for the fact that Spatter will only generate
read instructions (gathers) for the gather kernel and write
instructions (for the scatter kernel) meaning the bandwidths
should be slightly different. Fig. 3 shows the results of our
uniform stride tests on CPUs. Fig. 3a has the results for
the Gather kernel and Fig. 3b has the Scatter kernel results.
We would expect that as stride increases by a factor of 2,
bandwidth should drop by half; the entire cache line is read
in but only every other element is accessed. This should
continue until about stride 8, as we are then using one double
from every cache line. This is what we see on Naples, but
1On CPU, we use an index buffer of length 8 and fill it with indices [1*stride,
2*stride, ...]. We set the delta to be 8*stride, so that there is no data reuse
and indeed stride-1 matches the STREAM pattern.
4
20 21 22 23 24 25 26 27
Stride (Doubles)
103
104
105
Lo
g
(B
a
n
d
w
id
th
)
BDW
KNL
Naples
SKX
TX2
(a)
20 21 22 23 24 25 26 27
Stride (Doubles)
103
104
105
Lo
g
(B
a
n
d
w
id
th
)
BDW
KNL
Naples
SKX
TX2
(b)
Fig. 3: CPU Gather (a) and Scatter (b) Bandwidth Comparison With
CCE 9.0, we increase the stride of memory access and show how
performance drops as stride increase from 1 to 128 (doubles) on
Skylake, Broadwell, Naples, and Thunder X2 systems. Cascade Lake
is omitted as it overlaps closely with Skylake. Log bandwidth is
displayed to make differences apparent.
performance continues to drop on TX2, Skylake, and Broadwell.
Interestingly, Broadwell performance increases at stride-64,
even out-performing Skylake. We can further use Spatter to
investigate these two points: 1) Why does Broadwell outperform
Skylake at high strides, and 2) why does TX2 performance
drop so dramatically past 1/16?
1) Disabling Prefetching. To get an idea of what is causing
Broadwell to outperform Skylake, we turn prefetching off with
Model Specific Registers (MSRs) and re-run the same uniform
stride patterns. Fig. 4a and b shows the results from this test.
For Broadwell, performance doesn’t show the same increase
for stride-64 with prefetching off and it instead bottoms out
after stride-8. We conclude that one of Broadwell’s prefetchers
pulls in two cache lines at a time for small strides but switches
to fetching only a single cache line at stride-64 (512 bytes).
We can understand the performance discrepancy between
Broadwell and Skylake by looking at Fig. 4b. Performance
drops to 1/16th of peak, as Skylake always brings in two cache
lines, no matter the stride. We did not get the opportunity to run
(a)
(b)
Fig. 4: Broadwell Gather (a) and Skylake Gather (b) We show the
performance of gather for various strides, with prefetching on and off.
On the right, normalized bandwidth is shown to display the regularity
of the decrease in bandwidth.
on the Thunder X2 without prefetching since it does not have
a similar MSR equivalent, but we suspect similar effects are
at play: one of the prefetchers likely always brings in the next
line, although that only helps to explain performance dropping
through stride-16, not through stride-64.
B. GPU Uniform Stride
As the memory architecture of CPUs and GPUs is quite
different, it is worthwhile to see how GPUs handle these
uniform stride patterns. ?? shows how a K40c, a Titan Xp, and
a P100 perform on the same tests2. As with the CPUs, we see
bandwidth drop by half for stride-2 and by another half for
stride-4.However, for the P100 and the Titan Xp, from stride-4
to stride-8, we see that bandwidth stays the same (illustrated
by the dotted lines). This is due to the fact that GPUs are able
to coalesce some loads. The older K40 hardware shows less
2To get high performance on GPUs, the threads within a block all work
together to read a pattern buffer into shared memory This buffer must be much
longer than the CPU index buffer (256 indices vs 8) so that each thread has
enough work to do.
5
(a)
(b)
Fig. 5: GPU Gather (a) and Scatter (b) Uniform Stride Bandwidth
comparison
ability to do so. In the scatter kernel plot, ??, the effect of
coalescing is less pronounced, but still visible from stride 4
to stride 8. Instead of plateauing at 1/4th of peak bandwidth,
however, it plateaus at 1/8th. Regardless of the effect being
less pronounced in scatter vs. gather, we still see the benefit of
a memory architecture that is able to coalesce access, as we
see how the bandwidth curves of these GPUs platforms take a
longer time to fall off than their CPU counterparts.
C. SIMD vs. Scalar Backend Characterization
Spatter can also be used to test the effectiveness of different
hardware implementations of single instruction, multiple data
(SIMD) instruction set architectures (ISAs). Vector versions of
indexed load and store instructions help compilers to vectorize
loops and can also help avoid unnecessary data motion between
scalar and vector registers that might otherwise be required. We
can use Spatter to investigate whether these vector instructions
are indeed superior to scalar load instructions and whether
compiler writers should prioritize vectorized gather / scatter
optimizations.
To demonstrate the effectiveness of SIMD load / store
instructions, we run Spatter using the gather kernel on multiple
20 21 22 23 24 25 26 27
Stride (Doubles)
20
0
20
40
60
80
Pe
rc
en
t I
m
pr
ov
em
en
t
BDW
KNL
Naples
SKX
TX2
(a)
20 21 22 23 24 25 26 27
Stride (Doubles)
20
0
20
40
60
80
100
Pe
rc
en
t I
m
pr
ov
em
en
t
BDW
KNL
Naples
SKX
TX2
(b)
Fig. 6: Improvement of SIMD Gather Kernel (a) and Scatter Kernel
(b) Compared to Serial Scalar Backend.
platforms with the scalar backend as a baseline. This scalar
baseline is then compared to the OpenMP backend as vectorized
by the Cray compiler (CCE 9.0) with the resulting percent
improvement from vectorization reported in Figure 6 for strides
1-128, as before. The Broadwell CPU performs the worst of all
the tested CPUs, showing worse performance with vectorized
code in many cases for both Gather and Scatter. Thus, for a
memory heavy kernel, it would likely be better to use scalar
instructions than G/S instructions, although this difference may
be mitigated as G/S instructions remove the need to move data
between regular and vector registers.
On the other hand, Skylake, KNL, and Naples have better
gather performance in the vectorized case. The use of gather
instructions on these platforms is clearly justified. Of these
three, however, Naples is the only one to not improve in
the scatter case as well. This is due to the lack of scatter
instructions on Naples. TX2 has no G/S support at all, so it
stays close to 0% difference (save for a single outlier in the
gather chart). Interestingly, for our three processors with useful
G/S instructions, they all gather best in different regions, with
KNL best at small strides, Naples for medium strides, and
Skylake best at large strides. While we are not able to explain
6
t
TABLE IV: Spatter Results for Mini-apps
Platform AMG (n=36) Nekbone (n=6) Lulesh PENNANT STREAM
GB/s GB/s GB/s GB/s GB/s
(H-Mean) (H-Mean) (H-Mean) (H-Mean)
BDW 123 121 20 6 43
SKX 328 309 12 35 96
CLX 234 215 9 28 94
Naples 140 323 3 11 97
TX2 270 247 232 28 241
KNL 201 190 19 4 249
R-value 0.26 0.03 0.50 -.04
K40c 108 99 88 14 193
TitanXP 496 320 175 21 443
P100 703 673 165 19 541
R-value 0.66 0.62 0.62 0.57
the reason for this performance artifact, we have demonstrated
the benefit of G/S instructions over their scalar counterparts.
At least for KNL, anecdotal evidence has suggested that using
vectorized instructions at lower strides reduces overall unique
instruction count and overall request pressure on the memory
system.
D. Real-World G/S Patterns on CPU
While the three previous sections have focused on uniform
stride patterns, Spatter is also able to run much more complex
patterns. To demonstrate Spatter’s ability to emulate patterns
found in real applications, we take the top patterns from several
DoE mini-apps (as described in Section II) and run them in
Spatter. Table V shows the harmonic mean of the bandwidth
of the patterns taken from the respective application traces.
The Pearson’s correlation coefficient, R, is calculated across all
CPU and GPU patterns for a given mini-app (X) as follows:
Coeff = cov(X ,ST REAM)/(std(X)∗ std(ST REAM)) (1)
What we note from Table 2 is the following: Lulesh shows
poor performance on most CPU platforms because it includes
a delta-0 scatter operation that we believe triggers cache
invalidations for multi-core writebacks. AMG and Nekbone
show higher performance than STREAM in general due to the
effects of caching for the included patterns while GPU systems
without HBM have lower performance.
More interestingly, we see that the CPU runs of the Nekbone
and PENNANT patterns show poor correlation (close to 0)
with STREAM. In the case of AMG, the patterns perform
much better than STREAM, whereas in PENNANT, the
patterns perform much worse. This means that Spatter is
indeed capturing different behavior than STREAM, and that
the patterns Spatter generates are not well approximated
by STREAM on CPUs. For GPU systems, however, the R
coefficient shows that STREAM is well correlated (close to 1)
with the Spatter results.
VI. USING PROXY G/S TO CHARACTERIZE MINI-APPS
We next look at an additional set of experiments to investigate
whether the application patterns we have generated can be used
to help predict the performance of hot kernels with G/S better
than other metrics like STREAM or LINPACK.
Experimental Setup: This set of tests uses Spatter’s optional
PAPI support with PAPI 5.7.0, and it compares two of the
previously discussed mini-app kernels, AMG and NEKBONE,
with native and Spatter execution on the Broadwell and Skylake
test platforms, BDW2 and SKX2 (Table III). All mini-apps,
STREAM, LINPACK, and Spatter are compiled using the
locally installed Intel 19.0 tools, as opposed to the Cray
compiler tools used for the earlier experiments. The OpenMP
backend and one thread are used for all tests, and Spatter,
STREAM, and LINPACK results are compared with single-
threaded results from the native hot kernels. The Spatter patterns
that are used for this test attempt to move the same amount of
data as the real mini-app’s hot kernels and use the same exact
ratio of specific gather / scatter patterns. In a sense, Spatter’s
pattern inputs attempt to provide a spatial-locality proxy for
the actual memory trace from the native mini-app.
Table V shows results for running Spatter with patterns
from the AMG and Nekbone mini-apps natively and with
Spatter. Timing is reported in milliseconds for the Native and
Spatter execution. Interestingly, the Skylake core is 40% slower
than the Broadwell core at running STREAM but 132% faster
at running LINPACK. The Skylake system is 27.7% faster
than the Broadwell system at running the native AMG kernel
(labeled Percent Improvement) and 26.1% faster at running
the Nekbone kernel. The Spatter proxy trace run predicts that
AMG would be 35.65% faster on the Skylake system and
that Nekbone would be 35.8% faster on Skylake. These results
indicate that while STREAM and LINPACK can provide rough
indicators of memory system performance and FLOPS, they
are not necessarily good figures of merit (FoM) for kernels
that include large numbers of gather / scatter operations.
In addition to timing results for Spatter and the mini-
app kernels, we look at PAPI counters for both native and
Spatter execution, with a focus on the following events:
PAPI_L1_TCM, PAPI_L2_TCM, PAPI_L3_TCM. Our ex-
perimentation focuses on whether Spatter can represent the
ratio of cache misses for a specific test platforms, where 1
7
TABLE V: Proxy Trace Comparison for the AMG and Nekbone Mini-apps
Application (FoM) Size Test Case Broadwell Skylake Percent L1 Miss L2 Miss L3 Miss
Improvement BDW / SKX BDW / SKX BDW / SKX
STREAM (GB/s) 17.6 10.6 -39.8 - - -
LINPACK (FLOPS) 40.8 94.5 131.6 - - -
AMG (ms) Native 149.3 108.0 27.7 - - -
n = 36 Spatter 23.6 16.6 35.6 39.1 / 1115.1 67.3 / 35.7 824.62 / 11.6
Nekbone (ms) Native 440 325 26.1 - - -
n = 6 Spatter 15.4 9.9 35.8 14.4 / 828.5 8.10 / 6.40 5.9e−4 / 4.6e−3
is the ideal result (i.e., no difference in cache misses between
Spatter and native execution). For both AMG and Nekbone,
the results for Spatter cache miss ratios vary widely across
both platforms with over-predictions that range from one order
of magnitude (e.g., L2 miss ratio for NEKBONE at 8.10x and
6.4x) to multiple orders of magnitude (e.g., L1 miss ratios for
SKX for both mini-apps at 1115.1x and 828.5x).
The results from this set of experiments illustrate two
important points. First, Spatter can currently potentially predict
some relative timing improvements across systems that may
be hard to capture with just STREAM. Secondly, the widely
varied PAPI cache miss results with Spatter indicate that more
complex, higher-order patterns-of-patterns may be needed to
accurately represent spatial locality when representing a pattern
with Spatter. That is, if an application kernel performs a gather
/ scatter pattern and then begins the next iteration of the pattern
within the data set already established in the caches, it may have
added locality that is not currently represented by repeating
the first-order patterns (described in Section II) across new
data each time. The resulting added locality can contribute
to a significant overestimation of cache misses. Additionally,
Spatter does not yet represent temporal locality, which is a
key component of cache misses. Both of these two sources of
error are components of improving Spatter for use as a proxy
and are discussed in Section VIII.
VII. RELATED WORK
Our primary aim for Spatter is to measure at a low level the
effects of sparsity and indirect accesses on effective bandwidth
for a particular application or algorithm. While a number of
bandwidth-related benchmarks exist, there are no current suites
that explicitly support granular examinations of sparse memory
accesses. The closest analogue to our work is APEX-Map [11],
which allows for varying sparsity to control the amount of
spatial locality in the tested data set. However, APEX-map
has not been updated for heterogeneous devices and does not
allow for custom G/S patterns.
In terms of peak effective, or real-world achievable band-
width, STREAM [12] provides the most widely used measure-
ment of sustained local memory bandwidth using a regular,
linear data access pattern. Similarly, BabelStream [13], provides
a STREAM interface for heterogeneous devices using backends
like OpenMP, CUDA, OpenCL, SyCL, Kokkos, and Raja.
Intel’s Parallel Research Kernels [14] also supports an nstream
benchmark that is used for some platforms here. The CORAL 2
benchmarks also include a STREAM variant called STRIDE [9],
that includes eight different memory-intensive linear algebra
kernels written in C and Fortran. STRIDE includes dot product
and triad variations but still utilizes uniform stride inputs and
outputs. None of these suites support any access pattern aside
from uniform stride, which underlines the need for a benchmark
like Spatter which includes configurable and indirect access
patterns
Where as STREAM focuses on a single access pattern,
pointer chasing benchmarks [15] and RandomAccess [16] use
randomness in their patterns. Pointer chasing measure the
effects of memory latency but are limited in scope to measuring
memory latency, and RandomAccess is only able to produce
random streams. Spatter cannot model dependencies like pointer
chasing but it contains kernels for modeling random access.
A. Heterogeneous Architectural Benchmarking
Memory access patterns have been studied extensively on
heterogeneous and distributed memory machines, where data
movement has been a concern for a long time. Benchmarks
such as SHOC [17], Parboil [18], and Rodinia [19] provide
varying levels of memory access patterns that are critical
to HPC applications. For example, SHOC contains “Level
0” DeviceMemory and BusSpeedDownload benchmarks that
can be used to characterize GPUs and some CPU-based
devices. Likewise, other recent Department of Energy work has
investigated vectorization support with hardware and compiler
suites [20] for next-generation applications for the SIERRA
supercomputer. The design of Spatter is intended to create
a new benchmark suite with a more focused set of access
patterns to supplement these existing benchmark suites and
studies and to provide a simpler mechanism for comparing
scatter and gather operations across programming models and
architectures.
Other work focuses on optimizing memory access patterns
for tough-to-program heterogenous devices like GPUs. Recent
work by Lai, et al. [21] evaluates the effects of TLB caching
on GPUs, develops an analytical model to predict the caching
characteristics of gather / scatter and then develops a multi-pass
technique to improve the performance of G/S on modern GPU
devices. Dymaxion [22] takes an API approach to transforming
data layouts and data structures and looks at scatter and
gather as part of a sparse matrix-vector multiplication kernel
8
experiment. Jang, et al. [23] characterize loop body random and
complex memory access patterns and attempt to resolve them
into simpler and regular patterns that can be easily vectorized
with GPU programming languages. Finally, CuMAPz [24]
provides a tool to evaluate different optimization techniques
for CUDA programs with a specific focus on access patterns
for shared and global memory.
B. Extensions to Other Architectures
One additional motivation for this work is to better imple-
ment sparse accesses patterns on nontraditional accelerators
like FPGAs and the Emu Chick. For FPGAs, the Spector
FPGA Suite [25] provides several features that have influenced
the design of our benchmark suite including user-defined
parameters for block size, work item size, and delta settings.
Spector uses OpenCL-based High-Level Synthesis and
compiles a number of different FPGA kernels with various
parameters and then attempts to pick the best configuration to
execute on a specific FPGA device. While this process can be
time-consuming for FPGAs due to routing heuristics, it does
provide some motivation for the design of Spatter. As shown
in Section III, we design scripts that run multiple tests and
then pick the best result for a given work item size, block size,
and vector length to plot as the “best” result for a particular
gather / scatter operation.
Finally, there is also work in computer architecture that
explores the area of adding more capabilities to vector units.
SuperStrider [26] and Arm’s Scalable Vector Extension [7]
both aim to implement gather / scatter operations in hardware.
Similarly, the Emu system [27] focuses on improving random
memory accesses by moving lightweight threads to the data in
remote DRAM. Spatter complements these hardware designs
and associated benchmarking by allowing users to test how
their code can benefit from dedicated data rearrangement units
or data migration strategies. These projects primarily focus on
architectural simulation and emulation, while we are looking at
approaches to create effective sparse kernels that can be tested
on FPGA prototypes or with these new architectures.
VIII. FUTURE WORK
We envision that the Spatter benchmark will be a tool that
can be used to examine any memory performance artifact that
exists in sparse codes. The current model we use, a single
index buffer and delta for each pattern, is descriptive of a
wide range of patterns that we have seen in DoE mini-apps as
well as related benchmarks like STREAM and RandomAccess.
However, certain aspects of the memory hierarchy cannot be
properly examined by the current version of Spatter, especially
those relating to temporal locality.
To increase Spatter’s ability to model memory access patterns,
we plan to expand the benchmark suite with the following
features: 1) model temporal locality for accesses using time
delta patterns to better represent cacheable access patterns,
2) investigate mathematical and AI techniques for modeling
more complex access patterns than can be represented with
combinations of stride and delta parameters, and 3) develop new
open-source techniques for extracting sparse memory access
patterns from applications in a timely fashion. Other features
that we are investigating for inclusion into Spatter are kernels
written with intrinsics as well as new backends for Kokkos,
SyCL, and novel architectures like FPGAs or the Emu Chick.
Our goal is also to make Spatter as easy to use as possible,
and useful for a wide audience. To aid in this effort, we plan
to make the following upgrades to the codebase: 1) support for
OpenMP 4.5 and SyCL backends, 2) automation of parameter
selection, 3) optimized CPU backends that make use of
prefetching and streaming accesses, and 4) make as much
of our tracing and trace analysis infrastructure available along
with our codebase, which is open-source.
IX. CONCLUSION
This work has demonstrated the growing importance of
indexed accesses in modern HPC applications and specifically
looks at the use of gather and scatter operations in modern
applications like the DoE mini-apps investigated in II. Spatter
is introduced as a configurable benchmark suite that can be
used to better evaluate these types of indirect memory accesses
by using pattern-based inputs to generate a wide variety of
patterns including uniform, fixed-stride, broadcast, and more
complex strides. The presented Spatter experiments demonstrate
how this tool could be used by architects to evaluate new
prefetching hardware or instructions for gather and scatter, how
compiler writers can inspect the performance implications of
their generated code, and potentially how application developers
could profile representative portions of their application that
rely on gather / scatter. We anticipate that future research into
improving this tool will focus on better supporting application
profiling and generating better pattern inputs so that Spatter
can be used to help generate more realistic access behaviors
for larger kernels and application traces.
9
REFERENCES
[1] J. McCalpin, “Notes on “non-temporal” (aka “streaming”) stores.” http:
//sites.utexas.edu/jdm4372/tag/cache/, 2018.
[2] “CORAL RFP b604142.” https://asc.llnl.gov/CORAL/, February 2014.
Accessed: 2019-04-02.
[3] “CORAL-2 acquisition, RFP no. 6400015092.” https://procurement.ornl.
gov/rfp/CORAL2/, May 2018. Accessed: 2019-04-02.
[4] U. Yang, R. Falgout, and J. Park, “Algebraic multigrid benchmark, version
00,” 8 2017.
[5] I. Karlin, J. Keasler, and J. Neely, “Lulesh 2.0 updates and changes,”
tech. rep., Lawrence Livermore National Lab.(LLNL), Livermore, CA
(United States), 2013.
[6] P. Fischer and K. Heisey, “Nekbone: Thermal hydraulics mini-application,”
Nekbone Release, vol. 2, 2013.
[7] N. Stephens, S. Biles, M. Boettcher, J. Eapen, M. Eyole, G. Gabrielli,
M. Horsnell, G. Magklis, A. Martinez, N. Premillieu, A. Reid, A. Rico,
and P. Walker, “The ARM scalable vector extension,” IEEE Micro, vol. 37,
pp. 26–39, Mar 2017.
[8] F. Bellard, “Qemu, a fast and portable dynamic translator.,” in USENIX
Annual Technical Conference, FREENIX Track, vol. 41, p. 46, 2005.
[9] M. K. Seager, “STRIDE CORAL 2 benchmark summary.” https://asc.llnl.
gov/coral-2-benchmarks/downloads/STRIDE_Summary_v1.0.pdf, 2019.
[10] D. Terpstra, H. Jagode, H. You, and J. Dongarra, “Collecting performance
data with papi-c,” in Tools for High Performance Computing 2009
(M. S. Müller, M. M. Resch, A. Schulz, and W. E. Nagel, eds.), (Berlin,
Heidelberg), pp. 157–173, Springer Berlin Heidelberg, 2010.
[11] E. Strohmaier and H. Shan, “Apex-Map: A global data access benchmark
to analyze HPC systems and parallel programming paradigms,” in
Proceedings of the 2005 ACM/IEEE Conference on Supercomputing,
SC ’05, (Washington, DC, USA), pp. 49–, IEEE Computer Society,
2005.
[12] J. D. McCalpin, “Memory bandwidth and machine balance in current high
performance computers,” IEEE Computer Society Technical Committee
on Computer Architecture (TCCA) Newsletter, pp. 19–25, Dec. 1995.
[13] T. Deakin, J. Price, M. Martineau, and S. McIntosh-Smith, “GPU-
STREAM v2.0: Benchmarking the achievable memory bandwidth of
many-core processors across diverse parallel programming models,” in
High Performance Computing (M. Taufer, B. Mohr, and J. M. Kunkel,
eds.), (Cham), pp. 489–507, Springer International Publishing, 2016.
[14] J. R. Hammond and T. G. Mattson, “Evaluating data parallelism in c++
using the parallel research kernels,” in Proceedings of the International
Workshop on OpenCL, IWOCL’19, (New York, NY, USA), pp. 14:1–14:6,
ACM, 2019.
[15] E. Hein, T. Conte, J. Young, S. Eswar, J. Li, P. Lavin, R. Vuduc, and
J. Riedy, “An initial characterization of the Emu Chick,” in 2018 IEEE
International Parallel and Distributed Processing Symposium Workshops
(IPDPSW), pp. 579–588, May 2018.
[16] P. Luszczek, J. J. Dongarra, D. Koester, R. Rabenseifner, B. Lucas,
J. Kepner, J. Mccalpin, D. Bailey, and D. Takahashi, “Introduction to
the HPC Challenge benchmark suite,” tech. rep., 2005.
[17] A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford,
V. Tipparaju, and J. S. Vetter, “The scalable heterogeneous computing
(SHOC) benchmark suite,” in Proceedings of the 3rd Workshop on
General-Purpose Computation on Graphics Processing Units, GPGPU-3,
(New York, NY, USA), pp. 63–74, ACM, 2010.
[18] J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang,
N. Anssari, G. D. Liu, and W.-m. W. Hwu, “Parboil: A revised benchmark
suite for scientific and commercial throughput computing,” Center for
Reliable and High-Performance Computing, vol. 127, 2012.
[19] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and
K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,”
in 2009 IEEE International Symposium on Workload Characterization
(IISWC), pp. 44–54, Oct 2009.
[20] M. Rajan, D. W. Doerfler, M. Tupek, and S. Hammond, “An investigation
of compiler vectorization on current and next-generation intel processors
using benchmarks and sandia’s sierra applications,” 2015.
[21] Z. Lai, Q. Luo, and X. Jia, “Revisiting multi-pass scatter and gather on
gpus,” in Proceedings of the 47th International Conference on Parallel
Processing, ICPP 2018, (New York, NY, USA), pp. 25:1–25:11, ACM,
2018.
[22] S. Che, J. W. Sheaffer, and K. Skadron, “Dymaxion: Optimizing memory
access patterns for heterogeneous systems,” in Proceedings of 2011
International Conference for High Performance Computing, Networking,
Storage and Analysis, SC ’11, (New York, NY, USA), pp. 13:1–13:11,
ACM, 2011.
[23] B. Jang, D. Schaa, P. Mistry, and D. Kaeli, “Exploiting memory access
patterns to improve memory performance in data-parallel architectures,”
IEEE Transactions on Parallel and Distributed Systems, vol. 22, pp. 105–
118, Jan 2011.
[24] Y. Kim and A. Shrivastava, “CuMAPz: A tool to analyze memory access
patterns in CUDA,” in Proceedings of the 48th Design Automation
Conference, DAC ’11, (New York, NY, USA), pp. 128–133, ACM, 2011.
[25] Q. Gautier, A. Althoff, P. Meng, and R. Kastner, “Spector: An OpenCL
FPGA benchmark suite,” 12 2016.
[26] S. Srikanth, T. M. Conte, E. P. DeBenedictis, and J. Cook, “The
Superstrider architecture: Integrating logic and memory towards non-
Von Neumann computing,” in 2017 IEEE International Conference on
Rebooting Computing (ICRC), pp. 1–8, Nov 2017.
[27] T. Dysart, P. Kogge, M. Deneroff, E. Bovell, P. Briggs, J. Brockman,
K. Jacobsen, Y. Juan, S. Kuntz, and R. Lethin, “Highly scalable
near memory processing with migrating threads on the Emu system
architecture,” in Irregular Applications: Architecture and Algorithms
(IA3), Workshop on, pp. 2–9, IEEE, 2016.
10
