ALTIS: Modernizing GPGPU Benchmarking by Hu, Bodun & Rossbach, Christopher J.
Altis: Modernizing GPGPU Benchmarks
Bodun Hu
University of Texas at Austin
Austin, USA
bodunhu@utexas.edu
Christopher J. Rossbach
University of Texas at Austin and VMware Research Group
Austin, USA
rossbach@cs.utexas.edu
Abstract—This paper presents Altis, a benchmark suite for
modern GPGPU computing. Previous benchmark suites such as
Rodinia [1] and SHOC [2] have served the research commu-
nity well, but were developed years ago when hardware was
more limited, software supported fewer features, and produc-
tion hardware-accelerated workloads were scarce. Since that
time, GPU compute density and memory capacity has grown
exponentially, programmability features such as unified memory,
demand paging, and HyperQ have matured, and new workloads
such as deep neural networks (DNNs), graph analytics, and
crypto-currencies have emerged in production environments,
stressing the hardware and software in ways that previous bench-
marks did not anticipate. Drawing inspiration from Rodinia
and SHOC, Altis is a benchmark suite designed for modern
GPU architectures and modern GPU runtimes, representing a
diverse set of application domains. By adopting and extending
applications from Rodinia and SHOC, adding new applications,
and focusing on CUDA platforms, Altis better represents modern
GPGPU workloads to enable support GPGPU research in both
architecture and system software.
I. INTRODUCTION
GPUs have become popular hardware accelerators in many
computing domains. Originally used primarily for 3D rendering,
their use as General Purpose Graphics Processing Units (GPG-
PUs) has grown rapidly. Heterogeneous computing using both
CPUs and GPUs is now the dominant practice in many domains
such as ML [3], [4], graph [5]–[7], and video analytics [8], and
crypto-currencies [9]. At the same time, dramatic improvements
in programmability for GPGPU workloads have been driven
by rapidly evolving high-level languages and runtimes such
as CUDA [10]. Unfortunately, current GPGPU benchmark
suites have not co-evolved with the hardware, programming
frameworks and runtimes, or production workloads, so their
ability to drive relevant research findings has become limited.
Existing benchmark suites such as Rodinia [1] and SHOC [2]
were designed to enable researchers to better understand
the characteristics and demands of heterogeneous systems.
Applications were curated such that each benchmark exhibits
unique behaviors to exercise important GPU subsystems and
components. This allows architects to design better hardware
and enables system software developers to improve efficiency
and programmability. However, current GPGPU benchmark
suites have not kept up with the evolution of hardware and
programming frameworks: in general, they do not use or
stress newer features introduced over multiple generations of
CUDA, such as HyperQ, dynamic parallelism, NVLink [11],
preemptive context switch [12], and UVM [10]. While the
backpropbfs
b+treecfd
dwt2dgaussian
heartwallhotspot
hotspot3Dhuffman
hybridsortkmeans
lavaMDleukocyte
ludmyocyte
nn
nw
particlefilterpathfinder
srad_v1srad_v2
streamcluster
ba
ck
pr
op bf
s
b+
tre
e
cf
d
dw
t2
d
ga
us
sia
n
he
ar
tw
a
ll
ho
ts
po
t
ho
ts
po
t3
D
hu
ffm
an
hy
br
id
so
rt
km
ea
ns
la
va
M
D
le
uk
o
cy
te lu
d
m
yo
cy
te nn nw
pa
rti
cl
ef
ilte
r
pa
th
fin
de
r
sr
a
d_
v1
sr
a
d_
v2
st
re
am
cl
us
te
r bfs
fft
gemm
md
md5hash
neuralnet
reduction
scan
sort
spmv
stencil2d
triad
s3d
qtclustering
bf
s fft
ge
m
m
m
d
m
d5
ha
sh
n
e
u
ra
ln
et
re
du
ct
io
n
sc
a
n
so
rt
sp
m
v
st
en
ci
l2
d
tri
ad s3
d
qt
clu
st
er
in
g
Fig. 1: The Pearson correlation matrix for Rodinia (left)
and SHOC (right). The x axis represents var1 and the y
axis represents var2. Darker colors represent greater positive
correlation value.
hardware has evolved to provide more raw computing power,
memory capacity, and memory bandwidth, the task of scaling
benchmarks to adapt to that growth has either been left to the
user, (e.g. with Rodinia), or entirely neglected (SHOC).
GPUs are the de facto platform of choice for training deep
neural networks. Frameworks like Tensorflow [13] and PyTorch
[14] are widely adopted in the research and industry envi-
ronments. While these platforms are available to researchers,
they are not designed to study hardware behaviors or system
software design. Targeted toward production environments, they
introduce large memory footprints, making them difficult for
hardware architects using simulators. DNNs have become ar-
guably the most important domain driving GPGPU technology
trends, but the minimal neural network workloads in SHOC
and Rodinia are not representative of the current state of the
art in DNN models or algorithms.
We present Altis to address emerging needs of GPGPU
researchers working in both architecture and systems. Altis
preserves what is useful in current suites, extending them to
use new features and exploit new hardware capabilities. Altis
introduces new workloads and feature support to reflect the
evolution of the GPGPU application landscape. We limit our
focus to CUDA, unlike previous suites which also support
OpenCL or other frameworks, because CUDA has become
the dominant platform in production settings, and because
production OpenCL support for advanced features like demand-
paging has been slow to emerge. We make the following
contributions:
ar
X
iv
:1
90
6.
10
34
7v
2 
 [c
s.P
F]
  2
7 A
ug
 20
20
• We demonstrate areas in which existing benchmark suites
no longer meet researchers’ needs, specifically in workload
diversity, problem sizes, and programming features.
• We demonstrate improvements made by Altis and show
capabilities of emerging GPUs more comprehensively. We
implement support for new features up to CUDA 10.0.
• We introduce new benchmarks to characterize emerging
applications domains, emphasizing neural networks by
adding commonly used DNN kernels to Altis.
II. BACKGROUND AND RELATED WORK
Rodinia [1] is a suite of applications designed for hetero-
geneous systems released in 2009. It comprises applications
representing different behaviors of the Berkeley dwarfs [15].
The dwarfs are 13 major categories of computation conjectured
to describe most types of parallel problems. Rodinia relies on
that conjecture to comprise a diverse set of applications that
include both communication and synchronization. However,
Rodinia does not take advantage of new features like unified
memory, HyperQ, Cooperative Groups, and Dynamic Paral-
lelism [16], because it still depends on CUDA 4.0. It does
not utilize programming constructs or architectural support
introduced in newer CUDA versions that can significantly
improve performance: for example, new hardware capabilities
such as half precision and tensor core computation are unused.
To analyze how well Rodinia exercises and stresses GPUs,
we profiled performance for each benchmark using nvprof
[17]. The tool reports metrics that quantify utilization of
a component relative to peak on a scale of 0 to 10: 0 is
idle, 10 is full utilization. Many applications run multiple
kernels: for these cases, we collect average utilization rate for
all kernels, and report the maximum of those averages. To
characterize how diverse applications in the suite are, we apply
Principle Component Analysis (PCA) to the top 69 metrics
supported by nvprof that contribute most to dissimilarity
across all benchmarks in the suite using the default run
configuration. The metrics included are detailed in Table I
and methodology is described in §V. Our goal is to assess
similarity/dissimilarity among benchmarks and show how well
they exercise different components of the GPU and its runtime.
The result is shown in Figure 2. The first three PCs represent
55% of total variance, which accounts for more than half of the
diversity across all benchmarks. A correlation matrix is shown
in Figure 1: an ideal matrix would be dark along the diagonal
but light elsewhere, and would demonstrate that applications
have different behaviors and exercise the system differently.
However, 41% and 70% of applications have correlation values
greater than 0.8 and 0.6 respectively, showing most applications
in Rodinia are highly correlated with a few outliers. An
improved suite that in which applications are more diverse is
urgently needed.
Figure 3 shows the average utilization rate and their standard
deviation of the different functional components and the
memory hierarchy for each application in Rodinia. A key
observation is that many components have low utilization; even
those comprising multiple kernels. Moreover, average GPU
Fig. 2: Rodinia PCA
component utilization looks very similar for many benchmarks,
such as gaussian, huffman, nw, and myocyte. Many
hardware components are not stressed to achieve maximum
utilization, stifling research into improving those components.
SHOC. Developed in 2010, SHOC [2] differs from Ro-
dinia by dividing workloads into two primary categories:
stress/functional tests and performance tests. Stress tests use
computationally demanding kernels to identify devices with
bad memory, insufficient cooling, and other device component
problems. Performance tests, on the other hand, focus on
synthetic kernels and a handful of common parallel algorithms.
SHOC applications run in a framework which accepts user-
specified parameters (e.g. number of iterations), and metrics,
such as floating point operations per second, are recorded.
Many programs in SHOC are basic parallel algorithms,
reflecting an ever-shrinking subset of algorithms actually used
in more modern applications. SHOC covers a variety of
the Berkeley dwarfs, but does not capture the evolution or
complexity of modern applications. Like Rodinia, SHOC was
written when software and hardware features were more limited.
Figure 1 includes a correlation matrix for SHOC that shows
overall less correlation than Rodinia (12% and 31% values
over 0.8 and 0.6 respectively), but a handful of benchmarks
are very highly correlated with most others (e.g. scan and
neuralnet). However, while SHOC suffers less from lack
of diversity than Rodinia, it significantly under-utilized modern
hardware. Figure 3 shows GPU resource utilization and standard
variance for each application in SHOC, measured using the
largest preset data size available. Unlike Rodinia, the utilization
rate no longer exhibits a fixed pattern but varies over a wide
range. This is because SHOC consists of microbenchmarks
that target specific hardware components. Importantly, most
components are not fully exercised to maximum capacity. The
PCA result is shown in Figure 4 based on the same metrics used
above, profiled with both the smallest and the largest predefined
data sizes. Like the PCA analysis for Rodinia (Figure 2), with
the exception of a very small number of outliers, workloads are
mostly tightly clustered together. Tight clustering suggests that
2
0.0
2.5
5.0
7.5
b+
tre
e
ba
ckp
rop bfs cfd
dw
t2d
ga
uss
ian
he
art
wa
ll
ho
tsp
ot
ho
tsp
ot3
D
hu
ffm
an
hyb
rid
so
rt
km
ea
ns
lav
aM
D
leu
ko
cyt
e lud
m
yoc
yte nn nw
pa
rtic
lef
ilte
r
pa
thf
ind
er
sr
ad
_v
1
sr
ad
_v
2
str
ea
mc
lus
ter
m
um
m
er
gp
u
Ut
iliz
at
io
n
DRAM L2 Shared Unified Cache Control Flow Load/Store Tex Special Single P. Double P.
0.0
2.5
5.0
7.5
10.0
bfs ff
t
ge
mm m
d
m
d5
ha
sh
ne
ur
aln
et
qtc
lus
ter
ing
re
du
ctio
n
s3
d
sc
an so
rt
sp
mv
ste
nc
il2d tria
d
Ut
iliz
at
io
n
DRAM L2 Shared Unified Cache Control Flow Load/Store Tex Special Single P. Double P.
Fig. 3: GPU resource utilization for Rodinia and SHOC.
Fig. 4: SHOC PCA. The black dots represent results from
running the smallest data size and the red ones represent the
benchmarks using the largest data size.
the workloads do not exercise the GPU in significantly different
ways. As the data size increases, the workloads become even
more clustered, showing that a key technology trend (increased
memory capacity) is making these workloads even less useful
for capturing a variety of important behaviors. This highlights
the need for workloads to better cover the underlying metric
space for modern and future GPGPU hardware and applications.
Other Benchmarks. More recent benchmarks such as Lonestar
[18], Pannotia [19], and LonestarGPU [20] have addressed
irregular parallelism, rather than stressing overall heterogeneous
performance. Not all new CUDA features are supported. Sarita,
Alsop, and Sinclair [21] focus on benchmarking the effects
of different levels of synchronization (from coarse to fine),
but do not address trends in GPGPU software and hardware
evolution. Parboil [22] provides a set of throughput computing
applications useful for studying the performance of throughput
computing architecture and compilers. The MAFIA framework
[23] is designed to target multi-application execution in GPUs.
These efforts are motivated by similar challenges to GPGPU
computing research: lack of workloads that actually exercise
the features under test. Unfortunately MAFIA remains tied
to older versions of CUDA by the need to support an in
simulator implementation of the runtime, which both stifles co-
evoluation, and limits the fidelity with which it can capture real
system behaviors. Chai [24] address many of the same concerns
motivating Altis and incorporates new benchmarks up to CUDA
8.0, but focuses more on task and data partitioning rather than
emerging domains such as DNNs and relational algebra. Unlike
Altis, Chai’s introduction of a new programming front end
limits its ability to keep up with CUDA evolution.
Due to the rapid growth of popularity in machine learning,
there has been significant focus on characterizing DNN
behavior on GPUs [25] [26] [27]. Popular frameworks such
as Tensorflow include primitive tools for users to analyze
the computational demands of their models. Tango [28] is a
framework to study behaviors of specific neural network model.
MLPerf is benchmark suite for measuring the performance of
machine learning (ML) software frameworks, ML hardware
platforms and ML cloud platforms. These systems could help
GPGPU researchers, but because they focus on high-level
and end-to-end behaviors, lack of control over scale and poor
visibility into system-level and architecture-level phenomena
make them hard for systems and architecture researchers to
use for GPGPU research. A better benchmark suite targeted to
this audience is long overdue.
III. MOTIVATION AND GOALS
A. Represent Emerging GPGPU Application Domains
GPUs are used in many new domains not represented by
workloads in Rodinia or SHOC, e.g., data analytics, graph
processing, relational algebra, and DNNs. While GPUs are
the standard for DNN training, benchmark suites either do
not include neural network based kernels, or include minimal
workloads based on outdated techniques. Our goal is to include
emerging domains, while preserving the ability of researchers
to study low level behaviors of system software and individual
kernels to explore improvements at those stack layers.
3
B. Flexible Dataset Sizes
A key aspect of existing benchmark suites that requires
significant improvement is management of dataset and working
set sizes. SHOC provides four preset data sizes. This lack
of flexibility makes it hard for SHOC to stay relevant in the
future, as advancing technology will eventually cause even
the largest data size to be too small to stress GPU resources.
Rodinia has the opposite problem, where benchmarks have
no preset data size, forcing the user to specify problem sizes,
and making research results based on the suite difficult to
interpret and compare. Users must run data generation scripts
even if they do not know what input size may be appropriate
for the platform being measured. Altis is designed for forward
compatibility by supporting both updated default sizes and
tooling to create arbitrary data sizes. A consequence of user-
parameterizable data sizes is that real world datasets are not
always possible. New data sets are synthesized. The user is
given the freedom to specify the dataset size. While Altis’s
support for modern default sizes and flexible non-defaults
avoids shortcomings of SHOC (no flexibility) and Rodinia (no
dataset size guidance), the problem of updating defaults for
future hardware is fundamental. In future work, we plan to
explore providing feedback to help the user choose new default
sizes based on utilization.
C. Focus on CUDA and Recent CUDA Versions
Supporting multiple GPGPU programming frameworks is
a goal with considerable conceptual appeal, but one which
increases maintenance burden and engineering effort involved
in keeping pace with hardware and software evolution. Because
CUDA has become overwhelmingly the dominant framework
for GPGPU, we limit our focus to CUDA and on maintaining
feature parity with recent releases. In addition to performance
improvements, each new release version of CUDA typically
supports new programming features that can be used to more
easily write more efficient code. It is essential to include
these new features to understand their impact on performance,
particularly as it is usually newer features that are of interest
to systems researchers and architects.
IV. THE ALTIS BENCHMARK SUITE
Altis is available at: https://github.com/utcs-scea/Altis.
In Altis, like SHOC, benchmarks are divided into levels.
Each level represents benchmarks whose behaviors of interest
range from low level characteristics such as memory bandwidth
to end-to-end performance on real world applications. While
determining a set of benchmarks for Altis, consideration was
given to both the Berkeley dwarfs and emerging application
domains such as analytics and DNNs. Our approach is to draw
representative subsets from Rodinia and SHOC and augment
those with workloads that represent emerging domains.
First, Altis integrates a new set of benchmarks representing
neural network layers commonly used in popular DNN models.
Inclusion of DNN benchmarks creates a single suite that
covers modern domains so that current research practice of
cobbling together superset measurements from Rodinia, SHOC,
Parboil, and others can be avoided. Altis DNN workloads
are parallelized with CUDA APIs and powered by libraries
including cuBLAS and cuDNN (NVIDIA CUDA Deep Neural
Network library). We use Darknet [29], an open source Neural
network framework, to construct neural networks. Darknet
enables us to construct real-world models using existing
building blocks. However, most of the kernels in Darknet do
not utilize the cuDNN library, which causes poor performance
compared to industrial standards like Tensorflow. Thus, we re-
implement the most commonly used kernels with cuDNN, and
remove extra memory operations to reduce memory footprints.
Contrary to NN-focused suites like MLperf, which focuses on
end-to-end measurements, Altis isolates individual layers from
DNN models, enabling finer-grained analysis. Critically, we
believe that because Altis uses cuDNN implementations, and
extracts them from an end-to-end setting, it enables researchers
to engage in detailed analysis based on state-of-the art code.
Altis aims to strike a balance between predetermined input
sizes available in SHOC and customizable input sizes available
in Rodinia. Benchmarks contains preset sizes optimized for
systems with different compute capabilities, as well as a
mechanism through which users can specify the size and
other aspects of their input. This feature merges the favorable
qualities from both Rodinia and SHOC.
Characterizing new datasets. For all benchmarks adopted
from Rodinia, users can either use the randomly generated
data provided by Altis tools, or the data originally provided in
Rodinia (randomly generated). All data generated for additional
benchmarks are randomly generated. While flexible sizing and
random generation do mean Altis benchmarks do not use “real
world” datasets, we believe this is not a significant limitation
for the type of architecture and systems research we envision
it being used for. Importantly, while DNN research on end-to-
end performance does require real world models and inputs to
accurate reflect time-to-convergence for training, research on
the behavior of individual kernels is not sensitive to the same
concern.
While not all features are applicable in all workloads, Altis
includes support for each new CUDA feature in every workload
where the feature is meaningful. UVM and CUDA Event
support are present in all workloads. Dynamic parallelism and
Cooperative Group support change the programming model
sufficiently that new workloads (detailed below) are required
to exercise them. HyperQ is meaningful only for workloads
featuring kernels that actually have independent kernels that
can be run concurrently without compromising application
semantics. All benchmarks in Altis support the most recent
releases of CUDA (version 10 at the time of writing). Newly
supported features are detailed below.
Unified Memory is a programmability feature supporting a
shared address space across device across device and host, with
demand paging support to eliminate the need for explicit data
transfer code. When applications access data currently absent
on the running device, the hardware automatically pages it in.
This feature is supported in all Altis workloads.
4
CUDA Events enable accurate timing of functions and kernel
calls. This is an improvement from previous suites which still
use system time, risking low fidelity measurements, particularly
in the presence of asynchrony. All Altis workloads support
CUDA events.
HyperQ allows multiple independent CUDA kernels to execute
in parallel on the same GPU if the resources are available.
HyperQ uses 32 independent Work Distributor Queues to detect
opportunities for parallelism, whereas old architectures uses a
single Work Distributor Queue. We implement HyperQ support
in all benchmarks that feature kernels that can execute indepen-
dently (DWT, LavaMD, SRAD) and extend Pathfinder
with a mode that runs independent but duplicate versions of
the same kernels on different streams.
Dynamic Parallelism enables currently executing CUDA
kernels to call child CUDA kernels(nested parallelism). This
feature is useful when running algorithms with hierarchical
data structures and recursive algorithms with parallelism at
each level. Previous suites do not feature workloads that use
such idioms; we added the Mandelbrot workload to provide
coverage for the feature.
Cooperative Groups (Grid Sync) provides another granularity
of synchronization for kernel threads running on a GPU.
GridSync allows users to sync all threads in the entire grid
before beginning next section of computation. This features is
useful for programs with disjoint phases of computation running
right after one another. This provides finer synchronization
granularity relative to previous CUDA versions that expose only
__syncthreads() to synchronize all threads in a single
block. Like Dynamic Parallelism, using the feature effectively
requires significant code change and is only advantageous for
workloads that make frequent data-dependent kernel launches
for which predicates would otherwise require data transfer.
Altis workloads that support it are SRAD and kmeans.
CUDA Graphs present a new model for submitting jobs in
CUDA. This features allow work to be defined as graphs instead
of single operations. A graph consists of a series of operations,
such as kernel launches and memory operations, defined as
nodes. Connections between these nodes specify corresponding
dependencies. This features enables launching multiple GPU
operations with one CPU operation, hence reducing overheads.
Graphs also enable the CUDA driver to perform optimizations
because the whole workflow is visible.
A. Level 0 Workloads
Level 0 benchmarks are designed to measure low level
characteristics of the hardware. These benchmarks do the
simple task of measuring a single capability of the GPU and
therefore dont represent any dwarfs or application domains.
BusSpeedDownload measures the speed of the PCIe bus by
repeatedly transferring data of various sizes from the host to
the device. The data sizes are varied from 1kb to 500kb.
BusSpeedReadback measures the speed of the PCI bus, except
in the opposite direction. Here, data is transferred from the
device to the host.
DeviceMemory measures the bandwidth of different compo-
nents of the memory hierarchy on the device. This includes
global, constant, and shared memory.
MaxFlops measures the maximum achievable floating point
operations per second on the device. This is adopted from
SHOC for single and double precision arithmetic but extended
with support for half-precision floating point arithmetic.
B. Level 1 Workloads
Level 1 benchmarks include basic parallel algorithms which
are common tasks in parallel computing and often used in
kernels of real applications. While these applications represent
a subset of the Berkeley dwarfs, they are complex enough to
represent real applications domains.
GUPS stands for Giga-updates per second. It measures how
frequently a computer can issue updates to randomly generated
RAM locations. This benchmarks stresses the latency and
bandwidth of the device. This test is important because the
random memory performance directly maps to the application
performance. This is adapted from the HPCC [30] benchmark
suite, extended to simplify the tuning of DRAM footprint.
Breadth First Search measures the performance for breadth-
first search, a common graph traversal algorithm. This appli-
cation was included because it is control-flow intensive. This
benchmark is adapted from Rodinia, extended with modern
CUDA feature support.
General Matrix Multiply is an application that measures
the performance for different types of matrix multiplications.
The types of matrix multiplications include single and double
precision tests with and without transposing the input matrices.
This benchmark is adapted from SHOC with added support
for half precision arithmetic, Tensor Cores, and new CUDA
features.
Pathfinder performs a shortest-path algorithm which serves
as a test of irregular parallelism. While most conventional
parallel algorithms have uniform behaviors across the different
threads, irregular algorithms are characterized by different
threads performing different executions. Depending on graph
connectivity, different threads can experience unique behaviors.
In addition to this, pathfinder will experience much higher
control flow unit utilization compared to regular parallelism
algorithms as each thread needs to decide how to execute
independently. This benchmark is adapted from Rodinia,
extended with new CUDA feature support.
Sort performs fast radix sort [31] on an array of integers.
This benchmark was originally included in SHOC. In Altis,
the workload is extended to simplify dataset size tuning and
new CUDA feature support.
C. Level 2 Workloads
Level 2 benchmarks are macro-benchmarks: real-world
application kernels. Benchmarks in this level are applications
that can be found in industry, and therefore represent a variety
5
of GPU application domains. These applications represent a
diverse types of performance characteristics.
CFD Solver is a computational fluid dynamics benchmark.
This application solves the three-dimensional Euler equations
for compressible flow. This workload optimizes effective GPU
memory bandwidth by reducing total global memory accesses
and overlapping computation. This is adapted from Rodinia,
extended to add new CUDA feature support.
GPUDWT is for discrete wavelet transform, an image and
video compression algorithm that is also a popularly used digital
signal processing technique. This benchmark implements both
forward and reverse, as well as 9/7 and 5/3 transforms. The
9/7 transform uses floats while the 5/3 transform uses integers,
so its important to measure the performance for both. This
benchmark is adapted from Rodinia, extended to add new
CUDA feature support.
KMeans is a popular clustering algorithm used in data mining.
This algorithm shows a high degree of data parallelism. At the
beginning, K centers are chosen randomly. In each iteration,
each data point is assigned to a center, and at the end of each
iteration, each center is recomputed as the mean of all the data
points in its cluster until the two converge. This benchmark
provides 11 different implementations, including both CPU
and GPU side aggregation. It is extended to add new CUDA
feature support.
LavaMD calculates N-body particle interaction. The code
calculates particle potential and relocation due to mutual forces
between particles within a large 3D space. This space is divided
into cubes, or large boxes, that are allocated to individual
cluster nodes. The large box at each node is further divided
into cubes, called boxes. 26 neighbor boxes surround each box
(the home box). Home boxes at the boundaries of the particle
space have fewer neighbors. Particles only interact with those
other particles that are within a cutoff radius since ones at
larger distances exert negligible forces. Thus the box size is
chosen so that the cutoff radius does not span beyond any
neighbor box for any particle in a home box, thus limiting the
reference space to a finite number of boxes. This benchmark is
implemented from scratch and provides 11 different variants,
extended to add new CUDA feature support.
Mandelbrot computes an image of a Mandelbrot fractal, a self
repeating geometric pattern that loops back on itself at ever
decreasing sizes. A commonly used algorithm is the Escape
Time Algorithm, which calculates the value for different pixels
on a per pixel basis. This benchmark was added specifically
to test With Dynamic Parallelism, the benchmark switches
to using the Mariani-Silver Algorithm. Unlike Escape Time,
this procedure starts out coarse grained, and only iterates at
a finer resolution if necessary for certain subsections. The
implementation is adapted from [32].
Needleman-Wunsch is a nonlinear global optimization method
for DNA sequence alignments. The potential pairs of sequences
are organized in a 2D matrix. In the first step, the algorithm
fills the matrix from top left to bottom right, step-by-step.
The optimum alignment is the pathway through the array with
maximum score, where the score is the value of the maximum
weighted path ending at that cell. Thus, the value of each
data element depends on the values of its northwest-, north-
and west-adjacent elements. In the second step, the maximum
path is traced backward to deduce the optimal alignment. The
benchmark is adapted from Rodinia, extended to add new
CUDA feature support.
ParticleFilter is a statistical estimator of the location of a target
object given noisy measurements of that targets location and an
idea of the objects path in a Bayesian framework. The PF has a
plethora of applications ranging from video surveillance in the
form of tracking vehicles, cells and faces to video compression.
This particular implementation is optimized for tracking cells,
particularly leukocytes and myocardial cells. The benchmark
is adapted from Rodinia, extended to add new CUDA feature
support.
SRAD is a computer vision application used for reducing
noise, or speckles, in images without destroying important
image features. This is done using partial differential equations.
Since each stage of this application operates on the entire
image, SRAD requires synchronization after each stage. This
makes SRAD the ideal benchmark to test the performance of
using cooperative groups in CUDA. This benchmark is adopted
from Rodinia with added support for Cooperative Groups.
Where is a new relational algebra benchmark developed for
Altis. GPUs are increasingly popular for data analytics because
relational algebra operations are amenable to efficient GPU
parallelization [33]. This benchmark implements a filter for a
set of records, returning a subset of the input records that meet
a set of conditions. It first maps each entry to a 1 or 0, before
running a prefix sum and using both of these auxiliary data
structures to reduce the input data to just the matching entries.
Raytracing. Ray tracing is a rendering technique used
for generating images by tracing the path of light in the
form of pixels. It operates by simulating the effects of its
interaction with virtual objects. This workload is new in Altis
and the implementation is adapted from publicly available
implementation here [34]. In addition, an OptiX version [35]
is supplied for RT Core benchmarking.
D. DNN Kernel Workloads
All benchmarks in this section represent neural network
layers commonly seen in popular DNN models. All layers in
this section include both forward and backward passes.
Activation layer is used to decide whether a neuron should be
activated by calculating the weighted sum and adding bias with
it. It introduces non-linearity into the output of a neuron. Some
of the most commonly used activation functions include ReLU,
sigmoid, tanh, and LeakyReLU. Here we present ReLU since
it is the simplest one to understand. ReLU activation function
can represented as y1 = max{0,x1} where xi represents the
input to the neuron and yi is the output.
6
0.0
2.5
5.0
7.5
10.0
ac
tiva
tio
n_
bw
ac
tiva
tio
n_
fw
av
gp
oo
l_b
w
av
gp
oo
l_fw
ba
tch
no
rm
_b
w
ba
tch
no
rm
_fw bfs cfd
co
nn
ec
ted
_b
w
co
nn
ec
ted
_fw
co
nv
olu
tio
n_
bw
co
nv
olu
tio
n_
fw
dro
po
ut_
bw
dro
po
ut_
fw
dw
t2d
ge
mm gu
ps
km
ea
ns
lav
am
d
m
an
de
lbr
ot
no
rm
aliz
ati
on
_b
w
no
rm
aliz
ati
on
_fw nw
pa
rtic
lef
ilte
r
pa
thf
ind
er
ra
ytr
ac
ing
rn
n_
bw
rn
n_
fw
so
ftm
ax
_b
w
so
ftm
ax
_fw so
rt
sr
ad
wh
ere
Ut
iliz
at
io
n
DRAM L2 Shared Unified Cache Control Flow Load/Store Tex Special Single P. Double P.
0.0
2.5
5.0
7.5
10.0
ac
tiva
tio
n_
bw
ac
tiva
tio
n_
fw
av
gp
oo
l_b
w
av
gp
oo
l_fw
ba
tch
no
rm
_b
w
ba
tch
no
rm
_fw bfs cfd
co
nn
ec
ted
_b
w
co
nn
ec
ted
_fw
co
nv
olu
tio
n_
bw
co
nv
olu
tio
n_
fw
dro
po
ut_
bw
dro
po
ut_
fw
dw
t2d
ge
mm gu
ps
km
ea
ns
lav
am
d
m
an
de
lbr
ot
no
rm
aliz
ati
on
_b
w
no
rm
aliz
ati
on
_fw nw
pa
rtic
lef
ilte
r
pa
thf
ind
er
ra
ytr
ac
ing
rn
n_
bw
rn
n_
fw
so
ftm
ax
_b
w
so
ftm
ax
_fw so
rt
sr
ad
wh
ere
Ut
iliz
at
io
n
DRAM L2 Shared Unified Cache Control Flow Load/Store Tex Special Single P. Double P.
0.0
2.5
5.0
7.5
10.0
ac
tiva
tio
n_
bw
ac
tiva
tio
n_
fw
av
gp
oo
l_b
w
av
gp
oo
l_fw
ba
tch
no
rm
_b
w
ba
tch
no
rm
_fw bfs cfd
co
nn
ec
ted
_b
w
co
nn
ec
ted
_fw
co
nv
olu
tio
n_
bw
co
nv
olu
tio
n_
fw
dro
po
ut_
bw
dro
po
ut_
fw
dw
t2d
ge
mm gu
ps
km
ea
ns
lav
am
d
m
an
de
lbr
ot
no
rm
aliz
ati
on
_b
w
no
rm
aliz
ati
on
_fw nw
pa
rtic
lef
ilte
r
pa
thf
ind
er
ra
ytr
ac
ing
rn
n_
bw
rn
n_
fw
so
ftm
ax
_b
w
so
ftm
ax
_fw so
rt
sr
ad
wh
ere
Ut
iliz
at
io
n
DRAM L2 Shared Unified Cache Control Flow Load/Store Tex Special Single P. Double P.
0.0
2.5
5.0
7.5
10.0
ac
tiva
tio
n_
bw
ac
tiva
tio
n_
fw
av
gp
oo
l_b
w
av
gp
oo
l_fw
ba
tch
no
rm
_b
w
ba
tch
no
rm
_fw bfs cfd
co
nn
ec
ted
_b
w
co
nn
ec
ted
_fw
co
nv
olu
tio
n_
bw
co
nv
olu
tio
n_
fw
dro
po
ut_
bw
dro
po
ut_
fw
dw
t2d
ge
mm gu
ps
km
ea
ns
lav
am
d
m
an
de
lbr
ot
no
rm
aliz
ati
on
_b
w
no
rm
aliz
ati
on
_fw nw
pa
rtic
lef
ilte
r
pa
thf
ind
er
ra
ytr
ac
ing
rn
n_
bw
rn
n_
fw
so
ftm
ax
_b
w
so
ftm
ax
_fw so
rt
sr
ad
wh
ere
Ut
iliz
at
io
n
DRAM L2 Shared Unified Cache Control Flow Load/Store Tex Special Single P. Double P.
Fig. 5: Per-resource type Utilization of Altis workloads using P100, GTX 1080, and M60 GPUs.
Pooling is common used between successive convolution
layers in a ConvNet architecture. Its main function is to reduce
the spatial dimensions on a convolution neural network. For
example, applying a maxpool kernel of size 2×2 on a 2×2
matrix will yield the biggest number in the matrix, while an
average pool kernel of the same size will produce the average
value of the sum in the matrix. For simplicity, We include only
average pool layer in the report.
Batch normalization is a technique proposed to solve covariate
shift [36] in DNNs. When parameters in the preceding layer
change, the input to the current layer will change accordingly,
causing the current layer to adjust to the new distribution. The
main goal of batch normalization is to limit the shifting to a
certain range to speedup training process and produce reliable
models.
Connected layers are those whose neurons are connected to
every neuron in the next layer. The connected layer can be
seen as a feature vector that holds aggregated information from
the previous layer. For example, a connected layer can be right
after a convolution layer which provides a low-dimensional
in-variate feature space. The fully connected layer can then
learn a function from that space to produce more useful or
abstract knowledge.
Convolution layer is mostly used to extract important features
from images by assigning learning weights to various objects
in those images. For example, give an RGB image of size 228
with 3 channels, we can train a convolution kernel of size 3×3
with 3 channels and stride 1 to produce an output tensor of
size 226×226 with 1 channel. The output tensor represents
one feature in the image, such as the presence of curves in
difference parts of the input image.
Dropout is a regularization technique used to prevent neural
networks from over-fitting [37]. The key idea is to randomly
drop units from the neural network during training. When
training large neural networks on small data sets, over-fitting
can be a huge issue when the model is evaluated on test data
set. Dropout solves this problem by stochastically introducing
noise to prevent units from co-adapting too much, thus making
the model more robust.
RNN stands for Recurrent Neural Network. It is widely adopted
in learning tasks dealing with sequential data, such as speed
recognition, text generation, and so on. RNNs have proven
to be successful in capturing the dynamics of sequences by
keeping internal states(memory) which tracks information from
previous time stamps. Among the most commonly used RNNs
are GRU and LSTM. In our benchmark, we only show results
7
01
2
3
4
is
su
ed
_i
pc
is
su
e_
sl
ot
_u
til
iz
at
io
n
ip
c
e
lig
ib
le
_w
a
rp
s_
pe
r_
cy
cle
ld
st
_e
xe
cu
te
d
ld
st
_i
ss
ue
d
in
st
_b
it_
co
nv
e
rt
in
st
_i
nt
eg
er
in
st
_e
xe
cu
te
d_
gl
ob
al
_s
to
re
s
in
st
_e
xe
cu
te
d_
sh
ar
ed
_l
oa
ds
Co
nt
rib
u
tio
ns
 (%
) Contribution of variables to Dim−1−2
0
1
2
3
4
do
ub
le
_p
re
cis
io
n_
fu
_u
tili
za
tio
n
flo
p_
co
un
t_
dp
_f
m
a
flo
p_
co
un
t_
dp
te
x_
ca
ch
e_
hi
t_
ra
te
te
x_
u
til
iz
at
io
n
in
st
_f
p_
64
te
x_
fu
_u
tili
za
tio
n
in
st
_p
er
_w
a
rp
in
st
_e
xe
cu
te
d_
gl
ob
al
_l
oa
ds
flo
p_
co
un
t_
dp
_m
u
lCo
nt
rib
u
tio
ns
 (%
) Contribution of variables to Dim−3−4
Fig. 6: Contribution of the top 10 (out 69) Variables to PCA
dimensions 1,2,3, and 4.
for LSTM for simplicity.
Softmax layer is typically seen as the final output layer in a
neural network to perform multi-class classification. It takes an
input, usually a score value(zi, i= 1...K), and recomputes it as
probabilities. Therefore, the outputs of the layer will represent
a true probability distribution, where the sum of each individual
output will equal to 1. Its calculation process is shown below:
σ(zc) =
ezc
∑Kk=1 ezk
(1)
LRN (Local Response Normalization) is intended to simulate
a form of lateral inhibition [38] inspired by the type found in
real neurons. It allows diminishing response values uniformly
large to neighborhoods and creates a high contrast in activation
map. This feature is especially useful in unbound activation
functions such as ReLU. The original formula is written as
bix,y = a
i
x,y/(k+α
j=min(N−1,i+n/2)
∑
j=max(0,i−n/2)
(aix,y)
2)β (2)
where bix,y is the regularized output for kernel i at position x,y,
aix,y is the source output of kernel i applied at position x,y, N
is the number of kernels, n is the size of the normalization
neighbourhood, and α,β ,k are hyper parameters of LRN.
V. EVALUATION
In this section, we evaluate the applications in Altis in terms
of runtime characteristics, diversity, and performance.
A. Experimental Setup
All measurements are obtained by executing the applications
on real hardware. The benchmarks are evaluated on a NVIDIA
Tesla P100 GPU with 1.48 GHz shader clock. The CPU is a
Intel Xeon E5-2650 running at 2.2 GHz. We use this setup as
the standard platform to collect all metrics. We used NVIDIA
driver version 418.87 and CUDA version 10.0. The benchmark
is also executed on a NVIDIA Tesla M60 running at 1.18
GHz and a NVIDIA GeForce GTX 1080 running at 1.85 GHz
to collect utilization metrics. The operating system is Ubuntu
18.04.3 LTS.
bfs
gemm
pathfinder
sort
cfd
dwt2d
gups
kmeans
lavamd
mandelbrot
nw
particlefilter
srad
where
raytracing
activation_fw
avgpool_fw
batchnorm_fw
connected_fw
convolution_fw
dropout_fw
normalization_fw
rnn_fw
softmax_fw
activation_bw
avgpool_bw
batchnorm_bw
connected_bw
convolution_bw
dropout_bw
normalization_bw
rnn_bw
softmax_bw
bf
s
ge
m
m
pa
th
fin
de
r
so
rt
cf
d
dw
t2
d
gu
ps
km
ea
ns
la
va
m
d
m
a
n
de
lb
ro
t
n
w
pa
rti
cl
ef
ilte
r
sr
a
d
w
he
re
ra
yt
ra
ci
ng
a
ct
iva
tio
n_
fw
av
gp
oo
l_
fw
ba
tc
hn
or
m
_
fw
co
n
n
e
ct
ed
_f
w
co
nv
o
lu
tio
n_
fw
dr
op
ou
t_
fw
n
o
rm
a
liz
at
io
n_
fw
rn
n
_
fw
so
ftm
ax
_f
w
a
ct
iva
tio
n_
bw
av
gp
oo
l_
bw
ba
tc
hn
or
m
_
bw
co
n
n
e
ct
ed
_b
w
co
nv
o
lu
tio
n_
bw
dr
op
ou
t_
bw
n
o
rm
a
liz
at
io
n_
bw
rn
n
_
bw
so
ftm
ax
_b
w
−1.0
−0.5
0.0
0.5
1.0
Fig. 7: Pearson correlation matrix for Altis
B. Benchmark Performance
We use nvprof to collect the metrics of individual kernels.
Note a number of benchmarks involve multiple kernels and
some may be redundant. In such cases, we select the maximum
utilization of each kernel. The memory and computational unit
utilization rate are provided in Figure 5. The utilization rates
of different GPU components show a diverse set of behaviors
for both forward and backward passes in DNN section of the
benchmark. We observe that the most utilized components are
DRAM and single precision floating point function unit.
Figure 8 and Figure 6 show the Altis workloads in PCA
space, as well as the contribution of the top 10 metrics to
the first through fourth dimensions of that space. We selected
counters by performing PCA analysis over all the counters
supported by nvprof, and selecting with preference for the
most dominant metrics. Because nvprof and GPGPU-sim
provide different metrics, we preferred the metrics in nvprof
for which there was a clear mapping to a corresponding metric
in GPGPU-sim. The major GPGPU kernel characteristics can
be categorized into several types: efficiency and utilization,
arithmetic features, stall behavior, instruction mix, and memory
hierarchy behaviors. The complete set of metrics used is shown
in Table I.
The PCA data show that use of new CUDA features and
larger inputs can significantly affect the position of a benchmark
in the space, reflect the fact that bottleneck components change
when emerging features are exercised. For example, lavaMD
is an outlier in all cases because it uses double-precision units
rarely exercised in other workloads, but use of UVM shifts
the bottleneck to pipeline stalls. The raytracing and nw
workloads behave similarly.
The IPC-related metrics contribute the most to the variance
in PC1 while double precision functional units is more
prevalent in PC2. This is because only a few benchmarks use
double precision data type as indicated in Figure 5. The new
workloads and extended versions of previous workloads enable
Altis to better cover the PCA space. Altis’s new workloads,
8
Fig. 8: Altis PCA. Blue and gray represent Altis running on
small and large input datasets respectively.
raytracing, and many of the DNN kernels are clearly at
extrema of the space.
Category Metrics
Util & Efficiency branch efficiency, warp execution efficiency, warp nonpred execution efficiency,
inst replay overhead, gld efficiency, gst efficiency, ipc, issued ipc, issue slot utilization,
sm efficiency, achieved occupancy, eligible warps per cycle,ldst fu utilization,
cf fu utilization, tex fu utilization, special fu utilization
Arithmetic inst integer, inst fp 32, inst fp 64,inst bit convert,flop count dp, flop count dp add,
flop count dp fma,flop count dp mul,flop count dp mul, flop count sp,flop count sp add,
flop sp efficiency,flop count sp fma, flop count sp mul, flop count sp special,
single precision fu utilization, double precision fu utilization
Stall stall inst fetch, stall exec dependency, stall memory dependency, stall texture,stall sync,
stall constant memory dependency, stall pipe busy,stall memory throttle, stall not selected
Instructions inst executed global loads, inst executed local loads, inst executed shared loads,
inst executed local stores, inst executed shared stores, inst executed global reductions,
inst executed tex ops, l2 global reduction bytes, inst executed global stores,
inst per warp, inst control, inst compute ld st, inst inter thread communication,
ldst issued,ldst executed.
Cache&Mem local load transactions per request,global hit rate, local hit rate, tex cache hit rate,
l2 tex read hit rate, l2 tex write hit rate, ram utilization, shared efficiency,
shared utilization, l2 utilization, tex utilization, l2 tex hit rate.
TABLE I: Metrics used to create Altis’ PCA metric space.
The floating point function unit is closely related to the
IPC for most kernels. For example, convolution is compute
intensive, which results in high IPC, shown in Figure 9. Low
utilization of single precision function unit results in low
IPC for batch normalization. The eligible number of warps
per cycle also shows high number for convolution and low
value for batch normalization. This can be explained by that
convolution has relatively good data locality, whereas batch
normalization requires more memory operations which reduces
the number of warps eligible to issue the next instruction. This
is a sign that convolution operation is compute bound and
batch normalization is memory bound.
Figure 10 shows the eligible warps per cycle in Altis. Eligible
warps tend to show how often the benchmark makes data
requests, and it correlates with IPC. For example, gemm and
connected_fw are heavily computation bound since they
are essentially matrix-matrix multiplication. In contrast, gups
always requests a single (randomly chosen) unit of data from
DRAM for each read, and the resulting stalls result in very
low eligible warps per cycle.
The utilization for the rest of all benchmarks in Altis show
a diverse range of values. Each GPU component utilization is
0.0
0.5
1.0
1.5
2.0
ac
tiv
at
ion
_b
w
ac
tiv
at
ion
_fw
av
gp
oo
l_b
w
av
gp
oo
l_f
w
ba
tch
no
rm
_b
w
ba
tch
no
rm
_fw bf
s
cfd
co
nn
ec
te
d_
bw
co
nn
ec
te
d_
fw
co
nv
olu
tio
n_
bw
co
nv
olu
tio
n_
fw
dro
po
ut_
bw
dro
po
ut_
fw
dw
t2d
ge
mmgu
ps
km
ea
ns
lav
am
d
m
an
de
lbr
ot
no
rm
ali
za
tio
n_
bw
no
rm
ali
za
tio
n_
fw nw
pa
rtic
lef
ilte
r
pa
thf
ind
er
ra
ytr
ac
ing
rn
n_
bw
rn
n_
fw
so
ftm
ax
_b
w
so
ftm
ax
_fwso
rt
sr
ad
w
he
re
IP
C
Fig. 9: IPC for Altis measured with the largest supported data
set size.
0
3
6
9
ac
tiv
at
ion
_b
w
ac
tiv
at
ion
_fw
av
gp
oo
l_b
w
av
gp
oo
l_f
w
ba
tch
no
rm
_b
w
ba
tch
no
rm
_fw bf
s
cfd
co
nn
ec
te
d_
bw
co
nn
ec
te
d_
fw
co
nv
olu
tio
n_
bw
co
nv
olu
tio
n_
fw
dro
po
ut_
bw
dro
po
ut_
fw
dw
t2d
ge
mmgu
ps
km
ea
ns
lav
am
d
m
an
de
lbr
ot
no
rm
ali
za
tio
n_
bw
no
rm
ali
za
tio
n_
fw nw
pa
rtic
lef
ilte
r
pa
thf
ind
er
ra
ytr
ac
ing
rn
n_
bw
rn
n_
fw
so
ftm
ax
_b
w
so
ftm
ax
_fwso
rt
sr
ad
w
he
re
e
lig
ibl
e 
wa
rp
s 
pe
r c
yc
le
Fig. 10: Altis Average Eligible Warps Per Cycle
increased compared to SHOC. This can be explained by the
increase in input data size, which demonstrates the importance
of having user-defined input problem size to stress hardware
performance. These benchmarks also differ from DNN kernels.
DNN kernel tend to stress dram and single precision function
units heavily, while the conventional benchmarks exhibit a more
diverse utilization of each component. Compared to Figure 1,
Figure 7 shows a great deal of variance in the correlation
between benchmarks. The strong correlation between gemm and
convolution kernels indicates they share similar characteristics
as both are compute bound. On the other hand, gups has almost
no correlation with convolution, since random memory access
is heavily memory bound. The correlation matrix shows a good
amount of applications with little correlation, indicating diverse
GPU behaviors. In addition, relative to the utilization reported
in §II for Rodinia and SHOC, the hardware is much more
fully utilized, and the majority of workloads have at least one
resource whose utilization is a significant fraction of peak.
Discussion: Insights from Altis. Overall, the data illustrate
Altis’s primary value in providing new workloads sized and
crafted to actually utilize the hardware and exercise new fea-
tures. While DRAM and floating point are known to be critical
to GPU performance, and Altis will not change that, exercising
new features at meaningful scale has revealed new insights. For
example, the first order bottlenecks for some workloads have
changed, as Figure 8 and Figure 5 illustrate. UVM may decrease
performance for some workloads, but increases utilization under
several metrics: while this observation has been made by others,
a shared suite that exercises the feature in an understandable
9
way prevents researchers from having to directly modify suites
in potentially different ways to model UVM, as was necessary
for works such as [39], [40]. DRAM utilization, single precision
functional unit utilization, and unified cache utilization are all
metrics that show significant changes in behavior in Altis
compared to the original versions. As we show in per-feature
analysis below, new features such as Cooperative Groups have
unpredictable and undesirable performance artifacts: supporting
them in Altis enables research into how to improve them.
C. CUDA Feature Analysis
In this section, we analyze a subset of benchmarks with new
CUDA features to explore each feature’s impact.
Unified Memory : We measure kernel time plus transfer
time of BFS without UVM and compare to kernel time with
UVM, since there is no explicit transfer time with UVM.
Three different versions of BFS using UVM were measured
and compared to the baseline with no new CUDA features.
The first version uses UVM without cudaMemAdvise()
or cudaMemPrefetchAsync(). The second uses
only cudaMemAdvise(), and the last uses both
cudaMemAdvise() and cudaMemPrefetchAsync().
BFS with UVM is faster than the baseline version only
0.4
0.8
1.2
10 15 20
Number of Nodes (Power of 2)
Sp
ee
du
p
UM
UM+Advise
UM+Advise+Prefetch
Fig. 11: BFS Speedup Using Unified Memory
with pre-fetching enabled. Additionally, the speedup was
inconsistent and did not scale with the input size. This
is because the execution path is highly dependent on the
generated graph. While data is randomly generated, irregular
access patterns typified by graph workloads are a challenge
for UVM because hiding frequent demand paging latency
depends on prefetch that is not always effective.
HyperQ : We explore HyperQ impact on Pathfinder. HyperQ
increases SM utilization when multiple independent kernels
can execute concurrently. Our updated pathfinder runs multiple
instances on different streams. Figure 12 shows the speedup
as the number of concurrent kernels increases. Transfer time
is elided because it is constant.
Speedup from HyperQ increases with the number of parallel
kernels, and levels out around 32 instances, when it saturates
all 32 work queues supported by the physical hardware. We
see speedup starting at a little under 1x for a single instance,
and up to 4x thereafter. This occurs because increasing the
number of instances increases queue occupancy, but aggregate
throughput becomes limited by available SMs.
Cooperative Groups : We measure and compare SRAD
using cooperative kernels to the kernel time for the original
1
2
3
4
0 2 4 6 8 10 12
Number of Instances (Power of 2)
Sp
ee
du
p
Fig. 12: Pathfinder Speedup Using HyperQ
implementation. Transfer time is elided. A key concern for
0.9
1.0
1.1
2 4 6 8 10 12 14 16
Image Dimension (Multiple of 16)
Sp
ee
du
p
Fig. 13: SRAD Speedup Using Cooperative Groups
cooperative groups is the limit on the number of blocks the
GPU is able to launch. SRAD using a cooperative kernel could
not be run on image sizes greater than 256x256. We vary the
problem size by multiples of 16. The data in Figure 13 show
that the feature provides minimal performance benefit in a
handful of cases, and can harm performance significantly in
others. Supporting this feature in a way that enables developers
to reason about and predict the utility of the feature is possible
direction for research exposed by Altis.
Dynamic Parallelism : For this feature, the speedup was
measured using the kernel times for Mandelbrot with and
without Dynamic Parallelism. Transfer time was not included.
The benchmark shows smooth increase in speedup as problem
sizes increase. This is due to the efficiency of the algorithms
used and the more efficient coding of the problem that
Dynamic Parallelism enables. While the traditional Escape
Time algorithm is forced to calculate values for every pixel,
Mariani-Silver can subdivide and thus ignore ever increasing
swaths of the image without requiring synchronization or
relaunch of multiple kernels.
0
1
2
3
4
5
5 6 7 8 9 10 11 12 13
Image Dimension (Power of 2)
Sp
ee
du
p
Fig. 14: Mandlebrot Speedup using Dynamic Parallelism
CUDA Graph : For this feature, the speedup was measured
using the frame processing times for particlefilter with and
without CUDA Graph. The frame dimension is set to 30×30,
10
the frame is set to 40. Data transfer time and data initialization
time are not included. The data in Figure 15 demonstrates
slight speedup as number of points increases. By defining work
1.00
1.05
1.10
1.15
0 1 2 3 4 5 6 7 8 9
Number of Points (power of 2 times 100)
Sp
ee
du
p
Fig. 15: Particlefilter Speedup Using CUDA Graph
as graphs, all the kernel work descriptors are pre-initialized
and can be launched repeatedly as rapidly as possible, thus
reducing launching overheads. As the data size increases, the
kernel launch time is overshadowed by the computation time,
thus less speedup.
VI. CONCLUSION
Our goal with Altis is to modernize GPGPU benchmark
suites. We improved the diversity over existing benchmarks,
introducing new workloads from different domains, adapting
problem sizes to better match modern hardware, and adding
support for new CUDA features. We hope Altis will server as
a more complete GPGPU benchmark suite for modern GPGPU
research.
VII. ACKNOWLEDGEMENTS
We thank the PC reviewers and our shepherd for their
valuable feedback. We thank Christopher Denny, Sarah Wang,
and Vance Miller, who contributed to the engineering effort
to build many of the workloads in Altis. This research was
supported by NSF grants CNS-1618563 and CNS-1846169.
REFERENCES
[1] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and
K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,”
in 2009 IEEE International Symposium on Workload Characterization
(IISWC), pp. 44–54, Oct 2009.
[2] A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford,
V. Tipparaju, and J. S. Vetter, “The scalable heterogeneous computing
(shoc) benchmark suite,” in Proceedings of the 3rd Workshop on General-
Purpose Computation on Graphics Processing Units, GPGPU-3, (New
York, NY, USA), pp. 63–74, ACM, 2010.
[3] N. Hemsoth, “Medical Imaging Drives GPU Accelerated Deep Learning
Developments,” Nov. 2017. Accessed: September 19, 2019.
[4] E. Smistad, T. L. Falch, M. Bozorgi, A. C. Elster, and F. Lindseth,
“Medical image segmentation on GPUs A comprehensive review,”
Medical Image Analysis, vol. 20, no. 1, pp. 1–18, 2015.
[5] ThinCI, “ThinCI.” Accessed: September 25, 2019.
[6] ThinCI, “Startup Unveils Graph Processor at Hot Chips.” Accessed:
September 25, 2019.
[7] NVIDIA, “nvGRAPH :: CUDA Toolkit Documentation.” Accessed:
September 25, 2019.
[8] NVIDIA, “Driving Innovation: Building AI-Powered Self-Driving Cars.”
Accessed: September 19, 2019.
[9] D. Altavilla, “AMD And NVIDIA Shares Surge As Blockchain Demand
For GPUs Spurs Rally,” 2018. [Online; accessed September 2018].
[10] “Cuda toolkit documentation v10.1.243.” https://docs.nvidia.com/cuda/
index.html. Accessed: 2019-10-15.
[11] D. Foley, “Ultra-Performance Pascal GPU and NVLink Interconnect,” in
HotChips, 2016.
[12] I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero,
“Enabling Preemptive Multiprogramming on GPUs,” in ISCA, 2014.
[13] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,
A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur,
J. Levenberg, D. Mane´, R. Monga, S. Moore, D. Murray, C. Olah,
M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,
V. Vanhoucke, V. Vasudevan, F. Vie´gas, O. Vinyals, P. Warden, M. Wat-
tenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale
machine learning on heterogeneous systems,” 2015. Software available
from tensorflow.org.
[14] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in
pytorch,” 2017.
[15] K. Asanovc, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer,
D. Patterson, W. Plishker, J. Shalf, S. Williams, and K. Yelick, “The
landscape of parallel computing research: A view from berkeley,” tech.
rep., 2006.
[16] “Cuda c++ programming guide.” https://docs.nvidia.com/cuda/
cuda-c-programming-guide/index.html. Accessed: 2019-10-15.
[17] “Profiler user’s guide.” https://docs.nvidia.com/cuda/profiler-users-guide/
index.html. Accessed: 2019-10-15.
[18] M. Kulkarni, M. Burtscher, C. Cascaval, and K. Pingali, “Lonestar:
A suite of parallel irregular programs,” in 2013 IEEE International
Symposium on Performance Analysis of Systems and Software (ISPASS),
(Los Alamitos, CA, USA), IEEE Computer Society, apr 2009.
[19] S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron, “Pannotia:
Understanding irregular gpgpu graph applications,” in 2013 IEEE
International Symposium on Workload Characterization (IISWC), pp. 185–
195, Sep. 2013.
[20] M. A. O’Neil and M. Burtscher, “Microarchitectural performance
characterization of irregular gpu kernels,” in 2014 IEEE International
Symposium on Workload Characterization (IISWC), pp. 130–139, Oct
2014.
[21] M. D. Sinclair, J. Alsop, and S. V. Adve, “Heterosync: A benchmark
suite for fine-grained synchronization on tightly coupled gpus,” in 2017
IEEE International Symposium on Workload Characterization (IISWC),
pp. 239–249, Oct 2017.
[22] J. A. Stratton, C. I. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang,
N. Anssari, G. D. Liu, and W. mei W. Hwu, “Parboil: A revised
benchmark suite for scientific and commercial throughput computing,”
2012.
[23] A. Jog, O. Kayiran, T. Kesten, A. Pattnaik, E. Bolotin, N. Chatterjee,
S. W. Keckler, M. T. Kandemir, and C. R. Das, “Anatomy of gpu
memory system for multi-application execution,” in Proceedings of the
2015 International Symposium on Memory Systems, MEMSYS ’15, (New
York, NY, USA), pp. 223–234, ACM, 2015.
[24] J. Gomez-Luna, I. Hajj, L. Chang, V. Garcia-Flores, S. de Gonzalo, T. B.
Jablin, A. J. Pena, and W. Hwu, “Chai: Collaborative heterogeneous
applications for integrated-architectures,” in 2017 IEEE International
Symposium on Performance Analysis of Systems and Software (ISPASS),
(Los Alamitos, CA, USA), pp. 43–54, IEEE Computer Society, apr 2017.
[25] H. Zhu, M. Akrout, B. Zheng, A. Pelegris, A. Jayarajan, A. Phanishayee,
B. Schroeder, and G. Pekhimenko, “Benchmarking and analyzing deep
neural network training,” in 2018 IEEE International Symposium on
Workload Characterization (IISWC), pp. 88–100, Sep. 2018.
[26] J. Lew, D. A. Shah, S. Pati, S. Cattell, M. Zhang, A. Sandhupatla, C. Ng,
N. Goli, M. D. Sinclair, T. G. Rogers, and T. M. Aamodt, “Analyzing
machine learning workloads using a detailed gpu simulator,” in 2019
IEEE International Symposium on Performance Analysis of Systems and
Software (ISPASS), pp. 151–152, March 2019.
[27] S. A. Mojumder, M. S. Louis, Y. Sun, A. K. Ziabari, J. L. Abelln,
J. Kim, D. Kaeli, and A. Joshi, “Profiling dnn workloads on a volta-based
dgx-1 system,” in 2018 IEEE International Symposium on Workload
Characterization (IISWC), pp. 122–133, Sep. 2018.
[28] A. Karki, C. P. Keshava, S. M. Shivakumar, J. Skow, G. M. Hegde, and
H. Jeon, “Tango: A deep neural network benchmark suite for various
accelerators,” jan 2019.
[29] J. Redmon, “Darknet: Open source neural networks in c.” http://pjreddie.
com/darknet/, 2013–2016.
[30] P. R. Luszczek, D. H. Bailey, J. J. Dongarra, J. Kepner, R. F. Lucas,
R. Rabenseifner, and D. Takahashi, “The hpc challenge (hpcc) bench-
11
mark suite,” in Proceedings of the 2006 ACM/IEEE Conference on
Supercomputing, SC ’06, (New York, NY, USA), ACM, 2006.
[31] N. Satish, M. Harris, and M. Garland, “Designing efficient sorting
algorithms for manycore gpus,” in 2009 IEEE International Symposium
on Parallel Distributed Processing, pp. 1–10, May 2009.
[32] “Adaptive parallel computation with cuda dynamic parallelism.” https:
//devblogs.nvidia.com/introduction-cuda-dynamic-parallelism/. Accessed:
2019-10-15.
[33] C. J. Rossbach, Y. Yu, J. Currey, J.-P. Martin, and D. Fetterly, “Dandelion:
A compiler and runtime for heterogeneous systems,” in Proceedings of
the Twenty-Fourth ACM Symposium on Operating Systems Principles,
SOSP ’13, (New York, NY, USA), pp. 49–68, ACM, 2013.
[34] “Ray tracing in one weekend in cuda.” https://github.com/rogerallen/
raytracinginoneweekendincuda/tree/ch12 where next cuda. Accessed:
2019-10-15.
[35] “Rtow-optix.” https://github.com/ingowald/RTOW-OptiX. Accessed:
2020-05-15.
[36] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network
training by reducing internal covariate shift,” in ICML, 2015.
[37] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdi-
nov, “Dropout: A simple way to prevent neural networks from overfitting,”
Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014.
[38] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in Neural
Information Processing Systems 25 (F. Pereira, C. J. C. Burges, L. Bottou,
and K. Q. Weinberger, eds.), pp. 1097–1105, Curran Associates, Inc.,
2012.
[39] R. Ausavarungnirun, J. Landgraf, V. Miller, S. Ghose, J. Gandhi,
C. J. Rossbach, and O. Mutlu, “Mosaic: A GPU Memory Manager
with Application-transparent Support for Multiple Page Sizes,” in
Proceedings of the 50th Annual IEEE/ACM International Symposium on
Microarchitecture, MICRO-50 ’17, (New York, NY, USA), pp. 136–150,
ACM, 2017.
[40] T. Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, and S. W. Keckler,
“Towards High Performance Paged Memory for GPUs,” in HPCA, 2016.
12
