Buddy Compression: Enabling Larger Memory for Deep Learning and HPC
  Workloads on GPUs by Choukse, Esha et al.
Buddy Compression: Enabling
Larger Memory for Deep Learning and HPCWorkloads on GPUs
Esha Choukse
University of Texas at Austin
esha.choukse@utexas.edu
Michael B. Sullivan
NVIDIA
misullivan@nvidia.com
Mike O’Connor
NVIDIA
moconnor@nvidia.com
Mattan Erez
University of Texas at Austin
mattan.erez@utexas.edu
Jeff Pool
NVIDIA
jpool@nvidia.com
David Nellans
NVIDIA
dnellans@nvidia.com
Stephen W. Keckler
NVIDIA
skeckler@nvidia.com
ABSTRACT
GPUs offer orders-of-magnitude higher memory bandwidth than
traditional CPU-only systems. But, their memory capacity tends
to be relatively small and can not be increased by the user. This
work proposes Buddy Compression, a scheme to increase both the
effective GPU memory capacity, and bandwidth, while avoiding the
downsides of conventional memory-expansion techniques. Buddy
Compression splits each compressed memory-entry between
high-speed GPU memory and a slower-but-larger disaggregated
memory pool or host-CPU memory, such that highly-compressible
memory-entries are accessed completely from GPU memory, while
incompressible entries source some of their data from off-GPU
memory. We show that Buddy Compression achieves an average
compression ratio of 1.9x for representative HPC applications and
1.5x for deep learning workloads, with a performance within 2%
of an ideal system containing 100% high speed and high capacity
memory. This makes Buddy Compression an ideal candidate for
developers that require some additional memory capacity, and can
tolerate minimal performance penalty.
1 INTRODUCTION
GPUs are widely used for many high-memory-footprint applica-
tions, including those for High Performance Computing (HPC) and
Deep Learning (DL). HPC applications like weather prediction and
the modeling of fluid and molecular dynamics have grown to re-
quire very large models [1, 2]. DL networks are also developing in
a direction where either the model sizes are too big to run on GPUs,
or they are large enough such that the only a small batch size can fit
on the GPU, resulting in low utilization, and accuracy issues [3–5].
Today, applications with large memory footprints must: (i) scale out
to many GPUs for capacity purposes (inefficient resource utiliza-
tion) [6, 7], (ii) explicitly orchestrate data movement between the
host CPU and the GPU to stay within device memory limitations
(adding algorithmic complexity) [4, 8, 9], or (iii) rely on off-GPU
memory accesses or Unified Memory [10] to automatically oversub-
scribe device memory (limiting performance) [11, 12]. In this paper,
we explore memory compression as a solution to this challenge.
Main memory compression has been studied in detail for
CPUs [13–17]. However, GPU-specific workloads and architectural
Fig. 1: In Buddy Compression, if a memory-entry (128B) does not
compress sufficiently, part of it is accessed from the buddy-memory.
NVLink2
150 GB/s
Full Duplex
150GB/s
N
V
S
w
it
ch
CPU Memory
(e.g. Power9)
Buddy
StorageGPU
(e.g.V100)
Compressed
Memory
Buddy Storage Alternatives
Uncompressible
Traffic
Unused Peer GPU Memory
Disaggregated Memory
Fig. 2: A target system for Buddy Compression. Any larger NVLink-
connected memory is used as buddy storage. Overall organization
is like NVIDIA DGX-2 [20].
details pose very different trade-offs. For instance, the CPU
solutions assume the compressed pages to be of different sizes
and that they can be re-allocated as the compression ratios
change [14–17]. Due to the immense device memory bandwidth,
relying on such on-the-fly page re-allocations has a huge impact
on throughput in GPUs [11].
While domain-specific compression [18, 19] has been explored
to help with large workloads on GPUs, hardware memory
compression to increase memory capacity for general purpose
applications in GPUs remains unexplored.
In Buddy Compression, we compress the data and divide the
compressed memory allocations between the GPU device memory
and a larger-but-slower buddy-memory connected with a high-
bandwidth interconnect (Figure 1). If a memory-entry is sufficiently
compressed, it is sourced completely from device memory. If not,
it is sourced from both device and buddy-memory. This design
requires no re-allocations within the device memory as the data
changes compressibility over time. A high-bandwidth interconnect
like NVLink [20], OpenCAPI [21] or PCIe5.0 [22] is the enabling
feature for this design, since it ensures low overhead accesses to the
buddy-memory, as long asmost of the data is in GPU devicememory.
Any remote memory that is connected to the GPU with a high-
bandwidth interconnect is suitable for being used as buddy-memory
(Figure 2). As we demonstrate in the rest of the paper, this design
ar
X
iv
:1
90
3.
02
59
6v
2 
 [c
s.A
R]
  1
5 A
pr
 20
19
maintains a good compression ratio and high performance, while
avoiding the complexity and performance concerns of mapping
CPU memory compression approaches on GPUs. To summarize the
research contributions of this work:
• Weprovide an in-depth analysis of the data of representative GPU
workloads and derive insights for effective GPU compression.
• We introduce the first design to use general-purpose compression
to increase the memory capacity of GPUs. Buddy Compression is
unique, since it does not require any additional data movement
as the compressibility of the data changes.
• We show that Buddy Compression achieves 1.9x compression
and performs within 2% of an ideal, high-memory-capacity GPU.
• Finally, we present a case study on DL training to understand
the benefits and trade-offs of using Buddy Compression.
2 OVERVIEW
2.1 Target Workloads and Motivation
HPC Workloads. Several important HPC applications, ranging
from fluid dynamics to weather prediction, have found GPUs to
be the accelerator of choice. These models have outgrown GPU
device memory [1, 2]. Today, scientists use either Unified Memory
or multiple GPUs to scale the models. Device memory compression
can be very useful for such cases. We use a subset of SpecAccel and
DOE FastForward benchmarks to represent these HPC applications.
The subset is chosen based on the confidence in the representative-
ness of the data values used in the benchmarks. All the discarded
benchmarks seemed to have large portions of their working sets
be zero, thereby having extremely high compression ratios.
DL Workloads. GPUs are currently the most popular choice
for training deep neural networks. As these networks grow deeper
and wider, they require more data and are inevitably hitting
the memory-capacity wall, as we discuss in detail in Section 4.4.
While many domain-specific solutions across the stack have
been proposed to deal with this memory capacity challenge in
deep learning training [6, 9, 19, 23–27], our proposal requires no
algorithm-level changes, and applies to other classes of workloads
as well. We use a set of convolutional neural networks (CNNs)
and one recurrent neural network (RNN) to represent this class
of workloads in our evaluation.
2.2 Related Work
There are two approaches to tackle memory limitations: compres-
sion, and domain-specific techniques. The graphics pipeline of
most GPUs includes texture memory that is lossily compressed
Tab. 1: Details of the GPU Benchmarks Used
HPC SpecAccel HPC FastForward
351.palm 2.89GB FF_HPGMG-FV 2.32GB
352.ep 2.75GB FF_Lulesh 1.59GB
354.cg 1.23GB DL Training
355.seismic 2.83GB BigLSTM 2.71GB
356.sp 2.83GB AlexNet 8.85GB
357.csp 1.44GB Inception_V2 3.21GB
360.ilbdc 1.94GB SqueezeNetv1.1 2.03GB
370.bt 1.21MB VGG16 11.08GB
ResNet50 4.50GB
offline using tailored compression algorithms [18, 28] in order to
reduce the footprint of these textures. The deep learning space
has domain-specific solutions across the stack [19, 24, 25]. Buddy
Compression is orthogonal to most of these proposals. For instance,
vDNN [25] proposes running large DL networks using manual
offloading of data layer-by-layer. However, there are still cases
where it fails, due to the inability to fit data required for just one
layer [29]. Buddy compression can enable running larger networks
with vDNN. To our knowledge, hardware compression is not
currently used for general-purpose compute workloads on GPUs.
In CPUs, for decades, memory compression in various forms has
been proposed and used as a solution to the memory capacity chal-
lenge. Most modern operating systems compress the swap space to
reduce paging to disk [30]. There have been numerous proposals
for hardware-accelerated main memory compression [13–17].
2.3 Relevant Modern GPU Technology
Unified Memory (UM). Unified Memory (UM), introduced in
CUDA 8 for Pascal-class GPUs [10], allows sharing a single
memory space across a heterogeneous node. Non-local UM
requests either remotely access data through the GPU interconnect
or result in transparent data migration with the placement of any
piece of data being determined by a variety of heuristics [10, 31].
UM supports memory over-subscription, allowing UM-managed
regions that are larger than the GPU device memory to be
accessed without explicit data movement. This has not been widely
adopted, since applications with large hot working sets experience
frequent page faults and thrashing with Unified Memory, causing
high performance overheads [29, 31, 32]. Similar solutions are
available for AMD and ARM GPUs using Heterogeneous System
Architecture (HSA) and CCIX [33, 34], and very recently, for Intel
GPUs using Compute eXpress Link (CXL) [22].
High Bandwidth Interconnects. In recent years, high
bandwidth interconnects like NVLink [35], openCAPI [21], and
NVLink2 [20] have been used to alleviate the communication
bottleneck in multi-GPU systems. Buddy Compression is made
possible due to these high-bandwidth interconnects. NVLink2
provides 25GBps of full-duplex unidirectional bandwidth per brick.
Modern compute-class V100 GPUs support six NVLink2 bricks
per GPU, offering a bidirectional bandwidth of up to 150GBps
(full-duplex), much higher than the 16GBps ×16 PCIe3.0 full-duplex
connection. The NVIDIA DGX-2 [20] workstation has sixteen V100
GPUs connected through an NVLink2 switch that supports the full
150GBps traffic between any two GPUs. IBM Power9 CPUs also
support six NVLink2 bricks, allowing high-speed remote access
to system memory [36].
Buddy Compression Target System.Given the trends in mod-
ern GPU nodes, the future-facing system we envision for Buddy
Compression is shown in Figure 2. It is composed of NVSwitch-
connected multi-GPU nodes with NVLink2-based access to a larger
source of remote memory. In currently available systems, this re-
mote memory could be the system memory of a Power9 CPU, or
unused peer GPU memory. While we know of no current NVLink-
connected disaggregated memory appliance, such a device is a natu-
ral extension of the technology that is being explored for servers [37,
2
38]. THe high-bandwidth interconnect is what enables Buddy Com-
pression. So, as long as the remote memory sources operate at the
full NVLink2 bandwidth, Buddy Compression applies equally well.
Any high bandwidth interconnect can be used in place of NVLink2.
2.4 Compression Design Considerations
There are some important design choices and challenges that need
to be addressed in a compressed memory proposal. We present
these design points in brief.
Compression Algorithms. A hardware memory compression
algorithm should be fast and require little energy, yet result in
high compression rates. After comparing several algorithms [39–43],
we choose Bit-Plane Compression (BPC) [43] for Buddy Compression.
BPC has been shown to have high compression ratios for GPU
benchmarks when applied for DRAM bandwidth compression.
Compression Granularity. This is the unit of data that is com-
pressed or decompressed together. A higher compression granu-
larity requires less metadata, and generally results in higher com-
pression. On the other hand, a lower compression granularity does
not require as many read-modify-write operations. Most CPU main
memory compression work uses a cache-block sized compression
granularity to avoid these overheads. We share this design deci-
sion, and, following the results of microbenchmarks [44], use a 128B
memory-entry as the compression granularity for GPUs.
Translation Metadata. The compressed size of each 128B
memory-entry depends on its compressibility. This requires ad-
ditional translation metadata to access compressed data. This meta-
data lookup generally lies on the critical path for every access to the
memory. The layout of a compressed main-memory is somewhat
similar in all previous work on main memory compression in CPUs.
There is some space dedicated for the compression metadata, and
the rest is arranged as variable-sized pages. Page size is decided by
the compressibility of the data within a page.
CompressedDataMovement andAllocation.As data is writ-
ten back to a compressed memory, value changes can lead to
changes in compressibility. This means that a memory-entry can
grow or shrink over time, leading to additional data movement [17].
Allocation granularity is closely related to the data movement over-
head. For example, current systems allocate memory at page gran-
ularity. Changes in the size of one memory-entry can lead to data
movement within and across the pages. However, if each memory-
entry is separately allocated and translated, its size changes do not
affect other data.
3 BUDDY COMPRESSION
3.1 Compressibility of Workloads
To estimate the possible gains from compression, it is imperative to
first find how compressible the high-footprint GPU workloads are.
To this end, we take memory dumps of the workloads running on a
Tesla P100 GPU with an Intel Xeon E5 host. We intercept each GPU
malloc and free API call (including variants for pinned and Unified
Memory-managed memory) to dynamically track the current
allocated regions in device memory. We divide the entire runtime
of the workload into 10 regions, and at kernel boundaries of each
region, take a memory dump of the allocated device memory.
Figure 3 shows the compression ratio of each benchmark
using BPC compression [43] over its entire run. Note that these
compression ratios are quite optimistic capacity compression,
since they assume eight different compressed memory-entry
sizes are available (0B, 8B, 16B, 32B, 64B, 80B, 96B, and 128B) and
assume no page-packing overheads. That is, each memory-entry
is individually compressed and allowed to occupy any of the above
mentioned sizes. On average, the geometric mean of compression
ratio for the HPC workloads is 2.51 for the HPC benchmarks and
1.85 for the DL benchmarks. This is a higher average as compared
to prior work on CPU workloads [17], and can be attributed to
the higher percentage of homogeneous data allocations (with
a single uniform datatype). Prior work has established that
Bit-Plane Compression works well for homogeneous data, and
such homogeneity is prevalent in GPU workloads [43].
Compressibility Changes. As compared to previously studied
workloads [14, 15, 17], the compressibility changes over time more
often in GPU benchmarks. Themost extreme example is 355.seismic,
which begins with many zero values but slowly asymptotes to a 2x
compression ratio over its execution. We also observe that although
the overall compression ratio of the DL workloads stays roughly
constant, there are frequent compressibility changes for individual
memory entries. This is likely due to the fact that DL frameworks
perform their own asynchronous GPU memory allocation with
software-managed memory pools, and may reuse the same memory
location for a variety of purposes over program execution [45]. In
some cases, the data in amemory-entry can growmore random over
time, thereby decreasing its compressibility. This would require
more space to be allocated for the same memory-entry, causing a
memory-entry overflow, and thereby additional data movement, as
discussed in Section 2.4.
3.2 Buddy Compression Overview
Buddy Compression allows a programmer or DL framework to an-
notate the program such that they take up less device memory than
the allocation size. For instance, if a user has 24GB of data and a
GPU with only 12GB of memory capacity, the data can be allocated
with a target of 2x compression. This means that only half of the full
data size is allocated on the GPU device memory. We use compres-
sion to opportunistically fit data into this reduced device-resident
allocation. If a memory-entry does not compress sufficiently,
an NVLink-connected larger-but-slower memory is available
as overflow storage. The buddy-memory is used as extended
storage, with the data being striped across at 128B memory-entry
granularity. Data from compressible memory-entries is sourced
completely from GPU device memory, while incompressible
memory-entries are sourced from both device and buddy-memory.
As shown in Figure 4, Buddy Compression stripes the data
using 32B sectors. This 32B sector size is chosen to match the
access granularity of GPU memory, specifically, GDDR5, GDDR5X,
GDDR6, and HBM2-based GPUs. For example, if an allocation
targets a compression ratio of 2x, the first two sectors per 128B
memory-entry are mapped to device memory, while the last two
are mapped to system memory. Therefore, if a memory-entry can
be compressed by 2x or more, it fits completely in device memory,
and otherwise, the rest of its data is saved in its fixed pre-allocated
3
Fig. 3: The average compression ratio of the allocated memory for the complete run of each benchmark. Ten equally distributed memory-
snapshots are taken during the entire run of each benchmark, and the compression ratio calculated.
Fig. 4: Depending on its data, a 128B memory-entry compresses to
occupy from 1-4 sectors of 32B each. Here, the target compression
ratio is 2x. If an entry does not compress to 2x, left over sectors are
accessed from buddy-memory.
spot in the buddy-memory. The allowed compression ratios for
this study are 1x, 1.33x, 2x and 4x (4, 3, 2, or 1 sectors allocated
in device memory). These ratios are chosen to keep the sector
interleaving simple and avoid unaligned sector accesses.
Buddy-Memory Carve-Out Region. At boot time, the host
system carves out a physically-contiguous chunk of its memory
for each GPU, dedicating this storage to be used as each GPU’s
buddy memory. Those regions are then never accessed by the host,
eliminating any coherence issues. The buddy-memory size should
correspond to the maximum target compression ratio for the GPU.
As an example, if the maximum target compression ratio is 4x,
then the carve-out region should be 3x as large as GPU device
memory, in order to allow each memory-entry to have 3 sectors
in host memory (in the worst case) and only 1 on the GPU.
Compression Metadata and Address Translation. Once the
data is stored in compressed form, addressing it requires some
additional translation and metadata. This metadata informs us
about (i) the target compression ratio, (ii) whether or not a partic-
ular memory-entry was compressed to the target ratio, and (iii) the
address in buddy-memory to be accessed for memory-entries that
did not compress to the target ratio.
A global base address for the buddy-memory carve-out region
is stored in a Global Buddy Base-address Register (GBBR). The
page-table and TLBs are extended to store the information about
whether the page is compressed or not, the target compression ratio,
and the offset of buddy-page from the global base address. This
adds a total overhead of 24 bits per page-table entry. To know the
actual compressed size of each 128B memory-entry, there are 4 bits
of metadata per cache block, stored in a dedicated region of device
memory, amounting to a 0.4% overhead in storage. The metadata
storage overheads of Buddy Compression are either comparable
to, or less than those of previous works in memory compression
in CPUs [13–17]. Figure 5a shows a very high-level view of the
metadata setup and translation. The simple GBBR-offset based
addressing makes the overall translation mechanism very simple.
Ametadata cache is used to avoid additional memory accesses
each time memory is accessed. Figure 5b shows the metadata cache
hit ratios as a function of the metadata cache size. Most applications
have high hit ratios. We use a 4-way 64KB metadata cache, that
is split into 8 slices, 1 per DRAM channel. Each metadata cache
entry is 32B, thereby causing a prefetch of metadata corresponding
63 neighboring 128B memory-entries on every metadata cache
miss. The metadata is assumed to be interleaved across the DRAM
channels using the same hashing mechanism as regular physical-
address interleaving.
3.3 Benefits of the Buddy Compression Design
No Page-Faulting Expense. The immense parallelism of a GPU
increases the throughput of work done. Driver-based page-fault
handling, however, is remote and non-distributed, making GPU
page-faults during the runtime of a kernel very expensive [11]. As
data is written back to memory, its compressibility can decrease,
requiring new page allocations to store the same data. The page fault
overhead in GPUs makes reducing the compressed data movement
a very important directive. The uniqueness of the design lies in the
fact that the compressibility of each memory-entry affects only its
own allocation, thereby never having to cause page movement.
Low Translation Overhead. Memory bandwidth is a frequent
bottleneck for GPUs. Accordingly, there has been fruitful re-
search on bandwidth compression of GPU main memory [43, 46].
Buddy Compression uses compression to amplify both the band-
width and capacity of GPU memory. However, as discussed earlier,
compression-for-capacity requires additional metadata accesses for
translation into the compressed address space. This makes reduc-
ing the metadata size and keeping translation simple important.
Buddy Compression requires only 0.4% metadata, and since the
carve-out region is contiguous in host memory, addressing into the
buddy-memory is offset-based, and trivial.
No Impact on Small Workloads. If the available GPU device
memory is enough to allocate the memory required by the appli-
cation, the Buddy Compression can be disabled. In that case, the
design does not affect the performance at all.
3.4 Reducing Buddy Compression Overheads
With the design of Buddy Compression, the obvious overhead
comes from having to access the slower buddy-memory in cases
of unexpectedly low compression.
Profiling for Target Compression Ratio. Choosing the right
target compression ratio is important, since aggressive compression
ratios will lead to more memory-entries exceeding the allocated
device memory and requiring buddy-memory accesses. To choose
the target compression ratio, we use a simple profiling pass on a
representative dataset. For HPC workloads, the profiling pass is
run using a smaller dataset, like the train dataset for SpecAccel2.
4
GPU 
 
 
 
 
 
 
 
 
 
     
  TLB 
  
Buddy­
Memory
Device Memory
NVLink
GBBR
Memory Controller
Metadata Cache
...
Metadata Cache Entry: 
4­bits per  
128B memory­entry 
L2 
Compressor 
Decompressor 
Cores
(a) High-level overview of architectural changes/additions
with Buddy Compression.
(b) Metadata cache hit rates with different sizes of total metadata cache
Fig. 5: Compression metadata handling and architecture
128B
0B
64B
   
Hi
gh
er
 C
om
pr
es
si
bi
lit
y 
   
351.palm 352.ep 354.cg 355.seismic
356.sp 357.csp 360.ilbdc 370.bt
FF_HPGMG FF_LULESH BigLSTM AlexNet
InceptionV2 Squeezenetv1.1 VGG16 ResNet50
Fig. 6: Spatial patterns of compressibility. Each plot is a heatmap of compressibility per 128B memory-entry for the allocated GPU memory.
Each horizontal line is an 8KB page and pages are stacked vertically as per address.
For DL workloads, this profiling pass is run with a smaller batch
size, and can be embedded in the training platform, like PyTorch
or TensorFlow. Furthermore, the target ratios can be periodically
updated for long running applications, e.g., for DL training, the
target ratio update can be combined with checkpointing in the
framework. In this paper, for simplicity, we consider a single static
target compression ratio throughout the run of the application.
Annotation Granularity. The granularity with which the
programmer annotates memory is also important—the best
annotation granularity depends on the spatial characteristics of
compressibility. Naive Buddy Compression considers a single,
conservative target compression ratio for the whole-program. As
shown in Figure 7, we find this granularity to be too coarse. The
naive mechanism achieves an overall compression ratio of 1.57x for
HPC workloads, and 1.18x for DL workloads, with 8% accesses over
the interconnect to the buddy-memory for HPC, and 32% for DL.
The overall compression is low, and, given that even the highest
available bandwidth on the interconnect (NVLink2, 150GBps) is
6x lower than the GPU device memory bandwidth (900GBps),
the overheads from these buddy-memory accesses are high. The
desirable solution for us would be something that effectively lowers
the buddy-memory accesses, while maintaining high compression
ratios. In order to find such a solution, we present a deep dive into
the detailed compression data patterns in these workloads.
Understanding Compressibility Patterns. Figure 6 shows a
spatial plot (in the virtual address space) of the compressibility of
each workload’s data. Each sub-plot is a spatial heat map that shows
the compressibility of each 128B memory-entry in the memory
allocated by each benchmark. A colder color (blue), signifies high
compressibility and hotter color (red) shows low compressibility.
The plot is structured assuming 8KB pages, where each page has 64
128B memory-entries laid along x-axis. The y-axis is the total num-
ber of pages in the memory of the benchmark. Figure 6 shows that
the spatial locality of compressibility of data varied significantly
varied across benchmarks. While most HPC benchmarks have large
homogeneous regions of similar compressibility, the distribution
is more random in DL workloads. FF_HPGMG shows specific
patterns of compressibility that can directly be correlated with
the arrays of heterogeneous structs that are used in its allocation.
Although the DL workloads do not show the level of homogeneity
5
351
.pal
m
352
.ep
354
.cg
355
.sei
smi
c
356
.sp
357
.csp
360
.ilbd
c
370
.bt
FF_H
PGM
G
FF_L
ules
h
BigL
STM
Alex
Net
Ince
ptio
n_V
2
Squ
eez
eNe
t
VGG
16
Res
Net
50
GME
AN_
HPC
GME
AN_
DL
0.0
0.1
0.2
0.3
0.4
0.5
Fr
ac
ti
on
 o
f a
cc
es
se
s
to
 B
ud
dy
-m
em
or
y
Naive conservative Buddy compression
Per-Allocation Compression
Zero-Page Optimized (Final Buddy compression)
Target-compression-ratio per bar (right axis)
1
2
3
4
5
Co
m
pr
es
si
on
 R
at
io
Fig. 7: Sensitivity of the compression ratio and buddy-memory accesses to design optimizations.
Fig. 8: The fraction of buddy storage accesses over the execution
of one iteration in DL training. We achieve a constant compression
ratio of 1.49 for SqueezeNet and 1.64 for Resnet.
that can be seen in HPC workloads, there are still some mostly-red
or mostly-blue regions. Based on the insights from these plots, we
propose optimizations to the design of Buddy Compression.
Per-Allocation Compression Targets. Figure 6 shows that
there are several regions that are mostly-red, or mostly-blue. We
find that the majority of these region boundaries overlap with
cudamalloc() boundaries. A special allocation API for compressed
regions allows us to capture this behavior and eliminate the futile
effort of compressing the red regions.
During profiling, we periodically take snapshots of memory, to
track the compression ratios per allocation. At the end of profiling,
we decide target compression ratios per allocation using heuristics
to trade-off the compression ratio with the buddy-memory accesses.
The compression ratio chosen is conservative to minimize the
buddy-memory accesses. As an example, based on Figure 3, for
355.seismic, for most allocations, the target ratio used will be 2x,
and not 7x or 6x. We use a static target compression ratio for the
entire run of the application. This is because a dynamic target
compression ratio would require reallocating and moving around
the pages, making the compression management more complicated
and less performant, unless the applications are very long running
and the overheads are amortized.
Buddy Threshold. Most benchmarks have regions that
are highly homogeneous in their compressibility, making the
per-allocation target ratio decision simple. However, for bench-
marks like AlexNet and ResNet50, the regions are mixed in
compressibility. Therefore, target compression ratio decision in
these cases, involve a trade-off between compression ratio and
buddy-memory accesses. We define a Buddy Threshold, that sets a
limit on the fraction of memory-entries that require accessing the
buddy-memory, per-allocation. A higher Buddy Threshold achieves
a higher compression ratio at the cost of more buddy-memory
accesses, and hence, lower performance. These buddy-memory
accesses are calculated per target compression ratio, using a
histogram of the static memory snapshots.
Figure 9 shows the results from a sensitivity study of the Buddy
Threshold (10% to 40%). In addition to this, the figure shows the best
achievable compression ratio assuming no constraints are placed
on the buddy-memory accesses. The bars in the figure show that
the buddy-memory accesses remain very low for HPC benchmarks,
due to their homogeneous regions. For DL benchmarks however,
the buddy-memory accesses are more frequent, and increase further
as the buddy threshold is increased. Similarly, the compression
benefits from increasing the buddy threshold are mostly seen in
DL benchmarks. With the exception of FF_HPGMG, we are able to
achieve near-optimal compression, as can be seen in comparison
with the black marker. FF_HPGMG, as discussed earlier has a
peculiar striped compressibility pattern resulting from the struct it
uses. To capture the maximum compression, FF_HPGMG requires
more than 80% Buddy Threshold for most of its allocated memory
region. Another interesting scenario is seen in benchmarks
354.cg and 370.bt. Since these benchmarks mostly consist of
incompressible data, without the per-allocation targets, Buddy
Compression was unable to compress them at all. However, with
the per-allocation targets, we are able to compress them by 1.1x and
1.3x respectively. Overall, since a 30% Buddy Threshold achieves
a good balance between the compression and buddy-memory
accesses, we choose this for our final Buddy Compression design.
Since the target compression ratio remains constant, while actual
data compressibility can change over time, these statically calcu-
lated buddy-memory accesses may not be similar across the run. To
investigate this, we observed the buddy-memory accesses across all
the memory dumps, while maintaining constant target compression
ratios. Figure 8 presents the results from ResNet50 and SqueezeNet,
both of which have frequent changes in compression ratio per
memory-entry, and have high accesses to buddy-memory to begin
with. We observe that the buddy-memory accesses do not change a
lot over time. This is because even though the individual memory-
entries frequently change their compressibility, the changes are
almost equal in both directions, making the overall effects small.
Furthermore, as mentioned earlier, for benchmarks that see large
changes in their overall compressibility, like 355.seismic, we avoid
this challenge by choosing conservative target compression ratios.
Special Case For Mostly-Zero Allocations. Based on the spa-
tial plots, we observe that there are areas in memory that remain
mostly-zero even across complete benchmark executions. To cap-
ture the capacity-expanding opportunity of such allocations, we add
an aggressive target compression ratio of 16x where we keep only
6
351
.pal
m
352
.ep
354
.cg
355
.sei
smi
c
356
.sp
357
.csp
360
.ilbd
c
370
.bt
FF_H
PGM
G
FF_L
ules
h
BigL
STM
Alex
Net
Ince
ptio
n_V
2
Squ
eez
eNe
t
VGG
16
Res
Net
50
GME
AN_
HPC
GME
AN_
DL
0.0
0.1
0.2
Fr
ac
ti
on
 o
f a
cc
es
se
s
to
 B
ud
dy
-m
em
or
y
Buddy Threshold 10%
Buddy Threshold 20%
Buddy Threshold 30%
Buddy Threshold 40%
Target-compression-ratio per bar (right axis)
Best Achievable Compression Ratio (right axis)
1
2
3
4
5
Co
m
pr
es
si
on
 R
at
io
Fig. 9: Sensitivity of the compression ratio and buddy-memory accesses to the Buddy Threshold parameter.
8B out of each 128B in device memory. Note that the only change
involved here is an additional encoding for page size in the TLB.
This optimization allows us to increase the compression ratio for
benchmarks with large highly-compressible regions, for example,
352.ep, and VGG16. Note that this optimization does not have much
impact on the buddy-memory accesses, since such compressible
data would always fit in device memory. Figure 7 shows the impact
of this optimization. For HPC benchmarks, the compression ratio
goes up from 1.7x to 1.9x, while for DL, from 1.42x to 1.5x.
For this optimization, it is important to identify allocations that
are mostly zero, and remain so for the entirety of the run of the
benchmark, unless there is a periodic update of target compression
ratio. The profiler marks the regions that can be compressed with
this optimization, such that the overall compression ratio is still
under 4x, limited by the buddy-memory carve-out region.
Possible Optimization for Metadata Access. We note that
on a metadata cache miss, both the device-memory data and
its metadata can be accessed in parallel, since the metadata
only informs us about the buddy-memory part of data. We do
not, however access the buddy-memory in parallel, since the
buddy-memory accesses are rare on average (Figure 7).
3.5 Final Design
Buddy Compression uses a Buddy Threshold default of 30%, a meta-
data cache of 4KB per DRAM channel, and a buddy-memory region
of size 3x of the GPU device memory, to support a 4xmaximum com-
pression ratio. The application is first profiled with a smaller dataset,
during which our tool periodically calculates a histogram of com-
pressed memory-entries per allocation. At the end of profiling, the
tool reports the target compression ratios, which are then used by
the DL platform, or the HPC user to annotate cudamalloc API calls,
enabling running a larger dataset without the overheads of Unified
Memory. Figure 7 shows the compression ratio and buddy-memory
accesses for the final design. We achieve 1.9x compression for
HPC and 1.5x compression for DL workloads. The average buddy-
memory accesses are 0.08% for HPC and 4% for DL workloads.
4 PERFORMANCE EVALUATION
We have already presented results regarding buddy-memory
accesses and compression ratios from Buddy Compression in
Figure 7. In this section, we first discuss the performance impact
of Buddy Compression relative to an ideal large-capacity GPU,
followed by a comparison to UM-based oversubscription. We then
Tab. 2: Performance simulation parameters.
Core 1.3 GHz; 2 greedy-then-oldest warp schedulers per SMMax 64 32-thread warps per SM
Caches
24KB private L1/texture cache per SM, 128B lines
64KB dedicated scratchpad per SM,
4MB shared L2, 32 slices, 128B lines, 16 ways
Off-Chip 32 HBM2 channels at 875MHz (900 GBps)
6 NVLink2 bricks (150 GBps full-duplex*)
Buddy 4KB
* metadata cache per L2 slice, 128B lines, 4 ways
Compression/Decompression latency = +11 cycles
* These parameters are swept in later parts of the evaluation.
present a case-study of DL training to estimate the performance
benefits from increased capacity.
4.1 Methodology
Workloads. As previously described, we evaluate Buddy Com-
pression’s effectiveness on workloads from the SpecAccel [47]
and FastForward benchmark suites for HPC workloads. We collect
a representative trace from each benchmark while running the
reference datasets. Each trace contains 1–9 billion warp instruc-
tions and corresponds to the dominant kernel of each benchmark
at a point in execution that exhibits the average compression
ratio for that entire benchmark execution [48]. For DL, we use a
set of 5 convolutional neural networks: AlexNet [49], Inception
v2 [50], SqueezeNetv1.1 [51], VGG16 [52], and ResNet50 [53],
all of which were run under the Caffe [45] framework with the
ImageNet [54] dataset. Additionally we consider a long short-term
memory network, BigLSTM [55], which is a 2-layer LSTM with
a 8192+1024 dimensional recurrent state in each of the layers and
uses the English language model. The traces for the DL training
workloads span one full training iteration.
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
1.E+09
1.E+10
1.E+04 1.E+07 1.E+10
Si
m
u
la
to
r 
C
yc
le
s
Silicon Cycles
Our Simulator GPGPUSim Line
1.E-05
1.E-04
1.E-03
1.E-02
1.E-01
1.E+00
1.E+01
1.E+04 1.E+07 1.E+10
Si
m
u
la
ti
o
n
 W
al
l C
lo
ck
 T
im
e 
(H
o
u
rs
)
Simulator Cycles
Fig. 10: Our simulator correlates with a V100GPU (left, with slope=1
line). It is two orders of magnitude faster than GPGPUSim [56],
enabling longer programs (right, linear regression lines).
7
Fig. 11: The performance impact of compression, not accounting for capacity benefits. Systems with different link bandwidths are evaluated
(showing unidirectional full-duplex bandwidths), with results normalized to a system with unlimited memory and a 150GBps interconnect.
Simulation Infrastructure.We use a dependency-driven GPU
performance simulator, similar to the one used by Arunkumar
et al. and others [57–59]. We configure the simulator based on
publicly information about NVIDIA’s P100 Pascal GPU [60]
and the interconnect characteristics of recent Volta GPUs [61]
(Tab. 2). Non-public microarchitectural details are configured using
microbenchmark results from Jia et al. [44]. Each SM is modeled
as an in-order processor with greedy-then-oldest warp scheduling.
We model a multi-level cache hierarchy with private L1 caches
and a shared sectored L2 cache with 128B lines and 32B sectors.
Caches are banked to provide the necessary parallelism to saturate
DRAM bandwidth. We model software-based cache coherence in
the private caches, similar to state-of-the-art GPUs. The memory
system consists of 32 HBM2 channels and the GPU is connected
to the system with 6 NVLink2 bricks.
We conservatively model decompression latency as 11 DRAM
cycles, as discussed in prior work [43]. Unless explicitly noted,
the default metadata cache configuration is 4-way set associative
4KB per L2 slice. Additionally, to separate out its performance
impact, we also evaluate bandwidth-only interconnect compression
between the L2 cache and device memory. Such compression does
not increase the effective memory capacity, but it can increase the
bandwidth between L2 cache and memory without requiring any
metadata or buddy-memory accesses.
Figure 10 (left) shows that our simulator correlates well
(correlation coefficient 0.989) against the total cycles spent on a real
V100 GPU across a wide variety of benchmarks across a number of
domains.1 Corresponding numbers from GPGPUSim (correlation
coefficient 0.948), a widely-used academic simulator, are also shown.
Our motivation in using a proprietary simulator comes from the
two orders-of-magnitude speed benefit shown in Figure 10 (right),
which enables us to simulate larger and more realistic workloads.
4.2 Performance Relative to an Ideal GPU
Apart from increasing the memory capacity, Buddy Compression
can affect the performance of the system in the following ways:
(i) Buddy-memory accesses can cause a performance hit. (ii) De-
compression latency can hinder performance. (iii) Metadata cache
misses can cause additional requests to device memory. (iv) It has
two conflicting effects on the effective bandwidth from L2 cache to
device memory. First, since compression is done at the cache-block
1We show simulator correlation results with a slightly different configuration than
is used for evaluation, in order to also be able to show comparable GPGPUSim results.
The P100 configuration used in the paper also correlates well with silicon.
granularity, the minimum L2 fill granularity is no longer a single
32B sector. Instead, the complete cache-block is transferred to L2
upon a load access to a compressed memory-entry. This may result
in over-fetch for fine-grained accesses, squandering device memory
bandwidth. However, for compressible workloads with high
locality, compression allows an increase in effective bandwidth
because cache-blocks can be fetched with fewer memory accesses.
We evaluate Buddy Compression alongside bandwidth-only
compression that compresses the data being transferred between
L2 cache and device memory. We also sweep the buddy-memory
interconnect bandwidth from 50 to 200GBps on full-duplex connec-
tion, where 150GBps represents NVLink2. The results are shown
in Figure 11. Bandwidth-only compression achieves an overall
speedup of 5.5%. Most of this speedup comes from the DL training
workloads. This is because of the regular, streaming memory ac-
cesses of these workloads, which are essentially performing matrix
multiplications. Since most of their memory accesses are coalesced
to access all sectors in each cache-block, bandwidth compression
achieves higher effective bandwidth by requiring fewer packets
per request. On the other hand, the HPC applications 354.cg and
360.ilbdc experience slowdowns with bandwidth compression. This
is because of the random and irregular memory access pattern of
these benchmarks. Most of their memory accesses require only one
sector. However, bandwidth compression leads to a full cache-block
transfer of any compressible data, potentially lowering the effective
bandwidth for random accesses. FF_Lulesh experiences a slowdown
despite having a regular memory access pattern. We find the reason
behind this to be the compression and decompression latency,
which both lie on the critical path for bandwidth compression.
Buddy Compression introduces additional overheads on top of
bandwidth compression, in the form of metadata cache misses and
buddy-memory accesses. Figure 11 shows that while an intercon-
nect bandwidth of 200GBps still achieves a 2% average speedup
using Buddy Compression, all lower interconnect bandwidths expe-
rience some slowdown relative to the ideal large-capacity GPU that
serves as the baseline. Note that the performance benefits from a
larger memory capacity are not accounted for in these experiments.
The benchmarks 351.palm and 355.seismic experience slowdown
due to a higher metadata cache miss rate, as can be seen from
Figure 5b. Since the other benchmarks have high metadata cache
hit rates, metadata accesses do not have a discernible impact on
their performance.
Most HPC benchmarks have rare buddy-memory accesses
(Figure 7), leading to negligible slowdowns with a high bandwidth
8
interconnect. However, when the interconnect bandwidth is
reduced, even these 1% accesses from buddy-memory can cause
a considerable slowdown of bandwidth-sensitive applications like
352.ep and 355.seismic. Note that FF_HPGMG has host-memory
accesses in its native form, due to synchronous copies from host
to device. Therefore, lowering the link bandwidth shows a drastic
impact on its performance (since all results are normalized to a
baseline system without compression and a 150 GBps interconnect).
DL training workloads have a higher percentage of buddy-
memory accesses, as can be seen in Figure 7. These buddy-memory
accesses are caused by a lack of compression locality in the work-
loads. For example, AlexNet requires accesses to 5.4% of memory
locations to go to buddy-memory, leading to a 6.5% slowdown
relative to ideal (with a 150GBps full-duplex interconnect). This
is because of the difference in the bandwidth available from device
memory vs. buddy-memory, which in our setup is 900GBps and
150GBps. Performance degenerates quickly as this disparity grows,
with the 50GBps full-duplex connection seeing a 35% slowdown.
These results show that recently-developed high-speed GPU
interconnects are an enabling technology for Buddy Compression.
The slowest link we evaluate (50 GBps full-duplex) is still faster
than the most recent PCIe generation (x16 PCIe4.0, providing
32GBps full-duplex bandwidth) yet it suffers from more than 20%
average slowdown relative to the ideal GPU. However, using high
bandwidth interconnects such as NVLink2 (150Gbps full-duplex)
enables Buddy Compression to come within 1% of the performance
of the large-capacity GPU on HPC benchmarks, and within 2.2%
of ideal on DL training workloads.
4.3 Comparison with Unified Memory
Faithfully comparing the performance of Buddy compression to Uni-
fied Memory in simulation is not feasible due to the complex host-
driver interactions and page migration policies implemented within
UM. Instead we choose to understand UM performance in oversub-
scription scenarios on real hardware. Figure 12 showsmeasured per-
formance of three applications on an IBMPower9 system, connected
to a Tesla V100 GPU via NVLink2 (3 bricks, 75 GBps full-duplex
bandwidth). We use SpecAccel applications with the managed PGI
compiler flag, and force varying levels of oversubscription through
an interposer that hogs GPU memory at application startup. We
also run the applications using a compiler flag to pin all allocations
in host memory, showing the slowdown in dotted lines. Our results
indicate that UM migration heuristics often perform worse than
running applications completely pinned in host memory; perhaps
because UM was primarily intended for the ease of programming
and has not yet been tuned for high-performance memory oversub-
scription. Previous work [11, 31] supports our observation that the
slowdown due to UM oversubscription can be excessive without
more extensive hardware support. Figure 11 shows that Buddy Com-
pression suffers from at most 1.67x slowdown for these programs
when oversubscribing by 50%, even with a conservative 50 GBps
NVLink speed. This indicates that it is a better alternative to high
performance memory over subscription than software-based UM.
1
4
16
64
0% 10% 20% 30% 40%
R
u
n
ti
m
e
 R
e
la
ti
ve
 t
o
 O
ri
gi
n
al
Percent Forced Workload Oversubscription
360.ilbdc
360.ilbdc (pinned)
356.sp
356.sp (pinned)
351.palm
351.palm (pinned)
Fig. 12: Measured overheads of using UM oversubscription.
A Power9 CPU is connected via 3 NVLink2 bricks (75 Gbps
full-duplex) to an NVIDIA V100 GPU. Dotted lines show the
performance when all allocations are in the host memory.
4.4 Case Study: DL Training
Benefits from Increased Memory Capacity
Thus far, we have compared the performance of Buddy Compres-
sion to an uncompressed, large-memory baseline (Figure 11). This
excludes the benefits of having access to a larger-memory, thus
ignoring the main purpose of Buddy Compression. In the case
of HPC benchmarks, a larger memory enables solving a larger
problem. Such benefits are important yet difficult to quantify.
Accordingly, we instead perform a case-study on DL training
workloads to quantify the performance benefits from compression.
Stochastic Gradient Descent (SGD) is widely used to update
weight values during DL network training. SGD iterates repeatedly
through the training dataset, optimizing and updating the model
each iteration. Updates depend on hyperparameters (such as the
chosen learning rate) and dynamically react as the classification
accuracy increases. The entire dataset is divided into mini-batches,
and each iteration goes through one mini-batch and updates the
model. While there is an ongoing debate concerning the utility of
large mini-batches across all domains, they can help regularize and
improve convergence in many cases, as evidenced by [3, 5].
Memory Footprints of DLWorkloads. Thememory footprint
of a network during training depends on the mini-batch size. Larger
mini-batch sizes require a larger part of the dataset to reside in de-
vice memory, along with more intermediate data (activations and
gradients). Figure 13a shows thememory footprint of each of our DL
training workloads as the mini-batch size is increased. The sizes are
increased up to the maximum size that a Titan Xp GPU can support
(12GB device memory). Initially there is not much difference as the
batch size is doubled. Eventually, however, the memory footprint
grows almost linearly with increasing mini-batch size. This tran-
sition point depends on the size of the network parameters, which
do not vary with mini-batch size. For example, for AlexNet, the
network parameters are a large portion of the overall memory con-
sumption due to the three large fully-connected layers and relatively
few (five) convolutional layers. This leads to a later transition point
for AlexNet at a batch-size of 96; all other tested networks transition
to an increasing memory footprint at a batch size of 32 or below.
Performance Impact of LargerMini-Batches.A larger batch
size is beneficial for DL training [3, 55], because it allows more
work to be done per iteration, leading to higher resource utilization.
We use an analytical model very similar to [62, 63] to project this,
since we cannot collect traces for DL execution with memory
capacity requirements that are larger than current GPU’s capacity.
We extensively validate the model and its projections highly
9
(a) Memory footprint of DL workloads as a function of batch
size. (PyTorch, Titan Xp)
(b) Projected speedup in images per second as a function of
mini-batch size.
(c) Projected speedup in images per second by using Buddy
Compression to achieve a larger batch size.
(d) Validation accuracy with differentmini-batch sizes. ResNet50
is trained until 100 epochs with CIFAR100.
Fig. 13: Impact of increasing mini-batch size on DL training
correlate with a range of existing commercial GPUs. Figure 13b
shows the projected speedup for each network as the mini-batch
size is increased. It is generated using a detailed analytical model
of deep learning training efficiency (Section 4.1). As shown in the
figure, increasing the mini-batch size leads to higher speed in terms
of frames per second. This effect, however, is only seen until the
mini-batch size is large enough to utilize most of the GPU resources.
After the point of full GPU utilization, the effect plateaus.
Buddy Compression allows us to fit a larger mini-batch into
GPU memory. Figure 13c shows the relative speedup projected
by our model for this larger mini-batch size over a baseline GPU
with 12GB of device memory. The average speedup is 14%, while
individual workloads like BigLSTM and VGG16 achieve high
speedups of 28% and 30%, respectively. The reason for the higher
speedup in these workloads follows from Figures 13a and 13b.
Without compression, both of these are unable to fit the mini-batch
size of 64, which needed for good resource utilization.
This overall speedup of 14% is much higher than the 2.2%
performance overhead due to Buddy Compression (Figure 11).
This indicates that Buddy compression can lead to significant
performance gain for capacity-constrained GPUs by allowing the
use of larger mini-batch sizes.
Better Convergence with Larger Mini-Batches. Apart from
improving computational throughput with better resource utiliza-
tion, the mini-batch size can also impact the training accuracy. In
order to investigate this, we train ResNet50 on the CIFAR100 [64]
dataset for 100 epochs on a Titan Xp GPU with different mini-batch
sizes. Figure 13d shows the validation accuracy results for these
runs. As can be seen, very small mini-batches of 16 and 32 do not
reach the maximum accuracy, despite using the corresponding,
tuned hyperparameters. Additionally, although the mini-batch size
of 64 trains to the maximum accuracy, it converges slower than
the larger mini-batches. With batch normalization, the jitter in
the accuracy is also higher with small mini-batch sizes. While we
observe good validation accuracy up to a batch size of 256, which
is in line with the reported results in previous work [3], it has been
reported that increasing the mini-batch beyond a certain size can
be detrimental to the network’s generalization. However, there has
been other work on tuning loss functions and hyperparameters
for successful training with large mini-batches [65, 66].
Huge DL Networks. Recent object detection networks like
MegDet [3] and natural language processing networks like BERT [5]
are unable to fit more than 2-4 input samples per GPU during train-
ing, due to memory capacity limits. This is a hurdle for developers,
since the best regularization technique, Batch Normalization re-
quires a batch size of at least 32 samples to be effective [67]. As a
result, developers resort to horizontal scaling by spreading a mini-
batch across many GPUs. As an example, the version of MegDet
that won the COCO challenge [68] in 2017, performs batch normal-
ization across 128 GPUs, resulting in high communication overhead.
They also present results proving that larger mini-batches lead to
higher accuracy, and are faster to train. Using horizontal scaling
alone to support larger batches is not sustainable due to the inter-
GPU communication bottleneck. While our simulation infrastruc-
ture is unable to support such huge DL training networks, Buddy
Compression enables modest vertical scaling, which, when com-
bined with horizontal scaling can lead to more sustainable solutions.
The final takeaway from this case-study is that most DL
networks require a mini-batch of at least 64 or 128 in order to
achieve near-maximum throughput and best accuracy (with batch
normalization). Buddy Compression can help achieve the required
mini-batch sizes for large networks using fewer GPUs.
5 CONCLUSIONS
This work proposes and evaluates Buddy Compression: the first
general-purpose mechanism that can be used to increase user
visible GPU memory capacity on GPUs. Buddy Compression
10
is enabled by modern high-bandwidth interconnects that allow
remote memory pool to be used as a backup when the compress-
ibility is not sufficient. Buddy Compression is able to achieve
1.5–1.9× memory compression ratios across a wide range of
HPC and deep learning workloads while incurring only a 1–2%
performance penalty compared to a system with a larger GPU
memory capacity, due to its unique design where compressibility
changes do not incur additional data movement. This combination
of high performance and reasonable compression ratios makes
Buddy Compression an attractive and performant alternative to
existing technologies like Unified Memory oversubscription.
REFERENCES
[1] L. Gu, J. Siegel, and X. Li, “Using GPUs to Compute Large Out-of-card FFTs,” in
Proceedings of the International Conference on Supercomputing, ser. ICS ’11, 2011.
[2] F. Song, S. Tomov, and J. Dongarra, “Enabling and scaling matrix computations
on heterogeneous multi-core and multi-gpu systems,” in Proceedings of the 26th
ACM International Conference on Supercomputing, ser. ICS ’12, 2012.
[3] C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, and J. Sun, “Megdet: A
large mini-batch object detector,” CoRR, vol. abs/1711.07240, 2017.
[4] X. Chen, D. Z. Chen, and X. S. Hu, “moDNN: Memory optimal DNN training
on GPUs,” in Proceedings of the Conference on Design, Automation, and Test in
Europe (DATE), Mar. 2018, pp. 13–18.
[5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of
Deep Bidirectional Transformers for Language Understanding,” CoRR, vol.
abs/1810.04805, 2018.
[6] M. Wang, C.-c. Huang, and J. Li, “Supporting Very Large Models using Automatic
Dataflow Graph Partitioning,” arXiv:1807.08887 [cs], Jul. 2018.
[7] T. Akiba, T. Kerola, Y. Niitani, T. Ogawa, S. Sano, and S. Suzuki, “PFDet:
2nd Place Solution to Open Images Challenge 2018 Object Detection Track,”
arXiv:1809.00778 [cs], Sep. 2018.
[8] M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler, “vDNN: Vir-
tualized Deep Neural Networks for Scalable, Memory-efficient Neural Network
Design,” in Proceedings of the International Symposium on Microarchitecture
(MICRO), 2016, pp. 18:1–18:13.
[9] M. Rhu, M. O’Connor, N. Chatterjee, J. Pool, Y. Kwon, and S. W. Keckler,
“Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep
Neural Networks,” in Proceedings of the International Symposium on High
Performance Computer Architecture (HPCA), Feb. 2018, pp. 78–91.
[10] M. Harris. (2016) Unified memory for CUDA beginners. NVIDIA Blog. [Online;
accessed 18-Jan-2018].
[11] T. Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, and S. W. Keckler, “Towards
high performance paged memory for gpus,” in 2016 IEEE International Symposium
on High Performance Computer Architecture (HPCA). IEEE, 2016.
[12] Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, Xiaoyi Lu, and
Dhabaleswar K. Panda, “Can Unified-Memory support on Pascal and Volta GPUs
enable Out-of-Core DNN Training?”
[13] R. B. Tremaine, P. A. Franaszek, J. T. Robinson, C. O. Schulz, T. B. Smith,
M. Wazlowski, and P. M. Bland, “IBM Memory Expansion Technology (MXT),”
in IBM Journal of Research and Development, vol. 45, no. 2, 2001.
[14] M. Ekman and P. Stenstrom, “A Robust Main-Memory Compression Scheme,” in
Proceedings of the 32nd Annual International Symposium on Computer Architecture,
2005.
[15] G. Pekhimenko, V. Seshadri, Y. Kim, H. Xin, O. Mutlu, P. Gibbons, M. Kozuch,
and T. Mowry, “Linearly compressed pages: a low-complexity, low-latency main
memory compression framework,” in Proceedings of the 46th Annual IEEE/ACM
International Symposium on Microarchitecture, 2013.
[16] J. Zhao, S. Li, J. Chang, J. L. Byrne, L. L. Ramirez, K. Lim, Y. Xie, and P. Fara-
boschi, “Buri: Scaling Big-Memory Computing with Hardware-Based Memory
Expansion,” ACM Trans. Archit. Code Optim., vol. 12, no. 3, 2015.
[17] E.Choukse, M.Erez, and A.R.Alameldeen, “Compresso: Pragmatic Main Memory
Compression,” in Proceedings of the International Symposium on Microarchitecture
(MICRO), 2018.
[18] J. Nystad, A. Lassen, A. Pomianowski, S. Ellis, and T. Olson, “Adaptive scalable
texture compression,” in Proceedings of the Fourth ACM SIGGRAPH / Eurographics
Conference on High-Performance Graphics, ser. EGGH-HPG’12, 2012.
[19] A. Jain, A. Phanishayee, J. Mars, L. Tang, and G. Pekhimenko, “Gist: Efficient Data
Encoding for Deep Neural Network Training,” in Proceedings of the International
Symposium on Computer Architecture (ISCA), Jun. 2018, pp. 776–789.
[20] NVIDIA. NVIDIA DGX-2: The world’s most powerful AI system for the most
complex AI challenges. https://www.nvidia.com/en-us/data-center/dgx-2/.
[21] O. Consortium. OpenCAPI 3.0 Data Link Specification. [Online]. Available: https:
//opencapi.org/wp-content/uploads/2016/09/OC-DL-Specification.10.14.16.pdf
[22] “Intel Hints Towards An Xe âĂŸCoherent Multi-GPUâĂŹ
Future With CXL Interconnect.” [Online]. Available: https:
//wccftech.com/intel-xe-coherent-multi-gpu-cxl/
[23] L. Liu, L. L. Deng, X. Hu, M. Zhu, G. Li, Y. Ding, and Y. Xie, “Dynamic sparse
graph for efficient deep learning,” CoRR, vol. abs/1810.00859, 2018.
[24] A. Gruslys, R. Munos, I. Danihelka, M. Lanctot, and A. Graves, “Memory-efficient
backpropagation through time,” in NIPS, 2016.
[25] M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler, “vdnn: Virtual-
ized deep neural networks for scalable, memory-efficient neural network design,”
2016 49th Annual IEEE/ACM International Symposium on Microarchitecture
(MICRO), pp. 1–13, 2016.
[26] C. D. Sa, M. Leszczynski, J. Zhang, A. Marzoev, C. R. Aberger, K. Olukotun, and
C. Ré, “High-accuracy low-precision training,” CoRR, vol. abs/1803.03383, 2018.
[27] Y. Ito, R. Matsumiya, and T. Endo, “ooc_cudnn: Accommodating convolutional
neural networks over GPU memory capacity,” in 2017 IEEE International
Conference on Big Data (Big Data), Dec. 2017, pp. 183–192.
[28] NVIDIA. NVIDIA Turing GPU Architecture. [Online]. Available: https://www.
nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/
turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
[29] N. Sakharnykh. (2017) Unified memory on pascal and
volta. GPU Technology Conference (GTC). [Online]. Avail-
able: http://on-demand.gputechconf.com/gtc/2017/presentation/
s7285-nikolay-sakharnykh-unified-memory-on-pascal-and-volta.pdf
[30] “zram: Compressed RAM based block devices,” 2012. [Online]. Available:
https://www.kernel.org/doc/Documentation/blockdev/zram.txt
[31] N. Sakharnykh. (2016, December) Beyond gpu memory lim-
its with unified memory on pascal. [Online]. Available: https:
//devblogs.nvidia.com/beyond-gpu-memory-limits-unified-memory-pascal/
[32] ——. (2018, March) Everything you need to know about unified
memory. http://on-demand.gputechconf.com/gtc/2018/presentation/
s8430-everything-you-need-to-know-about-unified-memory.pdf. GPU
Technology Conference (GTC).
[33] “Heterogenous System Architecture (HSA).” [Online]. Available:
https://www.hsafoundation.com
[34] “AMD Kaveri: Support for Heterogenous System Architecture (HSA).”
[Online]. Available: http://www.amd.com/en-us/products/processors/desktop/
a-series-apu
[35] Tiffany Trader. (2017) TSUBAME3.0 points to future HPE
Pascal-NVLink-OPA server. https://www.hpcwire.com/2017/02/17/
tsubame3-0-points-future-hpe-pascal-nvlink-opa-server/. HPC Wire.
[36] A. Caldeira. (2018, March) Ibm power system ac922 in-
troduction and technical overview. [Online]. Available: https:
//www.redbooks.ibm.com/redpapers/pdfs/redp5472.pdf
[37] P.Markthub, M. E. Belviranli, and S. S.Lee, J.S. Vetter, “DRAGON: Breaking
GPU Memory Capacity Limits with Direct NVM Access,” in Proceedings of the
International Conference on High Performance Computing, Networking, Storage
and Analysis (SC), 2018.
[38] K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. K. Reinhardt, and T. F.
Wenisch, “Disaggregated memory for expansion and sharing in blade servers,” in
Proceedings of the International Symposium on Computer Architecture (ISCA), 2009.
[39] G. Pekhimenko, V. Seshadri, O. Mutlu, P. Gibbons, M. Kozuch, and T. Mowry,
“Base-Delta-Immediate Compression: Practical Data Compression for On-Chip
Caches,” in Proceedings of the International Conference on Parallel Architectures
and Compilation Techniques (PACT), 2012.
[40] A. R. Alameldeen and D. A.Wood, “Frequent Pattern Compression: A significance-
based compression scheme for L2 caches,” Technical Report 1500, Computer
Sciences Department, University of Wisconsin-Madison, Tech. Rep., 2004.
[41] J. Yang, Y. Zhang, and R. Gupta, “Frequent value compression in data caches,”
in Proceedings of the 33rd Annual IEEE/ACM International Symposium on
Microarchitecture, 2000.
[42] X. Chen, L. Yang, R. Dick, L. Shang, and H. Lekatsa, “C-PACK: A High-
Performance Microprocessor Cache Compression Algorithm,” in IEEE
Educational Activities Department vol. 18, 2010.
[43] J. Kim, M. Sullivan, E. Choukse, and M. Erez, “Bit-Plane Compression: Transform-
ing Data for Better Compression in Many-Core Architectures,” in Proceedings
of the 43rd Annual International Symposium on Computer Architecture, 2016.
[44] Z. Jia, M. Maggioni, B. Staiger, and D. P. Scarpazza, “Dissecting the NVIDIA
volta GPU architecture via microbenchmarking,” CoRR, 2018. [Online]. Available:
http://arxiv.org/abs/1804.06826
[45] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,
and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,”
in Proceedings of the 22Nd ACM International Conference on Multimedia, ser. MM
’14, 2014.
[46] G. Pekhimenko, E. Bolotin, N. Vijaykumar, O. Mutlu, T. C. Mowry, and S. W.
Keckler, “A case for toggle-aware compression for gpu systems,” 2016 IEEE In-
ternational Symposium on High Performance Computer Architecture (HPCA), 2016.
11
[47] G. Juckeland, W. C. Brantley, S. Chandrasekaran, B. M. Chapman, S. Che, M. E.
Colgrove, H. Feng, A. Grund, R. Henschel, W. mei W. Hwu, H. Li, M. S. MÃĳller,
W. E. Nagel, M. Perminov, P. Shelepugin, K. Skadron, J. A. Stratton, A. Titov,
K. Wang, G. M. van Waveren, B. Whitney, S. Wienke, R. Xu, and K. Kumaran,
“Spec accel: A standard application suite for measuring hardware accelerator
performance.” 2014.
[48] E. Choukse, M. Erez, and A. R. Alameldeen, “Compresspoints: An evaluation
methodology for compressed memory systems,” IEEE Computer Architecture
Letters, vol. 17, 2018.
[49] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep
convolutional neural networks,” in Proceedings of the 25th International Conference
on Neural Information Processing Systems - Volume 1, ser. NIPS’12, 2012.
[50] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the
inception architecture for computer vision,” 2016 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2016.
[51] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer,
“Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model
size,” CoRR, vol. abs/1602.07360, 2016.
[52] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
[53] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[54] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A
Large-Scale Hierarchical Image Database,” in CVPR09, 2009.
[55] R. Józefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu, “Exploring the
limits of language modeling,” CoRR, vol. abs/1602.02410, 2016.
[56] A. Bakhoda, G. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, “Analyz-
ing CUDA Workloads Using a Detailed GPU Simulator,” IEEE International
Symposium on Performance Analysis of Systems and Software (ISPASS), 2009.
[57] A. Arunkumar, E. Bolotin, B. Cho, U. Milic, E. Ebrahimi, O. Villa, A. Jaleel,
C.-J. Wu, and D. Nellans, “Mcm-gpu: Multi-chip-module gpus for continued
performance scalability,” in Proceedings of the 44th Annual International
Symposium on Computer Architecture, ser. ISCA ’17, 2017.
[58] U. Milic, O. Villa, E. Bolotin, A. Arunkumar, E. Ebrahimi, A. Jaleel, A. Ramirez,
and D. Nellans, “Beyond the socket: NUMA-aware GPUs,” in Proceedings of the
50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM,
2017, pp. 123–135.
[59] V. Young, A. Jaleel, E. Bolotin, E. Ebrahimi, D. Nellans, and O. Villa, “Combining
HW/SW Mechanisms to Improve NUMA Performance of Multi-GPU Systems,”
in Proceedings of the 51th Annual IEEE/ACM International Symposium on
Microarchitecture. ACM, 2018.
[60] NVIDIA. NVIDIA Pascal GPU Architecture. [Online]. Available:
https://www.nvidia.com/object/pascal-architecture-whitepaper.html
[61] ——. NVIDIA Volta GPU Architecture. [Online]. Available: https://images.nvidia.
com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
[62] H. Qi, E. R. Sparks, and A. S. Talwalkar, “Paleo: a performance model for deep
neural networks,” 2017.
[63] S. Lym, D. Lee, M. O’Connor, N. Chatterjee, and M. Erez, “DeLTA: GPU
Performance Model for Deep Learning Applications with In-depth Memory
System Traffic Analysis,” in Proceedings of the IEEE International Symposium on
Performance Analysis of Systems and Software (ISPASS), 2019.
[64] A. Krizhevsky, V. Nair, and G. Hinton, “Cifar-100 (canadian institute for advanced
research).” [Online]. Available: http://www.cs.toronto.edu/~kriz/cifar.html
[65] P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola,
A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet
in 1 hour,” CoRR, 2017.
[66] C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl,
“Measuring the effects of data parallelism on neural network training,” arXiv
preprint arXiv:1811.03600, 2018.
[67] Y. Wu and K. He, “Group Normalization,” in ECCV, 2018.
[68] T.-Y. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona,
D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common Objects
in Context,” in ECCV, 2014.
12
