Performance Evaluation of Advanced Features in CUDA Unified Memory by Chien, Steven W. D. et al.
Performance Evaluation of Advanced Features in
CUDA Unified Memory
Steven W. D. Chien
KTH Royal Institute of Technology
Stockholm, Sweden
wdchien@kth.se
Ivy B. Peng
Lawrence Livermore National Laboratory
Livermore, USA
peng8@llnl.gov
Stefano Markidis
KTH Royal Institute of Technology
Stockholm, Sweden
markidis@kth.se
Abstract—CUDA Unified Memory improves the GPU pro-
grammability and also enables GPU memory oversubscription.
Recently, two advanced memory features, memory advises and
asynchronous prefetch, have been introduced. In this work, we
evaluate the new features on two platforms that feature different
CPUs, GPUs, and interconnects. We derive a benchmark suite
for the experiments and stress the memory system to evaluate
both in-memory and oversubscription performance.
The results show that memory advises on the Intel-Volta/Pascal-
PCIe platform bring negligible improvement for in-memory exe-
cutions. However, when GPU memory is oversubscribed by about
50%, using memory advises results in up to 25% performance
improvement compared to the basic CUDA Unified Memory. In
contrast, the Power9-Volta-NVLink platform can substantially
benefit from memory advises, achieving up to 34% performance
gain for in-memory executions. However, when GPU memory is
oversubscribed on this platform, using memory advises increases
GPU page faults and results in considerable performance loss.
The CUDA prefetch also shows different performance impact on
the two platforms. It improves performance by up to 50% on the
Intel-Volta/Pascal-PCI-E platform but brings little benefit to the
Power9-Volta-NVLink platform.
Index Terms—CUDA Unified Memory, UVM, CUDA memory
hints, GPU, memory oversubscription
I. INTRODUCTION
Recently, leadership supercomputers are becoming increas-
ingly heterogeneous. For instance, the two fastest super-
computers in the world [17], Summit and Sierra, are both
equipped with Nvidia V100 GPUs [6], [12] for accelerating
workloads. One major challenge in programming applications
on these heterogeneous systems arises from the physically
separate memories on the host (CPU) and the device (GPU).
Kernel execution on GPU can only access data stored on the
device memory. Thus, programmers either need to explicitly
manage data using the memory management API in CUDA or
relying on programming systems, such as OpenMP 4.5 [7] and
RAJA [5], for generating portable programs. Today, a GPU can
have up to 16 GB memory on top supercomputers while the
system memory on the host can reach 256 GB. Leveraging the
large CPU memory as a memory extension to the relatively
small GPU memory becomes a promising and yet challenging
task for enabling large-scale HPC applications.
CUDA Unified Memory (UM) addresses the challenges as
mentioned above by providing a single and consistent logical
view of the host and device memories on a system. UM
uses the virtual memory abstraction to hide the heterogeneity
in GPU and CPU memories. Therefore, pages in the virtual
address space in an application process may be mapped to
physical pages either on CPU or GPU memory. Based on UM,
CUDA runtime can leverage page faults, which is supported
on recent GPU architectures, e.g., Nvidia Pascal and Volta
architectures, to enable automatic data migration between de-
vice and host memories. For instance, when a device accesses
a virtual page that is not mapped to a physical page on the
device memory, a page fault is generated. Then, the runtime
resolves the fault by remapping the page to a physical page
on the device memory and copying the data. This procedure
is also called on-demand paging. Now with the hardware-
supported page fault and the runtime-managed data migration
on UM, oversubscribing the GPU memory becomes feasible.
For instance, when there is no physical memory available on
the device for newly accessed pages, the runtime evicts pages
from GPU to CPU and then bring the on-demand page.
CUDA has introduced new features for optimizing the data
migration on UM, i.e., memory advises and prefetch. Instead
of solely relying on page faults, the memory advises feature
allows the programmer to provide data access pattern for each
memory object so that the runtime can optimize migration
decisions. The prefetch proactively triggers asynchronous data
migration to GPU before the data is accessed, which reduces
page faults and, consequently, the overhead in handling page
faults.
In this paper, we evaluate the effectiveness of these new
memory features on CUDA applications using UM. Due
to the absence of benchmarks designed for this purpose,
we developed a benchmark suite of six GPU applications
using UM. We evaluate the impact of the memory features
in both in-memory and oversubscription executions on two
platforms. The use of memory advises results in performance
improvement only when when we oversubscribe the GPU
memory on the Intel-Volta/Pascal-PCI-E systems. On Power9-
Volta-NVLink based system, using memory advises leads to
performance improvement only for in-memory executions.
With GPU memory oversubscription, it results in substantial
performance degradation. Our main contributions in this work
are as follows:
• We survey state-of-art practice in UM memory advises,
prefetch, and GPU memory oversubscription.
ar
X
iv
:1
91
0.
09
59
8v
1 
 [c
s.D
C]
  2
1 O
ct 
20
19
CPU GPU
(1) *a = 1
(2) Page fault (3) Unmap from GPU
(4) Map on Host
(5) *a = 1
Fig. 1: CPU writes to a page resident on the GPU, triggering
a page fault and the page is migrated to CPU.
• We design a UM benchmark suite consisting of six
applications for evaluating advanced memory features.
• We evaluate the performance impact of memory advises,
prefetch on two systems with Intel, Nvidia Pascal, and
Volta GPUs connected via PCI-E and a system with IBM
Power9 and Nvidia Volta GPU connected via NVLink.
• Our results indicate that using memory advises improves
application performance in oversubscription execution on
the Intel platform and in-memory executions on the IBM
platform.
• We show that UM prefetch provides a significant per-
formance improvement on the Intel-Volta/Pascal-PCI-E
based systems while it does not show a performance
improvement on the Power9-Volta-NVLink based system.
II. UNIFIED MEMORY
In this section, we introduce the underlying mechanism in
GPU UM, and the three memory advises. We also describe
the prefetching and memory oversubscription.
A. CUDA Unified Memory
UM creates a unified logical view of the physically separate
memories across host and GPU. Currently, modern CPUs sup-
port 48-bit memory addresses while Unified Memory uses 49-
bit virtual addressing, which can address both host and GPU
memories [14]. One of the main goals of Unified Memory is to
provide a consistent view of data between devices. The system
ensures a memory page can only be accessed by one process at
a time. When a process accesses a page that is not resident of
its memory system, a page fault occurs. The memory system
holding the requested page will unmap it from its page table,
and the page will be migrated to the faulting process. Figure 1
illustrates an example when the CPU accesses a page on GPU
memory, and the page is migrated to CPU memory. Similarly,
when GPU accesses a page not physically stored on GPU
memory, the page will be moved to GPU.
UM was first introduced in CUDA 6.0 [21]. Only until
the recent Nvidia Pascal microarchitecture that has hardware
support for page faults, bi-directional on-demand page migra-
tion becomes feasible [14]. Resolving a page fault has high
overhead, and memory thrashing that moves the same pages
back and forth between the memories is even a performance
bottleneck. The massive parallelism on GPU further exacer-
bates the page fault overhead because processes stall when
page faults are being resolved, and multiple threads in different
warps accessing the same page can cause multiple duplicated
faults [18].
B. Data Movement Advises
CUDA 8.0 introduces a new programming interface,
called memory advise [15]. The concept is similar to
posix_madvise in Linux, which uses application knowl-
edge about access patterns to make informed decisions on
page handling [1]. The UM advise focuses on data locality,
i.e., whether a page is likely to be accessed from the host
or device. The main objective is to reduce unnecessary page
migration and their associated overhead. Currently, developers
can specify three access patterns to the CUDA runtime:
cudaMemAdviseSetReadMostly implies a read-intensive
data region. In the basic UM, accessing a page on
a remote side triggers page migration. However, with
cudaMemAdviseSetReadMostly, a read-only duplicate
of the page will be created on the faulting side, which
prevents page faults and data migration in the future. Figure 2a
illustrates an example, where the second access (step 5) has
no page fault and is local access. This mechanism, however,
results in a high overhead if there is any update to this memory
region because all copies of the corresponding page will be
invalidated to preserve consistency between different copies.
Thus, this advice is often used in read-only data structures,
such as lookup tables and application parameters.
cudaMemAdviseSetPreferredLocation sets the preferred
physical location of pages. This advice pins a page and
prevents it from migrating to other memories. Figure 2b
illustrates a page preferred on the host side, and GPU
uses remote mapping to access the page. This advice es-
tablished a direct (remote) mapping to the memory page.
When accessing the page remotely, data is fetched through
the remote memory instead of generating a page fault. If
the underlying hardware does not support the remote map-
ping, the page will be migrated as in the standard UM.
cudaMemAdviseSetPreferredLocation is useful for
applications with little data sharing between CPU and GPU,
i.e., part of the application is executed completely on the GPU,
and the rest of the application executes on the host. Data that
is being used mostly by the GPU can be pinned to the GPU
with the advice, avoiding memory thrashing.
cudaMemAdviseSetAccessedBy establishes a direct map-
ping of data to a specified device. Figure 2c illustrates an
example of a physical page on GPU being remotely access
from the host. When cudaMemAdviseSetPreferredLocation is
applied, CUDA runtime tries to build a direct mapping to the
page to avoid data migration so that the destination can access
data remotely. Differently from cudaMemAdviseSetPreferred-
Location, this cudaMemAdviseSetAccessedBy does not try to
pin pages on a specific device; instead, its main effect is to
establish mapping on the remote device. This advice takes
effect on the creation of the memory pages. The mapping will
be re-established after the pages are migrated.
CPU GPU
2) int x = *a;
1) Mark Read-mostly 3) Page fault
4) Create read only copy
5) Access
6) Access (without fault)
(a) A read-mostly region is duplicated to the
GPU to avoid page faults in the future.
CPU GPU
(1) *a = 1
(2) page fault
(4) Remote
     access
(3) create
     mapping
(b) A host-preferred region is directly re-
mote mapped to allow remote access from
the GPU.
CPU GPU A
(1) create
    mapping
(2) *a = 1
(3) Remote access without fault
GPU B
(4) Mapping re-created even after migration
(c) A GPU-resident region with accessed-by
CPU advise can be accessed by CPU through
remote memory access.
Fig. 2: Page fault mechanism and effects of the three Memory Advise in Unified Memory.
C. Prefetching
The CUDA interface introduces an asynchronous page
prefetching mechanism, i.e., cudaMemPrefetchAsync() [15],
to trigger data migration. The data migration occurs in a
background CUDA stream to avoid stalling the computation
threads. One natural optimization for prefetching a large
number of pages is to split into multiple streams, i.e., a bulk
transfer, to prefetch pages in a batch of streams concurrently.
If the page is prefetched to the device memory before the data
access, no page faults will occur, and the GPU benefits from
the high bandwidth on its local memory.
The behavior of the prefetching mechanism might change
when used in combination with CUDA memory advises. For
example, when cudaMemAdviseSetReadMostly is set, a read-
only copy will be immediately created. Also, when prefetching
a region with cudaMemAdviseSetPreferredLocation set to an-
other destination memory, the pages will no longer be pinned
to the preferred location. Thus, our evaluation considers the
interplay between these two types of memory features.
D. Oversubscription of Device Memory
GPU memory has a relatively small capacity compared
to the system memory on CPU. One major limitation when
porting large-scale applications to GPUs is to overcome their
memory capacity to enable larger problems. UM in the post-
Pascal page fault capable GPUs can oversubscribe GPU mem-
ory, allowing GPU kernels to use more memory than the
physical capacity on the device. The memory oversubscription
is achieved through the traditional virtual memory manage-
ment, i.e., selected memory pages on the device are evicted
to CPU to make space for newly requested pages. Currently,
the CUDA runtime uses the Least Recently Used (LRU)
replacement policy to select victim pages when running out
of space [19]. Some work also proposed pre-eviction to start
page eviction early to avoid stalling on the critical path [3].
III. METHODOLOGY
We develop a benchmark suite for evaluating UM and
different data migration policies. Although several porting
efforts have been reported for specific applications, there
lacks a suite of diverse kernels for controlled experiments
across platforms. Thus, we extend the memory management in
popular GPU benchmark and applications to utilize UM with
advanced advise and prefetching features.
A. Application and Benchmarks
Our benchmark suite includes six applications, as specified
in Table I. These applications include numerical solvers,
financial application, image processing, and graph problems.
The benchmark suite is available at a repository 1.
For each application, we develop four versions in addition to
the original version that uses explicit GPU memory allocation.
Our benchmarks use long data types to support large input
problems in oversubscription executions. We use GPU kernel
execution time as the figure of merit.
We present detailed tracing results for BS, CG, and FDTD3d
on selected platforms to study the implications of data move-
ment. BS is a financial application that performs option pric-
ing. BS features good data reuse because the same input data
set is used in multiple iterations in the application lifetime. CG
is a linear solver that solves a linear system Ax = b on the
GPU. An error is computed on the host using the results from
GPU computation after the solving iteration finishes. FDTD3d
is a finite difference solver that reads and writes to two arrays
in an interleaving manner. Both arrays are being initialized
using the same data. The output eventually resides in one of
the arrays.
1) UM: The first version is an implementation that uses
UM with minimal changes. We simply replace the memory
allocation in applications from cudaMalloc() to cudaMalloc-
Managed() and eliminate explicit data copy, i.e., cudaMem-
cpy(), between host and device. After the completion of a GPU
kernel, if the application has no subsequent host computation
using the GPU results, an explicit data copy by memcpy() is
inserted to simulate a CPU computation using the results.
2) UM Advise: The second version is UM with Advise.
This version is based on the basic UM version and applies
memory advises to data structures in the application. A stall
in GPU execution, e.g., for resolving page fault, has a sig-
nificant impact on performance due to massive parallelism.
Thus, the main consideration for memory advises is to keep
data used by GPU close to GPU memory. Therefore, we
set a cudaMemAdviseSetPreferredLocation and specify the
preferred location to GPU memory after the memory allo-
cation of a data structure that is accessed by GPU in the
computation. If the data structure is initialized by the CPU,
1https://github.com/steven-chien/um-apps
TABLE I: Applications and data input sizes on different platforms.
Name Description
Input size Intel-Pascal
(Approximate GB)
Input size Intel-Volta & P9-Volta
(Approximate GB)
In-memory Oversubscribe In-memory Oversubscribe
Black-Scholes (BS) A financial application that performs option pricing. 4 6.4 15.2 26
Matrix Multiplication (cuBLAS) A general matrix matrix multiplication in single precision using cuBLAS. 3.9 6.3 15.2 25.4
Conjugate Gradient (CG) A conjugate gradient solver that solves a sparse linear system using cusparse. 3.8 6.4 15.4 25.4
Graph500 Breadth-first search (BFS) kernel of Graph500. 3.63 7.62 8.52 N/A
Convolution 0 (conv0) A FFT-based image convolution using Real-to-Complex and Complex-to-Real FFT plans. 2.8 6.4 11.6 25.6
Convolution 1 (conv1) A FFT-based image convolution using Complex-to-Complex FFT plan. 3.5 6.7 13.6 25.5
Convolution 2 (conv2) A FFT-based image convolution using Complex-to-Complex FFT plan. 3.0 6.4 11.6 25.5
Finite Difference Time Domain (FDTD3d) A finite difference solver in three dimension. 3.8 6.4 15.2 25.3
we set a cudaMemAdviseSetAccessedBy CPU to keep the data
physically on GPU but establish a remote mapping on CPU.
With this optimization, the host data initialization performs
remote accesses to initialize data in GPU memory directly. For
constant data structures, the cudaMemAdviseSetReadMostly
advice is set after data initialization. This optimization will
only have page fault at the first access but keep all subsequent
accesses local.
3) UM Prefetch: The third version is UM with prefetch. We
apply cudaMemPrefetchAsync to trigger page migration
at appropriate sites explicitly. We prefetch large data structures
that will be accessed by GPU kernels in a background stream
while the GPU kernel is launched in the default stream.
After completing the GPU kernel execution, we prefetch the
arrays containing results to the host memory in the default
stream. One advantage of bulk transfer in prefetch, compared
to resolving individual page fault groups, is high memory
bandwidth to utilize the hardware capability fully. Explicitly
triggering page pages in bulk improves transfer efficiency.
Furthermore, to prefetch pages avoids page faults as data
already resides in the physical memory when the kernel starts
executing.
4) UM Both: Finally, in the fourth version, we combine
memory advises and prefetch to examine the mutual effects
of both techniques.
B. Test Environment
We evaluate our benchmark applications on three platforms:
1) Intel-Pascal is a single node system with Intel Core i7-
7820X processor and 32 GB of RAM. It has one GeForce
GTX 1050 ti GPU with 4GB memory. The GPU is
connected through PCIe. The operating system is Ubuntu
18.10 and the host compiler is GCC 8.3.
2) Intel-Volta is a GPU node on Kebnekaise at HPC2N in
Ume. It has an Intel Xeon Gold 6132 processor with 192
GB of RAM. The node has two Tesla V100 GPU with 16
GB memory and the GPU is connected through PCIe. The
operating system is Ubuntu 16.04 and the host compiler
is GCC 8.2.
3) P9-Volta is a node with an IBM Power9 processor and
256 GB of RAM. The system has four Tesla V100 GPUs
with 16 GB of HBM. The GPU is connected through
NVLINK to CPU.
Our platforms consist of two Intel systems that use Pascal
and Volta GPUs, and a Power9 system that uses Volta GPU.
All the systems use CUDA 10.1. We only use one GPU in
the experiments. For each application variation, we perform
benchmark runs up to five times and present the average GPU
kernel execution time and standard deviation. An exception is
Graph500, where we report the average and standard deviation
of BFS iterations. We separate our experiments into two
cases: when problem size fits into GPU memory and when
oversubscription of memory is required. Their problem sizes
are selected to be approximately 80% and 150% to GPU
memory, respectively. A detailed list of sizes is presented in
Table I. Due to the limitation in implementation for input
data size, we only examine Graph500 with oversubscription
on Intel-Pascal. However, the input size does not follow the
150% data size rule.
Apart from benchmark executions, we perform profiling
runs using nvprof for selected applications. We obtain the trace
by –print-gpu-trace. By selecting entries with Unified Memory
Memcpy HtoD and Unified Memory Memcpy DtoH, we can
build a time series of data movement. Through a comparison
of the time series and time spent on memory movement, it
is possible to compare and characterize the intensity of data
movement between different application variations.
IV. RESULTS
In this section, we present the performance and profiling
results of the applications in four configurations: basic UM
(UM), UM with Advise (UM Advise), UM with Prefetch (UM
Prefetch) and UM with both Advise and Prefetch (UM Both).
Each application is evaluated in each configuration with two
problem sizes: one that fits into GPU memory (in-memory
execution) and one that oversubscribes GPU memory (over-
subscription execution). We report the average and standard
deviation of GPU kernel execution time for each application.
A. In-Memory Execution
We present the GPU kernel execution time of the applica-
tions in Fig. 3. The performance of all applications decreases
when using basic UM instead of explicit data movement be-
tween CPU and GPU memories. Performances on Volta GPUs
platforms have a larger performance decrease. In particular,
our convolution and FDTD3d applications exhibit a drastic
increase in execution time. The execution time of conv2 and
FDTD3d are 14× and 9× higher respectively on P9-Volta.
Performance change is similar to Intel-Volta. Performance
decrease is less drastic but still considerable on Intel-Pascal.
The execution of both applications is 2− 3× slower than the
execution time of applications using explicit data movement.
(a) Intel-Pascal (b) Intel-Volta (c) Power9-Volta.
Fig. 3: GPU kernel execution time of applications where data fits in GPU memory.
(a) BS on Intel-Pascal (b) CG on Intel-Pascal (c) BS on P9-Volta (d) CG on P9-Volta
Fig. 4: Breakdown of total time spent handling page faults and data movement when applications are running in-memory.
(a) BS Intel-Pascal (b) CG Intel-Pascal (c) BS P9-Volta (d) CG P9-Volta
Fig. 5: UM data transfer traces when running in-memory.
After applying advises, the performance of our applications
generally improves. It is possible to improve execution time
up to 15% on Intel-based platforms. The impact of advises
is higher for the three FFT based convolution applications on
Intel platforms. Advises have a significant impact on all the ap-
plications, and execution time can be improved by up to 70%
on the P9 platform. Applications, such as CG and cuBLAS,
results in similar execution time to the original version with
explicit memory allocation. This implies that some advises are
more effective than others on the P9 platform.
Expensive page fault handling can be avoided by prefetching
data to the GPU before execution. Our results show that
prefetch has a much higher impact on Intel-based platforms
than P9-based platforms. Application performance generally
improves when prefetch is used: our results show that it has
a much higher impact on Intel-based platforms than the case
advise is used. The performance of FDTD3d improves by up
to 56% on the Intel-Pascal system. The performance of Black-
Scholes application is close to on-par with the application
version using explicit data transfer. As for Intel-Volta, the
performance of FDTD3d improves by up to 65%. Performance
improves by 50% on the P9-Volta system. However, the
improvement is less than when only advises are applied.
Despite that, we notice that when both advises and prefetch
are used together, it generally outperforms the performance of
applications using only advises or prefetch.
To better understand the difference in terms of data move-
ment between the versions, we plot the total time spent on
different UM events in Fig. 4 as stacked bar plots. They
show the total time spent on GPU page fault group handling
and data transfer, respectively. In particular, we have selected
BS and CG for the comparison. The bar plot reveals two
important information for comparison: the time spent on data
movement, which correlates to the amount and efficiency of
data transferred, as well as stalls due to page fault, which
correlates to the number of page fault and efficiency of fault
resolution.
Since the Black-Scholes application uses the same input
dataset repeatedly over iterations, when data size fits in mem-
ory, the first iteration will be slower due to page migration.
Subsequent iterations should be able to execute at full speed as
data already resides in device memory. For this particular ap-
plication, the advise cudaMemAdviseSetReadMostly is applied
to the input arrays. No other advise is applied. The same goes
for prefetch. Figs. 4a and 4c show the break down of total time
spent on data-related activities on the two platforms for the
Black-Scholes application. Comparing to Intel-Pascal, the data
transfer is much faster on P9-Volta, while the impact of stalling
is less profound on Intel-Pascal. This can be attributed to the
larger input data used and a faster interconnect on P9-Volta.
For UM Advise, the time spent on data transfer is similar
while the time spent stalling due to page fault has reduced.
This suggests that page fault handling becomes more efficient
when the advises are applied. The observation is similar for
both Intel-Pascal and Intel-Volta similarly. When prefetch is
used, the same amount of data is being transferred while the
stall due to page fault is eliminated. This implies the complete
elimination of page faults. By prefetching pages in bulk, data
can be transferred at a fast pace to avoid future page faults
when accessed. The observation can be confirmed by Figs. 5a
and 5c, where the detailed transfers are plotted as a time series.
When prefetch is applied, data is transferred as a block at a
much higher rate.
The Conjugate Gradient application solves a linear system
Ax = b iteratively. When applying advises, we set the
preferred location of matrix A and vector b to GPU mem-
ory. We also set a read-mostly advise on the sparse matrix
after completing initialization. The breakdown of time spent
on Intel-Pascal and P9-Volta are shown in Figs.4b and 4d,
respectively. The use of advises results in similar time spent on
data transfer from host to device but a slight reduction in time
on stalls on Intel-Pascal system. A considerable reduction in
time spent on the host to device transfer and stall is observed
on the P9-Volta system. One reason is the use of preferred
location advise, where the data arrays are initialized from the
host on GPU memory through remote memory access. On
Power9, it is possible for the CPU to access GPU memory
while this is not possible on Intel platforms. At the same
time, time spent on transfer from device to host is largely
eliminated on Intel-Pascal. One possible reason is due to the
read-mostly advise. Instead of migrating pages to the GPU
from host memory, a read-only copy is copied to the GPU.
This means that a copy of data exists in both memory systems.
When the Ax is being computed, A can be fetched directly in
host memory. Since P9-Volta initializes data directly in GPU
memory, a copy has to be fetched back to the host. In this
case, the naive use of prefetch results in a reduction of time
spent stalling. Despite the fact that more data is transferred
from device to host, the use of prefetch results in a higher
transfer rate. The data transfer trace is presented in Figs. 5b
and 5d. When used in combination with advises, it results in
a reduction of time for data transfer and stall.
B. Oversubscription Execution
Oversubscription of GPU memory is a key new feature
of UM. It resembles the paging of unused memory pages
to secondary storage to free up memory in classical virtual
memory management. Similarly to the CPU memory subscrip-
tion case, excessive use can lead to system slowdown and
can severely impact performance. Our results show that all
applications execute correctly, even when running out of GPU
memory. However, techniques that improved performance
for in-memory do not necessarily perform well when GPU
memory is oversubscribed. On the contrary, the use of these
techniques without careful optimization can lead to severe
performance degradation.
We present the execution time of our applications in Fig. 6.
Since the case does not exist with original versions with
explicit allocation, a comparison is not possible. Instead, the
minimal UM version is used as a baseline. By using advise,
specific applications can achieve up to over 20% improvement
on Intel platforms. Our P9 platform, on the other hand, shows
a negative impact when advises are used. To better understand
data movement, we perform tracing with the BS and CG on
Intel-Pascal, and with BS and FDTD3d on P9-Volta.
For the Black-Scholes application, the use of advise results
in performance improvement on Intel-Pascal. Fig. 7a shows
the breakdown of time spent on page-fault related events of
BS between host and device while Fig. 8a shows the detailed
tracing on Intel Pascal. One significant difference between
default UM and UM advise is that a lot less time is spent
on transferring data back to the host. The reduction in data
movement can contribute to the improvement in performance.
One possible reason for the reduction in data transfer from
device to host is that instead of migrating data from GPU to
host memory to make space, read-only data can simply be
discarded as a copy already exists on host memory. On the
other hand, on P9-Volta, significantly more time is spent on
stalls. This can be seen in Fig. 7c, where the total time is a few
times higher than when no advise is used. Fig. 8c examines
the data movement traces and clearly shows an intense data
movement in both directions. This implies that the read-mostly
advise has an interestingly negative effect on P9-Volta when
data size exceeds device memory. A naive prefetch on Intel-
Pascal provides performance improvement; however, it has
little to no effect on P9-Volta.
CG on the Intel-Pascal platform benefits from using advise.
The time breakdown for page faults and data movement
is shown in Fig. 7b. As in the case of the Black-Scholes
application, less time is spent on transferring data back to the
host than in the case of basic UM. However, we note that a
similar amount of data is sent from host to device in the two
cases. This can also be seen in the detailed tracing in Fig. 8b,
where less device to host transfer is made.
FDTD3d is a finite difference solver, and it reads and
writes to two arrays in an interleaving manner. Both arrays
are being initialized using the same data. One of the arrays is
being set to prefer GPU memory and will be accessed by
the CPU. No advise is set on the other array. Since both
arrays will be written to during execution, no read-mostly
advise is set for them. However, read-mostly is set for a small
array that contains coefficients. Fig. 7d shows the breakdown
(a) Intel-Pascal (b) Intel-Volta (c) Power9-Volta.
Fig. 6: GPU kernel execution time of applications where data do not fit in GPU memory.
(a) BS on Intel-Pascal (b) CG on Intel-Pascal (c) BS on P9-Volta (d) FDTD3d on P9-Volta
Fig. 7: Breakdown of total time spent handling page faults and data movement when input size exceeds GPU memory.
(a) BS Intel-Pascal (b) CG Intel-Pascal (c) BS P9-Volta (d) FDTD3d P9-Volta
Fig. 8: UM data transfer traces when input size exceeds GPU memory.
of time in handling data movement and page faults on P9-
Volta. Similarly to the Black-Scholes application, the usage
of advise results in much higher spent on stalling. Execution
time also increased significantly by approximately 3×. When
prefetching, only one of those two data arrays is prefetched
as they are originally identical. Interestingly, less data is seen
transferring in both directions when prefetch is used. Fig. 8d
shows the detailed tracing of the application. Smaller data
transfers at the beginning become a bulk transfer. This is also
reflected in the execution time, which is reduced from 60.9s
to 45.3s as well as a reduction in time spent stalling. One
possible reason is the size of the array being prefetched. Since
only one array, which represents 50% of the total problem
size, is prefetched, the entire array can reside entirely on GPU
memory without needed to evict previously prefetched data.
V. RELATED WORK
The separate memory system between host and GPU has
long been a programming challenge for developers. With UM,
the runtime can transparently handle data movement between
CPU and GPU. Earlier works [11], [13] have investigated
the impact of UM in applications while [13] investigated
the programming model support for UM in OpenMP through
an extended LLVM compiler. These studies lack the support
of advanced memory features, which only become available
recently. Recent efforts in the operating system, such as
Heterogeneous Memory Management (HMM) in the Linux
kernel [4], [8], [20], provides mechanisms to mirror CPU page
table on GPU and integrate device memory pages in the system
page table by adding a new type of struct page.
CPU to GPU interconnect is another factor that impacts
the performance of data movement directly. Extensive efforts
have reported evaluation on modern GPU systems [6], [9]. For
instance, [16] developed a microbenchmark tool to evaluate
the raw bandwidth performance with UM. While their works
focus on interconnect performance and provide optimization
insights, our work focuses on the impact of advanced memory
features in optimizing the locality of pages.
Some of the recent works that apply advanced features
of UM are Deep-Learning frameworks. One example is OC-
DNN [2], an extended Caffe framework that uses UM to
support the training of out-of-core batch sizes. They use
memory advises to trigger data eviction and prefetch to trigger
migration. They find these techniques useful in optimizing
training performance. However, incorrect use can lead to
performance degradation.
The memory oversubscription in GPU memory requires
efficient page eviction to make space for newly requested
pages. [3] proposed two pre-eviction policies using a tree-
based neighborhood prefetching technique to select candidate
pages. [10] introduced an ETC framework for eager page pre-
eviction and memory throttling in memory trashing. However,
these optimization techniques target future GPU designs that
require hardware modifications.
VI. CONCLUSION
In this work, we investigated the impact of UM memory
advises, prefetch, and GPU memory oversubscription, on
CUDA application performance. We found that the perfor-
mance of memory advises mostly depends on the system in
use and whether the GPU memory is oversubscribed. The
use of memory advises results in a performance improvement
only when the GPU memory is oversubscribed on the Intel-
Volta/Pascal-PCI-E systems. The use of memory advises on
Power9-Volta-NVLink based system, leads to a performance
improvement when applications run in-memory while it re-
sults in a considerable performance degradation with GPU
memory oversubscription. CUDA Unified prefetch provides
a performance improvement only on the Intel-Volta/Pascal-
PCI-E based systems while it does not show a performance
improvement on the Power9-Volta-NVLink based system.
In this work, we have set memory advises for each memory
object following best-practice guidelines from Nvidia. How-
ever, a future study on how to select optimal advise placement
would help programmers derive different combinations of
advises in different applications. In general, we found both
memory advises and prefetch to be simple and effective.
Overall, we showed that UM is a promising technology that
can be used effectively when programming applications for
GPU systems.
ACKNOWLEDGMENT
Funding for the work is received from the European Commission H2020 program,
Grant Agreement No. 801039 (EPiGRAM-HS). Experiments were performed on re-
sources provided by the Swedish National Infrastructure for Computing (SNIC) at
HPC2N and Lassen supercomputer at LLNL. Part of this work was performed under the
auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory
under Contract DE-AC52-07NA27344 LLNL-PROC-788778. This research was also
supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of
the U.S. Department of Energy Office of Science and the National Nuclear Security
Administration.
REFERENCES
[1] Linux programmer’s manual. http://man7.org/linux/man-pages/man3/
posix madvise.3.html, 2019.
[2] Ammar Ahmad Awan, Ching-Hsiang Chu, Hari Subramoni, Xiaoyi Lu,
and Dhabaleswar K Panda. OC-DNN: Exploiting Advanced Unified
Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core
DNN Training. In 2018 IEEE 25th International Conference on High
Performance Computing (HiPC), pages 143–152. IEEE, 2018.
[3] Debashis Ganguly, Ziyu Zhang, Jun Yang, and Rami Melhem. Interplay
Between Hardware Prefetcher and Page Eviction Policy in CPU-GPU
Unified Virtual Memory. In Proceedings of the 46th International
Symposium on Computer Architecture, ISCA ’19, pages 224–235. ACM,
2019.
[4] Jerome Glisse. Redhat heterogeneous memory management.
https://linuxplumbersconf.org/event/2/contributions/70/attachments/
14/6/hmm-lpc18.pdf, 2018.
[5] Richard D Hornung and Jeffrey A Keasler. The raja portability layer:
overview and status. Technical report, Lawrence Livermore National
Lab.(LLNL), Livermore, CA (United States), 2014.
[6] Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza.
Dissecting the NVIDIA Volta GPU Architecture via Microbenchmark-
ing. arXiv preprint arXiv:1804.06826, 2018.
[7] Ian Karlin, Tom Scogland, Arpith C Jacob, Samuel F Antao, Gheorghe-
Teodor Bercea, Carlo Bertolli, Bronis R de Supinski, Erik W Draeger,
Alexandre E Eichenberger, Jim Glosli, et al. Early experiences porting
three applications to openmp 4.5. In International Workshop on
OpenMP, pages 281–292. Springer, 2016.
[8] The kernel development community. The linux kernel 4.18.0. https:
//www.kernel.org/doc/html/v4.18/vm/hmm.html, 2019.
[9] Ang Li, Shuaiwen Leon Song, Jieyang Chen, Xu Liu, Nathan Tallent,
and Kevin Barker. Tartan: Evaluating Modern GPU Interconnect via a
Multi-GPU Benchmark Suite. In 2018 IEEE International Symposium
on Workload Characterization (IISWC), pages 191–202. IEEE, 2018.
[10] Chen Li, Rachata Ausavarungnirun, Christopher J. Rossbach, Youtao
Zhang, Onur Mutlu, Yang Guo, and Jun Yang. A framework for memory
oversubscription management in graphics processing units. In Proceed-
ings of the Twenty-Fourth International Conference on Architectural
Support for Programming Languages and Operating Systems, ASPLOS
’19, pages 49–63, New York, NY, USA, 2019. ACM.
[11] Wenqiang Li, Guanghao Jin, Xuewen Cui, and Simon See. An
Evaluation of Unified Memory Technology on Nvidia GPUs. In 2015
15th IEEE/ACM International Symposium on Cluster, Cloud and Grid
Computing, pages 1092–1098. IEEE, 2015.
[12] S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, and J. S. Vetter.
NVIDIA Tensor Core Programmability, Performance Precision. In 2018
IEEE International Parallel and Distributed Processing Symposium
Workshops (IPDPSW), pages 522–531, May 2018.
[13] Alok Mishra, Lingda Li, Martin Kong, Hal Finkel, and Barbara Chap-
man. Benchmarking and evaluating unified memory for OpenMP GPU
offloading. In Proceedings of the Fourth Workshop on the LLVM
Compiler Infrastructure in HPC, page 6. ACM, 2017.
[14] NVIDIA. P100 white paper. NVIDIA Corporation, 2016.
[15] NVIDIA. CUDA C Programming Guide. NVIDIA Corporation, 2019.
[16] Carl Pearson, Abdul Dakkak, Sarah Hashash, Cheng Li, I-Hsin Chung,
Jinjun Xiong, and Wen-Mei Hwu. Evaluating Characteristics of
CUDA Communication Primitives on High-Bandwidth Interconnects.
In Proceedings of the 2019 ACM/SPEC International Conference on
Performance Engineering, ICPE ’19, pages 209–218. ACM, 2019.
[17] The TOP500 project. Top500 lists. https://www.top500.org/lists/2019/
06/, 2019.
[18] Nikolay Sakharnykh. Maximizing unified memory
performance in cuda. https://devblogs.nvidia.com/
maximizing-unified-memory-performance-cuda/, 2017.
[19] Nikolay Sakharnykh. Unified memory on pascal and
volta. http://on-demand.gputechconf.com/gtc/2017/presentation/
s7285-nikolay-sakharnykh-unified-memory-on-pascal-and-volta.pdf,
2017.
[20] Nikolay Sakharnykh. Unified memory on pascal and volta. In GPU
Technology Conference (GTC), 2017.
[21] Nikolay Sakharnykh. Everything you need to know about unified
memory. NVIDIA GTC, 2018.
