Improving Multi-Application Concurrency Support Within the GPU Memory
  System by Ausavarungnirun, Rachata et al.
Improving Multi-Application Concurrency Support Within
the GPU Memory System
Rachata Ausavarungnirun? Christopher J. Rossbach†‡ Vance Miller†
Joshua Landgraf† Saugata Ghose? Jayneel Gnadhi‡
Adwait Jog§ Onur Mutlu∗?
?Carnegie Mellon University †University of Texas at Austin ‡VMware Research Group
§College of William and Mary ∗ETH Zurich
ABSTRACT
GPUs exploit a high degree of thread-level parallelism to
efficiently hide long-latency stalls. Thanks to their latency-
hiding abilities and continued improvements in programma-
bility, GPUs are becoming a more essential computational
resource. Due to the heterogeneous compute requirements
of different applications, there is a growing need to share
the GPU across multiple applications in large-scale com-
puting environments. However, while CPUs offer relatively
seamless multi-application concurrency, and are an excellent
fit for multitasking and for virtualized environments, GPUs
currently offer only primitive support for multi-application
concurrency.
Much of the problem in a contemporary GPU lies within
the memory system, where multi-application execution re-
quires virtual memory support to manage the address spaces
of each application and to provide memory protection. In
this work, we perform a detailed analysis of the major prob-
lems in state-of-the-art GPU virtual memory management
that hinders multi-application execution. Existing GPUs are
designed to share memory between the CPU and GPU, but
do not handle multi-application support within the GPU well.
We find that when multiple applications spatially share the
GPU, there is a significant amount of inter-core thrashing on
the shared TLB within the GPU. The TLB contention is high
enough to prevent the GPU from successfully hiding stall la-
tencies, thus becoming a first-order performance concern.
Based on our analysis, we introduce MASK, a memory
hierarchy design that provides low-overhead virtual mem-
ory support for the concurrent execution of multiple appli-
cations. MASK extends the GPU memory hierarchy to effi-
ciently support address translation through the use of multi-
level TLBs, and uses translation-aware memory and cache
management to maximize throughput in the presence of inter-
application contention. MASK uses a novel token-based ap-
proach to reduce TLB miss overheads, and its L2 cache by-
passing mechanisms and application-aware memory schedul-
ing reduce the interference between address translation and
data requests. MASK restores much of the thread-level par-
allelism that was previously lost due to address translation.
Relative to a state-of-the-art GPU TLB, MASK improves
system throughput by 45.2%, improves IPC throughput by
43.4%, reduces unfairness by 22.4%, and MASK performs
within 23% of the ideal design with no translation overhead.
1. INTRODUCTION
Graphics Processing Units (GPUs) provide high through-
put by exploiting a high degree of thread-level parallelism.
A GPU executes a group of threads (i.e., a warp) in lockstep
(i.e., each thread in the warp executes the same instruction
concurrently). When a warp stalls, the GPU hides the la-
tency of this stall by scheduling and executing another warp.
The use of GPUs to accelerate general-purpose GPU (GPGPU)
applications has become common practice, in large part due
to the large performance improvements that GPUs provide
for applications in diverse domains [20, 22, 32, 34, 77]. The
compute density of GPUs continues to grow, with GPUs ex-
pected to provide as many as 128 streaming multiprocessors
per chip in the near future [10,85]. While the increased com-
pute density can help many GPGPU applications, it exacer-
bates the growing need to share the GPU streaming multi-
processors across multiple applications. This is especially
true in large-scale computing environments, such ascloud
servers, where a diverse range of application requirements
exists. In order to enable efficient GPU hardware utilization
in the face of application heterogeneity, these large-scale en-
vironments rely on the ability to virtualize the compute re-
sources and execute multiple applications concurrently [4,7,
38, 39].
The adoption of discrete GPUs in large-scale computing
environments is hindered by the primitive virtualization sup-
port in contemporary GPUs. While hardware virtualization
support has improved for integrated GPUs [19], the current
virtualization support for discrete GPUs is insufficient, even
though discrete GPUs provide the highest available com-
pute density and remain the platform of choice in many do-
mains [1]. Two alternatives for discrete virtualization are
time multiplexing and spatial multiplexing. Emerging GPU
architectures support time multiplexing the GPU by provid-
ing application preemption [57, 65], but this support cur-
rently does not scale well with the number of applications.
Each additional application introduces a high degree of con-
1
ar
X
iv
:1
70
8.
04
91
1v
1 
 [c
s.A
R]
  1
6 A
ug
 20
17
tention for the GPU resources (Section 2.1). Spatial multi-
plexing allows us to share a GPU among applications much
as we currently share CPUs, by providing support for multi-
address-space concurrency (i.e., the concurrent execution of
kernels from different processes or guest VMs). By effi-
ciently and dynamically managing the kernels that execute
concurrently on the GPU, spatial multiplexing avoids the
scaling issues of time multiplexing. To support spatial multi-
plexing, GPUs must provide architectural support for mem-
ory virtualization and memory protection domains.
The architectural support for spatial multiplexing in con-
temporary GPUs is not well-suited for concurrent multi-
application execution. Recent efforts at improving address
translation support within GPUs [25, 67, 68, 84, 94] eschew
MMU-based or IOMMU-based [7, 9] address translation
in favor of TLBs close to shader cores. These works do
not explicitly target concurrent multi-application execution
within the GPU, and are instead focused on unifying the
CPU and GPU memory address spaces [6]. We perform a
thorough analysis of concurrent multi-application execution
when these state-of-the-art address translation techniques
are employed within a state-of-the-art GPU (Section 4). We
make four key observations from our analysis. First, we find
that for concurrent multi-application execution, a shared L2
TLB is more effective than the highly-threaded page table
walker and page walk cache proposed in [68] for the uni-
fied CPU-GPU memory address space. Second, for both the
shared L2 TLB and the page walk cache, TLB misses be-
come a major performance bottleneck with concurrent multi-
application execution, despite the latency-hiding properties
of the GPU. A TLB miss incurs a high latency, as each miss
must walk through multiple levels of a page table to find the
desired address translation. Third, we observe that a sin-
gle TLB miss can frequently stall multiple warps at once.
Fourth, we observe that contention between applications in-
duces significant thrashing on the shared TLB and signif-
icant interference between TLB misses and data requests
throughout the GPU memory system. Thus, with only a few
simultaneous TLB misses, it becomes difficult for the GPU
to find a warp that can be scheduled for execution, defeating
the GPU’s basic techniques for hiding the latency of stalls.
Thus, based on our extensive analysis, we conclude that
address translation becomes a first-order performance con-
cern in GPUs when multiple applications are executed con-
currently. Our goal in this work is to develop new techniques
that can alleviate the severe address translation bottleneck
existing in state-of-the-art GPUs.
To this end, we propose Multi-Address Space Concurrent
Kernels (MASK), a new cooperative resource management
framework and TLB design for GPUs that minimizes inter-
application interference and translation overheads. MASK
takes advantage of locality across shader cores to reduce
TLB misses, and relies on three novel techniques to min-
imize translation overheads. The overarching key idea is
to make the entire memory hierarchy TLB request aware.
First, TLB-Fill Tokens provide a TLB-selective-fill mecha-
nism to reduce thrashing in the shared L2 TLB, including a
bypass cache to increase the TLB hit rate. Second, a low-
cost scheme for selectively bypassing TLB-related requests
at the L2 cache reduces interference between TLB-miss and
data requests. Third, MASK’s memory scheduler prioritizes
TLB-related requests to accelerate page table walks.
The techniques employed by MASK are highly effective at
alleviating the address translation bottleneck. Through the
use of TLB-request-aware policies throughout the memory
hierarchy, MASK ensures that the first two levels of the page
table walk during a TLB miss are serviced quickly. This re-
duces the overall latency of a TLB miss significantly. Com-
bined with a significant reduction in TLB misses, MASK
allows the GPU to successfully hide the latency of the
TLB miss through thread-level parallelism. As a result,
MASK improves system throughput by 45.2%, improves IPC
throughput by 43.4%, and reduces unfairness by 22.4% over
a state-of-the-art GPU memory management unit (MMU)
design [68]. MASK provides performance within only 23%
of a perfect TLB that always hits.
This paper makes the following contributions:
• To our knowledge, this is the first work to provide a
thorough analysis of GPU memory virtualization un-
der multi-address-space concurrency, and to demon-
strate the large impact address translation has on la-
tency hiding within a GPU. We demonstrate a need
for new techniques to alleviate interference induced by
multi-application execution.
• We design an MMU that is optimized for GPUs that
are dynamically partitioned spatially across protection
domains, rather than GPUs that are time-shared.
• We propose MASK, which consists of three novel tech-
niques that increase TLB request awareness across the
entire memory hierarchy. These techniques work to-
gether to significantly improve system performance,
IPC throughput, and fairness over a state-of-the-art
GPU MMU.
2. BACKGROUND
There has been an emerging need to share the GPU hard-
ware among multiple applications. As a result, recent work
has enabled support for GPU virtualization, where a single
physical GPU can be shared transparently across multiple
applications, with each application having its own address
space.1 Much of this work has relied on traditional time and
spatial multiplexing techniques that have been employed by
CPUs, and state-of-the-art GPUs currently contain elements
of both types of techniques [78,81,86]. Unfortunately, as we
discuss in this section, existing GPU virtualization imple-
mentations are too coarse, bake fixed policy into hardware,
or leave system software without the fine-grained resource
management primitives needed to implement truly transpar-
ent device virtualization.
2.1 Time Multiplexing
Most modern systems time-share GPUs [57, 62]. These
designs are optimized for the case where no concurrency ex-
ists between kernels from different address spaces. This sim-
plifies memory protection and scheduling at the cost of two
fundamental tradeoffs. First, it results in underutilization-
when kernels from a single address space are unable to fully
1In this paper, we use the term address space to refer to distinct
memory protection domains, whose access to resources must be
isolated and protected during GPU virtualization.
2
utilize all of the GPU’s resources [45, 50, 51, 66, 87]. Sec-
ond, it limits the ability of a scheduler to provide forward-
progress or QoS guarantees, leaving applications vulnerable
to unfairness and starvation [74].
While preemption support could allow a time-sharing
scheduler to avoid pathological unfairness (e.g., by con-
text switching at a fine granularity), GPU preemption sup-
port remains an active research area [33, 79]. Software ap-
proaches [87] sacrifice memory protection. NVIDIA’s Ke-
pler [62] and Pascal [65] architectures support preemption
at thread block and instruction granularity respectively. We
find empirically, that neither is well optimized for inter-
application interference.
Figure 1 shows the overhead (i.e., performance loss) per
process when the NVIDIA K40 and GTX 1080 GPUs are
contended by multiple application processes. Each pro-
cess runs a kernel that interleaves basic arithmetic opera-
tions with loads and stores into shared and global memory,
with interference from a GPU matrix-multiply program. The
overheads range from 8% per process for the K40, to 10%
or 12% for the GTX 1080. While the performance cost
is significant, we also found inter-application interference
pathologies to be easy to create: for example, a kernel from
one process consuming the majority of shared memory can
easily cause kernels from other processes to fail at dispatch.
While we expect preemption support to improve in future
hardware, we seek a solution that does not depend on it.
0%
2%
4%
6%
8%
10%
12%
2 3 4 5 6 7 8 9 10
O
v
e
rh
e
a
d
 P
e
r-
P
ro
c
e
s
s
Number of Concurrent Processes
Tesla K40 Tesla K40 with interference Pascal GTX 1080 Pascal GTX 1080 with interference
Figure 1: Context switch overheads under contention on
K40 and GTX 1080.
2.2 Spatial Multiplexing
Resource utilization can be improved with spatial multi-
plexing [2], as the ability to execute multiple kernels concur-
rently enables the system to co-schedule kernels that have
complementary resource demands, and can enable indepen-
dent progress guarantees for different kernels. NVIDIA’s
stream [62] support, which co-schedules kernels from in-
dependent “streams” in a single address space, relies on
similar basic concepts, as does application-specific software
scheduling of multiple kernels in hardware [45,66] and GPU
simulators [11, 50, 51]. Software approaches (e.g., Elastic
Kernels [66]) require programmers to manually time-slice
kernels to enable mapping them onto CUDA streams for
concurrency. While sharing techniques that leverage the
stream abstraction support flexible demand-partitioning of
resources, they all share critical drawbacks. When kernels
from different applications have complementary resource
demands, the GPU remains underutilized. More importantly,
merging kernels into a single address space sacrifices mem-
ory protection, a key requirement in virtualized settings.
Multi-address-space concurrency support can address
these shortcomings by enabling a scheduler to look beyond
kernels from the current address space when resources are
under-utilized. Moreover, QoS and forward-progress guar-
antees can be enabled by giving partitions of the hardware
simultaneously to kernels from different address spaces.
For example, long-running kernels from one application
need not complete before kernels from another may be dis-
patched. NVIDIA and AMD both offer products [3, 35]
with hardware virtualization support for statically partition-
ing GPUs across VMs, but even this approach has critical
shortcomings. The system must select from a handful of
different partitioning schemes, determined at startup, which
is fundamentally inflexible. The system cannot adapt to
changes in demand or mitigate interference, which are key
goals of virtualization layers.
3. BASELINE DESIGN
Our goal in this work is to develop efficient address trans-
lation techniques for GPUs that allow for flexible, fine-
grained spatial multiplexing of the GPU across multiple ad-
dress spaces, ensuring protection across memory protection
domains. Our primary focus is on optimizing a memory hier-
archy design extended with TLBs, which are used in a state-
of-the-art GPU [68]. Kernels running concurrently on dif-
ferent compute units share components of the memory hier-
archy such as lower level caches, so ameliorating contention
for those components is an important concern. In this sec-
tion, we explore the performance costs and bottlenecks in-
duced by different components in a baseline design for GPU
address translation, motivating the need for MASK.
3.1 Memory Protection Support
To ensure that memory accesses from kernels running in
different address spaces remain isolated, we make use of
TLBs and mechanisms for reducing TLB miss costs within
the GPU. We adopt the current state-of-the-art for GPU TLB
design for CPU-GPU heterogeneous systems proposed by
Power et al. [68], and extend the design to handle multi-
address-space concurrency, as shown in Figure 2a. Each
core has a private L1 cache ( 1 ), and all cores share a highly-
threaded page table walker ( 2 ). On a TLB miss, the shared
page table walker first probes a page walk cache ( 3 ).2 A
miss in the page walk cache goes to the shared L2 cache and
(if need be) main memory.
Private
Shared
2
Shared L2 Cache
Main Memory
Highly Threaded Page Table Walker
1
3
L1 TLB
Shader Core
CR3 L1 TLB
Shader Core
CR3
Page Walk Cache
2
(a) TLB design
from [68]
Shared L2 Cache
Main Memory
Highly Threaded Page Table Walker
4
L1 TLB
Shader Core
CR3 L1 TLB
Shader Core
CR3
Shared L2 TLB
Private
Shared
(b) MASK’s baseline TLB design.
Figure 2: Baseline TLB designs
2In our evaluation, we provision the page walk cache to be 16-way,
with 1024 entries.
3
3.2 Page Walk Caches
Techniques to avoid misses and hide or reduce their la-
tency are well-studied in the literature. To conserve space,
we do not discuss the combinations of techniques that we
considered, and focus on the design which we ultimately se-
lected as the baseline for MASK, shown in Figure 2b. The
design differs from [68] by eliminating the page walk cache,
and instead dedicating the same chip area to 1) a shared
L2 TLB with entries extended with address space identifiers
(ASIDs) and 2) a parallel page table walker. TLB accesses
from multiple threads to the same page are coalesced. On a
private L1 TLB miss, the shared L2 TLB is probed ( 4 ). On
a shared L2 TLB miss, the page table walker begins a walk,
probing the shared L2 cache and main memory.
Figure 3 compares the performance with multi-address-
space concurrency of our chosen baseline and the design
from [68] against the ideal scenario where every TLB access
is a hit (see Section 6 for our methodology). While Power
et al. find that a page walk cache is more effective than a
shared L2 TLB [68], the design with a shared L2 TLB pro-
vides better performance for all but three workloads, with
13.8% better performance on average. The shared L2 data
cache enables a hit rate for page table walks that is compet-
itive with a dedicated page walk cache, and a shared TLB is
a more effective use of chip area. Hence, we adopt a design
with a shared L2 TLB as the baseline for MASK. We ob-
serve that a 128-entry TLB provides only a 10% reduction
in miss rate over a 64-entry TLB, suggesting that the addi-
tional area needed to double the TLB size is not efficiently
utilized. Thus, we opt for a smaller 64-entry L1 TLB in our
baseline. Note that a shared L2 TLB outperforms the page
walk cache for both L1 TLB sizes. Lastly, we find that both
designs incur a significant performance overhead compared
to the ideal case where every TLB access is a TLB hit.
0.8
1
N
or
m
al
iz
ed
 P
er
fo
rm
ac
ne
Translation Cache Shared TLB Ideal
0
0.2
0.4
0.6
3D
S_
BP
3D
S_
H
IS
TO
BL
K_
LP
S
C
FD
_M
M
C
O
N
S_
LP
S
C
O
N
S_
LU
H
FW
T_
BP
H
IS
TO
_G
UP
H
IS
TO
_L
PS
LU
H
_B
FS
2
LU
H_
G
U
P
M
M
_C
O
NS
M
U
M
_H
IS
TO
N
W
_H
S
N
W
_L
PS
R
AY
_G
U
P
R
AY
_H
S
R
ED
_B
P
R
ED
_G
UP
R
ED
_M
M
R
ED
_R
AY
R
ED
_S
C
SC
AN
_C
O
N
S
SC
AN
_H
IS
TO
SC
AN
_S
AD
SC
AN
_S
R
AD
SC
P_
G
UP
SC
P_
H
S
SC
_F
W
T
SR
AD
_3
D
S
TR
D
_H
S
TR
D
_L
PS
TR
D
_M
U
M
TR
D
_R
AY
TR
D
_R
ED
Av
er
ag
e
N
or
m
al
iz
ed
 P
er
fo
rm
ac
ne
Figure 3: Baseline designs vs. ideal performance.
4. DESIGN SPACE ANALYSIS
To inform the design of MASK, we characterize overheads
for address translation, and consider performance challenges
induced by the introduction of multi-address-space concur-
rency and contention.
4.1 Address Translation Overheads
GPU throughput relies on fine-grained multithread-
ing [76, 80] to hide memory latency. However, we observe
a fundamental tension between address translation and fine-
grained multithreading. The need to cache address trans-
lations at a page granularity, combined with application-
level spatial locality, increases the likelihood that transla-
tions fetched in response to a TLB miss will be needed by
more than one thread. Even with the massive levels of paral-
lelism supported by GPUs, we observe that a small number
of outstanding TLB misses can result in the thread scheduler
not having enough ready threads to schedule, which in turn
limits the GPU’s most essential latency-hiding mechanism.
Figure 4 illustrates a scenario where all warps of an ap-
plication access memory. Each box represents a mem-
ory instruction, labeled with the issuing warp. Figure 4a
shows how the GPU behaves when no virtual-to-physical
address translation is required. When Warp A executes
a high-latency memory access, the core does not stall as
long as other warps have schedulable instructions: in this
case, the GPU core selects from among the remaining warps
(Warps B–H) during the next cycle ( 1 ), and continues is-
suing instructions until all requests to DRAM have been
sent. Figure 4b considers the same scenario when address
translation is required. Warp A misses in the TLB (indi-
cated in red), and stalls until the translation is fetched from
memory. If threads belonging to Warps B–D access data
from the same page as the one requested by Warp A, these
warps stall as well (shown in light red) and perform no use-
ful work ( 2 ). If a TLB miss from Warp E similarly stalls
Warps E–G ( 3 ), only Warp H executes an actual data access
( 4 ). Two phenomena harm performance in this scenario.
First, warps stalled on TLB misses reduce the availability of
schedulable warps, lowering utilization. Second, TLB miss
requests must complete before actual the data requests can
issue, which reduces the ability of the GPU to hide latency
by keeping multiple memory requests in flight.
Warp A
Warp B
Warp C
Warp D
Warp E
Warp F
Warp G
Warp H
Warp A
Warp B
Warp C
Warp D
Warp E
Warp F
Warp G
Warp H
Warp A
Warp B
Warp C
Warp D
Warp E
Warp F
Warp G
Warp H
Warp A
Warp B
Warp C
Warp D
Warp E
Warp F
Warp G
Warp H
7 Concurrent Data Requests
Time
1
(a) No virtual-to-physical address translation on critical path
Warp A
Warp B
Warp C
Warp D
Warp E
Warp F
Warp G
Warp H
Warp A
Warp B
Warp C
Warp D
Warp E
Warp F
Warp G
Warp H
Warp A
Warp B
Warp C
Warp D
Warp E
Warp F
Warp G
Warp H
Warp A
Warp B
Warp C
Warp D
Warp E
Warp F
Warp G
Warp H
1 Concurrent Data Request 
STALLEDTime
2
3
4
(b) Virtual-to-physical address translation on the critical path
Figure 4: Example bottlenecks created by TLB misses.
Figure 5 shows the number of stalled warps per active
TLB miss, and the average number of maximum concurrent
page table walks (sampled every 10K cycles for a range of
applications). In the worst case, a single TLB miss stalls
over 30 warps, and over 50 outstanding TLB misses con-
tend for access to address translation structures. The large
number of concurrent misses stall a large number of warps,
which must wait before issuing DRAM requests, so mini-
mizing TLB misses and page table walk latency is critical.
Impact of Large Pages. Larger page size can significantly
improve the coverage of the TLB. However, previous work
has observed that the use of large pages significantly in-
4
010
20
30
40
50
60
3D
S
B
F
S
2
B
LK B
P
C
F
D
C
O
N
S
F
F
T
FW
T
G
U
P
S
H
IS
TO H
S
JP
E
G
LI
B
LP
S
LU
D
LU
H
M
M
M
U
M
N
N
N
W
Q
T
C
R
AY
R
E
D
S
A
D
S
C
S
C
A
N
S
C
P
S
P
M
V
S
R
A
D
T
R
DW
ar
ps
 S
ta
lle
d 
Pe
r
TL
B 
En
tr
y
M
ax
im
um
 
Co
nc
ur
re
nt
 P
ag
e 
W
al
kWarps Stalled Per TLB Entry Concurrent Page Walks
0
10
20
30
40
50
60
Figure 5: Average number of stalled warp per active TLB
miss and number of concurrent page walks.
creases the overhead of demand paging in GPUs [94]. We
evaluate this overhead with 2MB page size and find that it
results in an average slowdown of 93%.
4.2 Interference Induced by Sharing
To understand the impact of inter-address-space interfer-
ence through the memory hierarchy, we concurrently run
two applications using the methodology described in Sec-
tion 6. Figure 6 shows the TLB miss breakdown across all
workloads: most applications incur significant L1 and L2
TLB misses. Figure 7 compares the TLB miss rate for ap-
plications running in isolation to the miss rates under con-
tention. The data show that inter-address-space interference
through additional thrashing has a first-order performance
impact.
0 
0.2 
0.4 
0.6 
0.8 
1 
3
D
S
 
B
F
S
2
 
B
L
K
 
B
P
 
C
F
D
 
C
O
N
S
 
F
F
T
 
F
W
T
 
G
U
P
S
 
H
IS
T
O
 
H
S
 
L
P
S
 
L
U
D
 
L
U
H
 
M
M
 
M
U
M
 
N
N
 
N
W
 
Q
T
C
 
R
A
Y
 
R
E
D
 
S
A
D
 
S
C
 
S
C
A
N
 
S
C
P
 
S
R
A
D
 
T
R
D
 
T
L
B
 M
is
s
 R
a
te
 
L1 L2 
Figure 6: TLB miss breakdown for all workloads.
0 
0.2 
0.4 
0.6 
0.8 
1 
3DS_HISTO CONS_LPS MUM_HISTO RED_RAY 
L
2
 T
L
B
 M
is
s
 R
a
te
 
(L
o
w
e
r 
is
 B
e
tt
e
r)
 
Alone App1 Shared App1 Alone App2 Shared App2 
Figure 7: Cross-address-space interference in real appli-
cations. Each set of bars corresponds to a pair of co-
scheduled applications, e.g. “3DS_HISTO” denotes the 3DS
and HISTO benchmarks running concurrently.
Figure 8 illustrates TLB misses in a scenario where two
applications (green and blue) share the GPU. In Figure 8a,
the green application issues five parallel TLB requests, caus-
ing the premature eviction of translations for the blue appli-
cation, increasing its TLB miss rate (Figure 8b). The use of a
shared L2 TLB to cache entries for each application’s (non-
overlapping) page tables dramatically reduces TLB reach.
The resulting inter-core TLB thrashing hurts performance,
and can lead to unfairness and starvation when applications
generate TLB misses at different rates. Our findings of se-
vere performance penalties for increased TLB misses cor-
roborate previous work on GPU memory designs [67,68,84].
However, interference across address spaces can inflate miss
rates in ways not addressed by these works, and which are
best managed with mechanisms that are aware of concur-
rency (as we show in Section 7.1).
Shared TLB
LRU
Tags Entries
00
01
02
03
04
05
06
07
10
11
12
13
14
(a) Parallel requests
00
Shared TLB
LRU
Tags Entries
10
11
12
13
14
05
06
07
(b) Conflict
Shared TLB
Tags Entries
10
11
12
13
14
00
06
07
(c) Final state
Figure 8: Cross-address-space TLB interference.
4.3 Interference from Address Translation
Interference at the Shared Data Cache. Prior work [12]
demonstrated that while cache hits in GPUs reduce the con-
sumption of off-chip memory bandwidth, cache hits result in
a lower load/store instruction latency only when every thread
in the warp hits in the cache. In contrast, when a page table
walk hits in the shared L2 cache, the cache hit has the po-
tential to help reduce the latency of other warps that have
threads which access the same page in memory. While this
makes it desirable to allow the data generated by the page ta-
ble walk to consume entries in the shared cache, TLB-related
data can still interfere with and thrash normal data cache en-
tries, which hurts the overall performance.
Hence, a trade-off exists between prioritizing TLB related
requests or normal data requests in the GPU memory hierar-
chy. Figure 9 shows that entries for translation data from lev-
els closer to the page table root are more likely to be shared
across warps, and will typically be served by cache hits. Al-
lowing shared structures to cache page walk data from only
the levels closer to the root could alleviate the interference
between low-hit-rate translation data and application data.
Interference at Main Memory. Figure 10 characterizes the
DRAM bandwidth utilization, broken down between data
and address translation requests for applications sharing the
GPU concurrently pairwise. Figure 11 compares the average
latency for data requests and translation requests. We see
that even though page walk requests consume only 13.8%
of the utilized DRAM bandwidth (2.4% of the maximum
bandwidth), their DRAM latency is higher than that of data
requests, which is particularly egregious since data requests
that lead to TLB misses stall while waiting for page walks to
complete. The phenomenon is caused by FR-FCFS memory
schedulers [71, 95], which prioritize accesses that hit in the
row buffer. Data requests from GPGPU applications gener-
ally have very high row buffer locality [11, 50, 88, 93], so a
scheduler that cannot distinguish page walk requests effec-
tively de-prioritizes them, increasing their latency.
In summary, we make two important observations about
address translation in GPUs. First, address translation
competes with the GPU’s ability to hide latency through
thread-level parallelism, when multiple warps stall on the
TLB misses for a single translation. Second, the GPU’s
memory-level parallelism generates interference across ad-
dress spaces, and between TLB requests and data requests,
which can lead to unfairness and increased latency. In light
5
0 
0.2 
0.4 
0.6 
0.8 
1 
3
D
S
_
B
P
 
3
D
S
_
H
IS
T
O
 
B
L
K
_
L
P
S
 
C
F
D
_
M
M
 
C
O
N
S
_
L
P
S
 
C
O
N
S
_
L
U
H
 
F
W
T
_
B
P
 
H
IS
T
O
_
G
U
P
 
H
IS
T
O
_
L
P
S
 
L
U
H
_
B
F
S
2
 
L
U
H
_
G
U
P
 
M
M
_
C
O
N
S
 
M
U
M
_
H
IS
T
O
 
N
W
_
H
S
 
N
W
_
L
P
S
 
R
A
Y
_
G
U
P
 
R
A
Y
_
H
S
 
R
E
D
_
B
P
 
R
E
D
_
G
U
P
 
R
E
D
_
M
M
 
R
E
D
_
R
A
Y
 
R
E
D
_
S
C
 
S
C
A
N
_
C
O
N
S
 
S
C
A
N
_
H
IS
T
O
 
S
C
A
N
_
S
A
D
 
S
C
A
N
_
S
R
A
D
 
S
C
P
_
G
U
P
 
S
C
P
_
H
S
 
S
C
_
F
W
T
 
S
R
A
D
_
3
D
S
 
T
R
D
_
H
S
 
T
R
D
_
L
P
S
 
T
R
D
_
M
U
M
 
T
R
D
_
R
A
Y
 
T
R
D
_
R
E
D
 
L
2
 D
a
ta
 C
a
c
h
e
 H
it
 R
a
te
 1 2 3 4 
Figure 9: L2 cache hit rate for page walk requests.
0.6
0.8
1
DR
AM
 B
an
dw
id
th
 U
til
.
TLB Data
0
0.2
0.4
3D
S_
BP
3D
S_
H
IS
TO
BL
K_
LP
S
C
FD
_M
M
C
O
N
S_
LP
S
C
O
N
S_
LU
H
FW
T_
BP
H
IS
TO
_G
U
P
H
IS
TO
_L
PS
LU
H
_B
FS
2
LU
H
_G
U
P
M
M
_C
O
NS
M
U
M
_H
IS
TO
N
W
_H
S
N
W
_L
PS
R
AY
_G
U
P
R
AY
_H
S
R
ED
_B
P
R
ED
_G
U
P
R
ED
_M
M
R
ED
_R
AY
R
ED
_S
C
SC
AN
_C
O
N
S
SC
AN
_H
IS
TO
SC
AN
_S
AD
SC
AN
_S
R
AD
SC
P_
G
U
P
SC
P_
HS
SC
_F
W
T
SR
AD
_3
D
S
TR
D
_H
S
TR
D
_L
PS
TR
D
_M
U
M
TR
D
_R
AY
TR
D
_R
ED
Av
er
ag
e
DR
AM
 B
an
dw
id
th
 U
til
.
Figure 10: Bandwidth breakdown of two applications.
of these observations, the goal of this work is to design
mechanisms that alleviate the translation overhead by 1) in-
creasing the TLB hit rate through reduced TLB thrashing,
2) decreasing interference between normal data and TLB re-
quests in the shared L2 cache, 3) decreasing TLB miss la-
tency by prioritizing TLB-related requests in DRAM, and
4) enhancing memory scheduling to provide fairnesswithout
sacrificing DRAM bandwidth utilization.
5. DESIGN OF MASK
We now introduce Multi-Address Space Concurrent Ker-
nels (MASK), a new cooperative resource management
framework and TLB design for GPUs. Figure 12 provides
a design overview of MASK. MASK employs three compo-
nents in the memory hierarchy to reduce address translation
overheads while requiring minimal hardware change. First,
we introduce TLB-Fill Tokens to lower the number of TLB
misses and utilize a bypass cache to cache frequently used
TLB entries ( 1 ). Second, we design a TLB-Request-Aware
L2 Bypass mechanism for TLB requests that significantly in-
creases the shared L2 data cache utilization, by reducing in-
terference from TLB misses at the shared L2 data cache ( 2 ).
Third, we design an Address-Space-Aware DRAM Scheduler
to further reduce interference between TLB requests and
data requests from different applications ( 3 ). We analyze
the hardware cost of MASK in Section 7.5.
5.1 Memory Protection
MASK uses per-core page table root registers (similar to
x86 CR3) to set the current address space on each core: set-
ting it also sets the value in a page table root cache with
per-core entries at the L2 layer. The page table root cache is
kept coherent with the CR3 value in the core by draining all
in-flight memory requests for that core when the page table
root is set. L2 TLB cache lines are extended with address-
space identifiers (ASIDs); TLB flush operations target a sin-
gle shader core, flushing the core’s L1 TLB and all entries in
the L2 TLB with a matching ASID.
5.2 Reducing L2 TLB Interference
600
800
1000
D
RA
M
 L
at
en
cy
 (C
yc
le
s)
TLB Data
0
200
400
3D
S_
BP
3D
S_
HI
…
BL
K_
LP
S
CF
D_
M
M
CO
N
S_
L…
CO
N
S_
L…
FW
T_
BP
H
IS
TO
_…
H
IS
TO
_…
LU
H
_B
FS
2
LU
H
_G
U
P
M
M
_C
O
…
M
UM
_H
I…
NW
_H
S
NW
_L
PS
R
AY
_G
UP
RA
Y_
H
S
RE
D
_B
P
R
ED
_G
U
P
RE
D_
M
M
RE
D_
RA
Y
RE
D_
SC
SC
AN
_C
…
SC
AN
_H
…
SC
AN
_S
…
SC
AN
_S
…
SC
P_
G
UP
SC
P_
H
S
SC
_F
W
T
SR
AD
_3
…
TR
D
_H
S
TR
D
_L
PS
TR
D_
M
U
M
TR
D
_R
AY
TR
D_
RE
D
Av
er
ag
e
D
RA
M
 L
at
en
cy
 (C
yc
le
s)
Figure 11: Latency breakdown of two applications.
Sections 4.1 and 4.2 demonstrated the need to minimize
TLB miss overheads. MASK addresses this need with a new
mechanism called TLB-Fill Tokens. Figure 13a shows ar-
chitectural additions to support TLB-Fill Tokens. We add
two 16-bit counters to track TLB hits and misses per shader
core, along with a small fully-associative bypass cache to the
shared TLB. Figure 14 illustrates operation of the proposed
TLB fill bypassing logic. When a TLB access arrives (Fig-
ure 14a), tags for both the shared TLB ( 1 ) and bypass cache
( 2 ) are probed in parallel. A hit on either the TLB or the
bypass cache yields a TLB hit.
To reduce inter-core thrashing at the shared L2 TLB, we
use an epoch- and token-based scheme to limit the number of
warps from each shader core that can fill into (and therefore
contend for) the L2 TLB. While every warp can probe the
shared TLB, to prevent thrashing, we allow only warps with
tokens to fill into the shared TLB as shown in Figure 14b.
This token-based mechanism requires two components, one
to determine the number of tokens for each application, and
one to implement policy for assigning tokens to warps.
Determining the Number of Tokens. At the beginning of
a kernel, MASK performs no bypassing, but tracks the L2
miss rate for each application and the total number of warps
in each core. After the first epoch,3 the initial number of
tokens (InitialToken) is set to a fraction of the total num-
ber of warps per application. At the end of any subsequent
epoch, MASK compares the shared L2 TLB miss rates of
the current and previous epoch. If the miss rate decreases
or increases from the previous epoch, MASK uses the logic
shown in Figure 13b to decrease or increase the number of
tokens allocated to each application.
Assigning Tokens to Warps. Empirically, we observe that
1) warps throughout the GPU cores have mostly even TLB
miss rate distribution; and 2) it is more beneficial for warps
that previously have tokens to retain their token, as it is more
likely that their TLB entries are already in the shared TLB.
We leverage these two observations to simplify the token as-
signment logic: TLB-Fill Tokens simply hands out tokens
in round-robin fashion to all cores in warpID order. The
heuristic is effective at reducing thrashing, as contention at
the shared TLB is reduced based on the number of tokens,
and highly-used TLB entries that do not have tokens can still
fill into the bypassed cache.
Bypass Cache. While TLB-Fill Tokens can reduce thrashing
in the shared TLB, a handful of highly-reused pages from
warps with no tokens may be unable to utilize the shared
TLB. To address this, we add a bypass cache, which is a
3We empirically select an epoch length of 100K cycles.
6
Bypassed TLB-request Data
D
R
AM
Bank 0
Bank 1
Bank 2
Bank n
L2 Cache
Request Buffers
TLB-request Data Normal Data
Bypass
L2?
Shared L2 TLB
# Hits
# Misses
Prev. Hit
Tokens
TLB Tags   EntriesBypass Cache
Tokens Dir
TLB Miss
PT
Walker
L1 Cache Miss
L1
Cache
Golden Queue
Normal Queue
TLB-request
Normal Data
Cache
Miss
Memory Scheduler
Silver Queue
Normal Data
1
2 3
PT Root
Cache
Page table root
TLB-request
Level Hit Rate >= L2 Hit Rate
Level 0 Hit Rate
Level 1 Hit Rate
Level 2 Hit Rate
Level 3 Hit Rate
L2 Hit Rate
Level 4 Hit Rate
Level Hit Rate < L2 Hit Rate
2
Figure 12: MASK design overview.
Shared L2 TLB
# Hits
# Misses
Prev. Hit
Tokens
TLB Tags   EntriesBypass Cache
Tokens Dir
(a) TLB hit and miss counters
Prev. Hit < (Hits/Misses)
Token dir? Token dir?
Token-10% Token+10% Token-10%
N Y
Inc Dec Inc Dec
Fewer Tokens Fewer TokensMore Tokens
(b) TLB counter control logic
Figure 13: L2 TLB and token assignment logic.
TLB Tags   Entries
Bypass Cache
TLB Access
1
2
Tag
T
a
g
(a) TLB access
TLB Tags   Entries
Bypass Cache
TLB Fill
Has Token
No Token
TLB Fill
(b) TLB fill
Figure 14: TLB fill bypassing logic in MASK.
small 32-entry fully associative cache. Only warps without
tokens can fill the bypass cache.
Replacement Policy. While it is possible to base the cache
replacement policy on how many warps are stalled per TLB
entry and prioritize TLB entries with more warps sharing an
entry, we observe small variance across TLB entries on the
shared TLB in practice. Consequently, a replacement policy
based on number of warps stalled per TLB entry actually
performs worse than a reuse-based policy. Hence, we use
LRU replacement policy for L1 TLBs, the shared L2 TLB
and the bypass cache.
5.3 Minimizing Shared L2 Interference
Interference from TLB Requests. While Power et al. pro-
pose to coalesce TLB requests to minimize the cache pres-
sure and performance impact [68], we find that a TLB miss
generates shared cache accesses with varying degrees of lo-
cality. Translating addresses through a multi-level page ta-
ble (4 levels for MASK) can generate dependent memory re-
quests for each level. This causes significant queuing latency
at the shared L2 cache, corroborating observations from pre-
vious work [12]. Page table entries in levels closer to the
root are more likely to be shared across threads than entries
near the leaves, and more often hit in the shared L2 cache.
To address both the interference and queuing delay at the
shared L2 cache we introduce TLB-Request-Aware L2 By-
pass for TLB requests, as shown in Figure 15. To determine
which TLB requests should be bypassed, we leverage our
insights from Section 4.3. Because of the sharp drop-off in
L2 cache hit rate after the first few levels, we can simplify
the bypassing logic to compare the L2 cache hit rate of each
page level for TLB requests to the L2 cache hit rate for non-
TLB requests. We impose L2 cache bypassing when the hit
rate for TLB requests falls below the hit rate for non-TLB
requests. Memory requests are tagged with three additional
bits specifying page walk depth, allowing MASK to differ-
entiate between request types. These bits are set to zero for
normal data requests, and to 7 for any depth higher than 6.4
Bypassed TLB-request Data
D
R
AM
Cache
Miss
TLB-request
Data
Bank 0
Bank 1
Bank 2
Bank n
L2 Cache
Request Buffers
TLB-request Data Normal Data
Level Hit Rate 
>=
 L2 Hit Rate
Level 0 Hit Rate
Level 1 Hit Rate
Level 2 Hit Rate
Level 3 Hit Rate
L2 Hit Rate
Level 4 Hit Rate
Level Hit Rate 
<
 L2 Hit Rate
Figure 15: Design of TLB-Request-Aware L2 Bypass.
5.4 Minimizing Interference at Main Memory
Section 4.3 demonstrates two different types of interfer-
ence at main memory. Normal data requests can interfere
with TLB requests, and data requests from multiple applica-
tions can interfere with each other. MASK’s memory con-
troller design mitigates both forms of interference using an
Address-Space-Aware DRAM Scheduler.
MASK’s Address-Space-Aware DRAM Scheduler breaks
the traditional DRAM request buffer into three separate
queues, as shown in Figure 12. The first queue, called the
Golden Queue, contains a small FIFO queue.5 TLB-related
requests always go to the Golden Queue, while non-TLB-
related requests go the other two larger queues (similar to
the size of a typical DRAM request buffer size). The second
queue, called the Silver Queue, contains data request from
one selected application. The last queue, called the Normal
Queue, contains data requests from all other applications.
The Golden Queue is used to prioritize TLB misses over
data requests, while the Silver Queue ensures that DRAM
bandwidth is fairly distributed across applications.
Our Address-Space-Aware DRAM Scheduler always pri-
oritizes requests in the Golden Queue over requests in the
4Note that all experiments done in this paper use a depth of 4.
5We observe that TLB-related requests have low row locality.
Thus, we use a FIFO queue to further simplify the design.
7
Silver Queue, which are prioritized over requests in the Nor-
mal Queue. Applications take turns being assigned to the
Silver Queue based on two factors: the number of concurrent
page walks, and the number of warps stalled per active TLB
miss. The number of requests each application can add to the
Silver Queue is shown in Equation 1. Application (Appi) in-
serts thresi requests into the Silver Queue. Then, the next
application (Appi+1) is allowed to send thresi+1 requests to
the Silver Queue. Within each queue, FR-FCFS [71, 95] is
used to schedule requests.
thresi = thresmax
Concurrenti ∗WrpStalledi
∑numAppj=1 Concurrent j ∗WrpStalled j
(1)
To track the number of outstanding concurrent page
walks, we add a 6-bit counter per application to the shared
TLB.6 This counter tracks of the maximum number of TLB
miss queue, and is used as Concurrenti in Equation 1. To
track the number of warps stalled per active TLB, we add
a 6-bit counter to the TLB MSHRs, to track the maximum
number of warps that hit in each MSHR entry. This number
is used for WrpStalledi. Note that the Address-Space-Aware
DRAM Scheduler resets all of these counters every epoch.3
We find that the number of concurrent TLB requests that
go to each memory channel is small, so our design has
an additional benefit of lowering page table walk latency
while minimizing interference. The Silver Queue prevents
bandwidth-heavy applications from interfering with applica-
tions utilizing the queue, which in turn prevents starvation.
It also minimizes the reduction in total bandwidth utiliza-
tion, as the per-queue FR-FCFS scheduling policy ensures
that application-level row buffer locality is preserved.
5.5 Page Faults and TLB Shootdowns
Address translation inevitably introduces page faults. Our
design can be extended to use techniques from previ-
ous works, such as performing copy-on-write for minor
faults [68], and either exception support [58] or demand pag-
ing techniques [8, 65, 94] for major faults. We leave this as
future work, and do not evaluate these overheads.
Similarly, TLB shootdowns are required when shader
cores change address spaces and when page tables are up-
dated. We do not envision applications that make frequent
changes to memory mappings, so we expect such events
to be rare. Techniques to reduce TLB shootdown over-
head [18,73] are well-explored and can be applied to MASK.
6. METHODOLOGY
We model Maxwell architecture [63] cores, TLB fill by-
passing, bypass cache, and memory scheduling mechanisms
in MASK using the MAFIA framework [45], which is based
on GPGPU-Sim 3.2.2 [13]. We heavily modify the simulator
to accurately model the behavior of CUDA Unified Virtual
Address [63, 65] as described below. Table 1 provides de-
tails on our baseline GPU configuration. In order to show
that MASK works on any GPU architecture, we also evalu-
ate the performance of MASK on a Fermi architecture [61],
which we discuss in Section 7.3.
6We leave techniques to virtualize this counter for more than 64
applications as future work.
System Overview 30 cores, 64 execution unit per core. 8 memory partitions
Shader Core Config 1020 MHz, 9-stage pipeline,
64 threads per warp, GTO scheduler [72]
Private L1 Cache 16KB, 4-way associative, LRU, L1 misses are
coalesced before accessing L2, 1 cycle latency
Shared L2 Cache 2MB total, 16-way associative, LRU, 2 cache banks
2 interconnect ports per memory partition, 10 cycle latency
Private L1 TLB 64 entries per core, fully associative, LRU, 1 cycle latency
Shared L2 TLB 512 entries total, 16-way associative, LRU, 2 ports
per memory partition (16 ports total), 10 cycle latency
DRAM GDDR5 1674 MHz, 8 channels, 8 banks per rank
FR-FCFS scheduler [71, 95] burst length 8
Page Table Walker 64 threads shared page table walker, traversing
4-level page table
Table 1: Configuration of the simulated system.
TLB and Page Table Walk Model. We modify the MAFIA
framework to accurately model the TLB designs from [68]
and the MASK baseline design. We employ the non-blocking
TLB implementation used in the design from Pichai et
al. [67]. Each core has a private L1 TLB. The page table
walker is shared, and admits up to 64 concurrent threads for
walks. The baseline design for MASK adds a shared L2 TLB
instead of page walk caches (see Section 3.2), with a shared
L2 TLB in each memory partition. Both L1 and L2 TLB
entries contain MSHR entries to track in-flight page table
walks. On a TLB miss, a page table walker generates a series
of dependent requests that probe the L2 data cache and main
memory as needed. To correctly model virtual-to-physical
address mapping and dependent memory accesses for multi-
level page walks, we collect traces of all virtual addresses
referenced by each application (executing them to comple-
tion), enabling us to pre-populate disjoint physical address
spaces for each application with valid page tables.
Workloads. We randomly select 27 applications from the
CUDA SDK [60], Rodinia [22], Parboil [77], LULESH [47,
48], and SHOC [27] suites. We classify these benchmarks
based on their L1 and L2 TLB miss rates into one of four
groups. Table 2 shows the categorization for each bench-
mark. For our multi-application results, we randomly se-
lect 35 pairs of applications, avoiding combinations that se-
lect applications from the lowL1miss-lowL2miss category, as
these applications are relatively insensitive to memory pro-
tection overheads. The application that finishes first is re-
launched to keep the SM full and to properly model con-
tention.
We divide these pairs into three workload categories based
on the number of applications that are from highL1miss-
highL2miss category. 0 HMR contains workload bundles
where none of the applications in the bundle are from
highL1miss-highL2miss. 1 HMR contains workloads where
only one application in the bundle is from highL1miss-
highL2miss. 2 HMR contains workloads where both appli-
cations in the bundle are from highL1miss-highL2miss.
Evaluation Metrics. We report performance using
weighted speedup [30, 31], defined as ∑ IPCSharedIPCAlone . IPCalone
is the IPC of an application that runs on the same number
of shader cores, but does not share GPU resources with any
other applications, and IPCshared is the IPC of an applica-
tion when running concurrently with other applications. We
8
W
ei
gh
te
d 
Sp
ee
du
p Static GPU-MMU MASK-TLB MASK-Cache MASK-DRAM MASK
1
2
3
4
5
H
IS
TO
_G
U
P
H
IS
TO
_L
PS
N
W
_H
S
N
W
_L
PS
R
AY
_G
UP
R
AY
_H
S
SC
P_
G
U
P
SC
P_
HS
3D
S_
BP
3D
S_
H
IS
TO
BL
K_
LP
S
CO
N
S_
LP
S
FW
T_
BP
LU
H
_B
FS
2
LU
H
_G
UP
M
U
M
_H
IS
TO
RE
D
_B
P
RE
D
_G
U
P
R
ED
_R
AY
SC
AN
_H
IS
TO
SC
AN
_S
AD
TR
D_
HS
TR
D_
LP
S
TR
D_
RA
Y
C
FD
_M
M
C
O
NS
_L
U
H
M
M
_C
O
N
S
R
ED
_M
M
R
ED
_S
C
SC
AN
_C
O
N
S
SC
AN
_S
R
AD
SC
_F
W
T
SR
AD
_3
D
S
TR
D
_M
UM
TR
D
_R
ED
W
ei
gh
te
d 
Sp
ee
du
p
Figure 16: System-wide weighted speedup for multiprogrammed workloads.
L1 TLB Miss L2 TLB Miss Benchmark Name
Low Low LUD, NN
Low High BFS2, FFT, HISTO, NW,
QTC, RAY, SAD, SCP
High Low BP, GUP, HS, LPS
High High 3DS, BLK, CFD, CONS,
FWT, LUH, MM, MUM, RED,
SC, SCAN, SRAD, TRD
Table 2: Categorization of each benchmark.
report the unfairness of each design using maximum slow-
down, defined as Max IPCAloneIPCShared [11, 29].
Scheduling and Partitioning of Cores. The design space
for core scheduling is quite large, and finding optimal algo-
rithms is beyond the scope of this paper. To ensure that we
model a scheduler that performs reasonably well, we assume
an oracle schedule that finds the best partition for each pair
of applications. For each pair of applications, concurrent ex-
ecution partitions the cores according to the best weighted
speedup observed for that pair during an exhaustive search
over all possible partitionings.
Design Parameters. MASK exposes two configurable pa-
rameters: InitialTokens for TLB-Fill Tokens and thresmax
for the Address-Space-Aware DRAM Scheduler. A sweep
over the range of possible InitialTokens values reveals less
than 1% performance variance, as TLB-Fill Tokens is effec-
tive at reconfiguring the total number of tokens to a steady-
state value (shown in Figure 14). In our evaluation, we set
InitialTokens to 80%. We set thresmax to 500 empirically.
7. EVALUATION
We compare the performance of MASK against three de-
signs. The first, called Static, uses a static spatial partition-
ing of resources, where an oracle is used to partition GPU
cores, but the shared L2 cache and memory channels are par-
titioned equally to each application. This design is intended
to capture key design aspects of NVIDIA GRID and AMD
FirePro—however, insufficient information is publicly avail-
able to enable us to build a higher fidelity model. The second
design, called GPU-MMU, models the flexible spatial parti-
tioning GPU MMU design proposed by Power et al. [68].7
The third design we compare to is an ideal scenario, where
every single TLB access is a TLB hit. We also report per-
formance impact for individual components of MASK: TLB-
Fill Tokens (MASK-TLB), TLB-Request-Aware L2 Bypass
7Note that we use the design in Figure 2b instead of the one in
Figure 2a, as it provides better performance for the workloads that
we evaluate.
(MASK-Cache), and Address-Space-Aware DRAM Sched-
uler (MASK-DRAM).
7.1 Multiprogrammed Performance
Figures 17 and 16 compare the weighted speedup of mul-
tiprogrammed workloads for MASK, as well as each of the
components of MASK, against Static and GPU-MMU. Each
group of bars in the figure represents a pair of co-scheduled
benchmarks. Compared to GPU-MMU, MASK provides
45.2% additional speedup. We also found that MASK per-
forms only 23% worse than the ideal scenario where the TLB
always hits. We observe that MASK provides 43.4% better
aggregate throughput (system wide IPC) compared to GPU-
MMU. Compared to the Static baseline, where resources
are statically partitioned, both GPU-MMU and MASK pro-
vide better performance, because when an application stalls
for concurrent TLB misses, it does not use other shared
resources such as DRAM bandwidth. During such stalls,
other applications can utilize these resources. When multi-
ple GPGPU applications run concurrently, TLB misses from
two or more applications can be staggered, increasing the
likelihood that there will be heterogeneous and complemen-
tary resource demand.
2.5
3
3.5
4
W
ei
gh
te
d 
Sp
ee
du
p
Static GPU-MMU MASK-TLB MASK-DRAM MASK-Cache MASK Ideal
0
0.5
1
1.5
2
0 HMR 1 HMR 2 HMR Average
W
ei
gh
te
d 
Sp
ee
du
p
Figure 17: System-wide weighted speedup for multipro-
grammed workloads.
Figure 18 compares unfairness in MASK against the GPU-
MMU and Static baselines. On average, our mechanism
reduces unfairness by 22.4% compared to GPU-MMU. As
the number of tokens for each application changes based
on the TLB miss rate, applications that benefit more from
the shared TLB are more likely to get more tokens, caus-
ing applications that do not benefit from shared TLB space
to yield that shared TLB space to other applications. Our
application-aware token distribution mechanism and TLB
fill bypassing mechanism can work in tandem to reduce the
amount of inter-application cache thrashing observed in Sec-
tion 4.2. Compared to statically partitioning resources in
Static, allowing both applications access to all of the shared
9
resources provides better fairness. On average, MASK re-
duces unfairness by 30.7%, and a handful of applications
benefit by as much as 80.3%.
1.5
2
U
nf
ai
rn
es
s
Static GPU-MMU MASK
0
0.5
1
0 HMR 1 HMR 2 HMR Average
U
nf
ai
rn
es
s
Figure 18: Max unfairness of GPU-MMU and MASK.
Individual Application Analysis. MASK provides bet-
ter throughput on all applications sharing the GPU due to
reduced TLB miss rates for each application. The per-
application L2 TLB miss rates are reduced by over 50% on
average, which is in line with the system-wide miss rates
observed in Figure 3. Reducing the number of TLB misses
through the TLB fill bypassing policy (Section 5.2), and re-
ducing the latency of TLB misses through the shared L2 by-
passing (Section 5.3) and the TLB- and application-aware
DRAM scheduling policy (Section 5.4) enables significant
performance improvement.
In some cases, running two applications concurrently pro-
vides better speedup than running the application alone (e.g.,
RED-BP, RED-RAY, SC-FWT). We attribute these cases to
substantial improvements (more than 10%) of two factors: a
lower L2 queuing latency for bypassed TLB requests, and a
higher L1 hit rate when applications share the L2 and main
memory with other applications.
7.2 Component-by-Component Analysis
Effectiveness of TLB-Fill Tokens. Table 3 compares the
TLB hit rates of GPU-MMU and MASK-TLB. We show only
GPU-MMU results for TLB hit rate experiments, as the TLB
hit behavior for Static and GPU-MMU are similar. MASK-
TLB increases TLB hit rates by 49.9% on average, which we
attribute to TLB-Fill Tokens. First, TLB-Fill Tokens reduces
the number of warps utilizing the shared TLB entries, which
in turn reduces the miss rate. Second, the bypass cache can
store frequently-used TLB entries that cannot be filled in the
traditional TLB. Table 4 confirms this, showing the hit rate
of the bypass cache for MASK-TLB. From Table 3 and Ta-
ble 4, we conclude that the TLB-fill bypassing component
of MASK successfully reduces thrashing and ensures that
frequently-used TLB entries stay cached.
Shared TLB 0 HMR 1 HMR 2 HMR Average
Hit Rate
GPU-MMU 47.8% 45.6% 55.8% 49.3%
MASK-TLB 68.1% 75.2% 76.1% 73.9%
Table 3: Aggregate Shared TLB hit rates.
Bypass Cache 0 HMR 1 HMR 2 HMR Average
Hit Rate
MASK-TLB 63.9% 66.6% 68.8% 66.7%
Table 4: TLB hit rate for bypassed cache.
Effectiveness of TLB-Request-Aware L2 Bypass. Table 5
shows the average L2 data cache hit rate for TLB requests.
For requests that fill into the shared L2 data cache, TLB-
Request-Aware L2 Bypass is effective in selecting which
blocks to cache, resulting in a TLB request hit rate that is
higher than 99% for all of our workloads. At the same time,
TLB-Request-Aware L2 Bypass minimizes the impact of by-
passed TLB requests, leading to 17.6% better performance
on average compared to GPU-MMU, as shown in Figure 17.
L2 Data Cache 0 HMR 1 HMR 2 HMR Average
Hit Rate
GPU-MMU 71.7% 71.6% 68.7% 70.7%
MASK-Cache 97.9% 98.1% 98.8% 98.3%
Table 5: L2 data cache hit rate for TLB requests.
Effectiveness of Address-Space-Aware DRAM Scheduler.
While the impact of the DRAM scheduler we propose is
minimal for many applications, (the average improvement
across all workloads is just 0.83% in Figure 17), we ob-
serve that a few applications that suffered more severely
from interference (see Figures 10 and 11) can significantly
benefit from our scheduler, since it prioritizes TLB-related
requests. Figures 19a and 19b compare the DRAM band-
width utilization and DRAM latency of GPU-MMU and
MASK-DRAM for workloads that benefit from Address-
Space-Aware DRAM Scheduler. When our DRAM sched-
uler policy is employed, SRAD from the SCAN-SRAD pair
sees a 18.7% performance improvement, while both SCAN
and CONS from SCAN-CONS have performance gains of
8.9% and 30.2%, respectively. In cases where the DRAM
latency is high, the DRAM scheduler policy reduces the
latency of TLB requests by up to 10.6% (SCAN-SAD),
while increasing DRAM bandwidth utilization by up to 5.6%
(SCAN-HISTO).
0.4
0.6
0.8
DRA
M B
and
wid
th U
til. GPU-MMU MASK-DRAM
0
0.2
DRA
M B
and
wid
th U
til.
(a) DRAM Bandwidth Utilization
400
600
800
1000
Ave
rag
e D
RAM
 La
ten
cy GPU-MMU MASK-DRAM
0
200
Ave
rag
e D
RAM
 La
ten
cy
(b) DRAM Latency
Figure 19: DRAM bandwidth utilization and latency.
7.3 Scalability and Performance on Other Ar-
chitectures
Figure 20a shows the performance of GPU-MMU and
MASK, normalized to the ideal performance with no trans-
lation overhead, as we vary the number of applications ex-
ecuting concurrently on the GPU. We observe that as the
application count increases, the performance of both GPU-
MMU and MASK are further from the ideal baseline, due
to contention for shared resources (e.g., shared TLB, shared
data cache). However, MASK provides increasingly better
performance compared to GPU-MMU (35.5% for one ap-
plication, 45.2% for two concurrent applications, and 47.3%
for three concurrent applications). We conclude that MASK
provides better scalability with application count over the
state-of-the-art designs.
10
0.6
0.8
1
1.2
N
or
m
al
iz
ed
 
Pa
er
fo
rm
an
ce GPU-MMU MASK Ideal
0
0.2
0.4
1 App 2 Apps 3 Apps
N
or
m
al
iz
ed
 
Pa
er
fo
rm
an
ce
(a) Scalability analysis
0.4
0.6
0.8
1
1.2
N
or
m
al
iz
ed
 P
er
fo
rm
na
ce GPU-MMU MASK Ideal
0
0.2
GTX 480 
(Fermi)
GTX 750 Ti 
(Maxwell)
N
or
m
al
iz
ed
 P
er
fo
rm
na
ce
(b) Performance on Fermi
Figure 20: Scalability and portability studies for MASK.
The analyses and designs of MASK are architecture inde-
pendent and should be applicable to any SIMD machine. To
demonstrate this, we evaluate MASK on the GTX 480, which
uses the Fermi architecture [61]. Figure 20b shows the per-
formance of GPU-MMU and MASK, normalized to the ideal
performance with no translation overhead, for the GTX 480
and the GTX 750 Ti. We make three observations. First,
address translation incurs significant performance overhead
in both architectures for the baseline GPU-MMU design.
Second, MASK provides a 29.1% performance improvement
over the GPU-MMU design in the Fermi architecture. Third,
compared to the ideal performance, MASK performs only
22% worse in the Fermi architecture. On top of the data
shown in Figure 20b, we find that MASK reduces unfairness
by 26.4% and increases the TLB hit rates by 64.7% on av-
erage the in Fermi architecture. We conclude that MASK
delivers significant benefits regardless of GPU architecture.
Aside from this, Table 6 provides an evaluation of
MASK on the integrated GPU configuration used in previ-
ous work [68]. This integrated GPU has fewer number of
GPU cores, slower L2 cache, slower and less bandwidth
main memory.
Relative Performance Maxwell Integrated GPU [68]
Shared TLB 52.4% 38.2%
MASK + Shared TLB 76.3% 64.5%
Translation Cache 46.0% 52.1%
MASK + Translation Cache 72.6% 72.5%
Table 6: Relative performance vs. the ideal baseline.
From Table 6, we found that 1) MASK is effective in re-
ducing the latency of address translation and able to improve
the performance of both the shared L2 TLB and translation
cache designs on both off-chip Maxwell GPU and integrated
GPU configurations, 2) contention at the shared L2 TLB be-
comes significantly more severe and causes a significant per-
formance drop in the integrated GPU setup.
7.4 Sensitivity Studies
Sensitivity to L1 and L2 TLB Sizes. We evaluated the per-
formance of MASK for a range of L1 and L2 TLB sizes. We
find that for both the L1 and L2 TLB, MASK performs closer
to the baseline as the number of TLB entries increases, as the
contention at the L1 and L2 TLB decreases.
Sensitivity to Memory Policies. We study the sensitivity
of MASK to (1) main memory row policy, and (2) memory
scheduling policies. We find that for both the GPU-MMU
baseline and MASK, the workload performance for an open-
row policy is similar (within 0.8%) when we instead employ
a closed row policy, which is used in various CPU proces-
sors [36,37,40]. Aside from the FR-FCFS scheduler [71,95],
we applied MASK on other state-of-the-art GPU memory
scheduler [45] and found that MASK with this scheduler per-
forms 44.2% over the GPU-MMU baseline. We conclude
that MASK is effective across different memory policies.
Sensitivity to Different Page Size. We evaluate the perfor-
mance of MASK with large page assuming ideal page fault
latency. We found that applying MASK allows the GPU to
perform within 1.8% of the ideal baseline.
7.5 Hardware Overheads
To support memory protection, each L2 TLB cache line
adds an address space identifier (ASID). We model 8-bit
ASIDs added to TLB entries, which translates to 7% of the
L2 TLB size.
TLB-Fill Tokens, uses two 16-bit counters at each shader
core. We augment the shared cache with 32-entry fully-
associative content addressable memory (CAM) for the by-
pass cache, and 30 15-bit token counts with 30 1-bit token
direction entries to distribute tokens over up to 30 concurrent
applications. In total, we add 436 bytes (4 bytes per core on
the L1 TLB, and 316 bytes in the shared L2 TLB), which
represents 0.5% growth of the L1 TLB and 3.8% of the L2
TLB.
TLB-Request-Aware L2 Bypass uses ten 8-byte counters
per core to track cache hits and cache accesses per level,
(including for the data cache). The resulting 80 bytes are less
than 0.1% of the shared L2 cache. Each cache and memory
request requires an additional 3 bitsspecifying the page walk
level, as discussed in Section 5.3.
Address-Space-Aware DRAM Scheduler adds a 16-entry
FIFO queue in each memory channel for TLB-related re-
quests, and a 64-entry memory request buffer per memory
channel for the Silver Queue, while reducing the size of the
Normal Queue by 64 entries down to 192 entries. This adds
an extra 6% of storage overhead to the DRAM request queue
per memory controller.
Area and Power Consumption. We compare the area and
power consumption of MASK using CACTI [59]. We com-
pare the area and power of the L1 TLB, L2 TLB, the shared
data cache and the page walk cache. We find that MASK
introduces a negligible overhead, consuming less than 0.1%
additional area and 0.01% additional power than both shared
L2 TLB and page walk cache baselines.
8. RELATED WORK
8.1 Partitioning for GPU Concurrency
Concurrent Kernels and GPU Multiprogramming. The
opportunity to improve utilization with concurrency is well-
recognized, but previous proposals [56, 66, 87, 92] do not
support memory protection. Adriaens et al. [2] observe the
need for spatial sharing across protection domains, but do
not propose or evaluate a design. NVIDIA GRID [35] and
AMD Firepro [3] support static partitioning of hardware to
allow kernels from different VMs to run concurrently, but
the partitions are determined at startup, which causes frag-
mentation and under-utilization (see Section 7.1). MASK’s
11
goal is flexible, dynamic partitioning.
NVIDIA’s Multi Process Service (MPS) [64] allows mul-
tiple processes to launch kernels on the GPU, but the service
provides no memory protection or error containment. Xu et
al. [91] propose Warped-Slicer, which is a mechanism for
multiple applications to spatially share a GPU core. Similar
to MPS, Warped-Slicer provides no memory protection, and
is not suitable for supporting multi-application in a multi-
tenant cloud setting.
Preemption and Context Switching. Preemptive context
switch is an active research area [33, 79, 87], and architec-
tural support [57, 65] will likely improve in future GPUs.
Preemption is complementary to spatial multiplexing, and
we leave techniques to combine them for future work.
GPU Virtualization. Most current hypervisor-based full
virtualization techniques for GPGPUs [49,78,81] must sup-
port a virtual device abstraction without the dedicated hard-
ware support for the Virtual Desktop Infrastructure (VDI)
found in GRID [35] and FirePro [3]. Key components miss-
ing from these proposals include support for the dynamic
partitioning of hardware resources, and efficient techniques
for handling over-subscription. Performance overheads in-
curred by these designs argue strongly for hardware assis-
tance, as we propose. By contrast, API-remoting solutions
such as vmCUDA [86] and rCUDA [28] provide near-native
performance, but require modifications to the guest software
and sacrifice both isolation and compatibility.
Demand Paging in GPUs. Demand paging is an impor-
tant primitive for memory virtualization that is challenging
for GPUs [84]. Recent works on CC-NUMA [8], AMD’s
hUMA [5], and NVIDIA’s PASCAL architecture [65, 94]
support for demand paging in GPUs. These techniques can
be used in conjunction with MASK.
8.2 TLB Design
GPU TLB Designs. Previous works have explored TLB de-
signs in heterogeneous systems with GPUs [25, 67, 68, 84],
and the adaptation of x86-like TLBs in a heterogeneous
CPU-GPU setting [68]. Key elements in these designs in-
clude probing the TLB after L1 coalescing to reduce the
amount of parallel TLB requests, shared concurrent page ta-
ble walks, and translation caches to reduce main memory
accesses. MASK owes much to these designs, but we show
empirically that contention patterns at the shared L2 layer re-
quire additional support beyond these designs to accommo-
date contention from multiple address spaces. Cong et al.
propose a TLB design similar to our baseline GPU-MMU
design [25]. However, this design utilizes the host (CPU)
MMU to perform page walks, which is inapplicable in the
context of multi-application GPUs. Pichai et al. [67] ex-
plore a TLB design for heterogeneous CPU-GPU systems,
and add TLB awareness to the existing CCWS GPU warp
scheduler [72]. Warp scheduling is orthogonal to our work,
and can be combined to further improve performance.
Vesely et al. analyze support for virtual memory in hetero-
geneous systems [84], finding that the cost of address trans-
lation in GPUs is an order of magnitude higher than in CPUs,
and that high latency address translations limit the GPU’s la-
tency hiding capability and hurts performance (an observa-
tion in line with our own findings in Section 4.1). We show
additionally that thrashing due to interference further slows
down applications sharing the GPU. MASK is capable not
only of reducing interference between multiple applications
(Section 7.1), but of reducing the TLB miss rate in single-
application scenarios as well.
Instead of relying on hardware modifications, Lee et al.
propose VAST, a software-managed virtual memory space
for GPUs [53]. Data-parallel applications typically have a
larger working set size compared to the size of GPU mem-
ory, preventing these applications from utilizing the GPUs.
To address this, VAST creates the illusion of a large virtual
memory (without concerns about the physical memory size),
by providing an automatic memory management system that
partitions GPU programs into chunks that fit the physical
memory space. Even though recent GPUs now support de-
mand paging [65], the observation regarding the large work-
ing set size of GPGPU programs motivates the need for bet-
ter virtual memory support, which is what MASK provide.
TLB Designs in CPU Systems. Cox and Bhattacharjee pro-
pose an efficient TLB deign that allows entries correspond-
ing to multiple page sizes to share the same TLB structure,
simplifying the design of TLBs [26]. While this design
can be applied to GPUs, it is solving a different problem:
area and energy efficiency. Thus, this proposal is orthogo-
nal to MASK. Bhattacharjee et al. examine shared last-level
TLB designs [17] and page walk cache designs [16], propos-
ing a mechanism that can accelerate multithreaded applica-
tions by sharing translations between cores. However, these
proposals are likely to be less effective for multiple con-
current GPGPU applications, because translations are not
shared between virtual address spaces. Barr et al. propose
SpecTLB [14], which speculatively predicts address trans-
lations to avoid the TLB miss latency. Speculatively pre-
dicting address translation can be complicated and costly in
GPUs, because there can be multiple concurrent TLB misses
to many different TLB entries in the GPU.
Direct segments [15] and redundant memory map-
pings [46] reduce address translation overheads by mapping
large contiguous virtual memory regions to a contiguous
physical region. These techniques increase the reach of each
TLB entry, and are complementary to those in MASK.
8.3 Techniques to Reduce Interference
GPU-Specific Resource Management. Jog et al. pro-
pose MAFIA, a main memory management scheme that im-
proves performance of concurrently-running GPGPU appli-
cations [45]. The design of MAFIA assumes that parallel
applications operate under the same virtual address space,
and does not model address translation overheads or accom-
modate safe, concurrent execution of kernels from different
protection domains. In contrast, we model and study the im-
pact of address translation and memory protection. Lee et al.
propose TAP [52], a TLP-aware cache management mecha-
nism that modifies the CPU cache partitioning policy [70]
and cache insertion policy [42] to lower GPGPU applica-
tions’ interference to CPU applications at the shared cache.
However, TAP does not consider address translation and in-
terference between different GPGPU applications.
Several memory scheduler designs target systems with
12
GPUs [11, 21, 43, 82, 83, 93]. Unlike MASK, these designs
focus on a single GPGPU application, and are not aware
of page walk traffic. They focus on reducing the complex-
ity of the memory scheduler for a single application by re-
ducing inter-warp interference [21, 93], or by providing re-
source management for heterogeneous CPU-GPU applica-
tions [11, 43, 82, 83]. While some of these works propose
mechanisms that reduce interference [11,43,82,83], they dif-
fer from MASK because 1) they consider interference from
applications with wildly different characteristics (CPU ap-
plications vs. GPU applications), and 2) they do not consider
interference between page-walk-related and normal memory
traffic.
Cache Bypassing Policies in GPUs. Techniques to reduce
contention on shared GPU caches [12,23,24,54,55,90] em-
ploy memory-divergence-based bypassing [12], reuse-based
cache bypassing [23, 24, 54, 55, 90], and software-based
cache bypassing [89], and sometimes combine these works
with throttling [23,24,44] to reduce contention. These works
do not differentiate page walk traffic from normal traffic, and
focus on a single application.
Cache and TLB Insertion Policies. Cache insertion poli-
cies that account for cache thrashing [41, 42, 69] or future
reuse [75] work well for CPU applications, but other pre-
vious works have shown these policies to be ineffective for
GPU applications [12, 52]. This observation holds for the
shared TLB in the multi-address space scenario.
9. CONCLUSION
Efficiently deploying GPUs in a large-scale computing en-
vironment needs spatial multiplexing support. However, the
existing address translation support stresses a GPU’s fun-
damental latency hiding techniques, and interference from
multiple address spaces can further harm performance. To
alleviate these problems, we propose MASK, a new mem-
ory hierarchy designed for multi-address-space concurrency.
MASK consists of three major components that lower inter-
application interference during address translation and im-
prove L2 cache utilization for translation requests. MASK
successfully alleviates the address translation overhead, im-
proving performance by 45.2% over the state-of-the-art.
10. REFERENCES
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,
G. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat,
I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz,
L. Kaiser, M. Kudlur, J. Levenberg, D. ManÃl’, R. Monga, S. Moore,
D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever,
K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. ViÃl’gas,
O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and
X. Zheng, “TensorFlow: Large-Scale Machine Learning on
Heterogeneous Distributed Systems,” 2015. [Online]. Available:
http://download.tensorflow.org/paper/whitepaper2015.pdf
[2] J. Adriaens, K. Compton, N. S. Kim, and M. Schulte, “The Case for
GPGPU Spatial Multitasking,” in HPCA, 2012.
[3] Advanced Micro Devices, “OpenCL: The Future of Accelerated
Application Performance Is Now,” https:
//www.amd.com/Documents/FirePro_OpenCL_Whitepaper.pdf.
[4] Advanced Micro Devices, AMD-V Nested Paging, 2010,
http://developer.amd.com/wordpress/media/2012/10/NPT-WP-1%
201-final-TM.pdf.
[5] Advanced Micro Devices, “Heterogeneous System Architecture: A
Technical Review,” http://amd-dev.wpengine.netdna-cdn.com/
wordpress/media/2012/10/hsa10.pdf, 2012.
[6] Advanced Micro Devices. (2013) What is Heterogeneous System
Architecture (HSA)? [Online]. Available:
http://developer.amd.com/resources/heterogeneous-computing/
what-is-heterogeneous-system-architecture-hsa/
[7] Advanced Micro Devices, “AMD I/O Virtualization Technology
(IOMMU) Specification,” 2016. [Online]. Available:
http://support.amd.com/TechDocs/48882_IOMMU.pdf
[8] N. Agarwal, D. Nellans, M. O’Connor, S. W. Keckler, and T. F.
Wenisch, “Unlocking Bandwidth for GPUs in CC-NUMA Systems,”
in HPCA, 2015.
[9] N. Amit, M. Ben-Yehuda, and B.-A. Yassour, “IOMMU: Strategies
for Mitigating the IOTLB Bottleneck,” in ISCA, 2012.
[10] A. Arunkumar, E. Bolotin, B. Cho, U. Milic, E. Ebrahimi, O. Villa,
A. Jaleel, and C.-J. Wu, “MCM-GPU: Multi-Chip-Module GPUs for
Continued Performance Scalability,” in ISCA, 2017.
[11] R. Ausavarungnirun, K. Chang, L. Subramanian, G. Loh, and
O. Mutlu, “Staged Memory Scheduling: Achieving High
Performance and Scalability in Heterogeneous Systems,” in ISCA,
2012.
[12] R. Ausavarungnirun, S. Ghose, O. KayÄs´ran, G. H. Loh, C. R. Das,
M. T. Kandemir, and O. Mutlu, “Exploiting Inter-Warp
Heterogeneity to Improve GPGPU Performance,” in PACT, 2015.
[13] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt,
“Analyzing CUDA Workloads Using a Detailed GPU Simulator,” in
ISPASS, 2009.
[14] T. W. Barr, A. L. Cox, and S. Rixner, “SpecTLB: A Mechanism for
Speculative Address Translation,” in ISCA, 2011.
[15] A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift, “Efficient
Virtual Memory for Big Memory Servers,” in ISCA, 2013.
[16] A. Bhattacharjee, “Large-reach Memory Management Unit Caches,”
in MICRO, 2013.
[17] A. Bhattacharjee and M. Martonosi, “Inter-core Cooperative TLB for
Chip Multiprocessors,” in ASPLOS, 2010.
[18] D. L. Black, R. F. Rashid, D. B. Golub, and C. R. Hill, “Translation
Lookaside Buffer Consistency: A Software Approach,” in ASPLOS,
1989.
[19] D. Bouvier and B. Sander, “Applying AMD’s "Kaveri" APU for
Heterogeneous Computing,” in HOTCHIP, 2014.
[20] M. Burtscher, R. Nasre, and K. Pingali, “A Quantitative Study of
Irregular Programs on GPUs,” in IISWC, 2012.
[21] N. Chatterjee, M. O’Connor, G. H. Loh, N. Jayasena, and
R. Balasubramonian, “Managing DRAM Latency Divergence in
Irregular GPGPU Applications,” in SC, 2014.
[22] S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and
K. Skadron, “Rodinia: A Benchmark Suite for Heterogeneous
Computing,” in IISWC, 2009.
[23] X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W. W.
Hwu, “Adaptive Cache Management for Energy-Efficient GPU
Computing,” in MICRO, 2014.
[24] X. Chen, S. Wu, L.-W. Chang, W.-S. Huang, C. Pearson, Z. Wang,
and W. W. Hwu, “Adaptive Cache Bypass and Insertion for
Many-Core Accelerators,” in MES, 2014.
[25] J. Cong, Z. Fang, Y. Hao, and G. Reinmana, “Supporting Address
Translation for Accelerator-Centric Architectures,” in HPCA, 2017.
[26] G. Cox and A. Bhattacharjee, “Efficient Address Translation with
Multiple Page Sizes,” in ASPLOS, 2016.
[27] A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth,
K. Spafford, V. Tipparaju, and J. S. Vetter, “The Scalable
Heterogeneous Computing (SHOC) benchmark suite,” in GPGPU,
2010.
[28] J. Duato, A. Pena, F. Silla, R. Mayo, and E. Quintana-Orti, “rCUDA:
Reducing the Number of GPU-based Accelerators in High
Performance Clusters,” in HPCS, 2010.
[29] E. Ebrahimi, O. Mutlu, C. J. Lee, and Y. N. Patt, “Coordinated
Control of Multiple Prefetchers in Multi-core Systems,” in MICRO,
13
2009.
[30] S. Eyerman and L. Eeckhout, “System-Level Performance Metrics
for Multiprogram Workloads,” IEEE Micro, vol. 28, no. 3, 2008.
[31] S. Eyerman and L. Eeckhout, “Restating the Case for Weighted-IPC
Metrics to Evaluate Multiprogram Workload Performance,” IEEE
CAL, 2014.
[32] M. Flynn, “Very High-Speed Computing Systems,” Proc. of the
IEEE, vol. 54, no. 2, 1966.
[33] M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally,
E. Lindholm, and K. Skadron, “Energy-Efficient Mechanisms for
Managing Thread Context in Throughput Processors,” in ISCA, 2011.
[34] B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang, “Mars: A
MapReduce Framework on Graphics Processors,” in PACT, 2008.
[35] A. Herrera, “NVIDIA GRID: Graphics Accelerated VDI with the
Visual Performance of a Workstation,” May 2014.
[36] Intel Corporation, “Intel(R) Microarchitecture Codename Sandy
Bridge,”
http://www.intel.com/technology/architecture-silicon/2ndgen/.
[37] Intel Corporation. (2012) Products (Formerly Ivy Bridge). [Online].
Available: http://ark.intel.com/products/codename/29902/
[38] Intel Corporation, “Intel 64 and ia-32 architectures software
developerâA˘Z´s manual,” 2016, https://www-ssl.intel.com/content/
dam/www/public/us/en/documents/manuals/
64-ia-32-architectures-software-developer-manual-325462.pdf.
[39] Intel Corporation, “Intel virtualization technology for directed i/o,”
2016. [Online]. Available:
http://www.intel.com/content/dam/www/public/us/en/documents/
product-specifications/vt-directed-io-spec.pdf
[40] Intel Corporation, “6th generation intelÂo˝ coreâDˇc´ processor family
datasheet, vol. 1,” 2017,
http://www.intel.com/content/dam/www/public/us/en/documents/
datasheets/desktop-6th-gen-core-family-datasheet-vol-1.pdf.
[41] A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely, Jr., and
J. Emer, “Adaptive Insertion Policies for Managing Shared Caches,”
in PACT, 2008.
[42] A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer, “High
Performance Cache Replacement Using Re-reference Interval
Prediction (RRIP),” in ISCA, 2010.
[43] M. K. Jeong, M. Erez, C. Sudanthi, and N. Paver, “A QoS-aware
memory controller for dynamically balancing GPU and CPU
bandwidth use in an MPSoC,” in DAC, 2012.
[44] W. Jia, K. A. Shaw, and M. Martonosi, “MRPB: Memory Request
Prioritization for Massively Parallel Processors,” in HPCA, 2014.
[45] A. Jog, O. Kayiran, T. Kesten, A. Pattnaik, E. Bolotin, N. Chatterjee,
S. W. Keckler, M. T. Kandemir, and C. R. Das, “Anatomy of GPU
Memory System for Multi-Application Execution,” in MEMSYS,
2015.
[46] V. Karakostas, J. Gandhi, F. Ayar, A. Cristal, M. D. Hill, K. S.
McKinley, M. Nemirovsky, M. M. Swift, and O. Ünsal, “Redundant
Memory Mappings for Fast Access to Large Memories,” in ISCA,
2015.
[47] I. Karlin, A. Bhatele, J. Keasler, B. Chamberlain, J. Cohen,
Z. DeVito, R. Haque, D. Laney, E. Luke, F. Wang, D. Richards,
M. Schulz, and C. Still, “Exploring Traditional and Emerging Parallel
Programming Models using a Proxy Application,” in IPDPS, 2013.
[48] I. Karlin, J. Keasler, and R. Neely, “Lulesh 2.0 Updates and
Changes,” 2013.
[49] S. Kato, M. McThrow, C. Maltzahn, and S. Brandt, “Gdev:
First-Class GPU Resource Management in the Operating System,” in
USENIX ATC, 2012.
[50] O. Kayıran, A. Jog, M. T. Kandemir, and C. R. Das, “Neither More
Nor Less: Optimizing Thread-Level Parallelism for GPGPUs,” in
PACT, 2013.
[51] O. Kayıran, N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T.
Kandemir, G. H. Loh, O. Mutlu, and C. R. Das, “Managing GPU
Concurrency in Heterogeneous Architectures,” in MICRO, 2014.
[52] J. Lee and H. Kim, “Tap: A tlp-aware cache management policy for a
cpu-gpu heterogeneous architecture,” in High Performance Computer
Architecture (HPCA), 2012 IEEE 18th International Symposium on.
IEEE, 2012, pp. 1–12.
[53] J. Lee, M. Samadi, and S. Mahlke, “VAST: The Illusion of a Large
Memory Space for GPUs,” in PACT, 2014.
[54] C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou,
“Locality-Driven Dynamic GPU Cache Bypassing,” in ICS, 2015.
[55] D. Li, M. Rhu, D. Johnson, M. O’Connor, M. Erez, D. Burger,
D. Fussell, and S. Redder, “Priority-Based Cache Allocation in
Throughput Processors,” in HPCA, 2015.
[56] T. Li, V. K. Narayana, and T. El-Ghazawi, “Symbiotic Scheduling of
Concurrent GPU Kernels for Performance and Energy
Optimizations,” in CF, 2014.
[57] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, “NVIDIA
Tesla: A Unified Graphics and Computing Architecture,” IEEE
Micro, vol. 28, no. 2, 2008.
[58] J. Menon, M. de Kruijf, and K. Sankaralingam, “iGPU: Exception
Support and Speculative Execution on GPUs,” in ISCA, 2012.
[59] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing
NUCA Organizations and Wiring Alternatives for Large Caches with
CACTI 6.0,” in MICRO, 2007.
[60] NVIDIA Corporation, “CUDA C/C++ SDK Code Samples,”
http://developer.nvidia.com/cuda-cc-sdk-code-samples, 2011.
[61] NVIDIA Corporation, “NVIDIA’s Next Generation CUDA Compute
Architecture: Fermi,” http://www.nvidia.com/content/pdf/fermi_
white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf,
2011.
[62] NVIDIA Corporation, “NVIDIA’s Next Generation CUDA Compute
Architecture: Kepler GK110,” http://www.nvidia.com/content/PDF/
kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, 2012.
[63] NVIDIA Corporation, “NVIDIA GeForce GTX 750 Ti,”
http://international.download.nvidia.com/geforce-com/international/
pdfs/GeForce-GTX-750-Ti-Whitepaper.pdf, 2014.
[64] NVIDIA Corporation, “Multi-Process Service,” https://docs.nvidia.
com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf,
2015.
[65] NVIDIA Corporation, “NVIDIA Tesla P100,”
https://images.nvidia.com/content/pdf/tesla/whitepaper/
pascal-architecture-whitepaper.pdf, 2016.
[66] S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan, “Improving
GPGPU Concurrency with Elastic Kernels,” in ASPLOS, 2013.
[67] B. Pichai, L. Hsu, and A. Bhattacharjee, “Architectural Support for
Address Translation on GPUs: Designing Memory Management
Units for CPU/GPUs with Unified Address Spaces,” in ASPLOS,
2014.
[68] J. Power, M. D. Hill, and D. A. Wood, “Supporting x86-64 Address
Translation for 100s of GPU Lanes,” in HPCA, 2014.
[69] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer,
“Adaptive Insertion Policies for High Performance Caching,” in
ISCA, 2007.
[70] M. K. Qureshi and Y. N. Patt, “Utility-based cache partitioning: A
low-overhead, high-performance, runtime mechanism to partition
shared caches,” in Proceedings of the 39th Annual IEEE/ACM
International Symposium on Microarchitecture. IEEE Computer
Society, 2006.
[71] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens,
“Memory Access Scheduling,” in ISCA, 2000.
[72] T. G. Rogers, M. O’Connor, and T. M. Aamodt, “Cache-Conscious
Wavefront Scheduling,” in MICRO, 2012.
[73] B. F. Romanescu, A. R. Lebeck, D. J. Sorin, and A. Bracy, “UNified
Instruction/Translation/Data (UNITD) Coherence: One Protocol to
Rule them All,” in HPCA, 2010.
[74] C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel,
“PTask: Operating System Abstractions to Manage GPUs as
Compute Devices,” in SOSP, 2011.
[75] V. Seshadri, O. Mutlu, M. A. Kozuch, and T. C. Mowry, “The
Evicted-Address Filter: A Unified Mechanism to Address Both
Cache Pollution and Thrashing,” in PACT, 2012.
14
[76] B. J. Smith, “A Pipelined, Shared Resource MIMD Computer,” in
ICPP, 1978.
[77] J. A. Stratton, C. Rodrigues, I. J. Sung, N. Obeid, L. W. Chang,
N. Anssari, G. D. Liu, and W. W. Hwu, “Parboil: A Revised
Benchmark Suite for Scientific and Commercial Throughput
Computing,” Univ. of Illinois at Urbana-Champaign, Tech. Rep.
IMPACT-12-01, March 2012.
[78] Y. Suzuki, S. Kato, H. Yamada, and K. Kono, “GPUvm: Why Not
Virtualizing GPUs at the Hypervisor?” in USENIX ATC, 2014.
[79] I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and
M. Valero, “Enabling Preemptive Multiprogramming on GPUs,” in
ISCA, 2014.
[80] J. E. Thornton, “Parallel Operation in the Control Data 6600,” AFIPS
FJCC, 1964.
[81] K. Tian, Y. Dong, and D. Cowperthwaite, “A Full GPU Virtualization
Solution with Mediated Pass-Through,” in USENIX ATC, 2014.
[82] H. Usui, L. Subramanian, K. Chang, and O. Mutlu, “SQUASH:
Simple qos-aware high-performance memory scheduler for
heterogeneous systems with hardware accelerators,” arXiv CoRR,
2015.
[83] H. Usui, L. Subramanian, K. Chang, and O. Mutlu, “DASH:
Deadline-Aware High-Performance Memory Scheduler for
Heterogeneous Systems with Hardware Accelerators,” ACM TACO,
vol. 12, no. 4, Jan. 2016.
[84] J. Vesely, A. Basu, M. Oskin, G. H. Loh, and A. Bhattacharjee,
“Observations and Opportunities in Architecting Shared Virtual
Memory for Heterogeneous Systems,” in ISPASS, 2016.
[85] T. Vijayaraghavany, Y. Eckert, G. H. Loh, M. J. Schulte,
M. Ignatowski, B. M. Beckmann, W. C. Brantley, J. L. Greathouse,
W. Huang, A. Karunanithi, O. Kayiran, M. Meswani, I. Paul,
M. Poremba, S. Raasch, S. K. Reinhardt, G. Sadowski, and
V. Sridharan, “Design and Analysis of an APU for Exascale
Computing,” in HPCA, 2017.
[86] L. Vu, H. Sivaraman, and R. Bidarkar, “GPU Virtualization for High
Performance General Purpose Computing on the ESX Hypervisor,”
in HPC, 2014.
[87] Z. Wang, J. Yang, R. Melhem, B. R. Childers, Y. Zhang, and M. Guo,
“Simultaneous Multikernel GPU: Multi-tasking Throughput
Processors via Fine-Grained Sharing,” in HPCA, 2016.
[88] H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and
A. Moshovos, “Demystifying GPU Microarchitecture Through
Microbenchmarking,” in ISPASS, 2010.
[89] X. Xie, Y. Liang, G. Sun, and D. Chen, “An Efficient Compiler
Framework for Cache Bypassing on GPUs,” in ICCAD, 2013.
[90] X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang, “Coordinated Static
and Dynamic Cache Bypassing for GPUs,” in HPCA, 2015.
[91] Q. Xu, H. Jeon, K. Kim, W. W. Ro, and M. Annavaram,
“Warped-Slicer: Efficient Intra-SM Slicing through Dynamic
Resource Partitioning for GPU Multiprogramming,” in ISCA, 2016.
[92] T. T. Yeh, A. Sabne, P. Sakdhnagool, R. Eigenmann, and T. G.
Rogers, “Pagoda: Fine-Grained GPU Resource Virtualization for
Narrow Tasks,” in PPoPP, 2017.
[93] G. Yuan, A. Bakhoda, and T. Aamodt, “Complexity Effective
Memory Access Scheduling for Many-Core Accelerator
Architectures,” in MICRO, 2009.
[94] T. Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, and S. W. Keckler,
“Towards High Performance Paged Memory for GPUs,” in HPCA,
2016.
[95] W. K. Zuravleff and T. Robinson, “Controller for a Synchronous
DRAM That Maximizes Throughput by Allowing Memory Requests
and Commands to Be Issued Out of Order,” in US Patent Number
5,630,096, 1997.
15
