Mosaic: An Application-Transparent Hardware-Software Cooperative Memory
  Manager for GPUs by Ausavarungnirun, Rachata et al.
Mosaic: An Application-Transparent
Hardware–Software Cooperative Memory Manager for GPUs
Rachata Ausavarungnirun1 Joshua Landgraf2 Vance Miller2 Saugata Ghose1
Jayneel Gandhi3 Christopher J. Rossbach2,3 Onur Mutlu4,1
1Carnegie Mellon University 2The University of Texas at Austin
3VMware Research 4ETH Zürich
This paper summarizes the idea and key contributions of
Mosaic, which was published at MICRO 2017 [8], and examines
the work’s signicance and future potential. Contemporary
discrete GPUs support rich memory management features such
as virtual memory and demand paging. These features sim-
plify GPU programming by providing a virtual address space
abstraction similar to CPUs and eliminating manual memory
management, but they introduce high performance overheads
during (1) address translation and (2) page faults. A GPU re-
lies on high degrees of thread-level parallelism (TLP) to hide
memory latency. Address translation can undermine TLP, as a
single miss in the translation lookaside buer (TLB) invokes an
expensive serialized page table walk that often stalls multiple
threads. Demand paging can also undermine TLP, as multiple
threads often stall while they wait for an expensive data transfer
over the system I/O (e.g., PCIe) bus when the GPU demands a
page.
In modern GPUs, we face a trade-o on how the page size
used for memory management aects address translation and
demand paging. The address translation overhead is lower when
we employ a larger page size (e.g., 2MB large pages, compared
with conventional 4KB base pages), which increases TLB cov-
erage and thus reduces TLB misses. Conversely, the demand
paging overhead is lower when we employ a smaller page size,
which decreases the system I/O bus transfer latency. Support
for multiple page sizes can help relax the page size trade-o
so that address translation and demand paging optimizations
work together synergistically. However, existing page coalesc-
ing (i.e., merging base pages into a large page) and splintering
(i.e., splitting a large page into base pages) policies require costly
base page migrations that undermine the benets multiple page
sizes provide. In this paper, we observe that GPGPU applications
present an opportunity to support multiple page sizes without
costly data migration, as the applications perform most of their
memory allocation en masse (i.e., they allocate a large number
of base pages at once). We show that this en masse allocation
allows us to create intelligent memory allocation policies which
ensure that base pages that are contiguous in virtual memory
are allocated to contiguous physical memory pages. As a result,
coalescing and splintering operations no longer need to migrate
base pages.
We introduce Mosaic, a GPU memory manager that provides
application-transparent support for multiple page sizes. Mosaic
uses base pages to transfer data over the system I/O bus, and
allocates physical memory in a way that (1) preserves base page
contiguity and (2) ensures that a large page frame contains
pages from only a single memory protection domain. We take
advantage of this allocation strategy to design a novel in-place
page size selection mechanism that avoids data migration. This
mechanism allows the TLB to use large pages, reducing address
translation overhead. During data transfer, this mechanism
enables the GPU to transfer only the base pages that are needed
by the application over the system I/O bus, keeping demand
paging overhead low. Our evaluations show that Mosaic reduces
address translation overheads while eciently achieving the
benets of demand paging, compared to a contemporary GPU
that uses only a 4KB page size. Relative to a state-of-the-art
GPU memory manager, Mosaic improves the performance of
homogeneous and heterogeneous multi-application workloads
by 55.5% and 29.7% on average, respectively, coming within
6.8% and 15.4% of the performance of an ideal TLB where all
TLB requests are hits.
1. Introduction
Graphics Processing Units (GPUs) are used for an ever-
growing range of application domains due to their capability
to provide high throughput. GPUs provide a high amount of
throughput but they require a dierent programming model
than CPUs, making their general adoption dicult. Recent
support within GPUs for memory virtualization features, such
as a unied virtual address space [57,70], demand paging [73],
and preemption [2,73], can ease programming by allowing de-
velopers to exploit key benets such as application portability
and multi-application execution.
Hardware-supported memory virtualization relies on ad-
dress translation to map each virtual memory address to a
physical address within the GPU memory. Address transla-
tion uses page-granularity virtual-to-physical mappings that
are stored within a multi-level page table. To look up a map-
ping within the page table, the GPU performs a page table
walk, where a page table walker traverses through each level
of the page table in main memory until the walker locates
the page table entry for the requested mapping in the last
level of the table. GPUs with virtual memory support have
translation lookaside buers (TLBs), which cache page table
entries and avoid the need to perform a page table walk for
ar
X
iv
:1
80
4.
11
26
5v
1 
 [c
s.O
S]
  3
0 A
pr
 20
18
the cached entries, thereby reducing the address translation
latency.
State-of-the-art GPU memory virtualization provides sup-
port for demand paging [3, 57, 73, 81, 102]. In demand paging,
all of the memory used by a GPU application does not need
to be transferred to the GPU memory at the beginning of
application execution. Instead, during application execution,
when a GPU thread issues a memory request to a page that
has not yet been allocated in the GPU memory, the GPU
issues a page fault, at which point the data for that page is
transferred over the o-chip system I/O bus (e.g., the PCIe
bus [76] in contemporary systems) from the CPU memory to
the GPU memory. The transfer requires a long latency due
to its use of an o-chip bus. Once the transfer completes, the
GPU runtime allocates a physical GPU memory address to
the page, and the thread can complete its memory request.
GPU Virtualization Challenges. Two fundamental
challenges prevent further adoption of virtualization in GPUs:
(1) the address translation challenge, and (2) the demand pag-
ing challenge. The address translation challenge stems from
a long latency process that consists of a series of serialized
memory accesses required to traverse the page table [80, 81].
As many threads can access dierent data present in a single
page, these serialized page walk accesses signicantly limit
GPU concurrency, by lowering thread-level parallelism (TLP)
and thereby reducing the latency hiding capability of a GPU.
Translation lookaside buers (TLBs) can reduce the latency of
address translation by caching recently-used address transla-
tion information. Unfortunately, as application working sets
and DRAM capacity have increased in recent years, state-of-
the-art TLB designs [80, 81] suer from poor TLB reach, i.e.,
the TLB covers only a small fraction of the physical mem-
ory working set of an application. We found that the poor
TLB reach has a detrimental eect on GPU performance, be-
cause a single TLB miss can stall hundreds of threads at once,
undermining TLP within a GPU and signicantly reducing
performance [8, 61, 95].
Figure 1 shows the performance of two GPU-MMU designs:
(1) a design that uses the base 4KB page size, and (2) a design
that uses a 2MB large page size, where both designs have
no demand paging overhead (i.e., the system I/O bus trans-
fer takes zero cycles to transfer a page). We normalize the
performance of the two designs to a GPU with an ideal TLB,
where all TLB requests hit in the L1 TLB. Our full experimen-
tal methodology is described in detail in our MICRO 2017
paper [8]. We make two major observations from the gure.
First, compared to the ideal TLB, the GPU-MMU with 4KB
base pages experiences an average performance loss of 48.1%.
We observe that with 4KB base pages, a single TLB miss
often stalls many of the warps, which undermines the latency
hiding behavior of the SIMT execution model used by GPUs.
Second, the gure shows that using a 2MB page size with
the same number of TLB entries as the 4KB design allows
applications to come within 2% of the ideal TLB performance.
0.0
0.2
0.4
0.6
0.8
1.0
N
o
r m
a l
i z
e d
 
P
e r
f o
r m
a n
c e
4KB pages (no demand paging overhead) 2MB pages (no demand paging overhead)B (no demand paging overhe  (no demand paging overhea
Figure 1: Performance of a GPUwith no demand paging over-
head, using (1) 4KB base pages and (2) 2MB large pages, nor-
malized to the performance of a GPU with an ideal TLB. Re-
produced from [8].
We nd that with 2MB pages, the TLB has a much larger reach,
which reduces the TLB miss rate substantially. Thus, there is
strong incentive to use large pages for address translation.
To increase the TLB reach, large pages (e.g., the 2MB or
1GB pages used in many modern CPU architectures [39, 40])
can be employed. However, large pages increase the risk of
internal fragmentation, where a portion of the large page is un-
allocated (or unused). Internal fragmentation occurs because
it might often be dicult for an application to completely uti-
lize large contiguous regions of memory. This fragmentation
leads to (1) memory bloat, where a much greater amount of
physical memory is allocated than the amount of memory
that the application needs; and (2) longer memory access
latencies, due to a potentially lower eective TLB reach and
more page faults [56].
The demand paging challenge stems from a page fault,
which requires a long-latency data transfer for an entire page
over the system I/O bus [76]. Since GPU threads often access
data in the same page due to data locality, a single page fault
can cause multiple threads to stall at once. As a result, the
page fault can signicantly reduce the amount of TLP that
the GPU can exploit, and thus signicantly degrade perfor-
mance [8, 102].
Unlike address translation, which benets from larger
pages, the overhead of demand paging is smaller when a
smaller page size is used. A larger amount of data transfer
increases the transfer time, increases the amount of time that
GPU threads stall, and decreases TLP. Furthermore, as the
size of a page increases, there is a greater probability that an
application does not need all of the data in the page. As a
result, threads may stall for a longer time without gaining
any further benet in return. Based on these two conicting
observations, memory virtualization in GPU systems has a
fundamental trade-o due to the page size choice. We provide
more detail on the trade-o in our MICRO 2017 paper [8].
2. Mosaic
In our MICRO 2017 paper [8], we propose Mosaic, a new
GPU memory management scheme that aims to get the best
of both small and large page sizes. Mosaic relaxes the page
size trade-o by using multiple page sizes transparently to the
application, and, thus, to the programmer. With multiple page
sizes, and the ability to change virtual-to-physical mappings
2
dynamically, the GPU system can support good TLB reach
by using large pages for address translation, while providing
good demand paging performance by using base pages for data
transfer. However, while coalescing multiple small pages into
a large page requires a contiguous region, existing memory
allocation mechanisms make it dicult to nd regions of
physical memory where base pages can be coalesced without
a large number of page migration operations. This is because
existing GPU memory allocation mechanisms do not allocate
base pages in a manner that is aware of the contiguity of
memory allocated to each application. Figure 2 shows how a
state-of-the-art GPU memory manager [81] allocates mem-
ory for two applications. Within a single large page frame
(i.e., a contiguous piece of physical memory that is the size
of a large page and whose starting address is page aligned),
the GPU memory manager allocates base pages from both
Applications 1 and 2 ( 1 in Figure 2). As a result, the memory
manager cannot coalesce the base pages into a large page
( 2 ) without rst migrating some of the base pages, which
would incur a high latency. Instead, Mosaic allocates physical
base pages in a way that avoids the need to migrate data dur-
ing coalescing ( 3 in Figure 3), and uses a simple coalescing
mechanism to combine base pages into large pages (e.g., 2MB)
and thus increase TLB reach ( 4 in Figure 3).
Large Page Frame 2
Large Page Frame 1
Standard Memory Allocation Cannot Coalesce Pages
Without Migrating Data
Large Page Frame 2
Large Page Frame 1
Application 1 Base Pages Application 2 Base Pages Unallocated Pages
1 2
Figure 2: Page allocation and coalescing behavior of a state-
of-the-art GPU memory manager [81]. Adapted from [8].
Large Page Frame 2
Large Page Frame 1
Contiguity-Conserving 
Allocation
 
Coalesced Large Page 2
Coalesced Large Page 1
Lazy Coalescer
3 4
Figure 3: Page allocation and coalescing behavior of Mosaic.
Adapted from [8].
We make a key observation about the memory behavior
of contemporary general-purpose GPU (GPGPU) applica-
tions. We nd that the vast majority of memory allocations
in GPGPU applications are performed en masse (i.e., a large
number of pages are allocated at the same time). The en masse
memory allocation presents us with an opportunity: with so
many pages being allocated at once, we can rearrange how
we allocate the base pages to ensure that (1) all of the base
pages allocated within a large page frame belong to the same
virtual address space, and (2) base pages that are contiguous
in virtual memory are allocated to a contiguous portion of
physical memory and aligned within the large page frame.
Mosaic is designed to achieve these two goals. It con-
sists of three major components: Contiguity-Conserving
Allocation (CoCoA), the In-Place Coalescer , and Contiguity-
Aware Compaction (CAC). These three components work
together to coalesce (i.e., combine) base pages into large pages
and splinter (i.e., split apart) large pages back to base pages
during memory management. Memory management opera-
tions for Mosaic take place at two times: (1) when memory is
allocated, and (2) when memory is deallocated. We describe
what happens at each component briey. Figure 4 depicts
the three components of Mosaic, and we will use Figure 4 to
provide a walkthrough of the actions taken during memory
allocation and deallocation.
Memory Allocation. When a GPGPU application wants
to access data that is not currently in the GPU memory, it
sends a request to the GPU runtime (e.g., OpenCL, CUDA
runtimes) to transfer the data from the CPU memory to the
GPU memory ( 1 in Figure 4). A GPGPU application typi-
cally allocates a large number of base pages at the same time.
CoCoA allocates space within the GPU memory ( 2 ) for the
base pages, working to conserve the contiguity of base pages,
if possible during allocation. Regardless of contiguity, CoCoA
provides a soft guarantee that a single large page frame con-
tains base pages from only a single application. Once the base
page is allocated, CoCoA initiates the data transfer across
the system I/O bus ( 3 ). When the data transfer is complete
( 4 ), CoCoA noties the In-Place Coalescer that allocation is
done by sending a list of the large page frame addresses that
were allocated ( 5 ). For each of these large page frames, the
runtime portion of the In-Place Coalescer then checks to see
whether (1) all base pages within the large page frame have
been allocated, and (2) the base pages within the large page
frame are contiguous in both virtual and physical memory.
If both conditions are true, the hardware portion of the In-
Place Coalescer updates the page table to coalesce the base
pages into a large page ( 6 ). Section 4.3 of our MICRO 2017
paper [8] describes how page tables are modied to support
coalescing.
Memory Deallocation. When a GPGPU application
would like to deallocate memory (e.g., when an application
kernel nishes), it sends a deallocation request to the GPU
runtime ( 7 ). For all deallocated base pages that are coalesced
into a large page, the runtime invokes Contiguity-Aware
Compaction (CAC) for the corresponding large page. The
runtime portion of CAC checks to see whether the large page
has a high degree of internal fragmentation (i.e., if the num-
ber of unallocated base pages within the large page exceeds
a predetermined threshold). For each large page with high
internal fragmentation, the hardware portion of CAC updates
the page table to splinter the large page back into its con-
stituent base pages ( 8 ). Next, CAC compacts the splintered
large page frames, by migrating data from multiple splintered
3
Hardware
Page Table
GPU Runtime
 
TLB Misses HandlingIn-Place
Coalescer
Contiguity-Conserving
Allocation
Allocate 
memory
1
Data
Data transfer
done notification
Coalesce pages
Send list of
large page frames
GPU Main
Memory
Application
demands data
2 3
1
2 4
5
6
Contiguity-Aware
Compaction
1
Application
deallocates data
7
Splinter pages8
Compact pages
by migrating data
9
Send list of newly-free
pages after compaction
10
System I/O Bus Transfer data3
Figure 4: High-level overview ofMosaic, showing how and when its three components interact with the GPUmemory. Repro-
duced from [8].
large page frames into a single large page frame ( 9 ). Finally,
CAC noties CoCoA of the large page frames that are now
free after compaction (10 ), which CoCoA can use for future
memory allocations. We describe each component of Mosaic
in more detail in Sections 4.2, 4.3, and 4.4 of our MICRO 2017
paper [8].
3. Evaluation Methodology
Table 1 shows the system conguration we simulate for
our evaluations, including the congurations of the GPU
cores and memory partitions. We modify the MAFIA frame-
work [43], which uses GPGPU-Sim 3.2.2 [10], to evaluate
Mosaic. We add a memory allocator into cuda-sim, the
CUDA simulator within GPGPU-Sim, to handle all virtual-
to-physical address translations and to provide memory pro-
tection. We add an accurate model of address translation
to GPGPU-Sim, including TLBs, page tables, and a page
table walker. The page table walker is shared across all
SMs, and allows up to 64 concurrent walks. Both the L1
and L2 TLBs have separate entries for base pages and large
pages [32,47,48,75,78,79]. Each TLB contains miss status hold-
ing registers (MSHRs) [54] to track in-ight page table walks.
Our simulation infrastructure supports demand paging by
detecting page faults and faithfully modeling the system I/O
bus (i.e., PCIe) latency based on measurements from NVIDIA
GTX 1080 cards [74]. We use a worst-case model for the per-
formance of our compaction mechanism conservatively, by
stalling the entire GPU (all SMs) and ushing the pipeline. We
have publicly released our simulator modications as open
source software [88, 89].
We evaluate the performance of Mosaic using both homo-
geneous and heterogeneous workloads. We categorize each
workload based on the number of concurrently-executing
applications, which ranges from one to ve for our homoge-
neous workloads, and from two to ve for our heterogeneous
workloads. We form our homogeneous workloads using mul-
tiple copies of the same application. We build 27 homoge-
neous workloads for each category using GPGPU applica-
tions from the Parboil [92], SHOC [25], LULESH [49, 50],
Rodinia [20], and CUDA SDK [69] suites. We form our het-
erogeneous workloads by randomly selecting a number of
GPU Core Conguration
Shader Core 30 cores, 1020 MHz, GTO warp scheduler [84]
Cong
Private L1 Cache 16KB, 4-way associative, LRU, L1 misses are
coalesced before accessing L2, 1-cycle latency
Private L1 TLB 128 base page/16 large page entries per core,
fully associative, LRU, single port, 1-cycle latency
Memory Partition Conguration
(6 memory partitions in total
with each partition accessible by all 30 cores)
Shared L2 Cache 2MB total, 16-way associative, LRU, 2 cache banks,
2 ports per memory partition, 10-cycle latency
Shared L2 TLB 512 base page/256 large page entries,
16-way/fully-associative (base page/large page),
, non-inclusive, LRU,2 ports, 10-cycle latency
DRAM 3GB GDDR5 [37, 53], 1674 MHz,
6 channels, 8 banks per rank,
FR-FCFS scheduler [83, 104], burst length 8
Table 1: Conguration of the simulated system. Adapted
from [8].
applications out of these 27 GPGPU applications. We build 25
heterogeneous workloads per category. In total we evaluate
235 homogeneous and heterogeneous workloads.
We compare Mosaic to two mechanisms: (1) GPU-MMU, a
baseline GPU with a state-of-the-art memory manager based
on the work by Power et al. [81]; and (2) Ideal TLB, a GPU with
an ideal TLB, where every address translation request hits in
the L1 TLB (i.e., there are no TLB misses). We report work-
load performance using the weighted speedup metric [28,29],
which is calculated as:
Weighted Speedup =
∑ IPCshared
IPCalone
(1)
where IPCalone is the retired instructions per cycle (IPC) of an
application in the workload that runs on the same number
of shader cores using the baseline state-of-the-art congura-
tion [81], but does not share GPU resources with any other
applications; and IPCshared is the IPC of the application when
it runs concurrently with other applications. We report the
4
performance of each application within a workload using
IPC.
Section 5 of our MICRO 2017 paper [8] provides more detail
on our experimental methodology.
4. Experimental Results
Figure 5 shows the performance of Mosaic for the homoge-
neous workloads we evaluated. We make two observations
from the gure. First, we observe that Mosaic is able to re-
cover most of the performance lost due to the overhead of
address translation (i.e., an ideal TLB) in homogeneous work-
loads. Compared to the GPU-MMU baseline, Mosaic improves
system performance by 55.5%, averaged across all 135 of our
homogeneous workloads. The performance of Mosaic comes
within 6.8% of the Ideal TLB performance, indicating that
Mosaic is eective at extending the TLB reach. Second, we
observe that Mosaic provides good scalability. As we increase
the number of concurrently-executing applications, which
puts more pressure on the shared TLBs, we observe that the
performance of Mosaic remains close to the Ideal TLB perfor-
mance.
0
1
2
3
4
5
6
7
1 App 2 Apps 3 Apps 4 Apps 5 Apps
W
e i
g
h
t e
d
S
p
e e
d
u
p
Number of Concurrently-Executing Applications
GPU-MMU
Mosaic
Ideal TLB 
39.0%33.8%
55.4%
61.5%
95.0%
Figure 5: Homogeneous workload performance of GPU
memory managers as we vary the number of concurrently-
executing applications in each workload. Reproduced
from [8].
Figure 6 shows the performance of Mosaic for heteroge-
neous workloads that consist of multiple dierent randomly-
selected GPGPU applications (100 workloads in total). From
the gure, we observe that on average across all of the work-
loads, Mosaic provides a performance improvement of 29.7%
over GPU-MMU, and comes within 15.4% of the Ideal TLB
performance. We nd that the improvement comes from the
signicant reduction in the TLB miss rate with Mosaic. We
also see that Mosaic’s scalability is good, as the number of
applications increases, yet there is still room for improve-
ment to reach the performance of Ideal TLB. We conclude
that Mosaic is a more eective memory manager than the
state-of-the-art. A detailed analysis of the results in Figures 5
and 6 can be found in Sections 6.1 and 6.2 of our MICRO 2017
paper [8].
Impact of Demand Paging on Performance. All of
our results so far show the performance of the GPU-MMU
baseline and Mosaic when demand paging is enabled. Figure 7
shows the normalized weighted speedup of the GPU-MMU
baseline and Mosaic, compared to GPU-MMU without demand
paging, where all data required by an application is moved
to the GPU memory before the application starts executing.
We make two observations from the gure. First, we nd
0
1
2
3
4
5
6
7
2 Apps 3 Apps 4 Apps 5 Apps
W
e i
g
h
t e
d
S
p
e e
d
u
p
Number of Concurrently-Executing Applications
GPU-MMU
Mosaic
Ideal TLB 
23.7%43.1%
31.5%
21.4%
Figure 6: Heterogeneous workload performance of the GPU
memory managers. Reproduced from [8].
that Mosaic outperforms GPU-MMU without demand pag-
ing by 58.5% on average for homogeneous workloads and
47.5% on average for heterogeneous workloads. Second, we
nd that demand paging has little impact on the weighted
speedup. This is because demand paging latency occurs only
when a kernel launches, at which point the GPU retrieves
data from the CPU memory. The data transfer overhead is
required regardless of whether or not demand paging is en-
abled, and thus the GPU incurs similar overhead with and
without demand paging. We conclude that Mosaic improves
performance signicantly, regardless of the demand paging
overhead in the baseline.
0.0
0.5
1.0
1.5
2.0
Homogeneous Heterogeneous
N
o
r m
a l
i z
e d
W
e i
g
h
t e
d
S
p
e e
d
u
p GPU-MMU no Paging
GPU-MMU with 
Paging
Mosaic with Paging
58.5% 47.5%
,
no demand paging
,
with demand paging
i ,
with demand paging
Figure 7: Performance of GPU-MMU and Mosaic compared
to GPU-MMUwithout demand paging. Reproduced from [8].
TLB Hit Rate. Figure 8 compares the overall TLB hit rate
of GPU-MMU to Mosaic for 214 of our 235 workloads, which
suer from limited TLB reach (i.e., workloads that have an
L2 TLB hit rate lower than 98%). We make two observations
from the gure. First, we observe that Mosaic is very eective
at increasing the TLB reach of these workloads. We nd that
for the GPU-MMU baseline, every fully-mapped large page
frame contains pages from multiple applications, as the GPU-
MMU allocator does not provide the soft guarantee of CoCoA
(i.e., a single large page frame contains base pages from only
a single application). As a result, GPU-MMU does not have
any opportunities to coalesce base pages into a large page
without performing signicant amounts of data migration. In
contrast, Mosaic can coalesce a vast majority of base pages
thanks to CoCoA. As a result, Mosaic reduces the TLB miss
rate drastically for these workloads, with the average miss
rate falling below 1% in both the L1 and L2 TLBs. Second, we
observe an increasing amount of interference in GPU-MMU
when more than three applications are running concurrently.
This results in a lower TLB hit rate as the number of appli-
cations increases from three to four, and from four to ve.
The L2 TLB hit rate of GPU-MMU drops from 81% in work-
loads with two concurrently-executing applications to 62%
in workloads with ve concurrently-executing applications.
Mosaic experiences no such drop due to interference as we
increase the number of concurrently-executing applications,
5
since it makes much greater use of large page coalescing and
enables a much larger TLB reach. We conclude that Mosaic is
very eective in improving the hit rate.
0%
20%
40%
60%
80%
100%
1 App 2 Apps 3 Apps 4 Apps 5 Apps
T
L
B
 H
i t
 R
a t
e
Number of Concurrently-Executing Applications
GPU-MMU
Mosaic
L1 L2 L1 L2 L1 L2 L1 L2 L1 L2
Figure 8: L1 and L2 TLB hit rates for GPU-MMU and Mosaic.
Reproduced from [8].
We provide the following additional results in our full
MICRO 2017 paper [8]:
• Individual applications’ performance with Mosaic and the
baseline GPU-MMU
• TLB size sensitivity of Mosaic and the baseline GPU-MMU
• Analysis of the eectiveness of CAC to reduce memory
fragmentation incurs by using large pages
5. Related Work
To our knowledge, this is the rst work to (1) analyze the
fundamental trade-os between TLB reach, demand paging
performance, and internal page fragmentation; and (2) pro-
pose an application-transparent GPU memory manager that
preemptively coalesces pages at allocation time to improve
address translation performance, while avoiding the demand
paging ineciencies and memory copy overheads typically
associated with large page support. Reducing performance
degradation from address translation overhead is an active
area of work for CPUs, and the performance loss that we
observe as a result of address translation is well corrobo-
rated [13, 15, 31, 33, 63]. In this section, we discuss previous
techniques that aim to reduce the overhead of address trans-
lation and demand paging.
5.1. TLB Designs for CPU Systems
TLB miss overhead can be reduced by (1) accelerating page
table walks [11, 14] or reducing the walk frequency [32]; or
(2) reducing the number of TLB misses (e.g., through prefetch-
ing [16, 46, 90], prediction [75], structural changes to the
TLB [77, 78, 93] or a TLB hierarchy [4, 5, 13, 15, 31, 47, 60, 91]).
Support for Multiple Page Sizes. Multi-page mapping
techniques [77,78,93] use a single TLB entry for multiple page
translations, improving TLB reach by a small factor. Much
greater improvements to TLB reach are needed to deal with
modern memory sizes. MIX TLB [24] accommodates entries
that translate multiple page sizes, eliminating the need for
a dedicated set of large page entries in the TLB. MIX TLB
is orthogonal to our work, and can be used with Mosaic to
further improve TLB reach.
Navarro et al. [66] identify contiguity-awareness and frag-
mentation reduction as primary concerns for large page
management, proposing reservation-based allocation and
deferred promotion (i.e., coalescing) of base pages to large
pages. Similar ideas are widely used in modern OSes [23]. In-
stead of the reservation-based scheme, Ingens [56] employs a
utilization-based scheme that uses a bit vector to track spatial
and temporal utilization of base pages.
Techniques to Increase Memory Contiguity.
GLUE [79] groups contiguous, aligned base page translations
under a single speculative large page translation in the
TLB. GTSM [26] provides hardware support to leverage the
contiguity of physical memory region even when pages
have been retired due to bit errors. These mechanisms for
preserving or recovering contiguity are orthogonal to the
contiguity-conserving allocation we propose for Mosaic, and
they can help Mosaic by avoiding the need for compaction.
Gorman et al. [35] propose a placement policy for an OS’s
physical page allocator that mitigates fragmentation and pro-
motes contiguity by grouping pages according to the amount
of migration required to achieve contiguity. Subsequent
work [36] proposes a software-exposed interface for applica-
tions to explicitly request large pages like libhugetlbfs [34].
These ideas are complementary to our work. Mosaic can po-
tentially benet from similar policies if such policies can be
simplied enough to be implementable in hardware.
Alternative TLB Designs. Research on shared last-level
TLB designs [15, 17, 60] and page walk cache designs [14] has
yielded mechanisms that accelerate multithreaded CPU appli-
cations by sharing translations between cores. SpecTLB [12]
provides a technique that predicts address translations. While
speculation works on CPU applications, speculation for
highly-parallel GPUs is more complicated [41,44], and can po-
tentially waste o-chip DRAM bandwidth, which is a highly-
contended resource in GPUs. Direct segments [13] and redun-
dant memory mappings [47] provide virtual memory support
for server workloads that reduces the overhead of address
translation. These proposals map large contiguous chunks
of virtual memory to the physical address space in order to
reduce the address translation overhead. While these tech-
niques improve the TLB reach, they increase the transfer
latency depending on the size of the virtual chunks they
map.
5.2. TLB Designs for GPU Systems
TLB Designs for Heterogeneous Systems. Previous
works provide several TLB designs for heterogeneous sys-
tems with GPUs [80,81,95] and with accelerators [22]. Mosaic
improves upon a state-of-the-art TLB design [81] by provid-
ing application-transparent, high-performance support for
multiple page sizes in GPUs. No prior work provides such
support.
TLB-AwareWarp Scheduler. Pichai et al. [80] extend the
cache-conscious warp scheduler [84] to be aware of the TLB in
heterogeneous CPU-GPU systems. Other more sophisticated
warp schedulers [51,59,62,65,84,85,103] can also be extended
to be TLB aware. These techniques are orthogonal to the
6
problem we focus on, and can be applied in conjunction with
Mosaic to further improve performance.
TLB-Aware Memory Hierarchy. Ausavarungnirun et
al. [9] improve the performance of the GPU under the pres-
ence of memory protection by redesigning the GPU main
memory hierarchy to be aware of TLB-related memory re-
quests. Many prior works propose memory scheduling
designs for GPUs [7, 42, 45, 101] and heterogeneous sys-
tems [6, 94]. These memory scheduling design can be modi-
ed to be aware of TLB-related memory requests and used in
conjunction with Mosaic to further improve the performance
of the GPUs.
Analysis of Address Translation in GPUs. Vesely et
al. [95] analyze support for virtual memory in heterogeneous
systems, nding that the cost of address translation in GPUs is
an order of magnitude higher than that in CPUs. They observe
that high-latency address translations limit the GPU’s latency
hiding capability and hurt performance. Mei et al. [61] use a
set of microbenchmarks to evaluate the address translation
process in commercial GPUs. Their work concludes that
previous NVIDIA architectures [71, 72] have o-chip L1 and
L2 TLBs, which lead to poor performance.
GPU Core Modications. Many prior works propose
modications to the GPU core design [7, 45, 51, 52, 55, 59, 62,
65, 84, 85, 86, 87, 97, 98, 103]. These techniques are complemen-
tary to Mosaic, and can be combined with Mosaic to further
improve GPU performance.
5.3. GPU Virtualization
VAST [58] is a software-managed virtual memory space
for GPUs. In that work, the authors observe that the lim-
ited size of physical memory typically prevents data-parallel
programs from utilizing GPUs. To address this, VAST auto-
matically partitions GPU programs into chunks that t within
the physical memory space to create the illusion of innite
virtual memory. Unlike Mosaic, VAST is unable to provide
memory protection from concurrently-executing GPGPU ap-
plications. Zorua [96] is a holistic mechanism to virtualize
multiple hardware resources within the GPU. Zorua does not
virtualize the main memory, and is thus orthogonal to our
work. vmCUDA [99] and rCUDA [27] provide close-to-ideal
performance, but they require signicant modications to
GPGPU applications and the operating system, which sacri-
ce transparency to the application, performance isolation,
and compatibility across multiple GPU architectures.
5.4. Demand Paging for GPUs
Demand paging is a challenge for GPUs [95]. Recent
works [3, 102], and the AMD hUMA [57] and NVIDIA PAS-
CAL architectures [73, 102] provide various levels of support
for demand paging in GPUs. These techniques do not tackle
the existing trade-o in GPUs between using large pages to
improve address translation and using base pages to minimize
demand paging overhead, which we relax with Mosaic.
6. Potential Impact of Mosaic
While several previous works propose mechanisms to
lower the overhead of virtual memory [13, 15, 26, 31, 33, 63,
79, 80, 81, 95], only a handful of these works extensively eval-
uate virtual memory on GPUs [58, 80, 81, 95], and no work
has investigated virtual memory as a shared resource when
multiple GPGPU applications need to share the GPUs. In this
section, we explore the potential future impact of Mosaic.
Support for Concurrent Application Execution in
GPUs. The large number of cores within a contemporary
GPU make it an attractive substrate for executing multiple
applications in parallel. This can be especially useful in vir-
tualized cloud environments, where hardware resources are
safely partitioned across multiple virtual machines to provide
ecient resource sharing. Prior approaches to execute mul-
tiple applications concurrently on a GPU have been limited,
as they either (1) lack sucient memory protection support
across multiple applications; (2) incur a high performance
overhead to provide memory protection; or (3) perform a
conservative static partitioning of the GPU, which can often
underutilize many resources in the GPU.
Mosaic provides the rst exible support for memory pro-
tection within a GPU, allowing applications to dynamically
partition GPU resources without violating memory protection
guarantees. This support can enable the practical virtualiza-
tion and sharing of GPUs in a cloud environment, which in
turn can increase the appeal of GPGPU programming and
the use cases of GPGPUs. By enabling practical support for
concurrent application execution on GPUs, Mosaic encour-
ages and enables future research in several areas, such as
resource sharing mechanisms, kernel scheduling, and quality-
of-service enforcement within the GPU and heterogeneous
systems.
Virtual Memory for SIMD Architectures. Mosaic is an
important rst step to enable low overhead virtual memory in
GPUs. We believe that the key ideas and general observations
that we make are applicable to any highly-parallel SIMD
architecture [30], and to heterogeneous systems with SIMD-
based processing cores [1, 18, 19, 21, 38, 39, 40, 64, 67, 68, 82,
100]. Future works can expand upon our ndings and adapt
our mechanisms to reduce the overhead of page walks and
demand paging on other SIMD-based systems.
Improved Programmability. Aside from memory pro-
tection, virtual memory can be used to (1) improve the pro-
grammability of GPGPU applications, and (2) decouple a GPU
kernel’s working set size from the size of the GPU memory. In
fact, Mosaic transparently allows applications to benet from
virtual memory without incurring a signicant performance
overhead. This is a key advantage for programmers, many
of whom are used to the conventional programming model
used in CPUs to provide application portability and memory
protection. By providing programmers with a familiar and
simple memory abstraction, we expect that a greater number
of programmers will start writing high-performance GPGPU
7
applications. Furthermore, by enabling low-overhead mem-
ory virtualization, Mosaic can enable new classes of GPGPU
applications. For example, in the past, programmers were
not able to easily write GPGPU applications whose memory
working set sizes exceeded the physical memory within the
GPU. With Mosaic, programmers no longer need to restrict
themselves to applications whose working sets t within the
physical memory; they can rely on the GPU itself software-
transparently managing page migration and address transla-
tion.
Publicly-Released Infrastructure. Our simulation in-
frastructure is publicly available as open-source software [88].
Other researchers can utilize our infrastructure to conduct
future research on virtual memory management on GPUs.
Some examples of research topics that can be investigated us-
ing our infrastructure include (1) how to manage which pages
reside in CPU memory or GPU memory, (2) how to dynam-
ically partition the physical main memory across multiple
concurrently-executing applications, and (3) how to maintain
programmability of the virtual memory as the GPU architec-
ture evolves and becomes more heterogeneous over time. We
hope and believe that our new, open-source infrastructure
can inspire future research in these and other research areas
on GPU and heterogeneous system memory virtualization.
7. Conclusion
We introduce Mosaic, a new GPU memory manager that
provides application-transparent support for multiple page
sizes. The key idea of Mosaic is to perform demand paging
using smaller page sizes, and then coalesce small (i.e., base)
pages into a larger page immediately after allocation, which
allows address translation to use large pages and thus in-
crease TLB reach. We have shown that Mosaic signicantly
outperforms state-of-the-art GPU address translation designs
and achieves performance close to an ideal TLB, across a wide
variety of workloads. We conclude that Mosaic eectively
combines the benets of large pages and demand paging in
GPUs, thereby breaking the conventional tension that exists
between these two concepts. We hope the ideas presented in
our MICRO 2017 paper can lead to future works that analyze
Mosaic in detail and provide even lower-overhead support for
synergistic address translation and demand paging in GPUs
and heterogeneous systems.
Acknowledgments
We thank the anonymous reviewers and SAFARI group
members for their feedback. Special thanks to Nastaran Ha-
jinazar, Juan Gómez Luna, and Mohammad Sadr for their
feedback. We acknowledge the support of our industrial part-
ners, especially Google, Intel, Microsoft, NVIDIA, Samsung,
and VMware. This research was partially supported by the
NSF (grants 1409723 and 1618563), the Intel Science and Tech-
nology Center for Cloud Computing, and the Semiconductor
Research Corporation.
References
[1] Advanced Micro Devices, Inc., “AMD Accelerated Processing Units,” http://www.
amd.com/us/products/technologies/apu/Pages/apu.aspx.
[2] Advanced Micro Devices, Inc., “OpenCL: The Future of Accelerated Applica-
tion Performance Is Now,” https://www.amd.com/Documents/FirePro_OpenCL_
Whitepaper.pdf.
[3] N. Agarwal, D. Nellans, M. O’Connor, S. W. Keckler, and T. F. Wenisch, “Unlock-
ing Bandwidth for GPUs in CC-NUMA Systems,” in HPCA, 2015.
[4] J. Ahn, S. Jin, and J. Huh, “Revisiting Hardware-Assisted Page Walks for Virtual-
ized Systems,” in ISCA, 2012.
[5] J. Ahn, S. Jin, and J. Huh, “Fast Two-Level Address Translation for Virtualized
Systems,” IEEE TC, 2015.
[6] R. Ausavarungnirun, K. Chang, L. Subramanian, G. Loh, and O. Mutlu, “Staged
Memory Scheduling: Achieving High Performance and Scalability in Heteroge-
neous Systems,” in ISCA, 2012.
[7] R. Ausavarungnirun, S. Ghose, O. Kayıran, G. H. Loh, C. R. Das, M. T. Kandemir,
and O. Mutlu, “Exploiting Inter-Warp Heterogeneity to Improve GPGPU Perfor-
mance,” in PACT, 2015.
[8] R. Ausavarungnirun, J. Landgraf, V. Miller, S. Ghose, J. Gandhi, C. J. Rossbach,
and O. Mutlu, “Mosaic: A GPU Memory Manager with Application-Transparent
Support for Multiple Page Sizes,” in MICRO, 2017.
[9] R. Ausavarungnirun, V. Miller, J. Landgraf, S. Ghose, J. Gandhi, A. Jog, C. Ross-
bach, and O. Mutlu, “MASK: Redesigning the GPU Memory Hierarchy to Support
Multi-Application Concurrency,” in ASPLOS, 2018.
[10] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, “Analyzing CUDA
Workloads Using a Detailed GPU Simulator,” in ISPASS, 2009.
[11] T. W. Barr, A. L. Cox, and S. Rixner, “Translation Caching: Skip, Don’t Walk (the
Page Table),” in ISCA, 2010.
[12] T. W. Barr, A. L. Cox, and S. Rixner, “SpecTLB: A Mechanism for Speculative
Address Translation,” in ISCA, 2011.
[13] A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift, “Ecient Virtual Mem-
ory for Big Memory Servers,” in ISCA, 2013.
[14] A. Bhattacharjee, “Large-Reach Memory Management Unit Caches,” in MICRO,
2013.
[15] A. Bhattacharjee, D. Lustig, and M. Martonosi, “Shared Last-level TLBs for Chip
Multiprocessors,” in HPCA, 2011.
[16] A. Bhattacharjee and M. Martonosi, “Characterizing the TLB Behavior of Emerg-
ing Parallel Workloads on Chip Multiprocessors,” in PACT, 2009.
[17] A. Bhattacharjee and M. Martonosi, “Inter-Core Cooperative TLB for Chip Mul-
tiprocessors,” in ASPLOS, 2010.
[18] D. Bouvier and B. Sander, “Applying AMD’s "Kaveri" APU for Heterogeneous
Computing,” in Hot Chips, 2014.
[19] B. Burgess, B. Cohen, J. Dundas, J. Rupley, D. Kaplan, and M. Denman, “Bobcat:
AMD’s Low-Power x86 Processor,” IEEE Micro, 2011.
[20] S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaer, S.-H. Lee, and K. Skadron, “Ro-
dinia: A Benchmark Suite for Heterogeneous Computing,” in IISWC, 2009.
[21] M. Clark, “A New X86 Core Architecture for the Next Generation of Computing,”
in Hot Chips, 2016.
[22] J. Cong, Z. Fang, Y. Hao, and G. Reinman, “Supporting Address Translation for
Accelerator-Centric Architectures,” in HPCA, 2017.
[23] J. Corbet, “Transparent Hugepages,” https://lwn.net/Articles/359158/, 2009.
[24] G. Cox and A. Bhattacharjee, “Ecient Address Translation for Architectures
with Multiple Page Sizes,” in ASPLOS, 2017.
[25] A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spaord, V. Tip-
paraju, and J. S. Vetter, “The Scalable Heterogeneous Computing (SHOC) Bench-
mark Suite,” in GPGPU, 2010.
[26] Y. Du, M. Zhou, B. R. Childers, D. Mossé, and R. Melhem, “Supporting Super-
pages in Non-Contiguous Physical Memory,” in HPCA, 2015.
[27] J. Duato, A. Pena, F. Silla, R. Mayo, and E. Quintana-Orti, “rCUDA: Reducing
the Number of GPU-based Accelerators in High Performance Clusters,” in HPCS,
2010.
[28] S. Eyerman and L. Eeckhout, “System-Level Performance Metrics for Multipro-
gram Workloads,” IEEE Micro, 2008.
[29] S. Eyerman and L. Eeckhout, “Restating the Case for Weighted-IPC Metrics to
Evaluate Multiprogram Workload Performance,” IEEE CAL, 2014.
[30] M. Flynn, “Very High-Speed Computing Systems,” Proc. of the IEEE, vol. 54, no. 2,
1966.
[31] J. Gandhi, A. Basu, M. D. Hill, and M. M. Swift, “Ecient Memory Virtualization:
Reducing Dimensionality of Nested Page Walks,” in MICRO, 2014.
[32] J. Gandhi, M. D. Hill, and M. M. Swift, “Agile Paging: Exceeding the Best of
Nested and Shadow Paging,” in ISCA, 2016.
[33] F. Gaud, B. Lepers, J. Decouchant, J. Funston, A. Fedorova, and V. Quema, “Large
Pages May Be Harmful on NUMA Systems,” in USENIX ATC, 2014.
[34] M. Gorman, “Huge Pages Part 2 (Interfaces),” https://lwn.net/Articles/375096/,
2010.
[35] M. Gorman and P. Healy, “Supporting Superpage Allocation Without Additional
Hardware Support,” in ISMM, 2008.
[36] M. Gorman and P. Healy, “Performance Characteristics of Explicit Superpage
Support,” in WIOSCA, 2010.
8
[37] Hynix. Hynix GDDR5 SGRAM Part H5GQ1H24AFR Revision 1.0.
http://www.hynix.com/datasheet/pdf/graphics/H5GQ1H24AFR(Rev1.0).pdf
[38] Intel Corp., “Sandy Bridge Intel Processor Graphics Performance Developer’s
Guide.”
[39] Intel Corp., “Introduction to Intel® Architecture,” 2014.
[40] Intel Corp., “6th Generation Intel® CoreTM Processor Family Datasheet, Vol. 1,”
2017.
[41] J. A. Jablin, T. B. Jablin, O. Mutlu, and M. Herlihy, “Warp-aware Trace Scheduling
for GPUs,” in PACT, 2014.
[42] M. K. Jeong, M. Erez, C. Sudanthi, and N. Paver, “A QoS-Aware Memory Con-
troller for Dynamically Balancing GPU and CPU Bandwidth Use in an MPSoC,”
in DAC, 2012.
[43] A. Jog, O. Kayıran, T. Kesten, A. Pattnaik, E. Bolotin, N. Chatterjee, S. W. Keckler,
M. T. Kandemir, and C. R. Das, “Anatomy of GPU Memory System for Multi-
Application Execution,” in MEMSYS, 2015.
[44] A. Jog, O. Kayıran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das,
“Orchestrated Scheduling and Prefetching for GPGPUs,” in ISCA, 2013.
[45] A. Jog, O. Kayiran, A. Pattnaik, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das,
“Exploiting Core Criticality for Enhanced GPU Performance,” in SIGMETRICS,
2016.
[46] G. B. Kandiraju and A. Sivasubramaniam, “Going the Distance for TLB Prefetch-
ing: An Application-Driven Study,” in ISCA, 2002.
[47] V. Karakostas, J. Gandhi, F. Ayar, A. Cristal, M. D. Hill, K. S. McKinley, M. Ne-
mirovsky, M. M. Swift, and O. Ünsal, “Redundant Memory Mappings for Fast
Access to Large Memories,” in ISCA, 2015.
[48] V. Karakostas, J. Gandhi, A. Cristal, M. D. Hill, K. S. McKinley, M. Nemirovsky,
M. M. Swift, and O. Unsal, “Energy-Ecient Address Translation,” in HPCA,
2016.
[49] I. Karlin et al., “Exploring Traditional and Emerging Parallel Programming Mod-
els Using a Proxy Application,” in IPDPS, 2013.
[50] I. Karlin, J. Keasler, and R. Neely, “LULESH 2.0 Updates and Changes,” Lawrence
Livermore National Lab, Tech. Rep. LLNL-TR-641973, 2013.
[51] O. Kayıran, A. Jog, M. T. Kandemir, and C. R. Das, “Neither More Nor Less: Op-
timizing Thread-Level Parallelism for GPGPUs,” in PACT, 2013.
[52] O. Kayıran, N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T. Kandemir, G. H.
Loh, O. Mutlu, and C. R. Das, “Managing GPU Concurrency in Heterogeneous
Architectures,” in MICRO, 2014.
[53] Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A Fast and Extensible DRAM Simu-
lator,” CAL, 2015.
[54] D. Kroft, “Lockup-Free Instruction Fetch/Prefetch Cache Organization,” in ISCA,
1981.
[55] H.-K. Kuo, B. C. Lai, and J.-Y. Jou, “Reducing Contention in Shared Last-Level
Cache for Throughput Processors,” ACM TODAES, vol. 20, no. 1, 2014.
[56] Y. Kwon, H. Yu, S. Peter, C. J. Rossbach, and E. Witchel, “Coordinated and E-
cient Huge Page Management with Ingens,” in OSDI, 2016.
[57] G. Kyriazis, “Heterogeneous System Architecture: A Technical Review,” https:
//developer.amd.com/wordpress/media/2012/10/hsa10.pdf, Advanced Micro De-
vices, Inc., 2012.
[58] J. Lee, M. Samadi, and S. Mahlke, “VAST: The Illusion of a Large Memory Space
for GPUs,” in PACT, 2014.
[59] S.-Y. Lee and C.-J. Wu, “CAWS: Criticality-Aware Warp Scheduling for GPGPU
Workloads,” in PACT, 2014.
[60] D. Lustig, A. Bhattacharjee, and M. Martonosi, “TLB Improvements for Chip Mul-
tiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs,”
ACM TACO, 2013.
[61] X. Mei and X. Chu, “Dissecting GPU Memory Hierarchy Through Microbench-
marking,” IEEE TPDS, 2017.
[62] J. Meng, D. Tarjan, and K. Skadron, “Dynamic Warp Subdivision for Integrated
Branch and Memory Divergence Tolerance,” in ISCA, 2010.
[63] T. Merrield and H. R. Taheri, “Performance Implications of Extended Page Ta-
bles on Virtualized x86 Processors,” in VEE, 2016.
[64] R. Mijat, “Take GPU Processing Power Beyond Graphics with Mali GPU Com-
puting,” 2012.
[65] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt,
“Improving GPU Performance via Large Warps and Two-Level Warp Scheduling,”
in MICRO, 2011.
[66] J. Navarro, S. Iyer, P. Druschel, and A. Cox, “Practical, Transparent Operating
System Support for Superpages,” in OSDI, 2002.
[67] NVIDIA Corp., “NVIDIA Tegra K1,” http://www.nvidia.com/content/pdf/tegra_
white_papers/tegra-k1-whitepaper-v1.0.pdf.
[68] NVIDIA Corp., “NVIDIA Tegra X1,” https://international.download.nvidia.com/
pdf/tegra/Tegra-X1-whitepaper-v1.0.pdf.
[69] NVIDIA Corp., “CUDA C/C++ SDK Code Samples,” 2011.
[70] NVIDIA Corp., “NVIDIA’s Next Generation CUDA Compute Architecture:
Fermi,” http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_
compute_architecture_whitepaper.pdf, 2011.
[71] NVIDIA Corp., “NVIDIA’s Next Generation CUDA Compute Architecture: Ke-
pler GK110,” 2012.
[72] NVIDIA Corp., “NVIDIA GeForce GTX 750 Ti,” 2014.
[73] NVIDIA Corp., “NVIDIA Tesla P100,” 2016.
[74] NVIDIA Corp., “NVIDIA GeForce GTX 1080,” 2017.
[75] M.-M. Papadopoulou, X. Tong, A. Seznec, and A. Moshovos, “Prediction-Based
Superpage-Friendly TLB Designs,” in HPCA, 2015.
[76] PCI-SIG, “PCI Express Base Specication Revision 3.1a,” 2015.
[77] B. Pham, A. Bhattacharjee, Y. Eckert, and G. H. Loh, “Increasing TLB Reach by
Exploiting Clustering in Page Translations,” in HPCA, 2014.
[78] B. Pham, V. Vaidyanathan, A. Jaleel, and A. Bhattacharjee, “CoLT: Coalesced
Large-Reach TLBs,” in MICRO, 2012.
[79] B. Pham, J. Vesely, G. Loh, and A. Bhattacharjee, “Large Pages and Lightweight
Memory Management in Virtualized Systems: Can You Have It Both Ways?” in
MICRO, 2015.
[80] B. Pichai, L. Hsu, and A. Bhattacharjee, “Architectural Support for Address Trans-
lation on GPUs: Designing Memory Management Units for CPU/GPUs with Uni-
ed Address Spaces,” in ASPLOS, 2014.
[81] J. Power, M. D. Hill, and D. A. Wood, “Supporting x86-64 Address Translation
for 100s of GPU Lanes,” in HPCA, 2014.
[82] PowerVR, “PowerVR Hardware Architecture Overview for Developers,”
http://cdn.imgtec.com/sdk-documentation/PowerVR+Hardware.Architecture+
Overview+for+Developers.pdf, 2016.
[83] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, “Memory Access
Scheduling,” in ISCA, 2000.
[84] T. G. Rogers, M. O’Connor, and T. M. Aamodt, “Cache-Conscious Wavefront
Scheduling,” in MICRO, 2012.
[85] T. G. Rogers, M. O’Connor, and T. M. Aamodt, “Divergence-Aware Warp Schedul-
ing,” in MICRO, 2013.
[86] M. Sadrosadati, A. Mirhosseini, B. Ehsani, H. Sarbazi-Azad, M. P. Drumond,
B. Falsa, R. Ausavarungnirun, and O. Mutlu, “LTRF: A Latency Tolerant Reg-
ister File Architecture for GPUs,” in ASPLOS, 2018.
[87] M. Sadrosadati, A. Mirhosseini, B. Ehsani, H. Sarbazi-Azad, M. P. Drumond,
B. Falsa, R. Ausavarungnirun, and O. Mutlu, “LTRF: Enabling High-Capacity
Register Files for GPUs via Hardware/Software Cooperative Register Prefetch-
ing,” in ASPLOS, 2018.
[88] SAFARI Research Group, “Mosaic – GitHub Repository,” https://github.com/
CMU-SAFARI/Mosaic/.
[89] SAFARI Research Group, “SAFARI Software Tools – GitHub Repository,” https:
//github.com/CMU-SAFARI/.
[90] A. Saulsbury, F. Dahlgren, and P. Stenström, “Recency-Based TLB Preloading,”
in ISCA, 2000.
[91] S. Srikantaiah and M. Kandemir, “Synergistic TLBs for High Performance Ad-
dress Translation in Chip Multiprocessors,” in MICRO, 2010.
[92] J. A. Stratton, C. Rodrigues, I. J. Sung, N. Obeid, L. W. Chang, N. Anssari, G. D.
Liu, and W. W. Hwu, “Parboil: A Revised Benchmark Suite for Scientic and
Commercial Throughput Computing,” Univ. of Illinois at Urbana-Champaign,
IMPACT Research Group, Tech. Rep. IMPACT-12-01, 2012.
[93] M. Talluri and M. D. Hill, “Surpassing the TLB Performance of Superpages with
Less Operating System Support,” in ASPLOS, 1994.
[94] H. Usui, L. Subramanian, K. Chang, and O. Mutlu, “DASH: Deadline-Aware High-
Performance Memory Scheduler for Heterogeneous Systems with Hardware Ac-
celerators,” ACM TACO, vol. 12, no. 4, Jan. 2016.
[95] J. Vesely, A. Basu, M. Oskin, G. H. Loh, and A. Bhattacharjee, “Observations
and Opportunities in Architecting Shared Virtual Memory for Heterogeneous
Systems,” in ISPASS, 2016.
[96] N. Vijaykumar, K. Hsieh, G. Pekhimenko, S. Khan, A. Shrestha, S. Ghose, A. Jog,
P. B. Gibbons, and O. Mutlu, “Zorua: A Holistic Approach to Resource Virtual-
ization in GPUs,” in MICRO, 2016.
[97] N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, R. Ausavarungnirun,
C. Das, M. Kandemir, T. C. Mowry, and O. Mutlu, “A Case for Core-Assisted Bot-
tleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist
Warps,” in ISCA, 2015.
[98] N. Vijaykumar, G. Pekhimenko, A. Jog, S. Ghose, A. Bhowmick, R. Ausavarung-
nirun, C. Das, M. Kandemir, T. C. Mowry, and O. Mutlu, “A Framework for Accel-
erating Bottlenecks in GPU Execution with Assist Warps,” in Advances in GPU
Research and Practice, 2016.
[99] L. Vu, H. Sivaraman, and R. Bidarkar, “GPU Virtualization for High Performance
General Purpose Computing on the ESX Hypervisor,” in HPC, 2014.
[100] S. Wasson, “AMD’s A8-3800 Fusion APU.” http://techreport.com/articles.x/
21730, 2011.
[101] G. Yuan, A. Bakhoda, and T. Aamodt, “Complexity Eective Memory Access
Scheduling for Many-Core Accelerator Architectures,” in MICRO, 2009.
[102] T. Zheng, D. Nellans, A. Zulqar, M. Stephenson, and S. W. Keckler, “Towards
High Performance Paged Memory for GPUs,” in HPCA, 2016.
[103] Z. Zheng, Z. Wang, and M. Lipasti, “Adaptive Cache and Concurrency Allocation
on GPGPUs,” IEEE CAL, 2014.
[104] W. K. Zuravle and T. Robinson, “Controller for a Synchronous DRAM That
Maximizes Throughput by Allowing Memory Requests and Commands to Be
Issued Out of Order,” US Patent No. 5,630,096, 1997.
9
