UMap: Enabling Application-driven Optimizations for Page Management by Peng, Ivy B. et al.
UMap : Enabling Application-driven Optimizations
for Page Management
Ivy B. Peng
peng8@llnl.gov
Lawrence Livermore National
Laboratory
Livermore, USA
Marty McFadden
mcfadden8@llnl.gov
Lawrence Livermore National
Laboratory
Livermore, USA
Eric Green
green77@llnl.gov
Lawrence Livermore National
Laboratory
Livermore, USA
Keita Iwabuchi
iwabuchi1@llnl.gov
Lawrence Livermore National
Laboratory
Livermore, USA
Kai Wu
kwu42@ucmerced.edu
University of California, Merced
Merced, USA
Dong Li
dli35@ucmerced.edu
University of California, Merced
Merced, USA
Roger Pearce
pearce7@llnl.gov
Lawrence Livermore National
Laboratory
Livermore, USA
Maya Gokhale
gokhale2@llnl.gov
Lawrence Livermore National
Laboratory
Livermore, USA
ABSTRACT
Leadership supercomputers feature a diversity of storage,
from node-local persistent memory and NVMe SSDs to
network-interconnected ￿ash memory and HDD. Memory
mapping ￿les on di￿erent tiers of storage provides a uniform
interface in applications. However, system-wide services like
mmap are optimized for generality and lack ￿exibility for
enabling application-speci￿c optimizations. In this work, we
present UMap to enable user-space page management that
can be easily adapted to access patterns in applications and
storage characteristics. UMap uses the userfaultfd mecha-
nism to handle page faults in multi-threaded applications
e￿ciently. By providing a data object abstraction layer,UMap
is extensible to support various backing stores. The design of
UMap supports dynamic load balancing and I/O decoupling
for scalable performance. UMap also uses application hints
to improve the selection of caching, prefetching, and eviction
policies. We evaluate UMap in ￿ve benchmarks and real ap-
plications on two systems. Our results show that leveraging
application knowledge for page management could substan-
tially improve performance. On average, UMap achieved 1.25
to 2.5 times improvement using the adapted con￿gurations
compared to the system service.
MCHPC’19, Denver, USA
2019. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM.
KEYWORDS
memory mapping, memmap, page fault, user-space paging,
userfaultfd, page management
Reference Format:
Ivy B. Peng, Marty McFadden, Eric Green, Keita Iwabuchi, Kai Wu,
Dong Li, Roger Pearce, and Maya Gokhale. 2019. UMap : Enabling
Application-driven Optimizations for Page Management.
1 INTRODUCTION
Recently, leadership supercomputers provide enormous stor-
age resources to cope with expanding data sets in applica-
tions. The storage resources come in a hybrid format for
balanced cost and performance [9, 11, 13]. Fast and small
storage, which is implemented using advanced technologies
like persistent memory and NVMe SSDs, often co-locate with
computing units inside compute node. Storage with massive
capacity, on the other hand, uses cost-e￿ective technologies
like HDD and is interconnected to compute nodes through
the network. In between, burst bu￿ers use fast memory tech-
nologies and are accessible through the network. Memory
mapping provides a uniform interface to access ￿les on dif-
ferent types of storage as if to dynamically allocated memory.
For instance, out-of-core data analytic workloads often need
to process large datasets that exceed the memory capacity
of a compute node [17]. Using memory mapping to access
these datasets shift the burden of paging, prefetching, and
caching data between storage and memory to the operating
systems.
MCHPC’19, Denver, USA Peng et al.
Currently, operating systems provide the mmap system
call to map ￿les or devices into memory. This system service
performs well in loading dynamic libraries and could also
support out-of-core execution. However, as a system-level
service, it has to be tuned for performance reliability and
consistency over a broad range of workloads. Therefore, it
may reduce opportunities in optimizing performance based
on application characteristics. Moreover, backing stores on
di￿erent storage exhibit distinctive performance character-
istics. Consequently, con￿gurations tuned for one type of
storage will need to be adjusted when mapping on another
type of storage. In this work, we provide UMap to enable
application-speci￿c optimizations for page management in
memory mapping various backing stores. UMap is highly
con￿gurable to adapt user-space paging to suit application
needs. It facilitates application control on caching, prefetch-
ing, and eviction policies with minimal porting e￿orts from
the programmer. As a user-level solution, UMap con￿nes
changes within an application without impacting other ap-
plications sharing the platform, which is unachievable in
system-level approaches.
We prioritize four design choices for UMap based on sur-
veying realistic use cases. First, we choose to implement
UMap as a user-level library so that it can maintain compati-
bility with the fast-moving Linux kernel without the need
to track and modify for frequent kernel updates. Also, we
employ the recent userfaultfd [7] mechanism, other than
the signal handling + callback function approach to reduce
overhead and performance variance in multi-threaded appli-
cations. Third, we target an adaptive solution that sustains
performance even at high concurrency for data-intensive
applications, which often employ a large number of threads
for hiding data access latency. Our design pays particular
consideration on load imbalance among service threads to
improve the utilization of shared resources even when data
accesses to pages are skewed. UMap dynamically balances
workloads among all service threads to eliminate bottleneck
on serving hot pages. Finally, for ￿exible and portable tun-
ing on di￿erent computing systems, UMap provides both
API and environmental controls to enable con￿gurable page
sizes, eviction strategy, application-speci￿c prefetching, and
detailed diagnosis information to the programmer.
We evaluate the e￿ectiveness of UMap in ￿ve use cases,
including two data-intensive benchmarks, i.e., a synthetic
sort benchmark and a breadth-￿rst search (BFS) kernel, and
three real applications, i.e., Lrzip [8], N-Store database [2],
and an asteroid detection application that processes massive
data sets from telescopes. We conduct out-of-core experi-
ments on two systems with node-local SSD and network-
interconnected HDD storage. Our results show that UMap
can enable ￿exible user-space page management in data-
intensive applications. On the AMD testbed with local NVMe
SSD, applications achieved 1.25 to 2.5 times improvement
compared to the standard system service. On the Intel testbed
with network-interconnected HDD, UMap brings the per-
formance of the asteroid detection application close to that
uses local SSD for 500 GB data sets. In summary, our main
contributions are as follows:
• We propose an open-source library1, called UMap that
leverages lightweight userfaultfd mechanism to enable
application-driven page management.
• We describe the design of UMap for achieving scalable
performance in multi-threaded data-intensive applica-
tions.
• We demonstrate ￿ve use cases of UMap and show that
enabling con￿gurable page size is essential for perfor-
mance tuning in data-intensive applications.
• UMap improves the performance of tested applications
by 1.25 to 2.5 times compared to the standard mmap
system service.
2 BACKGROUND AND MOTIVATION
In this section, we introduce memory mapping, prospective
bene￿ts from user-space page management, and the enabling
mechanism userfaultfd.
2.1 Memory Mapping
Memory mapping links images and ￿les in persistent storage
to the virtual address space of a process. The operating sys-
tem employs demand paging to bring only accessed virtual
pages into physical memory because virtual memory can be
much larger than physical memory. An access to memory-
mapped regions triggers a page fault if no page table entry
(PTE) is present for the accessed page. When such a page
fault is raised, the operating system resolves it by copying in
the physical data page from storage to the in-memory page
cache.
Common strategies for optimizing memory mapping in
the operating systems include page cache, read-ahead, and
madvise hints. The page cache is used to keep frequently used
pages in memory while less important pages may need to
be evicted from memory to make room for newly requested
pages. Least Recently Used (LRU) policy is commonly used
for selecting pages to be evicted. The operating system may
proactively ￿ush dirty pages, i.e., modi￿ed pages in the page
cache, into storage when the ratio of dirty page exceeds a
threshold value [19]. Read-ahead preloads pages into physi-
cal memory to avoid the overhead associated with page fault
handling, TLB misses and user-to-kernel mode transition.
Finally, the madvise interface takes hints to allow the op-
erating system to make informed decisions for managing
pages.
1UMAP v2.0.0 https://github.com/LLNL/umap.
UMap : Enabling Application-driven Optimizations for Page Management MCHPC’19, Denver, USA
2.2 User-space Page Management
User-space page management uses application threads to
resolve page faults and manage virtual memory in the back-
ground as de￿ned by the application. The userfaultfd is a
lightweight mechanism to enable user-space paging com-
pared to the traditional SIGSEGV signal and callback func-
tion [7]. Applications register address ranges to be manged
in user-space, and specify the type of events, e.g., page faults
and events in un-cooperative mode, to be tracked. Page faults
in the address ranges are delivered asynchronously so that
the faulting process is blocked instead of idling, allowing
other processes to be scheduled to proceed.
The fault-handling thread in the application can atomically
resolve page faults with the UFFDIO_COPY ioctl, which en-
sures the faulting process is (optionally) waken up only after
the requested page has been fully copied into physical mem-
ory [7]. The fault-handling threads may utilize application-
speci￿c knowledge to optimize this procedure, providing the
￿exibility that is unachievable in kernel mode. For instance,
the application could select arbitrary page sizes, read-ahead
window size, or provides speci￿c pages for prefetching or
evicting. All these optimizations remain inside one applica-
tion and will not impact other applications sharing the same
system. User-space paging is not only limited to backing
store on ￿le systems. In contrast to kernel mode, the fault-
handling thread has the liberty to fetch data from a variety
of backing stores, such a memory server, databases, and even
another process.
3 DESIGN
In this section, we describe the design of UMap . We ￿rst
provide an overview of the architecture and then focus on
four optimizations for achieving high performance in user-
space.
3.1 Overview
UMap provides an interface for applications to register mul-
tiple virtual address ranges, called UMap regions that bypass
the kernel service and instead, be managed in user-space. Fig-
ure 1 presents the UMap architecture. Dark blue regions in
the virtual address space are UMap regions. Each region has a
backing store, where the data is physically located.UMap pro-
vides an abstraction layer in the store object (yellow circles)
for accessing di￿erent types of storage. When an application
accesses a UMap region, if the accessed page is not present in
the physical memory, page faults are triggered. These page
faults queue up in a FIFO bu￿er and multiple UMap ￿llers
cooperatively resolve these faults. If the requested pages
are not fetched in yet, UMap ￿llers will invoke the access
functions de￿ned in the store object to read data from the
underlying storage. If the bu￿er is fully occupied, some pages
Store	
Object	
Store	
Object	
Physical	pages	
Application	
Filler	0	
SSD	
Backend	
Storage	 PM	
Network-attached	
HDD	
				Network-
attached	SSD	
...	
UMap	
Store	
Object	
Filler	1	 Filler	2	 Filler	3	
Internal	Buffer	
Page	faults	
Evictor0	 Evictor1	
Virtual	Address	Space	
Prefetching	
Policies	
Eviction	
Policies	
Figure 1: The UMap architecture.
need to be evicted following a user-de￿ned strategy. In the
background, a group of UMap evictors keep monitoring the
ratio of dirty pages in the bu￿er. Once the ratio of dirty pages
reaches a (con￿gurable) high watermark, UMap evictors will
coordinately write data to the storage.
3.2 I/O Decoupling
Our design decouples the I/O operation from the fault-
handing threads to achieve high concurrency in long latency
tasks. I/O operations that move data between storage and
memory have a much longer latency than memory accesses.
For instance, latency to the state-of-art persistent memory
(PM) is about 100 - 500 ns [12], latency to NVMe-based SSD
is in the range of ⇡ 20 µs [3] while accesses to HDD would
require several milliseconds. In contrast, memory accesses
typically takes 20-100 ns. To improve the I/O performance,
UMap employ a con￿gurable number of threads for moving
data between storage and memory to exploit the bandwidth
supported by the hardware.
The dedicated two groups of I/O threads is referred to
as ￿llers and evictors, as illustrated in the orange and blue
boxes in Figure 1. Fillers split the workload of copying pages
to memory while evictors concurrently write data to stor-
age. A separate group of manager threads, typically with
low concurrency, keeps polling for noti￿cation of tracked
events from the operating system. By decoupling the tasks
into three groups of workers, UMap has the ￿exibility to
adapt the concurrency in each group to re￿ect their di￿er-
ent workload. In contrast, a coupled design results in a long
blocking operation that has limited ￿exibility to optimize.
MCHPC’19, Denver, USA Peng et al.
3.3 Dynamic Load Balancing
UMap employs a dynamic load balancing strategy to improve
resource utilization. We ￿nd that memory-mapped regions
could have hot and cold segments. Hot segments require
a higher level of concurrency for frequent data movement
and more physical memory for bu￿ering data than cold seg-
ments. For instance, social networks are considered as a type
of scale-free network whose degree distribution follows a
power law. Memory segment that stores high-degree vertices
would naturally result in more accesses than the regions that
store low-degree vertices. We design UMap to avoid load
imbalance even in such skewed data access patterns by dy-
namically distributing workloads from all memory regions
among UMap ￿llers.
UMap employs a dynamic scheduling strategy similar to
“work stealing" approach in task-based programming mod-
els [15].UMap uses a singleUMap bu￿er object tomanage the
metadata of in-memory pages for all regions. When UMap
receives the noti￿cation of a fault event from the operat-
ing system, it appends the workload for resolving this fault
into a dynamically growing queue. A group of workers split
the pending workload to load pages from the backing store
collectively. Consequently, when hot memory segments gen-
erate more workloads than others, they will be assigned with
more working threads. Orthogonal to the data fetching task
is the data ￿ushing task that writes dirty pages back to the
persistent stores. When the number of dirty pages reaches
a high watermark, the workload is appended to a separate
queue and then split by a di￿erent group of workers. Fig-
ure 1 illustrates the shared (internal) bu￿er and the work
distribution among workers. The dynamic load balancing
design prepare UMap to cope with applications with diverse
access patterns.
3.4 Extensible Back Store
UMap provides a data object abstraction layer to support
di￿erent types of backing stores. Currently, applications run-
ning on leadership supercomputers have multiple choices of
storage, including local SSD, network-interconnected SSD,
and HDD. In the future, architectures that provide disaggre-
gated memory and storage resources are likely to emerge.
Based on this observation, our design ensures that UMap is
extensible for current and future architectures.
UMap facilitates applications to associate their own back-
ing store for each memory region. The application has spe-
ci￿c control over which storage layer to access to resolve
a page fault. In this way, an application is presented with
a uniform interface as the virtual memory address space
while UMap in the backend handles data movement to/from
various types of storage.
3.5 User-controlled Page Flushing
We design UMap to enable user-space control on page ￿ush-
ing to a persistent store. There are two motivations. First,
the system service may write dirty pages to storage when-
ever the operating system deems appropriate. Unpredictable
behavior may occur if a memory range requires strong con-
sistency such as atomicity among multiple pages. Second,
frequent page ￿ushing is known to cause increased perfor-
mance variation and degradation. For instance, RHEL trig-
ger page ￿ushing when more than 10% pages are dirty [19].
With user control, the application could avoid aggressive
page ￿ushing by setting a high threshold or even postponed
page ￿ushing to a later stage. UMap monitors the ratio of
dirty pages to compare with a user-de￿ned high watermark
to trigger page ￿ushing as well as a low watermark that
suspends page ￿ushing.
3.6 Application-Speci￿c Optimization
UMap maintains a set of parameters for programmers with
application knowledge to con￿gure page management. One
of the most performance-critical parameters is the inter-
nal page size of a memory region, denoted as UMap page.
UMap supports an arbitrary page size for each memory re-
gion while the system service only supports ￿xed page sizes.
UMap page de￿nes the ￿nest granularity in data movement
between memory and backing store. For the same memory
region, choosing a large UMap page could reduce the over-
head of metadata, but may also move more than accessed
data into memory. By tuning the page size, an application
could identify an optimal con￿guration that balances the
overhead and data usage. Also, an application can control
the page bu￿er size, which can alleviate OOM situations in
unconstrained mmap.
UMap also supports a ￿exible prefetching policy that can
fetch pages even in irregular patterns. The operating sys-
tems usually recognize page accesses as either sequential
or random, to increase or decrease the readahead window
size, respectively. Real-world applications, however, exhibit
complex access patterns, and the general prefetching mecha-
nism becomes insu￿cient. In contrast, UMap could prefetch
a set of arbitrary pages into memory, as informed by the
application. Moreover, an application can control the start
of prefetching to avoid premature data migration that in-
terferences with pages in use. This ￿exibility, together with
knowledge from application algorithm or o￿ine pro￿ling,
eases application performance tuning.
4 IMPLEMENTATION
UMap is implemented in C++ and uses the userfaultfd system
call [1]. UMap enables application controls on page manage-
ment through both API and environmental variables. The
UMap : Enabling Application-driven Optimizations for Page Management MCHPC’19, Denver, USA
fault-handling thread resolves the page fault by calling the
application-supplied function (if provided), or performing
direct I/O to the backing store by invoking the de￿ned access
functions. UMap uses the UFFDIO_COPY ioctl [7] to ensure
atomic copy to the allocated memory page before waking
up the blocked process.
4.1 API
UMap provides similar interfaces as mmap to ease porting
existing applications. An application can register/unregister
multiple memory regions to be managed by UMap through
the umap and uunmap interface. One additional ￿exibility
provided by UMap is the multi-￿le backed region. Given a
set of ￿les, each with individual o￿sets and size, UMap maps
them into a contiguous memory region. While applications
can rely on UMap runtime for managing pages, UMap also
provides a plugin architecture that allows application to reg-
ister callback functions. A set of con￿guration interfaces
with naming convention umapcfg_set_xx, allow the appli-
cation to control paging explicitly: (1) the maximum size of
physical memory used for bu￿ering pages; (2) the level of
concurrency for processing I/O operations in each group of
workers; (3) the threshold value for starting or suspending
writing dirty pages to back stores. Listing 1 illustrates a sim-
ple application that uses paging and prefetching services in
UMap .
Listing 1: UMap API
1
2 int fd = open(fname , O_RDWR);
3 void* base_addr = umap(NULL , totalbytes ,
PROT_READ|PROT_WRITE , UMAP_PRIVATE , fd, 0);
4
5 // Select two non -contiguous pages to prefetch
6 std::vector <umap_prefetch_item > pfi;
7 umap_prefetch_item p0 = { .page_base_addr = &base[5 *
psize] };
8 pfi.push_back(p0);
9 umap_prefetch_item p1 = { .page_base_addr = &base [15
* psize] };
10 pfi.push_back(p1);
11 umap_prefetch(num_prefetch_pages , &pfi [0]);
12
13 computation ();
14
15 // release resources
16 uunmap(base_addr , totalbytes);
4.2 Environmental Controls
UMap uses a set of environment variables to control: the num-
ber of ￿llers and evictors; the bu￿er size; the bu￿er draining
policy; and the read-ahead window size. We highlight the
key environment variables that UMap tracks to dictate its
runtime behavior:
• UMAP_PAGESIZE sets the internal page size for memory
regions
• UMAP_PAGE_FILLERS sets the number of workers to
perform read operations from the backing store. Default: the
number of hardware threads.
• UMAP_PAGE_EVICTORS sets the number of evictors that
will perform evictions of pages. Eviction includes writing
to the backing store if the page is dirty and informing the
operating system that the page is no longer needed. Default:
the number of hardware threads.
• UMAP_EVICT_HIGH_WATER_THRESHOLD sets the
threshold in UMap bu￿er to trigger the evicting procedure.
Default: 90%
• UMAP_EVICT_LOW_WATER_THRESHOLD sets the
threshold in UMap bu￿er to suspend evicting procedure.
Default: 70%
• UMAP_BUFSIZE sets the size of physical memory to be
used for bu￿ering UMap pages. Default: (80% of available
memory)
• UMAP_READ_AHEAD sets the number of pages to read-
ahead when resolving a demand paging. Default: 0
• UMAP_MAX_FAULT_EVENTS: sets the maximum num-
ber of page fault events that will be read from the kernel in
a single call. Default: the number of hardware threads.
4.3 Limitations
The current implementation uses the write protection sup-
port from the kernel to track dirty pages in the physical
memory. For pages in write-protected memory ranges, a
writes will trigger a fault that sends a UFFD message to
handling threads. Currently, the write protection support
in userfaultfd is only available in the experimental Linux
kernel 2.
5 EXPERIMENTAL SETUP
In this section, we describe the experimental setup for the
evaluation. We summarize the con￿guration parameters of
two testbeds in Table 1 and 2. The AMD testbed includes
three identical machines (Altus, Bertha, Pmemio) that feature
two AMD EPYC 7401 (24 cores /48 hardware threads) proces-
sors. The testbed has a total of 256 GB DDR4 DRAM and 16
memory channels that operate at 2400 MT/s. Each machine
has a total of 4.65 TiB disk capacity, including 1.8 GB SATA
Micron 5200 Series SATA SSD. The platform runs Fedora 29
with Linux kernel 5.1.0-rc4-u￿d-wp-207866-gcc66ef4-dirty
(experimental version) . We compiled all applications using
GCC 8.3.1 compiler with support for OpenMP. We use the lo-
cal SSD on the AMD testbed to evaluate the impact of UMap
page sizes in all applications. The second testbed, the Intel
testbed is on a cluster called ￿ash. Its storage includes a re-
mote HDD through Lustre parallel distributed ￿le system. It
also features 1.5 TB local SSD. We test the asteroid detection
2Linux Patch https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git.
MCHPC’19, Denver, USA Peng et al.
Table 1: The AMD Testbed Speci￿cations
Platform Penguin R  Altus R  XE2112 (Base Board: MZ91-FS0-ZB)
Processor AMD EPYC 7401
CPU 24 cores (48 hardware threads) ⇥ 2 sockets
Speed 1.2 GHz
Caches 64KB 8-way L1d and 32KB 4-way L1i, 512KB 8-way pri-
vate L2, 8MB 8-way shared L3 per three cores
Memory 16 GBDDR4 RDIMM⇥ 8 channels (2400MT/s)⇥ 2 sockets
Storage ⇡ 3 TB NVMe (type: HGST SN200)
Table 2: The Intel Testbed Speci￿cations
Platform S2600WTTR (Base Board: S2600WTTR)
Processor Intel Xeon E5-2670 v3 (Haswell)
CPU 12 cores (24 hardware threads) ⇥ 2 sockets
Speed 2.3 GHz (Turbo 3.1 GHz)
Caches 32KB 8-way L1d and 32KB 8-way L1i, 256KB 8-way pri-
vate L2, 30MB 20-way shared L3
Memory 2 16 GB DDR4 RDIMM ⇥ 4 channels (1866 MT/s) ⇥ 2
sockets
Storage ⇡ 1.5 TB NVMe SSD(type: HGST SN200)
application on this testbed to compare the performance of
the backing store on Lustre with the local SSD. The platform
runs the Red Hat Enterprise Linux 7.6 kernel. We compiled
all applications using GCC 8.1.0 compiler.
6 EVALUATION
In this section, we evaluate the performance of UMap in
data-intensive benchmarks and applications. In particular,
we study the performance bene￿t of enabling ￿exible page
sizes at application level.
6.1 Out-of-core Sort
Our ￿rst evaluation uses an in-house sorting benchmark,
called umapsort. Umapsort is a multi-threaded program that
performs quicksort on values stored in a ￿le. Thus, umapsort
is a read-write workload. For the evaluation, we use a single
500GiB data set of a sequence of ascending 64-bit words. We
con￿gured the benchmark to memory map data sets either
using the mmap system call or UMap API. Then, the program
sorts the values in the memory region into descending or-
der. The application was con￿gured to run with 96 OpenMP
threads on the AMD testbed with 256GiB of physical mem-
ory. The data set is stored on the local NVMe-SSD device
con￿gured with its default boot-time values. We report the
experimental results in Figure 2.
We used di￿erent numbers of ￿llers and evictors to iden-
tify the optimal concurrency for this benchmark. In most
tested cases, using 48 ￿llers and 24 evictors brings the best
performance.We then ￿xed the number of ￿llers and evictors
to test the impact of di￿erent page sizes. For the mmap tests,
we use its default setting and the standard 4KiB page size.
0.0	
0.5	
1.0	
1.5	
2.0	
2.5	
3.0	
0E+00	
5E+03	
1E+04	
2E+04	
2E+04	
3E+04	
3E+04	
4E+04	
4E+04	
5E+04	
5E+04	
4K	 64K	 128K	 256K	 512K	 1M	 2M	 4M	 8M	 16M	 32M	
Sp
ee
du
p	
Ti
m
e	
(s
ec
on
d)
	
Page	Size	
mmap	
UMap	
Reference	
Speedup	
Figure 2: The performance of UMap for sorting
500 GiB data on NVMe-SSD on the AMD testbed, as
normalized to that of mmap.UMap starts outperform-
ing mmap when the page size is larger than 64KB. At
the page size of 8MB,UMap achievs 2.5 times improve-
ment compared to mmap.
For UMap tests, we change the page size to identify the opti-
mal con￿guration. At the smallest page size, UMap shows
much higher overhead than mmap. We ￿nd that increasing
page sizes in UMap steadily improves the performance. At
64KiB page, UMap starts outperforming mmap. By adjust-
ing UMap page size to 8MiB, the UMap version achieves 2.5
times speedup compared to the mmap version. One reason
for the improved performance at larger page sizes is that the
reduction in page faults, which reduces the time spent in
servicing page faults and also aggregate smaller data trans-
fers into bulky transfers to exploit bandwidth. As the change
is localized to the application process, there is no need to
modify any OS page size or ￿le system prefetch settings.
6.2 Graph Application
We implemented a conventional level-synchronous BFS al-
gorithm. Our BFS program takes a graph with compressed
sparse row (CSR) data format and stores only the CSR graph
in the storage device. We used a separated program to gener-
ate a CSR graph to make a read-only benchmark and dropped
page cache before running the benchmark to achieve consis-
tent results. As for dataset, we used an R-MAT graph genera-
tor with the edge falling probabilities used in the Graph500.
Figure 3 shows Umap’s BFS performance normalizing to
mmap’s best performance case where readahed is o￿. We
varied Umap page size from 4 KB to 4 MB and used the
default values for its other environmental variables. Umap
showed its best performance and overperformed mmap by
1.8X with 512 KB page size whereas mmap slowed down as
increased the page size. We clearly con￿rmed the bene￿t
of Umap’s variable page size feature in terms of not only
providing user level control but also better performance.
UMap : Enabling Application-driven Optimizations for Page Management MCHPC’19, Denver, USA
0.0	
0.5	
1.0	
1.5	
2.0	
2.5	
3.0	
0E+00	
1E+03	
2E+03	
3E+03	
4E+03	
5E+03	
6E+03	
7E+03	
4K	 16K	 64K	 128K	256K	512K	 1M	 2M	 4M	 8M	 16M	32M	64M	
Sp
ee
du
p	
Ti
m
e	
(s
ec
on
d)
	
Page	Size	
mmap	
UMap	
Reference	
Speedup	
Figure 3: The relative performance of UMap as com-
pared to that of mmap in BFS on an R-MAT scale
31 CSR graph (529 GB) data on NVMe on the AMD
testbed.
6.3 File Compression
Long Range ZIP (lrzip) is a program that implements a full-
￿le compression algorithm [8]. Compression algorithms de-
tect redundancies in input ￿les to reduce size. Lrzip uses a
modi￿ed RZIP algorithm to achieve an e￿ectively unlimited
compression window size. The original mmap version of
lrzip uses a large bu￿er, e.g., one-third of system memory, to
mmap a window that ’slides’ through the input ￿le. When
matches are found, lrzip may use a secondary 64k mmap
region to page in any matching regions outside the main
window. The UMap version removes these sliding bu￿ers
and replaces them with a single UMap region spanning the
entire input ￿le. UMap runtime automatically manages the
amount of ￿le data paged in memory during execution.
Our experiments run lrzip in pre-processing mode to com-
pare the performance of mmapwith UMap in RZIP algorithm.
We constrain the available memory to the program to ensure
out-of-core execution, i.e., 16 GB memory and a 64 GB input
data. The UMap version sets the environmental variable to
limit UMap bu￿er for caching pages in memory. The mmap
version requires a command-line option to override the sys-
tem memory on the testbed. In Figure 4 , lrzip shows low
sensitivity to the change in page size. This insensitivity is
likely due to the mostly sequential access pattern in lrzip,
which only has occasional data reuse of earlier portions of
the input ￿le, i.e., when duplicated hash values are found.
Once the page size exceeds 1MB, the UMap version stabilizes
performance at about 1.25 times that of the mmap system
call.
6.4 Asteroid Detection Application
In this case study, we use UMap for an on-going study that
searches for transient objects, such as asteroids, in intermit-
tent time-series telescope data. We uses UMap to create a
0.0	
0.5	
1.0	
1.5	
2.0	
0	
500	
1000	
1500	
2000	
2500	
3000	
3500	
4000	
4K	 16K	 64K	 256K	 1M	 2M	 4M	 8M	
Sp
ee
du
p	
Ti
m
e	
(s
ec
on
d)
	
Page	Size	
mmap	
UMap	
Reference	
Speedup	
Figure 4: The relative performance of UMap as com-
pared to that of mmap version for LRZIP 64 GB ran-
dom data on NVMe on the AMD testbed.
3D cube of virtual address space, where each page is directly
mapped to pixel data in a series of image ￿les. UMap has
the extensibility to integrate an application-speci￿c FITS
handler for resolving page fault to a particular ￿le, which
would require extensive porting e￿orts to achieve in mmap.
The application creates millions or even billions of vectors
and then virtually traces them through the image cube to
calculate the median pixel value along each vector. The start-
ing point of each vector has a uniform random distribution
in the data and their slope follows a given linear function.
The backing store contains thousands of FITS format image
￿les. Page faults are resolved to the FITS ￿les containing the
requested data, where the pixel data is subsequently read
and decoded before copied into the faulting page. Note that
a page fault may require access data in multiple ￿les.
The evaluation uses a synthetic data set derived from
537 random images taken from an astronomical survey per-
formed on 12/232018 by the Dark Energy camera in Chile.
These ￿les were resized via bicubic resampling to four times
their original dimension in each axis in order to emulate
the characteristics of real-world datasets. Each ￿le is ap-
proximately 977MB with dimensions of 16,000 by 16,000 pix-
els after this operation. The entire dataset is approximately
512GB. For the Lustre tests, transparent Lustre compression
and de-duplication reduces this size to 223GB.
The experiments process a single pass of 32 million vec-
tors with a UMap bu￿er size of 64GB. We demonstrate two
types of backing stores in this application. The ￿rst uses the
local SSD on the AMD testbed. The second uses a backing
store mapped to remote disks through a Lustre parallel ￿le
system on the Intel testbed. Figure 5 and 6 present the results.
Our results show that the application has low sensitivity to
page sizes because data reuse among the vectors. A slight
performance degradation at large page sizes because larger
pages bring more unused data. The execution time initially
MCHPC’19, Denver, USA Peng et al.
0	
50	
100	
150	
200	
250	
300	
350	
400	
450	
64KB	 256KB	 1M	 4M	 16M	 64M	
Ti
m
e	
(s
ec
on
d)
	
Page	Size	
Asteroid	Detection	Application	
Figure 5: Execution time of the asteroid application on
local SSD at various UMap page sizes at 256GB input.
0	
500	
1000	
1500	
2000	
2500	
3000	
3500	
4000	
4500	
4KB	 16KB	 64KB	 256KB	 1M	 4M	 16M	 64M	
Ti
m
e	
(s
ec
on
d)
	
Page	Size	
SSD	
Lustre	
Figure 6: Compare performance of the asteroid appli-
cation on local SSD and Lustre using 512GB input.
decreases to the optimal minimum at 1MiB page and then,
slightly increases as larger amounts of unused data begins
to contend for bu￿er space.
6.5 Database Workload
This use case demonstrates that UMap can be easily plugged
into existing database applications to improve user-space
control over memory mapping. We ported N-Store [2], an ef-
￿cient NVM database, to use UMap API by changing approx-
imately ten lines of code. N-Store uses persistent memory
like SSD as the memory pool for data. Our experiments use
a 384 GB persistent memory pool on the local NVMe-SSD
on the AMD testbed. N-Store supports multiple executors
to execute transactions to the database concurrently. In our
evaluation, we sweep 4-32 executors to understand the scal-
ability of UMap on variable concurrency. Our workload uses
the popular YCSB [4] benchmark with eight million transac-
tions and ￿ve million keys. The measurement is repeated ten
times, and we report throughput from N-Store as the metric
for performance.
We tested di￿erent numbers of ￿llers and evictors to se-
lect the concurrency to be 48 ￿llers and 24 evictors for this
0.0	
0.5	
1.0	
1.5	
0.0E+00	
2.0E+05	
4.0E+05	
6.0E+05	
8.0E+05	
1.0E+06	
1.2E+06	
4K	 16K	 32K	 64K	 128K	 256K	 512K	
Im
pr
ov
em
en
t	(
x)
	
Pe
rf
or
m
an
ce
	(o
ps
/s
)	
Page	Size	
mmap	
UMap	
Speedup	
Figure 7: Compare database throughput using mmap
and UMap . UMap achieves up to 34% improvement at
32KB page.
0	
0.3	
0.6	
0.9	
1.2	
1.5	
1.8	
0.0E+00	
5.0E+05	
1.0E+06	
1.5E+06	
2.0E+06	
2.5E+06	
3.0E+06	
3.5E+06	
4.0E+06	
4	executors	 8	executors	 16	executors	 32	executors	
Sp
ee
du
p	
(x
)	
Th
ro
ug
hp
ut
	(o
ps
/s
)	
Application	Concurrency	
mmap	
umap	
Speedup	
Figure 8: A scaling test in N-Store using increased
number of executors in the database shows that UMap
sustains performance scaling at increased application
concurrency.
benchmark. Then, with a ￿xed number of ￿llers and evictors,
we test the impact of di￿erent page sizes. Figure 7 reports
the throughput of UMap version at di￿erent page sizes and
the original mmap version at the default 4KiB page. We ￿nd
that increasing page sizes in UMap does show a trend of
increased performance as other applications. The highest
throughput is achieved at 32KiB page size, which is about
34% improvement of the mmap version. This page size is
smaller than the optimal page sizes in other applications
because the access pattern in the benchmark has low locality
and mostly random.
Figure 8 report the throughput of the database at an in-
creased application concurrency, i.e., the number of execu-
tors increases. The scaling test results demonstrate the ad-
vantage of UMap in addressing application requirements
that change dynamically. When the number of executors
increases from four to 32, the gap between the UMap ver-
sion and the mmap version increases (in the gray bars). In
particular, the speedup by UMap increases from 1.3x to 1.6x
steadily (the red line). This result highlights the importance
UMap : Enabling Application-driven Optimizations for Page Management MCHPC’19, Denver, USA
of a scalable design in UMap for handling various application
workloads.
7 DISCUSSION
There are several future directions for UMap to support
emerging architectures.
Multi-tiered Storage has tiered access latency and band-
width. Currently, UMap is extensible for new layers by de￿n-
ing new data objects. In the future work, we will automate
data migration between data objects and adapt to application
characteristics to improve storage utilization.
Disaggregated Memory architecture has large-capacity
memory servers connected to compute node through high-
performance network to provide memory on demand. UMap
can be used to port applications on such architecture by pro-
viding a backing store that de￿nes access functions likely
using RDMA for moving to/from memory server.
Byte-addressable NVM requires strong consistency for
system software like ￿le systems and DAX-aware mmap
lacks such support [20]. The UMap bu￿er could provide ap-
plications with explicit control on when to persist changes
cached in volatile memory.
8 RELATEDWORKS
Previous works have identi￿ed limitations in system services
for data-intensive applications that perform out-of-core exe-
cution for large data sets [5, 18]. [16] analyzes the overhead
in the path through Linux virtual memory subsystem for
handling memory-mapped I/O. They conclude that kernel-
based paging will prevent applications to exploit fast storage.
Our approach aims to provide ￿exibility to adapt memory
mapping to application characteristics and back store fea-
tures.
DI-MMAP [17] provides a loadable kernel module that
combines with a runtime to optimize page eviction and TLB
performance. This approach requires updates to remain com-
patible with the fast-moving kernel. CO-PAGER [10] also
provides a user-space paging service by combining a kernel
module with a user-space component. CO-PAGER bypasses
complex I/O subsystem in the kernel to reduce the overhead
of accessing NVM. Our approach stays in user-space com-
pletely, and require no modi￿cation in the kernel or updates
due to kernel updates. Moreover, our design can support a va-
riety of back stores. For instance, remote memory paging that
fetches data from a memory server or compute node [6, 14]
could be easily integrated into UMap by providing a new
store object.
9 CONCLUSIONS
In this work, we provide a user-space page management
library, called UMap , to ￿exibly adapt memory mapping
to application characteristics and storage features. UMap
employs the lightweight userfaultfd mechanism to enable
applications to control critical parameters that impact the
performance of memory mapping large data sets while con-
￿ning the customizations within the application without
impacting other applications on the same system. We evalu-
ate UMap in ￿ve applications using large data sets on both
local SSD and remote HDD. By adapting the page size in
each application, UMap achieved 1.25 to 2.5 times improve-
ment compared to the system service mmap. In summary,
ăUMap can be easily plugged into data-intensive applications
to enable application-speci￿c optimization.
ACKNOWLEDGMENT
This work was performed under the auspices of the U.S. Department of Energy
by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344
(LLNL-PROC-788145). This research was also supported by the Exascale Computing
Project (17-SC-20-SC), a collaborative e￿ort of the U.S. Department of Energy O￿ce
of Science and the National Nuclear Security Administration. This document was pre-
pared as an account of work sponsored by an agency of the United States govern-
ment. Neither the United States government nor Lawrence Livermore National Secu-
rity, LLC, nor any of their employees makes any warranty, expressed or implied, or
assumes any legal liability or responsibility for the accuracy, completeness, or useful-
ness of any information, apparatus, product, or process disclosed, or represents that
its use would not infringe privately owned rights. Reference herein to any speci￿c
commercial product, process, or service by trade name, trademark, manufacturer, or
otherwise does not necessarily constitute or imply its endorsement, recommendation,
or favoring by the United States government or Lawrence Livermore National Secu-
rity, LLC. The views and opinions of authors expressed herein do not necessarily state
or re￿ect those of the United States government or Lawrence Livermore National Se-
curity, LLC, and shall not be used for advertising or product endorsement purposes.
REFERENCES
[1] Andrea Arcangeli. 2019. Userland Page Faults and Beyond. https:
//schd.ws/hosted_￿les/lcccna2016/c4/userfaultfd.pdf.
[2] Joy Arulraj, Andrew Pavlo, and Subramanya R Dulloor. 2015. Let’s
talk about storage & recovery methods for non-volatile memory data-
base systems. In Proceedings of the 2015 ACM SIGMOD International
Conference on Management of Data. ACM, 707–722.
[3] Danny Cobb and Amber Hu￿man. 2012. NVMe Overview. In Intel
Developer Forum. Intel.
[4] Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan,
and Russell Sears. 2010. Benchmarking cloud serving systems with
YCSB. In Proceedings of the 1st ACM symposium on Cloud computing.
ACM, 143–154.
[5] Michael Cox and David Ellsworth. 1997. Application-controlled de-
mand paging for out-of-core visualization. In Proceedings. Visualiza-
tion’97 (Cat. No. 97CB36155). IEEE, 235–244.
[6] Sandhya Dwarkadas, Nikolaos Hardavellas, Leonidas Kontothanassis,
Rishiyur Nikhil, and Robert Stets. 1999. Cashmere-VLM: Remote
memory paging for software distributed sharedmemory. In Proceedings
13th International Parallel Processing Symposium and 10th Symposium
on Parallel and Distributed Processing. IPPS/SPDP 1999. IEEE, 153–159.
[7] Linux kernel. 2019. Userfaultfd. https://www.ker-
nel.org/doc/Documentation/-vm/userfaultfd.txt.
[8] Con Kolivas. 2019. Lrzip – Long Range Zip. https://github.com/
ckolivas/lrzip.
[9] Anthony Kougkas, Hariharan Devarajan, and Xian-He Sun. 2018. Her-
mes: a heterogeneous-aware multi-tiered distributed I/O bu￿ering
system. In Proceedings of the 27th International Symposium on High-
Performance Parallel and Distributed Computing. ACM, 219–230.
MCHPC’19, Denver, USA Peng et al.
[10] Feng Li, Daniel G Waddington, and Fengguang Song. 2019. Userland
CO-PAGER: boosting data-intensive applications with non-volatile
memory, userspace paging. In Proceedings of the 3rd International Con-
ference on High Performance Compilation, Computing and Communica-
tions. ACM, 78–83.
[11] Sai Narasimhamurthy, Nikita Danilov, Sining Wu, Ganesan Umanesan,
Stefano Markidis, Sergio Rivas-Gomez, Ivy Bo Peng, Erwin Laure, Dirk
Pleiter, and Shaun DeWitt. 2019. SAGE: percipient storage for exascale
data centric computing. Parallel Comput. 83 (2019), 22–33.
[12] Ivy B. Peng, Maya B. Gokhale, and Eric W. Green. 2019. System
Evaluation of the Intel Optane Byte-addressable NVM. In Proceedings
of the International Symposium on Memory Systems. ACM. https:
//doi.org/10.1145/3357526.3357568
[13] I. B. Peng and J. S. Vetter. 2018. Siena: Exploring the Design Space of
Heterogeneous Memory Systems. In SC18: International Conference
for High Performance Computing, Networking, Storage and Analysis.
427–440. https://doi.org/10.1109/SC.2018.00036
[14] Sergio Rivas-Gomez, Roberto Gioiosa, Ivy Bo Peng, Gokcen Kestor,
Sai Narasimhamurthy, Erwin Laure, and Stefano Markidis. 2018. MPI
windows on storage for HPC applications. Parallel Comput. 77 (2018),
38–56.
[15] Arch Robison, Michael Voss, and Alexey Kukanov. 2008. Optimization
via re￿ection on work stealing in TBB. In 2008 IEEE International
Symposium on Parallel and Distributed Processing. IEEE, 1–8.
[16] Nae Young Song, Yongseok Son, Hyuck Han, and Heon Young Yeom.
2016. E￿cient memory-mapped I/O on fast storage device. ACM
Transactions on Storage (TOS) 12, 4 (2016), 19.
[17] Brian Van Essen, Henry Hsieh, Sasha Ames, Roger Pearce, and Maya
Gokhale. 2013. DI-MMAP–a scalable memory-map runtime for out-
of-core data-intensive applications. Cluster Computing (2013). https:
//doi.org/10.1007/s10586-013-0309-0
[18] Brian Van Essen, Roger Pearce, Sasha Ames, and Maya Gokhale. 2012.
On the role of NVRAM in data-intensive architectures: an evalua-
tion. In 2012 IEEE 26th International Parallel and Distributed Processing
Symposium. IEEE, 703–714.
[19] Rik van Riel and Peter W. Morreale. 2008. Sysctl in kernel version
2.6.29. https://www.kernel.org/doc/Documentation/sysctl/vm.txt.
[20] Jian Xu and Steven Swanson. 2016. NOVA: A Log-structured File Sys-
tem for Hybrid Volatile/Non-volatile Main Memories. In 14th USENIX
Conference on File and Storage Technologies (FAST 16). 323–338.
