Effectively Prefetching Remote Memory with Leap by Maruf, Hasan Al & Chowdhury, Mosharaf
Effectively Prefetching Remote Memory with Leap
Hassan Al Maruf, Mosharaf Chowdhury
University of Michigan
{hasanal,mosharaf}@umich.edu
ABSTRACT
Memory disaggregation over RDMA can improve the perfor-
mance of memory-constrained applications by replacing disk
swapping with remote memory accesses. However, state-of-
the-art memory disaggregation solutions still use data path
components designed for slow disks. As a result, applica-
tions experience remote memory access latency significantly
higher than that of the underlying low-latency network,
which itself is too high for many applications.
In this paper, we propose Leap, a prefetching solution for
remote memory accesses due to memory disaggregation. At
its core, Leap employs an online, majority-based prefetching
algorithm, which increases the page cache hit rate. We com-
plement it with a lightweight and efficient data path in the
kernel that isolates each application’s data path to the disag-
gregated memory and mitigates latency bottlenecks arising
from legacy throughput-optimizing operations. Integration
of Leap in the Linux kernel improves the median and tail re-
mote page access latencies of memory-bound applications by
up to 104.04× and 22.62×, respectively, over the default data
path. This leads to up to 10.16× performance improvements
for applications using disaggregated memory in comparison
to the state-of-the-art solutions.
1 INTRODUCTION
Modern data-intensive applications [5, 28, 29, 70] experience
significant performance loss when their complete working
sets do not fit into the main memory. At the same time,
despite significant and disproportionate memory underuti-
lization in large clusters [62, 78], memory cannot be accessed
beyond machine boundaries. Such unused, stranded memory
can be leveraged by forming a cluster-wide logical mem-
ory pool via memory disaggregation, improving application-
level performance and overall cluster resource utilization
[10, 44, 48].
Two broad avenues have emerged in recent years to ex-
pose remote memory to memory-intensive applications. The
first requires redesigning applications from the ground up
using RDMA primitives [14, 20, 35, 49, 59, 63, 77]. Despite
its efficiency, rewriting applications can be cumbersome and
may not even be possible for many applications [9]. Alter-
natives rely on well-known abstractions to expose remote
memory; e.g., distributed virtual file system (VFS) for remote
file access [9] and distributed virtual memory management
(VMM) for remote memory paging [27, 31, 44, 45, 65].
Because disaggregated remote memory is slower, keeping
hot pages in the faster local memory ensures better perfor-
mance. Colder pages are moved to the far/remote memory as
needed [8, 31, 44]. Subsequent accesses to those cold pages
go through a slow data path inside the kernel – for instance,
our measurements show that an average 4KB remote page
access takes close to 40 µs in existing memory disaggregation
systems. Such high access latency significantly affects perfor-
mance because memory-intensive applications can tolerate
at most single µs latency [27, 44]. Note that the latency of
existing systems is many times more than the 4.3 µs average
latency of a 4KB RDMA operation, which itself is too high
for some applications.
In this paper, we take the following position: an ideal
solution should minimize remote memory accesses in its critical
path as much as possible. In this case, a local cache can reduce
the total number of remote memory accesses – a cache hit
results in a sub-µs latency, comparable to that of a local
page access. An effective prefetcher can proactively bring in
correct pages into the cache and increase the cache hit rate.
Unfortunately, existing prefetching algorithms fall short
for several reasons. First, they are designed to reduce disk
access latency by prefetching sequential disk pages in large
batches. Second, they cannot distinguish accesses from dif-
ferent applications. Finally, they cannot quickly adapt to
temporal changes in page access patterns within the same
process. As a result, being optimistic, they pollute the cache
area with unnecessary pages. At the same time, due to their
rigid pattern detection technique, they often fail to prefetch
the required pages into the cache before they are accessed.
In this paper, we propose Leap, an online prefetching so-
lution that minimizes the total number of remote memory
accesses in the critical path. Unlike existing prefetching al-
gorithms that rely on strict pattern detection, Leap works
with an approximate mechanism. Specifically, it builds on
the Boyer-Moore majority vote algorithm [16] to efficiently
identify remote memory access patterns for each individual
process. Relying on an approximate mechanism instead of
looking for trends in strictly consecutive accesses makes
Leap resilient to short-term irregularities in access patterns
(e.g., due to multi-threading). It also allows Leap to perform
well by detecting trends only from remote page accesses
instead of tracing the full virtual memory footprint of an
ar
X
iv
:1
91
1.
09
82
9v
1 
 [c
s.D
C]
  2
2 N
ov
 20
19
application, which demands continuous scanning and log-
ging of the hardware access bits of the whole virtual address
space and results in high CPU and memory overhead. In
addition to identifying the majority access pattern, Leap de-
termines how many pages to prefetch following that pattern
to minimize cache pollution.
While reducing cache pollution and increasing the cache
hit rate, Leap also ensures that the host machine faces mini-
mal memory pressure due to the prefetched cache. To move
pages from local to remote memory, the kernel needs to scan
through the entire cache to find eviction candidates – the
more pages it has, the more time it takes to scan. As a result,
memory allocation time for new pages increases. Therefore,
alongside a background LRU-based asynchronous cache evic-
tion policy, Leap eagerly frees up a cache entry just after it
gets hit and reduces the page allocation wait time.
We complement our algorithm with an efficient data path
design for remote memory accesses that is used in case of
a cache miss. It isolates per-application remote traffic and
cuts inessentials in the end-host software stack (e.g., the
block layer) to reduce host-side latency and handle a cache
miss with latency close to that of the underlying RDMA
operations.
We make the following contributions in this paper:
• We analyze the root causes behind data path latency over-
heads for disaggregated memory systems. We then pro-
pose Leap, a novel online prefetching algorithm and an
eager prefetch cache eviction policy along with a leaner
data path, to improve remote I/O performance.
• We implement Leap on Linux Kernel 4.4.125 as a separate
data path for remote memory access. Applications can
choose either Linux’s default data path for traditional
usage or Leap for going beyond the machine’s boundary
using unmodified Linux ABIs.
• We evaluate Leap’s prefetching algorithm against practi-
cal real-time prefetching algorithms (Next-K Line, Stride,
Linux Read-ahead) and show that the prefetcher itself im-
proves application-level performance by 1.75–3.36× with
an improved prefetch coverage of 3.06–37.51%. Leap’s
prefetching technique also reduces cache pollution and
cache miss by 1.28–1.62× and 1.74–10.47×, respectively.
Depending on the memory access pattern, simply replac-
ing the default Linux prefetcher with Leap’s prefetcher
can provide application-level performance benefit even
when they are paging to slower storage (e.g., HDD, SSD).
• In comparison to the Linux data path, Leap improves the
median and tail latency of VMM disaggregation frame-
work (e.g., Infiniswap [31] or LegoOS [65]) by up to
104.04× and 22.06×, respectively. For VFS disaggregation
solutions (e.g., Remote Regions [9]), Leap also improves
the 4KB page access latency characteristics by 24.96× at
User 
Space
Kernel 
SpaceVirtual File System 
(VFS)
VFS 
Cache
Device Mapping Layer
Block Device Driver
Block Devices 
(HDD, SSD, etc.)
Generic Block Layer
I/O Scheduler Request 
QueueRequest queue processing: 
Insertion, Merging, 
Sorting, Staging and Dispatch
bio
Remote Memory
Storage
Dispatch 
Queue
Memory Management
Unit (MMU)
MMU 
Cache
Process 1 Process 2 Process N…
File Read/Write Page Fault
HDD: 91.48 us
SSD: 20 us
RDMA: 4.3 us
0.27 us
10.04 us
21.88 us
2.1 us
Cache
Miss
Cache
Hit
Figure 1: High-level life cycle of page requests in
Linux data path along with the average time spent in
each stage.
the median and 17.32× at the 99th percentile. Due its faster
data path, Leap provides with application-level perfor-
mance improvements of 1.27–10.16× for multiple unmodi-
fied memory-intensive applications: PowerGraph, NumPy,
VoltDB, and Memcached with production workloads.
2 BACKGROUND AND MOTIVATION
2.1 Remote Memory
In memory disaggregation systems, unused cluster memory
is logically exposed as a global memory pool that is used as
the slower memory for machines with extreme memory de-
mand. This improves the performance of memory-intensive
applications that have to frequently access slower memory in
memory-constrained settings. At the same time, the overall
cluster memory usage gets balanced across the machines,
decreasing the need for memory over-provisioning per ma-
chine.
Access to remote memory over RDMA without significant
application rewrites typically relies on two primary mecha-
nisms: disaggregated VFS [9], that exposes remote memory
as files and disaggregated VMM for remote memory paging
[31, 44, 65]. In both cases, data is communicated in small
chunks or pages. In case of remote memory as files, pages go
through the file system before they are written to/read from
the remote memory. For remote memory paging and dis-
tributed OS, page faults cause the virtual memory manager
to write pages to and read them from the remote memory.
2.2 Remote Memory Data Path
State-of-the-art memory disaggregation frameworks depend
on the existing kernel data path that is optimized for slow
disks. Figure 1 depicts the major stages in the life cycle of
2
0
0.2
0.4
0.6
0.8
1
0.01 1 100 10000
CD
F
Latency (us)
Disk
Disaggregated
VMM
Disaggregated
VFS
(a) Sequential
0
0.2
0.4
0.6
0.8
1
0.01 1 100 10000
CD
F
Latency (us)
(b) Stride-10
Figure 2: Data path latencies for two access patterns.
Memory disaggregation systems have some constant
implementation overheads that cap theirminimum la-
tency to around 1 µs.
a page request. Due to slow disk access times – average
latencies for HDDs and SSDs range between 4–5 ms and
80–160 µs, respectively – frequent disk accesses have severe
impact on application throughput and latency. Although the
recent rise of memory disaggregation is fueled by the hope
that RDMA can consistently provide single µs 4KB page
access latency [10, 27, 31], this is often a wishful thinking in
practice [79]. Blocking on a page access – be it from HDD,
SSD, or remote memory – is often unacceptable.
To avoid blocking on I/O, race conditions, and synchro-
nization issues (e.g., accessing a page while the page out
process is still in progress), the kernel uses a page cache. To
access a page from slower memory, it is first looked up in the
appropriate cache location; a hit results in almost memory-
speed page access latency. However, when the page is not
found in the cache (i.e., amiss), it is accessed through a costly
block device I/O operation that includes several queuing and
batching stages to optimize disk throughput. As a result, a
cache miss leads to more than 100× slower latency than a
hit; it also introduces high latency variations.
2.3 Prefetching in Linux
The Linux kernel tries to store files on the disk in adjacent
sectors to increase sequential disk accesses. The same hap-
pens for paging. Naturally, existing prefetching mechanisms
are designed assuming sequential data layout. The default
Linux prefetcher relies on the last two page faults: if they are
for consecutive pages, it brings in several sequential pages
into the page cache; otherwise, it assumes that there are
no patterns and reduces or stops prefetching. This has sev-
eral drawbacks. First, whenever it observes two consecutive
paging requests for consecutive pages, it over-optimistically
brings in pages that may not even be useful. As a result, it
wastes I/O bandwidth and causes cache pollution by occu-
pying valuable cache space. Second, simply assuming the
absence of any pattern based on last two requests is over-
pessimistic. Furthermore, all the applications share the same
swap space in Linux; hence, pages from two different pro-
cesses can share consecutive places in the swap area. An ap-
plication can also have multiple, inter-leaved stride patterns
– for example, due to multiple concurrent threads. Overall,
considering only last two requests to prefetch a batch of
future pages falter on both respects.
To illustrate this, we measure the page access latency for
two memory access patterns: (a) Sequential accesses mem-
ory pages sequentially; and (b) Stride-10 accesses memory
in strides of 10 pages. In both cases, we use a simple applica-
tion with its working set size set to 2GB. For disaggregated
VMM, it is provided 1GB memory to ensure that 50% of its
access cause paging. For disaggregated VFS, it performs 1GB
remote write and then another 1GB remote read operations.
Figure 2 shows the latency distributions for 4KB page
accesses from disk and disaggregated remote memory for
both of the access patterns. For a prefetch size of 8 pages,
both performwell for the Sequential pattern; this is because
80% of the requests hit the cache. In contrast, we observe
significantly higher latency in the Stride-10 case because
all the requests miss the page cache due to the lack of con-
secutiveness in successive page accesses. By analyzing the
latency breakdown inside the data path for Stride-10 (as
shown in Figure 1), we make two key observations. First,
although RDMA can provide significantly lower latency than
disk (4.3µs vs. 91.5µs), RDMA-based solutions do not benefit
as much from that (38.3µs vs. 125.5µs). This is because of the
significant data path overhead (on average 34µs) to prepare
and batch a request before dispatching it. Significant varia-
tions in the preparation and batching stages of the data path
cause the average to stray far from the median. Second, ex-
isting sequential data layout-based prefetching mechanism
fails to serve the purpose in the presence of diverse remote
page access pattern. Solutions based on fixed stride sizes also
fall short because stride sizes can vary over time within the
same application. Besides, there can be more complicated
patterns beyond stride or no repetitions at all.
Shortcoming of Strict Pattern Finding for Prefetching. Fig-
ure 3 presents the remote page access patterns of four
memory-intensive applications during page faults when they
are run with 50% of their working sets in memory (more de-
tails in Section 5.3). Specifically, we consider all page-fault
sequences of size X ∈ {2, 4, 8} in these applications and di-
vide them into three categories: sequential when all X are
sequential pages, stride when allX have the same stride from
the first page, and other when it is neither sequential nor
stride.
3
0
0.2
0.4
0.6
0.8
1
Po
we
rG
ra
ph
Nu
mP
y
Vo
ltD
B
M
em
Ca
ch
ed
Po
we
rG
ra
ph
Nu
mP
y
Vo
ltD
B
M
em
Ca
ch
ed
Po
we
rG
ra
ph
Nu
mP
y
Vo
ltD
B
M
em
Ca
ch
ed
Po
we
rG
ra
ph
Nu
mP
y
Vo
ltD
B
M
em
Ca
ch
ed
Window-2 Window-4 Window-8 Window-8
Strict Majority
%
 of
 Pa
tte
rn
s
Sequential Stride Other
Figure 3: Fractions of sequential, stride, and other
access patterns in page fault sequences of length X
(Window-X ).
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100
CD
F
Time (s)
Cache Eviction
Latency
Figure 4: Due to Linux’s lazy cache eviction pol-
icy, page caches waste the cache area for significant
amount of time.
The default prefetcher in Linux finds strict sequential
patterns in window size X = 2 and tunes up its aggres-
siveness accordingly. Consequently, for PowerGraph and
VoltDB, it optimistically prefetches many pages into the
cache. Given that both ratios decrease for X = 8, many
of these prefetches only cause cache pollution. At the same
time, all non-sequential patterns with X = 2 fall under the
stride category. Considering low cache hit, Linux pessimisti-
cally decrease/stop prefetching in those cases, which leads
to a stale page cache.
Note that strictly expecting all X accesses to follow the
same pattern results in not having any patterns at all (e.g.,
when X = 8), because this cannot capture the transient in-
terruptions in sequence. In that case, following the major
sequential and/or stride trend within a limited page access
history window is more resilient to the short term irregu-
larity. Consecutively, when X = 8, a majority-based pattern
detection can detect 11.3%–29.7% more sequential accesses.
Therefore, it can successfully prefetch more accurate pages
in to the page cache. Besides sequential and stride access
patterns, it is also transparent to irregular access patterns;
e.g., for MemCached, it can detect 96.4% of the irregularity.
Prefetch Cache Eviction. The Linux kernel has an asynchro-
nous background thread (kswapd) to monitor the machine’s
memory consumption. If a memory node goes beyond a
critical memory pressure or a process’s memory usage hits
its limit, it determines the eviction candidates by scanning
over the in-memory pages to find out the least-recently-used
(LRU) ones. Then, it frees up the selected pages from the
main memory to provide space for new pages waiting for
memory allocation. A prefetched cache waits into the LRU
list for its turn to get selected for eviction even though it has
already been used by a process (Figure 4). Unnecessary pages
waiting for eviction in-memory leads to extra scanning time.
This extra wait-time due to lazy cache eviction policy adds
to the overall latency, especially in a high memory pressure
scenario.
3 REMOTE MEMORY PREFETCHING
In this section, we first highlight the characteristics of
an ideal prefetcher. Next, we present our proposed online
prefetcher alongwith its different components and the design
principles behind them. Finally, we discuss the complexity
and correctness of our algorithm.
3.1 Properties of an Ideal Prefetcher
A prefetcher’s effectiveness is measured along three axes:
• Accuracy refers to the ratio of total cache hits and the total
pages added to the cache via prefetching.
• Coverage measures the ratio of total cache hit from the
prefetched pages and the total number of requests (e.g.,
page faults in case of remote memory paging solutions).
• Timeliness of an accurately prefetched page is the time
gap from when it was prefetched to when it was first hit.
Trade-off. An aggressive prefetcher can hide the slower
memory access latency by bringing pages well ahead of the
access requests. This might increase the accuracy, but as
prefetched pages wait longer to get consumed, this wastes
the effective cache and I/O bandwidth. On the other hand,
a conservative prefetcher has lower prefetch consumption
time and reduces cache and bandwidth contention. However,
it has lower coverage and cannot hide memory access latency
completely. An effective prefetcher must balance all three.
An effective prefetcher must be adaptive to temporal
changes in memory access patterns as well. When there
is a predictable access pattern, it should bring pages aggres-
sively. In contrast, during irregular accesses, prefetch rate
should be throttled down to avoid cache pollution.
Prefetching algorithms use prior page access information
to predict future access patterns. As such, their effectiveness
largely depends on how well they can detect patterns and
predict. A real-time prefetcher has to face a tradeoff between
pattern identification accuracy vs. computational complexity
4
Low Computational
Complexity
Low Memory
Overhead
Unmodified
Application
HW/SW
Independent
Temporal
Locality
Spatial
Locality
High Prefetch
Utilization
Next-N-Line [52] ✓ ✓ ✓ ✓ X ✓ X
Stride [13] ✓ ✓ ✓ ✓ X ✓ X
GHB PC [54] X X ✓ X ✓ ✓ ✓
Instruction Prefetch [26, 40] X X X X ✓ ✓ ✓
Linux Read-Ahead [72] ✓ ✓ ✓ ✓ ✓ ✓ X
Leap Prefetcher ✓ ✓ ✓ ✓ ✓ ✓ ✓
Table 1: Comparison of prefetching techniques based on different objectives.
and resource overhead. High CPU usage and memory con-
sumption will negatively impact application performance
even though they may help in increasing accuracy.
Common Prefetching Techniques. The most common and
simple form of prefetching is spatial pattern detection
[51]. Some specific access patterns (i.e., stride, stream etc.)
can be detected with the help of special hardware fea-
tures [32, 34, 66, 80]. However, they are typically applied
to identify patterns in instruction access that are more reg-
ular; in contrast, data access patterns are more irregular.
Special prefetch instructions can also be injected into an ap-
plication’s source code, based on compiler or post-execution
based analysis [26, 39, 40, 60, 61]. However, compiler-injected
prefetching needs static analysis of the cache miss behavior
before the application runs. Hence, they are not adaptive to
dynamic cache behavior. Finally, usage of these hardware-
or software-dependent prefetching techniques are limited
to the availability of the special hardware/software features
and/or application modification.
Summary. An ideal prefetcher should have low computa-
tional and memory overhead. It should have high accuracy,
coverage, and timeliness to reduce cache pollution; an adap-
tive prefetch window is imperative to fulfill this requirement.
It should also be flexible to both spatial and temporal locality
in memory accesses. Finally, hardware/software indepen-
dence and application transparency make it more generic
and robust.
Table 1 compares different prefetching methods.
3.2 Majority Trend-Based Prefetching
Leap has two main components: detecting trends and deter-
mining what to prefetch. The first component looks for any
approximate trend in earlier accesses. Based on the trend
availability and prefetch utilization information, the latter
component decides how many and which pages to prefetch.
3.2.1 Trend Detection. Existing prefetch solutions rely on
strict pattern identification mechanisms (e.g., sequential or
Algorithm 1 Trend Detection
1: procedure FindTrend(Nsplit )
2: Hsize ← size(AccessHistory)
3: w ← Hsize/Nsplit ▷ Start with small detection
window
4: ∆maj ← ∅
5: while true do
6: ∆maj ← Boyer-Moore on
{Hhead , . . . ,Hhead−w−1}
7: w ← w ∗ 2
8: if ∆maj , major trend then
9: ∆maj ← ∅
10: if ∆maj , ∅ orw >Hsize then
11: return ∆maj
12: return ∆maj
stride of fixed size) and fail to ignore temporary irregulari-
ties. Instead, we consider a relaxed approach that is robust
to short-term irregularities. Specifically, we identify the ma-
jority ∆ values in a fixed-size (Hsize ) window of remote page
accesses (AccessHistory) and ignore the rest. For a window
of sizew , a ∆ value is said to be the major only if it appears at
least ⌊w/2⌋ + 1 times within that window. To find the major-
ity ∆, we use the Boyer-Moore majority vote algorithm [16]
(Algorithm 1), a linear-time and constant-memory algorithm,
over AccessHistory elements. Given a majority ∆, due to
the temporal nature of remote page access events, it can be
hypothesized that subsequent ∆ values are more likely to be
the same as the majority ∆.
Note that if two pages are accessed together, they will be
aged and evicted together in the slower memory space at
contiguous or nearby addresses. Consequently, the temporal
locality in virtual memory accesses will also be observed in
the slower page accesses and an approximate stride should
be enough to detect that.
Window Management. If a memory access sequence
follows a regular trend, then the majority ∆ is likely to be
5
t0 t1 t2 t3
0x48 0x45 0x42 0x3F
-3-3-3+72
(a) at time t3
t4 t5 t6 t7
0x3C 0x02 0x04 0x06
t0 t1 t2 t3
0x48 0x45 0x42 0x3F
-3-3-3+72 +2+2-58-3
(b) at time t7
t8 t1 t2 t3
0x08 0x45 0x42 0x3F
-3-3-3+2 +2+2-58-3
t4 t5 t6 t7
0x3C 0x02 0x04 0x06
(c) at time t8
t8 t9 t10 t11
0x08 0x0A 0x0C 0x10
+4+2+2+2 +2+2-39-41
t12 t13 t14 t15
0x39 0x12 0x14 0x16
(d) at time t15
Figure 5: Content of AccessHistory at different time.
Solid colored boxes indicate the head position at time
ti . Dashed boxes indicate detection windows. Here,
time rolls over at t8.
found in almost any part of that sequence. In that case, a
smaller window can be more effective as it reduces the total
number of operations. So instead of considering the entire
AccessHistory, we start with a smaller window that start
from the head position (Hhead ) of AccessHistory. For a
window of size w , we find the major ∆ appearing in the
Hhead ,Hhead−1, ...,Hhead−w−1 elements.
However, in the presence of short-term irregularities,
small windowsmay not detect a majority. To address this, the
prefetcher starts with a small detection window and doubles
the window size up to AccessHistory size until it finds a
majority; otherwise, it determines the absense of a majority.
The smallest window size can be controlled by Nsplit .
Example. Let us consider aAccessHistorywithHsize = 8
and Nsplit = 2. Say pages with the following addresses:
0x48, 0x45, 0x42, 0x3F, 0x3C, 0x02, 0x04, 0x06, 0x08, 0x0A,
0x0C, 0x10, 0x39, 0x12, 0x14, 0x16, were requested in that
order. Figure 5 shows the corresponding ∆ values stored in
AccessHistory, with t0 being the earliest and t15 being the
latest request. At ti , Hhead stays at the ti -th slot.
FindTrend in Algorithm 1 will initially try to detect a
trend using a window size of 4. Upon failure, it will look for
a trend first within a window size of 8.
At time t3, FindTrend successfully finds a trend of -3
within the t0–t3 window (Figure 5a).
Algorithm 2 Prefetch Candidate Generation
1: procedure GetPrefetchWindowSize(page Pt )
2: PWsizet ▷ Current prefetch window size
3: PWsizet−1 ▷ Last prefetch window size
4: Chit ▷ Prefetched cache hits after last prefetch
5: if Chit = 0 then
6: if Pt follows the current trend then
7: PWsizet ← 1 ▷ Prefetch a page along trend
8: else
9: PWsizet ← 0 ▷ Suspend prefetching
10: else ▷ Earlier prefetches had hits
11: PWsizet ← Round up Chit + 1 to closest power
of 2
12: PWsizet ← min(PWsizet , PWsizemax )
13: if PWsizet <PWsizet−1/2 then ▷ Low cache hit
14: PWsizet ← PWsizet−1/2 ▷ Shrink window
smoothly
15: Chits ← 0
16: PWsizet−1 ← PWsizet
17: return PWsizet
18: procedure DoPrefetch(page Pt )
19: PWsizet ← GetPrefetchWindowSize(Pt )
20: if PWsizet , 0 then
21: ∆maj ← FindTrend(N_split)
22: if ∆maj , ∅ then
23: Read PWsizet pages with ∆maj stride from Pt
24: else
25: Read PWsizet pages around Pt with latest
∆maj
26: else
27: Read only page Pt
At time t7, the trend starts to shift from -3 to +2. At
that time, t4–t7 window does not have a majority ∆, which
doubles the window to consider t0–t7. This window does
not have any majority ∆ either (Figure 5b). However, at t8,
we will find a majority ∆ of +2 within t5–t8 window and
adapt to the new trend (Figure 5c).
Similarly, at t15, we have a majority of +2 in the t8–t15,
which will continue to the +2 trend found at t8 while ignoring
the short-term variations at t12 and t13 (Figure 5d).
3.2.2 Prefetch Candidate Generation. So far we have fo-
cused on identifying the presence of a trend. Algorithm 2
determines whether and how to use that trend for prefetch-
ing for a request for page Pt .
We determine the prefetch window size (PWsizet ) based on
the accuracy of prefetches between two consecutive prefetch
requests (see GetPrefetchWindowSize). Any cache hit
6
of the prefetched data between two consecutive prefetch
requests indicates the overall effectiveness of the prefetch.
In case of high effectiveness (i.e., high cache hit), PWsizet
is expanded until it reaches a maximum size (PWsizemax ).
On the other hand, low cache hit indicates low effective-
ness; in that case, the prefetch window size gets reduced.
However, in the presence of drastic drops, prefetching is
not suspended immediately. The prefetch window is shrunk
smoothly to make the algorithm flexible to short-term ir-
regularities. When prefetching is suspended, no extra pages
are prefetched until a new trend is detected. This is to avoid
cache pollution during irregular/unpredictable accesses.
Given a non-zero PWsize , the prefetcher brings in PWsize
pages following the current trend, if any (DoPrefetch). If
no majority trend exists, instead of giving up right away,
it speculatively brings PWsize pages around Pt ’s offset fol-
lowing the previous trend. This is to ensure that short-term
irregularities cannot completely suspend prefetching.
Prefetching in the Presence of Irregularity. FindTrend can
detect a trend within a window of sizew in the presence of
at most ⌊w/2⌋ − 1 irregularities within it. If the window size
is too small or the window has multiple perfectly interleaved
threads with different strides, FindTrend will consider it as
random pattern. In that case, if the PWsize has a non-zero
value then it performs a speculative prefetch (line 25) with
the previous ∆maj . If that ∆maj is one of the interleaved
strides, then this speculation will cause cache hit and con-
tinue. Otherwise, PWsize will eventually be zero and the
prefetcher will stop bringing unnecessary pages. In that case,
the prefetcher cannot be worse than the existing prefetch
algorithms.
3.3 Analysis
Time Complexity. The FindTrend function in Algorithm 1
initially tries to detect trend aggressively within a smaller
window using the Boyer-Moor Majority Voting algorithm. If
it fails, then it expands the window size. The Boyer-Moor Ma-
jority Voting algorithm (line 6) detects a majority element (if
any) in O(w) time, wherew is the size of the window. In the
worst case, it will invoke the Boyer-Moor Majority Voting
algorithm for O(loдHsize ) times. However, as the windows
are continuous, searching in a new window does not need
to start from the beginning and the algorithm never access
the same item twice. Hence, the worst-case time complexity
of the FindTrend function is O(Hsize ), where Hsize is the
size of the AccessHistory queue. For smaller Hsize the com-
putational complexity is constant. Even for Hsize = 32, the
prefetcher provides significant performance gain (§5) that
greatly outweighs the slight extra computational cost.
User 
Space
Kernel 
SpaceVirtual File System (VFS)
VFS 
Cache
Remote Memory Storage
Memory Management
Unit (MMU)
MMU 
Cache
Process 1 Process 2 Process N…
File Read/Write Page Fault
Trend 
Detection
Prefetch 
Candidate
Generation
Process Specific 
Page Access Tracker
Prefetcher
Leap
Cache
Miss
Cache
Hit 0.27us
4.3us
2.1us
Eager Cache Eviction
Figure 6: Leap has a faster data path for a cache miss.
Memory Complexity. The Boyer-Moor Majority Voting al-
gorithm operates on constant memory space. FindTrend
just invokes the Boyer-Moor Majority Voting algorithm and
does not require any additional memory to execute. So, the
Trend Detection algorithm needs O(1) space to operate.
Correctness of Trend Detection. The correctness of Find-
Trend depends on that of the Boyer-Moor Majority Voting
algorithm, which always provides the majority element, if
one exists, in linear time (see [16] for the formal proof).
4 SYSTEM DESIGN
We have implemented our prefetching algorithm as a data
path replacement for memory disaggregation frameworks
(we refer to this design as Leap data path) alongside the tra-
ditional data path in Linux kernel v4.4.125. Leap has three
primary components: a page access tracker to isolate pro-
cesses, a majority-based prefetching algorithm, and an eager
cache eviction mechanism. All of them work together in the
kernel space to provide a faster data path. Figure 6 shows the
basic architecture of Leap’s remote memory access mecha-
nism. It takes only around 400 lines of code to implement
the page access tracker, prefetcher, and the eager eviction
mechanism.
4.1 Page Access Tracker
Leap isolates each process’s page access data paths. The page
access tracker monitors page accesses inside the kernel to
provide the prefetcher with enough information to detect
the page access trend of a specific application. Leap does not
monitor in-memory pages (hot pages) because continuously
scanning and recording the hardware access bits of a large
number of pages causes significant computational overhead
and memory consumption. Instead, it monitors only the
cache look-ups and records the access sequence of the pages
after I/O requests or page faults, trading off small loss in
access pattern detection accuracy for low resource overhead.
As temporal locality in the virtual memory space results in
7
spatial locality in the remote address space, just monitoring
the remote page accesses is often enough.
The page access tracker is added as a separate control
unit inside the kernel. Upon a page fault, during the page-in
operation (do_swap_page() under mm/memory.c), we notify
(log_access_history()) Leap’s page access tracker about
the page-fault and the process involved. Leap maintains
process-specific fixed-size (Hsize ) FIFO AccessHistory cir-
cular queues to record the page access history. Instead of
recording exact page addresses, however, we only store the
difference between two consecutive requests (∆). For exam-
ple, if page faults happen for addresses 0x2, 0x5, 0x4, 0x6,
0x1, 0x9, then AccessHistory will store the corresponding
∆ values: 0, +3, -1, +2, -5, +8. This reduces the storage space
and computation overhead during trend detection (§3.2.1).
4.2 The Prefetcher
To increase the probability of cache hit, Leap incorporates
the majority trend-based prefetching algorithm (§3.2). Here,
the prefetcher considers each process’s earlier remote page
access histories available in the respectiveAccessHistory to
efficiently identify the access behavior of different processes.
Because threads of the same process share memorywith each
other, we choose process-level detection over thread-based.
Thread-based pattern detection may result in requesting the
same page for prefetch multiple times for different threads.
Two consecutive page access requests are temporally cor-
related in the sense that they may happen together in the
future. The ∆ values stored in the AccessHistory records
the spatial locality in the temporally correlated page accesses.
Therefore, the prefetcher utilizes both temporal and spatial
localities of page accesses to predict future page demand.
The prefetcher is also added as a separate control
unit inside the kernel. While paging-in, instead of going
through the default swapin_readahead(), we re-route it
through the prefetcher’s do_prefetch() function. When-
ever the prefetcher decides the set of pages which can
be accessed in future, Leap bypasses the expensive re-
quest scheduling and batching operations of the block
layer (swap_readpage()/swap_writepage() for paging
and generic_file_read()/generic_file_write() for
the file systems) and invokes leap_remote_io_request()
to re-direct the request through Leap’s asynchronous remote
I/O interface over RDMA (§4.4).
4.3 Eager Cache Eviction
Leap maintains a circular linked list of prefetched caches
(PrefetchFifoLruList). Whenever a page is fetched from
the remote memory, besides the kernel’s global LRU lists,
Leap adds it at the tail of the PrefetchFifoLruList. After
the prefetch cache gets hit and the page table is updated,
Leap instantaneously frees the page cache and removes it
from the PrefetchFifoLruList. As an accurate prefetcher
is timely in using the prefetched data, in Leap, prefetched
caches do not wait long in the PrefetchFifoLruList to
get evicted by the background process. This eager eviction
of prefetch caches reduces the scan time to select eviction
candidates. As a result, the wait time to find and allocate
new pages also reduces - on average, page allocation time is
reduced by 750ns (36% less than the usual). Thus, new pages
can be brought to the memory more quickly leading to a
reduction in the overall data path latency.
However, if the prefetched pages need to be evicted
even before they get consumed (e.g., at severe global
memory pressure or extreme constrained prefetch cache size
scenario), due to the lack of any access history, prefetched
pages will follow a FIFO eviction order among themselves
from the PrefetchFifoLruList. Reclamation of other
memory (file-backed or anonymous page) follows the
existing LRU eviction policy in kernel. We modify the
kernel’s Memory Management Unit (mm/swap_state.c) to
add the prefetch eviction related functions.
Except for the above mentioned re-directions or modifica-
tions, we do not modify any other existing kernel functions.
4.4 Remote I/O Interface
Similar to existing works [9, 31], Leap uses an agent in
each host machine to expose a remote I/O interface to the
VFS/VMM over RDMA. The host machine’s agent commu-
nicates to another remote agent with its resource demand
and performs remote memory mapping. The whole remote
memory space is logically divided into fixed-size memory
slabs. A host agent can map slabs across one or more remote
machine(s) according to its resource demand, load balancing,
and fault tolerance policies.
The host agent maintains a per CPU core RDMA connec-
tion to the remote agent. We use the multi-queue IO queuing
mechanism where each CPU core is configured with an indi-
vidual RDMA dispatch queue for staging remote read/write
requests. Upon receiving a remote I/O request, the host gen-
erates/retrieves a slot identifier, extracts the remote memory
address for the page within that slab, and forwards the re-
quest to the RDMA dispatch queue to perform read/write
over the RDMA NIC. During the whole process, Leap com-
pletely bypasses the expensive block layer operations.
4.5 Resilience, Scalability, and Load
Balancing
One can use the existingmemory disaggregation frameworks
[9, 31, 65] and still have the performance benefits of Leap
while maintaining respective scalability and fault tolerance
characteristics. We do not claim any innovation here. In
8
our implementation, the host agent leverages power of two
choices [53] to minimize memory imbalance across remote
machines, and remote in-memory replication is the default
fault tolerance mechanism in Leap.
5 EVALUATION
We have evaluated Leap over a 56 Gbps InfiniBand RDMA
network on CloudLab [3]. Our key results are as follows:
• Leap provides a faster data path to remote memory. La-
tency for 4KB remote page accesses improves by up to
104.04× at the median and 22.06× at the tail in case of
Disaggregated VMM. In case of Disaggregated VFS, the
latency benefit is up to 24.96× at the median and 17.32×
at the tail (§5.1).
• Our prefetching algorithm outperforms its counterparts
(Next-K, Stride, and Linux Read-Ahead) by up to 1.62× in
terms of cache pollution and up to 10.47× for cache miss.
It improves prefetch coverage by up to 37.51%. (§5.2)
• Leap improves end-to-end application completion times
of unmodified PowerGraph, NumPy, VoltDB and Mem-
Cached by up to 9.84× and their throughput by up to
10.16× over existing memory disaggregation solutions
(§5.3).
Methodology. As mentioned earlier, we integrated Leap
inside the Linux kernel, both in its VMM and VFS data paths.
As a result, we evaluate its impact on three primary mediums.
• Local disks: Here, Linux swaps to a local HDD and SSD.
• Disaggregated VMM (D-VMM): To evaluate Leap’s benefit
for disaggregated VMM system, we integrate Leap with
the latest commit of Infiniswap on GitHub [4].
• Disaggregated VFS (D-VFS): To evaluate Leap’s benefit for a
disaggregated VFS system, we add Leap to Remote Regions
[9] that we implemented as it is not open-source.
For both the memory disaggregation systems, we use respec-
tive load balancing and fault tolerance mechanism. Unless
otherwise specified, we use AccessHistory buffer sizeHsize
= 32, and maximum prefetch window size PWsizemax = 8.
Each machine in our evaluation has 64 GB of DRAM and
2× Intel Xeon E5-2650v2 with 32 virtual cores supporting
AVX instructions.
5.1 Microbenchmark
We start by analyzing Leap’s latency characteristics with the
two simple access patterns described in Section 2.
During sequential access, due to prefetching, 80% of the
total page requests hit the cache in the default mechanism.
On the other hand, during stride access, all prefetched pages
brought in by the Linux prefetcher are unused and every
page access request experience a cache miss.
0
0.2
0.4
0.6
0.8
1
0.01 1 100 10000
CD
F
Latency (us)
D-VMM
D-VMM+Leap
D-VFS
D-VFS+Leap
(a) Sequential
0
0.2
0.4
0.6
0.8
1
0.01 1 100 10000
CD
F
Latency (us)
(b) Stride-10
Figure 7: Leap provides lower 4KB page access latency
for both sequential and stride access patterns.
Due to Leap’s faster data path, for Sequential, it improves
the median by 4.07× and 99-th percentile by 5.48× for disag-
gregated VMM (Figure 7a). For Stride-10, as the prefetcher
can detect strides efficiently, Leap performs almost as good as
it does during the sequential accesses. As a result, in terms of
4KB page access latency, Leap improves disaggregated VMM
by 104.04× at the median and 22.06× at the tail (Figure 7b).
Leap provides similar performance benefit during memory
disaggregation through the file abstraction as well. During
sequential access, Leap improves 4KB page access latency
by 1.99× at the median and 3.42× at the 99th percentile.
During stride access, the median and 99th percentile latency
improves by 24.96× and 17.32×, respectively.
As the idea of using far/remote memory for storing cold
data is getting more popular these days [8, 31, 44], through-
out the rest of the evaluation, we focus only on remote paging
through a disaggregated VMM system.
5.2 Performance Benefit of the Prefetcher
In this section, we focus on the effectiveness of the prefetcher
itself for real-world applications with complex access pat-
terns. We choose the PowerGraph workload because it has
significant amount of all three – stride, sequential, and irreg-
ular – remote memory access patterns.
We start with dissecting the latency contribution of our
prefetcher. Then, we evaluate its efficiency over existing
prefetchers. For the latter, we run PowerGraph on disk to
separate only the prefetching algorithm’s benefit without
any other data path optimizations.
5.2.1 Performance Benefit Breakdown. Figure 8a shows
the performance benefit breakdown for each component of
Leap data path. For PowerGraph at 50% memory limit, due
to data path optimizations, Leap provides with single-digit
µs latency for 4KB page accesses up to the 95th percentile.
Inclusion of the prefetcher ensures sub-µs 4KB page access
latency up to the 85th percentile and improves the 99-th
percentile latency by 11.4% over Leap’s optimized data path.
9
110
100
0 10 20 30 40
CC
DF
 (%
)
4KB Page Access Latency (μs)
Data path optimizations
 Data path optimizations + 
Prefetcher Data path optimizations + 
Prefetcher + Eviction
(a) Benefit Breakdown
263.9
424.47
206.65
257.55
0 200 400 600
HDD + Leap
Prefetcher
HDD + Read-
Ahead
SSD + Leap
Prefetcher
SSD + Read-
Ahead
Completion  Time (s)
(b) Prefetcher with Slow Storage
Figure 8: The prefetcher provides performance benefit
for different storage systems.
4.9
1.1
3.9
1.6
3.9
0.3
3.0
0.2
0
1
2
3
4
5
Cache Add Cache Miss
Co
un
t (
M
ill
ion
s)
Next-N-Line Stride Read-Ahead Leap
(a) Impact on cache
683.9
885.9
462.5
263.9
0
200
400
600
800
1000
Co
m
pl
eti
on
 T
im
e (
s)
Next-N-Line Stride Read-Ahead Leap
(b) Application Performance
Figure 9: The prefetcher improves performance by re-
ducing both cache pollution and cache miss events.
The eager eviction policy reduces the page cache allocation
time and improves the tail latency by another 22.2%.
5.2.2 Performance Benefit for Slow Storage. To observe
the usefulness of the prefetcher for slow disk access, we
incorporate it to Linux’s default data path while paging to
disk. Due to the majority based prefetching algorithm, the
overall application run time improves by 1.25× and 1.61×
over SSD and HDD using the default prefetcher, respectively
(Figure 8b).
5.2.3 Prefetch Utilization. Here, we run PowerGraph on
disk (with existing block layer based data path) with 50%
memory limit and compared the prefetching algorithm with
the following practical and realtime prefetching techniques:
• Next-N-Line Prefetcher [52] aggressively brings N pages
sequentially mapped to the page with the cache miss if
they are not in the cache.
• Stride Prefetcher [13] brings pages following a stride pat-
tern relative to the current page upon a cache miss. The
aggressiveness of this prefetcher depends on the accuracy
of the past prefetch.
• Linux Read-Ahead prefetches an aligned block of pages
containing the faulted page [72]. Linux uses prefetch hit
count and an access history of size 2 to control the aggres-
siveness of the prefetcher.
55
71
46 5245
87
44
90
0
20
40
60
80
100
Accuracy Coverage
Pe
rc
en
tag
e (
%
)
Next-N-Line Stride Read-Ahead Leap
(a) Correctness of Prefetch
0
0.2
0.4
0.6
0.8
1
0.0001 0.1 100 100000
CD
F 
Time (ms)
Next-N-Line
Stride
Read-Ahead
Leap
(b) Timeliness of Prefetch
Figure 10: Performance analysis of Leap’s prefetcher.
Impact on the Cache. As the volume of data fetched in
cache increases, prefetch hit rate increases. However, thrash-
ing begins as soon as the working set exceeds cache capacity.
As a result, useful demand-fetched pages are evicted. Fig-
ure 9a shows that Leap’s prefetcher uses 28.15%–62.13%
fewer page caches than the other prefetching algorithms.
A successful prefetcher reduces the number of cache
misses by bringing themost accurate pages into cache. Leap’s
prefetcher has the smallest cache miss: it experiences 7.19×,
10.47×, and 1.736× fewer cache miss events w.r.t. Next-N-
Line, Stride, and Read-Ahead, respectively (Figure 9a).
Application Performance. Due to the improvement in
cache pollution and reduction of cache miss, using Leap’s
prefetcher, PowerGraph experiences the lowest comple-
tion time. PowerGraph experiences 2.59×, 3.36×, and 1.75×
higher completion time than Leap when using Next-N-Line,
Stride, and Read-Ahead, respectively (Figure 9b).
Effectiveness. If a prefetcher brings every possible page in
the page cache, then it will be 100% accurate. However, in
reality, one cannot have an infinite cache space due to large
data volumes and/or multiple applications running on the
same machine. Besides, optimistically bringing pages may
create cache contention, which reduces overall performance.
Leap’s prefetcher trades off cache pollution with compara-
tively lower accuracy. Compare to other prefetchers, it shows
0.9–10.88% lower accuracy (Figure 10a). This accuracy loss is
linear to the number of cache add done by the prefetchers. As
all the other prefetchers bring in lots of pages, their chances
of getting lucky hits also increase. Although Leap has the
lowest accuracy, its high coverage (3.06–37.51%) (Figure 10a)
allows it to serve with accurate prefetches with a lower cache
pollution cost. At the same time, it has an improved time-
liness (Figure 10b) over Read-Ahead (Next-K-Line) 12.37×
(13.9×) at the median and 12.47× (1.52×) at the tail. Due to
the higher coverage, better timeliness, and almost similar ac-
curacy, Leap’s prefetcher thus outperforms others in terms of
application level performance (Figure 9b). Note that despite
10
11
6.2
11
8.3
11
7.4
42
4.5
21
4.6
14
2.2
71
0.6
29
8.5
0
200
400
600
800
Di
sk
D-
VM
M
D-
VM
M
+L
ea
p
Di
sk
D-
VM
M
D-
VM
M
+L
ea
p
Di
sk
D-
VM
M
D-
VM
M
+L
ea
p
100% 50% 25%
Co
mp
let
ion
 T
im
e (
s)
Ne
ve
r f
ini
sh
es
(a) PowerGraph Completion Time
55
1.9
55
1.9
55
1.9 1
33
2.4
83
5.2
66
0.2 15
12
.6
10
57
.9
75
7.0
0
400
800
1200
1600
Di
sk
D-
VM
M
D-
VM
M
+L
ea
p
Di
sk
D-
VM
M
D-
VM
M
+L
ea
p
Di
sk
D-
VM
M
D-
VM
M
+L
ea
p
100% 50% 25%
Co
mp
let
ion
 T
im
e (
s)
(b) NumPy Completion Time
38.6 37.0 37.0
1.0
12.9
35.6
1.5
15.6
0
10
20
30
40
Di
sk
D-
VM
M
D-
VM
M
+L
ea
p
Di
sk
D-
VM
M
D-
VM
M
+L
ea
p
Di
sk
D-
VM
M
D-
VM
M
+L
ea
p
100% 50% 25%
TP
S (
Th
ou
sa
nd
s)
Ne
ve
r f
ini
sh
es
(c) VoltDB Throughput
11
9
11
9
11
9
11
10
7 11
9
97 1
17
0
40
80
120
Di
sk
D-
VM
M
D-
VM
M
+L
ea
p
Di
sk
D-
VM
M
D-
VM
M
+L
ea
p
Di
sk
D-
VM
M
D-
VM
M
+L
ea
p
100% 50% 25%
OP
S (
Th
ou
sa
nd
s)
Ne
ve
r f
ini
sh
es
(d) Memcached Throughput
Figure 11: Leap provides lower completion times and higher throughput over Infiniswap’s default data path for
different memory limits. Note that lower is better for completion time, while higher is better for throughput.
having the best timeliness, Stride has the worst coverage and
completion time that impedes its overall performance.
5.3 Leap’s Overall Impact on Applications
Finally, we evaluate the overall benefit of Leap (including all
its components). We use four real-world memory-intensive
application and workload combinations with different data
access patterns (Figure 3) used in prior works:
• Twitter dataset[43] on PowerGraph[28];
• Matrix multiplication on NumPy[57];
• TPC-C benchmark[7] on VoltDB [70];
• Facebook workloads[12] on Memcached [5]
The peak memory usage of these applications varies from
9 GB to 38.2 GB. Unless otherwise mentioned, we use the
same workload and application parameters as in prior works
[9, 31]. To prompt remote paging, we limit an application’s
memory usage to fit 100%, 50%, 25% of its peak memory us-
age through cgroups [2], a Linux kernel feature to control
and monitor a process’s system resource usage. Here, we
considered the extreme memory constrain (e.g., 25%) to vali-
date the applicability of Leap to recent resource (memory)
disaggregation frameworks that are expected to operate on
minimal amount of local memory [65].
5.3.1 PowerGraph. PowerGraph suffers significantly for
cache misses in Infiniswap (Figure 11a). In contrast, Leap ex-
periences more cache hit as its prefetcher can detect 19.03%
more remote page access patterns over Read-Ahead. The
faster the prefetch cache hit happens, the faster the eager
cache eviction mechanism frees up page caches and eventu-
ally help in faster page allocations for new prefetch. Besides,
due to more accurate prefetching, Leap reduces the wastage
in both cache space and RDMA bandwidth. This improves
4KB remote page access time by 8.17× and 2.19× at the 99-th
percentile for 50% and 25% cases, respectively. Overall, inte-
gration of Leap to Infiniswap improves the completion time
by 1.56× and 2.38× at 50% and 25% cases, respectively.
5.3.2 NumPy. We use NumPy to perform a matrix mul-
tiplication over two large matrices that is pretty common
in any computational application of linear algebra (e.g., ma-
chine learning). We load two matrices of non-zero floating
points with 100k × 100 and 50k × 100 dimensions from pre-
viously stored file and perform matrix dot product on them.
Here, the peak memory usage is 38.2 GB.
Leap can detect most of the remote page access patterns
(10.4% better than Linux’s default prefetcher). As a result,
similar to PowerGraph, for NumPy, Leap improves the com-
pletion time by 1.27× and 1.4× for Infiniswap at 50% and 25%
memory limit, respectively (Figure 11b). The 4KB page access
time improves by 5.28× and 2.88× at the 99-th percentile at
50% and 25% cases, respectively.
5.3.3 VoltDB. Latency-sensitive applications like VoltDB
suffers a lot due to the paging overhead. During paging, due
to Linux’s slower data path, Infiniswap suffers (65.12% and
95.72% lower throughput than local memory behavior on 50%
and 25%, respectively). In contrast, Leap’s better prefetching
(11.6% better than Read-Ahead) and instant cache eviction
improves the 4KB page access time – 2.51× and 2.7× better
99-th percentile at 50% and 25% cases, respectively. How-
ever, while executing short random transactions, VoltDB has
irregular page access pattern (69% of total remote page ac-
cesses). At that time, Leap prefetcher’s adaptive throttling
helps the most by not congesting the RDMA. Overall, Leap
faces smaller throughput loss (3.78% and 57.97% lower than
local memory behavior on 50% and 25% memory limits, re-
spectively). Leap improves Infiniswap’s throughput by 2.76×
and 10.16× for 50% and 25% configurations, respectively (Fig-
ure 11c).
5.3.4 MemCached. This workload has mostly random re-
mote page access pattern. Leap’s prefetcher can detect most
of them and avoids prefetching in the presence of random-
ness. This results in fewer remote requests and less cache
pollution. As a result, Leap providesMemCachedwith almost
the local memory level behavior at 50% memory limit while
11
143.2 155.3 158.5 160.2
660.2 726.1 734.3 739.6
0
200
400
600
800
No Limit 320 32 3.2
Co
m
pl
eti
on
 T
im
e (
s)
Prefetch Cache Size (MB)
PowerGraph NumPy
(a) Completion Time
35.6 33.7 31.6 31.0
119.0 119.0 118.0 118.0
0
30
60
90
120
150
No Limit 320 32 3.2T
PS
 (T
ho
us
an
ds
)
Prefetch Cache Size (MB)
VoltDB MemCached
(b) Throughput
Figure 12: Leap has minimal performance drop for In-
finiswap even in the presence of O(1) MB cache size.
the default data path of Infiniswap faces 10.1% throughput
loss (Figure 11d). At 25% memory limit, Leap deviates from
the local memory throughput behavior by only 1.7%. Here,
the default data path of Infiniswap faces 18.49% throughput
loss. In this phase, Leap improves Infiniswap’s throughput by
1.11× and 1.21× at 50% and 25% memory limits, respectively.
Leap provides with 5.94× and 1.08× better 99-th percentile
4 KB page access time at 50% and 25% cases, respectively.
5.3.5 Performance Under Constrained Cache Size. To ob-
serve Leap’s performance benefit in the presence of limited
prefetch cache size, we run the four application in 50% mem-
ory limit configuration at different cache limit (Figure 12).
For MemCached, as most of the accesses are of random
pattern, most of the performance benefit comes from Leap’s
faster slow path. For the rest of the applications, as the
prefetcher has better timeliness, most of the prefetched
caches get used and evicted before the cache size hits the
limit. For this reason, during O(1)MB cache size, all of these
applications face minimal performance drop (11.87 –13.05%)
compared to the unlimited cache space scenario. Note that,
for NumPy, 3.2MB cache size is only 0.02% of its total remote
memory usage.
5.3.6 Multiple Applications Running Together. We run all
four applications simultaneously with their 50% memory
limit and observe the performance benefit of Leap for Infin-
iswap when multiple throughput- (PowerGraph, NumPy)
and latency-sensitive applications (VoltDB, MemCached)
concurrently request for remote memory access (Figure 13).
As Leap isolates each application’s page access path, its
prefetcher can consider individual access patterns while mak-
ing prefetch decisions. Therefore, it brings more accurate re-
mote pages for each application and reduces contention over
the network. As a result, overall application performance
improves by 1.1–2.4× over Infiniswap. To enable aggregate
performance comparison, we present end-to-end completion
time of application-workload combinations defined earlier;
application-specific metrics improve as well.
515.1
214.8
1429.6
836.7
191.7 92.1 88.4 82.6
0
400
800
1200
1600
D-
VM
M
D-
VM
M
 +
Le
ap
D-
VM
M
D-
VM
M
 +
Le
ap
D-
VM
M
D-
VM
M
 +
Le
ap
D-
VM
M
D-
VM
M
 +
Le
ap
PowerGraph NumPy VoltDB MemCached
Co
m
pl
eti
on
 T
im
e (
s)
Figure 13: Leap improves application-level perfor-
mance when all four applications access remote mem-
ory concurrently.
6 RELATEDWORK
Remote Memory Solutions. A good number of software
systems have been proposed over the years to access remote
machine’s memory for paging [1, 19, 21, 25, 31, 44, 45, 50,
55, 64, 65], global virtual machine abstraction [6, 24, 42],
and distributed data stores and file systems [9, 20, 41, 47, 58].
Hardware-based remote access using PCIe interconnects [48]
or extended NUMA memory fabric [56] are also proposed
to disaggregate memory. Leap is complementary to these
works.
Kernel Data Path Optimizations. With the emergence of
faster storage devices, several optimization techniques and
design principles have been proposed to fully utilize faster
hardware. Considering the overhead of the block layer, differ-
ent service level optimizations and system re-designs have
been proposed – examples include parallelism in batching
and queuing mechanism [15, 75], avoiding interrupts and
context switching during I/O scheduling [11, 18, 74, 76],
better buffer cache management [33] etc. During remote
memory access, optimization in data path has been pro-
posed through request batching [36, 37, 71], eliminating page
migration bottleneck [73], reducing remote I/O bandwidth
through compression [44], and network-level block devices
[46]. Leap’s data path optimizations are inspired by many of
them.
Prefetching Algorithms. Many prefetching techniques
exist to utilize hardware features [32, 34, 66, 80], compiler-
injected instructions [26, 39, 40, 60, 61], and memory-side
access pattern [22, 54, 67–69] for cache line prefetching.
They are often limited to specific access patterns, application
behavior, or require specified hardware design. More
importantly, they are designed for a lower level memory
stack than Leap’s prefetcher.
A large number of entirely kernel-based prefetching tech-
niques have also been proposed to hide the latency over-
head of file accesses and page faults [17, 23, 30, 38, 72].
Among them, Linux Read-Ahead [72] is the most widely
used. However, it does not consider the access history to
12
make prefetch decision. It was also designed for hiding disk
seek time. Therefore, its optimistic looking around approach
often results in lower cache utilization for remote memory
access.
To the best of our knowledge, Leap is the first to consider
a fully software-based kernel-level prefetching technique for
DRAM with remote memory as a backing storage over fast
RDMA-capable networks.
7 CONCLUSION
We propose a remote page prefetching algorithm, Leap, that
relies on majority-based pattern detection instead of strict
detection. We implement it in a leaner and faster data path
for remote memory access over RDMA without any mod-
ifications to the applications or hardware. By relying on a
more permissible/approximate mechanism to detect access
patterns instead of looking for trends in strictly consecutive
accesses makes Leap resilient to short-term irregularities.
We have integrated Leap with two major memory disag-
gregation systems (namely, Infiniswap and Remote Regions),
and Leap improves the median and tail remote page access
latencies by up to 104.04× and 22.62×, respectively, over the
default data path in Linux. This leads to application-level
performance improvements of 1.27–10.16× over the state-of-
the-art solutions. Applying Leap to slower storage systems
such as HDD and SSD leads to large performance benefits
too.
13
REFERENCES
[1] Accelio based network block device. https://github.com/accelio/
NBDX.
[2] Cgroup. https://wiki.archlinux.org/index.php/cgroups.
[3] CloudLab. https://www.cloudlab.us.
[4] Infiniswap Github Repository. https://github.com/SymbioticLab/
infiniswap.
[5] Memcached - A distributed memory object caching system. http:
//memcached.org.
[6] The Versatile SMP (vSMP) Architecture. http://www.scalemp.com/
technology/versatile-smp-vsmp-architecture/.
[7] TPC Benchmark C (TPC-C). http://www.tpc.org/tpcc.
[8] Neha Agarwal and Thomas F Wenisch. 2017. Thermostat: Application-
transparent Page Management for Two-tiered Main Memory. In ASP-
LOS.
[9] Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard,
Jayneel Gandhi, Stanko Novaković, Arun Ramanathan, Pratap Subrah-
manyam, Lalith Suresh, Kiran Tati, Rajesh Venkatasubramanian, and
Michael Wei. 2018. Remote regions: a simple abstraction for remote
memory. In ATC 18.
[10] Marcos K Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard,
Jayneel Gandhi, Pratap Subrahmanyam, Lalith Suresh, Kiran Tati, Ra-
jesh Venkatasubramanian, and Michael Wei. 2017. Remote memory in
the age of fast networks. In SoCC.
[11] Ameen Akel, Adrian M. Caulfield, Todor I. Mollov, Rajesh K. Gupta,
and Steven Swanson. 2011. Onyx: A Protoype Phase Change Memory
Storage Array. In HotStorage’11.
[12] Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike
Paleczny. 2012. Workload Analysis of a Large-scale Key-value Store.
SIGMETRICS Perform. Eval. Rev. (2012).
[13] Jean-Loup Baer and Tien-Fu Chen. 1991. An Effective On-chip Preload-
ing Scheme to Reduce Data Access Penalty. In Proceedings of the 1991
ACM/IEEE Conference on Supercomputing (Supercomputing ’91).
[14] Claude Barthels, Simon Loesing, Gustavo Alonso, and Donald Koss-
mann. 2015. Rack-Scale In-Memory Join Processing Using RDMA. In
SIGMOD.
[15] Matias Bjørling, Jens Axboe, David Nellans, and Philippe Bonnet. 2013.
Linux Block IO: Introducing Multi-queue SSD Access on Multi-core
Systems. In SYSTOR ’13.
[16] Robert S Boyer and J Strother Moore. 1991. MJRTY -âĂŤ A Fast
Majority Vote Algorithm. In Automated Reasoning. 105–117.
[17] Pei Cao, Edward W. Felten, and Kai Li. 1994. Implementation and
Performance of Application-controlled File Caching. In OSDI ’94.
[18] A. M. Caulfield, A. De, J. Coburn, T. I. Mollow, R. K. Gupta, and S. Swan-
son. 2010. Moneta: A High-Performance Storage Array Architecture
for Next-Generation, Non-volatile Memories. In MICRO.
[19] Haogang Chen, Yingwei Luo, Xiaolin Wang, Binbin Zhang, Yifeng Sun,
and Zhenlin Wang. 2008. A transparent remote paging model for vir-
tual machines. In International Workshop on Virtualization Technology.
[20] Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, and
Miguel Castro. 2014. FaRM: Fast Remote Memory. In NSDI.
[21] Sandhya Dwarkadas, Nikolaos Hardavellas, Leonidas Kontothanassis,
Rishiyur Nikhil, and Robert Stets. 1999. Cashmere-VLM: Remote
memory paging for software distributed shared memory. In IPPS/SPDP.
[22] Viacheslav Fedorov, Jinchun Kim, Mian Qin, Paul V. Gratz, and
A. L. Narasimha Reddy. 2017. Speculative Paging for Future NVM
Storage. In MEMSYS ’17.
[23] Viacheslav Fedorov, Jinchun Kim, Mian Qin, Paul V. Gratz, and
A. L. Narasimha Reddy. 2017. Speculative Paging for Future NVM
Storage. In Proceedings of the International Symposium on Memory
Systems (MEMSYS ’17).
[24] Michael J Feeley, William E Morgan, EP Pighin, Anna R Karlin,
Henry M Levy, and Chandramohan A Thekkath. 1995. Implementing
global memory management in a workstation cluster. In SOSP.
[25] EdwardW. Felten and John Zahorjan. 1991. Issues in the implementation
of a remote memory paging system. Technical Report. University of
Washington.
[26] Michael Ferdman, Cansu Kaynak, and Babak Falsafi. 2011. Proactive
Instruction Fetch. In MICRO-44.
[27] Peter X Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin
Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Net-
work requirements for resource disaggregation. In OSDI.
[28] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and
Carlos Guestrin. 2012. PowerGraph: Distributed Graph-Parallel Com-
putation on Natural Graphs. In OSDI 12.
[29] Joseph E Gonzalez, Reynold S Xin, Ankur Dave, Daniel Crankshaw,
Michael J Franklin, and Ion Stoica. 2014. GraphX: Graph processing in
a distributed dataflow framework. In OSDI.
[30] James Griffioen and Randy Appleton. 1994. Reducing File System
Latency Using a Predictive Approach. In USTC’94.
[31] J. Gu, Y. Lee, Y. Zhang, M. Chowdhury, and K. G. Shin. 2017. Efficient
Memory Disaggregation with Infiniswap. In NSDI.
[32] Akanksha Jain and Calvin Lin. 2013. Linearizing Irregular Memory
Accesses for Improved Correlated Prefetching. In MICRO-46.
[33] Song Jiang, Xiaoning Ding, Feng Chen, Enhua Tan, and Xiaodong
Zhang. 2005. DULO: An Effective Buffer Cache Management Scheme
to Exploit Both Temporal and Spatial Localities. In FAST.
[34] Doug Joseph and Dirk Grunwald. 1997. Prefetching Using Markov
Predictors. In ISCA ’97.
[35] Anuj Kalia, Michael Kaminsky, and David G Andersen. 2014. Using
RDMA efficiently for key-value services. In SIGCOMM.
[36] Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016. Design
Guidelines for High Performance RDMA Systems. In ATC ’16.
[37] Anuj Kalia, Michael Kaminsky, and David G Andersen. 2016. FaSST:
Fast, Scalable and Simple Distributed Transactions with Two-Sided
(RDMA) Datagram RPCs. In OSDI.
[38] Scott F. Kaplan, Lyle A. McGeoch, and Megan F. Cole. 2002. Adaptive
Caching for Demand Prepaging. SIGPLAN Not. (2002).
[39] M. Khan, A. Sandberg, and E. Hagersten. 2014. A Case for Resource
Efficient Prefetching in Multicores. In ICPP.
[40] A. Kolli, A. Saidi, and T. F. Wenisch. 2013. RDIP: Return-address-stack
Directed Instruction Prefetching. In MICRO.
[41] Chinmay Kulkarni, Aniraj Kesavan, Tian Zhang, Robert Ricci, and
Ryan Stutsman. 2017. Rocksteady: Fast Migration for Low-latency
In-memory Storage. In SOSP.
[42] Yossi Kuperman, Joel Nider, Abel Gordon, and Dan Tsafrir. 2016. Par-
avirtual Remote I/O. In ASPLOS.
[43] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010.
What is Twitter, a Social Network or a News Media?. In WWW ’10.
[44] Andres Lagar-Cavilla, Junwhan Ahn, Suleiman Souhlal, Neha Agarwal,
Radoslaw Burny, Shakeel Butt, Jichuan Chang, Ashwin Chaugule, Nan
Deng, Junaid Shahid, Greg Thelen, Kamil Adam Yurtsever, Yu Zhao,
and Parthasarathy Ranganathan. 2019. Software-Defined Far Memory
in Warehouse-Scale Computers (ASPLOS ’19).
[45] Shuang Liang, Ranjit Noronha, and Dhabaleswar K Panda. 2005. Swap-
ping to remote memory over Infiniband: An approach using a high
performance network block device. In Cluster Computing.
[46] S. Liang, R. Noronha, and D. K. Panda. 2005. Swapping to Remote
Memory over InfiniBand: An Approach using a High Performance
Network Block Device. In 2005 IEEE International Conference on Cluster
Computing.
[47] Hyeontaek Lim, Dongsu Han, David G Andersen, and Michael Kamin-
sky. 2014. MICA: A Holistic Approach to Fast In-Memory Key-Value
14
Storage. In NSDI.
[48] Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ranganathan,
Steven K Reinhardt, and Thomas F Wenisch. 2009. Disaggregated
memory for expansion and sharing in blade servers. In ISCA.
[49] Xiaoyi Lu, Nusrat S. Islam, Md. Wasi-Ur-Rahman, Jithin Jose, Hari
Subramoni, Hao Wang, and Dhabaleswar K. Panda. 2013. High-
Performance Design of Hadoop RPC with RDMA over InfiniBand.
In ICPP ’13.
[50] Evangelos P Markatos and George Dramitinos. 1996. Implementation
of a Reliable Remote Memory Pager. In USENIX ATC.
[51] Marshall K. McKusick, William N. Joy, Samuel J. Leffler, and Robert S.
Fabry. 1984. A Fast File System for UNIX. ACM Trans. Comput. Syst.
(1984).
[52] Sparsh Mittal. 2016. A Survey of Recent Prefetching Techniques for
Processor Caches. ACM Comput. Surv. (2016).
[53] Michael Mitzenmacher, Andrea W. Richa, and Ramesh Sitaraman. 2001.
The Power of Two Random Choices: A Survey of Techniques and
Results. Handbook of Randomized Computing (2001), 255–312. Issue 1.
[54] K.J. Nesbit and J.E. Smith. 2005. Data Cache Prefetching Using a Global
History Buffer. IEEE Micro (2005).
[55] Tia Newhall, Sean Finney, Kuzman Ganchev, andMichael Spiegel. 2003.
Nswap: A network swapping module for Linux clusters. In Euro-Par.
[56] Stanko Novakovic, Alexandros Daglis, Edouard Bugnion, Babak Falsafi,
and Boris Grot. 2014. Scale-out NUMA. In ASPLOS.
[57] Travis Oliphant. NumPy: A guide to NumPy. USA: Trelgol Publishing.
http://www.numpy.org/ [Online; accessed <today>].
[58] Diego Ongaro, Stephen M Rumble, Ryan Stutsman, John Ousterhout,
and Mendel Rosenblum. 2011. Fast Crash Recovery in RAMCloud. In
SOSP.
[59] John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis,
Jacob Leverich, David Mazières, Subhasish Mitra, Aravind Narayanan,
Guru Parulkar, Mendel Rosenblum, Stephen M. Rumble, Eric Strat-
mann, and Ryan Stutsman. 2010. The Case for RAMClouds: Scalable
High Performance Storage Entirely in DRAM. (2010).
[60] Leeor Peled, Shie Mannor, Uri Weiser, and Yoav Etsion. 2015. Semantic
Locality and Context-based Prefetching Using Reinforcement Learning.
In ISCA ’15.
[61] Rodric M. Rabbah, Hariharan Sandanagobalane, Mongkol Ekpa-
nyapong, and Weng-Fai Wong. 2004. Compiler Orchestrated Prefetch-
ing via Speculation and Predication. In ASPLOS XI.
[62] Charles Reiss, Alexey Tumanov, Gregory R Ganger, Randy H Katz, and
Michael A Kozuch. 2012. Heterogeneity and dynamicity of clouds at
scale: Google trace analysis. In SoCC.
[63] Wolf Rödiger, Tobias Mühlbauer, Alfons Kemper, and Thomas Neu-
mann. 2015. High-speed Query Processing over High-speed Networks.
Proc. VLDB Endow. (2015).
[64] Ahmad Samih, Ren Wang, Christian Maciocco, Tsung-Yuan Charlie
Tai, Ronghui Duan, Jiangang Duan, and Yan Solihin. 2012. Evaluating
dynamics and bottlenecks of memory collaboration in cluster systems.
In CCGrid.
[65] Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. 2018.
LegoOS: A Disseminated, Distributed OS for Hardware Resource Dis-
aggregation. In OSDI 18.
[66] Timothy Sherwood, Suleyman Sair, and Brad Calder. 2000. Predictor-
directed Stream Buffers. In MICRO 33.
[67] Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian, Chris
Wilkerson, Seth H. Pugsley, and Zeshan Chishti. 2015. Efficiently
Prefetching Complex Address Patterns. In MICRO-48.
[68] Stephen Somogyi, Thomas F. Wenisch, Anastasia Ailamaki, and Babak
Falsafi. 2009. Spatio-temporal Memory Streaming. In ISCA ’09.
[69] Santhosh Srinath, Onur Mutlu, Hyesoon Kim, and Yale N. Patt.
2007. Feedback Directed Prefetching: Improving the Performance
and Bandwidth-Efficiency of Hardware Prefetchers. In HPCA ’07.
[70] Michael Stonebraker and Ariel Weisberg. 2013. The VoltDB Main
Memory DBMS. IEEE Data Engineering Bulletin (2013).
[71] Shin-Yeh Tsai and Yiying Zhang. 2017. LITE Kernel RDMA Support
for Datacenter Applications. In SOSP ’17.
[72] Yair Wiseman, Song Jiang, Yair Wiseman, and Song Jiang. 2009. Ad-
vanced Operating Systems and Kernel Applications: Techniques and
Technologies. Information Science Reference - Imprint of: IGI Publish-
ing.
[73] Zi Yan, Daniel Lustig, David Nellans, and Abhishek Bhattacharjee.
2019. Nimble Page Management for Tiered Memory Systems (ASPLOS
’19).
[74] Jisoo Yang, Dave B. Minturn, and Frank Hady. 2012. When Poll is
Better Than Interrupt. In FAST’12.
[75] Suli Yang, Tyler Harter, Nishant Agrawal, Salini Selvaraj Kowsalya,
Anand Krishnamurthy, Samer Al-Kiswany, Rini T. Kaushik, Andrea C.
Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2015. Split-level I/O
Scheduling. In SOSP ’15.
[76] Young Jin Yu, Dong In Shin, Woong Shin, Nae Young Song, Jae Woo
Choi, Hyeong Seog Kim, Hyeonsang Eom, and Heon Young Yeom.
2014. Optimizing the Block I/O Subsystem for Fast Storage Devices.
ACM Trans. Comput. Syst. (2014).
[77] Erfan Zamanian, Carsten Binnig, Tim Harris, and Tim Kraska. 2017.
The End of a Myth: Distributed Transactions Can Scale. Proc. VLDB
Endow. (2017).
[78] Qi Zhang, Mohamed Faten Zhani, Shuo Zhang, Quanyan Zhu, Raouf
Boutaba, and Joseph L Hellerstein. 2012. Dynamic energy-aware
capacity provisioning for cloud computing environments. In ICAC.
[79] Yiwen Zhang, Juncheng Gu, Youngmoon Lee, Mosharaf Chowdhury,
and Kang G. Shin. 2017. Performance Isolation Anomalies in RDMA.
In KBNets.
[80] Huaiyu Zhu, Yong Chen, and Xian-He Sun. 2010. Timing Local Streams:
Improving Timeliness in Data Prefetching. In ICS ’10.
15
