A Case for Asymmetric Non-Volatile Memory Architecture by Ma, Teng et al.
A Case for Asymmetric Non-Volatile Memory Architecture
Teng Ma1, Mingxing Zhang1,3, Kang Chen1, Xuehai Qian2 and Yongwei Wu1
1Tsinghua University
2University of Southern California
3Sangfor Inc
Abstract
The byte-addressable Non-Volatile Memory (NVM) is
a promising technology since it simultaneously provides
DRAM-like performance, disk-like capacity, and persistency.
The current NVM deployment is symmetric, where NVM
devices are directly attached to servers. Due to the higher
density, NVM provides much larger capacity and should
be shared among servers. Unfortunately, in the symmetric
setting, the availability of NVM devices is affected by the
specific machine it is attached to. High availability can be
realized by replicating data to NVM on a remote machine.
However, it requires full replication of data structure in local
memory, limiting the size of the working set.
This paper rethinks NVM deployment and makes a case for
the asymmetric non-volatile memory architecture, which de-
couples servers from persistent data storage. In the proposed
AsymNVM architecture, NVM devices (i.e., back-end nodes)
can be shared by multiple servers (i.e., front-end nodes) and
provide recoverable persistent data structures. The asymmet-
ric architecture is made possible by the high-performance
network (e.g., RDMA), and follows the recent industry trend
of resource disaggregation.
We build AsymNVM framework based on AsymNVM archi-
tecture that implements: 1) high performance persistent data
structure update; 2) NVM data management; 3) concurrency
control; and 4) crash-consistency and replication. The central
idea is to use operation logs to reduce the stall due to RDMA
writes and enable efficient batching and caching in front-end
nodes. To evaluation performance, we construct eight widely
used data structures and two applications based on Asym-
NVM framework, and use traces of industry workloads. In a
cluster with ten machines (at most seven machines to emulate
a 60GB NVM using DRAM with additional latency), the re-
sults show that AsymNVM achieves comparable (sometimes
better) performance to the best possible symmetric architec-
ture while avoiding all the drawbacks with disaggregation.
Compared to the baseline AsymNVM, speedup brought by
the proposed optimizations is drastic, — 6∼22× among all
benchmarks.
1 Introduction
Emerging Non-Volatile Memory (NVM) is blurring the line
between memory and storage. These kinds of memories,
such as 3D X-Point [1], phase change memory (PCM) [2–
4], spin-transfer torque magnetic memories (STTMs) [5],
and memristors are byte-addressable, and provide DRAM-
like performance, high density, and persistency at the same
time. To unleash the potential of NVM, the existing solutions
attach NVM directly to processors [6–10], enabling high-
performance implementations of persistent data structures
using load and store instructions on local memory.
While accessing NVM via local memory provides good
performance, it may not be the most suitable setting in the
data center due to the desire of sharing NVM. Due to the
higher density, NVM can provide much larger capacity [11],
which would likely exceed the need of a single machine [12].
Moreover, in data center servers, the resources are often under
utilized [13, 14], — Google’s study [15] shows the resource
utilization lower than 40% on average. We expect that persis-
tent resource utilization will follow the same trend.
To enable NVM sharing, recent work [16] builds a dis-
tributed shared persistent memory system. The system pro-
vides a global, shared, and persistent memory space for a
pool of machines with NVMs attached to each at the main
memory bus. Once an NVM device is attached to a specific
machine, its data become unavailable when the host machine
goes down. It violates the availability requirement of many
data center applications. To ensure availability, one still
needs to replicate the data to a remote NVM [17]. However,
it requires full replication of data structures in local memory,
limiting the size of the working set.
In essence, these challenges are due to the symmetric nature
of the current NVM deployment [17–19]. To fundamentally
overcome the drawbacks, we rethink NVM deployment and
make a case for the asymmetric NVM architecture, in which
NVM devices are not associated with the individual machine
and accessed only passively via fast network. In this Asym-
NVM architecture, the number of NVM devices can be much
1
ar
X
iv
:1
80
9.
09
39
5v
2 
 [c
s.D
C]
  3
0 J
an
 20
19
smaller than the number of machines, and they can be pro-
vided as specialized “blades”.
AsymNVM architecture is an example of the recent trend
of disaggregation architecture, which was first proposed for
capacity expansion and memory resource sharing by Lim et
al. [20, 21]. As described by Gao et al. [22], disaggregated
architecture is a paradigm shift from servers each tightly inte-
grated with a small amount of various resources (i.e., CPU,
memory, storage) to the disaggregated data center built as a
pool of standalone resource blades and interconnected using
a network fabric. In industry, Intel’s RSD [23] and HP’s “The
Machine” [24] are state-of-the-art disaggregation architecture
products. Such systems are not limited by the capability of
a single machine and can provide better resource utilization
and the ease of hardware deployment. Due to these advan-
tages, it is considered to be the future of data centers [24–28].
In AsymNVM architecture, the NVM devices are instances
of disaggregated resources that are not associated with any
server. It also improves the availability since the crash of a
server will not affect NVM devices.
AsymNVM architecture (and resource disaggregation) re-
lies on the high-performance network, e.g., RDMA, which
provides the direct remote access capability with RDMA
verbs (RDMA Write, RDMA Read, etc.), as well as reliable
connection. While promising, the architecture poses key
challenges. First, a straightforward implementation of re-
placing the is local store/load instructions with RDMA Write
and RDMA Read operations suffers from long network latency.
Although the throughput of RDMA over InfiniBand is com-
parable to the throughput of NVM, the NIC cannot provide
enough IOPS for fine-grained data structure accesses. Second,
the management of smaller local volatile memory needs to
be cleared designed to ensure high performance. Third, the
interface of back-end nodes needs to be simple enough to
ensure reliability while ensuring efficiency.
Based on AsymNVM architecture, this paper builds Asym-
NVM framework for implementing data structures and applica-
tions with high performance and availability. The framework
efficiently implements four components. 1) high performance
persistent data structure update realized by introducing oper-
ation log to reduce the stall due to RDMA writes and enable
efficient batching and caching; 2) NVM data management
that handles non-volatile memory allocation and free, and
metadata storage; 3) concurrency control for lock-free and
lock-based data structures; and 4) crash-consistency and repli-
cation that ensures correct recovery and availability. Putting
all together, AsymNVM framework efficiently solves the
three challenges discussed before.
To evaluate the performance of AsymNVM framework, we
construct eight widely used data structures (i.e., stack, queue,
hash-table, skip-list, binary search tree, B+tree, multi-version
binary search tree, and multi-version b+tree) and two appli-
cations (i.e., TATP and SmallBank) based on AsymNVM
framework, and use traces of industry workloads. The data
structures and applications are executed in a cluster with ten
machines, among them at most seven machines to emulate
a 60GB NVM using DRAM with additional latency. The re-
sults show that AsymNVM achieves comparable (sometimes
better) performance to the best possible symmetric architec-
ture while avoiding all the drawbacks with disaggregation.
Compared to the baseline AsymNVM, speedup brought by
the proposed optimizations is drastic, — 6∼22× among all
benchmarks.
The remainder of this paper is organized as follows. Sec-
tion 2 discusses the current deployment of the NVM devices.
Section 3 presents the overview of AsymNVM architecture
and framework. Section 4 explains the mechanisms to ensure
efficient persistent updates. Section 5 discusses the NVM
data management. Section 6 and Section 7 explains details
of concurrency control, recovery, and replication. Section 8
evaluates AsymNVM architecture and framework. Section 9
discusses additional related work. Section 10 concludes the
paper.
2 Background and Motivation
We consider two current architectures of deploying NVM de-
vices: 1) single-node setting that only considers one machine
and one NVM device; 2) symmetric distributed setting, where
each machine in the cluster is attached with an NVM device.
2.1 Single-Node Local NVM
To leverage the advantages of DRAM-like performance of
byte addressable NVM, recent studies consider the setting that
NVM device is directly accessed via the processor-memory
bus using load/store instructions. This design avoids the
overhead of legacy block-oriented file systems or databases.
Persistent memory also allows programmers to update persis-
tent data structures directly at byte level without the need for
serialization to storage.
Based on this setting, many kinds of persistent data struc-
tures are proposed. For example, CDDS-Tree [10] uses
multi-version to support atomic updates without logging. NV-
Tree [29] is a consistent and cache-optimized B+Tree, which
reduces CPU cache line flush operations. HiKV [30] con-
structs a hybrid index strategy to build a persistent key-value
store. Since all the data accesses are performed by local
store/load instructions, these implementations can offer the
best performance. Although these persistent data structures
can survive a failure of the machine, they are not accessible
during the recovery/restarting.
2.2 Symmetric Distributed NVM
Symmetric architecture is widely used in distributed systems
(e.g., shared memory and distributed file systems). In sym-
metric architecture, each machine has its own NVM device.
2
To achieve good availability on top of persistency, one needs
to replicate its data structures to multiple NVM devices. Mo-
jim [17] implements this mechanism by adding two more
synchronization APIs (msync/gmsync) in the Linux kernel.
Specifically, it allows users to set up a pair of primary node
and the mirror node. Once these synchronization APIs are in-
voked, Mojim efficiently replicates fine-grained data from the
primary node to the mirror node using an optimized RDMA-
based protocol. This synchronization is implemented by ap-
pending primary node’s logs with end marks to the mirror
node’s log buffer, thereby tolerating a failure of the primary
node. Mojim also allows users to set up several backup nodes
that are the only weakly-consistent replication of data in the
primary node.
With a similar interface, Hotpot [16] extends Mojim to a
distributed shared persistent memory system. It provides a
global, shared, and persistent memory space for a pool of
machines with NVMs attached at the main memory bus. As a
result, applications can perform native memory load and store
instructions to access both local and remote data in this global
memory space, and can at the same time make their data
persistent and reliable. To achieve this, when a committed
page is written, Hotpot creates a local copy-on-write page
and marks it as dirty. These pages are not write-protected
until they become committed after an explicit invoking of the
synchronization APIs. At this point, the modifications to this
page from all nodes are finished.
The two systems are designed for the symmetric usage of
NVM. As a result, Mojim requires a full replication of the
data structure in local memory. Similarly, Hotpot assumes
that the dirty page can always be held in memory and only
uses a simple LRU-like mechanism to evict redundantly and
committed pages.
3 AsymNVM Overview
3.1 AsymNVM Architecture
To overcome the drawbacks of the current single-node or
symmetric distributed NVM architecture, we make a case
for the asymmetric non-volatile memory architecture, i.e.,
AsymNVM architecture. In this asymmetric distributed setting,
the number of NVM devices can be much smaller than the
number of machines, and they can be even attached to only a
few specialized “blades”. Thus, NVM device/blade is shared
by multiple client machines, and the memory space of these
client machines may be much smaller than the capacity of the
NVM devices.
Figure 1 shows the AsymNVM architecture, which in-
cludes two components performing different functions: 1)
back-end nodes that have NVM attached to their memory bus;
and 2) front-end nodes that actually operate the data structures
on NVM. In AsymNVM architecture, front-end nodes can
only access back-end nodes via fast network. Specifically,
DRAM (Fast Access / Volatile )
FPGA/ASIC
RNIC
……
One-sided RDMA (Lower Latency / Bypass )
RNIC
NVM (Slow Access / Persistent / Huge Capacity )
DRAM DRAM
RNICFront-end
Back-end
Figure 1: AsymNVM Architecture
the relationship between front-end and back-end nodes is
“many-to-many” (i.e., a front-end node can access multiple
back-end nodes and a back-end node can also be shared by
multiple front-end nodes). Compared to the symmetric archi-
tecture, AsymNVM architecture offers three advantages: a)
it follows the promising trend of disaggregated data centers;
b) it naturally matches the desire of sharing NVM devices;
c) it can ensure availability with multiple back-end nodes;
and d) the potentially simpler back-end nodes leads to better
reliability [31].
To the best of our knowledge, there are very few works
on asymmetric distributed NVM. Octopus [19] can support
asymmetric service with file interface but its current imple-
mentation is still using the symmetric setting. Mitsume [32] is
the only work that can naturally support asymmetric deploy-
ment of NVM devices for providing object store interface.
We make a step further to support byte addressable data
structures.
3.2 Key Challenges
Despite the benefits, AsymNVM architecture also brings three
key challenges.
The first challenge is network latency. Although the band-
width of InfiniBand is comparable to NVM, the RDMA op-
eration RTT is about 2 µs, which is much larger than the
latency of NVM write (about 200 ns). If we simply replace
local read/write operations with RDMA Read and RDMA Write
operations, the performance will significantly degrade.
The second challenge is how to efficiently use the small
volatile space of the front-end nodes. Keeping a full copy of
the data structure in the front-end (like Mojim) can always
offer the best performance. However, it contradicts with
the original purpose of the asymmetric setting. The high
density of NVM devices makes it capable to hold terabytes of
data. On the other side, the memory space of the front-end is
typically to be only tens of gigabytes. This asymmetry means
that only the necessary data in the current work-set should be
loaded into the front-end for caching.
The third challenge is how to design the interface of the
back-end nodes that is both simple and effective. The most
straightforward interface is to assume that the back-end nodes
can be programmed to perform arbitrary RPC calls. How-
ever, to ensure high reliability, the back-end nodes need to
be simple. Therefore, they should be only asked to perform
3
a small collection of simple APIs, such as remote memory
read/write, remote memory allocation/release, lock acquire/re-
lease, etc. With a limited interface that constitutes a small set
of fixed APIs, it is possible to implement the simple logic of
the back-end nodes in specialized hardware (e.g., using ASIC
or FPGA), instead of using a general-purpose CPU.
3.3 AsymNVM Framework Overview
To address the challenges, we build the AsymNVM framework,
a general framework for implementing high-performance data
structures based on AsymNVM architecture. The AsymNVM
framework assumes that all persistent data are hosted on the
NVM devices in the remote back-end nodes, and can be
much larger than the limited size of the local volatile memory
in the front-end nodes. Moreover, the back-end nodes are
always passive: they never actively start a communication
with the front-end nodes, but only passively response to the
API invocations from the front-end.
We assume the back-end nodes are equipped with advanced
NIC that supports RDMA. In other words, the front-end nodes
can directly access their data via RDMA Read/Write opera-
tions. Although it is possible to implement any kind of data
structures with only RDMA verbs, the performance will suf-
fer. In addition, the back-end nodes need to expose methods
to allow front-end nodes to manage NVM data. To this end,
the AsymNVM framework implements two sets of simple and
fixed API functions in the back-end, beyond RDMA verbs.
The first set of APIs provides a transactional interface that
allows the front-end nodes to push a list of update logs to the
back-end for persistency, which is guaranteed to be executed
in an all-or-nothing manner. The transactional interface is
simple and has two variants. Specifically, a transaction can
include: 1) a collection of memory logs, — {memory address,
value} pairs; or 2) an operation log, which includes the oper-
ation and parameters applied to a certain data structure. The
operation log is used to reduce the stall due to remote persis-
tency, and more details will be discussed in Section 4. The
back-end nodes ensure that all these addresses are updated
atomically.
The second set of APIs handles memory management,
which includes remote memory allocation, releasing, and
global naming. The AsymNVM framework implements a
two-tier slab-based memory allocator [33]. The back-end
runs in the remote NVM to ensure persistency and provide
the fixed-size blocks. The front-end deals with the finer-
grained memory allocations. Section 5 is dedicated to NVM
data management.
The AsymNVM framework supports SWMR (Single
Writer Multiple Reader) access model by concurrent control
mechanisms. This means that if two front-end nodes perform
writes on the same address, they should be synchronized by
locks. In addition, the framework assumes that reads and
writes to the same address are also properly synchronized by
locks. Based on different applications, we can support lock-
free and lock based data structures. We do not implement the
API for concurrency control, e.g., to implement certain lock
mechanism, instead we leverage existing RDMA primitives.
The details are discussed in Section 6.
Finally, the AsymNVM framework adopts a consensus-
based voting system to detect machine failures. The details
about recovery and replication are discussed in Section 7.
3.4 RDMA Operations
There are two common programming paradigms of RDMA.
The straightforward approach is the two-sided server-
reply [34, 35] paradigm which directly replaces traditional
send and receive with RDMA Send and RDMA Recv. The
other one is server-bypass [36–38] paradigm using one-sided
RDMA, which requires the system re-design to exploit such
feature [36–38]. The AsymNVM framework uses one-sided
RDMA to improve the performance. Besides, it use polling to
detect completions similar to [39]. This means that the front-
end nodes can access the memory space on remote NVM de-
vices directly via RDMA Write, RDMA Read, and even RDMA
atomic operations without notifying the processing unit (e.g.,
CPU or FPGA/ASIC chips) on the remote side.
In AsymNVM architecture, back-end nodes need to man-
age metadata consistently. RDMA provides several atomic
verbs to guarantee that any update to a 64-bit data is atomic.
Thus, we can apply RDMA atomic operations to the critical
metadata, e.g., root pointer of data structure.
Due to the non-volatile nature of the remote NVM, the
data may be corrupted if the back-end crashes during a single
RDMA Write operation. AsymNVM framework guarantees
the data integrity via checksum.
4 Efficient Persistent Update
4.1 Basic Implementation
At low level, the persistent data structure implementation
should support read/write operations. A read can return
data that is not yet persisted, but if there is a persistent
fence [40, 41], the read should return the persisted data if
it is produced before the fence. When a write (update) returns,
the data should always be persisted in the back-end NVM.
The straightforward implementation of the two operations
is to perform RDMA read and RDMA write on the back-end
nodes.
However, the simple implementation may incur consider-
able rounds of network communications. As we discussed ear-
lier, the network latency is much higher than memory writes,
thus the performance will suffer. In addition, the volatile
local memory is typically much smaller than the whole data
structure in remote NVM, so the front-end nodes need to use
4
proper data eviction mechanisms when using DRAM as the
cache.
4.2 Decoupled Memory Log Persistency
To reduce the performance impact of persistency,
DuduTM [9] uses redo log and decouples the update
of real data structure in NVM and the persistency of redo
log. In another word, a write can return after the redo
log is persisted and does not need to wait for the data
structure modification. In AsymNVM framework, we also
use memory log to improve performance. Unlike prior works,
in AsymNVM architecture, the front-end and back-end nodes
are distributed, the only reasonable choice is to use redo log.
In AsymNVM framework, each write (update) will gener-
ate a number of memory logs, and the back-end node provides
the transaction APIs to ensure that the memory logs are per-
sisted atomically and in an all-or-nothing manner. When
memory logs in the transaction is persisted, the back-end
node sends back an acknowledgement, so that the write in
the front-end can return and is guaranteed to be durable. The
back-end node also guarantees that the modifications to the
real data structure are performed (i.e., replaying the persisted
logs) in order due to the sequential log writing.
Specifically, the transaction API is remote tx write.
The input parameter is a list of {address, value} pairs, each of
which consists of a memory address and a value that should
be written to this address. The back-end nodes keep two
areas: the data area holds the real data structures; the log
area records the transaction logs. The front-end can directly
read the data area, but any updates have to go through the
log area. To implement remote tx write, the framework
library will construct a continuous set of memory logs and
append to the corresponding log area in remote NVM via a
single RDMA Write operation. The format of these memory
logs is shown in Figure 2. Every log entry includes address,
length, data, and one-byte flag in the head. This flag indicates
whether the value is in the memory log, it is used by an opti-
mization related to batching, more details will be discussed
in Section 4.3. A transaction will produce several log entries,
a commit flag, and a checksum value. The checksum of a
transaction is recorded as the end mark and can be used to
validate the integrity of the appended log. After the restart
of the back-end node, it needs to use the checksum of the
last transaction to validate whether all the log entries of this
transaction is flushed to the NVM.
ValueLengthAddr
1 Byte 8 Bytes 4 Bytes Len Bytes
......Entry CommitFlagEntry Checksum
1 Byte 4 Bytes
Flag
OperationType Parameters
1 Byte M Bytes 4 Bytes   
Checksum
Memory Log
Transaction Log
Operation Log
Figure 2: Memory Log vs. Operation Log
The advantages of using the transactional API are 1) re-
ducing the persistency latency due to modification of real
data structure; and 2) largely reducing the required rounds
of RDMA operations. Without the transaction API, multiple
rounds of RDMA operations are needed when writing to mul-
tiple non-continuous areas of the NVM, or a continuous area
with the size larger than a cache line. Other works [42–44]
propose to add an additional flush operation to the RDMA
standard. However, such resolution will at least add the addi-
tional latency of invoking this flush operation. Moreover, the
additional operation itself does not make the other RDMA
operation crash-consistent. Importantly, the implementation
of the transactional API is fixed and simple, improving the
reliability of back-end nodes.
4.3 Batching and Caching with Operation
Log
To further reduce the latency to data persistency, we propose
the notion of operation log, which is shown at the bottom
of Figure 2. Different from the memory logs, each write
only incur one operation log, which contains operation type,
parameters, and checksum. A write can return after the opera-
tion log is persisted in the back-end node. Persisting operation
log can be achieved by a single one-sided RDMA write to the
back-end node.
The crucial benefit of operation log is that it enables batch-
ing and caching. Once the operation logs are recorded, the
modifications on the real data structure can be postponed and
batched to improve the performance while ensuring crash
consistency (e.g., asynchronous execution to remove network
latency from the critical path, and combining redundant writes
to reduce write operations). This is because, even after a crash,
the proper state can be restored by replaying the operation
logs that are not executed (i.e., have not yet modified the data
area).
It is important to understand the difference between the
operation log and memory log. With only memory log, we
can only realize the “postpone” aspect, — the real data struc-
ture modification can be delayed as long as the memory log
is persisted. But it cannot achieve the “batched” aspect, be-
cause the memory logs of each write need to be persisted with
a remote tx write. The operation log achieves batching
by combining the memory logs of multiple writes into one
remote tx write. At lower level, operation log reduces
the number of RDMA Write operations. With memory log
only, each write needs at least two RDMA Write operations,
— one for the commit and the others for at least one mem-
ory log. A write typically needs more than 1 memory log,
thus the number of RDMA Write operations is usually larger
than two. With the operation log, each write still needs one
RDMA Write, but no commit is required, because the opera-
tion log already serves the purpose of committing the opera-
tion. The number of RDMA Write operations for the memory
5
Append
Log
Append
LogA
lloca
tor R
PC
Memory
Log Area
Page MemoryCache Page
Apply Write Front-end
Back-end
Dire
ct R
ead
NVM
Apply Write
Read 1
4
3 5
6
DS operation
Swap InPage Level
2
Page
Persistent DS
Hot Data
Mirror
Replica 7 OperationLog Area
+
Figure 3: AsymNVM Framework Data Access Workflow
logs is less since multiple writes can be opportunistically coa-
lesced into one RDMA Write, depending on the addresses. In
addition, the commit for the batched memory logs (not for
each write) also needs an RDMA Write.
Figure 3 shows the workflow of AsymNVM framework
data accesses. Each data structure level modification oper-
ation typically needs to first read the data and then write to
perform the modifications. Accordingly, we divide each oper-
ation into two parts and use a Gather-Apply model: gathering
data and applying the modifications, and read-only operation
only needs the gathered part. We use the terms gather and
apply, instead of read and write, to explain the holistic flow
including batching and caching, which may involve multiple
reads/writes. Batching can execute multiple operations to-
gether and coalesce memory logs to both reduce the number
of RDMA Write and RDMA Read operations. Besides, caching
will reduce the number of RDMA Read operations during the
gathering phase. They are applicable to all data structures.
Gather Data: The data are fetched from the front-end
cache whenever possible (cache-hit, 1©). If not cache-hit, 1)
the data will either be read from the back-end directly by
using remote nvm read ( 3©) or, 2) its corresponding page
will be swap-in ( 2©) via remote nvm read and put to the
cache in the front-end memory. Then, data is read from front-
end cache via 1©. The choice between these two strategies
depends on specific data structures and follows a principle
that using swap-in ( 2©) for cold data and remote nvm read
( 3©) for hot data. Hot data (e.g., the root of a B+Tree) are
accessed frequently than cold data (e.g., the leaf of a B+Tree).
On a persistent fence, the read after the fence will need to wait
until memory logs before the fence persisted in the back-end
node.
Apply Modification: Each modification operation causes
one operation log to be flushed to the back-end for recovery
( +©). The operation log with format {insert op, key, value} as
shown in Figure 2 will be put in the operation log area. Then,
the memory logs of format {address, data} are be generated
afterward. They do not need to be flushed immediately. We
replace actual data in memory log with a pointer to the previ-
ous flushed operation log to reduce the size of data write, the
data/pointer is indicated by the “Flag” of memory log in Fig-
ure 2. It is correct because, after the operation log is stored in
the back-end, the data structure modification is persistent and
recoverable. While flushing the logs, the cached data (if exist)
are modified accordingly ( 4©). If a number of operations
get executed successfully, or the buffer is full, the buffered
memory logs, together with appended TX COMMIT, will also
be flushed to the back-end NVM via remote tx write ( 5©).
These logs are then handled by the back-end ( 6©) and repli-
cated to the mirror-node ( 7©) (Section 7 discusses details
on replication). If the back-end fails, the front-end handles
exceptions, abort the transaction and clear the cache.
To support a data structure even larger than the capacity
of the NVM in a single back-end node, AsymNVM frame-
work supports a distributed data structure partitioning across
multiple back-ends. Specifically, the distributed data struc-
ture is partitioned with key-hashing. When the front-end
node executes a data structure operation, it first locates the
appropriate back-end according to the key’s value using the
hash. Then, the front-end will read or modify data using
remote nvm read / remote tx write in the corresponding
back-end node. At this point, the processing is similar to the
single back-end scenario.
4.4 Data Cache in Front-end
Several recent works build NVM systems using DRAM as
cache [9, 30, 45, 46]. Bw-tree [47] uses a cache layer to map
logical pages (Bw-tree nodes) to physical pages. We use a
similar data structure of hash map to translate the address of
data structure nodes in NVM to address in DRAM. Each item
in the hash map represents the page cached. The page size is
adjustable according to different data structures.
Our cache replacement policy combines the methods of
LRU (Least Recently Used) and RR (Random Replacement).
LRU works well in choosing hot data, but its implementa-
tion is expensive. RR is easy to implement but does not
provide any guarantee of preserving hot data. We use a hy-
brid approach, — first choosing a random set of pages for
replacement (RR) and then selecting a least used page from
the set to discard (LRU). No page flush is needed because the
write workflow already put the memory logs in the back-end
node. In a micro-benchmark, the hybrid approach (29.2%)
can reduce the miss ratio by 33.5% compared to RR (62.7%)
when the size of choosing set is 32, and gain a similar miss
ratio as LRU with nearly 27.5% throughput improvement.
4.5 Data Structure Specific Optimizations
The AsymNVM framework are general to implement differ-
ent persistent data structures with high performance. In this
section, we investigate several commonly-used data structures
and propose data structure specific optimizations.
Stack/Queue We implement Stack and Queue by using
the List data structure. Because the only data items that can
be accessed in Stack or Queue are headers or tails and they
6
are more frequently accessed, the front-end nodes only need
to cache nodes pointed by them to reduce remote nvm read.
If there are not enough data items of headers and tails in
the cache i.e., less than a threshold, the front-end will fetch
back corresponding data to the cache. Moreover, due to the
access pattern, the operations may be combined because the
operations are only allowed on stack header for Stack, and
on queue tail for Queue. Thus, the effective pushes will be
annulled by pops for Stacks, and the effective enqueues will
be annulled by dequeues for Queue. Such an opportunity can
be identified by checking the un-executed operation logs in
the front-end memory. For example, for a pop operation to
the stack data structure, we need first to count the number of
un-executed push and pop operations in the operation log. If
the number of pushes is larger than the number of pops, there
is no need to access the data area. This optimization based on
operation log will reduce the RDMA reads and writes.
Tree-Like Data Structure Tree-like data structures have
the hierarchical organization. The nodes in higher (near the
root) level are more frequently accessed than lower level
nodes. Based on this natural property, we choose to cache
higher level nodes with higher priority. Specifically, the front-
end sets a threshold N and the nodes with level larger than N
will not be cached. They will be directly accessed through
RDMA Read. N is dynamically adjusted according to the cache
miss ratio α , i.e., if α > 50%, N = N−1 while if α < 25%,
N = N+1. Otherwise, N stays unchanged. The native LRU
algorithm treats higher level nodes and lower level nodes
in the same way, and hence incurs frequently cache misses.
Compared with the primitive LRU algorithm, our mechanism
gives a “hint” to cache the hot nodes.
In addition, because trees are sorted data structures, the
performance can be improved when the operations are sorted.
Based on this insight, we pack the sorted operations into
a vector operation. The operation goes from the root of
the tree down to the leaf nodes. The vector can then be
split accordingly. The operations in vector segments can be
executed in parallel.
The vector insert in Algorithm 1 shows
vector write, one vector operation, in a binary search
tree following the Gather-Apply paradigm. It firstly reads
the information to decide where to insert these nodes and
then applies these insert in the correct position. Without
batching, two read rounds are needed if insert operation A
and B read the same node. When we execute A and B with
one vector write operation, it only needs one round read
to access this node. Similarly, if several operations modify
the same NVM memory, they will be compacted to one NVM
write in vector insert.
The caching and batching optimization described for tree-
like data structure can also be applied to skip-list. Specifically,
higher degree nodes in skip-list will be cached. Vector opera-
tion (containing sorted operations) for skip-list can similarly
reduce the number of RDMA Read calls.
Algorithm 1 Vector Insert
1: procedure VECTOR INSERT(kvs) . keys are sorted
2: node← root
3: queue.push(< 0, len,node >)
4: while queue is not empty do
5: begin,end,node← queue.pop()
6: mid← binary search(kvs.keys,begin,end,node.key)
7: if node.le f t = null then
8: create sub tree(kvs[begin : mid])
9: . construct a new sub tree
10: node.le f t← sub tree
11: else
12: queue.push(< begin,mid,node >)
13: if node.right = null then
14: create sub tree(kvs[mid : end])
15: node.right← sub tree
16: else
17: queue.push(< mid,end,node >)
5 NVM Data Management
5.1 Back-end Interface and Metadata
At the back-end nodes, we implement NVM management
APIs since using only one-sided RDMA operations is inef-
ficient. In addition, since they provide the basic functions
needed by all applications, it is convenient to support them
directly in the back-end nodes to reduce the network com-
munication to only one round for RPC invocation. In the
AsymNVM framework, two memory management APIs are
provided: remote nvm malloc and remote nvm free. The
front-end node can use them to allocate and release NVM
memory in the back-end nodes. To ensure simplicity, we only
implement fixed-size memory management. Moreover, we
use a persistent bitmap to record the usage of NVM, with one
bit indicating the allocation status of each block. The two
design decisions ensure fast recovery. Since front-end nodes
connect to the back-end via one-sided RDMA, we use the
RFP [38] to implement the interfaces. Because the front-end
puts the requests via RDMA Write and gets the responses via
RDMA Read, the back-end is passive and does not need to deal
with any network operation. It simplifies the implementation.
The back-end nodes also need to store metadata for recov-
ery since nothing will be left on the front-end after failure.
In AsymNVM frameworks, the metadata are stored in the
“well-known” locations to all front-end and back-end nodes.
This is the global naming space for recovery [48]. After
restarting, both front-end or the back-end nodes know the
location to find the needed information/data before recovery.
Then, the back-end node maps the virtual memory address to
the previous NVM mapped regions. With this mechanism, a
pointer to the back-end NVM is still valid after restarting, —
this pointer will be mapped to the previous NVM location.
The following metadata are stored in the global naming
space. 1) During recovery, the front-end nodes need to know
7
Table 1: Comparison of Different Allocators. We set the slab
size as 128 Bytes and 1024 Bytes separately.
Type/Tput(MOPS) Alloc Free
Glibc 21.0 57.0
Pmem 1.42 1.38
RPC allocator 0.33 0.88
Two-tier allocator (128 Bytes) 1.33 2.41
Two-tier allocator (1024 Bytes) 6.42 13.90
the NVM area address belonging to its data structure, and this
NVM area, including the data and log area. It is needed for
physical to virtual address translation for the corresponding
front-end node. 2) The front-end nodes need to know the
location of data structures. It is achieved by storing the root
reference of data structures, e.g., the address of the root node
for a tree. 3) The allocation bitmaps indicate whether a block
of NVM is allocated. This information is used to reconstruct
the memory usage lists and soon recover the back-end al-
locator. 4) Addresses of log areas, LPNs (Log Processing
Number, indicating the next entry in the memory log area)
and the OPNs (Operation Processing Number, indicating the
last operation log whose memory log is still not persistent)
are used to find the logs together with the location of the next
logs. They can be used for the back-end node to reproduce
logs (memory log) and for the front-end node to recover the
data structure operations (operation log).
5.2 Front-end Allocator
The design of the front-end allocator is inspired by the slab
allocator [33]. The back-end allocator provides slabs to the
front-end allocator, and the front-end manages these slabs in
finer granularity. The slabs in the front-end are organized in
full/partial/empty list according to how much capacity is con-
sumed in the corresponding page. To support finer granularity
allocation, we use a simple best-fit mechanism in the front-
end. To improve the NVM utilization, a threshold is defined
as the maximum free blocks number, and the front-end nodes
will reclaim free blocks periodically. While reclaiming, The
front-end nodes send the request to the back-end nodes to free
the reclamation slabs via RPC. When allocation size is larger
than the size of a slab, the front-end node directly allocates
memory in the back-end using RPC and back-end interface.
Benchmark. We compare the two-tier allocator of Asym-
NVM framework with persistent allocator and standard Linux
Glibc allocator. As table 1 shows, Glibc achieves the highest
throughput (21.0 MOPS and 57.0 MOPS) but without per-
sistent guarantee. Pmem allocator is a persistent allocator
from NVML project [49], and can reach 1.42 MOPS. With
only the back-end (RPC) allocator, the throughput is only
23% and 64% of Pmem allocator because of the network
overhead. With two-tier allocator, the throughput is similar
Node A
A
Root Version ++
B, C
Node A’
A’ B’, C’
D, E
Old Root New Root
Copy On Write
FSub Tree
Insert(D,E,F) 
Update(A,B,C) Path Copying
Vector Insert 
Multi-Version
Figure 4: Overall Multi-version Data Structure
1
2
Find Operation A (Status A)
Find Operation B (Status B)
Node A
Level = 6
Node B
Level = 4
New
Node C
Level = 4
Ins
ert
Op
era
tio
n C
(St
atu
s A
)
Broken Pointer
Pointer
Insert new node C
<1> Update next pointers
<2> Update previous pointers from bottom to top layer
Figure 5: Naturally Lock-Free in Skip-list.
Algorithm 2 Writer Lock
1: procedure WRITER LOCK
2: while rdma compare and swap(L,Locked) = Locked
3: procedure WRITER UNLOCK
4: rdma atomic write(L,UnLocked)
or even better performance than Pmem allocator.
6 Concurrency Control
This section describes the concurrency control mechanisms
to support SWMR (Single Writer Multiple Reader) access
patterns. Based on applications, AsymNVM can support both
lock-free and lock based data structures.
6.1 Exclusive Write
Under the SWMR mode, write operations are exclusive.
Therefore, the writer should acquire an exclusive lock first. If
it succeeds, it fetches the LPN (Log Processing Number, refer
to Section 5.1 for its management), and then executes the
write operation. After finishing appending logs to the remote
NVM based on LPN, it should release this exclusive lock.
While the exclusive writes are being performed, other write
operations (if any) will be blocked until the current writer has
completed the current write operation.
As shown in Algorithm 2, for Write Lock, we leverage
the RDMA atomic verbs, RDMA Compare And Swap [50] to
implement it as a distributed spin-lock. When releasing the
lock, the writer resets it via a RDMA Read. In AsymNVM
8
framework, to handle the failures while holding the lock, ev-
ery write lock acquire/release operation should write a record
(lock-ahead log) to the back-end node before appending the
memory logs. Thus, if the front-end crashed before releasing
the lock, we can identify the lock need to be released during
recovery.
6.2 Lock-Free Data Structure
Multi-version Data Structure. Our design of lock-free tree-
like data structure is inspired by append-only B-tree [51, 52]
and persistent data structures [53, 54].
Multi-version is a widely used method in optimistic con-
current control [55, 56]. Multi-version data structures will
first make copies of the corresponding data if needed. Then
the data will be modified or new data items are inserted. For
example in Figure 4, the writer copies all the affected nodes
along the path to the root, a.k.a., path copying [57]. Then, the
nodes in the path will update some of the pointers pointing
to the old data. Finally, the data will be inserted into the
new path. After finishing all these operations, the root will
be atomically changed to the new root by updating the root
pointer. Vector operation discussed in Section 4.5 can help
here to reduce the number of network round trips significantly.
Since the readers can always get consistent data, this kind of
concurrent control does not affect the performance of readers.
Skip-List: Naturally Lock-Free. Some data structures
like skip-list are naturally lock-free and the only concern is to
carefully choose the order of operations [58, 59]. As shown
in Figure 5, the writer first creates the newly allocated node
and sets the pointers in the new node accordingly ( 1©). After
that, the previous pointers will be updated from the bottom
to the top ( 2©). Readers can still get (potentially different)
consistent views of Skip-list in such scenario, thus, the lock
is not required [60].
Lightweight Recovery. In the multi-version data struc-
ture, the only in-place update is the root pointer. However,
the pointer changing is atomic. Therefore, it doesn’t need a
recovery process as the discussion in [48]. While recovering,
the front-end can use the root pointer (which is well-known
via naming mechanism) to find out the whole data structure.
NVM reclamation. The use of lock-free data struc-
tures needs to ensure that memory is safely reclaimed, which
further complicates the garbage collection [61]. In Asym-
NVM framework, this requirement can be achieved by a lazy
garbage collection mechanism. After version changes, the
front-end should release the old version’s data. Back-end de-
lays this operation for n µs and then reclaims corresponding
memory. It requires that the latency of each pending data
structure operation should be less than n µs to avoid memory
leak (i.e., access the reclaimed memory).
Algorithm 3 Writer Preferred Lock
1: procedure WRITE BEGIN
2: gcc atomic increment(SN)
3: procedure WRITE END
4: gcc atomic increment(SN)
5: procedure READER LOCK
6: do
7: ret← rdma atomic read(SN)
8: while ret is odd
9: start sn← ret
10: procedure READER UNLOCK
11: return start sn 6= rdma atomic read(SN)
…Write_begin
(gcc atomic)
reader_lock
(RDMA atomic)
Log Area
Key[N0, N1] => root1
Key[N1, N2] => root2
……
Key[Nm-1, Nm] => rootM
Writer Reader Reader
RW lock RW lock RW lock
root1 root2 rootM
writer_lock
(RDMA CAS)
Figure 6: Data Structure Partition.
6.3 Lock Based Data Structure
Write Preferred Lock. RDMA library provides atomic
verbs which is an appropriate way to implement distributed
sequencer [50, 62]. Algorithm 3 shows the implementation
of retry-based read locks by using the sequence number
(SN), an 8 bytes integer variable. Distinct from Algorithm 2,
which is invoked by the front-end nodes, Write Begin and
Write End is executed by the back-end nodes. When a back-
end node applies the persisted memory log to the real data
structure in NVM, it atomically increases the SN twice be-
fore and after the modification. Reader Lock and Reader
Unlock are invoked by front-end nodes before and after a
sequence of reads. To disallow reads when data are being
updated, it needs to wait until the current SN is odd. To ensure
reads in between get the consistent view, Reader Unlock
needs to check that SN is unchanged since Reader Lock. If
the data are inconsistent, the readers need to retry and fetch
the data again.
Lock Benchmark. We make a ping-point test about the
lock’s performance as in Frangipani [63]. In our test scenario,
six readers and one writer try to access the same data in the
back-end and the writer’s workload is 10 % write and 90 %
read. The results show that each reader’s average throughput
is 260 KOPS (1.56 MOPS in total) and writer’s throughput
is 539 KOPS, separately. The reader’s failed ratio (i.e., a
try for reading data is failed) is only 3 %. When setting the
workload as 50 % write, reader’s throughput will drop to only
165 KOPS with a 26 % fail ratio. The write-preferred lock
makes writers to gain a higher throughput than readers.
9
Table 2: Performance Improvement and Comparison (in KOPS) (R: using log reproducing, C: caching 10% NVM size in the
front-end, B: batching with batch size 1024. The evaluation uses the one-to-one setting with 100% write workloads harnessing
all optimization. Reasons for empty cells: Data structure with time complexity O(1) (i.e., HashTable/SmallBank) cannot apply
batching optimization. In Queue/Stack implementation, batch and cache should be combined together.
TX(SmallBank) TX(TATP) Queue Stack HashTable SkipList BST BPT MV-BST MV-BPT
Symmetric 654 214 1199 1087 1097 125.2 84.5 305.2 42.2 18.6
Symmetric-B - 260 2279 2255 - 209.0 151.0 343.0 146.1 76.0
AsymNVM-Naı¨ve 254 10.2 301 285 315 5.0 19.0 11.5 7.0 7.4
AsymNVM-R 295 12.4 833 828 385 7.7 22.9 13.7 12.3 9.8
AsymNVM-RC 362 63.7 - - 445 40.4 59.5 77.1 28.4 17.8
AsymNVM-RCB - 127.5 1678 1449 - 66.0 134.2 184.3 88.9 60.2
6.4 Data Structure Partition
We use partitioning to eliminate the potential bottleneck due
to the lock, to achieve both high throughput and better scal-
ability [64, 65]. Similar to the support of large size data
structure in Section 4.3, AsymNVM framework adopts key-
hashing partitioning to improve the performance of various
data structures. As shown in Figure 6, Each partition has its
own write lock and index data structure. While the writer is
executing write operation in one of the partitions, multiple
readers can still concurrent access other partitions. In our
implementations, we always separate the data structure into
four partitions to simplify the evaluation.
Lock-free data structures benefit the reader but create mul-
tiple copies by writers. Lock-based data structures prioritize
the writer without extra copies, but readers have to read multi-
ple times until consistent data are obtained. The right choice
depends on specific applications.
7 Recovery and Replication
7.1 Replication
Recent work Mojim (based on symmetric architecture) [39]
adopts a primary-mirror architecture to avoid complex proto-
cols for achieving availability. The AsymNVM framework
similarly needs at least one mirror-node attached with the non-
volatile device like SSD, Disk or even NVM. To improve fault
tolerance, it is preferred to deploy mirror-nodes to different
racks. The back-end nodes replicate the memory/operation
logs to mirror-nodes before committing the transaction and
acknowledging the front-end node to avoid high overhead. If
the mirror node is equipped with NVM, the mirror node also
implements a log replay function similar to the back-end to
apply logs to the replicated real data structure. Replicated
logs in mirror nodes are read-only. When the back-end node
crashes, if the mirror node is equipped with NVM, it will be
voted as the new back-end. Otherwise, the front-end nodes
use the logs and data structures from the mirror node to re-
cover the data structure to a new back-end node with the
NVM device.
In our implementation, the back-end node is responsible
for ensuring that the replica is persistent in its mirror-node.
The front-end node only needs to ensure that data is stored
in the back-end NVM, but does not wait for an acknowledge
after replication completes. Thus, the replication phase is
performed asynchronously.
7.2 Data Structure Recovery
With the log mechanism, AsymNVM framework ensures
crash consistency with the non-volatile data and logs stored
in the back-end node. This section discusses the details of
different recovery scenarios based on the failure of different
components.
Similar to most distributed systems, we implement
a consensus-based voting system, i.e., etcd [66] or
ZooKeeper [67], to detect machine failures. Leases are used
to identify whether a node, i.e., front-end or back-end, is still
alive or not. If the lease expires and the node cannot renew
its lease, the node is considered to be crashed. We implement
this mechanism as keepAlive service.
Case 1: Front-end reader crash. If the front-end node
crashes when performing a read, it only needs to gain the
meta-data via naming mechanism and resume execution after
rebooting.
Case 2: Front-end writer crash. If the front-end node
crashes when performing a write, the back-end will know
this information through keepAlive service. After the front-
end node reboots from crashing, if there still exists memory
logs not replayed from the front-end node, the back-end node
will validate whether all log entries of the last transaction are
flushed to the NVM or not via checksum. If this transaction
log is consistent (Case 2.a), the back-end will notify the front-
end to resume as normal, same as the case of reader crashing.
Otherwise (Case 2.b), the back-end will notify the front-end
that the last transaction log is inconsistent. Thus, the front-
end node will fetch the LPN, OPN and operation logs, of
which memory logs is not replayed, and then re-executes
the uncommitted transaction according to the operation log.
(Case 2.c) In most cases, there are several operation logs
10
whose corresponding memory logs are not flushed to back-
end yet. the front-end will process as (Case 2.b).
Case 3: Back-end transient failure. When the back-end
node fails while executing RDMA write/read, the front-end
can detect it through the feedback from RNIC, i.e., the desti-
nation is un-reached. Then, it will wait for the notification for
the back-end node recovery or a new voted back-end. After re-
booting, the back-end node will first reconstruct the mapping
between the physical addresses and virtual addresses. The
reconstruction is possible because such mapping is also stored
in NVM and its beginning address is well-known via global
addressing schema as the description in Section 5.1. After
that, the back-end node checks whether the last transaction
log is consistent. If there is no transaction/operation logs left,
or 2) the transaction log is consistent (Case 3.a), the back-end
can start its normal execution immediately, i.e., reproducing
memory logs to data structures if any log has not been ap-
plied, and then notify its liveness to the front-end nodes. If the
transaction log is inconsistent (Case 3.b), the back-end node
will notify the corresponding front-end nodes about its crash,
and the front-end will flush the memory logs again to redo
this transaction. It is possible since the front-end node must
have not received the persistent acknowledgement. If existing
operation logs are ahead of current memory logs (Case 3.c),
which means that the memory logs of several operation logs
have not been flushed from front-end due to batching, the
back-end node will notify the front-end about this, and the
front-end will continue to execute the next operation.
Case 4: Back-end permanent failure. In this case, one
of the mirror nodes will be voted as the new back-end and
provides service to the front-end. The new back-end node will
broadcast to living front-ends to announce such information.
After that, the front-end will reconstruct the data structures
to a new back-end by using the data and logs in the mirror
nodes.
Case 5: Mirror node crash. The consensus-based based
service will detect the failure and remove it out of the group.
If both front-end and back-end crash, the keepAlive service
will coordinate front-end and back-end nodes, and let the
back-end nodes to recover first. They will first check the
status as in Case 2. After that, the front-end will determine
how to recover according to the back-end’s failure cases in
Case 1.
8 Evaluation
Our evaluations attempt to answer the following questions:
I. How does AsymNVM perform, — how it is compared
to symmetric setting and naive asymmetric implementation?
II. How much performance improvement can batching and
caching deliver? III. How does AsymNVM perform under
multiple front-end nodes? IV. How does AsymNVM perform
under different workloads?
8.1 Evaluation Setup
Hardware Setup. The experiment cluster contains ten ma-
chines, each of which is equipped with an 8-cores CPU (Intel
Xeon E5-2640 v2, 2.0 GHz), 96 GB memory, and one Mel-
lanox ConnectX-3 InfiniBand with network bandwidth of
40Gbps. Up to three machines are used as the back-end nodes
or mirror nodes.
NVM Emulator. We use 60 GB DRAM as remote NVM
device, and 6GB DRAM as the front-end DRAM for caching
data. Similar to prior works [9,39,68], we set the write latency
as 200ns and read latency as the latency of DRAM. This is
due to the read/write asymmetry in NVM.
 0
 20
 40
 60
 80
 100
 120
 140
 160
1 2 4 8 16 32 64 27 28 29 210 211 212
T
h
r
o
u
g
h
p
u
t
 
(
K
O
P
S
)
Batch Size
MV-BST
MV-BPT
SkipList
(a) Lock-Free Data Structure
 0
 40
 80
 120
 160
 200
 240
1 2 4 8 16 32 64 27 28 29 210 211 212
Batch Size
BST
BPT
TATP
(b) Lock Based Data Structure
Figure 7: Throughput with different batch sizes
 0
 20
 40
 60
 80
 100
 120
BPT BST SKIPLIST TATP MV-BPT MV-BST
T
h
r
o
u
g
h
p
u
t
 
(
K
O
P
S
)
Cache Size (%)
1%
5%
10%
20%
 0
 100
 200
 300
 400
 500
HashTable SmallBank
Cache Size (%)
Figure 8: Throughput with different cache sizes
 0
 200
 400
 600
 800
 1  2  3  4  5  6
T
h
r
o
u
g
h
p
u
t
 
(
K
O
P
S
)
Reader Machine Number
MV-BPT(R)
MV-BST(R)
MV-BPT(W)
MV-BST(W)
(a) Lock-Free Data Structure
 0
 50
 100
 150
 200
 250
 300
 1  2  3  4  5  6
Reader Machine Number
BPT(R)
BST(R)
Skiplist(R)
BPT(W)
BST(W)
Skiplist(W)
(b) Lock Based Data Structure
Figure 9: Scalability of multiple readers. The workload of
the writer is 100% insert. R/W represents reader/writer.
8.2 AsymNVM Performance
We implement eight widely-used data structures covering
different access time complexity (O(1) and O(log(n))): stack,
11
 0
 100
 200
 300
 400
 500
 1  2  3  4  5  6  7
T
h
r
o
u
g
h
p
u
t
 
(
K
O
P
S
)
Data Structure Number
SKIPLIST
BST
BPT
MV-BST
MV-BPT
Figure 10: Throughput of multiple data structures in the Same
Back-end Machine
 0
 20
 40
 60
 80
 100
 120
 0  4  8  12  16  20  24U
t
i
l
i
z
a
t
i
o
n
 
(
%
)
Million Operations
Back-end
Front-end
Figure 11: CPU utilization with the operation increasing in
BST. The workload is 10% put and 90% get.
queue, hash-table, skip-list, binary search tree (BST), B+tree
(BPT), multi-version binary search tree (MV-BST), and multi-
version b+tree (MV-BPT). To simplify the evaluations, the key
and value are all 64 bits. In addition, we use two applications:
TATP and SmallBank.
Table 2 shows the overall performance as well as the com-
parison to symmetric and naive implementations.
Compare to Naı¨ve Implementation. The naı¨ve im-
plementation access remote NVM directly using RDMA
reads and writes without any optimizations. The com-
plete implementation denoted as AsymNVM-RCB (with log
Reproducing, Catching, and Batching) can provide nearly
6∼22 × improvements compared to naı¨ve implementation.
We see that the cachingis more effective than other optimiza-
tions (nearly 2∼7 × performance improvement). The reason
is that the RDMA Read operation causes the major overhead
in accessing a data structure, and catching largely eliminates
this read overhead.
Compare to the Symmetric Setting. We implement the
symmetric NVM architecture by storing data structures in
local NVM and storing logs in remote NVM for fault toler-
ance. The logs are flushed asynchronously (without waiting
for the acknowledgement from remote nodes). It reaches the
upper-bound performance of symmetric NVM architecture,
but will obviously cause inconsistency. From the results, we
see that, AsymNVM-RCB still achieves comparable perfor-
mance to the optimistic performance of symmetric NVM data
structures without consistency. Especially, in a few cases (i.e.,
Queue, Stack, BST, MV-BST, MV-BPT), the performance of
AsymNVM-RCB is even better than symmetric NVM with-
out batching. This is mainly due to the small DRAM cache
in the front-end nodes.
End-to-end Performance. We evaluate application per-
formance by two transaction benchmarks: SmallBank [69]
and TATP [70]. We use HashTable and BPT as the index
data structure of SmallBank and TATP, separately. As shown
in Table 2, the results show AsymNVM can improve the
throughput to 1.42× in SmallBank and 12.5× in TATP.
Cost Comparison. In the symmetric setting with m ma-
chines, it needs n1 = max(∑i=mi=1 dSi/S0e ,m) NVM devices (as-
suming each NVM capacity is S0, the real usage of each NVM
is Si). Besides, AsymNVM needs n2 =
⌈
∑i=mi=1 Si
⌉
NVM de-
vices and n2  n1. As we mentioned in Section 1, each
server only needs smaller capacity less than S0, thereby the
necessary NVM n2 will be fewer than n1 (n1 = m).
8.3 CPU Utilization
Figure 11 shows CPU utilization of front-end and back-end
nodes. The front-end node keeps running with nearly 100%
CPU utilization but the request only incurs very small CPU
usage (4%∼10% CPU utilization). It matches the intuition
that, the back-end has very little computing overhead and it
can be made reliable due to the simplicity.
8.4 Effects of Batching and Caching
While batching and caching are applicable to queue and stack,
only a little cache is needed in both queue and stack to reach a
high performance as shown in Table 2. Also, the performance
is less sensitive to the batch size. Thus, we do not discuss
them here.
Batching. We measure the performance of batching with
vector operations under different batch sizes from 1 to 4048.
The results are in Figure 7. MV-BST can be improved by
3.38× (from 17.8 KOPS without batch to 60.2 KOPS with
batch size 1024). The improvement for MV-BPT is about
3.13× (from 28.4 KOPS to 88.9 KOPS). The improvements
are 126%, 139%, and 63% for BST, BPT, and SkipList, re-
spectively. Multi-version data structures need to perform path
copying which incurs many write operations. The batching
can effectively reduce such overhead.
Caching. We measure the benefit of caching under different
front-end cache sizes. Binary search tree, B+Tree, hashtable,
and skip-list are used here, and the results are shown in Fig-
ure 8. Overall, the throughput increases with the increase of
cache sizes. Notice that MV-BPT and MV-BST do not get too
much improvement with catching. This is due to the fact that
the data modified are still kept in memory for multi-version
data structures. We also measure the improvement due to
our special optimizations in the tree-like data structures. The
results show that, when using native LRU strategy (i.e., access
any data including the lower level nodes through the front-end
cache), the BPT can only reach 42.5 KOPS which is 44 %
lower than AsymNVM.
12
 0
 20
 40
 60
 80
 100
 120
100% put 50% put
50% get
75% put
25% get
10% put
90% get
100% get
T
h
r
o
u
g
h
p
u
t
 
(
K
O
P
S
) Naive
R
RC
(a) BST
 0
 20
 40
 60
 80
 100
100% put 50% put
50% get
75% put
25% get
10% put
90% get
100% get
T
h
r
o
u
g
h
p
u
t
 
(
K
O
P
S
) Naive
R
RC
(b) MV-BST
 0
 40
 80
 120
 160
 200
100% put 50% put
50% get
75% put
25% get
10% put
90% get
100% get
T
h
r
o
u
g
h
p
u
t
 
(
K
O
P
S
) Naive
R
RC
(c) BPT
 0
 40
 80
 120
 160
100% put 50% put
50% get
75% put
25% get
10% put
90% get
100% get
T
h
r
o
u
g
h
p
u
t
 
(
K
O
P
S
) Naive
R
RC
(d) MV-BPT
 0
 20
 40
 60
 80
100% put 50% put
50% get
75% put
25% get
10% put
90% get
100% get
T
h
r
o
u
g
h
p
u
t
 
(
K
O
P
S
) Naive
R
RC
(e) SkipList
 0
 600
 1200
 1800
 2400
 3000
100% push 50% push
50% pop
100% pop
T
h
r
o
u
g
h
p
u
t
 
(
K
O
P
S
) Naive
R
RC
(f) Queue
 0
 600
 1200
 1800
 2400
 3000
100% push 50% push
50% pop
100% pop
T
h
r
o
u
g
h
p
u
t
 
(
K
O
P
S
) Naive
R
RC
(g) Stack
 0
 100
 200
 300
 400
 500
 600
 700
100% put 50% put
50% get
75% put
25% get
10% put
90% get
100% get
T
h
r
o
u
g
h
p
u
t
 
(
K
O
P
S
) Naive
R
RC
(h) HashTable
Figure 12: Throughput with Different Workloads (100%put, 50%put+50%get, 25%put+75%get, 10%put+90%get, 100%get)
8.5 Multiple Front-end Nodes
The results so far are based on one front-end and one back-
end node. We measure the scalability of AsymNVM using
multiple readers. The results are shown in Figure 9. We
choose five data structures which we mentioned in Section 6
to make comparisons between lock based and lock-free data
structures.
The readers’ performance can scale well with the increas-
ing number of front-end nodes. We see that, the writer per-
formance of lock based data structure decreases more than
that of multi-version data structures. This is because there
are more RDMA rounds for lock based data structures that
can influence the performance. With different mechanisms of
concurrent control, the effects are different. With lock based
BST, the average throughput with 6 readers performs 26%
worse than the value with only one reader. In the case of
MV-BST, performance degradation is about 8%. The results
confirm that the multi-version data structures do benefit the
readers.
We also see that, the lock-free data structures scale better
than their lock-based counterparts. The readers in Figure 9b
have about 3.0∼3.2 × higher performance than the readers
in Figure 9a. Retries incurred by the failed read is the main
cause for the lower performance. The portion of retry is about
4%∼16% of total operations with 6 readers and 100% insert
from the writer. Lower write workload will decrease the ratio
of retries.
We also measure the throughput of multiple front-end
nodes sharing one NVM device, each accessing its own dis-
tinct data structure. We only test the case that each front-end
use the same type of data structure but with different instances.
Figure 10 shows that the scalability is almost linear with 7
front-ends. The average performance degradation for a sin-
gle client is about 7% ∼ 20% compared to the one-to-one
deployment.
8.6 Multiple Back-end Nodes
As shown in Figure 13, we measure the performance after
partitioning data structure to multiple back-ends. The results
show no significant performance degradation after partition-
ing. The reason is that the partition in each back-end is strictly
isolated with other back-ends.
8.7 Industry Workloads
We also measure our data structure implementations under
industry workloads from an online service. The workloads
trace the real world user behaviors and satisfy the power-law
distribution.
 0
 20
 40
 60
 80
 100
 120
 1  2  3  4  5  6  7T
h
r
o
u
g
h
p
u
t
 
(
K
O
P
S
Back-end Number
SKIPLIST
BST
BPT
MV-BST
MV-BPT
Figure 13: Throughput with Partitions
We also use the operation traces of industry workloads
from online service to evaluate AsymNVM. Figure 12 shows
the throughput of using different read/write ratios from a sin-
gle writer front-end node. For simplicity, insert operation is
used as write and find operation is used as read. With fewer
read operations, the performance decreases due to more over-
head brought by write operations. Comparing BPT/BST to
their MV-counterparts (MV-BPT/MV-BST), BPT/BST have
relatively higher performance. For example, with the full
write workload, there are about 23% and 48% performance
gap. This is because, in the MV-version, the write operations
need to write more data during path copying.
13
9 Related Work
Single-node NVM systems [6, 7, 9, 29, 30, 45, 71–74] pro-
vide direct access to NVM via memory bus but cause lower
utilization of NVM and inaccessible facing node failures. Dis-
tributed NVM systems including Octopus [19], Hotpot [16],
Mojim [39], and FaRM [37,75] combine the NVM devices to-
gether with RDMA, and they are all using symmetric deploy-
ment. Currently, the asymmetric deployments provide storage
interfaces including NVMe over Fabric [76] and Crail [18].
However, they are not byte addressable i.e., they cannot pro-
vide data structure level service. A few file systems (e.g.,
Aerie [77]) adopt a hybrid paradigm like AsymNVM which
allowing direct remote read and transactional write with logs.
Different from Aerie, AsymNVM is a distributed system.
Recent works on implementation of a persistent allocation
system over NVM include nvm malloc [78], Makalu [68],
PAllocator [79], Mneosyne [8]. They discuss considerations
of NVM allocators in a single machine. We make the first
step towards distributed NVM allocator.
Several projects aim to design the future disaggregation
data center, like [24–26, 80–84]. LegoOS [82] proposes
splitkernel, an OS model disseminates functionalities into
loosely-coupled monitors. Some of these works focus on how
to design remote memory. Aguilera et al. [85] introduce ben-
efits and challenges about applying remote memory. NAM-
BD [86] proposes Network-Attach-Memory (NAM) architec-
ture and implements a time-stamp based concurrency control
algorithm. Distinct from other works, INFINISWAP [87] pro-
vides a page mapping mechanism for memory disaggregation
and a decentralized resource management. AsymNVM is
an asymmetric architecture that can be used to organize the
disaggregated NVM resource.
10 Conclusion
This paper rethinks NVM deployment and makes a case for
the asymmetric non-volatile memory architecture, which de-
couples servers from persistent data storage. We build Asym-
NVM framework based on AsymNVM architecture that imple-
ments: 1) high performance persistent data structure update;
2) NVM data management; 3) concurrency control; and 4)
crash-consistency and replication. The central idea is to use
operation logs to reduce the stall due to RDMA writes and
enable efficient batching and caching in front-end nodes. In a
cluster with ten machines (at most seven machines to emulate
a 60GB NVM using DRAM with additional latency), the re-
sults show that AsymNVM achieves comparable (sometimes
better) performance to the best possible symmetric architec-
ture while avoiding all the drawbacks with disaggregation.
Compared to the baseline AsymNVM, speedup brought by
the proposed optimizations is drastic, — 6∼22× among all
benchmarks.
References
[1] H. Packard, “Understanding the intel/micron 3d xpoint
memory.” http://www.hpl.hp.com/research/
systems-research/themachine/, 2015.
[2] G. W. Burr, M. J. Breitwisch, M. M. Franceschini,
D. Garetto, K. Gopalakrishnan, B. L. Jackson, B. N.
Kurdi, C. H. Lam, L. A. Lastras, A. Padilla, et al.,
“Phase change memory technology,” Journal of Vacuum
Science and Technology B, vol. 28, no. 2, pp. 223–262,
2010.
[3] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Archi-
tecting phase change memory as a scalable dram alter-
native,” international symposium on computer architec-
ture, vol. 37, no. 3, pp. 2–13, 2009.
[4] P. Zhou, B. Zhao, J. Yang, and Y. Zhang, “A durable and
energy efficient main memory using phase change mem-
ory technology,” international symposium on computer
architecture, vol. 37, no. 3, pp. 14–23, 2009.
[5] D. Apalkov, A. V. Khvalkovskiy, S. M. Watts, V. Nikitin,
X. Tang, D. Lottis, K. Moon, X. Luo, E. Chen, A. E.
Ong, et al., “Spin-transfer torque magnetic random ac-
cess memory (stt-mram),” ACM Journal on Emerging
Technologies in Computing Systems, vol. 9, no. 2, p. 13,
2013.
[6] E. R. Giles, K. Doshi, and P. Varman, “Softwrap: A
lightweight framework for transactional support of stor-
age class memory,” in Mass Storage Systems and Tech-
nologies (MSST), 2015 31st Symposium on, pp. 1–14,
IEEE, 2015.
[7] J. Coburn, A. M. Caulfield, A. Akel, L. M. Grupp, R. K.
Gupta, R. Jhala, and S. Swanson, “Nv-heaps: making
persistent objects fast and safe with next-generation,
non-volatile memories,” ACM Sigplan Notices, vol. 46,
no. 3, pp. 105–118, 2011.
[8] H. Volos, A. J. Tack, and M. M. Swift, “Mnemosyne:
Lightweight persistent memory,” in ACM SIGARCH
Computer Architecture News, vol. 39, pp. 91–104, ACM,
2011.
[9] M. Liu, M. Zhang, K. Chen, X. Qian, Y. Wu, W. Zheng,
and J. Ren, “Dudetm: Building durable transactions
with decoupling for persistent memory,” in Proceedings
of the Twenty-Second International Conference on Ar-
chitectural Support for Programming Languages and
Operating Systems, pp. 329–343, ACM, 2017.
[10] S. Venkataraman, N. Tolia, P. Ranganathan, R. H. Camp-
bell, et al., “Consistent and durable data structures
for non-volatile byte-addressable memory.,” in FAST,
vol. 11, pp. 61–75, 2011.
14
[11] S. Mittal and J. S. Vetter, “A survey of software tech-
niques for using non-volatile memories for storage and
main memory systems,” IEEE Transactions on Parallel
and Distributed Systems, vol. 27, no. 5, pp. 1537–1550,
2016.
[12] J. Arulraj and A. Pavlo, “How to build a non-volatile
memory database management system,” in Proceedings
of the 2017 ACM International Conference on Manage-
ment of Data, pp. 1753–1758, ACM, 2017.
[13] A. Eisenman, D. Gardner, I. AbdelRahman, J. Axboe,
S. Dong, K. Hazelwood, C. Petersen, A. Cidon, and
S. Katti, “Reducing dram footprint with nvm in face-
book,” in Proceedings of the Thirteenth EuroSys Con-
ference, p. 42, ACM, 2018.
[14] C. Delimitrou and C. Kozyrakis, “Quasar: resource-
efficient and qos-aware cluster management,” ACM SIG-
PLAN Notices, vol. 49, no. 4, pp. 127–144, 2014.
[15] Google., “Clusterdata2011 2 traces.” https:
//github.com/google/cluster-data/blob/
master/ClusterData2011_2.md, 2018.
[16] Y. Shan, S.-Y. Tsai, and Y. Zhang, “Distributed shared
persistent memory,” in Proceedings of the 2017 Sympo-
sium on Cloud Computing, pp. 323–337, ACM, 2017.
[17] N. S. Islam, M. Wasi-ur Rahman, X. Lu, and D. K.
Panda, “High performance design for hdfs with byte-
addressability of nvm and rdma,” in Proceedings of the
2016 International Conference on Supercomputing, p. 8,
ACM, 2016.
[18] P. Stuedi, A. Trivedi, J. Pfefferle, R. Stoica, B. Met-
zler, N. Ioannou, and I. Koltsidas, “Crail: A high-
performance i/o architecture for distributed data process-
ing.,” IEEE Data Eng. Bull., vol. 40, no. 1, pp. 38–49,
2017.
[19] Y. Lu, J. Shu, Y. Chen, and T. Li, “Octopus: an rdma-
enabled distributed persistent memory file system,” in
2017 USENIX Annual Technical Conference (USENIX
ATC 17), pp. 773–785, 2017.
[20] K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. K.
Reinhardt, and T. F. Wenisch, “Disaggregated mem-
ory for expansion and sharing in blade servers,” in
ACM SIGARCH Computer Architecture News, vol. 37,
pp. 267–278, ACM, 2009.
[21] K. Lim, Y. Turner, J. R. Santos, A. AuYoung, J. Chang,
P. Ranganathan, and T. F. Wenisch, “System-level impli-
cations of disaggregated memory,” in High Performance
Computer Architecture (HPCA), 2012 IEEE 18th Inter-
national Symposium on, pp. 1–12, IEEE, 2012.
[22] P. X. Gao, A. Narayan, S. Karandikar, J. Carreira,
S. Han, R. Agarwal, S. Ratnasamy, and S. Shenker,
“Network requirements for resource disaggregation.,” in
OSDI, vol. 16, pp. 249–264, 2016.
[23] Intel., “Intel rack scale design architecture white paper.”
https://www.intel.com/content/dam/www/
public/us/en/documents/white-papers/rack-
scale-design-architecture-white-paper.pdf,
Jan. 2017.
[24] H. Packard, “The machine.” https://www.labs.hpe.
com/the-machine, 2018.
[25] Intel, “Intel, facebook collaborate on future data center
rack technologies.” https://newsroom.intel.com/
news-releases/intel-facebook-collaborate-
on-future-data-center-rack-technologies/,
2013.
[26] Intel, “Intel rack scale design.” https://www.
intel.com/content/www/us/en/architecture-
and-technology/rack-scale-design-
overview.html, 2018.
[27] Intel, “Disaggregated servers drive data cen-
ter efficiency and innovation.” https://www.
intel.com/content/dam/www/public/us/en/
documents/best-practices/disaggregated-
server-architecture-drives-data-center-
efficiency-paper.pdf.
[28] C.-H. Hsu, Q. Deng, J. Mars, and L. Tang, “Smoothop-
erator: Reducing power fragmentation and improving
power utilization in large-scale datacenters,” in Proceed-
ings of the Twenty-Third International Conference on
Architectural Support for Programming Languages and
Operating Systems, pp. 535–548, ACM, 2018.
[29] J. Yang, Q. Wei, C. Chen, C. Wang, K. L. Yong, and
B. He, “Nv-tree: Reducing consistency cost for nvm-
based single level systems.,” in FAST, vol. 15, pp. 167–
181, 2015.
[30] F. Xia, D. Jiang, J. Xiong, and N. Sun, “Hikv: A hy-
brid index key-value store for dram-nvm memory sys-
tems,” in 2017 USENIX Annual Technical Conference
(USENIX ATC 17), pp. 349–362, USENIX, 2017.
[31] B. W. Lampson, “Hints for computer system design,”
in ACM SIGOPS Operating Systems Review, vol. 17,
pp. 33–48, ACM, 1983.
[32] S.-Y. Tsai and Y. Zhang, “Mitsume: an object-based
remote memory system,” in Workshop on Warehouse-
scale Memory Systems (WAMS), ACM, 2018.
15
[33] J. Bonwick et al., “The slab allocator: An object-
caching kernel memory allocator.,” in USENIX summer,
vol. 16, Boston, MA, USA, 1994.
[34] A. Kalia, M. Kaminsky, and D. G. Andersen, “Us-
ing rdma efficiently for key-value services,” in ACM
SIGCOMM Computer Communication Review, vol. 44,
pp. 295–306, ACM, 2014.
[35] J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang,
M. Wasi-ur Rahman, N. S. Islam, X. Ouyang, H. Wang,
S. Sur, et al., “Memcached design on high performance
RDMA capable interconnects,” in Proceedings of the In-
ternational Conference on Parallel Processing (ICPP),
pp. 743–752, IEEE, 2011.
[36] C. Mitchell, Y. Geng, and J. Li, “Using one-sided rdma
reads to build a fast, cpu-efficient key-value store.,” in
USENIX Annual Technical Conference, pp. 103–114,
2013.
[37] A. Dragojevic´, D. Narayanan, O. Hodson, and M. Cas-
tro, “Farm: Fast remote memory,” in Proceedings of the
11th USENIX Conference on Networked Systems Design
and Implementation, pp. 401–414, 2014.
[38] M. Su, M. Zhang, K. Chen, Z. Guo, and Y. Wu, “Rfp:
When rpc is faster than server-bypass with rdma.,” in
EuroSys, pp. 1–15, 2017.
[39] Y. Zhang, J. Yang, A. Memaripour, and S. Swanson,
“Mojim: A reliable and highly-available non-volatile
memory system,” ACM SIGPLAN Notices, vol. 50, no. 4,
pp. 3–18, 2015.
[40] S. Pelley, P. M. Chen, and T. F. Wenisch, “Memory
persistency,” in ACM SIGARCH Computer Architecture
News, vol. 42, pp. 265–276, IEEE Press, 2014.
[41] A. Kolli, J. Rosen, S. Diestelhorst, A. Saidi, S. Pelley,
S. Liu, P. M. Chen, and T. F. Wenisch, “Delegated persist
ordering,” in The 49th Annual IEEE/ACM International
Symposium on Microarchitecture, p. 58, IEEE Press,
2016.
[42] T. Tom, “Remote access to ultra-low-latency storage.”
https://www.snia.org/sites/default/files/
SDC15_presentations/persistant_mem/Talpey-
Remote_Access_Storage.pdf, Aug. 2015.
[43] D. Chet, “Rdma with pmem: Software mechanisms
for enabling access to remote persistent mem-
ory.” http://www.snia.org/sites/default/
files/SDC15_presentations/persistant_mem/
ChetDouglas_RDMA_with_PM.pdf, 2015.
[44] OpenFabric, “Rdma and nvm programming model.”
https://www.openfabrics.org/images/
eventpresos/workshops2015/DevWorkshop/
Monday/monday_12.pdf, 2015.
[45] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable
high performance main memory system using phase-
change memory technology,” ACM SIGARCH Com-
puter Architecture News, vol. 37, no. 3, pp. 24–33, 2009.
[46] H. G. Lee, S. Baek, C. Nicopoulos, and J. Kim, “An
energy-and performance-aware dram cache architecture
for hybrid dram/pcm main memory systems,” in Com-
puter Design (ICCD), 2011 IEEE 29th International
Conference on, pp. 381–387, IEEE, 2011.
[47] J. J. Levandoski, D. B. Lomet, and S. Sengupta, “The
bw-tree: A b-tree for new hardware platforms,” in 2013
IEEE 29th International Conference on Data Engineer-
ing (ICDE), pp. 302–313, IEEE, 2013.
[48] J. Arulraj, A. Pavlo, and S. R. Dulloor, “Let’s talk about
storage & recovery methods for non-volatile memory
database systems,” in Proceedings of the 2015 ACM
SIGMOD International Conference on Management of
Data, pp. 707–722, ACM, 2015.
[49] Intel., “Nvm library.” https://github.com/pmem/
nvml, 2018.
[50] D. Y. Yoon, M. Chowdhury, and B. Mozafari, “Dis-
tributed lock management with rdma: Decentralization
without starvation,” in Proceedings of the 2018 Interna-
tional Conference on Management of Data, pp. 1571–
1586, ACM, 2018.
[51] J. C. Anderson, J. Lehnardt, and N. Slater, CouchDB:
The Definitive Guide: Time to Relax. ” O’Reilly Media,
Inc.”, 2010.
[52] M. Hedenfalk, “how the append-only btree works.”
http://www.bzero.se/ldapd/btree.html, 2009.
[53] J. R. Driscoll, N. Sarnak, D. D. Sleator, and R. E. Tarjan,
“Making data structures persistent,” Journal of computer
and system sciences, vol. 38, no. 1, pp. 86–124, 1989.
[54] C. Okasaki, Purely functional data structures. Cam-
bridge University Press, 1999.
[55] B. Becker, S. Gschwind, T. Ohler, B. Seeger, and P. Wid-
mayer, “An asymptotically optimal multiversion b-tree,”
The VLDB JournalThe International Journal on Very
Large Data Bases, vol. 5, no. 4, pp. 264–275, 1996.
[56] B. Sowell, W. Golab, and M. A. Shah, “Minuet: A
scalable distributed multiversion b-tree,” Proceedings of
the VLDB Endowment, vol. 5, no. 9, pp. 884–895, 2012.
[57] O. Rodeh, “B-trees, shadowing, and clones,” ACM
Transactions on Storage (TOS), vol. 3, no. 4, p. 2, 2008.
16
[58] M. Fomitchev and E. Ruppert, “Lock-free linked lists
and skip lists,” in Proceedings of the twenty-third annual
ACM symposium on Principles of distributed computing,
pp. 50–59, ACM, 2004.
[59] T. Crain, V. Gramoli, and M. Raynal, “No hot spot non-
blocking skip list,” in Distributed Computing Systems
(ICDCS), 2013 IEEE 33rd International Conference on,
pp. 196–205, IEEE, 2013.
[60] M. Herlihy, Y. Lev, V. Luchangco, and N. Shavit, “A
provably correct scalable concurrent skip list,” in Con-
ference On Principles of Distributed Systems (OPODIS),
Citeseer, 2006.
[61] K. Fraser, “Practical lock-freedom,” tech. rep., Univer-
sity of Cambridge, Computer Laboratory, 2004.
[62] A. K. M. Kaminsky and D. G. Andersen, “Design guide-
lines for high performance rdma systems,” in 2016
USENIX Annual Technical Conference, p. 437, 2016.
[63] C. A. Thekkath, T. Mann, and E. K. Lee, “Frangipani:
a scalable distributed file system,” symposium on oper-
ating systems principles, vol. 31, no. 5, pp. 224–237,
1997.
[64] S. Tu, W. Zheng, E. Kohler, B. Liskov, and S. Madden,
“Speedy transactions in multicore in-memory databases,”
in Proceedings of the Twenty-Fourth ACM Symposium
on Operating Systems Principles, pp. 18–32, ACM,
2013.
[65] H. Lim, D. Han, D. G. Andersen, and M. Kaminsky,
“Mica: A holistic approach to fast in-memory key-value
storage,” USENIX, 2014.
[66] “Etcd project.” https://github.com/coreos/etcd,
2018.
[67] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed,
“Zookeeper: Wait-free coordination for internet-scale
systems.,” in USENIX annual technical conference,
vol. 8, Boston, MA, USA, 2010.
[68] K. Bhandari, D. R. Chakrabarti, and H. Boehm,
“Makalu: fast recoverable allocation of non-volatile
memory,” conference on object oriented programming
systems languages and applications, vol. 51, no. 10,
pp. 677–694, 2016.
[69] B. University, “Smallbank benchmark.” http:
//hstore.cs.brown.edu/documentation/
deployment/benchmarks/smallbank/, 2018.
[70] N. Simo, W. Antoni, m. Markku, and R. Vilho,
“Tatp benchmark.” http://tatpbenchmark.
sourceforge.net/, 2011.
[71] A. Kolli, S. Pelley, A. Saidi, P. M. Chen, and T. F.
Wenisch, “High-performance transactions for persis-
tent memories,” ACM SIGPLAN Notices, vol. 51, no. 4,
pp. 399–411, 2016.
[72] A. Chatzistergiou, M. Cintra, and S. D. Viglas, “Rewind:
Recovery write-ahead system for in-memory non-
volatile data-structures,” Proceedings of the VLDB En-
dowment, vol. 8, no. 5, pp. 497–508, 2015.
[73] K. Korgaonkar, I. Bhati, H. Liu, J. Gaur, S. Manipatruni,
S. Subramoney, T. Karnik, S. Swanson, I. Young, and
H. Wang, “Density tradeoffs of non-volatile memory as
a replacement for sram based last level cache,” pp. 315–
327, 2018.
[74] R. Kateja, A. Badam, S. Govindan, B. Sharma, and
G. Ganger, “Viyojit: Decoupling battery and dram ca-
pacities for battery-backed dram,” in Proceedings of
the 44th Annual International Symposium on Computer
Architecture, pp. 613–626, ACM, 2017.
[75] A. Dragojevic´, D. Narayanan, E. B. Nightingale,
M. Renzelmann, A. Shamis, A. Badam, and M. Castro,
“No compromises: distributed transactions with consis-
tency, availability, and performance,” in Proceedings
of the 25th symposium on operating systems principles,
pp. 54–70, ACM, 2015.
[76] P. Couvert, “High speed io processor for nvme over
fabric (nvmeof),” Flash Memory Summit, 2016.
[77] H. Volos, S. Nalli, S. Panneerselvam, V. Varadarajan,
P. Saxena, and M. M. Swift, “Aerie: Flexible file-system
interfaces to storage-class memory,” in Proceedings of
the Ninth European Conference on Computer Systems,
p. 14, ACM, 2014.
[78] D. Schwalb, T. Berning, M. Faust, M. Dreseler, and
H. Plattner, “nvm malloc: Memory allocation for
nvram.,” in ADMS@ VLDB, pp. 61–72, 2015.
[79] I. Oukid, D. Booss, A. Lespinasse, W. Lehner, T. Will-
halm, and G. Gomes, “Memory management techniques
for large-scale persistent-main-memory systems,” Pro-
ceedings of the VLDB Endowment, vol. 10, no. 11,
pp. 1166–1177, 2017.
[80] S. Han, N. Egi, A. Panda, S. Ratnasamy, G. Shi, and
S. Shenker, “Network support for resource disaggrega-
tion in next-generation datacenters,” in Proceedings of
the Twelfth ACM Workshop on Hot Topics in Networks,
p. 10, ACM, 2013.
[81] “Seamicro technology overview.” http:
//seamicro.com/sites/default/files/SM_
TO01_64_v2.5.pdf, 2012.
17
[82] Y. Shan, Y. Huang, Y. Chen, and Y. Zhang, “Legoos:
A disseminated, distributed OS for hardware resource
disaggregation,” in 13th USENIX Symposium on Oper-
ating Systems Design and Implementation (OSDI 18),
(Carlsbad, CA), pp. 69–87, USENIX Association, 2018.
[83] A. Klimovic, C. Kozyrakis, E. Thereska, B. John, and
S. Kumar, “Flash storage disaggregation,” in Proceed-
ings of the Eleventh European Conference on Computer
Systems, p. 29, ACM, 2016.
[84] M. Nanavati, J. Wires, and A. Warfield, “Decibel: Iso-
lation and sharing in disaggregated rack-scale storage.,”
in NSDI, pp. 17–33, 2017.
[85] M. K. Aguilera, N. Amit, I. Calciu, X. Deguillard,
J. Gandhi, P. Subrahmanyam, L. Suresh, K. Tati,
R. Venkatasubramanian, and M. Wei, “Remote memory
in the age of fast networks,” in Proceedings of the 2017
Symposium on Cloud Computing, pp. 121–127, ACM,
2017.
[86] E. Zamanian, C. Binnig, T. Harris, and T. Kraska,
“The end of a myth: Distributed transactions can scale,”
Proceedings of the VLDB Endowment, vol. 10, no. 6,
pp. 685–696, 2017.
[87] J. Gu, Y. Lee, Y. Zhang, M. Chowdhury, and K. G. Shin,
“Efficient memory disaggregation with infiniswap,” in
14th USENIX Symposium on Networked Systems Design
and Implementation (NSDI 17), (Boston, MA), pp. 649–
667, USENIX Association, 2017.
18
