Building Atomic, Crash-Consistent Data Stores with Disaggregated
  Persistent Memory by Tsai, Shin-Yeh & Zhang, Yiying
Building Atomic, Crash-Consistent Data Stores with Disaggregated Persistent Memory
Shin-Yeh Tsai, Yiying Zhang
Purdue University
Abstract
Byte-addressable persistent memories (PM) has finally
made their way into production. An important and press-
ing problem that follows is how to deploy them in exist-
ing datacenters. One viable approach is to attach PM as
self-contained devices to the network as disaggregated
persistent memory, or DPM. DPM requires no changes
to existing servers in datacenters; without the need to
include a processor, DPM devices are cheap to build;
and by sharing DPM across compute servers, they offer
great elasticity and efficient resource packing.
This paper explores different ways to organize DPM
and to build data stores with DPM. Specifically, we
propose three architectures of DPM: 1) compute nodes
directly access DPM (DPM-Direct); 2) compute nodes
send requests to a coordinator server, which then ac-
cesses DPM to complete a request (DPM-Central); and
3) compute nodes directly access DPM for data opera-
tions and communicate with a global metadata server for
the control plane (DPM-Sep). Based on these architec-
tures, we built three atomic, crash-consistent data stores.
We evaluated their performance, scalability, and CPU
cost with micro-benchmarks and YCSB. Our evalua-
tion results show that DPM-Direct has great small-size
read but poor write performance; DPM-Central has the
best write performance when the scale of the cluster is
small but performs poorly when the scale increases; and
DPM-Sep performs well overall.
1 Introduction
After year’s of research, engineering, and commercializ-
ing efforts, persistent memory (PM), non-volatile mem-
ories that can be attached to the main memory bus, is
finally coming to market [27, 29]. As promised, PM can
be accessed like memory and it offers persistence, high
density, and performance that is orders of magnitude
faster than flash. It has the potential to significantly im-
prove the efficiency and reduce the cost of large-scale
data-intensive applications. An immediate question that
follows is how to utilize PM and deploy it in existing
datacenters.
We believe that a promising approach is to directly
attach PM to the network to form disaggregated per-
sistent memory, or DPM. A DPM device only needs a
network interface, a hardware PM controller, and some
PM; it requires no server packaging or any processors.
Datacenters owners can use normal servers as compute
nodes (CNs) and store data in DPM.
The DPM model offers several key benefits. First, un-
like the alternative approach of attaching PM to a server,
DPMs can be integrated into current datacenters without
any disruption to existing servers. Second, without the
need for a processor or a server to host DPM, the mone-
tary and energy cost of DPM is low. Third, multiple CNs
can share one DPM device and one CN can store data
on multiple DPMs. Doing so enables better resource
packing than attaching and confining the usage of PM
to a single node [4, 14, 22, 23, 42]. Fourth, the DPM
model offers great elasticity, since DPMs can be freely
added, removed, and replaced. The amount of CNs and
DPMs can scale independently. Finally, although access-
ing DPMs involves network communication, this cost is
becoming lower as datacenter network speed improves
quickly [43, 61].
Despite its benefits, the DPM model presents new
challenges. Without any processing power, accesses to
DPMs have to come all from the network, or one-way.
DPMs cannot perform any management tasks of its
own memory resources. Finally, each DPM can fail
independently from CNs or other DPMs. These unique
issues of DPMs have not been addressed in traditional
distributed storage or distributed memory systems.
To confront the challenges and to explore the design
tradeoffs of the DPM model, we propose three archi-
tectures of organizing DPMs (Figure 1(b) to 1(d)). The
first architecture, DPM-Direct, lets CNs directly access
DPMs with one-way operations. This architecture is
cheap to build. With a fully-distributed architecture,
DPM-Direct avoids any throughput bottleneck. How-
ever, it is hard and costly to synchronize concurrent
accesses to DPMs from multiple CNs [32]. It is also dif-
ficult for CNs to manage memory resources in DPMs.
The second architecture, DPM-Central, uses a central
server (the coordinator) to orchestrate the accesses from
CNs to DPMs and to manage DPM resources. CNs can
talk to the coordinator with two-way communication
(e.g., through RPC); the coordinator orchestrates concur-
rent CN requests and issues one-way accesses to DPMs.
The coordinator also stores all metadata and performs
metadata operations locally. Having a central point of
the coordinator makes it easy to manage DPMs and to
coordinate concurrent requests, but the coordinator can
become the performance bottleneck.
To remedy the performance and scalability limitations
of DPM-Direct and DPM-Central, we propose a third
architecture, DPM-Sep. The main idea of DPM-Sep
is to separate the data plane from the control plane.
On the data plane, CNs directly access DPMs. On the
control plane, we use a metadata server (MS) to handle
all metadata operations and manage DPM resources.
CNs talk to the MS with two-way communication to
1
ar
X
iv
:1
90
1.
01
62
8v
1 
 [c
s.D
C]
  7
 Ja
n 2
01
9
fetch or update metadata. The MS makes it easy and
more efficient to perform control plane tasks, while not
being the performance bottleneck on the data path.
Based on these three architectures, we designed three
atomic, crash-consistent DPM data store systems. All
these systems provide the same guarantees that when
writing to a data entry, the data entry either has all new
data (if the write is successfully committed) or all old
data (if the write fails), and that CNs only read commit-
ted data (i.e., the read-committed isolation level). These
properties hold even when a DPM crashes during the
write and recovers afterwards.
On top of the DPM-Direct architecture, we built a
data store, DirectDS. To best fit DPM-Direct and pro-
vide good performance, we designed DirectDS with
two principles: reducing network round trips (RTTs)
between CNs and DPMs and avoiding frequent DPM
management tasks or metadata modifications. DirectDS
uses two spaces for each data entry, one to write uncom-
mitted new data and one to store committed data. Doing
so avoid the need for space allocation after a data entry
is created. DirectDS protects each data entry with a lock
stored in DPM and accessed with one-sided RDMA
operations. We employ techniques like error-detecting
code to further reduce RTTs.
On top of DPM-Central, we built CentralDS. Cen-
tralDS leverages the centralized coordinator to perform
space management, to store metadata, and to serve as
the serializing point for concurrent accesses. CNs send
RPC read/write requests to the coordinator, which uses
local locks to protect concurrent accesses, reads/writes
data to DPM, and updates metadata locally.
On top of DPM-Sep, we built SepDS. On the data
path, SepDS performs out-of-place writes that are simi-
lar to log-structured writes. We use a novel data structure
that enables CNs to efficiently locate the latest data en-
try without the need to communicate with the MS. For
the control path, the MS stores all metadata and CNs
caches hot metadata. We move all metadata operations
off performance critical path. To minimize the need for
CNs to communicate with MS, CNs perform lazy, asyn-
chronous, batched reclamation of old data entries. We
also completely eliminate the need for the MS to com-
municate with DPMs; it manage DPM space without
accessing them.
To sustain non-transient DPM failures, it is not
enough to store data in just one DPM. For each of the
three systems, we added the support of replication on
top of our single-copy designs. We also utilize the data
redundancy to provide better load balancing for reads —
we dynamically choose which DPM to replicate data to
based on the loads of each DPM.
We evaluated the three DPM data stores using real
servers as CNs, the coordinator, the MS, and DPM de-
vices, all connected with RDMA. We emulate PM using
DRAM on real machines; we perform RDMA read to
ensure that data is written to the PM in DPMs [60]. We
perform a systematic, extensive set of experiments to
evaluate the latency, throughput, scalability, network
traffic, and CPU utilization of the three DPM data stores
using microbenchmarks and YCSB workloads [12, 71].
Our evaluation results not only confirm findings that
are easy to deduce from system designs (e.g., that Cen-
tralDS scales poorly with CNs and DPMs and that the
performance of SepDS is overall the best), but also re-
veal more subtle findings (e.g., that DirectDS only scales
well when there is no contention of concurrent accesses
and that SepDS’s good performance rely on CNs being
able to cache hot metadata). Based on our findings, we
summarize the tradeoffs of the three DPM data stores in
Table 1.
This paper makes the following contributions.
• We propose and compare three DPM architec-
tures.
• We built three DPM data stores. As far as we
know, these are the first set of publicly-described
and publicly-available DPM systems.
• We provide a detailed design to demonstrate how
to best separate data plane and control plane under
the DPM model.
• We performed extensive evaluation and learned
a set of new findings that can guide future DPM
research.
The source code of all our DPM systems will be
publicly available soon.
2 Using PM in Datacenters
Non-volatile memory technologies such as 3DX-
point [28], phase change memory (PCM), spin-transfer
torque magnetic memories (STTMs), and the memris-
tor provide byte addressability, persistence, and latency
that is within an order of magnitude of DRAM [25, 36–
38, 46, 56, 62, 70]. NVMs can attach directly to the
main memory bus and we call such NVMs Persis-
tent Memory or PM in this paper. PM is a disruptive
technology poised to radically alter the landscape of
memory and storage technologies. It has attracted ex-
tensive amount of research efforts over the past year,
most of which were designed for single-node environ-
ments [10, 11, 15, 17, 35, 45, 51, 54, 66, 68].
Despite of these successful prior research efforts,
there are at least two remaining challenges to be solved
before PMs can be readily used in datacenters. First,
in datacenter environments, PMs should support dis-
tributed applications. When using PMs to store persis-
tent data, they have to provide high availability and
2
Node
CPU
PM
…
PML
oc
al
PM
Re
m
ot
e
PM
(a) Distributed PM (b) DPM-Direct
…
…
PM PM
Coordinator
PM
CPU
(c) DPM-Central
…
…
PM PM
(d) DPM-Sep
…
…DR
AM
Node
CPU
Lo
ca
l
PM
Re
m
ot
e
PM
DR
AM
DRAM
CPU
Compute
DRAM
CPU
Compute
DRAM
CPU
Compute
DRAM
CPU
Compute
DRAM
Metadata
Server PM
CPU
DRAM
DRAM
CPU
Compute
DRAM
CPU
Compute
Figure 1. PM Organization Comparison. Blue bars indicate two-way communication and pink ones indicate one-way communication.
Bars with both blue and pink mean support for both.
reliability (i.e., sustain node failures). Unfortunately,
there are only limited work in the distributed PM re-
search space [41, 59, 73]. So far, distributed PM sys-
tems [41, 59] have all taken a model where each node in
a cluster includes some amount of PM used to store data
that can be accessed both locally and by other nodes
(Figure 1(a)).
Second, it is not clear how to deploy PMs in existing
datacenters. The distributed PM model requires PM to
be integrated into existing servers or purchasing new
servers to host PM. Since PMs attach to the main mem-
ory bus, only when existing servers have empty DIMM
slots will they be able to host PM. On the other hand,
purchasing new servers just to host PM can waste other
resources in the new servers. Moreover, applications
that desire to use PM can only run on these new servers.
With these challenges, we believe that we should seek
new ways to use and deploy PM in datacenters that are
flexible, cost-effective, reliable, and can perform well.
3 Disaggregated PM
Similar to disaggregated memory [39, 40] and other re-
source disaggregation systems [3, 24, 58], disaggregated
PM is an architecture that attaches PM devices directly
to the network and lets servers (CNs) access them across
the network. These PM devices do not have any local
processing units and only have a hardware controller
and a network interface (we simply call a disaggregated
PM device a DPM in this paper). The DPM model orga-
nizes DPMs as a pool of PM resources that can be used
by any CNs. A CN can store data on multiple DPMs
and one DPM can host data for multiple CNs.
The DPM model offers a cost-effective way to de-
ploy PM in datacenters. Without any processor or ma-
chine packaging, DPMs are cheap to build. They can
easily integrate into existing datacenters without disrup-
tion to existing servers. The DPM model also shares
many benefits with other resource disaggregation pro-
posals [20, 58]: it offers high resource packing effi-
ciency, since data can be allocated at any DPM; datacen-
ters can grow DPMs independent from other servers; it
is easy to add, remove, and upgrade DPMs, and DPMs
can fail independently without affecting other servers.
However, building an efficient DPM data store system
is not easy. A major technical hurdle is the complete
lack of computation power at DPMs. Different from tra-
ditional distributed storage and memory systems, DPMs
can only be accessed and managed from remote. It is
especially hard to provide good performance with con-
current data accesses. In addition, DPMs can fail inde-
pendently and such failures have to be handled properly
to ensure data reliability and high availability.
4 DPM Data Stores
This section first describes the interface of all our DPM
data stores and their common features. We then present
the three data stores, DirectDS, CentralDS, and SepDS.
Finally, we discuss failure handling and load balancing
in these data stores.
4.1 System Interface and Overview
To confront the challenges of DPM, we propose three
architectures of DPM and built three data stores on top
of these architectures. Figure 2 illustrates the read and
write operation flow of these systems and Table 1 sum-
marizes the tradeoffs of these systems. We will explain
Figure 2 and Table 1 in detail in §4.2 to §4.4.
Interface and guarantees. The current data model that
our three data stores support is a simple key-value store,
but these systems can be extended to other data models.
Users can create, read (get), write (put), and delete a
key-value entry. Different CNs can have shared access
to the same data. We manage the consistency of concur-
rent data accesses in software instead of relying on any
hardware-provided coherence like [8, 21, 50].
All our DPM data stores ensure atomicity of an en-
try across concurrent readers and writers. A successful
write indicates that the data is committed (atomically),
and reads only see committed value. We choose single-
key atomic write and read committed because these
consistency and isolation levels are widely used in many
data store systems and can be extended to other levels.
Since our DPM systems store persistent data, it is im-
portant to provide data reliability and high availability.
Our DPM systems guarantee the consistency of data
when crashes happen. After restart, each data entry is
guaranteed to either only have new data values or old
3
System Cost R-RTT W-RTT(rep) Scalability Data Metadata Performance
DirectDS low 3 6(6) w/ DPM† large large OK write performance when no contention
DirectDS-C low 1 6(6) w/ DPM† large large Best for small-sized read, not good otherwise
CentralDS high 2 3(3) Neither small small Best for small-scale writes, not good for reads
SepDS medium 1 3(4) w/ both small* medium Good overall when CNs can cache hot metadata
Table 1. Comparison of DPM Data Stores. The Cost column represents energy and monetary cost to build respective DPM data stores.
The R-RTT and W-RTT(rep) columns show the number of RTTs required to perform a read and a write (with replication). All RTT values are
measured when there is no contention. The Scalability column shows if a system is scalable with the number of CNs, the number of DPMs, both, or
neither. † only scalable when there is no contention. The data and metadata columns show the space needed to store a data entry and its metadata.
* under common scenario where reclamation can keep up with the speed of foreground write.
ones. In addition, all our three systems provide replica-
tion across DPMs to ensure that data is still available
even after losing N − 1 DPMs (when the degree of repli-
cation is N ).
Network layer. We choose RDMA as the network layer
that connects all servers and DPMs, but most of our
designs are applicable to other network systems that
can perform both one-way and two-way communica-
tion. We use RDMA’s RC (Reliable Connection) mode
which supports one-sided RDMA operations and en-
sures lossless and ordered packet delivery. Similar to
prior solutions [15, 65], we solve RDMA’s scalability
issues using memory huge page or physical memory to
register memory regions with RDMA NICs.
Ensuring data persistence. For data to be persistent in
DPM, it is not enough to just perform a remote write.
After a remote write (e.g., RDMA write), the data can be
in NIC, PCIe hub, or PM. Only when the data is written
to PM can it sustain power failure. To ensure this data
persistence, we follow the guidance of SNIA [60] and
Mellanox [26, 57] by performing a remote read to ensure
that data is actually in PM. Since we use RDMA RC
which guarantees ordered data delivery and PCIe also
follows ordering [53], we only read the last byte of a
data entry to verify its persistence [60].
4.2 Direct Connection
The DPM-Direct architecture (Figure 1(b)) connects
CNs directly to DPMs. CNs perform un-orchestrated,
direct accesses to DPMs using RDMA one-sided oper-
ations. Under DPM-Direct, performing metadata and
control operations from CNs is hard and costly (e.g., by
performing distributed coordination across CNs). Thus,
we made two design choices when building DirectDS.
First, we use two spaces for each data entry, one to
store committed data where reads go to (we call it the
committed space) and one to store in-flight, new data
(un-committed space). Doing so avoids dynamic space
allocation and de-allocation. Second, to avoid reading
and writing metadata from DPMs and the cost of en-
suring metadata consistency under concurrent accesses,
CNs in DPM-Direct locally store all the metadata of
key-value entries, including the key of a value and the
location of its committed and uncommitted spaces.
The only distributed coordination across CNs needed
in DPM-Direct is during the creation and deletion of a
key-value entry. We currently use Memcached [19] as a
metadata server to assist entry creation and delete, but
other distributed consensus systems can also work.
Figure 2(a) illustrates the read and write protocol of
DirectDS. DirectDS uses locks to isolate data entries
from concurrent read and write accesses. Each entry
has its own lock and we associate a 8-byte value at the
beginning of each data entry to implement its lock. A
CN performs a one-sided RDMA c&s (compare-and-
swap) operation to the value to acquire the lock (e.g.,
comparing whether the value is 0 and if so setting it
to 1). To release the lock, the CN simply performs an
RDMA write and sets the value to 0.
Our lock implementation leverages the unique feature
of the DPM model that all memory accesses to DPMs
come from the network (i.e., the NIC). Without proces-
sor’s accesses to memory, DMA guarantees that network
atomic operations like c&s are atomic [13, 64]. Note
that an RDMA c&s operation to an in-memory value
which can also be accessed locally at the same time does
not guarantee the atomicity of the value [13, 44, 67], and
thus it cannot be used in distributed PM systems in the
same way.
Read. To read a data entry, a CN uses its stored meta-
data to find the location of the data entry’s committed
space (and the first 8-byte lock). It first acquires a lock,
then performs an RDMA read, and finally releases the
lock. Locking reads ensures that CNs will not read inter-
mediate value during concurrent writes. The read latency
is 3 RTTs when there is no contention, with one RTT
used for data read. Under contention, the c&s operation
would fail and CNs will keep retry until succeed.
Write. To write a data entry, a CN first locates the
entry and locks it. Afterwards, the CN writes the new
data to the un-committed space. To sustain crashes, the
CN issues an RDMA read to the last byte of the un-
committed space to validate that it is actually written to
PM. This uncommitted data serves as the redo copy that
will be used during recovery if a crash happens. The CN
then writes the new data to the committed space with an
RDMA write and validates it with an RDMA read. At
the end, the CN releases the lock. The total write latency
4
C(a) DirectDS
read write
Lock Unlock Lock Create-Redo Update UnlockCN
PM
C
PM
(b) DirectDS-C
CN Lock Create-Redo Update Unlock
read write
RL U WL UUM
Create-Redo
(c) CentralDS
read write
CN
Cor
PM
Create-
Redo
Link-
Redo
Update-
Shortcut
US
MS
CN
PM
read write
Foreground Background
GC
(d) SepDS
WL Writer Lock
U Unlock C CRC calculation Read validation
Compare & SwapUS Update Shortcut UM Update Metadata
RL Reader Lock One-sided
RPC
Old data
New data
Figure 2. Read/Write Protocols of DPM Systems.
is 6 RTTs (without contention), two of which involve
data read/write.
Avoid read lock with CRC. DirectDS uses lock to en-
sure the read-committed isolation level at the cost of
two RTTs to acquire and release the lock for each read.
Instead of lock, we can use an error-detecting code for
each data entry to detect incomplete data. DirectDS-C
(Figure 2(b)) uses the CRC code for this purpose.
To perform a read, a CN simply issues an RDMA read
to fetch the data and then calculates and validates its
CRC. Thus, the read latency of DirectDS-C is one RTT
plus the CRC calculation time. Writes in DirectDS-C is
similar to DirectDS, except that before writing the new
data, the CN needs to first calculate and attach a CRC
to the new data entry.
Discussion. As we will see in §5, as expected, DPM-
Direct data stores scale well when there is no contention
of concurrent accesses to data entries. More surprising is
that they scale very poorly when contention happens, es-
pecially with write. In general, the write performance is
not good because of the high RTTs. But write performs
especially poorly under contention, because multiple
CNs will all try to acquire the lock with the c&s op-
eration and most of them will experience a lot of c&s
failures. However, DirectDS-C yields the best read per-
formance when read size is small, since it only requires
one lock-free RTT and it is fast to calculate small CRC.
DPM-Direct systems also require large space for both
data and metadata. For each data entry, it doubles the
space because of the need to store two copies of data.
The metadata overhead is also high, since CNs have to
store all metadata.
4.3 Connecting Through Coordinator
Most limitations of DPM-Direct come from the fact that
there is no central coordination of data, metadata, or
management operations. Specifically, DPM-Direct sys-
tems have to write data twice, once to the un-committed
and once to the committed space, because CNs in DPM-
Direct only know a fixed location to read committed
data. The DPM-Central architecture (Figure 1(c)) takes
the opposite design choice and uses one coordinator
to orchestrate all data accesses and to perform meta-
data and management operations. All CNs send RPC
requests to the coordinator (we use the HERD [30, 31]
RPC system for this purpose). The coordinator han-
dles RPC requests by performing read/write requests to
DPMs. To improve application throughput, we use mul-
tiple threads at the coordinator to handle RPC requests.
Since all requests go through the coordinator, it can
serve as the serialization point for concurrent accesses
to a data entry. We simply use a local read/write lock
for each data entry at the coordinator as the synchro-
nization of multiple coordinator threads. In addition to
orchestrating data accesses, the coordinator performs
all space allocation and de-allocation of data entries.
The coordinator uses its local PM to persistently store
all the metadata for a data entry including its key, its
location, and a read/write lock. With the coordinator
handling all read requests, it can freely direct a read to
the latest location of committed data. Thus, it does not
need to maintain the same location for committed data
and changes the location of committed data after each
write.
Read. To perform a read, a CN sends an RPC read
request to the coordinator. The coordinator finds the
location of the entry’s committed data using its local
metadata, acquires its local lock of the entry, reads the
data from the DPM using an RDMA read, releases the
lock, and finally replies to the CN’s RPC request. The
total read latency (from CN’s perspective) is 2 RTTs,
both containing data.
Write. After receiving a write RPC request from a CN,
the coordinator allocates a new space in a DPM for the
new data. It then writes the data and validates it with an
RDMA read. Note that we do not need to lock (either
at coordinator or at DPM) during this write, since it is
an out-of-place write to a location that is not exposed to
any other coordinator RPC handlers.
After successfully verifying the write, the coordina-
tor updates its local metadata of where the committed
version of the data entry is and flushes this new meta-
data to its local PM for crash recovery (by performing
CPU cache flushes and memory barrier instructions [9]).
Since concurrent coordinator RPC handlers can update
the same information of where the latest data entry is,
we use a local lock to protect this metadata change. The
total write latency without contention is 3 RTTs, with
two of them containing data and one for validation.
Discussion. CentralDS largely reduces write RTTs over
DirectDS and thus has good write performance when the
scale of the cluster is small. However, from our experi-
ments, the coordinator soon becomes the performance
5
bottleneck when either the number of CNs increases or
the number of DPMs increases. CentralDS’s read perfor-
mance is also worse than DirectDS-C with the extra hop
between a CN and the coordinator. In addition, the CPU
utilization of the coordinator is high, since it needs to
have a high amount of RPC handlers to sustain parallel
requests from CNs (§5). However, unlike DPM-Direct,
CNs in the DPM-Direct architecture does not need to
store any metadata.
MS
CN
uncommitted entry
committed entries
ptr
8B header
DPM
data
data
data
data
shortcut
ptr GC-verFreeList
ToGCList
metadata cache
OvflowList
key
key write-cursor
head
tail
next
GC-ver
my
GC-ver
ch
ain
shortcut-loc
shortcut-loc
head of chain
read-cursor
Figure 3. SepDS System Design.
4.4 Separating Data and Control
The main issue with DPM-Direct is its poor write perfor-
mance. CentralDS improves the write performance but
suffers from the scalability bottleneck of the central co-
ordinator. To solve these problems of the first two DPM
architectures, we propose a third architecture, DPM-Sep
(Figure 1(d)), and a data store designed for it, SepDS.
The main idea of DPM-Sep is to separate the data plane
from the control plane. It lets CNs directly access DPMs
for all data operations and uses a metadata server (MS)
for all control plane operations.
The MS stores metadata of all data entries in its local
PM. We keep the amount of metadata small, and 1 TB
of PM (a conservative estimation of the size of PM
a server can host) can store metadata for 64 TB data
at the granularity of 1 KB per data entry. CNs cache
metadata of hot data entries; under memory pressure,
CNs will evict metadata according to an eviction policy
(we currently support FIFO and LRU).
SepDS aims to deliver scalable, good performance at
the data plane and to avoid the MS being the bottleneck
at the control plane. Our overall approaches to achieve
these design goals include: 1) moving all metadata oper-
ations off performance critical path, 2) using lock-free
data structures to increase scalability, 3) employing op-
timization mechanisms to reduce network round trips
for data accesses, and 4) leveraging the unique atomic
data access guarantees of DPM. Figure 3 illustrates the
data structures used in SepDS.
4.4.1 Data Plane
To achieve our data plane design goal, we propose a
new mechanism to perform lock-free, fast, and scalable
reads and writes. The basic idea is to allow multiple
committed versions of a data entry in DPMs and to link
them into a chain. Each committed write to a data entry
will move its latest version to a new location. To avoid
the need to update CNs with the new location, we use a
self-identifying data structure to let CNs be able to find
the latest version.
We include a header with each version of a data entry,
which contains a pointer and some metadata bits used
for garbage collection. The pointers chain all versions
of a data entry together in the order that they are written.
A NULL pointer indicates that the version is the latest.
A CN acquires the header of the chain head from the
MS at the first access to a data entry. It then caches the
header locally to avoid the overhead of contacting MS
on every data access. As a CN reads or writes an entry, it
advances its cached header. We call a CN-cached header
a cursor.
Read. SepDS reads are lock-free. To read a data entry,
the CN performs a chain walk. The chain walk begins
with fetching the data entry its current cursor points to.
It then follows the pointer in the following entries until
it reaches the last entry. All steps in the chain walk use
one-sided RDMA reads. After a chain walk, the CN
updates its cursor to the last entry.
A chain walk can be slow with long chains when a
cursor is not up to date [69]. Inspired by skip-list [55],
we propose to solve this issue by using a shortcut to
directly point to a newer entry. The shortcut of a data
entry is stored in DPM and the location of the shortcut
never changes during the lifetime of the data. MS stores
the locations of all shortcuts and CNs cache the hot ones.
Shortcuts are best effort in that they are intended but not
enforced to always point to the last version of an entry.
The CN issues a chain walk read and a shortcut read
in parallel. It returns to user when the faster one returns
and discards the other result. Note that we do not re-
place chain walks completely with shortcut reads, since
shortcuts are updated asynchronously in the background
and may not be updated as fast as the cursor. When the
CN has a pointer that points to the latest version of data,
a read only takes 1 RTT.
Write. SepDS never overwrites existing data entries and
performs a lock-free out-of-place write before linking
the new data to an entry chain. To write a data entry, a
CN first selects a free DPM buffer assigned to it by MS
in advance (see § 4.4.2). It performs a one-sided RDMA
write to write the new data to this buffer and then issues
a read of the last byte to ensure that the data is written
in PM. Afterwards, the CN performs an RDMA c&s
operation to link this new entry to the tail of the entry
chain. Specifically, the c&s operation is on the header
that CN’s cursor points to. It compares if the pointer in
the header is NULL and swaps the pointer to point to
the new entry. If the c&s succeeds, we treat this data as
6
committed and return the write request to the user. If the
pointer is not NULL, it means that the cursor does not
point to the tail of the chain and we will do a chain walk
to reach the tail and then do another c&s.
Afterwards, the CN uses a one-sided RDMA write
to update the shortcut of the entry to point to the new
data entry. This step is off the performance critical path.
The CN also updates its cursor to the newly written data
entry. We do not invalidate or update other CNs’ cursors
at this time to improve the scalability and performance
of SepDS.
SepDS’ chained structure and write mechanism en-
sure that writers do not block readers and readers do not
block writers. They also ensure that readers can only
view committed data. Without high write contention to
the same data entry, one write takes only 3 RTTs.
Retire. After committing a write, a CN can retire the
old data entry, indicating that the entry space can be
reclaimed. To improve performance and minimize the
need to communicate with the MS, CNs perform lazy,
asynchronous, batched retirement of old data entries in
the background. We further avoid the need for MS to
invalidate CN-cached metadata using a combination of
timeout and epoch-based garbage collection.
4.4.2 Control Plane
CNs communicate with the MS using two-sided oper-
ations for all metadata operations. The MS performs
all types of management of DPMs. It manages physical
memory space of DPM, stores the location and shortcut
of a data entry. We carefully designed these MS func-
tionalities to achieve good performance and scalability.
Space allocation. With the data plane out-of-place
write model, SepDS has high demand for DPM space
allocation. We use an efficient space allocation mecha-
nism where MS packages free space of all DPMs into
chunks. Each chunk hosts the same size of data entries
and different chunks can have different data sizes, simi-
lar to FaRM [15] and Hoard [6]. Instead of asking for
a new free entry before every write, each CN requests
multiple entries at a time from the MS in the background.
This approach moves space allocation off the critical
path of writes and is important to deliver good write
performance.
Garbage collection. SepDS’ append-only chained data
structure makes its writes very fast. But like all other
append-only or log-structured data stores, SepDS needs
to garbage collect (GC) old data. We designed a new
efficient GC mechanism that does not involve any data
movement or communication to DPM and minimizes
the communication between MS and CNs.
The basic flow of GC is simple: the MS keeps busy
checking and processing incoming retire requests from
CNs. The MS decides when a data entry can be re-
claimed and puts a reclaimed entry to a free list (FreeL-
ist). It gets free entries from this list when CNs request
for more free buffers. A reclaimed entry can be used by
any CN for any new entry, as long as the size fits.
Although the above strawman GC implementation is
simple, making GC work correctly, efficiently, and scale
well is challenging. First, to achieve good GC perfor-
mance, we avoid the invalidations of CN cached cursors
after reclaiming entries so as to minimize the network
traffic between the MS and CNs. However, with the
strawman GC implementation, CNs’ outdated cursors
can cause failed chain walks. We solve this problem us-
ing two techniques: 1), the MS does not clear the header
(or the content) of a data entry after reclaiming it, and
2), we assign a GC version to each data entry. The MS
increases the GC version number after reclaiming a data
entry. It gives this new GC version together with the
location of the entry when assigning the entry as a new
free buffer to a CN, A. Before CN A uses the entry for
its new write, the entry content at the DPM still has old
header and data (with old GC version). Other CNs that
have cached cursors to this entry can thus still use the
old pointer to perform chain walk. CNs differentiate
if an entry is its intended data or has already been re-
claimed and reused for other data by comparing the GC
version in its cached cursor and the one it reads from the
DPM. After CN A writes the new data with the new GC
version number, other CNs that have the old cursors will
have a mismatched GC version and discard the entry
and invalidates their cursors. Doing so not only avoids
the need for MS to invalidate cursor caches on CNs, but
also eliminates the need for MS to access DPMs during
GC.
The next challenge is related to our targeted guaran-
tee of read isolation and atomicity (i.e., readers should
always read the data that is consistent to its metadata
header). An inconsistent read can happen if the read to
a data entry takes long and during the reading time, this
entry has been reclaimed and used to write a new data
entry. We use a read timeout scheme similar to [15]. CNs
abort a read operation after Tr , an agreed value among
CNs and the MS. The MS delays the actual reclamation
of an entry to only Tr time after it receives the retire
request of the entry. Specifically, the MS leaves the entry
in a ToGCList for Tr and then moves it to the FreeList.
The final challenge is the overflow of GC version
numbers. We can only use limited number of bits for
GC version in the header of a data entry (currently 8
bits), since the header needs to be smaller than the size
of an atomic RDMA operation. When the GC version of
an entry increases beyond the maximum value, we will
have to restart it from zero. With just the GC version
number and our GC mechanism so far, CNs will have
7
no way to tell if an entry matches its cached cursor
version or has advanced by 28 = 256 versions. To solve
this rare issue without invalidation traffic to CNs, we
use an epoch-based timeout mechanism. When the MS
finds the GC version number of a data entry overflows,
it puts the reclaimed entry into OvflowList and waits
for Te time before moving it to the FreeList that can be
assigned to CNs. All CNs invalidate their own cursors
after an inactive period of Te (if during this time, the
CN access the entity, it would have advanced the cursor
already). To synchronize epoch time, the MS sends a
message to CNs after Te , and the MS can choose the
value of Te . Epoch message is the only communication
the MS issues to CNs during GC.
4.4.3 Discussion.
The SepDS design offers four benefits. First, SepDS
reads and writes are fast, with 1 RTT and 3 RTTs re-
spectively when there is no contention. Even under con-
tention, SepDS still outperforms DirectDS and Cen-
tralDS. Achieving this low latency and guaranteeing
atomic write and read committed is not easy and is
achieved by the combination of four approaches: 1)
ensuring the data path does not involve the MS, 2) re-
ducing metadata communication to the MS and moving
it off performance critical path, 3) ensuring no mem-
ory copy in the whole data path, and 4) leveraging the
unique advantages of DPM to perform RDMA atomic
operations.
Second, SepDS scales well with the number of CNs
and DPMs, since its reads and writes are both lock free.
Readers do not block writers or other readers and writers
do not block readers. Concurrent writers to the same
entity only contend for the short period of RDMA c&s
operation. SepDS also minimizes the network traffic to
MS and the processing load on MS to make MS scale
well with number of CNs and data operations.
Third, we avoid all data movement or communication
between the MS and DPMs during GC. To scale and
support many CNs with few MSs, we avoid CN inval-
idation messages completely. The MS does not need
to proactively send any other messages to CNs either.
Essentially, the MS never pushes any messages to CNs.
Rather, CNs pull information from the MS.
Finally, the SepDS data structure is flexible and can
support load balancing very well. Different entries of a
data entity do not need to be on the same DPM device.
As we will see in §4.5.2 and §4.6, this flexible placement
is the key to SepDS’ load balancing and data replication
needs.
However, SepDS also has its own limitation. It re-
quires CNs to cache metadata. As we will see in
§5, when CN’s local metadata cache becomes small,
SepDS’s performance drops. Thus, SepDS works the
best when CNs have enough memory or when data ac-
cesses have good temporal locality.
4.5 Failure Handling
DPMs can fail independently from CNs. A DPM system
needs to handle both the transient failure of a DPM
(which can be rebooted) and a permanent failure of one.
For the former, our three DPM systems guarantee crash
consistency, i.e., after reboot, the DPM can recover all
its committed data. For the latter, we add the support for
data replication across multiple DPMs to all the three
data store systems. In addition, CentralDS and SepDS
also need to handle the failure of the coordinator and
the MS.
4.5.1 Recovery from Transient Failures
We now present how each system recovers from a single
DPM’s failure when it restarts. We assume that the rest
of the system (e.g., CNs, the coordinator, the MS) keeps
alive. We will discuss the reliability of the coordinator
and the MS in §4.5.2.
DirectDS. When recovering a DPM in DirectDS, we
need to decide whether to use the data in the committed
space or the un-committed space (i.e., where the redo
copy is). Note that a crash can happen when writing
to the committed space, leaving it in an intermediate
state, in which case a correct recovery should use the un-
committed space (the redo copy). We use a technique
that leverages RDMA’s ordered writes in increasing
address order [15, 63] to ensure the integrity of a data
space. Specifically, DirectDS extends its write data by
attaching a unique 8-byte value to the beginning and
the end of a data entry, and writes the extended data
entry during its write protocol. The unique value can be
calculated by maintaining a monotonically increasing
number at each CN. During recovery, we compare the
first and last 8 bytes of the committed space. A match
indicates the committed space has the complete data.
Otherwise, we check the un-committed space and use
the same way to tell if it has the complete data.
DirectDS-C does not need this extended write mecha-
nism and can simply validate the data in the committed
space with its CRC. If the CRC is incorrect, we copy
the data from the redo copy to the committed space.
CentralDS. Handling the failure of a DPM in Cen-
tralDS is simple, as long as the coordinator stays alive.
Since CentralDS performs out-of-place writes and the
coordinator stores the state of all writes, we can simply
use the information in the coordinator to know what
writes have written their redo copies but haven’t com-
mitted yet and what writes have not written redo copies.
For the former case, we advance to the redo copy, and
for the latter, we use the original version.
8
SepDS. SepDS’ recovery mechanism is also simple. If
a DPM fails before a CN successfully links the new data
it writes to the chain (indicating an un-committed write),
the CN simply unsets lock bits (within a pointer) of the
data entry (releasing the held lock) and discards the new
write (by treating the space as unused).
4.5.2 Adding Redundancy
We now present how we add redundancy to DPM in all
the three systems and how we handle coordinator and
MS failures. With the user-specified degree of replica-
tion being N , our data store systems guarantee that data
is still accessible after N − 1 DPMs have failed.
DirectDS and DirectDS-C. In order to sustain DPM
failure during a write, we need to replicate both the first
write to the un-committed space (the redo copy) and
the second write to the committed space. After getting
the lock, a CN sends the new data to the un-committed
space on N DPMs in parallel. Afterwards, it performs
N read validation, also in parallel. Once read validation
of all the copies succeeds, the CN writes the data to
the committed space of the N DPMs in parallel and
performs a parallel read validation afterwards.
CentralDS. To handle a replicated write RPC request,
the coordinator writes multiple copies of the data to N
DPMs in parallel and performs a parallel read valida-
tion of them. After the read validation, the coordinator
updates its metadata to record the new locations of all
these copies.
SepDS. We propose a new atomic replication mecha-
nism designed for the SepDS data structure. The basic
idea is to link each data entry version DN to all the
replicas of the next version (e.g., DaN+1, D
b
N+1, D
c
N+1 for
three replicas) by placing pointers to all these replicas in
the header of DN . Figure 4 shows an example of repli-
cated data entry. With this all-way chaining, SepDS can
always construct a valid chain as long as one copy of
each version in an entry survives.
Each data entry has a primary copy and one or more
secondary copies. To write a data entry DN+1 with R
replicas to an entry whose current tail is DN , a CN first
writes all copies of DN+1 to R DPMs. In parallel, a CN
performs a one-sided c&s to a bit, Bw , in the header of
the primary copy of DN to test if the entry is already
in the middle of a replicated write. If not, the bit will
be set, indicating that the entry is now under replicated
write. All the writes and the c&s operation are sent out
together to minimize latency.
After the CN receives the hardware acknowledgment
of all the operations, it constructs a header that contains
R pointers to the copies of DN+1 and writes it to all
the copies of DN . Once the new header is written to
all copies of DN , the system can recover DN+1 from
crashes (up to R − 1 concurrent DPM failure).
Backup coordinator and MS. To avoid the coordinator
or the MS being the single point of failure in CentralDS
and SepDS, we implement a mechanism to enabling
one or more backup coordinator (MS), by having the
primary coordinator (MS) replicate the metadata that
cannot be reconstructed (i.e., keys and locations of val-
ues) to the backup coordinator (MS) when changing
these metadata.
4.6 Load Balancing
With the DPM model, a system will have a pool of
DPMs. Thus, it is beneficial to balance the load to each
of them.
With a centralized place to initiate all requests, it
is easy for CentralDS to perform load balancing. The
coordinator simply records the load to each DPM and
directs new writes to the DPM with lighter load. When
DPM is replicated, the coordinator can also balance read
loads by selecting the replica that is on the DPM with
lighter load.
We use a novel two-level approach to balance loads
in SepDS: globally at MS and locally at each CN. Our
global management leverages two features in SepDS: 1)
MS assigns all new space to CNs; and 2) data entries of
the same entity in SepDS can be on different DPMs. To
reduce the load on a DPM, MS directs all new writes
to other devices. At a local level, each CN internally
balances the load to different DPMs. Each CN keeps
one bucket per DPM to store free entries. It chooses
buckets from different buckets for new writes according
to its own load balancing needs.
However, balancing loads with the DPM-Direct ar-
chitecture is hard, since there is no coordination across
CNs.
5 Evaluation Results
This section presents the evaluation results of different
DPM systems including DirectDS, DirectDS-C, Cen-
tralDS, and SepDS. All our experiments were carried out
in a cluster of 14 machines, connected with a 40 Gbps
Mellanox InfiniBand Switch. Each machine is equipped
with two Intel Xeon E5-2620 2.40GHz CPUs, 128 GB
DRAM, and one 40 Gbps Mellanox ConnectX-3 NIC.
5.1 Micro-benchmark Results
We then evaluate DPM data stores’ read and write per-
formance and compare them to LITE [65]. We chose
LITE for comparison since it offers low latency and uses
a similar physical memory registration method as our
data stores.
Figure 5 plots the average write latency with different
request size. LITE performs a write without read valida-
tion and only models the latency of un-validated writes.
Its latency is thus the lowest. Among DPM systems,
SepDS and CentralDS achieve the best write latency.
9
DPM1 DPM2 DPM3 DPM4
Figure 4. Replicated Data Entity. A
replicated data entity on four DPMs. The repli-
cation factor is two.
Request Size (B)
128 256 512 1K 2K 4K
L
a
t
e
n
c
y
 
(u
s)
0
10
20
30 DirectDS−C
DirectDS
CentralDS
SepDS
LITE
Figure 5. Write Latency
Request Size (B)
128 256 512 1K 2K 4K
L
a
t
e
n
c
y
 
(u
s)
0
5
10
15 DirectDS−C
CentralDS
DirectDS
SepDS
LITE
Figure 6. Read Latency
SepDS outperforms CentralDS slightly when request
size is big because of its smaller network traffic. Di-
rectDS and DirectDS-C have similar write performance
when request size is small. However, when request size
increases, the overhead of CRC calculation dominates,
making DirectDS-C perform poorly.
We also evluated all the DPM systems’ write perfor-
mance without read validation (i.e., treating DPM as
volatile memory). We found each read validation to cost
a constant of 1.5 µs overhead.
Figure 6 plots the average read latency. Overall,
SepDS’s performance is the best among DPM systems
and is only slightly worse than LITE. However, when
request size is small, DirectDS-C outperforms SepDS
because of DirectDS-C’s read only requires one round
trip under any circumstance. However, like writes, the
read performance of DirectDS-C dramastically drops as
request size increases because of the CRC calculation
overhead. As expected, DirectDS and CentralDS per-
form worse than SepDS because of their reads involve
3 RTTs and 2 RTTs.
5.2 YCSB Results
We now present our evaluation results using the YCSB
benchmark [12, 71]. We use a total of 100K key-value
entries where the key size is 8 bytes and the value size is
1 KB. The accesses to keys follow the Zipf distribution.
And we use four workloads with different read and write
intensity: read only (workload C), 5% write (workload
B), 50% write (workload A), and write only.
Basic performance. We first evaluate the performance
of all our DPM systems under our default setting: 4
CNs and 4 DPMs, each CN runs 8 application threads.
Figure 7 shows the overall performance of DPM data
stores, replicated DPM data stores (with degree of repli-
cation 2), and Hotpot [59]. The Hotpot runs use four
servers, each running 8 application threads, and we ran
Hotpot with its MRSW (multiple reader single writer)
consistency level without replication. Hotpot serves as
a comparison of the distributed PM model.
SepDS performs the best among all systems regard-
less of read/write intensity, even under high contention
(with Zipf distribution to keys). DirectDS-C performs
well with workloads that are read intensive. DirectDS-
C’s read performance is not affected by contention, since
it does not need to perform any lock. In contrast, Di-
rectDS’s read performance is the worst under contention
because of it needs to lock a data entry for each read.
CentralDS’s read performance is worse than DirectDS-
C and SepDS because each read in CentralDS requires
2 RTTs and under contention the coordinator becomes
the bottleneck.
For write-intensive workloads, CentralDS and SepDS
perform better than the DirectDS systems. This is be-
cause under high contention, the lock overhead of the
DirectDS systems become high, while CentralDS and
SepDS both avoid the lock contention. CentralDS avoids
it by using a local lock to protect metadata update (not
the write itself) and SepDS uses the lock-free out-of-
place chained data structure.
The overall performance of Hotpot is orders of mag-
nitude worse than all DPM data stores. The reason is
that each read and write in Hotpot involves a complex
protocol that requires RPCs across multiple nodes. Hot-
pot’s performance is especially poor with writes, since
the distributed PM consistency protocol involves fre-
quent invalidation of cached copies, especially under
high write contention to the same data. To confirm this,
we also ran Hotpot with workloads with uniform distri-
bution and found the results to be better, but still much
worse than DPM systems.
Replication overhead. As expected, adding redundancy
lowers the throughput of all data stores with write-heavy
workloads. Even though all systems issue the replication
requests in parallel, they only use one thread to perform
asynchronous RDMA read/write operations and doing
so still has an overhead.
Network traffic. To further understand the cause of per-
formance differences, we record the total network traffic
during each run. Figure 8 plots the average amount of
traffic that each data store incurs to complete one opera-
tion under different workloads. SepDS, DirectDS, and
DirectDS-C send less traffic in read-heavy workloads
since these data stores access DPMs directly. In contrast,
CentralDS incurs high traffic because data is sent once
between the CN and the coordinator and once between
the coordinator and the DPM. As expected, DirectDS
and DirectDS-C send more data for writes because of
their writes involve 2 RTTs with data.
10
C(0%) B(5%) A(50%) 100%T
hr
ou
gh
pu
t 
(M
OP
S)
0
2
4
6
8
DirectDS
DirectDS−C
CentralDS
SepDS
Hotpot−Zipf
DirectDS−R
DirectDS−C−R
CentralDS−R
SepDS−R
Hotpot−Uniform
Figure 7. DPM Throughput and Replications Running YCSB on four CNs and
four DPMs, with 1K request size and replication degree one and three. Hotpot is running
without replication with four nodes and 8 threads per data node.
C(0%) B(5%)A(50%) 100%
T
r
a
ff
ic
 (
KB
)/
op
0
1
2
3 DirectDS
DirectDS−C
CentralDS
SepDS
Figure 8. Network Traffic of DPM
stores Network traffic includes both control
and data communication.
Number of DPMs
1 2 4 8
T
hr
ou
gh
pu
t 
(M
OP
S)
0
2
4
6
SepDS
DirectDS−C
CentralDS
DirectDS
(a) Workload B (5%)
Number of DPMs
1 2 4 8
T
hr
ou
gh
pu
t 
(M
OP
S)
0
1
2
3
(b) Workload A (50%)
Figure 9. Scalability w.r.t. DPMs Running 4 CNs. Each CN
runs 8 threads.
Number of Clients
1 2 4 8
T
hr
ou
gh
pu
t 
(M
OP
S)
0
2
4
6
8
10
SepDS
CentralDS
DirectDS−C
DirectDS
(a) Workload B (5%)
Number of Clients
1 2 4 8
T
hr
ou
gh
pu
t 
(M
OP
S)
0
1
2
3
(b) Workload A (50%)
Figure 10. Scalability w.r.t. CNs Running 4 DPMs. Each
CNs runs 8 threads.
C(0%) B(5%)A(50%) 100%
C
P
U
 
t
im
e 
(s
ec
)
0
500
1000
1500 DirectDS
DirectDS−C
CentralDS
SepDS
Figure 11. CPU Utilization CPU time
to complete ten million requests. All tests run
4 CNs each using 8 threads.
C B A 100%T
hr
ou
gh
pu
t 
(M
OP
S)
0
2
4
6
8
100%
10%
1%
0%
Figure 12. Effect of
Metadata Cache in
SepDS Each bar shows the
percentage of total metadata
each CN can store.
C B A 100%T
hr
ou
gh
pu
t 
(M
OP
S)
0
2
4
100%
10%
1%
0%
Figure 13. Effect of
Data Cache in Cen-
tralDS Each bar shows the
percentage of total data the
coordinator can store.
Round−
Robin
Load 
Balanced
T
r
a
ff
ic
 (
GB
)
0
1
2
3
DPM−1
DPM−2
DPM−3
Figure 14. Load Bal-
ancing in SepDS
Scalability. Next, we evaluate the scalability of different
DPM data stores with respect to the number of CNs and
the number of DPMs. Figure 9 shows the scalability
of DPM data stores w.r.t. the number of DPMs. Both
DirectDS-C and SepDS scale well with DPMs because
DirectDS-C and SepDS both let CNs access DPMs di-
rectly, improving the network bandwidth utilization to
DPMs. DirectDS does not scale with DPMs because of
lock contention. CentralDS does not scale well either,
since its bottleneck is the network interface of a single
coordinator.
Figure 10 shows the scalability of DPM data stores
when varying the number of CNs. SepDS has the best
scalability, since there is no single network bottleneck
in SepDS. CentralDS’s scalability is worse than SepDS
again because of the bottleneck of a single coordina-
tor’s network throughput. DirectDS and DirectDS-C do
not scale with CNs. As the number of CNs increases,
contention happens more frequently in DirectDS and
DirectDS-C which reduce the overall throughput.
CPU utilization. We evaluate the CPU utilization of dif-
ferent DPM data store. Figure 11 plots the total CPU
time to complete ten million requests in different work-
loads. For read-intensive workload, DirectDS-C and
SepDS use less CPU than other data stores because
of one-sided primitives. DirectDS suffers from lock
contention which increases total CPU utilization. For
write-intensive workload, SepDS uses less CPU time
than other data stores mainly because SepDS has higher
throughput and separates data plane and control plane
which reduces CPU usage.
Metadata size. Different DPM data stores cache dif-
ferent amounts of metadata in CNs. DirectDS and
DirectDS-C cache all keys and pointers to each entity
for direct access to DPMs. CNs in CentralDS only cache
keys, and rely on coordinators to keep metadata. CNs in
DirectDS and DirectDS-C keep the mapping from keys
to DPMs. Similarly, SepDS caches a shortcut pointer
for each entity to improve performance. SepDS further
supports different sizes of metadata cache
Metadata caching effect. To evaluate the effect of dif-
ferent sizes of metadata cache at CNs in SepDS, we ran
the same YCSB workloads and configuration as Fig-
ure 7 and plot the results in Figure 12. Here, we use the
11
FIFO eviction policy (we also tested LRU and found it
to similar or worse than FIFO). With smaller metadata
cache, all workloads’ performance drop because a CN
has to get the metadata from the MS before accessing
the data entry that does not have local metadata cache.
With no metadata cache (0%), CNs need to get metadata
from the MS before every request. However, under Zipf
distribution, with just 10% metadata cache, SepDS can
already achieve satisfying performance.
Data caching effect. We do not cache data at CNs be-
cause doing so would require coherence traffic, result-
ing in performance that is similar to distributed PM.
However, it is possible to cache data at the coordina-
tor with the DPM-Central architecture, because that is
the only copy and does not need any coherence traffic.
By caching hot data in a coordinator, the coordinator
does not need to access DPMs to get data for every read
which can reduce network traffic and improve perfor-
mance. We built a FIFO data cache at the coordinator
for CentralDS to analyze the effect of data caching. Fig-
ure 13 plots the throughput with different percentages
of the data cache in a coordinator. With bigger data
cache, the performance increases. However, the over-
all performance is still limited by network bandwidth.
Furthermore, we observe that data cache improves read
traffic but not write traffic. Overall, we found the ef-
fect of data caching to be small with CentralDS, but
demands large amount of PM space at the coordinator.
Load balancing. To evaluate the effect of SepDS’s load
balancing mechanism, we use a synthetic workload with
three entities, A, B, and C. We first create A (without
replication) and B (with 2 replicas) and read these two
entities heavily. Then, we create C (without replication)
and keep updating C. One CN runs this synthetic work-
load on three DPMs. Figure 14 shows the total traffic to
the three DPMs with and without load balancing. With
a naive allocation policy of round-robin across DPMs,
write traffic spreads among all DPMs and read traffic
only goes to the first DPM. With load balancing, SepDS
spreads read traffic across different replicas depending
on the load of DPMs. At the same time, MS allocates
free entries for new writes from the least accessed DPM.
As a result, the total loads across the three DPMs are
balanced.
6 Related Work
Lim et al. [39, 40] first proposed the concept of disag-
gregating memory from processor. Recent years have
seen more industry and academic efforts in network
support for disaggregated memory [8, 18, 21, 48, 50]
and software systems to manage remote memory [2,
15, 23, 34, 49]. FaRM [15, 16] is an RDMA-based dis-
tributed memory platform. FaRM use one-way com-
munication for reads and perform both two-way and
one-way communication for replicated writes (depend-
ing on whether it is to the primary copy). Pilaf [47]
and HERD [30, 31] are two RDMA-based key-value
store systems. These systems rely on two-way commu-
nication for writes and HERD and FaSST use two-way
communication for reads too.
NAM-DB [7, 72] is a RDMA-based database system
that uses one-sided communication for both read and
write. Infiniswap [23] is an RDMA-based remote mem-
ory paging system. Remote regions [1] is a system that
exposes remote memory as files that other host servers
can access (through a file system interface). Although
these three systems do not use two-way communication
for data path, they both rely on processing power at re-
mote nodes to run data management tasks. SepDS runs
all management tasks (control path) at MS, a separate
node from remote memory.
Mojim [73], Hotpot [59], and Octopus [41] are three
recent distributed PM systems. Mojim [73] is the first
system that targets using PM in distributed, datacen-
ter environments. Mojim provides an efficient, RDMA-
based, asynchronous replication mechanism for PM, to
make it more reliable and available. Hotpot [59] is the
first distributed shared persistent memory system. It
integrates the idea of distributed shared memory and
distributed storage systems to provide a globally coher-
ent, crash-consistent, and reliable distributed PM system
that applications can access with memory instructions.
Octopus [41] is a distributed file system built on top of
PM. None of these systems build on the DPM model,
which presents a whole new set of challenges.
ReFlex[33] is a software-based system builds on
IX [5] and exposes a logical block interface for users to
access remote Flash with nearly identical performance
as accessing local Flash. RAMCloud [52] is a remote
key-value storage system that stores a full copy of all
data in DRAM and backups in disks or SSDs. Kamino-
Tx [45] proposes a new mechanism to perform transac-
tional updates on PM without any copying of data in the
critical path. These systems all rely on local computa-
tion power at remote memory/storage servers to perform
various online and recovery management services which
differs from DPM model.
7 Conclusion
This paper presents the disaggregated PM model, where
PM is attached directly to the network without any local
processors. We proposed three DPM architectures, built
three atomic, crash-consistent, and reliable data stores
on top of these architectures, and performed extensive
evaluation of these data stores. Our findings will be able
to guide future DPM system builders.
12
References
[1] Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguil-
lard, Jayneel Gandhi, Stanko Novakovic´, Arun Ramanathan,
Pratap Subrahmanyam, Lalith Suresh, Kiran Tati, Rajesh
Venkatasubramanian, and Michael Wei. 2018. Remote regions: a
simple abstraction for remote memory. In 2018 USENIX Annual
Technical Conference (ATC ’18). Boston, MA.
[2] Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguil-
lard, Jayneel Gandhi, Pratap Subrahmanyam, Lalith Suresh, Ki-
ran Tati, Rajesh Venkatasubramanian, and Michael Wei. 2017.
Remote Memory in the Age of Fast Networks. In Proceedings
of the 2017 Symposium on Cloud Computing (SoCC ’17).
[3] Krste AsanoviÄG˘. 2014. FireBox: A Hardware Building Block
for 2020 Warehouse-Scale Computers. Keynote talk at the 12th
USENIX Conference on File and Storage Technologies (FAST
’14).
[4] Luiz André Barroso and Urs Hölzle. 2007. The Case for Energy-
Proportional Computing. Computer (Dec. 2007).
[5] Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman,
Christos Kozyrakis, and Edouard Bugnion. 2014. IX: A Pro-
tected Dataplane Operating System for High Throughput and
Low Latency. In 11th USENIX Symposium on Operating Systems
Design and Implementation (OSDI ’14). Broomfield, CO, USA.
[6] Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and
Paul R. Wilson. 2000. Hoard: A Scalable Memory Allocator for
Multithreaded Applications. In Proceedings of the Ninth Inter-
national Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS ’00). Cambridge,
MA, USA.
[7] Carsten Binnig, Andrew Crotty, Alex Galakatos, Tim Kraska,
and Erfan Zamanian. 2016. The End of Slow Networks: It’s
Time for a Redesign. Proceedings of the VLDB Endowment 9, 7
(2016), 528–539.
[8] Cache Coherent Interconnect for Accelerators. 2018. https:
//www.ccixconsortium.com/.
[9] Tao Chen and G. Edward Suh. 2016. Efficient Data Supply
for Hardware Accelerators with Prefetching and Access/Exe-
cute Decoupling. In The 49th Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO-49). Taipei, Taiwan.
[10] Joel Coburn, Adrian M. Caulfield, Ameen Akel, Laura M. Grupp,
Rajesh K. Gupta, Ranjit Jhala, and Steven Swanson. 2011. NV-
Heaps: Making Persistent Objects Fast and Safe with Next-
generation, Non-volatile Memories. In Proceedings of the 16th
International Conference on Architectural Support for Program-
ming Languages and Operating Systems (ASPLOS ’11). New
York, New York.
[11] Jeremy Condit, Edmund B. Nightingale, Christopher Frost, En-
gin Ipek, Benjamin Lee, Doug Burger, and Derrick Coetzee.
2009. Better I/O Through Byte-addressable, Persistent Mem-
ory. In Proceedings of the ACM SIGOPS 22Nd Symposium on
Operating Systems Principles (SOSP ’09). Big Sky, MT, USA.
[12] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakr-
ishnan, and Russell Sears. 2010. Benchmarking Cloud Serving
Systems with YCSB. In Proceedings of the 1st ACM Symposium
on Cloud Computing (SoCC ’10). New York, New York.
[13] Alexandras Daglis, Dmitrii Ustiugov, Stanko Novakovic´,
Edouard Bugnion, Babak Falsafi, and Boris Grot. 2016. SABRes:
Atomic object reads for in-memory rack-scale computing. In
2016 49th Annual IEEE/ACM International Symposium on Mi-
croarchitecture (MICRO ’16). Taipei, Taiwan.
[14] Christina Delimitrou and Christos Kozyrakis. 2014. Quasar:
Resource-efficient and QoS-aware Cluster Management. In Pro-
ceedings of the 19th International Conference on Architectural
Support for Programming Languages and Operating Systems
(ASPLOS ’14).
[15] Aleksandar Dragojevic´, Dushyanth Narayanan, Orion Hodson,
and Miguel Castro. 2014. FaRM: Fast Remote Memory. In
Proceedings of the 11th USENIX Conference on Networked
Systems Design and Implementation (NSDI ’14). Seattle, WA,
USA.
[16] Aleksandar Dragojevic´, Dushyanth Narayanan, Edmund B.
Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh
Badam, and Miguel Castro. 2015. No Compromises: Distributed
Transactions with Consistency, Availability, and Performance.
In Proceedings of the 25th Symposium on Operating Systems
Principles (SOSP ’15). Monterey, CA, USA.
[17] Subramanya R. Dulloor, Sanjay Kumar, Anil Keshavamurthy,
Philip Lantz, Dheeraj Reddy, Rajesh Sankaran, and Jeff Jackson.
2014. System Software for Persistent Memory. In Proceed-
ings of the EuroSys Conference (EuroSys ’14). Amsterdam, The
Netherlands.
[18] Paolo Faraboschi, Kimberly Keeton, Tim Marsland, and Dejan
Milojicic. 2015. Beyond Processor-centric Operating Systems.
In 15th Workshop on Hot Topics in Operating Systems (HotOS
’15). Kartause Ittingen, Switzerland.
[19] Brad Fitzpatrick. 2004. Distributed Caching with Memcached.
Linux Journal 2004, 124 (2004), 5.
[20] Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Car-
reira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott
Shenker. 2016. Network Requirements for Resource Disaggrega-
tion. In 12th USENIX Symposium on Operating Systems Design
and Implementation (OSDI ’16). Savannah, GA.
[21] Gen-Z Consortium. 2018. https://genzconsortium.org.
[22] Albert Greenberg, James Hamilton, David A. Maltz, and Parveen
Patel. 2008. The Cost of a Cloud: Research Problems in Data
Center Networks. SIGCOMM SIGCOMM Computer Communi-
cation Review 39, 1 (Dec 2008), 68–73.
[23] Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowd-
hury, and Kang Shin. 2017. Efficient Memory Disaggregation
with Infiniswap. In Proceedings of the 14th USENIX Symposium
on Networked Systems Design and Implementation (NSDI ’17).
Boston, MA, USA.
[24] Hewlett Packard. 2005. The Machine: A New Kind of
Computer. http://www.hpl.hp.com/research/systems-research/
themachine/.
[25] M Hosomi, H Yamagishi, T Yamamoto, K Bessho, Y Higo, K Ya-
mane, H Yamada, M Shoji, H Hachino, C Fukumoto, et al. 2005.
A Novel Nonvolatile Memory with Spin Torque Transfer Mag-
netization Switching: Spin-RAM. In Electron Devices Meeting,
2005. IEDM Technical Digest. IEEE International. 459–462.
[26] InfiniBand Trade Association. 2015. InfiniBand Architecture
Specification. https://cw.infinibandta.org/document/dl/7859.
[27] Intel. 2019. Intel Optane technology. https://www.intel.com/
content/www/us/en/architecture-and-technology/intel-optane-
technology.html.
[28] Intel Corporation - Product and Performance Informa-
tion. 2018. Intel Non-Volatile Memory 3D XPoint.
http://www.intel.com/content/www/us/en/architecture-and-
technology/non-volatile-memory.html?wapkw=3d+xpoint.
[29] Intel Corporation - Product and Performance Information. 2019.
Reimagining the Data Center Memory and Storage Hierar-
chy. https://newsroom.intel.com/editorials/re-architecting-data-
center-memory-storage-hierarchy/.
[30] Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2014.
Using RDMA Efficiently for Key-value Services. In Proceedings
of the 2014 ACM Conference on Special Interest Group on Data
Communication (SIGCOMM ’14). Chicago, IL, USA.
[31] Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016.
Design Guidelines for High Performance RDMA Systems. In
Proceedings of the 2016 USENIX Annual Technical Conference
(ATC ’16). Denver, CO, USA.
[32] Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016.
FaSST: Fast, Scalable and Simple Distributed Transactions with
Two-Sided (RDMA) Datagram RPCs. In 12th USENIX Sympo-
sium on Operating Systems Design and Implementation (OSDI
’16). Savanah, GA, USA.
[33] Ana Klimovic, Heiner Litz, and Christos Kozyrakis. 2017. Re-
Flex: Remote Flash &#8776; Local Flash. In Proceedings of
the Twenty-Second International Conference on Architectural
Support for Programming Languages and Operating Systems
(ASPLOS ’17). Xi’an, China.
[34] Ana Klimovic, Heiner Litz, and Christos Kozyrakis. 2018. Se-
lecta: Heterogeneous Cloud Storage Configuration for Data Ana-
lytics. In 2018 USENIX Annual Technical Conference (ATC ’18).
Boston, MA.
[35] Aasheesh Kolli, Steven Pelley, Ali Saidi, Peter M. Chen, and
Thomas F. Wenisch. 2016. High-Performance Transactions for
Persistent Memories. In Proceedings of the Twenty-First Inter-
national Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS ’16). Atlanta, GA.
[36] Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger.
2010. Phase Change Memory Architecture and the Quest for
Scalability. Commun. ACM 53, 7 (2010), 99–106.
[37] Benjamin C Lee, Ping Zhou, Jun Yang, Youtao Zhang, Bo Zhao,
Engin Ipek, Onur Mutlu, and Doug Burger. 2010. Phase-change
Technology and the Future of Main Memory. IEEE micro 30, 1
(2010), 143.
13
[38] Myoung-Jae Lee, Chang Bum Lee, Dongsoo Lee, Seung Ryul
Lee, Man Chang, Ji Hyun Hur, Young-Bae Kim, Chang-Jung
Kim, David H Seo, Sunae Seo, et al. 2011. A Fast, High-
Endurance and Scalable Non-Volatile Memory Device Made
from Asymmetric Ta2O(5-x)/TaO(2-x) Bilayer Structures. Na-
ture materials 10, 8 (2011), 625–630.
[39] Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ran-
ganathan, Steven K. Reinhardt, and Thomas F. Wenisch. 2009.
Disaggregated Memory for Expansion and Sharing in Blade
Servers. In Proceedings of the 36th Annual International Sympo-
sium on Computer Architecture (ISCA ’09). Austin, Texas.
[40] Kevin Lim, Yoshio Turner, Jose Renato Santos, Alvin AuY-
oung, Jichuan Chang, Parthasarathy Ranganathan, and Thomas F.
Wenisch. 2012. System-level Implications of Disaggregated
Memory. In Proceedings of the 2012 IEEE 18th Interna-
tional Symposium on High-Performance Computer Architecture
(HPCA ’12). New Orleans, LA, USA.
[41] Youyou Lu, Jiwu Shu, Youmin Chen, and Tao Li. 2017. Octopus:
an RDMA-enabled Distributed Persistent Memory File System.
In 2017 USENIX Annual Technical Conference (ATC ’17). Santa
Clara, CA, USA.
[42] David Meisner, Brian T. Gold, and Thomas F. Wenisch. 2009.
PowerNap: Eliminating Server Idle Power. In Proceedings of
the 14th International Conference on Architectural Support for
Programming Languages and Operating Systems (ASPLOS ’09).
Washington, DC, USA.
[43] Mellanox. 2015. Mellanox Delivers the World’s Fastest EDR
100Gb/s InfiniBand Switch with Latency Less than 90 Nanosec-
onds. https://www.rdmag.com/news/2015/07/infiniband-now-
connecting-more-50-percent-top500-supercomputing-list.
[44] Mellanox Technologies. 2015. RDMA Aware Networks
Programming User Manual. http://www.mellanox.com/related-
docs/prod_software/RDMA_Aware_Programming_user_
manual.pdf.
[45] Amirsaman Memaripour, Anirudh Badam, Amar Phanishayee,
Yanqi Zhou, Ramnatthan Alagappan, Karin Strauss, and Steven
Swanson. 2017. Atomic In-place Updates for Non-volatile Main
Memories with Kamino-Tx. In Proceedings of the Twelfth Euro-
pean Conference on Computer Systems (EuroSys ’17). Belgrade,
Serbia.
[46] Micron Technology Inc. 2005. P8P Parallel Phase Change
Memory (PCM). https://media.digikey.com/pdf/Data%20Sheets/
Micron%20Technology%20Inc%20PDFs/NP8P128Ax60E_
Rev_K.pdf.
[47] Christopher Mitchell, Yifeng Geng, and Jinyang Li. 2013. Us-
ing One-sided RDMA Reads to Build a Fast, CPU-efficient
Key-value Store. In Proceedings of the 2013 USENIX Annual
Technical Conference (ATC ’13). San Jose, CA, USA.
[48] Stanko Novakovic, Alexandros Daglis, Edouard Bugnion, Babak
Falsafi, and Boris Grot. 2014. Scale-out NUMA. In Proceedings
of the 19th International Conference on Architectural Support
for Programming Languages and Operating Systems (ASPLOS
’14). Salt Lake City, UT.
[49] Stanko Novakovic, Alexandros Daglis, Edouard Bugnion, Babak
Falsafi, and Boris Grot. 2016. The Case for RackOut: Scalable
Data Serving Using Rack-Scale Systems. In Proceedings of
the Seventh ACM Symposium on Cloud Computing (SoCC ’16).
Santa Clara, CA, USA.
[50] Open Coherent Accelerator Processor Interface. 2018. https:
//opencapi.org/.
[51] Jiaxin Ou, Jiwu Shu, and Youyou Lu. 2016. A High Performance
File System for Non-volatile Main Memory. In Proceedings
of the Eleventh European Conference on Computer Systems
(EuroSys ’16). London, United Kingdom.
[52] John Ousterhout, Arjun Gopalan, Ashish Gupta, Ankita Kejri-
wal, Collin Lee, Behnam Montazeri, Diego Ongaro, Seo Jin
Park, Henry Qin, Mendel Rosenblum, Stephen Rumble, Ryan
Stutsman, and Stephen Yang. 2015. The RAMCloud Storage
System. ACM Transactions Computer System 33, 3 (August
2015), 7:1–7:55.
[53] PCI Express. 2014. PCI Express Base Specification Revision
4.0 Version 0.3.
[54] Steven Pelley, Peter M. Chen, and Thomas F. Wenisch. 2014.
Memory Persistency. In Proceeding of the 41st Annual Inter-
national Symposium on Computer Architecuture (ISCA ’14).
Piscataway, NJ.
[55] William Pugh. 1990. Skip Lists: A Probabilistic Alternative to
Balanced Trees. Communication of the ACM 33, 6 (June 1990),
668–676.
[56] Moinuddin K Qureshi, Michele M Franceschini, Luis A Lastras-
Montaño, and John P Karidis. 2010. Morphable memory system:
a robust architecture for exploiting multi-level phase change
memories. In Proceedings of the 37th Annual International Sym-
posium on Computer Architecture (ISCA ’07).
[57] R. Recio, B. Metzler, P. Culley, J. Hilland, and D. Garcia. 2007.
A Remote Direct Memory Access Protocol Specification. RFC
5040.
[58] Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang.
2018. LegoOS: A Disseminated, Distributed OS for Hardware
Resource Disaggregation. In 13th USENIX Symposium on Oper-
ating Systems Design and Implementation (OSDI ’18). Carlsbad,
CA.
[59] Yizhou Shan, Shin-Yeh Tsai, and Yiying Zhang. 2017. Dis-
tributed Shared Persistent Memory. In Proceedings of the 8th
Annual Symposium on Cloud Computing (SOCC ’17). Santa
Clara, CA, USA.
[60] SNIA, Chet Douglas. 2015. RDMA with PMEM.
https://www.snia.org/sites/default/files/SDC15_presentations/
persistant_mem/ChetDouglas_RDMA_with_PM.pdf.
[61] Storage Review. 2016. Mellanox Unveils 200Gb/s HDR In-
finiBand Solutions. http://www.storagereview.com/mellanox_
unveils_200gb_s_hdr_infiniband_solutions.
[62] Kosuke Suzuki and Steven Swanson. 2015. The Non-Volatile
Memory Technology Database (NVMDB). Technical Report
CS2015-1011. Department of Computer Science & Engineering,
University of California, San Diego.
[63] Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, and Toni Cortes.
2018. Tailwind: Fast and Atomic RDMA-based Replication. In
2018 USENIX Annual Technical Conference (ATC ’18). Boston,
MA.
[64] Dan Tang, Yungang Bao, Weiwu Hu, and Mingyu Chen. 2010.
DMA cache: Using on-chip storage to architecturally separate
I/O data from CPU data for improving I/O performance. In
The Sixteenth International Symposium on High-Performance
Computer Architecture (HPCA ’10). Bangalore, India, 1–12.
[65] Shin-Yeh Tsai and Yiying Zhang. 2017. LITE Kernel RDMA
Support for Datacenter Applications. In Proceedings of the 26th
Symposium on Operating Systems Principles (SOSP ’17). Shang-
hai, China.
[66] Haris Volos, Andres Jaan Tack, and Michael M. Swift. 2011.
Mnemosyne: Lightweight Persistent Memory. In Proceedings of
the Sixteenth International Conference on Architectural Support
for Programming Languages and Operating Systems (ASPLOS
’11). New York, New York.
[67] Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo
Chen. 2015. Fast in-memory transaction processing using
RDMA and HTM. In Proceedings of the 25th Symposium on
Operating Systems Principles (SOSP ’15). Monterey, CA, USA.
[68] Xiaojian Wu and A.L.N. Reddy. 2011. SCMFS: A File System
for Storage Class Memory. In International Conference for High
Performance Computing, Networking, Storage and Analysis (SC
’11).
[69] Yingjun Wu, Joy Arulraj, Jiexi Lin, Ran Xian, and Andrew Pavlo.
2017. An Empirical Evaluation of In-memory Multi-version
Concurrency Control. Proceedings of the VLDB Endowment 10,
7 (March 2017), 781–792.
[70] J Joshua Yang, Dmitri B Strukov, and Duncan R Stewart. 2013.
Memristive devices for computing. Nature nanotechnology 8, 1
(2013), 13–24.
[71] YCSB-C. 2015. https://github.com/basicthinker/YCSB-C.
[72] Erfan Zamanian, Carsten Binnig, Tim Harris, and Tim Kraska.
2017. The End of a Myth: Distributed Transactions Can Scale.
Proceedings of the VLDB Endowment 10, 6 (2017), 685–696.
[73] Yiying Zhang, Jian Yang, Amirsaman Memaripour, and Steven
Swanson. 2015. Mojim: A Reliable and Highly-Available Non-
Volatile Memory System. In Proceedings of the 20th Interna-
tional Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS ’15). Istanbul,
Turkey.
14
