Architecture-Aware, High Performance Transaction for Persistent Memory by Wu, Kai et al.
Architecture-Aware, High Performance Transaction
for Persistent Memory
Kai Wu
kwu42@ucmerced.edu
Jie Ren
jren6@ucmerced.edu
Dong Li
dli35@ucmerced.edu
University of California, Merced
Abstract—Byte-addressable non-volatile main memory (NVM)
demands transactional mechanisms to access and manipulate
data on NVM atomically. Those transaction mechanisms often
employ a logging mechanism (undo logging or redo logging).
However, the logging mechanisms can bring large runtime
overhead (8%-49% in our evaluation), and 41%-78% of the
overhead attributes to the frequent cache-line flushing. Such
large overhead significantly diminishes the performance benefits
offered by NVM. In this paper, we introduce a new method
to reduce the overhead of cache-line flushing for logging-based
transactions. Different from the traditional method that works
at the program level and leverages program semantics to reduce
the logging overhead, we introduce architecture awareness. In
particular, we do not flush certain cache blocks, as long as
they are estimated to be eliminated out of the cache because
of the hardware caching mechanism (e.g., the cache replacement
algorithm). Furthermore, we coalesce those cache blocks with
low dirtiness to improve the efficiency of cache-line flushing.We
implement an architecture-aware, high performance transaction
runtime system for persistent memory, Archapt. Our results show
that comparing with an undo logging (PMDK) and a redo logging
(Mnemosyne), Archapt reduces cache-line flushing by 66% and
improves system throughput by 19% on average (42% at most).
Our crash tests with four hardware caching policies show that
Archapt provides strong a guarantee on crash consistency.
I. INTRODUCTION
Non-volatile memories (NVM), addressed at a byte granu-
larity directly by CPU and accessed roughly at the latency of
main memory, are coming. While NVM as main memory pro-
vides an appealing interface that uses simple load/store,
it brings new challenges to the designs of persistent data
structures, storage systems, and databases. In particular, a
store does not immediately make data persistent, because
the memory hierarchy (e.g., caches and store buffers) can
remain non-persistent. There is a need to ensure that data is
modified atomically when moving from one consistent state
to another, in order to provide consistency after a crash due
to power loss or hardware failure.
The NVM challenges have resulted in investigations of
transactional mechanisms to access and manipulate data on
persistent memory (NVM) atomically [1]–[8]. Those transac-
tional mechanisms often employ a logging technique (undo
logging or redo logging). However, those transactional mech-
anisms have a high overhead. Our performance evaluation
reveals that running TPC-C [9], [10] and YCSB (A-F) [11],
[12] against Redis [13], and OLTP-bench [14] (TPC-C,
LinkBench [15] and YCSB) against SQLite [16] based on an
implementation of undo logging from Intel PMDK [17]) or a
redo logging from [3] to build transactions, we have overheads
of 8%-49%. Such large overhead significantly diminishes the
performance benefit NVM promises to provide.
Most overhead of logging mechanisms comes from data
copy for creating logs and cache-line flushing by special
instructions. Cache-line flushing takes a large portion of the
total overhead. Use our evaluation with the above workloads as
an example again. On average, the cache-line flushing takes
65% and 51% of total overhead for undo logging and redo
logging mechanisms respectively. Reducing the overhead of
cache-line flushing is the key to enable high performance
transaction for persistent memory.
The traditional methods reduce the overhead of cache-
line flushing using asynchronous cache-line flushing (e.g.,
blurring persistent boundary [18] and relaxing persistency
ordering [3], [18]). Those methods remove the overhead of
cache-line flushing off the critical path, by overlapping cache-
line flushing with the transaction. However, the effectiveness
of asynchronous cache-line flushing depends on the charac-
teristics of the transaction (e.g., how frequent data updates
happen), cache-line flushing can still be exposed into the
critical path, increasing the latency of the transaction.
In this paper, we introduce a new method to reduce the
overhead of cache-line flushing. The traditional methods work
at the program level and leverages program semantics: as
long as the transaction semantics remains correct, we can
change the order of persisting data and trigger asynchronous
cache-line flushing. Different from the traditional methods,
our method introduces architecture awareness. In particular,
we do not flush certain cache lines, as long as those cache
lines are eliminated out of the cache because of the hardware
caching mechanism (e.g., the cache replacement algorithm). In
other words, we rely on the existing hardware mechanism to
automatically and implicitly flush cache lines. The traditional
methods do not have architecture awareness. Ignoring the
possible effects of the caching mechanism, the traditional
methods flush cache lines by explicitly issuing cache flush
instructions, even though those cache lines will be soon or
have been eliminated out of the cache by hardware.
Furthermore, we examine the cache line dirtiness to quantify
the efficiency of cache-line flushing. The dirtiness of a cache
line is defined as the ratio of dirty bytes to total number of
bytes in a cache line. Since a cache line is the finest granularity
to enforce data persistency, the whole cache line has to be
flushed, even though only a few bytes in the cache line are
dirty. Use our evaluation with the above workloads as an
example again: the average dirtiness of flushed cache lines in
1
ar
X
iv
:1
90
3.
06
22
6v
2 
 [c
s.D
C]
  1
8 A
pr
 20
19
Redis and SQLite is 49% and 49% for undo and redo logging
mechanisms respectively. Flushing clean data in a cache line
wastes memory bandwidth and decreases the efficiency of
cache-line flushing.
To leverage the architecture awareness to enable high per-
formance transactions, we must address a couple of challenges.
First, we must have a software mechanism to reason and
decide the existence of cache blocks 1 in the cache, without
hardware modification. The mechanism must be lightweight
and allow us to make a quick decision on whether a cache-
line flushing is necessary.
Second, we must provide strong guarantee on crash consis-
tency to implement transactions. Skipping cache-line flushing
for some persistent objects raises the risk of losing data con-
sistency for committed transactions. The software mechanism
to reason the residence of a cache block in the cache is an
approximation to the hardware-based caching mechanism. If
the software mechanism skips a cache-line flushing, but the
corresponding dirty cache block is still in the cache, then there
is a chance that the cache block is inconsistent when a crash
happens. We must have a mechanism to detect and correct
such inconsistency in persistent memory.
To address the above two challenges, we introduce Archapt
(Architecture-aware, performant and persistent transaction),
an architecture-aware, high performance transaction runtime
system. Archapt provides a new way to perform transactional
updates on persistent memory with efficient cache-line flush-
ing. To address the first challenge, Archapt uses an LRU queue
to estimate the residence of cache blocks of a persistent object
in the cache and decide whether cache flushing for a persistent
object in a transaction is necessary.
To address the second challenge, Archapt introduces a
lightweight checksum mechanism. Checksums are built using
multiple cache blocks from one or more persistent objects to
establish implicit invariant relationships between cache blocks.
Leveraging the invariant, Archapt can detect data inconsistency
and make best efforts to correct data inconsistency after a
crash happens. The checksum mechanism provides a strong
guarantee on crash consistency, while causes small runtime
overhead (less than 5% loss in throughput in our evaluation).
Furthermore, to improve the efficiency of cache-line flush-
ing, we examine the implementation of common database
systems (Redis and SQLite), and find two problems accounting
for the low dirtiness of flushed cache lines. The two problems
are unaligned cache-line flushing and uncoordinated cache-
line flushing. The two problems come from the fundamental
limitation of the existing memory allocation mechanism de-
signed for the traditional DRAM. In particular, the existing
memory allocation does not consider the effects of cache-
line flushing on persistent memory, and spread data structures
with different dirtiness across cache blocks. This causes the
low dirtiness of flushed cache lines. Archapt introduces a
1We distinguish cache line and cache block in the paper. The cache line is
a location in the cache, and the cache block refers to the data that goes into
a cache line.
customized memory allocation mechanism to coalesce cache-
line flushing and improve efficiency.
In summary, the paper makes the following contributions:
• An architecture-aware new method to achieve high perfor-
mance transactions on persistent memory;
• A mechanism that determines the necessity of cache-line
flushing based on the locality of cache blocks; A checksum
mechanism to detect and correct data inconsistency to
provide strong guarantee on crash consistency;
• We reveal the low dirtiness of flushed cache lines in two
common databases, and provide a solution to improve the
efficiency of cache-line flushing;
• With Archapt, on average we reduce cache-line flushing
by 66% and improve system throughput by 19% (42%
at most), when running YCSB (A-F) and TPC-C against
Redis, and OLTP-bench (TPC-C, LinkBench and YCSB)
against SQLite, using a undo logging (PMDK) and a redo
logging (Mnemosyne) as baseline. Archapt provides strong
crash consistency demonstrated by our crash tests with four
hardware caching policies.
II. BACKGROUND AND MOTIVATION
Many studies build a atomic and durable transaction [1],
[2], [2], [3], [17]–[22] to handle the crash consistency issue
on NVM. With such a transaction, each update must be “all
or nothing”, i.e., either successfully completes, or fails com-
pletely with the data in NVM intact. With such a transaction,
one has to write back the modified data from the volatile cache
to NVM that provides the durability. To ensure a cache line
is written to NVM in a correct order, one often uses cache-
line flushing instructions (e.g., clflush, clflushopt or
clwb) and persistent barriers (e.g., sfence and mfence).
Cache-line flushing is expensive, because of two reasons: (1)
it may need to invalidate cache lines (with clflush and
clflushopt instructions) and trigger cache-line sized writes
to the memory controller; and (2) it needs persistent barriers
to ensure that all flushes are completed and force any updates
in the memory controller to be written to NVM.
In this paper, we use the term persistent object to represent
a data object that is modified within the transaction and needs
to be persisted. We use the term log record to represent a log
(an copy of the old data in an undo logging mechanism or a
copy of the new data in a redo logging mechanism). To persist
a persistent object, the current common practice is to flush all
cache blocks of the persistent object [17] 2 We use “flushing
all cache blocks” and “cache-line flushing” interchangeably to
indicate persisting a persistent object.
We do not consider battery-backed RAM as a solution to
reduce cache-line flushing. Using battery-backed RAM, it is
possible to avoid cache-line flushing to build durable trans-
actions, because the battery allows the system to flush cache
lines to persistent memory when a crash happens. However,
battery-backed RAM has restrictive temperature requirements,
2We have to flush all cache blocks, even though some of them may not be
in the cache, because there is no mechanism to faithfully track which cache
blocks are in the cache.
2
Fig. 1: Throughput when running YCSB workloads (A-F), TPC-C
and LinkBench against Redis and SQLite with and without logging.
Fig. 2: Breakdown of undo/redo logging latency overhead.
leakage risk, limited storage time, long re-charge cycles, finite
battery shelf life, and overall high cost-of-ownership [23].
A. Performance Analysis on Log-based Transactions
Undo and redo logging are two common mechanisms to
build persistent transactions on persistent memory. In undo and
redo logging, the logging operations (including data copy and
log record manipulation) and persistence operations (including
cache-line flushing and store barrier) are necessary. Both of
them cause performance loss in a transaction. To quantify the
impact of the persistent logging on transaction throughput, we
run multiple workloads, including YCSB and TPC-C against
Redis, and OLTP-bench (TPC-C, LinkBench and YCSB)
against SQLite with and without the persistent logging. For
each workload, we use eight client threads. Section V-A has
more experiment details. Figure 1 shows the results.
The figure reveals that logging decreases throughput by 8%-
49%. For a workload with frequent updates (YCSB-A) or large
updates (LinkBench), the logging overhead can be very large
(33% and 49% for YCSB-A and LinkBench respectively).
Furthermore, we measure the latency overhead caused by
logging operations and persistence operations. Figure 2 shows
the results. In the undo and redo logging, the persistence
operations account for 56%-78% and 41%-64% of the latency
overhead respectively; The overhead of those persistence oper-
ations is exposed to the critical path of transactions. The above
results show that the persist operations can significantly impact
the transaction performance. Thus, we must avoid frequent
cache-line flushing.
Introducing architecture awareness into the design of a
transaction, we want to skip cache flushing by leveraging data
reuse information in the cache. If data reuse is low, then there
is a very good chance that the data is eliminated out of the
cache by the hardware-based caching mechanism. We study
data reuse in the next section.
Fig. 3: Distribution of data reuse for persistent data objects.
TABLE I: Average dirtiness of flushed cache lines.
Redis SQLite
TPC-C YCSB TPC-C LinkBench YCSBA B D E F
0.31 0.55 0.55 0.51 0.51 0.56 0.40 0.49 0.46
B. Data Reuse and Dirtiness Analysis
Data in a transaction includes log records and persistent
objects. Log records, which are used to maintain the transac-
tion atomicity, are seldom reused. We study data reuse at the
persistent object level, and explore whether there are persistent
objects with few reuse. These persistent objects are candidates
for skipping cache-line flushing.
To study data reuse, we count the number of operations
(read and write) for each persistent object, and then report
what is the percentage of persistent objects with 0, 1, 2 or
more operations, which we call the distribution of data reuse.
Figure 3 shows the results. The figure reveals that 78% of
persistent objects are used only once or twice in all workloads
except YCSB-E. In YCSB-E, about 89% of persistent objects
have data reuse no less than 2. Such high data reuse is because
of the following reason: This workload has frequent queries,
each of which cover a range of persistent objects. Those ranges
of queries overlap with each other, causing high data reuse.
We also explore the efficiency of cache-line flushing. In
particular, we quantify the average dirtiness of flushed cache
lines. Table I shows the results for undo and redo logging (the
two logging mechanism have the same dirtiness). In general,
the dirtiness is less than 0.6 in all workloads, which is low.
Conclusions. Using the industry standard workloads, our
analysis on data reuse and dirtiness shows great opportunities
to enable high performance transactions by skipping cache-
line flushing and improving its efficiency.
III. DESIGN
Motivated by the above performance analysis, we introduce
a high-performance transaction runtime system.
A. Overview
Archapt avoids cache-line flushing for persistent objects (but
not log records) to enable high performance transactions with-
out disturbing transaction atomicity. Archapt uses an LRU-
based method to reason if persistent objects are in the cache.
With this approach, Archapt does not immediately make a
decision on flushing cache blocks for a persistent object, when
a cache flushing request is issued from a transaction to persist a
persistent object. Archapt delays the decision until it collects
more information on read/write of the persistent object and
estimates the locality of the persistent object, using the LRU
3
Fig. 4: The architecture of Archapt.
queue. For a persistent object that is estimated to be out of the
cache, the cache flushing for all of its cache blocks is skipped.
Archapt is also featured with a checksum mechanism.
Skipping cache-line flushing for some persistent objects raises
the risk of having inconsistent data for committed transactions,
when a crash happens. To remove the risk, we introduce a
checksum mechanism. This mechanism generates checksums
for persistent objects that have cache-line flushing skipped.
The checksum mechanism builds invariant relationships be-
tween cache blocks. Upon a crash, the checksums are used to
detect and correct data inconsistency. We design the mecha-
nism with the consideration of avoiding runtime overhead and
maximizing the capabilities of correcting data inconsistency.
Further, we identify two reasons accounting for the low
dirtiness of flushed cache lines: unaligned cache-line flushing
and uncoordinated cache line flushing. To address the two
problems, Archapt introduces a customized memory allocation
mechanism. It clusters persistent objects with the same func-
tionality (i.e., key, field, value, or log) into contiguous cache
blocks to coordinate and align cache-line flushing, based on
which Archapt improves the efficiency of cache-line flushing.
Overall architecture of Archapt. Archapt has four major
components: transaction management unit, memory manage-
ment unit, persistent management unit, and history manage-
ment unit. Figure 4 shows the architecture of Archapt.
(1) The transaction management unit includes a set of
APIs to establish a transaction (i.e., start and end). Such
transaction information is sent to the Archapt runtime to
implement transaction semantics. The transaction management
unit processes the requested operations of the transaction.
It also flushes cache blocks for persistent objects that are
estimated to be in the cache. (2) The memory management
unit pre-allocates a set of memory pools for coalescing cache
blocks and manages the pools to meet memory allocation
requests from transactions. (3) The persistent management
unit builds checksums for persistent objects for which Archapt
skips the cache-line flushing. (4) The history management unit
maintains an LRU queue and a hash table, ObjHT . The LRU
queue is used to estimate the locality of persistent objects
(i.e., in the cache or not). ObjHT is used to provide metadata
information for each persistent object in the LRU queue, such
as the location in the LRU queue and whether there is any
pending cache-line flushing.
B. Architecture-Aware Cache-Line Flushing
The architecture-aware cache-line flushing uses an LRU
queue to reason if a persistent object is in the cache or not,
and skips cache-line flushing for it, if not. When a persistent
object is updated, its cache blocks are placed into the LRU
queue (the queue length is equal to the capacity of last level
cache), and the decision for cache flushing for this persistent
object is pending until we have enough information to estimate
the residence of the persistent object in the cache, based on
the LRU queue. We describe our design in details as follows.
First, once Archapt receives a request (i.e., a read or write
operation to a persistent object) from the client, the transaction
management unit queries ObjHT to see if the requested
persistent object has a record there. If yes, we infer that the
persistent object is accessed recently. The hardware cache may
have the persistent object resident in the cache because of a
previous operation on the persistent object. If the previous
operation is a write operation, flushing cache blocks for the
persistent object must have been pending. We finish the cache
flushing for the previous write operation. Furthermore, we
update the location of the persistent object in the LRU queue,
because of the current request. If the current request is a write
operation, we hold the cache flushing for the current request,
waiting for the opportunity to skip it in the future.
If the transaction management unit cannot find the requested
persistent object information in ObjHT , we conclude that the
persistent object has not been accessed recently. The hardware
cache may evict the persistent object out of the cache or never
access it at all. The transaction management unit then skips
any pending cache flushing request for the persistent object.
Afterwards, the transaction management unit asks the history
management unit to insert the information for the persistent
object into the LRU queue, and suspend the cache flushing
for the most recent request if the request writes the persistent
object. In the future transactions, as other persistent objects
are accessed, the target persistent object can be evicted out of
the LRU queue according to the LRU policy, and its record
will then be removed from ObjHT and the pending cache
flushing will be skipped.
We must maintain the commit status of a transaction very
well. After the completion of a transaction, we cannot label
it as commit as in the traditional transaction, because cache
flushing for some persistent objects in the transaction may be
pending. For such a transaction, we label it as logical commit.
Only after all of cache flushing for persistent objects in the
transaction are either finished or skipped (but with checksums
added to the persistent objects. See Section III-C), we label
the transaction as physical commit.
A logically committed transaction has completed all read
and write operations in the transaction. For such a transaction,
the system does not respond to the client for transaction
commit. For a physically committed transaction, the system
does so, as in the traditional undo or redo logging mechanisms.
The modern hardware-based cache hierarchy employs so-
phisticated caching policies. It is possible that a persistent
4
object is resident in the cache while the LRU estimates
otherwise. For this case, skipping cache-line flushing can cause
data inconsistency for a physically committed transaction,
when a crash happens. We introduce a checksum mechanism
to detect and correct inconsistent data (see Section III-C).
In our evaluation (see “random crash tests” in Section V-B),
we find that our LRU-based approach tends to be conservative
to estimate data locality for our workloads. In particular,
evaluating with four hardware caching policies, we find that
when a crash happens, there are at most a few tens of
inconsistent persistent objects, much smaller than the number
of inconsistent persistent objects in the LRU queue. This
means that some persistent objects that are estimated to be
resident in the cache by the LRU approach are not in the cache.
Using such a conservative approach can reduce the number of
inconsistent persistent-objects when the crash happens.
Handling log records. Log records, once created for a
transaction, are seldom accessed (unless a crash happens).
We could skip cache flushing for log records and rely on
the hardware-based caching mechanism to implicitly persist
them. However, by doing so, some log records that are not
timely flushed by the hardware are lost when a crash happens;
We raise the risk of losing transaction atomicity before the
physical commitment of the transaction. Hence, we do not
skip cache-line flushing for log records. They are committed
and maintained as in the traditional logging mechanisms.
Memory utilization. Our LRU-based approach requires a
queue and a hash table. Maintaining these data structures
increases memory footprint (typically a few megabytes, de-
pending on the number of persistent objects estimated to be in
the cache). This memory overhead can pollute the cache and
hence lose performance. However, even with the overhead,
Archapt still brings performance benefit, because (1) some
operations on the two data structures are not in the critical
path, (2) the benefit of skipping cache lines overweigh the
overhead, and (3) some workloads have few data reuse and
hence are not sensitive to the reduction of the cache capacity.
C. Checksum Design
Skipping cache-line flushing for some persistent objects
raises the risk of disturbing transaction atomicity: once a trans-
action is physically committed, there is no strong guarantee on
crash consistency, because we estimate data locality and the
estimation can be inaccurate. To remove the risk, we introduce
a checksum mechanism.
We have multiple requirements for the checksum design.
First, the checksum mechanism should have the capability to
detect data inconsistency in physically committed persistent
objects. Second, the checksum mechanism must provide strong
guarantee on crash consistency for persistent objects when
they are physically committed. Third, the checksum mech-
anism must be lightweight. The overhead of the checksum
maintenance should be smaller than the performance benefit of
skipping cache-line flushing for persistent objects. We describe
the design of the checksum mechanism in this section.
General Description. Our checksums are built with cache
blocks of multiple persistent objects from one or more trans-
actions. To build the checksums, cache blocks of multiple per-
sistent objects are logically organized as an M×N matrix (M
and N are the dimension sizes of the matrix, discussed later).
Those persistent objects have cache-line flushing skipped.
Each column of the matrix corresponds to cache blocks of one
or more persistent objects, where each element of the column
is a cache block. Checksums are built as one extra row (the
(M+1)th row) and one extra column (the (N+1)th column)
of the matrix. The matrix becomes (M + 1)× (N + 1). The
extra row, named consistency checksums, is used to detect and
correct data inconsistency of N columns, and each element
of the extra row is a consistent checksum for one column.
The extra column, named correlation checksums, builds an
invariant relationship between cache blocks across multiple
persistent objects. The correlation checksums can correct data
inconsistency. We name the matrix, virtual matrix.
Consistency checksums to detect data inconsistency.
When a persistent object with cache-line flushing skipped
is logically committed, we immediately create a consistency
checksum. The checksum is a simple summation of cache
blocks of the persistent object. The checksum is immediately
flushed for persistency once it is created. When a crash hap-
pens, for each persistent object with a consistency checksum,
we recalculate the checksum and compare it with the existing
one in persistent memory. If there is a mismatch, then data
inconsistency is detected.
The consistency checksum mechanism is very effective to
detect data inconsistency for a persistent object. Any cache
block of the persistent object with data inconsistency can
easily cause checksum mismatch. In our evaluation running
ten workloads consisting of billions of transactions and 40,000
crash tests under four hardware caching polices, the consis-
tency checksum mechanism detects all inconsistent data.
Correlation checksums to correct data inconsistency. A
correlation checksum, as an element of the (N +1)th column
of the virtual matrix, is a summation of cache blocks of a row
in the virtual matrix. The (N + 1)th column is composed of
M correlation checksums, each of which is for one row. Since
the cache blocks of a row come from at most N persistent
objects, the correlation checksum at the end of the row aims
to correct data inconsistency for any persistent object in the
row. Correlation checksums (i.e., the (N + 1)th column) are
immediately flushed out of the cache for persistency, once they
are fully built.
Once a crash happens, we recalculate correlation checksums
and compare them with the existing ones in persistent memory.
If there is a mismatch in any correlation checksum (e.g., the el-
ement mk(N+1) of the virtual matrix), then the corresponding
row (the row k in the example) must have data inconsistency.
Using consistency checksums, we can reason which element
of the row k is inconsistent. Assume that the element mkj is.
5
(a) Checksum creation (b) An example of correctable data inconsis-
tency
(c) An example of uncorrectable data incon-
sistency
Fig. 5: Three examples for checksum creation and correcting data inconsistency.
This element is corrected by the following:
mkj = mk(N+1) −
N∑
i=1,i 6=j
mki (1)
where mk(N+1) is the correlation checksum committed in
persistent memory.
We pack persistent objects into the virtual matrix following
the order they are allocated. A persistent object can use more
than one columns of the virtual matrix, if one column is not
big enough to hold all cache blocks of the persistent object.
A column, after populated by cache blocks of a persistent
object, can have zero-valued elements, if the persistent object
is not big enough and the next persistent object is too big to
be placed into the column. Persistent objects are allocated by
Archapt, from memory pools pre-allocated before transactions
happen (Section III-D). This method allows Archapt to control
over the assignment of virtual addresses to persistent objects
to implement the virtual matrix-based checksum mechanism.
An example. Figure 5.a shows an example to further explain
the idea of checksums. In this example, we have four persistent
objects with four, four, three, and four cache blocks respec-
tively. The virtual matrix is 4×4, and each column has one per-
sistent object. The consistency checksums are in the fifth row,
and the correlation checksums are in the fifth columns. The
consistency checksums, CkSum1-ChkSum4 can detect data
inconsistency for the first-fourth persistent objects respectively.
The correlation checksums, CkSum5-CkSum8, can be used
to correct data inconsistency for the cache blocks in the first-
fourth rows. Suppose CB#32 has inconsistency detected by
the consistency checksum CkSum2. The inconsistency can be
corrected by the correlation checksum CkSum7. In particular,
CB#32 = CkSum7− CB#31− CB#33− CB#34.
Locating checksums. When a persistent object is updated,
we must locate its related checksums to update them. We intro-
duce a lightweight approach to correlate persistent objects with
their checksums, such that we can quickly locate checksums
for any given persistent object.
Consistency and correlation checksums are placed into the
virtual memory page where the persistent objects protected by
the checksums reside, such that using the virtual addresses of
the persistent objects, we can easily locate the checksums.
In particular, we fix the virtual matrix size as 8 × 8
(N = M = 8). A memory page with the memory page
size of 4KB (annotated as mps) has only one virtual matrix.
The first 3136 bytes of the memory page (i.e., 7 × 7 cache
blocks) is allocated for cache blocks of persistent objects,
which is annotated as oms. The remaining 896 bytes (two 7×1
cache blocks) of the memory page are for the consistency and
correlation checksums, which is annotated as cms. Given a
virtual memory address of a persistent object (obj addr) and
its size, we use the following two equations to locate its check-
sum. The equations calculate the addresses of consistency and
correlation checksums (cons and corr) respectively.
cons = (d(obj offset / (cms / 2)e − 1)× cbs+ oms+ psa (2)
corr = (obj offset% (cms / 2)− cbs+ oms+ (cms / 2) + psa (3)
where obj offset is the distance between the starting address
of the persistent object and the starting address of the memory
page in which the object reside (obj offset = obj addr %
mps), cbs is the cache block size, and psa is the starting
address of the memory page (psa = obj addr - obj offset).
In essence, by putting checksums at a fixed page offset, we
simplify the efforts to locate checksums. The above algorithm
to locate checksums is based on the assumption that a persis-
tent object is never larger than a memory page, which is true
in all workloads we evaluate. If the persistent object is larger
than a memory page, then we build a larger virtual matrix and
use multiple memory pages (instead of one) to hold it, and put
checksums at a fixed offset of the multiple memory pages.
High performance checksum. Our checksum mechanism
does not cause large performance overhead because of the
following reason. Creating checksums is not in the critical path
of a transaction. A checksum for a persistent object is created,
only after the persistent object is estimated to be evicted out of
the LRU queue. This indicates that it is highly likely that the
persistent object will not be accessed in the near future. Hence
it is highly likely that creating checksums does not block
operations on the persistent object. Also, creating checksums
for the persistent object and committing the checksums do not
block the execution of other transactions. Hence, checksum
creation can happen in parallel with other operations, which
removes it from the critical path.
Second, checksums do not need to be updated frequently.
Once a persistent object is updated, its checksums must be
recalculated and updated to maintain the validness of check-
6
sums. Such updates can cause performance overhead. This
performance problem is common in other mechanisms, such
as ECC or RAID. However, it is not a problem in our design,
because we use those persistent objects that are not frequently
accessed (according to the LRU queue) to build checksums.
Updating checksums does not happen often.
Third, the overhead of flushing cache blocks of check-
sums can be smaller than that of flushing cache blocks of
persistent objects in the traditional undo/redo logging-based
methods. This fact is supported by our performance evaluation
(Section V-B). Given an N ×M virtual matrix, we need to
flush (N +M ) cache blocks to make checksums consistent.
In contrast, to make persistent objects in the virtual matrix
consistent in the traditional methods, we flush at least (N )
cache blocks (this case happens when each column has one
small data object with the size of one cache block), and at most
(M ×N ) cache blocks (this case happens when each column
has a large data object with the size of M cache blocks).
If the virtual matrix is populated well with cache blocks of
large data objects, then the number of cache blocks to flush
for checksums is significantly smaller than that for persisting
persistent objects (i.e., N + M vs. M × N ). Hence, we
reduce the number of cache-line flushing, even though we
need to flush checksums. However, if the virtual matrix is not
populated well and each column has only a small data object,
then the number of cache blocks to flush for checksums can be
larger than that for persisting N persistent objects (i.e., N+M
vs. N ). To handle such cases with small data objects, we
pack multiple small objects into the same column to heavily
populate the virtual matrix.
Analysis on the capability of correcting data inconsis-
tency. The correlation checksums have a strong capability
to correct data inconsistency. If a cache block in a row is
inconsistent, we can easily correct it using Equation 1. If
more than one cache block in a row are inconsistent, we can
use the consistency checksums and the correlation checksums
to correct them. The consistency checksum is not only used
to detect data inconsistency, but also used to correct the
inconsistency of cache blocks that fall into the same column,
using the similar method as in Equation 1.
Figure 5.b gives an example where we have two inconsistent
cache blocks in the first row. Using the correlation checksum
CkSum5 alone is not able to correct them. However, using
the consistency checksums CkSum2 or CkSum4, we can
correct at least one inconsistent data. Afterwards, we can use
CkSum5 to correct the other.
It is possible that a row has multiple inconsistent cache
blocks and the columns where those inconsistent cache blocks
reside have another inconsistent cache blocks. Figure 5.c gives
an example. In this example, the first row have two inconsistent
cache blocks (CB#12 and CB#14). They reside in the
columns two and four. These two cache blocks are not cor-
rectable by the correlation checksum CkSum5. Meanwhile,
each of the columns two and four has another inconsistent
cache block (CB#42 and CB#24), making the consistency
checksums (CkSum2 and CkSum4) incapable of correcting
the inconsistent cache blocks too.
In this case, any checksum, including the combination
of consistency checksum and correlation checksum, cannot
correct those cache blocks. However, such a case is extremely
rare: Those inconsistent cache blocks must be so “coincident”
to fall into the same row and column together. In our evaluation
with ten workloads including billions of transactions and
400,000 crash tests under four hardware caching polices, our
checksums can correct all of data inconsistency for committed
transactions.
Post-crash processing. After a crash happens, we examine
persistent objects in persistent memory. If they do not have
checksums and the transactions are physically committed,
then the persistent objects must be consistent without any
cache flushing skipped. If they do not have checksum and the
transactions are not physically committed, then the transaction
updates are cancelled and the persistent objects are restored
using traditional logs.
If the persistent objects have checksums and the transactions
are physically committed, then we use consistency checksums
to detect consistency of each persistent object. If there is
any inconsistency, then we use correlation and consistency
checksums to correct them. If the data inconsistency is not
correctable, which is very rare, then the corresponding trans-
action update is lost. To avoid incorrectable data inconsistency
after physical commitment, we could add another row and
column as consistency and correlation checksums. The new
and old checksums, each of which is built upon half of rows
or columns, can enhance correction capabilities for those rare
cases, but come with larger performance overhead. Study of
this tradeoff is out of scope of this paper, because the current
checksum mechanism already works well in our evaluation.
Ensuring transaction atomicity. Before the physical com-
mitment, Archapt relies on logs, as in the traditional undo
and redo logging mechanisms, to ensure atomicity. After the
physical commitment, the persistent objects are successfully
updated with the assists of checksums and the atomicity is
enforced. In the extremely rare case where the persistent object
is not consistent and the checksum mechanism cannot correct
it, we rely on a traditional checkpoint mechanism and go back
to the last valid checkpoint to ensure atomicity.
D. Coalescing of Cache-Line Flushing
To reduce the overhead of cache-line flushing, we have two
methodologies: one is to avoid cache flushing for persistent
objects as in Section III-B; the other is to coalesce cache-line
flushing to avoid low dirtiness in flushed cache lines. After
investigating two common databases (Redis and SQLite), we
find two reasons accounting for low dirtiness: unaligned cache-
line flushing and uncoordinated cache-line flushing.
The unaligned cache-line flushing happens when a persistent
object is not aligned with cache lines. For example, a persistent
object is 100 bytes. Ideally, the object should use two cache
blocks (assuming that the cache block size is 64 bytes).
However, the object could not be aligned well during the
memory allocation, and uses three cache blocks. Once the
7
object is modified, we have to flush three cache lines instead
of two. This easily increases the number of cache-line flushing
by 50%. We find this problem in Redis and SQLite.
The uncoordinated cache-line flushing happens when mul-
tiple, associated data objects are allocated into separate cache
blocks. The multiple data objects are associated, because they
are often updated together. If they are allocated into the same
cache blocks, then we can reduce the number of cache-line
flushing. This problem happens more often in NoSQL systems,
such as a key-value store system. We use Redis as an example
to explain it further.
As a key-value store system, Redis enables secondary
indexing based on a two-level hash table. In the second level,
Redis has a set of field-value pairs. For each field-value pair,
the field and value objects are allocated separately on different
cache blocks. This is inefficient on persistent memory, because
the field and value have to be persisted by flushing separate
cache lines. The size of the field object is small (usually less
than one cache block), and the field object is usually updated
with the value object together. Therefore, coalescing the field
and value objects into a fewer contiguous cache blocks can
reduce the number of cache-line flushing.
To address the above problems, we introduce a memory
allocation mechanism to improve the efficiency of cache-line
flushing. The original implementation of Redis uses the tradi-
tional memory allocation, without considering the implications
of memory allocation on cache flushing. Whenever a key or
a value is created, Redis allocates the corresponding memory
space on demand, without the coordination with other memory
allocations. In our new design for Redis, we pre-allocate
three memory pools without allocating memory on demand.
The three memory pools, key pool, field value pool and
log pool, meet the memory allocation requests for keys, field
and value pairs, and log records, respectively. We use the
three memory pools, instead of allocating memory on demand,
because the three pools can use separate memory allocation
methods to minimize cache-line flushing; The three memory
pools can also cluster objects with the same functionality
(i.e., key, field, value, or log) into contiguous cache blocks
to coordinate cache-line flushing.
IV. IMPLEMENTATION
Programming APIs. Archapt is implemented as a user-
level library to provide persistence support and be integrated
with the existing log-based transaction implementation, such
as Intel PMDK [17] and Mnemosyne [24]. Archapt includes
a set of APIs, defined in Table II.
Archapt init() is used to pre-allocate multiple memory
pools for coalescing cache-line flushing (Section III-D) and
initialize critical data structures (e.g., the LRU queue and
QbjHT ). Archapt Tx Start() and Archapt Tx End()
are used to identify transactions for the Archapt runtime, and
can be embedded into the existing transaction start/finalization
functions. Archapt Tx LCommit() is used to replace the
traditional transaction commit to implement the logical com-
mit for Archapt. Archapt Malloc() and Archapt Free() are
TABLE II: Archapt APIs
API Name Functionality
Archapt Init() Pre-allocate memory pools and ini-
tialization
Archapt Tx Start() Identify the beginning of a transac-
tion
Archapt Tx End() Identify the end of a transaction
Archapt Tx LCommit() Logical commitment
Archa Malloc(int type,
size t size)
Memory allocation for coalescing
cache-line flushing
Archapt Free(int type,
size t size)
Free memory allocation for coa-
lescing cache-line flushing
used to replace the traditional memory allocation and free APIs
in the transaction implementation. The two APIs are used to
allocate and free memory from/to the pre-allocated memory
pools for coalescing cache-line flushing.
System optimization. Archapt includes a number of op-
timization techniques to enable high performance and thread
safety. These techniques include SIMD vectorization of check-
sum creation and update, a high-performance concurrent lock-
free hash table, and a high-performance LRU queue based on
circular buffers. In addition, to avoid contention on the LRU
queue from multiple transactions (multiple threads), Archapt
creates a transaction management unit for each transaction,
and the transaction management unit puts information on
write/read of persistent objects into a local buffer. By fetching
the information from those local buffers, the history manage-
ment unit collectively updates the LRU queue.
V. EVALUATION
A. Experimental Methodology
The goal of the evaluation is to evaluate the performance
of Archapt with a range of workloads with different charac-
teristics. We use both NoSQL and SQL systems (Redis and
SQLite). We use seven persistent transaction mechanisms for
evaluation: undo logging and redo logging with Archapt, undo
logging and redo logging without Archapt, the rollback journal
system in SQLite, the AOF mechanism (logging every write
operation) in Redis, and Atlas [25] an undo-logging-based
system that provides durability guarantees for failure-atomic
sections). Our implementation of the undo logging and redo
logging are based on Intel PMDK [17] and Mnemosyne [24]
(using asynchronous cache-line flushing) respectively. For the
rollback journal and AOF, whenever a transaction commitment
happens, we commit the transaction updates to memory (not
hard drive), in order to enable fair performance comparison.
We run YCSB [11] (A-F) and TPC-C [9], [10] against
Redis, and run OLTP-bench [26] (particularly, TPC-C,
LinkBench [15] and YCSB) against SQLite. These workloads
are chosen for Redis and SQLite respectively, because they
can easily run on the two database systems without any
modification. Table III gives some details for these workloads.
For YCSB running against Redis, we perform transaction
operations on 10M key-value pairs, and each of the pair size is
1KB. We choose default data access distributions (uniform or
zipfian); For TPC-C aginist Redis, we use 10 warehouses and
run five minutes per test. For other workloads against SQLite,
we use default configurations. For those experiments using
Redis, we run both servers and clients on a single machine as
in [27], [28].
8
TABLE III: The percentage of different operations in evaluated
workloads; “R”, “U”, “I”, “RU”, “S” and “D” standard for read,
update, insert, read & update, scan, and delete operations respectively.
Redis SQLite
Ops TPCC YCSB TPCC LinkBH YCSBA B C D E F
R 8 50 95 100 95 - 50 8 64 50
U 47 50 5 - - - - 47 16 10
I 45 - - - 5 5 - 45 12 5
RU - - - - - - 50 - - 10
S - - - - - 95 - - 4 15
D - - - - - - - - 4 10
All experiments are performed on a 24-core machine with
two 12-core Xeon Gold 6126 processors with 187GB memory
and 19.25MB last level cache. We use DRAM to emulate
NVM, since NVDIMM is not on the market at the time of
preparing this manuscript. For other slower NVMs, the benefits
of Archapt would only be larger because of the reduction of
cache-line flushing. We use clwb and clflushopt instruc-
tions to flush cache lines. These two instructions are the most
recent instructions especially designed for high performance
cache-line flushing. In our study, the transaction performance
of using the two instructions is pretty close to each other
(less than 5% performance difference, much smaller than
the performance benefit offered by Archapt), hence we only
present the results of using clwb for cache-line flushing.
B. Experimental Results
Micro-benchmark Evaluation. We use a micro-benchmark
to compare the overhead of flushing cache blocks of persistent
objects (without checksums as in the traditional undo/redo
logging) and creating/updating checksums. We aim to quantify
and study the overhead of our checksum mechanism.
In the micro-benchmark, we randomly access 1.3M persis-
tent objects. We perform five groups of tests. In each group,
all persistent objects have the same size, but five groups
use different data object sizes. In each group, we perform
three tests: (1) Flushing cache blocks of all persistent objects,
as in the traditional methods; (2) creating checksums for
each persistent object; and (3) updating checksums for each
persistent object. Figure 6 shows execution time for each test.
Figure 6 reveals that the overhead of creating checksums
is smaller than that of flushing cache blocks in the traditional
methods. Creating checksums is cheaper by 29% and 68%
for small data objects (128 bytes) and large data objects
(2048 bytes) respectively. For small persistent objects, packing
them into the same column effectively reduces the checksum
creation overhead.
Figure 6 reveals that for large data objects (2048 bytes),
the overhead of updating checksums is smaller (by 63%) than
that of flushing cache blocks in the traditional methods. For
small objects (128 bytes), updating checksums is slightly more
expensive. However, updating checksums does not happen
frequently for persistent objects with checksums (Otherwise,
those objects should be estimated to be in the cache and do
not have checksums).
Basic Performance. We use different numbers of threads
to evaluate throughput and latency. Figures 7 and 8 show the
Fig. 6: Comparing the overhead of flushing cache blocks of persistent
objects without checksum and creating/updating checksums.
results. We do not include the results for Atlas because of the
space limitation, but we discuss the Atlas results as follows.
Figure 7 reveals that the systems with Archapt have higher
throughput than all other systems. On average, the system with
Archapt offers 54% and 32% higher throughput than the roll-
back journal system in SQLite and AOF in Redis, respectively;
For the redo logging (Mnemosyne) and undo logging (PMDK),
the system with Archapt increases throughput by 16% and
22%, respectively.
The biggest improvement happens on YCSB-A and
LinkBench with undo-Archapt. YCSB-A is write-intensive
(see Table III) and LinkBench has large persistent objects,
which have higher demands on consistency and provide better
opportunities for performance improvement. In the best cases,
the undo logging with Archapt offers up to 32%, 54% and
67% higher throughput than the other three mechanisms (undo
(PMDK), AOF (Redis) and Rollback journal (SQLite)). For
the read-only workload (YCSB-C), Archapt cannot offer any
performance benefit, because no cache flushing is needed in
the original workload, but Archapt provides at most 3% lower
throughput than other mechanisms, which is small.
We also evaluate the performance of Atlas. Atlas uses undo
logging. Comparing with PMDK using undo logging and
Archapt, Atlas performs 9% and 31% worse. Atlas has bad
performance, because of its mechanism to detect dependencies
across failure-atomic sections and instrumentation on persis-
tent operations.
Figure 8 shows the 99th-percentile latency. We run eight
client threads for each workload. On average, Archapt de-
creases the tail latency by 8%, 9%, 9% and 22%, compared
with the traditional undo logging, redo logging, AOF and
the rollback journal system respectively. Such performance
improvement in the tail latency comes from the reduction
of unnecessary cache flushing. For the read-only workload
(YCSB-C) that offers no opportunity to reduce cache flushing,
Archapt still provides comparable performance to the other
transaction mechanisms. As for Atlas (not shown in the
Figure), its transaction tail latency is 5% and 13% longer than
PMDK and Archapt respectively.
Quantifying the effectiveness of reducing cache-line
flushing. We measure the number of cache-line flushing before
and after applying Archapt to undo logging. We only show
the results for undo logging, because it has less cache-line
flushing than the rollback journal and AOF. Hence reducing
cache-line flushing for undo-logging is more challenging than
9
(a) YCSB A-F (Redis).
(b) TPC-C (Redis). (c) Three workloads with SQLite.
Fig. 7: Throughput with the four transaction mechanisms, as the number of threads vary from four to twelve. “T”=“threads”.
(a) Redis (b) SQLite
Fig. 8: 99th-percentile transaction latency, as the number of threads is eight.
doing that for the rollback journal and AOF. Also, we do
not show the results for redo logging, because undo and
redo logging have the similar number of cache-line flushing.
Figure 9 shows the number of reduced cache-line flushing after
applying Archapt. The numbers in the figure are normalized
by the total numbers of cache-line flushing before applying
Archapt. The figure does not include YCSB-C, because this
workload is read-only and does not need cache flushing. The
figure also isolates the contributions of the two techniques (the
LRU-based approach and coalesce of cache-line flushing) to
compare the effectiveness of the two techniques.
The figure reveals that Archapt greatly reduces the number
of cache-line flushing by 66% on average. YCSB-E has less
reduction in the number of cache-line flushing than other
workloads, because it has more data reuse in persistent objects,
which provides less opportunities to skip cache-line flushing.
This result is consistent with that shown in Figure 3.
We further notice that both techniques effectively reduce
cache-line flushing. The contribution of the LRU-based ap-
proach to the reduction of cache-line flushing varies between
different workloads, because different workloads have differ-
ent data reuse of persistent objects. Furthermore, comparing
with Redis, SQLite gains less benefit from the coalesce of
cache-line flushing. This is because of the strict SQL data
structures in SQLite that have some existing optimization on
Fig. 9: The numbers of reduced cache-line flushing after applying
Archapt to undo logging. The numbers are normalized by the total
numbers of cache-line flushing before using Archapt.
cache line alignment.
Quantify dirtiness of flushed cache lines. Figure 10
shows the distribution of the dirtiness of flushed cache lines
before and after applying Archapt. The figure does not in-
clude YCSB-C, because this workload is read-only and does
not need cache flushing. In general, our memory allocation
improves dirtiness by 12% on average. Among all workloads,
YCSB-E (Redis) has the largest increase: the average cache
line dirtiness increases from 51% to 68%.
Random crash tests. We examine data consistency in phys-
ically committed transactions using random crash tests. We aim
to evaluate the effectiveness of our checksum mechanism. We
use an NVM crash emulator [29], because of two reasons:
(1) a large number of crash tests affect the reliability of our
10
TABLE IV: Crash test results. The numbers for each workload in the table are for 100 crash tests; “I-obj”, “DI-obj” and “CC-obj” standard
for number of inconsistent persistent-objects, number of inconsistent persistent-objects detected by checksums, and number of inconsistent
persistent-objects that cannot be corrected by checksums respectively.
Systems Workloads Caching policy: LRU Caching policy: Pseudo-LRU Caching policy: Bimodal insertion Caching policy: randomI-obj DI-obj CC-obj I-obj DI-obj CC-obj I-obj DI-obj CC-obj I-obj DI-obj CC-obj
Redis
TPC-C 539 539 0 504 504 0 429 429 0 798 798 0
YCSB-A 901 901 0 776 776 0 88 88 0 1435 1435 0
YCSB-B 0 0 0 0 0 0 0 0 0 0 0 0
YCSB-D 0 0 0 0 0 0 0 0 0 0 0 0
YCSB-E 0 0 0 0 0 0 0 0 0 0 0 0
YCSB-F 29 29 0 24 24 0 7 7 0 56 56 0
SQLite
TPC-C 782 782 0 713 713 0 335 335 0 941 941 0
LinkBench 605 605 0 583 583 0 305 305 0 889 889 0
YCSB 45 45 0 38 38 0 25 25 0 96 96 0
Fig. 10: Distribution of the dirtiness of flushed cache lines with and
without Archapt; “Y” and “N” standard for using Archapt and no
Archapt respectively.
physical machine (including those battery-backed machines);
and (2) DRAM (used for our NVM emulation) and caches,
lose data when the crash happens. The emulator is based on
PIN [30] and emulates a configurable LRU cache hierarchy.
The crash emulator retains data in the emulated main memory
after a crash is triggered, allowing us to examine data consis-
tency. The crash emulator randomly triggers crashes. In our
evaluation, the cache capacities (19.25 MB in the last level
cache) and associativity (11) in the crash emulator are the
same as those in our physical machine. For each workload we
perform crash tests 100 times to ensure statistical significance.
The emulator only supports the LRU caching policy. We add
three more common caching policies (pseudo-LRU, bimodal
insertion and random eviction).
Table IV shows the results. The table reports total number of
inconsistent persistent-objects measured by the crash emulator
and total numbers of inconsistent persistent-objects detected
and corrected by the checksum mechanism for 100 crash tests.
We have two observations. First, the checksum mechanism
successfully detects and corrects all inconsistent persistent-
objects for all workloads under the four caching polices. This
demonstrates that the checksum mechanism is highly effective-
ness. Second, we do not have a large number of inconsistent
persistent-objects after crashes: Given billions of persistent
objects to update, we have at most hundreds of inconsistent
persistent-objects in each of 100 crash tests. For YCSB-B,
YCSB-D, and YCSB-E, we do not even find any inconsistent
persistent-objects. This indicates that our LRU-based approach
successfully estimates data locality, such that skipping cache-
line flushing causes a small number of inconsistent persistent
objects after transactions are physically committed.
VI. RELATED WORK
Persistency in NVM has received significant research activ-
ities recently [1], [2], [2], [3], [5], [17]–[22] Some of them
employ undo logging [1], [2], [17], [19], [20], [22], while
others employ redo logging [3], [18], [21]. Our work can be
applied to them to improve transaction performance.
Enabling crash consistency on NVM. Strict persis-
tency [31] enforces crash consistency by strictly enforcing
write orders in persistent memory and can cause a large
performance loss. Some work [31]–[35] relaxes the constrains
on write orders to improve performance. The recent work [36]
proposes a software cache solution to improve performance by
buffering writes to NVM. Different from the above existing
work, we do not relax write orders, but optimize performance
by skipping and coalescing cache-line flushing and provide
strong guarantee on data consistency for crashes.
Detection and correction of data errors. Previous efforts
on RAID [37]–[40] and ECC [41]–[46] exploit hardware- and
software-based approaches to detect and correct data errors,
but the relatively large runtime overhead and hardware mod-
ifications make them hard to be applied to detect and correct
data inconsistency on NVM for a transaction mechanism.
Algorithm-based fault tolerance, as an efficient software
mechanism to correct data errors, has been used for fault
tolerance in high performance computing [47]–[54]. However,
they are customized for specific numerical algorithms, and are
hard to be applied to transactional workloads in database.
Probabilistic crash consistency. In nature our work is
built upon probabilistic crash consistency: we skip cache-
line flushing for those cache blocks with high probability of
crash consistency. Chidambaram et al. [55] apply the idea
of probabilistic crash consistency to journaling file systems.
They ask applications to order writes without incurring a
disk flush, and request durability on disks when needed; they
use checksums to detect data/metadata inconsistency. Our
work is different from theirs in three perspectives: (1) We
do not modify write orders; (2) Our checksum can correct
inconsistency; (3) We highly optimize the performance of
Archapt to minimize runtime overhead in NVM (as main
memory), while some techniques (e.g., reuse after notification)
in [55] can cause high overhead in our scenarios.
The lazy persistency [56] is another example of probabilistic
crash consistency. It focuses on loop-based HPC applications
and skips cache-line flushing for all dirty data objects. It
relies on periodical cache flushing of all dirty cache blocks.
Our work is different from them in two perspectives. (1) The
lazy persistency does not decide which cache-line flushing
can be skipped. (2) The lazy persistency cannot correct data
11
inconsistency after a crash. The above two limitations can
cause unpredictable data loss in committed transactions.
VII. CONCLUSIONS
Enabling high performance transaction is critical to release
the performance benefit of persistent memory for many ap-
plications. In this paper, we present Archapt, an architecture-
aware, high performance transaction runtime system for per-
sistent memory. Archapt reduces the number of cache-line
flushing to improve performance of transactions. Archapt
estimates if cache blocks of a persistent objects are in the cache
to determine the necessity of cache-line flushing. Relying
on checksum mechanisms to detect and correct data incon-
sistence, Archapt provides strong crash consistency. Archapt
also coalesces cache blocks with low dirtiness to improve the
efficiency of cache-line flushing. Our results show that on
average, Archapt reduces cache flushing by 66% and improves
system throughput by 19% (42% at most), using a undo
logging (PMDK) and a redo logging (Mnemosyne) as baseline.
REFERENCES
[1] D. R. Chakrabarti, H.-J. Boehm, and K. Bhandari, “Atlas: Leveraging
Locks for Non-volatile Memory Consistency,” in Proceedings of the
2014 ACM International Conference on Object Oriented Programming
Systems Languages & Applications, 2014.
[2] A. Kolli, S. Pelley, A. Saidi, P. M. Chen, and T. F. Wenisch, “High-
Performance Transactions for Persistent Memories,” in Proceedings of
the Twenty-First International Conference on Architectural Support for
Programming Languages and Operating Systems, 2016.
[3] H. Volos, A. J. Tack, and M. M. Swift, “Mnemosyne: Lightweight Per-
sistent Memory,” in Architectural Support for Programming Languages
and Operating Systems, 2011.
[4] Y. Lu, J. Shu, L. Sun, and O. Mutlu, “Loose-Ordering Consistency for
persistent memory,” in 2014 IEEE 32nd International Conference on
Computer Design, Oct 2014.
[5] A. Memaripour, A. Badam, A. Phanishayee, Y. Zhou, R. Alagappan,
K. Strauss, and S. Swanson, “Atomic In-place Updates for Non-volatile
Main Memories with Kamino-Tx,” in Proceedings of the Twelfth Euro-
pean Conference on Computer Systems, 2017.
[6] H. Shu, H. Chen, H. Liu, Y. Lu, Q. Hu, and J. Shu, “Empirical Study of
Transactional Management for Persistent Memory,” in 2018 IEEE 7th
Non-Volatile Memory Systems and Applications Symposium, 2018.
[7] V. J. Marathe, A. Mishra, A. Trivedi, Y. Huang, F. Zaghloul, S. Kashyap,
M. Seltzer, T. Harris, S. Byan, B. Bridge, and D. Dice, “Persistent
Memory Transactions,” CoRR, 2018.
[8] H. Wan, Y. Lu, Y. Xu, and J. Shu, “Empirical Study of Redo and Undo
Logging in Persistent Memory,” in 5th Non-Volatile Memory Systems
and Applications Symposium, 2016.
[9] S. T. Leutenegger and D. Dias, “A Modeling Study of the TPC-C
Benchmark,” in SIGMOD Record, 1993.
[10] “Python TPC-C ,” https://github.com/apavlo/py-tpcc.
[11] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears,
“Benchmarking Cloud Serving Systems with YCSB,” in Proceedings of
the 1st ACM Symposium on Cloud Computing, 2010.
[12] “Yahoo! Cloud Serving Benchmark,” https://github.com/
brianfrankcooper/YCSB.
[13] J. Carlson, “Redis in Action,” in Manning Publications: Greenwich,
2013.
[14] “OLTPBench,” https://github.com/oltpbenchmark/oltpbench.
[15] T. G. Armstrong, V. Ponnekanti, D. Borthakur, and M. Callaghan,
“Linkbench: A database benchmark based on the facebook social graph,”
in Proceedings of the 2013 ACM SIGMOD International Conference on
Management of Data, 2013.
[16] G. Allen and M. Owens, The Definitive Guide to SQLite, 2nd ed.
Berkely, CA, USA: Apress, 2010.
[17] Intel, “ Persistent Memory Development Kit,” https://pmem.io/.
[18] Y. Lu, J. Shu, and L. Sun, “Blurred Persistence in Transactional
Persistent Memory,” in 2015 31st Symposium on Mass Storage Systems
and Technologies, 2015.
[19] S. R. Dulloor, S. Kumar, A. Keshavamurthy, P. Lantz, D. Reddy,
R. Sankaran, and J. Jackson, “System Software for Persistent Memory,”
in Proceedings of the Ninth European Conference on Computer Systems,
2014.
[20] J. Coburn, A. M. Caulfield, A. Akel, L. M. Grupp, R. K. Gupta,
R. Jhala, and S. Swanson, “NV-Heaps: Making Persistent Objects Fast
and Safe with Next-generation, Non-volatile Memories,” in Proceedings
of the Sixteenth International Conference on Architectural Support for
Programming Languages and Operating Systems, 2011.
[21] S. Venkataraman, N. Tolia, P. Ranganathan, and R. H. Campbell, “Con-
sistent and Durable Data Structures for Non-volatile Byte-addressable
Memory,” in Proceedings of the 9th USENIX Conference on File and
Stroage Technologies, 2011.
[22] E. R. Giles, K. Doshi, and P. Varman, “SoftWrAP: A Lightweight
Framework for Transactional Support of Storage Class Memory,” in
2015 31st Symposium on Mass Storage Systems and Technologies, May
2015.
[23] M. LLC, “Database persistence, without the performance penalty,”
http://www.odbms.org/wp-content/uploads/2014/06/IMDS-NVDIMM-
paper.pdf.
[24] H. Volos, A. J. Tack, and M. M. Swift, “Mnemosyne: Lightweight persis-
tent memory,” in Proceedings of the Sixteenth International Conference
on Architectural Support for Programming Languages and Operating
Systems, ser. ASPLOS, 2011.
[25] D. R. Chakrabarti, H.-J. Boehm, and K. Bhandari, “Atlas: Leveraging
locks for non-volatile memory consistency,” in Proceedings of the
2014 ACM International Conference on Object Oriented Programming
Systems Languages & Applications, ser. OOPSLA ’14, 2014.
[26] D. E. Difallah, A. Pavlo, C. Curino, and P. Cudre-Mauroux, “OLTP-
Bench: An Extensible Testbed for Benchmarking Relational Databases,”
Proc. VLDB Endow., 2013.
[27] J. Izraelevitz, J. Yang, L. Zhang, J. Kim, X. Liu, A. Memaripour, Y. J.
Soh, Z. Wang, Y. Xu, S. R. Dulloor, J. Zhao, and S. Swanson, “Basic
performance measurements of the intel optane DC persistent memory
module,” CoRR, vol. abs/1903.05714, 2019.
[28] Q. Liu, J. Izraelevitz, S. K. Lee, M. L. Scott, S. H. Noh, and C. Jung,
“ido: Compiler-directed failure atomicity for nonvolatile memory,” in
2018 51st Annual IEEE/ACM International Symposium on Microarchi-
tecture (MICRO), 2018.
[29] J. Ren, K. Wu, and D. Li, “Understanding Application Recomputability
Without Crash Consistency in Non-Volatile Memory,” in Proceedings of
the Workshop on Memory Centric High Performance Computing, 2018.
[30] V. Reddi, A. Settle, D. Connors, and R. Cohn, “Pin: A Binary Instru-
mentation Tool for Computer Architecture Research and Education,” in
Proceedings of the 2004 workshop on Computer architecture education,
2004.
[31] S. Pelley, P. M. Chen, and T. F. Wenisch, “Memory Persistency,” in
Proceeding of the 41st Annual International Symposium on Computer
Architecuture, 2014.
[32] A. Joshi, V. Nagarajan, M. Cintra, and S. Viglas, “Efficient Persist Bar-
riers for Multicores,” in International Symposium on Microarchitecture,
2015.
[33] A. Kolli, J. Rosen, S. Diestelhorst, A. Saidi, S. Pelley, S. Liu, P. M. Chen,
and T. F. Wenisch, “Delegated Persist Ordering,” in 2016 49th Annual
IEEE/ACM International Symposium on Microarchitecture, 2016.
[34] J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. Lee, D. Burger, and
D. Coetzee, “Better I/O Through Byte-addressable, Persistent Memory,”
in Proceedings of the ACM SIGOPS 22Nd Symposium on Operating
Systems Principles, 2009.
[35] J. Ren, J. Zhao, S. Khan, J. Choi, Y. Wu, and O. Mutiu, “ThyNVM:
Enabling Software-transparent Crash Consistency in Persistent Memory
Systems,” in 2015 48th Annual IEEE/ACM International Symposium on
Microarchitecture, 2015.
[36] P. Li, D. R. Chakrabarti, C. Ding, and L. Yuan, “Adaptive software
caching for efficient nvram data persistence,” in 2017 IEEE International
Parallel and Distributed Processing Symposium (IPDPS), 2017.
[37] P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson,
“RAID: High-performance, Reliable Secondary Storage,” ACM Comput.
Surv., 1994.
[38] K. Hwang, H. Jin, and R. Ho, “RAID-x: A New Distributed Disk
Array for I/O-centric Cluster Computing,” in Proceedings the Ninth
International Symposium on High-Performance Distributed Computing,
2000.
12
[39] J. Menon and J. Cortney, “The Architecture of a Fault-tolerant Cached
RAID Controller,” in Proceedings of the 20th Annual International
Symposium on Computer Architecture, 1993.
[40] F. Schmuck and R. Haskin, “GPFS: A Shared-Disk File System for
Large Computing Clusters,” in Proceedings of the 1st USENIX Confer-
ence on File and Storage Technologies, 2002.
[41] S. Lu, H. Li, and K. Miyase, “Progressive ECC Techniques for Phase
Change Memory,” in 2018 IEEE 27th Asian Test Symposium, 2018.
[42] D. Li, Z. Chen, P. Wu, and J. S. Vetter, “Rethinking Algorithm-Based
Fault Tolerance with a Cooperative Software-Hardware Approach,” in
ACM/IEEE International Conference for High Performance Computing,
Networking, Storage and Analysis, 2013.
[43] A. N. Udipi, N. Muralimanohar, R. Balsubramonian, A. Davis, and
N. P. Jouppi, “LOT-ECC: Localized and Tiered Reliability Mechanisms
for Commodity Memory Systems,” in International Symposium on
Computer Architecture, 2012.
[44] S. Li, D. H. Yoon, K. Chen, J. Zhao, J. H. Ahn, J. B. Brockman, Y. Xie,
and N. P. Jouppi, “MAGE: Adaptive Granularity and ECC for Resilient
and Power Efficient Memory Systems,” in International Conference for
High Performance Computing, Networking, Storage and Analysis, 2012.
[45] D. H. Yoon and M. Erez, “Virtualized and flexible ecc for main memory,”
in Proceedings of the Fifteenth Edition of ASPLOS on Architectural
Support for Programming Languages and Operating Systems, 2010.
[46] T. Dell, “A White Paper On The Benefits Of Chipkill-Correct ECC for
PC Server Main Memory,” IBM Microelectronics Division, Tech. Rep.,
1997.
[47] S. Yang, K. Wu, Y. Qiao, D. Li, and J. Zhai, “Algorithm-Directed
Crash Consistence in Non-volatile Memory for HPC,” in 2017 IEEE
International Conference on Cluster Computing, 2017.
[48] Z. Chen, “Algorithm-based Recovery for Iterative Methods without
Checkpointing,” in International Symposium on High Performance Dis-
tributed Computing, 2011.
[49] P. Wu, C. Ding, L. Chen, T. Davies, C. Karlsson, and Z. Chen, “On-
line Soft Error Correction in Matrix-Matrix Multiplication,” Journal of
Computational Science, 2013.
[50] T. Davies, C. Karlsson, H. Liu, C. Ding, and Z. Chen, “High Perfor-
mance Linpack Benchmark: A Fault Tolerant Implementation without
Checkpointing,” in International Conference on Supercomputing, 2011.
[51] T. Davies and Z. Chen, “Correcting Soft Errors Online in LU Factor-
ization,” in International Symposium on High-Performance Parallel and
Distributed Computing, 2013.
[52] Z. Chen, “Online-ABFT: An Online Algorithm Based Fault Tolerance
Scheme for Soft Error Detection in Iterative Methods,” in ACM SIG-
PLAN Symposium on Principles and Practice of Parallel Programming,
2013.
[53] K.-H. Huang and Abraham, “Algorithm-Based Fault Tolerance for
Matrix Operations,” IEEE Transactions on Computers, 1984.
[54] P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra, “Algorithm-
based Fault Tolerance for Dense Matrix Factorizations,” in Proceedings
of the 17th ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming, 2012.
[55] V. Chidambaram, T. S. Pillai, A. C. Arpaci-Dusseau, and R. H. Arpaci-
Dusseau, “Optimistic crash consistency,” in Proceedings of the Twenty-
Fourth ACM Symposium on Operating Systems Principles, ser. SOSP
’13, 2013.
[56] M. Alshboul, J. Tuck, and Y. Solihin, “Lazy Persistency: A High-
Performing and Write-Efficient Software Persistency Technique,” in
2018 ACM/IEEE 45th Annual International Symposium on Computer
Architecture, June 2018.
13
