Reducing memory persistency overheads with transparent out-of-place updates by Coats, Chance Christopher
c© 2019 Chance Christopher Coats





Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2019
Urbana, Illinois
Adviser:
Assistant Professor Jian Huang
ABSTRACT
Recent advances in memory technology have led to the creation of high-
performance, non-volatile alternatives to traditional DRAM, known as non-
volatile memory (NVM). While this technology has provided immense oppor-
tunities to system designers, it has also presented new challenges since appli-
cations running on NVM systems require data persistence guarantees with
respect to system crashes. To address this problem, many crash-consistency
techniques, including logging and shadow paging, have been proposed. How-
ever, existing solutions can suffer from significant overheads on the critical
path of execution or introduce extra write traffic to NVM, or even both. For
instance, logging approaches introduce double writes for data and logs in the
critical path of program execution, while shadow paging incurs significant
write amplification and cache flushes to ensure durability.
To provide persistence guarantees, this work proposes a transparent and ef-
ficient out-of-place update mechanism which provides atomic data durability
without incurring a substantial number of additional writes or performance
overheads. The key idea of the proposed approach is to write the updated
data to a new location in NVM while keeping the old data unmodified until
after the updated version becomes durable. To support out-of-place updates
in NVM, this work introduces a lightweight and transparent persistence indi-
rection layer, called PIL, along with minor changes to existing processor ar-
chitectures which together enable efficient transaction execution in hardware.
Experimental results with a variety of data structures and data-intensive ap-
plications show that PIL achieves low critical-path latency with small write
amplification, which is close to that of a native system without persistence
support. Compared with the state-of-the-art crash-consistency techniques,
it improves application performance by up to 1.8× while reducing write am-
plification by up to 85.3%. PIL also demonstrates scalable data recovery
capability on multi-core systems.
ii
ACKNOWLEDGMENTS
I first want to thank Assistant Professor Jian Huang at the University of
Illinois at Urbana-Champaign. His guidance during my time as a graduate
student was directly responsible for shaping me as a student, researcher,
employee, and as a person. He not only provided incredible opportunities for
me to grow, but also gave feedback from which I learned valuable lessons.
For passing on his knowledge, I am immensely grateful.
I also want to express my utmost gratitude to my parents and family.
Their love and support while I grew as a person included life lessons that I
will never forget, and from them I learned practical skills which I use every
day. Without their loving influence, the success I’ve found thus far in life
certainly would have been hindered. As I began my academic career, their
support was unwavering and reaffirmed that I was following a path true to
who I am. To put it plainly: without their love and support, I would not be
the person I am today.
Finally, I need to express my admiration and love for the friends I have
made during my time as an undergraduate and graduate student at the
University of Illinois at Urbana-Champaign. I will never forget the time that
we spent together or the bonds we formed. Their support from day to day
is what kept me motivated to pursue my dreams and accomplish my goals
even when I faced immense challenges in my work and social life.
iii
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . viii
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 1
CHAPTER 2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . 5
2.1 Non-volatile Memory Technologies . . . . . . . . . . . . . . . . 5
2.2 Out-of-Order Processor Cores . . . . . . . . . . . . . . . . . . 6
2.3 Memory and Cache Hierarchies . . . . . . . . . . . . . . . . . 9
2.4 ACID Properties . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Crash-consistency Techniques . . . . . . . . . . . . . . . . . . 13
CHAPTER 3 DESIGN CONSIDERATIONS . . . . . . . . . . . . . . 18
3.1 Overview and Design Goals . . . . . . . . . . . . . . . . . . . 18
3.2 Transparent Out-of-place Update . . . . . . . . . . . . . . . . 18
3.3 Address Mapping Table . . . . . . . . . . . . . . . . . . . . . 20
3.4 TxBegin and TxEnd Instructions . . . . . . . . . . . . . . . . 21
3.5 Load and Store Operations in the PIL . . . . . . . . . . . . . 23
3.6 Persistence Optimizations . . . . . . . . . . . . . . . . . . . . 25
3.7 OOP Region Organization . . . . . . . . . . . . . . . . . . . . 27
3.8 Garbage Collection . . . . . . . . . . . . . . . . . . . . . . . . 30
3.9 Data Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
CHAPTER 4 METHODOLOGY . . . . . . . . . . . . . . . . . . . . 33
4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Applications and Microbenchmarks . . . . . . . . . . . . . . . 33
4.3 System Comparison . . . . . . . . . . . . . . . . . . . . . . . . 34
CHAPTER 5 ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . 36
5.1 Transaction Throughput . . . . . . . . . . . . . . . . . . . . . 36
5.2 Critical Path Latency . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Write Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
iv
5.4 Garbage Collection Efficiency . . . . . . . . . . . . . . . . . . 40
5.5 System Recovery . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.6 Wear Leveling and Sensitivity Analysis . . . . . . . . . . . . . 42
CHAPTER 6 RELATED WORK . . . . . . . . . . . . . . . . . . . . 44
6.1 Non-volatile Memory File Systems . . . . . . . . . . . . . . . . 44
6.2 Durable Transaction Systems . . . . . . . . . . . . . . . . . . 45
6.3 Lock-based Persistence Support . . . . . . . . . . . . . . . . . 46
CHAPTER 7 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . 47
7.1 Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.2 Future Work Directions . . . . . . . . . . . . . . . . . . . . . . 47
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
v
LIST OF TABLES
4.1 Simulation System Configuration . . . . . . . . . . . . . . . . 33
4.2 Microbenchmarks and Macrobenchmarks . . . . . . . . . . . . 34
5.1 Average Benchmark Data Removal Ratios . . . . . . . . . . . 39
vi
LIST OF FIGURES
2.1 Out-of-Order Processor Microarchitecture . . . . . . . . . . . 7
2.2 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Logging Crash-consistency . . . . . . . . . . . . . . . . . . . . 14
2.4 Shadow Paging Crash-consistency . . . . . . . . . . . . . . . . 15
2.5 Log-structured NVM Crash-consistency . . . . . . . . . . . . . 16
2.6 Hardware Out-of-place Update Crash-consistency . . . . . . . 16
3.1 Hardware Transparent Out-of-place Update with PIL . . . . . 19
3.2 The Store Process in PIL . . . . . . . . . . . . . . . . . . . . . 23
3.3 Data Packing in PIL . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Transaction Timeline of Crash-Safe Approaches . . . . . . . . 27
3.5 OOP Region Organization . . . . . . . . . . . . . . . . . . . . 28
3.6 Data Memory Slice Layout . . . . . . . . . . . . . . . . . . . . 28
5.1 Transaction Throughput . . . . . . . . . . . . . . . . . . . . . 36
5.2 Transaction Throughput for YCSB Benchmarks . . . . . . . . 37
5.3 Critical Path Latency . . . . . . . . . . . . . . . . . . . . . . . 38
5.4 Normalized Benchmark Write Traffic . . . . . . . . . . . . . . 39
5.5 Microbenchmark Performance with Varied Garbage Col-
lection Periods . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.6 Recovery Time of a 1GB Reserved Region . . . . . . . . . . . 42
5.7 Gigabytes Written to OOP Blocks 1-16 . . . . . . . . . . . . . 43
5.8 Transaction Throughput with Varied NVM Read Latency . . . 43
vii
LIST OF ABBREVIATIONS
CPU Central Processing Unit
DRAM Dynamic Random Access Memory
GC Garbage Collection
HDD Hard Disk Drive
LLC Last Level Cache
NVM Non-volatile Memory
OLTP Online Transaction Processing
OS Operating System
OSP Optimized Shadow Paging
PCM Phase Change Memory
PIL Persistence Indirection Layer
ReRAM Resistive Random Access Memory
ROB Reorder Buffer
SRAM Static Random Access Memory
STT-MRAM Spin-Transfer Torque Magnetic Random Access Memory




Emerging NVM technologies like PCM [1], STT-MRAM [2], ReRAM [3],
and 3D XPoint [4] offer promising properties including byte-addressability,
non-volatility, and scalable memory capacity. Unlike DRAM-based systems,
applications running on NVM require memory persistence guarantees to en-
sure crash safety [5, 6], which means a set of data updates must behave in
an atomic, consistent, and durable manner with respect to system failures.
Ensuring memory persistence with commodity out-of-order processors and
hardware-controlled cache hierarchies is challenging and costly due to un-
predictable cache evictions. Prior research has developed various crash-
consistency techniques such as logging [7], shadow paging [8], and their op-
timized versions for NVM. However, they either introduce extra write traffic
to NVM, or suffer from high performance overheads in the critical path of
program execution, or even both.
Specifically, although logging provides strong atomic durability against
system crashes, it introduces significant overheads. First, both undo logging
and redo logging must make a data copy before performing the in-place
update. Persisting these data copies incurs extra writes to NVM in the
critical path of program execution [9, 10]. This not only decreases application
performance, but also hurts NVM lifetime [1, 11]. Second, enforcing the
correct persistence ordering between log and data updates requires cache
flushes and memory fences [12, 13]. These costly instructions further cause
significant performance overheads [14, 15, 16, 17, 18].
To address the aforementioned problems, researchers recently proposed
asynchronous in-place data updates, such as DudeTM [12] and ReDu [16],
in which the systems maintain an explicit main copy of data to perform
in-place updates, and then asynchronously apply these changes to the data
copy, or asynchronously persist the undo logs to NVM [19]. Unfortunately,
it does not mitigate the problem of incurring additional write traffic, due
1
to the background data synchronization. Kiln [6] alleviates this drawback
by using a non-volatile on-chip cache to buffer data updates. However, it
requires non-trivial hardware modifications to the CPU architecture and its
cache coherence protocol.
Similar to asynchronous in-place updates, an alternative crash-consistency
technique is shadow paging. Unfortunately, shadow paging incurs both ad-
ditional data writes to NVM and performance overheads in the critical path
due to its copy-on-write (CoW) mechanism [8]. Ni et al. [10] optimized
shadow paging by enabling data copies in cache-line granularity. However, it
requires TLB modifications to support the cache-line remapping.
This work proposes a transparent out-of-place update approach in NVM
hardware. The key insight of the proposed approach is to store updated data
outside of their original locations in a dedicated memory region in NVM,
and then apply these updates lazily through an efficient garbage collection
scheme. This reduces the data persistence overheads in three ways. First,
it eliminates the extra writes caused by logging mechanisms, as the old data
copies already exist in NVM and logging is not required. Second, the out-of-
place update does not assume any persistence ordering for store operations,
which allows them to execute in a conventional out-of-order manner. Third,
persisting the updates in new locations does not affect the old data version,
which inherently supports atomic data durability.
Since updates are written to a new place in NVM, this work develops a
lightweight Persistence Indirection Layer, named PIL, in the memory con-
troller to handle the physical address remapping. PIL enables low-cost out-of-
place update with four major components. First, a dedicated memory region
is organized to store data updates in a log-structure, and apply data pack-
ing to the out-of-place updates. This allows PIL to best utilize the memory
bandwidth of NVM and to reduce write traffic to NVM. Second, to reduce the
memory space cost caused by the out-of-place updates, PIL develops an effi-
cient garbage collection (GC) algorithm to adaptively restore updated data
back to their home locations. It uses a data combination scheme to further re-
duce the GC overhead. Third, PIL maintains a hash-based address-mapping
table in the memory controller for physical-to-physical address translation,
and ensures that load operations always read updated data from NVM with
trivial address translation overhead. Since the entries in the address-mapping
table will be cleaned when the corresponding out-of-place updates are peri-
2
odically garbage collected, the mapping table size is small. Fourth, PIL also
supports fast data recovery in the event of system failures and crashes by
leveraging the thread parallelism available in multi-core computing systems.
Because PIL is developed in the memory controller, it is transparent to
upper-level systems software. No non-volatile cache or TLB modifications
for address translation are required. Moreover, unlike the software-based
log-structured solutions, such as LSNVMM [14] that requires multiple mem-
ory accesses in a tree index for each read, PIL provides an efficient hardware
solution with minimal performance overhead and write amplification. Over-
all, this work makes the following contributions:
• A hardware out-of-place update scheme is introduced to ensure the
crash-consistency of NVM. It minimizes extra write traffic to NVM
and avoids critical-path latency overheads.
• An efficient GC scheme is used to adaptively apply the recent data
updates from the log-structured memory region to their original copies
to save memory space.
• A lightweight persistence indirection layer named PIL is built into the
memory controller to make out-of-place updates transparent to software
systems, while minimizing hardware cost.
• A scalable data recovery scheme is created, which exploits multi-core
systems to parallelize the NVM recovery procedure after program fail-
ures and crashes.
PIL is implemented in a Pin-based many-core simulator, McSimA+ [20],
with the addition of an NVM simulator. PIL is evaluated against four
representative and well-optimized crash-consistency approaches, including
undo logging [15], redo logging [21], optimized shadow paging [10], and log-
structured NVM [14]. The evaluation uses a set of microbenchmarks run-
ning on popular data structures like hashmaps and B-trees [16, 15, 17, 18]
as well as real-world data-intensive application workloads like Yahoo Cloud
Service Benchmark (YCSB) and transactional databases [22]. Experimental
results demonstrate that PIL significantly outperforms the state-of-the-art
approaches by up to 1.8× in terms of transaction throughput, and reduces
3
write traffic to NVM by up to 85.3%, while ensuring the same atomic durabil-
ity as existing crash-consistency techniques. PIL also scales the data recovery
procedure as the number of threads on a multi-core system is increased.
The remainder of this thesis is organized as follows. First, necessary back-
ground on NVM technologies, modern out-of-order processors, memory and
cache hierarchies, and ACID techniques are provided in Chapter 2. The de-
sign considerations and decisions of this work are discussed in Chapter 3.
Chapter 4 describes the experimental setup and testing methodology of this
work while Chapter 5 presents the results of the work including performance
comparisons with existing solutions. Chapter 6 discusses literature related
to the work in this thesis. Finally, Chapter 7 includes a discussion of the




2.1 Non-volatile Memory Technologies
Dynamic random access memory (DRAM) chips form the backbone of com-
puter memory systems due to their high performance and byte-addressability.
These chips are used to keep data physically close to the CPU for fast access
during program execution. However, despite the wide use of DRAM chips
in computer systems, DRAM is not without drawbacks. Due to the physical
construction of DRAM storage cells, power must constantly be provided to
the chips else the data being stored will be lost. As a result of this require-
ment, memory system power can account for up to 6% of total system power
[23]. Furthermore, the cost of DRAM chips is determined primarily by their
integration density. Each storage cell can store only a single bit of data, lim-
iting scalability and increasing the cost of having a large amount of system
DRAM which is desired for performance reasons.
Recently, new memory technologies such as PCM [1], STT-MRAM [2],
ReRAM [3], and 3D XPoint [4] have offered promising alternatives to tradi-
tional DRAM. These technologies offer performance comparable to DRAM,
byte-addressability, and importantly are all non-volatile meaning they will
retain their stored data even when power is not provided to the underlying
chips. Further, these technologies offer increased integration density when
compared to DRAM. Given the ability to keep multiple bits of data in a sin-
gle storage cell, these non-volatile memory chips can provide lower costs for
a given capacity when compared with DRAM. The combination of DRAM-
like characteristics with non-volatility and increased density presents a new
paradigm for memory systems designers. The potential for dramatic system
power reduction and increased memory capacity means systems built with
NVM can offer users higher performance and greater energy efficiency.
5
2.2 Out-of-Order Processor Cores
Since the creation of the Intel 4004 [24] in 1971, processor microarchitecture
has evolved tremendously. Modern processors perform on-the-fly reordering
of program code, speculatively perform computations, and contain highly
complex hardware structures to support out-of-order execution all in the
constant pursuit of increased performance [25]. While the ever-increasing
complexity of processors has provided steady performance gains over the last
four decades, the complexity of the structures within these processors can
present challenges to system designers attempting to maximize performance
or add new functionality to these processors. Of particular interest in the
scope of this thesis are two structures present in modern, out-of-order pro-
cessors known as the reorder buffer (ROB) and the load/store queue [25]. A
description of these structures and their relevance to this work is presented
below.
Seen in Figure 2.1, the microarchitecture of a modern out-of-order pro-
cessor has become quite complex with several blocks used for instruction
fetching, decoding, and scheduling. These blocks connect with multiple func-
tional units, known as a wide execution back end, to provide high instruction
throughput. Highlighted for clarity are the reorder buffer and load/store
queue.
2.2.1 Reorder Buffers
The vast majority of programs run on computers today are written in a
language which expresses a sequence of steps that together achieve the de-
sired functionality of the programmer [26]. These programs are then turned
into a sequence of instructions which, when executed in program order on
a processor, will perform the desired computation dictated by the program.
As it turns out, programs often contain sections of code which, as a result
of their inputs being independent, need not be executed in the strict se-
quence dictated by the original program. Processor designers recognized this
instruction-level parallelism and started building machines which can execute
and commit instructions out-of-order with respect to the original program
sequence [27]. Despite the impressive performance gains achieved through
out-of-order execution, many challenges arose. When a program is execut-
6
Figure 2.1: Out-of-Order Processor Microarchitecture
ing, interrupts and exceptions may occur which require the operating system
to switch execution contexts. When this happens, the currently executing
program must have its state saved and the interrupt or exception is then
handled by the operating system. Once the interrupt or exception has been
handled by the OS, control is returned back to the program which was pre-
viously executing. This is accomplished by returning to the last committed
instruction in the program and resuming execution with the next instruction.
This process seems straightforward for hardware which executes instructions
in-order since the concept of the last committed instruction is obvious, but for
hardware which executes and commits instructions out-of-order, the concept
of the last committed instruction becomes complex to define.
7
Enter the reorder buffer. The goal of the reorder buffer is to allow for
out-of-order execution, but retain the benefits of in-order commit. Namely,
the use of the reorder buffer allows a program to have a well-defined concept
of the last committed instruction. This enables modern processors to exploit
instruction-level parallelism through out-of-order execution while also offer-
ing precise exceptions and interrupts. The reorder buffer works by keeping
all executed instructions in the original program order. Furthermore, the
reorder buffer can act as a destination for the result of computations. This
ensures that instructions which complete their execution out-of-order with
respect to the program do not commit and update the architectural state
of the machine until the instruction should normally be allowed to commit
per the program order. The reorder buffer therefore holds all instructions in
the original order defined by the program, and also holds temporary values
until these values can be written into the register file of the processor. With
these two functions, the reorder buffer supports high-performance execution
and maintains a clear ordering of instructions which allows precise context
switching between tasks.
2.2.2 Load and Store Queues
While out-of-order execution offers tremendous performance gains for in-
structions which perform computations, memory instructions have different
challenges which must be solved in order to exploit instruction-level par-
allelism. The term memory disambiguation [28] describes the challenge of
determining which address a load or store instruction will access. Determin-
ing the address of a load or store is the crucial step in deciding whether two
memory instructions are independent, and correspondingly, whether these in-
structions may be serviced out-of-order with respect to the original program
sequence.
The load/store queue has the job of holding memory instructions in pro-
gram order, similar to the goal of the reorder buffer, with the additional
task of determining which memory instructions can be serviced out-of-order
and which memory instructions must remain in program order. Instructions
which must remain in order are simply serviced when they reach the head of
the load/store queue. On the other hand, independent memory instructions
8
are offered various performance optimizations. Load speculation [29] can be
performed by the load/store queue whereby a waiting load instruction can
read its data before it is known whether this load is independent of prior
instructions. The load/store queue will simply hold the value until memory
disambiguation has completed for this instruction. If the load is indepen-
dent, it can immediately return its value. If not, the load will replay and
read the correct data. The load/store queue can also perform store-to-load
forwarding [30] whereby a load instruction which follows a store instruction
to the same address need not access the L1 cache and can simply read
the store instruction’s data directly. These performance optimizations are
performed by the load/store queue automatically, but can have a dramatic
impact on the order of memory operations.
2.3 Memory and Cache Hierarchies
Within a computer system, many different memory technologies are utilized
to best optimize the cost and performance of a given system. There are
memory cells such as SRAM which offer very high performance, but have
low density and correspondingly high cost. On the other hand, magnetic
storage devices such as HDDs offer orders of magnitude higher storage den-
sity and lower costs with a corresponding reduction in performance. It is
therefore possible that higher performance could be achieved with a substan-
tial increase in cost, or lower cost could be achieved with a large sacrifice
in overall performance. System designers keep these trade-offs in mind and
optimize systems to minimize cost and maximize performance in the aver-
age case. This optimization process has led to the creation of the modern
memory hierarchy, shown in Figure 2.2.
Figure 2.2 depicts the trade-offs between different memory technologies.
Given its high performance and high cost, SRAM is used for the processor
register file which must be accessed very quickly, but stores a relatively small
amount of data on the order of kilobytes. Caches which reside on-chip with
the processor cores are also constructed using SRAM. These caches are sub-
stantially larger than the processor register file and can have sizes on the
order of megabytes. These caches are themselves organized into a hierarchy
with multiple levels. The level 1 (L1) caches are smallest and quickest to
9
Figure 2.2: Memory Hierarchy
access, while the L2 caches are larger and slower, with the L3 cache, or LLC,
being the largest and slowest of them all. The next level of the memory
hierarchy is made up of DRAM and NVM technologies which offer longer
access times than SRAM, but offer improved capacity on the order of tens of
gigabytes. These storage elements most commonly reside off-chip which con-
tributes to the increased access latency and energy consumption due to the
communication overheads. This level of the hierarchy is used for main system
memory. The next two levels of the hierarchy contain storage elements with
access times many orders of magnitude longer than SRAM or even DRAM,
but present long-term storage capabilities due to their non-volatility and ca-
pacity on the order of terabytes. Flash disks utilize floating gate transistors
to store data in a dense, non-volatile manner which offers improved per-
formance when compared with magnetic storage disks [31]. Magnetic disks
include HDDs which have spinning platters of magnetic material which use
magnetic moments to encode zeros and ones [32]. Due to the spinning disk
and the movement of the device which reads from this disk, HDDs offer
10
dramatically longer access latencies which are also a function of where the
previous access took place.
2.3.1 VIPT Indexing
Virtual memory is an incredibly useful construct in operating system design,
providing benefits such as process isolation, memory protections, a uniform
program view of memory, and many others. While an in-depth discussion
of virtual memory is outside the scope of this thesis, virtual memory plays
an important role in modern systems. At a high level, programs are given
their own virtual address space inside which they store instructions and data.
This virtual address space is not dependent on the physical addresses which
are used to identify where the instructions and data are actually stored in
memory. The connection between a program’s virtual address space and
the physical addresses they are mapped to is maintained by the OS in page
tables. When a program executes, it uses virtual addresses to access memory
which presents a challenge to system designers since these virtual addresses
must be translated to physical addresses eventually.
For a multitude of reasons, including to improve the performance of the
translation process as well as to reduce the hardware complexity of the L2
and L3 caches, modern L1 data caches are virtually indexed and physically
tagged caches. What this really means is that during a cache access, the
virtual address is used to quickly access the data stored in the arrays of the
cache. In parallel with the array access, the virtual address is translated
to its corresponding physical address by accessing the translation lookaside
buffer, a structure whose sole purpose is to cache virtual-to-physical address
mappings. When the array access inside the cache and the translation of the
virtual address have completed, the cache determines whether the request is
a hit or miss by using the physical address to avoid aliasing problems. In the
event of a hit, the cache returns the data to the processor core or fetches the
missing cache line in the event of a miss. Importantly, this translation process
taking place in the L1 cache has useful implications. Any misses in the L1
cache will generate a cache line request that is sent to the next level of the
hierarchy. The benefit here is that all requests made by the L1 cache to the
L2 cache, and correspondingly from the L2 cache to the L3 cache, will deal
11
entirely with physical addresses. Thus, by translating the virtual address
to its corresponding physical address in the L1 cache, the costly translation
hardware is not needed in the L2 and L3 caches. When misses take place in
the L3 cache, the request sent to the memory controller will deal exclusively
with a physical address. This will be crucial for the design of the PIL.
2.4 ACID Properties
In the context of database systems, transactions are the basic unit of work
performed upon a database [33]. Transactions are used to perform operations
on a database such as adding a new user to a bank’s list of accounts, or
changing an account’s balance during a credit or debit. Given the presence
of unexpected system crashes and power failures, transactions require certain
properties which help guarantee the database will always be in a valid state.
For example, a transaction may credit account A and then debit account B
by the same amount. If the system were to crash after the first account is
credited, but before the second account is debited, then the database will
not be left in a valid state.
The properties required by database systems to ensure a valid state is
maintained are known as atomicity, consistency, isolation, and durability, or
ACID properties for short. Atomicity refers to the desired behavior that a
transaction should be all-or-nothing, meaning that the transaction will either
complete its update to the database (commit), or it will fail and the database
will remain unchanged (abort). Considering the example above, atomicity
will guarantee that account A is not credited unless account B is also debited.
Consistency describes the behavior that transactions on the database should
always take the database from one valid state to another valid state. In the
context of the example, consistency will guarantee that the credit to account
A and the debit from account B will not break future transactions on the
database. The isolation property describes behavior for concurrent changes
to the database. If two transactions are executed simultaneously on the
database, then the final outcome should be the same as if the transactions
were executed one after the other. Considering the ongoing example, isolation
will ensure that two simultaneous transactions which credit account A will
result in account A having been credited twice, and not only by one of the two
12
transactions. Finally, durability states that if a transaction has committed
(in an atomic, consistent, and isolated manner), then the change to the
database should persist even if the system crashes or a power failure occurs.
Returning to the example, if the transaction successfully credits account A
and debits account B, then a power failure should not return the database
to the previous state where account A had not been credited and account B
had not been debited.
By providing ACID guarantees to transactions executing on a database
system, one can be sure that all possible scenarios of execution, including
system crashes and power failures, will result in a database which is valid
and correct given the history of transactions which have been previously
committed.
2.5 Crash-consistency Techniques
Using NVM as a replacement for DRAM in a computer system promises
increased memory capacity and reduced energy consumption. However, the
change from volatile to non-volatile system memory has dramatic implica-
tions for system designers. Changes made to memory must be done in a
manner which ensures that a system crash or power failure does not leave
the computer in an unusable state given that the memory contents prior to
the crash will still be present once the machine is turned back on. In this
sense, the memory contents of a system using NVM can be thought of as a
database. In order to effectively use NVM in systems, it is therefore crucial
that changes made to memory are performed with the same guarantees as
transactions on a database.
Providing ACID guarantees to systems running with NVM is not a novel
concept. Many prior works have been proposed which aim to provide atomic-
ity, consistency, and durability guarantees to applications whose data resides
in NVM. The most prominent of these previous techniques are logging ap-
proaches [34, 15, 17, 35, 36, 37, 21, 12, 16, 18], shadow paging approaches
[8, 10], and log-structured NVM [14]. A further explanation of these ap-
proaches, including their strengths and weaknesses, is provided below.
13







Figure 2.3: Logging Crash-consistency
Write ahead logging (WAL) is widely used for NVM. Its core idea is to pre-
serve a data copy before applying a change to the original data (see Figure
2.3). The logging operations result in double the write traffic and present
wear issues on NVM systems [14, 10]. To reduce the logging overhead,
hardware-assisted logging schemes [16, 15, 18, 36] have been proposed to
leverage log packing, log coalescing, and bulk persistence to increase spatial
efficiency during log persistence. However, these optimizations only partially
mitigate the extra write traffic caused by logging.
Beyond the write traffic issue, logging also incurs lengthy critical-path
latency [12, 5]. This issue is especially serious for undo logging, since it re-
quires a strict persist ordering between log entries and data writes. ATOM
[15] is a hardware undo logging solution to enforce the persistence ordering
in memory controllers, thus eliminating the logging from the critical path of
write operations. Compared with undo logging, redo logging provides more
flexibility as it allows asynchronous log truncation and data checkpointing
[38, 36, 37, 21, 16], which contributes to a shorter critical-path latency. How-
ever, it is noted that redo logging still generates double write traffic eventu-
ally.
Decoupling logging from data updates with asynchronous in-place updates
is another way to improve transaction performance, as proposed in SoftWrAP
[37] and DudeTM [12]. It decouples the execution of durable transactions
and logging into asynchronous operations; therefore, the memory barrier
operations can be reduced. However, an address mapping is required for
tracking the updated version of data, and software-based address translation
14
inevitably introduces additional overhead to the critical-path latency. More-
over, this approach cannot reduce write traffic to NVM when compared with
the conventional logging approaches.
Despite each of these logging approaches applying various optimizations
to improve transaction performance, logging is still expensive, for a simple
reason: these approaches are restricted by their intrinsic additional log write
for each data update, regardless of whether the update takes place in the
foreground or background. Kiln [6] removes the log writes with a non-volatile







Figure 2.4: Shadow Paging Crash-consistency
Shadow paging can eliminate expensive cache flushes and memory fence
instructions, but its write amplification is still a severe issue that impedes
application performance improvements. With shadow paging, an entire page
has to be copied, even though only a small portion of data is modified (see
Figure 2.4). Recent work [10] proposed a fine-grained copy-on-write tech-
nique to reduce the write amplification overhead. In this approach, one
virtual cache line is mapped to two physical cache lines with TLB modifica-
tions, and it achieves data atomicity through cache-line level copy-on-write.
To enforce data durability, however, it must persist the data to NVM in-
stantly when a process performs data modifications to a cache line. This
eager data persistence may sacrifice the performance benefits obtained from









Figure 2.5: Log-structured NVM Crash-consistency
Inspired by the log-structured file system [39], Hu et al. [14] proposed a
software-based log-structured NVM called LSNVMM in which all the writes
are appended into a log space. Such an approach alleviates the double writes
caused by the undo/redo logging. However, it incurs significant software
overhead for read operations due to the complicated data indexing (see Fig-
ure 2.5) and garbage collection. Although the index can be cached in DRAM,
it still requires multiple memory accesses to obtain the data location. For
instance, LSNVMM requires O(logN) memory accesses for each data read
due to the address look-up in an index tree, where N is the number of log en-
tries. This significantly increases the read latency of NVM. Further, garbage
collection for the entire NVM space is costly.




Old Data NVMNew Data
Out-of-place Write
Figure 2.6: Hardware Out-of-place Update Crash-consistency
16
This work proposes a new approach: hardware out-of-place update, in which
the memory controller writes the new data to a different memory location
in a log-structured manner, and asynchronously applies the data update to
its home address periodically. This approach can be seen in Figure 2.6.
It alleviates the extra write traffic caused by the logging approaches, and
avoids the data copying in the critical path as discussed in shadow paging.
Unlike the log-structured NVM, the proposed approach maintains a small
physical-to-physical address mapping table in the memory controller for ad-
dress translation, and adaptively writes the data updates into their home
addresses. Therefore, it incurs minimal indirection and GC overhead.
The proposed hardware out-of-place update ensures the atomic data dura-
bility by default, as it always maintains the old data version in NVM while
persisting the updates in new memory locations in a log-structured manner.
It also does not assume any persistence ordering for store operations, which




3.1 Overview and Design Goals
The architectural overview of PIL’s out-of-place update scheme is presented
in Figure 3.1. To perform out-of-place updates efficiently, the design was
made with three goals in mind:
• To guarantee crash-consistency while minimizing critical path latency
and write traffic to NVM.
• To make minimal changes to the hardware, thus minimizing the cost
of PIL while providing software-transparent functionality.
• To develop a scalable data recovery scheme which leverages parallelism
to quickly recover from program failures and system crashes.
To achieve these goals, the most straightforward approach is to persist
cache lines which hold updated data along with necessary metadata out-of-
place in NVM. However, this solution has two drawbacks. First, persisting
data eagerly will negatively affect system performance because non-volatile
memory technologies usually have a high write latency [1]. Second, persisting
the data and metadata separately at a cache line granularity introduces extra
write traffic. To address these challenges, this work proposes an optimized
transparent out-of-place update scheme for data persistence along with a fast
data recovery procedure.
3.2 Transparent Out-of-place Update
Similar to previous work, PIL provides two transaction-like interfaces (i.e.,

















Figure 3.1: Hardware Transparent Out-of-place Update with PIL
end of a transaction which requires data atomic durability. Note that PIL
requires applications to provide their own concurrency controls such as a
locking protocol to resolve inter-transaction data dependencies [40].
During transaction execution, data is brought into the cache hierarchy
with load and store operations, described in Section 3.5, which requires first
accessing the mapping table to find the most recent version of the desired
cache line. The mapping table is described in Section 3.3. After bringing
the data into the cache hierarchy, updates made in the L1 cache are buffered
inside the PIL in the OOP data buffer. Each entry of this buffer can hold
multiple data updates as well as important metadata such as the home re-
gion addresses of the updates. This buffer is used by the PIL to apply the
persistence optimizations described in Section 3.6. Further, each core has
a dedicated OOP buffer entry to avoid access contention during concurrent
transaction execution.
Once the OOP data buffer fills with updated data and metadata, or the
processor executes the Tx end instruction, the PIL flushes the updated data
and its metadata to the OOP region. These writes are performed at the
granularity of a memory slice, the structure of which is described in Section
3.7. Leveraging out-of-place writes into the OOP region, the PIL provides
19
crash-safety by ensuring transactions are made persistent in the OOP region
before any changes are made to the home region.
As more transactions execute, the OOP region will fill with updated data
and metadata. The PIL performs periodic garbage collection (GC) to migrate
the transaction data by scanning the OOP region and writing the most recent
versions of each address back into the home region. This process uses data
combination to minimize write traffic to NVM and improve GC performance.
The GC algorithm is presented in Section 3.8.
Finally, in the event of a power failure or system crash, the PIL leverages
thread parallelism to scan the OOP region and quickly recover the system to
a consistent state. The recovery process is presented in Section 3.9.
3.3 Address Mapping Table
To provide crash-safety, the PIL must ensure that all updates from a trans-
action have been written to NVM before any of the cache lines modified by
the transaction can be evicted and written to the home region.
To guarantee this ordering in the presence of uncontrolled cache evictions,
the PIL writes cache lines which are modified by transactions into the OOP
region instead of the home region. To track these cache lines for future
accesses, the PIL uses a small hash table kept in the memory controller to
map from home region addresses to the OOP region addresses containing the
cache lines (physical-to-physical address mapping). Performing this address
translation transparently in hardware reduces overheads when compared to
software-based mapping approaches [14].
Discussed in detail in Section 3.5, each cache line in the system has a per-
sistent bit. Whenever a cache line is evicted from the LLC with its persistent
bit set, the cache line is written into the OOP region. The PIL must track
this cache line’s location by adding an entry to the mapping table, where
each entry contains the home region address of the cache line as well as the
OOP region address of the cache line. This table has a size of 2kB per core
based upon the maximum expected number of evicted cache lines from a
single transaction [22].
The entries in the mapping table are removed during two scenarios. First,
when the LLC misses on a cache line, the PIL will check this address in the
20
mapping table to determine if it must read the data from the home region
or OOP region. If the address is present in the table, the data will be read
from the OOP region. The mapping table entry may now be removed since
the most recent version is located within the cache hierarchy and coherence
mechanisms will ensure this data is read by any other requesting cores. The
second scenario is during garbage collection when the most recent version of
the data is migrated from the OOP region to the home region, detailed in
Section 3.8.
3.3.1 Cache Eviction Buffering
Along with the mapping table, the PIL adds a cache line eviction buffer
which is just over 17kB per core. The size was chosen to match the number
of per-core entries in the mapping table. This buffer holds cache lines (and
their home region addresses) which are written back to main memory during
garbage collection operations. As a result, this buffer is crucial in providing
crash-consistency since the garbage collection process will invalidate entries
in the mapping table as mentioned above.
If a mapping table entry is written during garbage collection, a previous
version of the cache line might be migrated to the home region thus removing
the mapping table entry. By buffering evicted cache lines, the PIL ensures
that if a mapping table entry is removed by the garbage collector, a new
mapping to the most recent version of that cache line can still be maintained.
Enforcing this ordering ensures that misses in the LLC will not read stale
data, while also periodically reducing the number of entries in the mapping
table. This enables the PIL to have a small mapping table, thus minimizing
indirection overheads.
3.4 TxBegin and TxEnd Instructions
As mentioned in Section 3.2, the PIL provides two transaction-like interfaces
(i.e., Tx begin and Tx end) which demarcate the beginning and end of a
transaction which requires data atomic durability.
The PIL requires one bit of additional state be added to each processor
core. This bit represents the transaction state of the core. Importantly, the
21
Tx begin instruction sets the transaction state bit of the processor core which
executes the instruction. The Tx end instruction clears the transaction state
bit of the processor core and acts like a barrier to ensure durability of the
committed transaction as described below.
3.4.1 Interactions with the Reorder Buffer
The Tx begin instruction updates the transaction status bit inside the pro-
cessor core which executes the instruction. Once this state change has been
made, a Tx begin instruction may leave the reorder buffer.
The Tx end instruction is a bit more complex. It has the job of clearing
the transaction status bit when it enters the reorder buffer. In addition to
this task, the Tx end instruction must signal to the memory controller that
the transaction is ending once it reaches the head of the reorder buffer. To
guarantee data durability, a Tx end instruction will wait at the head of the
reorder buffer until an acknowledgment signal is received from the memory
controller that the OOP data buffer has been flushed to NVM.
3.4.2 Interactions with the Load/Store Queue
Load operations which take place during transactions require no additional
interaction with the load/store queue than currently exists in processors to-
day. This is because the PIL leverages the cache coherence mechanisms of
modern cache hierarchies. A load during a transaction will simply be serviced
as usual.
Store operations have a slight modification since they update data which
may or may not require ACID guarantees. When a store instruction is added
to the load/store queue, the current transaction status bit is also added
to the queue entry. When the store is serviced at the head of the queue,
this transaction status bit is sent to the L1 cache and is used as described
in Section 3.5. The space and complexity overhead of this modification is
negligible.
22
3.5 Load and Store Operations in the PIL
In this section, we demonstrate how the PIL handles load and store opera-
tions during transaction execution, as shown in Figure 3.2. In addition to
the processor transaction state bit, the PIL adds one bit per cache line in the
cache hierarchy. This bit is used to tell if a cache line has been modified by
a transaction with persistence guarantees. This overhead is quite reasonable






GID TxID Slice Buffer
GID TxID Slice Buffer...OOP Addr
NVM Home Region OOP Region






















A' B' C'A B C
Data1
Figure 3.2: The Store Process in PIL
3.5.1 Load Operations
A load instruction is added to the load queue while it awaits address gener-
ation and disambiguation. Once this load is sent to the L1 cache (step 1), a
compulsory miss will most likely occur and the cache controller will generate
23
a request that is sent to the lower level caches (step 2). If there is a cache
line miss in the cache hierarchy (step 3), PIL will use this home address to
access the address mapping table. In the event of a mapping table hit, the
translation will be made and the requested data will be read from the OOP
region (step 4). In the event of a mapping table miss, the cache line will be
fetched from the home region using the home address directly (step 5). This
cache line will then return through the memory hierarchy until the original
load request becomes a hit in the L1 cache and the requested data is returned
to the core. During future loads to this data, hits in the cache hierarchy will
incur no more communication overhead than currently exists.
3.5.2 Store Operations
Figure 3.2 also depicts the store operation in PIL. If the store request (Step
1) has a cache line miss in the L1 cache, the cache coherence mechanisms
will search for the cache line in the cache hierarchy (Step 2). Eventually, the
latest version of the corresponding cache line will be retrieved from another
cache or read from non-volatile memory after checking the mapping table.
Once the cache line is loaded into the L1 cache, it will be modified and the
persistent bit in the cache line will be set. This persistent bit is kept with the
modified cache line as it moves through the cache hierarchy and is used by
the mapping table and garbage collection algorithm as described in Sections
3.3 and 3.8 respectively.
Because the vast majority of L1 caches are virtually indexed and physically
tagged (VIPT), the TLB will perform the virtual-to-physical page number
translation and then return the physical page number to the L1 cache. As a
result, in parallel with the cache line modification, the cache controller will
communicate with the PIL to send the modified word of data and its home
address (Step 3). This work models the total latency of this process as 20
cycles, which is comparable to the communication time between the L2 cache
and the memory controller when sending an entire cache line.
The PIL then stores the updated data inside the OOP data buffer accord-
ing to the processor core number. The metadata content in the OOP data
buffer will also be updated. In particular, a global transaction ID (GID)
or commit ID (TxID) will be assigned by the PIL if this is the first or last
24
store operation within the transaction. Other necessary metadata like the
home address and slice count are also stored in the OOP data buffer. If a
transaction has filled the buffer, the PIL will allocate a free memory slice
in the OOP block using a bitmap. It will then persist the memory slice in
non-volatile memory (Step 4). At the end of a transaction, the processor
executes the Tx end instruction and the PIL must ensure all updated data
in the OOP data buffer is flushed to the OOP region.
3.6 Persistence Optimizations
To help improve the overall performance of our out-of-place update solution,
the PIL minimizes the amount of data written to NVM by packing data and












Figure 3.3: Data Packing in PIL
3.6.1 Data Packing
Specifically, the PIL tracks updates to data at a small granularity (i.e., word-
based) instead of a cache line granularity during data persistence. Based on
this, the PIL applies data packing to reduce the write traffic during out-of-
place updates. As shown in Figure 3.3, data residing in several independent
cache lines are compacted into one single cache line. Consequently, it can
reduce memory write traffic by 87.5% compared with no data packing. Simi-
larly, the PIL also performs metadata packing to further reduce write traffic.
25
Figure 3.3 shows that metadata which are associated with eight data updates
are also packed into a single cache line. Other approaches which track data
at a cache line granularity will incur additional write traffic even when a
single word is modified in a cache line.
Eight pieces of data and their metadata are packed into a single unit, called
a memory slice, which is shown in Figure 3.6 and described in Section 3.7.
The total size of the OOP data buffer is 1KB under the eight core environ-
ment, which consumes much less chip space compared with prior hardware
solutions [6]. Currently, PIL uses a 40-bit address offset preserved in the
metadata to address the home region (1 TB). Since future NVM systems
could have a larger capacity compared with DRAM systems, the metadata
size would also increase. To solve this issue, the PIL only needs to reduce the
number of packed data items (N). For example, if the home region size is 1
PB (250) thus increasing the size of the metadata, the PIL can instead pack
seven units of data (56 bytes) and their metadata in a memory slice which
still consists of two cache lines.
3.6.2 Persistence Ordering
PIL maintains the persistence ordering in the memory controller, which does
not require programmers to explicitly execute cache-line flushes and mem-
ory barriers. The transaction execution timeline of different approaches is
depicted in Figure 3.4. Undo logging requires strict ordering for each data up-
date, incurring a substantial number of persistence operations during trans-
action execution. Redo logging mitigates this issue and only requires two
flush operations per transaction, one for the redo logs and another for the
data updates. Both schemes have to perform extra writes to NVM. The
optimized shadow paging scheme can avoid additional data copy overheads,
but it incurs frequent cache-line flush operations to persist data. PIL uses
the OOP data buffer to store the data updated by a transaction, and flushes
the data in units of a memory slice. When executing the Tx end instruction,
the PIL persists the last memory slice to the OOP region.
26
Time





A B D... E...
Time
A B D E... F G







Figure 3.4: Transaction Timeline of Crash-Safe Approaches
3.7 OOP Region Organization
PIL organizes the OOP region in a log-structured manner to minimize frag-
mentation and allow for sequential writes which offer high throughput. The
OOP region is divided into multiple OOP blocks with a fixed size of 2MB.
The OOP region has a block index table which stores the index number and
start address of each OOP block. During application execution, this block
index table is cached in the memory controller.
The layout of an OOP block is presented in Figure 3.5. Each OOP block
has an OOP header storing the block metadata. The header consists of (1)






Next Slice Tx:1 Next Slice Tx:0 Next Slice
Prev Slice Next Slice Prev Slice Next Slice
OOP Region
...Hdr OOP Block1 Hdr OOP Block1 Hdr OOP Block1 Block IndexTable
Data Data Data Data Data Data Data Data
Addr Addr Addr Addr Addr Addr Addr Addr Addr
Figure 3.5: OOP Region Organization
Data 0  Data 1  ... Data 7 Metadata
Home Addrs Next SliceTxID GID Cnt Flag
320 bit
Pad





Figure 3.6: Data Memory Slice Layout
and (3) a 2-bit flag denoting the block state (BLK FULL, BLK GC, BLK UNUSED,
BLK INUSE). The remainder of an OOP block is composed of memory slices
with a fixed size of 128-bytes. The fixed-size memory slices place an upper
bound on the worst-case fragmentation which can occur within an OOP
block, and PIL can easily manage OOP blocks with a memory slice bitmap.
Further, the 128-byte size of a memory slice means the PIL is capable of
flushing the memory slices to the OOP region using two consecutive memory
bursts [41].
3.7.1 Memory Slices
Memory slices can be classified into two categories: data memory slices and
address memory slices. As shown in Figure 3.5, a large transaction can be
composed of multiple data memory slices which are linked together. The start
28
address of these linked memory slices is stored in an address memory slice.
Address memory slices allow GC to quickly identify committed transactions
in the OOP region.
Figure 3.6 shows the internal layout of a data memory slice. With a total
size of 128-bytes, each slice can hold eight 8-byte words of data which have
been modified during a transaction, as well as metadata which is 64-bytes in
length. Each metadata block contains the reverse mappings (original physical
addresses) of modified data to be used during the GC and recovery processes.
It also contains an OOP block number, an address offset (24-bits) to find the
next data memory slice, a global ID (32-bits) assigned by the PIL at the start
of a transaction, a commit ID (32-bits) assigned at transaction commit, a
count of the updated words (3-bits) in that slice, and several bits used to
identify the state of each slice during GC and recovery.
3.7.2 Wear Leveling
The PIL can achieve uniform aging of all cache lines within an OOP block.
In particular, the PIL persists transaction data in the unit of memory slices
which are comprised of two cache lines. The PIL manages the available
memory slices in the OOP block using a bitmap and allocates the memory
slices in a round-robin manner. Consequently, all cache lines inside an OOP
block can achieve a uniform wear rate.
Moreover, the PIL should also guarantee the wear leveling of the whole
system. To solve this problem, the PIL maintains a wear count for each
OOP block and monitors their wear rates. After the garbage collector cleans
the OOP blocks, the PIL will replace OOP blocks whose wear rate is beyond
the predetermined threshold. The PIL communicates with the operating
system kernel to allocate the memory space used for OOP blocks. This
communication is done using two special registers (region reg & request reg).
Once the PIL requires a new memory space, it sets the request reg. The
operating system periodically reads this special register and, when the request
bit is set, will allocate space for an OOP block and record the address in the
region reg, clearing the request reg. The PIL sees the change to the request reg
and can use this new memory space as an OOP block by recording its start
address in the block index table.
29
Algorithm 1 Garbage Collection
1: Definitions: Home region: Memhome; OOP region: Memoop; OOP block: Blkoop; Memory slice
bitmap: Bitmap; Mapping Table: MT ;
2:
3: for All Blkoop is BLK FULL in Memoop do
4: Read all address memory slices Saddr.
5: Create a hash map H to hold the data during GC.
6: Start from the latest start address Addr in Saddr.
7: for each start address Addr in reverse order do
8: Read all slices of the committed Tx from Memoop.
9: for all memory slices in the Tx do
10: Read the home addresses Addrhome and Data.
11: Check if Addrhome hits in H.
12: if hash entry elem exists then
13: continue.
14: else





20: for All data in H do
21: Write the data to addr in the Memhome.
22: if addr is in MT then





28: Update the memory slice Bitmap.
29: Update the header in OOP blocks.
3.8 Garbage Collection
The PIL performs background garbage collection which migrates modified
data stored within the OOP region back to their original locations in the
home region. The garbage collection algorithm must address two challenges.
First, as all updated data are preserved in the OOP region, migrating these
old data versions sequentially would cause significant memory write traffic.
Thus, PIL scans the committed transactions in reverse time order and applies
data combination to minimize the data migration overheads. Second, the
garbage collection algorithm must also be crash-safe against system failures.
PIL performs garbage collection operations periodically. The time thresh-
old is adaptive and its performance impact on the system is evaluated in
Chapter 5. Algorithm 1 depicts the background garbage collection work
flow. First, the PIL reads address memory slices which have been committed
in the OOP region (line 4). Then, the PIL creates a hash map H to store the
home addresses and their modified data (line 5). According to the memory
slice start addresses preserved in the address memory slice, the PIL reads
each committed transaction from the OOP block in reverse time order (line
30
7). For each tuple <home addr, data, TxID> in the committed transaction,
the PIL combines all data with the same address to avoid writing to the
same home location multiple times (lines 9-17). This means a home loca-
tion updated multiple times only requires one record in the hash map which
corresponds to the latest update to this home address.
Once the PIL finishes scanning all committed transactions in the OOP
block, the latest data preserved in the hash map will be migrated back to
the home region (line 21). When a word is migrated back to the home region,
the corresponding cache line address is checked in the mapping table (line
22). If the address hits in the mapping table, the entry is removed since the
most recent version of the data has been migrated back to the home region
(line 23). After restoring all data back to their home locations, PIL clears the
corresponding bits in the memory slice bitmap (line 28). Finally, PIL updates
the OOP block header by resetting the block state to BLK UNUSED, and
clearing its entry in the block index table (line 29).
3.8.1 Crash safety
Garbage collection in the PIL is crash safe. The insight is that the OOP
region always remains in a consistent state during garbage collection. There-
fore, if a system crash happens while reading the memory slices (lines 4-10),
writing the hash table (line 15), or during data migration (line 21), PIL can
simpily replay all committed transactions in the OOP region during data
recovery Section 3.8, thus recovering the system to a consistent state.
3.9 Data Recovery
If a system crash occurs, the PIL will utilize the out-of-place updated data
which are preserved in the OOP region to recover the system to a consistent
state. Notice that the PIL requires the operating system kernel to rebuild
the memory regions of the application when the system restarts. In particu-
lar, the operating system kernel should guarantee the persistence of process
page tables during application execution [42]. After the PIL restores the
preserved data back to their home locations, application can access the lat-
est data as if no crash happened. The following section demonstrates PIL’s
31
fast data recovery procedure which leverages thread parallelism to accelerate
data recovery.
During the recovery process, the OS kernel will create multiple recovery
threads. The recovery thread reads the block index table to locate each OOP
block. Each recovery thread will kmap the memory of these OOP blocks into
its address space. Then, all committed address memory slices are read from
the OOP region to get the start address of the memory slice of the committed
transactions in the OOP region. Once the PIL collects these addresses, it
sorts them in the committed order and distributes these addresses to multiple
recovery threads in an interleaved fashion.
Each recovery thread will process its own working set independently. Specif-
ically, each thread scans the committed transactions in the OOP region in
a reverse order. The thread reads the data memory slices belonging to the
transaction and adds the tuple <home address, commit ID, data> into a lo-
cal hash set. During data processing, the thread will only preserve the value
which has the largest commit ID. Once all transactions have been completely
processed by the recovery threads, a single thread will aggregate the local
hash sets into a global one, preserving only the latest version for each home
address by checking the committed transaction ID. Finally, a thread splits
the global hash map and leverages other recovery threads to write the data
back to their home locations in parallel. This is accomplished by reading the
cache line that contains each home address in the split hash set. Once the
cache line is present in the cache hierarchy, the latest values for the home ad-
dresses in this cache line are written. After a cache line has been updated, it
is flushed to NVM using cache flush instructions to ensure durability. Finally,




This chapter details the experimental methodology used to characterize and
analyze the performance of the PIL design. The experimental simulation
setup is first presented, followed by a description of the workloads used to
test the behavior of the implemented system. Finally, the systems against
which the PIL is compared are presented.
4.1 Experimental Setup
The PIL is implemented and evaluated using McSimA+, a Pin-based many-
core simulator [20] in combination with an NVM simulator. The system is
configured to model an out-of-order processor with NVM for system memory,
and a total NVM capacity of 16 GB is used in the experiments. The read and
write latencies of this memory are configured as 50 ns and 150 ns respectively.
The detailed system configuration is presented in Table 4.1.
Table 4.1: Simulation System Configuration
Processor 8 x 2.5 GHz, out-of-order, x86
L1 I/D Cache 32KB 4-way, private
L2 Cache 256KB, 8-way, inclusive, shared
tRCD-tCL-tBL-tWR-tRAS-tRP-tRC-tRRD-tRTP-tWTR-tFAW
10-10-8-10-24-10-34-4-5-5-20(ns)
NVM Read/Write = 50ns/150ns Capacity: 16 GB
PIL Mapping Table: 16kB Reserved Region: 160 MB
4.2 Applications and Microbenchmarks
The PIL is evaluated with a series of microbenchmarks and real-world appli-
cations. In the experiment, eight threads are run for each microbenchmark.
33
To avoid simultaneous accesses to data, primarily to simplify concurrency
controls, each thread operates on independent data structures during transac-
tion execution. These transactions access five popular data structures listed
in Table 4.2. Each thread repeats insert and update operations on these data
structures. Two different data sizes are used for each microbenchmark: A
small set of size 64 bytes and large set of size 1 KB.
In addition to the microbenchmarks, two workloads from the WHISPER
benchmark suite [22], YCSB and TPC-C, are tested. In YCSB, reads com-
prise 20% of the transactions while updates comprise 80%. Again, two dif-
ferent data sizes are used: 512 Bytes and 1 KB. All requests follow a Zipfian
distribution. For TPC-C, the new order transactions are tested since this
mode is the most write intensive and best exercises the PIL. Following the
WHISPER benchmark suite, an N-store [43] database is utilized to persist
data to NVM, where each thread executes transactions on its database tables.
Table 4.2: Microbenchmarks and Macrobenchmarks
Micro Vector [16] Insert/update entries in vector
Hashmap [15] Insert/update entries in hash map
Queue [17] Insert/update entries in queue
Rbtree [18] Insert/update entries in RB-tree
B-tree [18] Insert/update entries in B-tree
Macro YCSB [16] 20%:80% for reads:updates
TPCC [22] A real-world OLTP workload
4.3 System Comparison
The PIL implementation is compared against several state-of-the-art so-
lutions [15, 21, 10]. Specifically, three common, high-performance crash-
consistency techniques are selected: undo logging, redo logging, and shadow
paging. Following is a brief introduction to these approaches.
• Undo logging: Hardware-based undo logging is implemented based
upon the work of ATOM [15]. Included in the implementation are the
authors’ post- and source-log optimizations which help to reduce the
critical path of store operations.
• Redo logging: Hardware-based redo logging is implemented based
on the research work in [21]. This implementation of hardware redo
34
logging supports asynchronous data checkpointing and log truncation.
When redo logging checkpoints the data, it must fetch the data from
NVM, perform the in-place update, and then truncate the log entry.
• Shadow paging: Optimized shadow paging is implemented based on
OSP [10]. In the shadow paging scheme, each virtual cache line is
associated with two physical cache lines. To improve spatial efficiency,
the authors propose a page consolidation approach which has also been
implemented in this experiment to further optimize the shadow paging
design.
This work also implements the log-structured non-volatile main memory
(LSNVMM), based on prior work [14]. Following their approach, the address
mapping table is stored in DRAM. During the evaluation, two optimiza-
tions are also applied, tree node cache and group update, to accelerate tree
look-ups. LSNVMM reduces the memory write traffic and memory fragmen-
tation with log-structured memory. However, their performance bottleneck
is the address mapping tree. During transaction execution, LSNVMM has
to frequently access the address mapping tree stored in DRAM to perform
the address translation, which generates a substantial number of memory
accesses. Experimental results show that their solution is much slower than
hardware approaches (80×). Therefore, the results of LSNVMM are not





Figure 5.1 shows the normalized throughput of the four designs when running
each benchmark. Higher values are better. In this case, WrAP forms the
baseline because it shows the lowest performance of the tested designs and






















t Vector Queue Rbtree Btree Hashmap TPCC Geo-mean
small large small large small large small large small large
ATOM WrAP OSP PIL
Figure 5.1: Transaction Throughput
As shown in the figures, PIL exhibits higher performance than all three
other approaches. More specifically, the PIL improves transaction through-
put by 69%, 42.1% and 16.6% compared with WrAP, ATOM, and OSP re-
spectively. From these results it is clear that WrAP logging suffers from
severe performance overheads because it cannot apply optimizations to the
data logging steps. First, WrAP logging does not support log removal. For
each data update, it must create a log entry and then persist this entry to
NVM. In contrast, ATOM logging will only generate one log entry for mul-
tiple updates to the same data within a transaction, reducing the cost of log
persistence. The PIL also provides data removal for multiple updates to the
same address with the OOP data buffer detailed in Section 3.6.
WrAP logging must persist both the data and metadata for a single update
using two cache lines which wastes useful memory bandwidth. Undo logging
uses cache line log packing to reduce the number of memory requests by up
36
to 57% compared with no log packing. The PIL uses a word granularity for
packing data. As a result, eight data updates and their metadata are packed
into only two cache lines. Thus with two memory bursts, the PIL can persist
eight data entries leading to a reduction in bandwidth by 87.5% compared
with no log packing and 30.5% compared with cache line packing.
WrAP logging applies asynchronous data checkpointing and log truncation
to eliminate these extra operations from the critical path. Despite this, indi-
rection and asynchronous data checkpointing require additional NVM reads.
The PIL also uses an indirect data update policy, but its data combination
during garbage collection, covered in Section 5.3, helps amortize these read
costs. Furthermore, since NVM read latency is much lower than the write
latency (by 3-10×), these costs could be further reduced in the future.
OSP obtains performance similar to that of ATOM and WrAP. OSP ap-
plies a lightweight copy-on-write mechanism to address the write amplifi-
cation issues caused by page-level shadow copies. Unfortunately, there are
three performance issues in their solution. First, to enforce the transac-
tion durability, this approach must persist the updated cache lines to NVM
frequently. This eager persistence in their approach greatly affects the trans-
action throughput. Furthermore, frequently updating the virtual-to-physical
address mapping during transaction execution would cause severe TLB co-
herence issues in a multi-core environment [44]. Finally, page consolidation
in the optimized shadow paging approach also incurs addition data copy
























YCSB-A YCSB-B YCSB-C YCSB-D YCSB-E YCSB-F
ATOM WrAP OSP PIL
Figure 5.2: Transaction Throughput for YCSB Benchmarks
Figure 5.2 also shows the transaction throughput of YCSB benchmarks.
37
It shows that PIL could significantly improve real-world application perfor-
mance when compared with the ATOM, WrAP and OSP. For write-intensive
workloads like YCSB-A (50% read & 50% write) and YCSB-F (50% read &
50% read-modify-write), PIL outperforms the ATOM, WrAP and OSP by
up to 72% due to its efficient data out-of-place updates.




























y Vector Queue Rbtree Btree Hashmap TPCC Geo-mean
small large small large small large small large small large
Native ATOM WrAP OSP PIL
Figure 5.3: Critical Path Latency
In this work, critical path latency is defined as the time taken to execute
the entire transaction, starting from a Tx begin instruction and stopping
with a Tx end instruction. In this experiment, the critical path lengths of five
designs are compared: ATOM, WrAP, OSP, PIL, and a native system without
persistence support. Figure 5.3 demonstrates the experimental results of
these five designs. Lower values are better. The critical path of the native
system forms the baseline because it incurs no persistence overheads.
The PIL achieves a significantly shorter critical path than other approaches.
The critical path latencies for ATOM, WrAP, and OSP are longer than that
for PIL by 52.8%, 46.2%, and 44.3% respectively. Furthermore, the PIL also
achieves a critical path latency close to the native system, being only 21%
longer on average. This is because the PIL leverages the OOP buffer table
instead of applying an eager policy to persist each data update, reducing the
performance overhead incurred by persisting data.
Both ATOM and WrAP deliver a critical path much longer than the PIL.
ATOM logging uses log removal to reduce the number of persisted logs and
enforces the log→data ordering at the memory controller. Both of these op-
timizations separate data persistence operations from store operations [15].
Despite this, ATOM still shows worse performance than WrAP logging be-
cause of the strict persist ordering between log and data residing in the criti-
38
cal path of transaction execution. Furthermore, asynchronous log truncation
and data checkpointing in WrAP accelerate its critical path execution. As a
result, WrAP outperforms ATOM by up to 10%. Finally, OSP also delivers
a longer critical path latency than PIL by up to 43% due to expensive TLB
coherence overheads.
5.3 Write Traffic
NVM has limited write endurance when compared with DRAM systems, i.e.,
108 writes vs. 1015 writes [45]. Therefore, reducing write traffic is vital to
extend the lifetime of NVM devices. In this section, write traffic caused by
these crash-consistency techniques is measured. Write traffic is defined as the
number of bytes written by these four approaches for data persistence on a
per-transaction basis. The system without persistence support is treated as
the baseline since it incurs no additional write traffic. Figure 5.4 shows the
































Vector Queue Rbtree Btree Hashmap YCSB TPC-C
ATOM WrAP OSP PIL
Figure 5.4: Normalized Benchmark Write Traffic
Table 5.1: Average Benchmark Data Removal Ratios
Tx Num. Vector Queue RBtree Btree Hashmap YCSB TPCC
101 29.1% 24.3% 23.5% 26.3% 27.7% 23.2% 24.3%
102 50.2% 51.8% 53.4% 48.2% 52.4% 49.6% 50.1%
103 74.1% 76.4% 73.5% 70.6% 71.2% 70.1% 72.0%
104 85.3% 82.2% 81.1% 83.2% 82.5% 81.3% 83.2%
As expected, the PIL delivers the fewest NVM writes compared with the
39
other three approaches. Both ATOM and WrAP introduce additional writes
for each data update, resulting in heavy write traffic during transaction exe-
cution. ATOM mitigates this issue through log removal, which delivers lower
write traffic than WrAP by an average of 11%. However, ATOM and WrAP
introduce 2.23× and 2.02× more NVM writes than PIL. OSP also has lower
write traffic than the logging approaches, with a reduction of 40.1% and
34.1% versus undo and redo logging respectively. Interestingly, the PIL also
has lower write traffic than OSP by an average of 8%. To explore why the
PIL performs so well, the average data removal ratio of the PIL is measured
by varying the number of combined transactions. The data removal ratio
is defined as the percentage of bytes modified during transactions which are
not written back to the home region during garbage collection due to data
combination. Results are shown in Table 5.1. As the number of combined
transactions increases, the PIL reduces more write traffic when performing
garbage collection. When the number of transactions exceeds 104, the PIL
only needs to write a small portion of data (less than 15%) back to their
home locations.
5.4 Garbage Collection Efficiency
To measure the overhead of PIL’s garbage collection algorithm, the garbage
collection period is varied from 2 ms to 14 ms. Transaction performance of
five microbenchmarks listed in Table 4.2 is measured for each of the periods.
The total transaction throughput is shown in Figure 5.5.
Several observations can be made from this experiment. First, when the
period is short, garbage collection is triggered more frequently to migrate
old transactions from the reserved region to the home region. Unfortunately,
an eager policy like this can reduce the possibility of data combination from
multiple committed transactions. As a result, more NVM bandwidth is used
by the garbage collection process for writing updated data back to its home
location. As the period becomes longer, throughput of the microbenchmarks
steadily increases. This is because a larger number of data modified by trans-
actions in the reserved region can be combined, significantly reducing NVM
write traffic during garbage collection and resulting in higher performance.
As can be seen from the figure, almost all microbenchmarks achieve their
40
2 4 6 8 10 12 14
















×104 vector queue rbtree btree hashmap
Figure 5.5: Microbenchmark Performance with Varied Garbage Collection
Periods
peak throughput when the GC period is 8-10 ms. Finally, when the period
exceeds 11 ms, the application could be blocked by the garbage collector as
there is not enough space to hold committed transactions. This leads to
on-demand garbage collection which must take place in the critical path.
5.5 System Recovery
PIL leverages multiple threads to accelerate the system recovery process. In
this experiment, the reserved region size is 1 GB. The number of threads
performing recovery and the available memory bandwidth is varied to calcu-
late the time taken to recover the system state following a crash. Figure 5.6
shows the experimental results.
As the available memory bandwidth increases, it linearly takes less time
to recover the system. For example, when the NVM bandwidth exceeds 25
GB/s, it only takes 47 ms for the PIL to recover 1 GB of data in the re-
served region. This result is 2.5× faster than the NVM system with only 10
GB/s memory bandwidth. Furthermore, as the number of recovery threads
increases, a lower recovery time is observed due to the benefits of paral-
lel scan and hash map creation. However, as the thread number exceeds a
threshold, multiple memory requests issued by these threads begin to satu-
41
1 2 4 8 16
















10 GB/s 15 GB/s 20 GB/s 25 GB/s
Figure 5.6: Recovery Time of a 1GB Reserved Region
rate the memory bandwidth. As a result, the memory controller becomes a
performance bottleneck and the data recovery time does not decrease with a
further increase in the number of recovery threads.
5.6 Wear Leveling and Sensitivity Analysis
In this section, a sensitivity study of the PIL design is performed by varying
NVM read latency. The YSCB benchmark is used for sensitivity testing as
well as to test the wear-leveling capabilities of PIL by looping the benchmark
100 times. This is done in order to write significantly more data (480GB when
looped 100×). Figures 5.7 and 5.8 display the experimental results.
YCSB-A generates a mix of 50% read and 50% update requests to the
N-store database. Looping the YCSB workload 100 times results in a total
of 480GB of data written to NVM. This data is spread across the 80 OOP
blocks in the current PIL configuration. The number of memory slices writ-
ten to each OOP block is counted and multiplied by the size of each slice.
The results are plotted in Figure 5.7. The results show that PIL’s approach
to wear leveling among blocks is quite effective and helps to ensure the NVM
device does not fail prematurely. Looking at the effects of NVM latency
in Figure 5.8, the results show that higher NVM read latency dramatically























1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
OOP Block #
Figure 5.7: Gigabytes Written to OOP Blocks 1-16







Figure 5.8: Transaction Throughput with Varied NVM Read Latency
rithm requires more time to read data from the reserved region in order to
perform data combination, which decreases the GC efficiency and contributes




Building systems atop NVM has attracted plenty of attention from the re-
search community. Its promising qualities such as byte-addressability, non-
volatility, and DRAM-like performance have greatly influenced traditional
systems design. Diversified systems like NVM file systems [46, 47, 8, 48]
and transactional libraries [14, 12] have been widely developed to exploit its
characteristics. In the following sections, a brief introduction to these related
works and how they support data atomic durability guarantees with respect
to system crashes is presented.
6.1 Non-volatile Memory File Systems
File systems provide a familiar POSIX-like interface for applications to ac-
cess NVM. However, NVM has posed severe challenges for conventional file
system designs [46, 47, 8, 48]. Existing works use a variety of techniques
including journaling [46, 49, 50], log-structuring [47], shadow paging [8] and
soft updates [48] to provide atomicity guarantees. Similar to the work of this
thesis, reducing the data persistence overheads resulting from these schemes
is crucial to NVM file system performance. Fortunately, existing techniques
can obtain performance benefits from the intrinsic properties of NVM. For
example, Chen et al. [51] leverage the byte-addressable interface to avoid
write amplification during system metadata journaling. Further, BPFS also
uses this property to propose a short-circuit shadow paging solution which
avoids cascading copy-on-write issues found in traditional shadow paging
mechanisms. SoupFS uses soft updates to eliminate synchronous metadata
updates caused by unexpected cache line flushes [48], thus accelerating crit-
ical path execution.
44
6.2 Durable Transaction Systems
To fully exploit DRAM-like access latencies, the most promising and preva-
lent mechanisms are durable transactional memory systems [52, 38]. Durable
transactions allow applications to access NVM directly with load and store
operations. Many approaches have been proposed to reduce the memory
persistence overheads with NVM. Mnemosyne defers data checkpointing and
log truncation which eliminates them from the critical path of transaction
execution [38]. BPPM [36] uses bulk persistence to delay data checkpointing
and log truncation, though it requires clwb and mfence instructions to persist
each log entry. Both SoftWrAP [37] and DudeTM [12] adopt shadow memory
to decouple redo logging from the critical path. They preserve volatile data
updates in DRAM and persist log entries to NVM asynchronously. They
adopt a DRAM buffer to improve critical path performance, but persisting
log entries from DRAM to NVM increases system write traffic compared with
traditional logging. Furthermore, decoupling redo logging from the critical
path trades strict data durability for performance.
Other problems with durable transaction systems are the ordering require-
ments within and between transactions. DCT [34], LOC [35] and HOPS [22]
aim to relax these requirements for undo or redo logging. DCT applies tech-
niques such as deferred commit to achieve this goal. HOPS proposes two
new ISA primitives, ofence and dfence, to decouple transaction ordering from
durability. Similar techniques have also been applied in the work of BPFS
[8].
Hardware-based logging for transactional systems has been thoroughly
studied. Hardware logging is promising since it can eliminate costly cache line
flushes and enforce ordering without explicit memory fences. Consequently,
hardware-based undo logging [15, 17], redo logging [21, 16] and undo+redo
logging [18] have been proposed. These solutions facilitate efficient logging
in hardware by eliminating the need for costly memory instructions.
Similar to the work of this thesis, LSNVMM [14] also enables out-of-
place data updates during transactions. Their approach is inspired by log-
structured file systems [39]. Their design caches address mappings in DRAM
instead of NVM. Even with the use of an optimized skiplist-based search tree
to accelerate the mapping look-up process, multiple accesses to DRAM for
each address translation will add additional performance overheads in the
45
critical path of a transaction. Fortunately, this work addresses this issue
with transparent out-of-place updates through the use of an on-chip map-
ping table.
Kamino-Tx [9] and Kiln [6] both propose in-place update solutions. How-
ever, supporting in-place updates in current architectures is non-trivial. In
order to do so, these works either integrate a non-volatile last-level cache into
the chip, or preserve an additional shadow copy for data updates, both of
which incur large storage overheads.
6.3 Lock-based Persistence Support
Apart from transactional memory, researchers have also found that locking
provides well-defined restrictions for atomicity, ordering, and concurrency
among multiple parallel executing threads. Therefore, they seek to add dura-
bility support into existing lock-based programs [53, 54, 55]. For example,
Atlas uses undo logging to provide durability guarantees to lock-protected
critical sections, which are also called failure-atomic sections in their solution.
In addition, Izraelevitz et al. reduce logging overheads through JUSTDO
logging [54]. This approach requires both non-volatile processor caches and





Enforcing data persistence on NVM using traditional crash-consistency tech-
niques is expensive. This thesis presents a mechanism to reduce memory
persistence overheads using out-of-place updates. The persistence indirec-
tion layer is built to enable transparent and efficient out-of-place updates
during transaction execution. Experimental results show that this approach
achieves a remarkably low critical path latency which nears that of a native
system providing no persistence support, and further provides up to 1.8×
higher transaction throughput than state-of-the-art hardware-based crash-
consistency techniques while also reducing write traffic to NVM by up to
85.3%. Importantly, the PIL design provides all of the same strong data
atomic durability guarantees as previous works.
7.2 Future Work Directions
While this work focuses on computer systems which only have NVM for sys-
tem memory, this setting is only one of many possible system configurations.
Future work will address systems which contain heterogeneous memory such
as DRAM and NVM together. This is a promising new direction for mem-
ory systems research since DRAM offers symmetric read and write latency
while NVM exhibits asymmetric, increased write latency. Future research can
leverage DRAM as a write cache for NVM to allow writes to occur quickly on




[1] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable high perfor-
mance main memory system using phase-change memory technology,” in
36th International Symposium on Computer Architecture (ISCA 2009),
June 20-24, 2009, Austin, TX, USA, 2009, pp. 24–33.
[2] T. Kawahara, “Scalable spin-transfer torque RAM technology for
normally-off computing,” IEEE Design & Test of Computers, vol. 28,
no. 1, pp. 52–63, 2011.
[3] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams, “The
missing memristor found,” Nature, vol. 453, no. 7191, p. 80, 2008.
[4] “3DXPoint,” https://newsroom.intel.com/news-releases/intel-and-
micron-produce-breakthrough-memory-technology/, 2015.
[5] S. Pelley, P. M. Chen, and T. F. Wenisch, “Memory persistency,” in
ACM/IEEE 41st International Symposium on Computer Architecture,
ISCA 2014, Minneapolis, MN, USA, June 14-18, 2014, 2014, pp. 265–
276.
[6] J. Zhao, S. Li, D. H. Yoon, Y. Xie, and N. P. Jouppi, “Kiln: clos-
ing the performance gap between systems with and without persistence
support,” in The 46th Annual IEEE/ACM International Symposium on
Microarchitecture, MICRO-46, Davis, CA, USA, December 7-11, 2013,
2013, pp. 421–432.
[7] C. Mohan, D. J. Haderle, B. G. Lindsay, H. Pirahesh, and P. M. Schwarz,
“ARIES: A transaction recovery method supporting fine-granularity
locking and partial rollbacks using write-ahead logging,” ACM Trans.
Database Syst., vol. 17, no. 1, pp. 94–162, 1992.
[8] J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. C. Lee, D. Burger,
and D. Coetzee, “Better I/O through byte-addressable, persistent mem-
ory,” in Proceedings of the 22nd ACM Symposium on Operating Systems
Principles 2009, SOSP 2009, Big Sky, Montana, USA, October 11-14,
2009, 2009, pp. 133–146.
48
[9] A. Memaripour, A. Badam, A. Phanishayee, Y. Zhou, R. Alagappan,
K. Strauss, and S. Swanson, “Atomic in-place updates for non-volatile
main memories with kamino-tx,” in Proceedings of the Twelfth Euro-
pean Conference on Computer Systems, EuroSys 2017, Belgrade, Serbia,
April 23-26, 2017, 2017, pp. 499–512.
[10] Y. Ni, J. Zhao, D. Bittman, and E. L. Miller, “Reducing NVM writes
with optimized shadow paging,” in 10th USENIX Workshop on Hot Top-
ics in Storage and File Systems, HotStorage 2018, Boston, MA, USA,
July 9-10, 2018., 2018.
[11] M. K. Qureshi, J. Karidis, M. Franceschini, V. Srinivasan, L. Lastras,
and B. Abali, “Enhancing lifetime and security of phase change memo-
ries via start-gap wear leveling,” in Proceedings of the 42nd International
Symposium on Microarchitecture (MCIRO’42), Austin, TX, 2009.
[12] M. Liu, M. Zhang, K. Chen, X. Qian, Y. Wu, W. Zheng, and J. Ren,
“Dudetm: Building durable transactions with decoupling for persistent
memory,” in Proceedings of the Twenty-Second International Confer-
ence on Architectural Support for Programming Languages and Operat-
ing Systems, ASPLOS 2017, Xi’an, China, April 8-12, 2017, 2017, pp.
329–343.
[13] “NVML,” https://github.com/pmem/, 2018.
[14] Q. Hu, J. Ren, A. Badam, J. Shu, and T. Moscibroda, “Log-structured
non-volatile main memory,” in 2017 USENIX Annual Technical Confer-
ence, USENIX ATC 2017, Santa Clara, CA, USA, July 12-14, 2017.,
2017, pp. 703–717.
[15] A. Joshi, V. Nagarajan, S. Viglas, and M. Cintra, “ATOM: atomic dura-
bility in non-volatile memory through hardware logging,” in 2017 IEEE
International Symposium on High Performance Computer Architecture,
HPCA 2017, Austin, TX, USA, February 4-8, 2017, 2017, pp. 361–372.
[16] J. Jeong, C. H. Park, J. Huh, and S. Maeng, “Efficient hardware-assisted
logging with asynchronous and direct-update for persistent memory,” in
Proceedings of the 51th Annual IEEE/ACM International Symposium
on Microarchitecture, MICRO 2018, Fukuoka, Japan, October 20-24,
2018, 2018, pp. 178–190.
[17] S. Shin, S. K. Tirukkovalluri, J. Tuck, and Y. Solihin, “Proteus: a flex-
ible and fast software supported hardware logging approach for NVM,”
in Proceedings of the 50th Annual IEEE/ACM International Symposium
on Microarchitecture, MICRO 2017, Cambridge, MA, USA, October 14-
18, 2017, 2017, pp. 178–190.
49
[18] M. Ogleari, E. L. Miller, and J. Zhao, “Steal but no force: Efficient
hardware undo+redo logging for persistent memory systems,” in IEEE
International Symposium on High Performance Computer Architecture,
HPCA 2018, Vienna, Austria, February 24-28, 2018, 2018, pp. 336–349.
[19] T. M. Nguyen and D. Wentzlaff, “PiCL: a software-transparent, per-
sistent cache log for nonvolatile main memory,” in Proceedings of the
51th Annual IEEE/ACM International Symposium on Microarchitec-
ture, MICRO 2018, Fukuoka, Japan, October 20-24, 2018, 2018, pp.
178–190.
[20] J. H. Ahn, S. Li, S. O, and N. P. Jouppi, “McSimA+: A manycore
simulator with application-level+ simulation and detailed microarchi-
tecture modeling,” in 2012 IEEE International Symposium on Perfor-
mance Analysis of Systems & Software, Austin, TX, USA, 21-23 April,
2013, 2013, pp. 74–85.
[21] K. Doshi, E. Giles, and P. J. Varman, “Atomic persistence for SCM
with a non-intrusive backend controller,” in 2016 IEEE International
Symposium on High Performance Computer Architecture, HPCA 2016,
Barcelona, Spain, March 12-16, 2016, 2016, pp. 77–89.
[22] S. Nalli, S. Haria, M. D. Hill, M. M. Swift, H. Volos, and K. Keeton, “An
analysis of persistent memory use with WHISPER,” in Proceedings of the
Twenty-Second International Conference on Architectural Support for
Programming Languages and Operating Systems, ASPLOS 2017, Xi’an,
China, April 8-12, 2017, 2017, pp. 135–148.
[23] A. Mahesri and V. Vardhan, “Power consumption breakdown on a mod-
ern laptop,” in International Workshop on Power-Aware Computer Sys-
tems. Springer, 2004, pp. 165–180.
[24] F. Faggin, M. E. Hoff, S. Mazor, and M. Shima, “The history of the
4004,” IEEE Micro, vol. 16, no. 6, pp. 10–20, 1996.
[25] R. Singhal, “Inside Intel next generation Nehalem microarchitecture,”
in Hot Chips, vol. 20, 2008, p. 15.
[26] L. B. A. Rabai, B. Cohen, and A. Mili, “Programming language use in
us academia and industry,” Informatics in Education, vol. 14, no. 2, p.
143, 2015.
[27] S. Anderson, J. Earle, R. E. Goldschmidt, and D. Powers, “The IBM
system/360 model 91: floating-point execution unit,” IBM Journal of
Research and Development, vol. 11, no. 1, pp. 34–53, 1967.
50
[28] D. M. Gallagher, W. Y. Chen, S. A. Mahlke, J. C. Gyllenhaal, and
W.-m. W. Hwu, “Dynamic memory disambiguation using the memory
conflict buffer,” in ACM SIGPLAN Notices, vol. 29, no. 11. ACM,
1994, pp. 183–193.
[29] G. Reinman and B. Calder, “Predictive techniques for aggressive load
speculation,” in Proceedings. 31st Annual ACM/IEEE International
Symposium on Microarchitecture. IEEE, 1998, pp. 127–137.
[30] K. A. Feiste, B. J. Ronchetti, and D. J. Shippy, “System for store for-
warding assigning load and store instructions to groups and reorder
queues to keep track of program order,” Feb. 19 2002, US Patent
6,349,382.
[31] J. Chen and Y. Fong, “High density non-volatile flash memory without
adverse effects of electric field coupling between adjacent floating gates,”
Feb. 2 1999, US Patent 5,867,429.
[32] R. M. White, “Disk-storage technology,” Scientific American, vol. 243,
no. 2, pp. 138–149, 1980.
[33] T. Haerder and A. Reuter, “Principles of transaction-oriented database
recovery,” ACM Comput. Surv., vol. 15, no. 4, pp. 287–317, Dec. 1983.
[Online]. Available: http://doi.acm.org/10.1145/289.291
[34] A. Kolli, S. Pelley, A. G. Saidi, P. M. Chen, and T. F. Wenisch, “High-
performance transactions for persistent memories,” in Proceedings of
the Twenty-First International Conference on Architectural Support for
Programming Languages and Operating Systems, ASPLOS ’16, Atlanta,
GA, USA, April 2-6, 2016, 2016, pp. 399–411.
[35] Y. Lu, J. Shu, L. Sun, and O. Mutlu, “Loose-ordering consistency for
persistent memory,” in 32nd IEEE International Conference on Com-
puter Design, ICCD 2014, Seoul, South Korea, October 19-22, 2014,
2014, pp. 216–223.
[36] Y. Lu, J. Shu, and L. Sun, “Blurred persistence in transactional persis-
tent memory,” in IEEE 31st Symposium on Mass Storage Systems and
Technologies, MSST 2015, Santa Clara, CA, USA, May 30 - June 5,
2015, 2015, pp. 1–13.
[37] E. Giles, K. Doshi, and P. J. Varman, “Softwrap: A lightweight frame-
work for transactional support of storage class memory,” in IEEE 31st
Symposium on Mass Storage Systems and Technologies, MSST 2015,
Santa Clara, CA, USA, May 30 - June 5, 2015, 2015, pp. 1–14.
51
[38] H. Volos, A. J. Tack, and M. M. Swift, “Mnemosyne: Lightweight per-
sistent memory,” in Proceedings of the 16th International Conference on
Architectural Support for Programming Languages and Operating Sys-
tems, ASPLOS 2011, Newport Beach, CA, USA, March 5-11, 2011,
2011, pp. 91–104.
[39] M. Rosenblum and J. K. Ousterhout, “The design and implementa-
tion of a log-structured file system,” in Proceedings of the Thirteenth
ACM Symposium on Operating System Principles, SOSP 1991, Asilo-
mar Conference Center, Pacific Grove, California, USA, October 13-16,
1991, 1991, pp. 1–15.
[40] R. Ramakrishnan and J. Gehrke, Database Management Systems,
3rd ed. McGraw-Hill Education, 2002.
[41] B. Jacob, S. Ng, and D. Wang, Memory Systems. Morgan Kaufmann,
2007.
[42] S. Kannan, A. Gavrilovska, and K. Schwan, “pVM: Persistent virtual
memory for efficient capacity scaling and object storage,” in Proceedings
of the Eleventh European Conference on Computer Systems, EuroSys
2016, London, United Kingdom, April 18-21, 2016, 2016, pp. 13:1–13:16.
[43] J. Arulraj, A. Pavlo, and S. Dulloor, “Let’s talk about storage & recovery
methods for non-volatile memory database systems,” in Proceedings of
the 2015 ACM SIGMOD International Conference on Management of
Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, 2015, pp.
707–722.
[44] N. Amit, “Optimizing the TLB shootdown algorithm with page ac-
cess tracking,” in 2017 USENIX Annual Technical Conference, USENIX
ATC 2017, Santa Clara, CA, USA, July 12-14, 2017., 2017, pp. 27–39.
[45] S. Mittal and J. S. Vetter, “A survey of software techniques for using
non-volatile memories for storage and main memory systems,” IEEE
Trans. Parallel Distrib. Syst., vol. 27, no. 5, pp. 1537–1550, 2016.
[46] D. S. Rao, S. Kumar, A. Keshavamurthy, P. Lantz, D. Reddy,
R. Sankaran, and J. Jackson, “System software for persistent mem-
ory,” in Ninth Eurosys Conference 2014, EuroSys 2014, Amsterdam,
The Netherlands, April 13-16, 2014, 2014, pp. 15:1–15:15.
[47] J. Xu and S. Swanson, “NOVA: A log-structured file system for hy-
brid volatile/non-volatile main memories,” in 14th USENIX Conference
on File and Storage Technologies, FAST 2016, Santa Clara, CA, USA,
February 22-25, 2016., 2016, pp. 323–338.
52
[48] M. Dong and H. Chen, “Soft updates made simple and fast on non-
volatile memory,” in 2017 USENIX Annual Technical Conference,
USENIX ATC 2017, Santa Clara, CA, USA, July 12-14, 2017., 2017,
pp. 719–731.
[49] J. Ou, J. Shu, and Y. Lu, “A high performance file system for non-
volatile main memory,” in Proceedings of the Eleventh European Con-
ference on Computer Systems, EuroSys 2016, London, United Kingdom,
April 18-21, 2016, 2016, pp. 12:1–12:16.
[50] “Ext4-dax,” https://lwn.net/Articles/613384/, 2014.
[51] C. Chen, J. Yang, Q. Wei, C. Wang, and M. Xue, “Fine-grained meta-
data journaling on NVM,” in 32nd Symposium on Mass Storage Systems
and Technologies, MSST 2016, Santa Clara, CA, USA, May 2-6, 2016,
2016, pp. 1–13.
[52] J. Coburn, A. M. Caulfield, A. Akel, L. M. Grupp, R. K. Gupta,
R. Jhala, and S. Swanson, “NV-Heaps: Making persistent objects fast
and safe with next-generation, non-volatile memories,” in Proceedings
of the 16th International Conference on Architectural Support for Pro-
gramming Languages and Operating Systems, ASPLOS 2011, Newport
Beach, CA, USA, March 5-11, 2011, 2011, pp. 105–118.
[53] D. R. Chakrabarti, H. Boehm, and K. Bhandari, “Atlas: leveraging
locks for non-volatile memory consistency,” in Proceedings of the 2014
ACM International Conference on Object Oriented Programming Sys-
tems Languages & Applications, OOPSLA 2014, part of SPLASH 2014,
Portland, OR, USA, October 20-24, 2014, 2014, pp. 433–452.
[54] J. Izraelevitz, T. Kelly, and A. Kolli, “Failure-atomic persistent memory
updates via JUSTDO logging,” in Proceedings of the Twenty-First In-
ternational Conference on Architectural Support for Programming Lan-
guages and Operating Systems, ASPLOS ’16, Atlanta, GA, USA, April
2-6, 2016, 2016, pp. 427–442.
[55] T. C. Hsu, H. Brügner, I. Roy, K. Keeton, and P. Eugster, “Nvthreads:
Practical persistence for multi-threaded applications,” in Proceedings of
the Twelfth European Conference on Computer Systems, EuroSys 2017,
Belgrade, Serbia, April 23-26, 2017, 2017, pp. 468–482.
53
