A Survey on Tiering and Caching in High-Performance Storage Systems by Hoseinzadeh, Morteza
A Survey on Tiering and Caching in High-Performance Storage Systems
Morteza Hoseinzadeh
University of California, San Diego
Abstract
Although every individual invented storage technology made
a big step towards perfection, none of them is spotless. Dif-
ferent data store essentials such as performance, availability,
and recovery requirements have not met together in a single
economically affordable medium, yet. One of the most influ-
ential factors is price. So, there has always been a trade-off
between having a desired set of storage choices and the costs.
To address this issue, a network of various types of storing
media is used to deliver the high performance of expensive
devices such as solid state drives and non-volatile memo-
ries, along with the high capacity of inexpensive ones like
hard disk drives. In software, caching and tiering are long-
established concepts for handling file operations and mov-
ing data automatically within such a storage network and
manage data backup in low-cost media. Intelligently moving
data around different devices based on the needs is the key
insight for this matter. In this survey, we discuss some re-
cent pieces of research that have been done to improve high-
performance storage systems with caching and tiering tech-
niques.
1 Introduction1
With the advancement in the computing and networking
technologies especially around the Internet, and emerging
tremendous number of new data sources such as Internet
of Things (IoT) endpoints, wearable devices, mobile plat-
forms, smart vehicles, etc., enterprise data-intensive analyt-
ics input is now scaled up to petabytes and it is predicted
to be exceeding 44 zettabytes by 2020 [58]. Concerning
this rapid data expansion, hardware has been endeavoring to
provide more capacity with higher density supporting high-
performance storage systems. Figure 1 represents available
and emerging storage technologies as of today. In terms of
storage technology, Hard Disk Drives (HDD) is now sup-
planted by fast, reliable Solid State Drives (SSD). Addition-
1Parts of this section is taken from my published papers [66, 17, 69].
Figure 1: Memory technologies
ally, one-time emerging persistent memory devices are now
going to be available in the market as Intel launched Optane
DIMMs [56]. Price-wise, when new technologies become
available, dated technologies become cheaper. Nowadays,
SSDs are very common such that they are being used as All-
Flash Arrays (AFA) in data centers [66]. However, storage
IO is still the biggest bottleneck on large scale data centers.
As shown in [2], the time consumed to wait for I/Os is the
primary cause of idling and wasting CPU resources, since
lots of popular cloud applications are I/O intensive, such as
video streaming, file sync, backup, data iteration for machine
learning, etc.
To solve the problem caused by I/O bottlenecks, paral-
lel I/O to multiple HDDs in a Redundant Array of Inde-
pendent Disks (RAID) becomes a common approach. How-
ever, the performance improvement from RAID is still lim-
ited. Therefore, lots of big data applications strive to store
intermediate data to memory as much as possible such as
Apache Spark. Unfortunately, memory is too expensive, and
its capacity is minimal (e.g., 64∼128GB per server), so it
alone is not able to support super-scale cloud computing use
cases. Some researches propose making use of NVM-based
SSDs like 3D XPoint Optane DIMM [40, 22] and PCM-
ar
X
iv
:1
90
4.
11
56
0v
1 
 [c
s.A
R]
  2
5 A
pr
 20
19
Table 1: Comparison of different storage technologies [3, 59, 14, 11]
STT-RAM DRAM NVDIMM Optane SSD† NAND SSD‡ HDD
Capacity* 100s of MBs Up to 128GB 100s of GBs Up to 1TB Up to 4TB Up to 14TB
Read Lat. 6ns 10−20ns 50ns 9µs 35µs 10ms
Write Lat. 13ns 10−20ns 150ns 30µs 68µs 10ms
Price $1-3K/GB $7.6/GB $3-13/GB $1.30/GB $0.38/GB $0.03/GB
Addressability Byte Byte Byte/Block Block Block Block
Volatility Non-Volatile Volatile Non-Volatile Non-Volatile Non-Volatile Non-Volatile
†Intel Optane SSD 905P Series (960GB) (AIC PCIe x 4 3D XPoint) ‡Samsung 960 Pro 1TB M.2 SSD with 48-layer 3D NAND (Source: Wikibon) *Per module
based DIMMs [17, 19, 22] instead of DRAM to provide high
density and non-volatility. But, these storage devices are not
matured enough to be instantly used as the main memory and
are still very expensive.
Caching and Tiering have been used for a long time to
hide long latency of slow devices in the storage hierarchy. In
the past, high-end HDDs such as 15k RPM where used as
the performance tier and low-end HDDs such as 7200 RPM
served as the capacity tier [41]. Today, NAND-Flash SSDs
replaces fast HDDs, and while low-end HDDs are obsolete,
high-end HDDs are used for capacity requirements. Soon,
modern storage technologies such as NVM will break with
the past and change the storage semantics. As in device level,
today’s high-speed SSDs are equipped with a write buffer as
in Apple’s Fusion Drive [51]. In system level, almost all file
systems come with a page cache which buffers data pages in
DRAM, and letting applications have access to the contents
of the files. Using persistent memory as a storage medium,
some file systems skip the page cache [63]. In application
level, lots of big data applications strive to store intermediate
data to memory as much as possible such as Apache Spark.
However, NVM is not economically affordable to be used
as a large enterprise storage system, and SSDs suffer from
limited write endurance.
In this survey, we discuss several studies on caching and
tiering solutions for high-performance storage systems. In
section 2, we give a short background of storage devices and
their technologies. Section 3 will investigate several research
studies on caching solutions followed by section 4 which dis-
cusses several papers on storage tiering solutions. At the end
of section 4, we briefly introduce Ziggurat, which is devel-
oped in our group. Finally, section 5 concludes the paper.
2 Background
This section briefly covers background information of indi-
vidual technology parts in the computer memory hierarchy.
We also discuss the counterpart pieces of hardware and soft-
ware required for networking them together.
2.1 Memory Hierarchy
Based on the response time, the memory hierarchy is de-
signed to separate the computer storage into an organized
multi-level structure aiming to enhance the overall perfor-
mance and storage management. Different types of stor-
age media are designated as levels according to their perfor-
mance, capacity, and controlling technology. In general, the
lower level in the hierarchy, the smaller its bandwidth and
the larger its storage capacity. There are four primary levels
in the hierarchy as follows [57].
2.1.1 Internal
On-chip memory cells such as processor registers and caches
fall into this level. To provide the highest performance, ar-
chitects use storage technologies with the lowest response
type such as SRAM, Flip-Flops, or Latch buffers. Embed-
ded DRAM is another technology which is used in some
application specific integrated circuits (ASIC) [16]. In re-
cent years, some emerging technologies such as spin-torque
transfer random access memory (STT-RAM) has received at-
tention for the last level cache [54, 53]. They not only pro-
vide low response time, but they also offer high density and
persistence.
Notice that there are multiple sub-levels in this level of
the memory hierarchy. Processor register file, which has
the lowest possible latency, resides in the nearest sub-level
to the processor followed by multiple levels of caches (i.e.,
L1, L2, and so on). Although in symmetric multi-processor
(SMP) architecture caches may be private or shared amongst
the cores, they are still considered on the same level in the
hierarchy.
2.1.2 Main
The primary storage or the main memory of the computer
system temporarily maintains all code and data (partial) of
the running applications including the operating system. At
this level of the hierarchy, the capacity is more important
compared with the internal levels. The whole code and data
of running applications settle at this level. Although the stor-
age capacity in this level is much larger than the internal
level, the performance should also be high enough to enable
2
fast data transfer between the main and internal levels. Us-
ing the spacial and temporal locality, the memory controller
manages to move bulks of data back and forth between the
last level cache and the main memory via the address and
data bus. In contrast with internal levels in which data can
be accessed in bytes, unit access of data is a cache or memory
line (usually 64 Bytes).
DRAM technology has been long used as the best candi-
date for this level. Other technologies such as phase change
memory (PCM) [26, 27, 18] have been introduced as a
scalable DRAM alternative with the ability to persist data.
3D XPoint [40] has been successfully prototyped and an-
nounced. Detailed information on the storage technologies
can be found in section 2.2.1.
2.1.3 Secondary Storage
The secondary storage or the on-line mass storage level is
composed of persistent block devices to store massive data
permanently. In contrast with the two levels above, the stor-
age is not directly accessible by the processor. Instead, the
storage media are connected to the processor via IO ports.
Solid State Drives (SSD), Hard Disk Drives (HDD), and ro-
tating optical devices are examples of secondary storage me-
dia. When a process is being executed, the processor sub-
mits an IO request to the block device via an IO BUS such as
PCIe, IDE, or SATA in order to load a chunk of data (usually
a block of 4KB) into a specific location in the main memory
using the Direct Memory Access (DMA) feature.
2.1.4 Tertiary Storage
The tertiary storage or off-line bulk storage includes any
kinds of removable storage devices. If accessing the data
is under control of the processing unit, it is called tertiary
storage or near-line storage. For example, a robotic mecha-
nism mounts and dismounts removable devices on demand.
Otherwise, it is called off-line storage, when a user physi-
cally attaches and detach the storage media. In some stor-
age classifications, tertiary storage and off-line storage are
distinguished. However, we consider them identical in this
paper. The rest of this section will discuss the most related
technologies and their characteristics.
2.2 Technology
The main factor that makes storage media different from
each other is their technologies. Throughout the computer
history, memory technologies have been evolved vastly. Fig-
ure 1 represents currently available and emerging technolo-
gies at a glance. Generally, the computer memory system can
be classified into volatile and non-volatile memories. Tra-
ditionally, non-volatile memories which usually fall in sec-
ondary and tertiary storage groups, are used to store data per-
manently. In contrast, volatile memories are usually used
as caches to temporarily maintain close to the processor be-
cause of their high performance. Nevertheless, their usage
may switch often. For example, a high-end SSD may be
used as a cache for slow storage devices. Likewise, re-
cently emerged storage class memories can be used as a non-
volatile media to permanently store data in spite of being in
the primary storage place. Table 1 compares different com-
puter storage technologies.
2.2.1 Memory Technology
SRAM and DRAM have been long known as the primary
technologies served as the processor’s internal cache and the
system’s main memory, respectively. Due to the nature of an
SRAM cell, it can retain information in no time. An SRAM
cell is composed of two back-to-back inverters. In its standby
state, these two inverters keep reinforcing each other as long
as they are supplied. One of them represents bit data, and the
other one corresponds to the inverted value of the bit data.
While reading, a sense amplifier reads the output ports of the
inverters and find which one has a higher voltage and deter-
mines the stored value. Although SRAM is almost as fast
as a logic gate circuit, its density is too low as its electronic
structure is made of at least four transistors. Additionally,
it is CMOS compatible, so, integrating SRAM cells in the
processor’s die is possible. On the other hand, a DRAM cell
comprises only one transistor and a capacitor. In contrast
with SRAM which statically keeps data, DRAM requires re-
freshing the data due to the charge leakage nature of the ca-
pacitor. The density of DRAM is much higher than SRAM,
but it is not CMOS compatible. So, integrating DRAM in
the processor’s die is not easy. Also, it requires larger pe-
ripheral circuitry for read and write operations. Since read-
ing from a DRAM cell is destructive, a write should happen
following each read to restore the data. Overall, the higher
capacity with a lower cost of DRAM made it the best candi-
date for the primary memory, so far. However, DRAM has
faced a scaling wall because it uses electric charge in ca-
pacitors to maintain data. So, while technology scaling, not
only the reliability of a capacitor dramatically drops, but also
there would be cell-to-cell interference. Not to mention that
the active power consumption of refresh overhead is another
challenging issue.
Many emerging technologies have been investigated to ad-
dress the scaling issue among others. Researchers have been
seeking a reliable solution for a byte-addressable and power
efficient alternative to DRAM. Spin-Transfer Torque RAM
(STT-RAM) is one of the high-performance solutions [20].
Having a fixed layer and a free layer of ferromagnetic ma-
terial, it stores bits in the form of high and low resistance
property of the fixed layer based on the spin orientation of
the free layer. Although it provides higher performance com-
paring with DRAM along with non-volatility which voids re-
freshing, its expensive costs make it an unfordable option of
3
DRAM replacement. Its super high density and low power
consumption make it a potential candidate for on-CPU cache
technology.
Nonetheless, Phase Change Memory (PCM) is another
emerging technology which is more promising than the oth-
ers. It stores digital information in the form of resistance
levels of a phase change material which ranges from lit-
tle resistance of its crystalline state to very high resistance
of its amorphous state [26]. As shown in table 1, PCM
has a lower performance compared with DRAM, especially
in write operations. It also can endure a smaller number
of writes and requires refreshing to prevent resistance drift.
There is a body of research focusing on addressing these is-
sues [18, 67, 42].
Notwithstanding, PCM is one of the best options to be
used as a storage class memory technology, and solid-state
drives. Table 1 shows the beneficiary of PCM and 3D XPoint
devices over NVMe driver. Connecting to the memory bus,
they provide near DRAM performance while having a large
capacity of a storage device. This type of memory technol-
ogy is recognized as Storage Class Memory (SCM) which
can be categorized as memory type (M-SCM, Persistent
Memory, or NVM) with fast access latency and low capac-
ity (as in 3D XPoint DIMM), or storage-type (S-SCM) with
high capacity and low access latency (as in Optane SSD, see
section 2.2.2) [64].
2.2.2 Storage Technology
Besides the internal and the main memory, permanent data
should reside in some storage device to be accessed on de-
mand. For a long time, Hard Disk Drives (HDD) have been
playing this role. An HDD consists of rigid rapidly rotating
disks held around a spindle and a head which relocates using
an actuator arm. Digital data is stored in the form of tran-
sitions in magnetization of a thin film of magnetic material
on each disk. The electromechanical aspect of HDD and the
serialization of the stored data make HDD orders of magni-
tude slower than the mentioned non-volatile memory tech-
nologies. However, its low price and extremely high density
make it a good candidate for secondary and tertiary storage
levels. According to table 1, the capacity of an HDD can
be 1000x larger than DRAM while the operational latency is
roughly 106 times slower.
Solid State Drives (SSD) offer higher performance, shock
resistance, and compact storage at the cost of higher prices
by using Flash technology. A Flash cell consists of a MOS-
FET with one word-line control gate and another floating
gate. It keeps data in the form of electrical switch in the float-
ing gate which can be programmed to be on or off. Whether
the networking of the MOSFETs resembles a NAND or a
NOR logic, it is called NAND-Flash or NOR-Flash SSD.
The read operation is as simple as reading the bit-line while
charging the word-line. However, writing to a flash cell re-
Figure 2: Hybrid storage architectures [43]
quires erasing a large portion (MBs) of storage area using
tunnel release and put data afterward with tunnel injection.
SSDs may use traditional protocols and file systems such
as SATA, SAS, NTFS, FAT32, etc. There are also some
interfaces such as mSATA, m.2, u.2, and PCIe, and some
protocols such as NVMe that are specifically designed for
SSDs. The capacity of NAND-flash based SSD ranges from
128GB to 100TB, and the performance can be up to 10GB/s.
Despite all benefits that a NAND-Flash SSD provides, its
lifespan is limited to 104 writes per cell. Intel and Micron
recently shipped Optane SSD with the new technology of
3D XPoint [40] that offers longer lifespan and higher perfor-
mance. A 3D XPoint cell preserves data based on the change
of bulk resistance [9]. Due to the stackable cross-gridded ar-
rangement, the density of 3D XPoint is much higher than
traditional non-volatile memory technologies. Intel also an-
nounced 3D XPoint DIMM form-factor which can provide
memory band-with for non-volatile storage.
2.2.3 Mass Storage Dilemma
The technologies mentioned above are engaged in different
levels on the memory hierarchy. In one hand, organization of
the storage system in the hierarchy can vary based on data in-
tensity. In the other hand, the pace of data growth in data cen-
ters and cloud-storage service providers mandates server ad-
ministrators to seek a high-performance mass storage system
which requires a data management software running on top
of networked storage devices and server machines. There-
fore, choosing one technology to design a massive storage
system is not the best solution. So, data center experts opt
to develop a hybrid storage system [43]. Figure 2 depicts
the overall categories of the hybrid storage architectures. In
this study, we focus on a host-managed tiering and caching
methods.
The exponentially expanding digital information requires
fast, reliable, and massive data centers to not only archive
data, but also rapidly process them. So, high-performance
and large capacity are both required. However, the portion
4
of digital information with different values may not be even.
The IDC report [58] predicts that with the speed of doubling
every two years, the size of the digital universe might exceed
44 zettabytes (270) by 2020. This tremendously extensive
information is not being touched equally. While the cloud
will touch only 24% of the digital universe by 2020, 13%
will be stored in the cloud, and 63% may not be touched at
all [58]. The speed of data that requires protection, which
is more than 40%, is even faster than the digital universe
itself. So, the major of data usually resides in cheaper, more
reliable, and larger devices and the minor of it which is still
not processed is preserved in fast storage media. Therefore,
a hybrid storage system with a caching/tiering mechanism
will be undoubtedly required.
3 Storage Caching Solutions
With the aim of alleviating the long latency of slow devices,
a caching mechanism can be used in a hybrid storage sys-
tem. There are two main principles in caching subsystems:
1) while keeping the original data in the moderate levels of
the hierarchy, a copy of under-processing data resides in the
cache; and 2) the lifetime of data in the cache layer is short,
and it is meant to be temporary. The performance of the
storage system with caching is chiefly influenced by four fac-
tors [43]:
1. Data allocation policy essentially controls the data flow
and determines the usefulness of the cache, accordingly.
The distribution of the data among multiple devices is
reflected by the caching policy, such as read-only, write-
back, etc.
2. The translation, depending on its mechanism, may also
influence the performance. In a hybrid storage system,
the same data may be kept in different locations in mul-
tiple devices, and each copy of the data should be ad-
dressable. The address translation mechanism is impor-
tant to be fast for data retrieval, and compact for meta-
data space usage.
3. An accurate data hotness identification method is nec-
essary for better cache utilization. It helps to pre-
vent cache pollution with unnecessary data, and conse-
quently, improving the overall performance by instantly
providing hot data.
4. The cache usage efficiency is another important fac-
tor which is influenced by the scheduling algorithm for
managing the queues, synchronization, and execution
sequence.
A caching mechanism can be managed by either in hard-
ware by the device or in software by the host operating sys-
tem (see figure 2). Device-managed caching systems are be-
yond the scope of this study, so we focus on host-managed
Figure 3: Dataflows in SSD caches
methods. With a host-manage caching mechanism, the host
may use separate devices to enhance the performance. One
of the most common cases is using SSD as a cache because
of its high performance as opposed to slow HDDs, and high
capacity compared with DRAM. Besides SSDs, emerging
Non-Volatile Memory (NVM) devices are promised to be in-
volved in storage caching mechanisms. In this section of the
paper, we discuss a few storage caching techniques including
using either SSD or SCM as a storage cache.
3.1 SSD as a Cache
Covering the performance gap between the disk drive and
the main memory, SSD devices have been widely used for
caching slow drives. Figure 3 shows common data flows in
a caching system using an SSD device. 1 happens when
the read request completes within the SSD cache without in-
volving the HDD. If the requested block is not in the SSD,
the HDD may be accessed to retrieve data in DRAM via 2 ,
and if it is identified as a hot data, it is going to be cached in
SSD via 3 . A background process which executes a hot data
identification may migrate data from the HDD to the SSD via
4 regardless of not being requested. A flush command or a
write-back can copy dirty blocks back to the HDD in 5 .
A write operation may be completed directly in SSD when
the block is already there as in 6 , and whether the cache
uses a write-through or write-back policy, the dirty block
can be copied in HDD via 7 . The write-through policy in
SSD caches is obsolete as it is designed for volatile caches in
which dirty blocks should be persisted at some point. In case
of using the read-only or write-around policy, all new write
operations are performed by 8 directly in HDD. Based on
the caching policy, data may flow through these paths.
3.1.1 SSD as a Read-Only Cache
Upon arrival of a new write request in a read-only cache ar-
chitecture [34, 55, 10, 68, 65, 39] where the accessing block
5
is not located in SSD, the request is completed by success-
fully recording it to HDD via 8 . When it was already
cached in SSD for priority read operations, the request is
considered as completed only after updating the HDD copy
of data and discarding the SSD copy, successfully. This kind
of cache architecture helps the durability of the SSD device
as the writing traffic to the SSD is limited to fetching data
from HDD. Meanwhile, the cache space can be better uti-
lized for reading operations, and it might improve the over-
all read performance which, unlike write operations, is on the
critical path. However, the SSD lifespan is still vulnerable to
the cache updating policy. If the data selection is not accu-
rate enough, the cache might be polluted with unnecessary
data, and a Garbage Collection (GC) process or a replace-
ment mechanism should run to make space for demanding
data. This process may incur a write overhead to the SSD
and reduce the lifespan.
The replacement algorithm is essential to alleviate the
writing pressure on SSD cache. Section 3.3 will discuss
more on common algorithms. Besides, the block hotness
identification also affects the SSD lifespan vastly. MO-
LAR [34] determines the data hotness based on the con-
trol metric of demotion count, and place the evicted blocks
wisely from the tier-1 cache (DRAM) to the tier-2 cache
(SSD). Using the I/O patterns of applications on an HPC
system, [68] proposes a heuristic file-placement algorithm
to improve the cache performance. Since the applications in
an HPC is more mechanized as opposed to end-user appli-
cations which have an unpredictable I/O pattern, assuming
foreknown patterns is not far from being realistic. To under-
stand the I/O pattern, a distributed caching middleware de-
tects and manipulates the frequently-accessed blocks in the
user level.
3.1.2 SSD as a Read-Write Cache
Due to its non-volatility feature, SSD caches do not use a
write-through policy to keep the original data up-to-date, in
contrast with DRAM caches which are volatile and need to
be synchronized or persistent. So, an SSD R/W cache may
only employ a write-back or flushing mechanism. Using
SSD as an R/W cache to improve the performance in terms of
both read and write operations is very common [21, 28, 35].
In such architectures, new writes are performed in the SSD
cache as shown in figure 3: 6 , and they will be written
back to the disk later. Since there are two versions of the
same data, a periodic flush operation usually runs to prevent
data synchronization problem. Although using an R/W SSD
cache normally improves the storage performance, when the
cache is nearly full, it fires the GC process to clean invalid
data which may interfere the main process and degrade the
performance. Meanwhile, if the workload is write-intensive
with a small ratio of data reuse, the HDD may be under a
heavy write load which prevents the disk to have long idle
periods. This fact wards off the SSD flushing process and
impose extra performance overhead to the system. However,
SSD can keep data permanently, thus flushing all write data
is not necessary. So, a write-back cache policy can improve
storage performance. Nevertheless, an occasional flush oper-
ation at the cost of small performance degradation is required
in case of SSD failure problem. Furthermore, the SSD lim-
ited write endurance is another issue which is more problem-
atic in R/W caches comparing with read-only caches. Notice
that the random write in an SSD device is roughly tenfold
slower than the sequential write and causes excessive internal
fragmentation. Many algorithms [7, 21, 10, 33] and archi-
tectures [65, 35, 44] have been design to alleviate the write
traffic and control the GC process in SSD caches.
Random Access First (RAF) [35] cache management ex-
tends the SSD lifespan by splitting the SSD to read and
write caches. The former one maintains random-access data
evicted from file cache with the aim of reducing flash wear
and write hits. The latter one is a circular write-through log
to respond to write requests faster and perform the garbage
collection. A monitoring module in the kernel intercepts
page-level operations and sends them to a dispatcher who is a
user-level daemon performing random-access data detection
and distributes the operations among the caches. In [44], bal-
ancing the read and write traffics in two different parts of the
cache is beneficial for both performance and SSD lifespan.
These parts can use different technologies such as DRAM,
NVM, or SSD. In section 3.1.1 we described SSD as an RO
cache in which the write traffic may go to the DRAM cache.
In other designs, SSD may be used as a write cache for HDD.
3.1.3 SSD Caches in Virtualization Environments
In a virtualization environment with multiple Virtual Ma-
chines (VM) running with different IO patterns, the random-
ness of write operations is a pain-point for SSD flashes. To
reduce the number of random writes, [28] proposes a cache
scheme in which they adopt the idea of log-structured file
systems to the virtual disk layer and convert the random
writes to sequential writes. Leveraging Sequential Virtual
Disks (SVD) in a virtual environment of a home cloud server
with multiple virtual machines (VM) in which synchronous
random writes dominate, it uses SSD drives completely se-
quentially to prolong its lifespan while improving the perfor-
mance. vCacheShare [39] is an SSD cache architecture on a
virtual cluster which simply skips SSD cache for write oper-
ations. By tracing the IO traffic of each virtual disk and an-
alyzing them periodically, vCacheShare optimally partitions
the SSD cache for each of the virtual disks.
3.1.4 Deduplication in SSD Caches
For expanding SSD’s lifetime, deduplication is one of the
most effective ways. Some research studies [10, 7] prevent
6
writing data to the SSD drive if the contents were already
cached. For instance, [7] reduces the number of writes to the
SSD by avoiding duplicated data in a virtualization environ-
ment in which the high integration of VMs can introduce a
lot of data duplication. Using a hash function (SHA-1), data
signature will be calculated upon a data fetch after a cache
miss, and if the signature was already in the cache, the ad-
dress would be mapped to the content, and it saves one write
operation.
CacheDedup [32] is an in-line deduplication mechanism
for Flash caching in both client machines and servers.
This design is complementary to Nitro [31] with a set of
modifications on the architecture and the algorithms for
deduplication-aware cache management. The benefits of
deduplication are not only a better utilization of the cache
space but also it helps to increase the hit-ratio. Additionally,
since the flash devices are limited in write endurance, it also
delays wearing out the device by avoiding excessive writes
due to duplicate data.
Figure 4: Architecture of CacheDedup [32]
As shown inf figure 4, CacheDedup is composed of two
data structures: Metadata Cache and Data Cache. The Meta-
data Cache maintains the information for tracking the foot-
prints of source addresses in the primary storage. It has a
source address index table and a footprint store. When a
read/write operation comes up, the corresponding footprint
index is obtained from the source-to-index mapping, and
then the footprint-to-block-address mapping gives the block
address of corresponding contents in the Data Cache. Since
the source-to-cache address space has a many-to-one rela-
tionship due to eliminating duplicate data, the size of map-
pings is not bounded to the size of the cache. Also, to pre-
vent re-fetching data from the primary store, CacheDedup
keeps the historical fingerprints for those blocks that have al-
ready been evicted. CacheDedup can be deployed on both
client and server machines. When it is running on a client
machine, it can better hide the Network I/O for duplicate
data and hence get better performance for applications. In
server side, multiple clients may request for the same data,
and CacheDedup can help data reduction. Notice that in the
server side there should be cache coherence protocol over
the network to maintain data consistency. Although the pro-
posed design is described all in software, the authors claim
that it can be embedded in the Flash Translation Layer in
the hardware device, as well. The described system works
with block I/O level referring source block addresses, but it
also can be used in file system level with (file handler, off-
set) tuple. One of the main parts of the design is the re-
placement algorithm. There are two algorithms: D-LRU and
D-ARC. The details can be found in section 3.3. D-ARC
algorithm is more complicated than D-LRU. D-ARC has a
scan-resistant nature which prevents single-accessed data to
pollute the cache capacity. Although both algorithms can be
used in CacheDedup, D-ARC achieves better performance
while D-LRU is simple. Both algorithms have the no-cache-
wastage property, i.e., it doesnt allow orphaned address and
orphaned data blocks at the same time. This study shows the
improvement on cache hit ratio, I/O latency, and the number
of writes sent to the cache device.
3.1.5 SSD as a Cache for SMR Drives
Although random write to SSD is slower than sequential
write, yet it is an order of magnitude faster than random
writes in a Shingled Magnetic Recording (SMR) device such
as HDD. Therefore, to benefit from the high-capacity and
low $/GB of SMRs and the high performance of SSDs, a
hybrid storage system may redirect all random writes to the
SSD cache and leave the sequential writes to the SMR, as
in [62, 60, 36].
Reading Head
Writer Head
Track
Band
Guard Band
Ongoing Write
Persistent Write
Buffer/Cache
Actuator Arm
M
ag
ne
tic
 D
is
k
Figure 5: A magnetic disk in a SMR device under operation
of writing new data on a whole band.
The writing mechanism is depicted in figure 5. In an SMR
drive, writing to a magnetic track partially overlaps a previ-
ously written track and makes it narrower to provide higher
density. This is because of the physical limitations of the
writing head which is not as delicate as the reading head and
remarks a wider trail while writing onto a magnetic disk. As
imagined, a random write destroys adjacent tracks, and they
all should be rewritten by a read-modify-write (RMW) oper-
ation. In the SMR architecture, a band is a bunch of consec-
utive tracks grouped and separated from adjacent bands by a
7
narrow guard band. Random write requires an RMW opera-
tion on a whole band. So, a persistent cache which is either
a flash buffer or some non-overlapped track on the magnetic
disk is used to buffer writes before writing them to the cor-
responding band. HS-BAS [62] is a hybrid storage system
based on band-awareness of SMR disk to improve the per-
formance of Shingled Write Disk (SWD or SMR disk) with
sustained random writes by taking SSD as a writing cache
for SWD. To make use of SWD devices in a RAID system,
[36] proposes three architectural models of using SSDs as
caches for SWDs. With this option, a RAID system may
provide more storage capacity at lower cost with the same
performance or even slightly better. Partially Open Region
for Eviction (PORE) [60] caching policy is another use of
SSD as a cache for SMR devices. It considers the SMR write
amplification due to the Logical Block Address (LBA) wide
range in addition to the popularity for replacement decision
making. To put it simply, SSD handles random writes and
flushes sequentially to the SMR device.
3.2 NVM Storage Cache
The advent of NVM technologies, as described in sec-
tion 2.2.1, allow persistent data operations at near-DRAM
latencies, which is an order of magnitude faster than SSD. A
study [29] on using NVM as an I/O cache for SSD or HDD
reveals that the current I/O caching solution cannot fully ben-
efit from the low-latency and high-throughput of NVM. Re-
cent researches have been trying to overcome the complex-
ity of using NVM as a Direct Access (DAX) storage device
and using it as a cache for SSD/HDD [4, 12, 24, 61]. In
recent years, Intel provided a Persistent Memory Develop-
ment Kit (PMDK) [1] which provides several APIs to access
the persistent memory from the user level directly. NVM
Bankshot [4] is a user-level library exposing the NVM by
implementing caching functions to the applications and by-
passing the kernel to lower the hit latency. However, PMDK
outperforms Bankshot in many ways as it is more recent.
Most NVM technologies can endure orders of magnitude
more writes comparing with NAND SSD, but still limited.
They also provide an in-place byte-size update which is way
faster than RMW operations in SSDs. With these features,
most of DRAM caching policies can be used as NVM-based
cache with significant modifications for carefully managing
the write traffic.
Hierarchical ARC (H-ARC) [12] cache is an NVM-based
cache that optimizes ARC algorithm to take four states of
recency, frequency, dirty, and clean into account and split
the cache first into the dirty-/clean-page caches and then
split each part into recency-/frequency-page cache. Based
on a similar mechanism as ARC (see section 3.3), it adapts
the sizes of each section, hierarchically in each level. So,
H-ARC keeps dirty pages with higher frequency longer in
the cache. I/O-Cache [13] also uses NVM as a buffer
cache for HDDs which coalesces multiple dirty blocks into
a single sequential write. This technique is also used in
many other NVM-based designs [25, 69]. Transactional
NVM disk Cache (Tinca) [61] aims to achieve crash consis-
tency through transactional supports while avoiding double
writes by exploiting an NVM-based disk cache. Leverag-
ing the byte addressability feature of NVM, Tinca maintains
fine-grained cache metadata to enable copy-on-write (COW)
while writing a data block. Tinca also uses a role switch
method in which each block has a role and can be either a log
block in ongoing committing transactions, or a buffer block
in a completed transaction. With the two of COW and role
switch mechanisms, Tinca supports a commit protocol to co-
alesce and write multiple blocks in a single transaction.
3.3 Cache Replacement Algorithms
To keep the most popular blocks in the cache, several
general-purpose and domain-specific algorithms have been
designed. In general, the majority of these algorithms are
based on two empirical assumptions that are temporal local-
ity and skewed popularity [21]. The former assumes that the
recently used blocks are most likely going to be requested
shortly again. The latter supposes that some blocks are more
frequently accessed comparing with the others. Accord-
ingly, the well-known mechanism of Least-Recently-Used
(LRU) and Least-Frequently-Used (LFU) have been created
and commonly used for data replacement in caches because
of their simplicity and O(1) overhead. Unlike CPUs, the
storage applications may not be interested in the temporal
locality since there is a page cache in the DRAM which ade-
quately manages the locality. Also, a simple search operation
over the entire storage space may flush all popular blocks
in the cache and replace them with seldom accessed ones.
There are many more advanced algorithms have been pro-
posed to address this issue which is mostly general-purpose.
3.3.1 General Purpose Algorithms
The Frequency-Based Replacement (FBR) [48] algorithm
benefits from both LRU and LFU algorithms. It keeps LRU
ordering and decides primarily upon the frequency count of
the blocks in a section. Its complexity ranges from O(1) to
O(log2n) according to the section size. Using the aggrega-
tion of recency information for block referencing behavior
recognition, Early Eviction LRU (EELRU) [52] aims to pro-
vide an on-line adaptive replacement method for all refer-
ence patterns. It would perform LRU unless many recently
fetched blocks had just been evicted. In that case, a fallback
algorithm either evicts the LRU block or the eth MRU one,
where e is a pre-determined recency position. The Low Inter-
reference Recency Set (LIRS) [23] algorithm takes reuse dis-
tance as a metric for dynamically ranking accessed blocks. It
divides the cache into a Low Inter-reference Recency (LIR)
8
for most highly ranked blocks and a High Inter-reference Re-
cency (HIR) for other blocks. When an HIR block is ac-
cessed, it goes to the LIR, and when LIR is full, the low-
est ranked block from LIR turns into the highest ranked
HIR block. With the aim of removing cold blocks quickly,
2Q [50] uses one FIFO queue A1in and two LRU lists of
A1out and Am. A first accessed block comes into A1in, and
upon eviction, it goes to A1out . Reusing the block promotes
it to Am. Similarly, Multi-Queue [70] algorithm uses multi-
ple LRU queues of Q0, ...,Qm−1 where the block lifetime in
Q j is longer than Qi (i < j) as a block in Qi is hit at least
2i times. Adaptive Replacement Cache (ARC) [38] divides
the cache space into T1 and T2, where T1 stores one-time ac-
cessed blocks whereas T2 keeps the rest of the blocks. Two
ghost caches B1 and B2 maintains the identifiers of evicted
blocks from T1 and T2, respectively, whereas t. Using B1 and
B2, the sizes of T1 and T2 is dynamically adjusted by a divid-
ing point P to balance between recency and frequency which
is tuned according to hit rates.
3.3.2 Domain Specific Algorithms
Base on the write performance of SSD and its lifetime is-
sue, SSD caches usually consider two factors: 1) keeping
dirty pages longer in the cache to avoid fetching a page more
than once, and 2) avoiding cache space pollution with low
popular blocks. Clean First LRU (CFLRU) [45] splits the
cache space into a clean-page cache and a dirty-page cache,
and evicts only from the clean-page cache unless there is
no clean page left. This basic algorithm tries to keep dirty
pages longer in the cache, but yet it ignores skipping one-
time access pages. Lazy ARC (LARC) [21] is designed ex-
plicitly for SSD caches to prevent write overheads and pro-
long the SSD lifespan. It filters the seldom accessed blocks
and skips caching them. Similar to 2Q and ARC, it consid-
ers the fact that blocks which are hit recently at least twice
are more likely to be popular. It has a ghost cache to keep
the identifiers of the first accessed blocks. If a block from
the ghost cache is reaccessed, it is considered popular and
placed in the cache. Since it prevents unnecessary writes to
the SSD, it can be also categorized as a data hotness identi-
fication method. The Second-level ARC (L2ARC) [15, 30]
is also optimized for SSD caches as it reduces the number
of writes to the device. It has been used in the Solaris ZFS
file system. It uses SSD as the second level cache of the in-
DRAM ARC cache to periodically fill it with the most pop-
ular data contents of the DRAM cache. With a large space
overhead, SieveStore [46] keeps the information of the miss
count of every block in the storage system, and only allows
those blocks with large miss count to be cached in SSD. Sim-
ilar algorithms are used in some enterprise products such as
Intel Turbo Memory [37].
Similar to ARC, Duplication-aware ARC (D-ARC) [7, 32]
consists of four LRU caches. D-ARC uses cache block con-
tents or fingerprint instead of addresses. Based on the high
or low levels of dirty ratio and the reference count of blocks,
it partitions data in four groups and always evicts least refer-
enced and dirtiest cache blocks. Hence, the removed block
is more likely the most unpopular one which is not going to
be reused in the near future, and the SSD would not eject it
up to the point that it is no longer required. This will reduce
the write bandwidth to the SSD device and save extra writes
due to false evictions. To the same end, Expiration-Time
Driven (ETD) [10] cache algorithm delays a cache block
eviction to its expiration time, and instead of updating the
cache on a miss, it the evicts a block when it is expired, and
then chooses a replacement form a list of candidate blocks.
D-LRU [32] is a duplication-aware LRU which consists of
two separate LRU policies. First, it inserts the address x in
Metadata Cache using LRU. Second, the corresponding fin-
gerprint of address x is inserted into Data Cache using the
other LRU.
PORE [60] is another domain-specific policy which is
beneficial in SSD-SMR hybrid storage systems. It splits the
SMR LBA range into Open and Forbidden regions. The re-
placement policy may only evict dirty blocks in the open re-
gion. The written back blocks are stored in the SMR write
buffer or persistent cache for subsequent writing to the cor-
responding band. The open region is periodically changed
to cover all dirty blocks across the SMR LBA range. This
algorithm helps to avoid writing on random bands which sig-
nificantly destroys the performance.
4 Storage Tiering Solutions
In the past, high-end and low-end HDDs were used as the
performance and the capacity tiers, respectively. Nowadays,
many types of storage media with different characteristics
and capacities are used in a multi-tiered storage system. The
main difference between caching and tiering is that in a
caching system, a copy of data is kept in the cache while in
a tiering system, the original data migrates between multiple
tiers via two operations of promotion and demotion. Data is
classified based on the application needs and characteristics
of available tiers, usually into hot and cold. The hot data re-
sides in the performance tier leaving the cold data to stay in
the capacity tier. Considering multiple factors such as ran-
domness, transfer speed, etc., there might be more than two
tiers.
Figure 6 illustrates a general storage tiering mechanism
which consists of four phases. In the data collection phase,
the system gathers required information for decision making.
The application profile of IO pattern can be obtained either
online or offline. An online profiling module may collect IO
information while the application is running at the cost of
potential performance overhead. This mechanism is useful
when there is a user involved, such as personal computers,
or virtual environment cloud systems. An offline profiling
9
Figure 6: General storage tiering phases
module, on the other hand, obtains the application IO pro-
file before it is running. This kind of profiling mechanism is
suitable for cluster analytical applications in which no ran-
dom parameter interferes with the IO path except the running
applications which are predictable. Some other information
such as application/system specifications can also be fed into
the tiering algorithm by the user or the machine all at once.
In the analysis phase, the system evaluates several possible
plans or models and generates a list of recommendations in
the form of a solutions space. Some tiering algorithms may
skip this phase by directly finding the answer with some
analysis. The solution space consists of several estimations
under different circumstances evaluated by a cost function or
a performance model (e.g., running a particular application
or the whole system under a particular distribution of data
among the tiers). Each solution comes with a cost estima-
tion which will be later used for decision making in the next
phase. In this phase, a sorting algorithm might suffice for
deciding which migration plan is worth taking. According
to the goals, the scores of each plan, and their costs, a tier-
ing algorithm determine whether or not migrating a chunk of
data in which direction.
4.1 SSD as a Performance Tier
A comprehensive study on available storage types is pro-
vided in [24]. It compares the Micron all-PCM SSD proto-
type with eMLC flash SSD regarding performance and eval-
uates it as a promising option for tiered storage systems. Us-
ing a simulation methodology with estimated/obtained per-
formance characteristics of each device, it tests every possi-
ble combination of PCM SSD, eMLC SSD, and HDD. Al-
though nowadays we have Optane SSD available in the mar-
ket from Intel and Micron, and we know that it offers much
better performance than the out-of-date all-PCM SSD pro-
totype, this paper assumes that the write operation of PCM
SSD is 3.5x slower than that of eMLS SSD. With this as-
sumption, and having a very simple IOPS based dynamic
tiering algorithm, they show the benefits of using PCM SSD
in a multi-tiered storage system in a variety of real-world
workloads as an enterprise solution.
Online OS-level Data Tiering (OODT) [49] efficiently dis-
tributes data among multiple tiers based on the access fre-
quency, data type (metadata or user data), access pattern
(random or sequential), and the read/write operation fre-
quency. Using a weighted priority function, OODT sorts
data chunks for each tier based on their degree of random-
ness, read ratio, and request type. OODT can interpret fixed
size requests (4KB). If the request is larger than that, it will
be broken into several small sub-requests and treat with them
independently in a module called the dispatcher. By us-
ing a mapping table, all data chunks can be tracked down
to the tier number and the physical block index. To enable
online migration, it obtains the statistics of the blocks and
keeps it in an access table which gets up-to-date by the dis-
patcher. The most important part of OODT, and every other
tiering schemes is the priority computation (may be referred
as scoring, sorting, or matching in other techniques) which
determines the matched tier for each data. Using a simple
weighted linear formula with four inputs of Paccess, Prandom,
Pread , and Pmetadata, OODT calculates the priority for poten-
tial migrations.
Cloud Analytics Storage Tiering (CAST) [8], as it sounds,
is a storage tiering solution for data analytics applications in
the cloud. With an offline workload profiling, CAST makes
job performance prediction models for each tenant on differ-
ent cloud storage services. Then, it combines the obtained
prediction models with workload specifications and its goals
to perform a cost-efficient data placement and storage pro-
visioning plan. They model the data placement and storage
provisioning problem into a non-linear optimization prob-
lem in which they maximize the tenant utilization in terms
of the completion time and the costs. An enhanced version
of CAST is also proposed in [8] which is called CAST++ and
adds data reuse patterns and workflow awareness to CAST.
Based on a measured IOPS in a virtualization environ-
ment, AutoTiering [66] dynamically migrates virtual disk
files (VMDK) from one tier to another in an all-flash stor-
age system. It uses a sampling mechanism to estimate the
IOPS of running a VM on other tiers. Based on this mea-
surement and the costs, it sorts all possible movements by
their scores. For each VMDK, a daemon on the hypervisor
collects the IO related statistics including the IOPS results
of latency injection test to resemble a slower tier at the end
of every sampling period. For simulating faster tiers, Au-
toTiering takes benefits of a linear regression model. If the
IOPS does not change by slowing down the IO process, and
there is a VM in the queue waiting for the performance tier,
a demotion will take the VMDK to a capacity tier and let the
other VMDK take over the performance tier by promoting it.
4.2 NVM as a Performance Tier
NVMFS [47] is a hybrid file system that improves the ran-
dom write operations in NAND-flash SSD by exploiting the
10
Strata: A Cross Media File System SOSP’17, October 2017, Shanghai, China
Application
LibFS
Application
LibFS
Shared Kernel FS
…
Process
LibFS
Trans.
header
Commit 
record UpdatesUpdates
Trans.
header
Commit 
record
Update log (per process)
Search
order
Super 
block
Free 
block 
bitmap
File & directory blocks,
extent tree nodes
Digest
NVM SSD HDD
Inode cache
Inode
Inode
…
Strata transaction
DRAM
Shared area (per storage layer)
Update log 
pointers
Extent tree 
cache
Extent tree
cache
 File data cache
Read LRU list NVM read/writeLRU list
Global Per inode Global Per inode
SSD read/write
LRU list 
Legend
Directory cache
Inode cache
Inode
Inode
…
Figure 1: Strata design. Writes go to the update log. Reads are served from the shared area. File data cache is a read-only
cache, containing data from SSD or HDD.
with their proper contents even if they were partially written
before the crash (log replay is idempotent). The log remains
authoritative until garbage collected after a completed digest.
Since data is updated in a log-structured way, synchronization
of log update and digest are simple. Writers make sure not to
overwrite already allocated log blocks, while only allocated
blocks are digested (and garbage collected). Write and digest
positions are kept in NVM.
Sequential, aligned writes. One benefit of digesting writes
in bulk is that, however they are initially written, file data can
be coalesced and written sequentially to the shared area, min-
imizing fragmentation and meta-data overhead. Digestion
minimizes device-level write amplification by enabling se-
quential, aligned writes. Below the NVM layer, all device
writes are sequential and aligned to large block boundaries
chosen to be efficient for the device, such as erasure blocks
for SSDs and write zones for shingled disks. These param-
eters are determined by Strata for each device [46]. When
data is updated, old versions are not immediately overwritten.
Instead, Strata periodically garbage collects cold blocks to
reclaim free space. Garbage collection consumes entire era-
sure/shingle block size units so that the device sees only full
block deletes, eliminating collection overhead from the de-
vice layer. This process is similar to what would occur within
device firmware but takes into account application data ac-
cess patterns and multiple layers, segregating frequently from
infrequently accessed data and moving them to appropriate
layers for better device utilization and performance isolation.
Use hardware-assisted protection. To bypass the kernel
safely and efficiently, Strata makes use of the hardware virtual-
ization capabilities available in modern server systems. Strata
specifies access rights for each application to contiguous sub-
sets of each device’s storage space, enforced by hardware. The
MMU trivially supports this feature at memory page gran-
ularity for NVM, while NVMe provides it via namespaces
that can be attached to hardware-virtualized SSDs [9]. Strata
moves all latency-sensitive mechanisms of the file system into
a user-level library. HDDs do not require kernel bypass.
We next describe each component of Strata and their interac-
tion. Since Strata breaks the responsibilities of a traditional
file system into LibFS and KernelFS, we organize our de-
scription along these lines. We start by describing Strata’s
principal meta-data structures.
3.1 Meta-data Structures
Strata keeps meta-data in superblocks, inodes, and per-layer
bitmaps of free blocks. These data structures are similar to
structures in other file systems and we only briefly describe
them here. Strata caches all recently accessed meta-data struc-
tures in DRAM.
Superblock. Strata’s superblock is stored in NVM and de-
scribes the layout of each storage layer and the locations of
all per-application logs. It is updated by KernelFS whenever
per-application logs are created or deleted.
Inodes and directories. Inodes store file meta-data, such
as access rights, owner, and creation times. As in EXT4, they
also store a root of each file’s extent tree, though for Strata, an
inode has multiple roots, one for each storage device. When
unfragmented, extent tree nodes point directly to a file’s data
blocks. As the extent tree fragments, nodes point to other
internal tree nodes before pointing to data blocks. Strata stores
inodes ordered by number in a hidden, sparse inode file and
manages it like a normal file: Strata accesses the inode file via
Figure 7: Strata design [25]. LibFS directs writes to the update log and serve the reads from the shared area. File data cache is
a read-only cache, containing data from SSD or HDD.
byte-addre sability of an auxiliary NVM d vice. The key
feature of this file system is that it redirects s all ran om IOs
on NVM which include metadata and hot file data blocks.
This scheme helps to reduce the write traffic to SSD, hence
improves SSD’s durability. The technique is to transform
random writes at the file system level to a sequential write
at the SSD level. It groups data with same update likelihood
and submits a single large SSD write request.
NVMFS comprises 2 LRU lists: dirty an clean. The dirty
LRU list abs rbs updates in the NVRAM. When page is
written back to the SSD device, it m ves from dirty list to the
clean list. NVMFS dy amically djust dirty and clean LRU
lists. Once th NVRAM us ge rea hes 80%, a back round
thread starts flushing data from the dirty list and move th m
to the clean LRU list until it goes down to 50%. NVRAM has
a non-overwrite on SSD policy: periodical cleanup internal
fragmentation that integrates multiple partial extents into one
and recycles the free space.
The uthors explain file syste consistency through 5
steps. 1: Ch ck if the NVRAM usage i over 80%; 2: if
so, group r ndom small IOs from the dirty LRU list into
large (512K) extents; 3: then, sequentially write the extent
to SSD (better block erase at FTL); 4: inse t the flushed
pages into the cl an LRU list; and finally 5: update meta-
data by recording the new data position within page_info
structure. Therefore, when a crash happens at any point, it
can be recovered.
T pr vent segmen cleani g inconsistency, NVMFS ex-
ploits transactions during defragmentation, similar to the
transactions in log-structured file systems. Af er choosing
a c idate extent, it migrates the valid blocks of that to
NVRAM, and then updates the co re ponding inodes. When
the inodes are updated, then it releases the space in SSD.
Data will always be consistent v when a crash happens in
the middle of the pr cess.
Strata [25] is a multi-tier d file sy tem which exploits
NVM as the performance tie , and SSD/HDD a the cap c-
ity tiers. I consi ts of two parts: KernFS and LibFS. To fire
p Strata, applications are required to be recom iled with
LibFS which re-implements standard POSIX interface. On
the kernel side, KernFS should be running to grant the appli-
cation access to the shared storage area which is a combina-
tion of NVM, SSD, and HDD. It uses the byte-addressability
of the NVM to coalesce logs and migrate them to lower tiers
to minimize write amplification. File data can only be allo-
cated in NVM in Strata, and they can be migrated only from
a faster tier to a slower one. The profiling granularity of
Strata is a page, which increases the bookkeeping overhead
and wastes the locality information of file accesses.
Strata attains fast write operation by separating and del-
egating the tasks of logging and digesting to the user space
and the kernel space, respectively. The KernFS grants LibFS
dir ct access to the NVM for its own private update log and
the the sh red area for read-only operations, as sh wn in
figu e 7. KernFS perform the ig st operation in parallel via
multiple threads. One benefit of this op ration is that despite
the randomness and small size of the initial writes to the up-
date log, they can be coalesced and written sequentially to
the shared area which helps to minimize fragmentation and
metadata overhead. This also helps efficient flash erasure
and shingled write operations.
For crash consistency, LibFS works with a durable unit
of Strata transaction which provides ACID semantics to the
applications update log. To implement this, Strata wrap ev-
ery POSIX system call in one or multiple Strata transac-
tions. Figure 7 represents the Strata design and the LibFS
11
and KernFS components.
4.3 NVM as a Metadata Tier
In a journaling file system, like Ext4, the metadata updates
are usually very small (e.g. inode size of 256B). Although
modifying an inode requires small write operations, due to
block size operations of the storage devices, a whole inode
block (e.g., 4K) would be replaced. In recent years, Non-
Volatile Memories have attracted a lot of attention due to
their feature of connecting via the memory bus. This feature
means that the CPU may issue byte-level (cache line size)
persistent updates.
File System Metadata Accelerator (FSMAC) [6] decou-
ples data and metadata I/O path and use NVM to store file
system metadata due to its small access size and popular-
ity. Metadata is permanently stored in NVM and by default,
never flushed back to the disk periodically. All updates to the
metadata are in-place updates. Not only after a power fail-
ure in the middle of a metadata update operation, metadata
would be corrupt, but also the authors argue that because of
the performance gap between NVM and a block device, the
data update is behind metadata update which becomes per-
sistent in NVM once updated. Since the byte-size version-
ing is very complex and tedious to implement, and block-
size versioning imposes write amplification and NVM space
waste, FSMAC uses fine-grained versioning (inode-size, i.e.,
128bytes) that can maintain consistency at reasonable imple-
mentation and space costs.
To address the write ordering issue of data and metadata
without destroying the performance gained due to the byte-
addressability of NVM, FSMAC uses a light-weight combi-
nation of fine-grained versioning and transaction. An orig-
inal version of metadata is created before updating it to re-
cover from a crash securely. It will be deleted only after
the successful completion of the updating transaction. After
that, the whole file system will be consistent.
Using this opportunity, C. Chen et al. proposed fine-
grained metadata journaling on NVM [5]. Although it is not
directly related to tiering nor caching solution, using NVM
to keep a part of storage data is a kind of classification prob-
lem which is fundamental in tiering approaches.
In contrast to conventional journaling file systems in
which the modified inode blocks in the page buffer in DRAM
are persisted to the disk in form of transactions, in NVM-
base fine-grained journaling file system [5], only modified
inodes are linked together and persisted in the NVM (Fig-
ure 8). Using cache flush instruction and memory fence, it
provides the consistency of ordered writes. Instead of using
large Descriptor and Commit (or Revoke) blocks which are
8K in total, a new data structure, TxnInfo, is introduced
which contains the number of modified inodes in the list
(Count), and a Magic Number for identifying TxnInfo dur-
ing the recovery time.
Disk
Descriptor
Block
Commit
Block
Modified 
Metadata Blocks
NVM
Modified Metadata
(e.g., 256B for inodes)
TxnInfo
inode
no. [1]
Count
Magic
Number
inode
no. [2]
inode
no. [3]
inode
no. [N]…
Fine-grained Journal format
Traditional Block-based Journal format
Same as block size
(e.g., 4KB)
Head Tail
One Transaction
∙∙∙∙∙∙
∙∙∙ ∙∙∙
One Transaction
Integral Multiple of Metadata Size (e.g., n 256B)
Fig. 4: Design of Fine-grained Journal Format
blocks are linked by Running Transaction list in DRAM page
buffer. When transaction commits, the Commiting Transac-
tion takes over the modified inode block list from Running
Transaction. By traversing that list, the corresponding blocks
are written to the journal area on the persistent storage. To
distinguish the adjacent transactions, the inode journal blocks
are surrounded by two special journal blocks, a Descriptor
Block at the beginning and a Commit Block (or Revoke Block
if the transaction is aborted) at the end.
Different from traditional journaling file systems which
write every modified inode block to the journal, our fine-
grained journaling only writes modified inodes to NVM.
Specifically, instead of linking all the modified inode blocks,
only the modified inodes are linked together. Moreover, when
they are flushed to NVM, ordered memory writes through
memory fences and cache-line flushes are used to guarantee
the consistency. We further reduce the amount of writes by
eliminating the Descriptor Block and Commit Block in the
journal. Instead, we use a cache-friendly data structure called
TxnInfo as the boundary of two adjacent transactions which
is discussed in detail in Section III-B.
At the same time, the checkpointing process is not much
changed since it essentially flushes all the modified blocks
in the page buffer to the disk without touching the journal
(except for the indication of a successful checkpointing). On
the other hand, the recovery process is redesigned to utilize the
fine-grained journal in NVM to ensure a consistent state for
the entire file system. The modified workflow of committing,
checkpointing and recovery is discussed in detailed in Section
III-C.
B. Fine-grained Journal Format
As shown in Figure 4, the design of the fine-grained journal
format differs from the traditional one in the following aspects:
• The basic unit of the journal in traditional journaling file
system is a block which is, for example, 4KB in Ext4.
Such design is to facilitate and maximize the disk I/O
efficiency given the block interface constraints. However,
when the byte-addressable NVM is in use, it is no longer
efficient to write the entire inode block when there are
only a few inodes in it are modified. Therefore, we use
the inode as the unit for the journal.
• Another drawback of traditional journal format is the
Descriptor Block and Commit Block (or Revoke Block)
to indicate the beginning and end of a transaction, re-
spectively. However, these two blocks occupy 8KB space
in total while a typical size for inode is only 256B.
To reduce such overhead, in our fine-grained journaling,
we design a new cache-friendly data structure called
TxnInfo as the boundary of two adjacent transactions.
• Traditional block-based Journal is guaranteed to be writ-
ten to disk by calling submit_bio through legacy
block I/O interface. However, when the journal device
turns to be NVM which is connected through memory
bus, to ensure the journal is written to NVM, the CPU
instructions of flushing the corresponding CPU-caches
and issuing a memory fence must be applied after the
journal is memcpyed from DRAM to NVM.
The TxnInfo (1) describes the inodes in the transaction and
(2) is used to locates the boundary of each transaction to facil-
itate the recovery process and guarantee its correctness. Since
the original inode structure does not contain any information
of the inode number, TxnInfo includes this information so that
the inodes in the journal can be located individually. On the
other hand, since TxnInfo is the only data structure that works
as the boundary of two adjacent transactions, it contains the
information of locating the start and end position of the journal
for each transaction. Therefore, we add two more fields, Count
and Magic Number. The Magic Number is predefined to help
the journal scan during recovery to identify the location of
every TxnInfo. The Count is the number of inodes included
in the transaction.
Theoretically, the length of TxnInfo can be elastic, but
for performance consideration, we design its size to be an
integral multiple of the inode size so that the journal for
each transaction can be naturally aligned to CPU-cache which
results in better performance for cache-line flushing. As the
maximum number of inodes in each transaction is determined
by the length of TxnInfo, it can be used to control the default
commit frequency which can be used to optimize the overall
performance for a certain workload, discussed in detail in
Section IV-C.
The journal area in NVM works similar to a ring buffer
whose access is controlled by the head and tail pointer.
Specifically, the normal procedure of writing a journal is (1)
the journal is memcpyed from DRAM to NVM, (2) flush the
corresponding cache-lines and issue a memory barrier, (3) use
a 8-byte atomic write to update the tail pointer, flush its cache-
line and issue a memory barrier. The detailed usage of the
journal for file system transaction committing, checkpointing
and recovery is discussed in the next section.
Figure 8: Fine-grained Metadata Journal Format on
NVM [5]
The journal area in NVM is a ring buffer with a head and
a tail pointer. Writing to it is composed of three steps: 1)
memcpy modified inodes from DRAM to NVM; 2) flush the
corresponding cache lines and issue a memory barrier; and
3) atomically update the tail pointer in the journal area in
NVM using the atomic 8-byte write, flush its cache line, and
issue a memory barrier.
In traditional journaling file systems, committing the Run-
ning Transaction, which is a link list of modified inode
blocks, is triggered by either a predefined timer or a prede-
fined number of modified inode blocks. In the fine-grained
journaling, when a predefined timer is up, similar to tradi-
tional file systems, the committing process starts. The unset
of this process is also controlled by the number of modified
inodes, because TxnInfo can hold the information of a lim-
ited number of modified inodes. The committing process
begins with relinking all modified inodes from the Running
Transaction to the Committing Transaction so that the Run-
ning Transaction can accept new modified inodes. Then, all
modified inodes are memcpyed to NVM starting from tail,
and then the TxnInfo is calculated afterwards. The corre-
sponding cache lines, thereafter, are flushed, and a memory
fence is issued. Finally, the tail pointer will atomically get
updated, confirming that the transaction is committed. No-
tice that, data is consistent during this process, even with a
crash happening in the middle, because the tail is controlling
the visibility of data. Comparing to traditional journaling,
this method reduces transaction writes by up to 99%.
To prevent too long journals which deteriorates the perfor-
mance, file systems usually use checkpointing periodically.
The fine-grained journaling triggers checkpointing once in
every 10 minutes or upon 50% utilizat on of the NVM. Like
traditional journaling file systems, it takes over the modified
inode block list and write the blocks one after another. Then,
12
15
Ch
m
od
W
rit
e
0-
8K
Head TailInode
Inode
log
NVMM
Disk
... Write 0-4K Write 8-16K
File 
Page 1
File 
Page 2
File 
Page 3
File 
Page 1
File 
Page 4
W
rit
e
8-
16
K
File 
Page 3’
File 
Page 4’
Step 1
Step 2
Step 3
Step 4
Pages
Step 5
Page state Stale Live
Entry type Inode update Old write entry New write entry
(a) The file structure of Ziggurat and basic migration
16
Step 1
Step 5
File 
Page 1’
File 
Page 2’
File 
Page 3’
File 
Page 4’
Ch
m
od
W
rit
e
0-
8K
Head TailInode
... Write 4-8K
File 
Page 1
File 
Page 2
File 
Page 3
File 
Page 2
File 
Page 4
W
rit
e
8-
16
K
Step 4
W
rit
e
12
-1
6K
W
rit
e
0-
16
K
File 
Page 4
Step 3
...Inodelog
NVMM
Disk
Pages
Page state Stale Live
Entry type Inode update Old write entry New write entry
Step 2
(b) Group migration
Figure 9: Migration mechanism of Ziggurat [69]. Ziggurat migrates file data between tiers using its basic migration and
group migration mechanisms. The blue arrows indicate data movement, while the black ones indicate pointers.
it discards the journal in NVM by making head and tail point-
ers equal, which guarantees the recoverability, because when
a crash happens, we still have the modified inodes in NVM
in the recovery.
The recovery process in the fine-grained journaling starts
from the tail in NVM, backward. It retrieves the correspond-
ing inode blocks to DRAM. The obsolete inode blocks get
up-to-date by applying modified inodes inside the block. Af-
ter that all inode blocks are updated in DRAM, it flushes
them back to the disk. Finally, make the head and tail point-
ers identical atomically. The consistency is guaranteed simi-
lar to the checkpointing process.
4.3.1 Our Multi-Tiered File System: Ziggurat2
Ziggurat [69] is a multi-tiered NVM-based file system a
tiered file system that spans NVNM and disks, and it was
developed in our research group. The paper is published
in the proceedings of the 17th USENIX Conference on File
and Storage Technologies (FAST ’19). It is based on our
well-known NVM-base file system, NOVA [63]. Ziggurat
exploits the benefits of NVM through intelligent data place-
ment during file writes and data migration. Ziggurat in-
cludes two placement predictors that analyze the file write
sequences and predict whether the incoming writes are both
large and stable and whether updates to the file are likely
to be synchronous. Then, it steers the incoming writes to
the most suitable tier based on the prediction: writes to
synchronously-updated files go to the NVM tier to minimize
the synchronization overhead. Small, random writes also go
to the NVM tier to entirely avoid random writes to disk. The
remaining large sequential writes to asynchronously-updated
2Parts of this section is taken from the original paper accepted in
FAST’19 [69]
files go to disk. Ziggurat seeks five principal design goals
which are as follows.
Send writes to the most suitable tier. Although NVM
is the fastest tier in Ziggurat, file writes should not always
go to NVM. NVM is best-suited for small updates (since
small writes to disk are slow) and synchronous writes (since
NVM has higher bandwidth and lower latency). However,
for larger asynchronous writes, targeting disk is faster, since
Ziggurat can buffer the data in DRAM more quickly than it
can write to NVM, and the write to disk can occur in the
background. Ziggurat uses its synchronicity predictor to an-
alyze the sequence of writes to each file and predict whether
future accesses are likely to be synchronous (i.e., whether the
application will call fsync shortly).
Only migrate cold data in cold files. During migration,
Ziggurat targets the cold portions of cold files. Hot files and
hot data in unevenly-accessed files remain in the faster tier.
When the usage of the fast tier is above a threshold, Ziggurat
selects files with the earliest average modification time to
migrate. Within each file, Ziggurat migrates blocks that are
older than average. Unless the whole file is cold (i.e., its
modification time is not recent), in which case we migrate
the entire file.
High NVM space utilization. Ziggurat fully utilizes
NVM space to improve performance. Ziggurat uses NVM
to absorb synchronous writes. Ziggurat uses a dynamic mi-
gration threshold for NVM based on the read-write pattern
of applications, so it makes the most of NVM to handle file
reads and writes efficiently. We also implement reverse mi-
gration to migrate data from disk to NVM when running
read-dominated workloads.
Migrate file data in groups. To maximize the write band-
width of disks, Ziggurat performs migration to disks as se-
quentially as possible. The placement policy ensures that
most small, random writes go to NVM. However, migrating
13
these small write entries to disks directly will suffer from the
poor random access performance of drives. To make migra-
tion efficient, Ziggurat coalesces adjacent file data into large
chunks for movement to exploit sequential disk bandwidth.
High scalability. Ziggurat extends NOVAs per-CPU stor-
age space allocators to include all the storage tiers. It also
uses per-cpu migration and page cache write-back threads to
improve scalability.
Figure 9a shows the basic procedures of how Ziggurat mi-
grates a write entry from NVM to disk. The first step is to
allocate continuous space on disk to hold the migrated data.
Ziggurat copies the data from NVM to disk. Then, it appends
a new write entry to the inode log with the new location of
the migrated data blocks. After that, it updates the log tail
in NVM and the radix tree in DRAM. Finally, Ziggurat frees
the old blocks of NVM.
Figure 9b exhibits the steps of group migration which
avoids fine-grain migration to improve efficiency and maxi-
mize sequential bandwidth to disks. They are similar to mi-
grating a write entry. In step 1, it allocates large chunks of
data blocks in the lower tier. In step 2, it copies multiple
pages to the lower tier with a single sequential write. After
that, it appends the log entry, and update the inode log tail,
which commits the group migration. The old pages and logs
are freed afterward. Ideally, the group migration size (the
granularity of group migration) should be set close to the fu-
ture I/O size, so that applications can fetch file data with one
sequential read from disk. Also, it should not exceed the
CPU cache size to maximize the performance of loading the
write entries from disks.
In a nutshell, Ziggurat bridges the gap between disk-based
storage and NVM-based storage and provides high perfor-
mance and large capacity to applications.
5 Conclusion
The diversity in storage technologies and their different char-
acteristics make each of them individually suitable for a set
of storage needs. In the software side, the ever expanding
cloud of digital information requires large scale enterprise
data servers with high-performance storage systems. While
old well-designed storage technologies like HDDs provide
large space and high density at a relatively low costs, new
technologies such as SSD and NVM offer super fast and reli-
able IO workflow at a much higher costs. The general desire
is to have the high-performance of the new technologies with
high storage capacities and low costs. Despite the fact that
the speed of processor’s development is much higher than the
storage technology development, software solutions such as
caching and tiering attract the expert’s attention to overcome
the aforementioned limitation. In this survey, we extensively
investigated several caching and tiering solutions for high-
performance storage systems. We observed that although
there are several caching and tiering proposals which use
SSD as the performance tier, the young technology of NVM
did not receive enough attention to be used in such systems.
It is not unexpected since this technology has been developed
recently and the first products of this type is shipped to the
market just a few month before this publication. By the way,
we also looked into some recent scientific papers on using
NVM as a performance tier, and we also introduced Ziggu-
rat, a multi-tiering file system using NVM as a performance
tier to cover the long latencies of SSDs and HDDs.
References
[1] PMDK. https://pmem.io/pmdk/.
[2] ANDERSEN, D. G., AND SWANSON, S. Rethinking
flash in the data center. IEEE Micro 30, 4 (2010), 52–
54.
[3] ARULRAJ, J., PAVLO, A., AND DULLOOR, S. R.
Let’s talk about storage & recovery methods for non-
volatile memory database systems. In Proceedings of
the 2015 ACM SIGMOD International Conference on
Management of Data (2015), ACM, pp. 707–722.
[4] BHASKARAN, M. S., XU, J., AND SWANSON, S.
Bankshot: Caching slow storage in fast non-volatile
memory. In Proceedings of the 1st Workshop on In-
teractions of NVM/FLASH with Operating Systems and
Workloads (2013), ACM, p. 1.
[5] CHEN, C., YANG, J., WEI, Q., WANG, C., AND XUE,
M. Fine-grained metadata journaling on nvm. In Mass
Storage Systems and Technologies (MSST), 2016 32nd
Symposium on (2016), IEEE, pp. 1–13.
[6] CHEN, J., WEI, Q., CHEN, C., AND WU, L. Fsmac:
A file system metadata accelerator with non-volatile
memory. In Mass Storage Systems and Technologies
(MSST), 2013 IEEE 29th Symposium on (2013), IEEE,
pp. 1–11.
[7] CHEN, X., CHEN, W., LU, Z., LONG, P., YANG, S.,
AND WANG, Z. A duplication-aware ssd-based cache
architecture for primary storage in virtualization envi-
ronment. IEEE Systems journal 11, 4 (2017), 2578–
2589.
[8] CHENG, Y., IQBAL, M. S., GUPTA, A., AND BUTT,
A. R. Cast: Tiering storage for data analytics in the
cloud. In Proceedings of the 24th International Sym-
posium on High-Performance Parallel and Distributed
Computing (2015), ACM, pp. 45–56.
[9] CLARKE, P. Intel, Micron Launch ”Bulk-Switching”
ReRAM. https://www.eetimes.com/document.
asp?doc_id=1327289, 2015.
14
[10] DAI, N., CHAI, Y., LIANG, Y., AND WANG, C.
ETD-Cache: an expiration-time driven cache scheme
to make SSD-based read cache endurable and cost-
efficient. In Proceedings of the 12th ACM Inter-
national Conference on Computing Frontiers (2015),
ACM, p. 26.
[11] DULLOOR, S. R., KUMAR, S., KESHAVAMURTHY,
A., LANTZ, P., REDDY, D., SANKARAN, R., AND
JACKSON, J. System software for persistent memory.
In Proceedings of the Ninth European Conference on
Computer Systems (2014), ACM, p. 15.
[12] FAN, Z., DU, D. H., AND VOIGT, D. H-arc: A
non-volatile memory based cache policy for solid state
drives. In Mass Storage Systems and Technologies
(MSST), 2014 30th Symposium on (2014), IEEE, pp. 1–
11.
[13] FAN, Z., HAGHDOOST, A., DU, D. H., AND VOIGT,
D. I/O-Cache: A Non-volatile Memory Based
Buffer Cache Policy to Improve Storage Performance.
In Modeling, Analysis and Simulation of Computer
and Telecommunication Systems (MASCOTS), 2015
IEEE 23rd International Symposium on (2015), IEEE,
pp. 102–111.
[14] FLOYER, D. Will 3D XPoint make it
against 3D NAND? https://wikibon.com/
3d-xpoint-falters/, 2017.
[15] GREGG, B. ZFS L2ARC. Oracle Blogs July
22 (2008). http://www.brendangregg.com/blog/
2008-07-22/zfs-l2arc.html.
[16] HAMZAOGLU, F., ARSLAN, U., BISNIK, N., GHOSH,
S., LAL, M. B., LINDERT, N., METERELLIYOZ, M.,
OSBORNE, R. B., PARK, J., TOMISHIMA, S., WANG,
Y., AND ZHANG, K. 13.1 A 1Gb 2GHz embedded
DRAM in 22nm tri-gate CMOS technology. In 2014
IEEE International Solid-State Circuits Conference Di-
gest of Technical Papers (ISSCC) (Feb 2014), pp. 230–
231.
[17] HOSEINZADEH, M., ARJOMAND, M., AND
SARBAZI-AZAD, H. Reducing access latency of
MLC PCMs through line striping. In Proceeding of the
41st Annual International Symposium on Computer
Architecuture (Piscataway, NJ, USA, 2014), ISCA ’14,
IEEE Press, pp. 277–288.
[18] HOSEINZADEH, M., ARJOMAND, M., AND
SARBAZI-AZAD, H. Reducing Access Latency
of MLC PCMs Through Line Striping. In Proceed-
ing of the 41st Annual International Symposium on
Computer Architecuture (Piscataway, NJ, USA, 2014),
ISCA ’14, IEEE Press, pp. 277–288.
[19] HOSEINZADEH, M., ARJOMAND, M., AND
SARBAZI-AZAD, H. SPCM: The striped phase
change memory. ACM Transactions on Architecture
and Code Optimization (TACO) 12, 4 (2016), 38.
[20] HOSOMI, M., YAMAGISHI, H., YAMAMOTO, T.,
BESSHO, K., HIGO, Y., YAMANE, K., YAMADA, H.,
SHOJI, M., HACHINO, H., FUKUMOTO, C., ET AL.
A novel nonvolatile memory with spin torque transfer
magnetization switching: Spin-ram. In Electron De-
vices Meeting, 2005. IEDM Technical Digest. IEEE In-
ternational (2005), IEEE, pp. 459–462.
[21] HUANG, S., WEI, Q., FENG, D., CHEN, J., AND
CHEN, C. Improving flash-based disk cache with lazy
adaptive replacement. ACM Transactions on Storage
(TOS) 12, 2 (2016), 8.
[22] INTEL. Intel optane technology, 2018.
https://www.intel.com/content/www/
us/en/architecture-and-technology/
intel-optane-technology.html.
[23] JIANG, S., AND ZHANG, X. LIRS: An Efficient
Low Inter-reference Recency Set Replacement Policy
to Improve Buffer Cache Performance. In Proceedings
of the 2002 ACM SIGMETRICS International Confer-
ence on Measurement and Modeling of Computer Sys-
tems (New York, NY, USA, 2002), SIGMETRICS ’02,
ACM, pp. 31–42.
[24] KIM, H., SESHADRI, S., DICKEY, C. L., AND CHIU,
L. Evaluating phase change memory for enterprise
storage systems: A study of caching and tiering ap-
proaches. ACM Transactions on Storage (TOS) 10, 4
(2014), 15.
[25] KWON, Y., FINGLER, H., HUNT, T., PETER, S.,
WITCHEL, E., AND ANDERSON, T. Strata: A cross
media file system. In Proceedings of the 26th Sympo-
sium on Operating Systems Principles (2017), ACM,
pp. 460–477.
[26] LEE, B. C., IPEK, E., MUTLU, O., AND BURGER, D.
Architecting phase change memory as a scalable dram
alternative. In ACM SIGARCH Computer Architecture
News (2009), vol. 37, ACM, pp. 2–13.
[27] LEE, B. C., ZHOU, P., YANG, J., ZHANG, Y., ZHAO,
B., IPEK, E., MUTLU, O., AND BURGER, D. Phase-
change technology and the future of main memory.
IEEE micro 30, 1 (2010).
[28] LEE, D., MIN, C., AND YOUNG, I. E. Effective SSD
caching for high-performance home cloud server. In
IEEE International Conference on Consumer Electron-
ics (ICCE) (2015), pp. 152–153.
15
[29] LEE, G., LEE, H. G., LEE, J., KIM, B. S., AND MIN,
S. L. An Empirical Study on NVM-based Block I/O
Caches. In Proceedings of the 9th Asia-Pacific Work-
shop on Systems (New York, NY, USA, 2018), AP-
Sys’18, ACM, pp. 11:1–11:8.
[30] LEVENTHAL, A. Flash storage memory. Communica-
tions of the ACM 51, 7 (2008), 47–51.
[31] LI, C., SHILANE, P., DOUGLIS, F., SHIM, H., SMAL-
DONE, S., AND WALLACE, G. Nitro: A capacity-
optimized ssd cache for primary storage. In USENIX
Annual Technical Conference (2014), pp. 501–512.
[32] LI, W., JEAN-BAPTISE, G., RIVEROS, J.,
NARASIMHAN, G., ZHANG, T., AND ZHAO, M.
Cachededup: In-line deduplication for flash caching.
In FAST (2016), pp. 301–314.
[33] LIANG, Y., CHAI, Y., BAO, N., CHEN, H., AND LIU,
Y. Elastic queue: A universal ssd lifetime extension
plug-in for cache replacement algorithms. In Proceed-
ings of the 9th ACM International on Systems and Stor-
age Conference (2016), ACM, p. 5.
[34] LIU, Y., GE, X., HUANG, X., AND DU, D. H. Mo-
lar: A cost-efficient, high-performance hybrid storage
cache. In Cluster Computing (CLUSTER), 2013 IEEE
International Conference on (2013), IEEE, pp. 1–5.
[35] LIU, Y., HUANG, J., XIE, C., AND CAO, Q. RAF: A
Random Access First Cache Management to Improve
SSD-Based Disk Cache. In 2010 IEEE Fifth Inter-
national Conference on Networking, Architecture, and
Storage (July 2010), pp. 492–500.
[36] LU, Z.-W., AND ZHOU, G. Design and Implementa-
tion of Hybrid Shingled Recording RAID System. In
14th Intl Conf on Pervasive Intelligence and Comput-
ing (PiCom) (2016), IEEE, pp. 937–942.
[37] MATTHEWS, J., TRIKA, S., HENSGEN, D., COUL-
SON, R., AND GRIMSRUD, K. Intel R© turbo mem-
ory: Nonvolatile disk caches in the storage hierarchy of
mainstream computer systems. ACM Transactions on
Storage (TOS) 4, 2 (2008), 4.
[38] MEGIDDO, N., AND MODHA, D. S. ARC: A self-
tuning, low overhead replacement cache. In USENIX
Annual Technical Conference, General Track (2003),
vol. 3, pp. 115–130.
[39] MENG, F., ZHOU, L., MA, X., UTTAMCHANDANI,
S., AND LIU, D. vCacheShare: Automated Server
Flash Cache Space Management in a Virtualization En-
vironment. In USENIX Annual Technical Conference
(2014), pp. 133–144.
[40] MICRON. 3d-xpoint technology, 2017. https://www.
micron.com/products/advanced-solutions/
3d-xpoint-technology.
[41] MUPPALANENI, N., AND GOPINATH, K. A multi-
tier RAID storage system with RAID1 and RAID5. In
Parallel and Distributed Processing Symposium, 2000.
IPDPS 2000. Proceedings. 14th International (2000),
IEEE, pp. 663–671.
[42] NAIR, P. J., CHOU, C., RAJENDRAN, B., AND
QURESHI, M. K. Reducing read latency of phase
change memory via early read and turbo read. In High
Performance Computer Architecture (HPCA), 2015
IEEE 21st International Symposium on (2015), IEEE,
pp. 309–319.
[43] NIU, J., XU, J., AND XIE, L. Hybrid Storage Systems:
A Survey of Architectures and Algorithms. IEEE AC-
CESS 6 (2018), 13385–13406.
[44] OH, Y., CHOI, J., LEE, D., AND NOH, S. H. Caching
less for better performance: balancing cache size and
update cost of flash memory cache in hybrid storage
systems. In FAST (2012), vol. 12.
[45] PARK, S.-Y., JUNG, D., KANG, J.-U., KIM, J.-S.,
AND LEE, J. Cflru: a replacement algorithm for flash
memory. In Proceedings of the 2006 international con-
ference on Compilers, architecture and synthesis for
embedded systems (2006), ACM, pp. 234–241.
[46] PRITCHETT, T., AND THOTTETHODI, M. Sievestore:
a highly-selective, ensemble-level disk cache for cost-
performance. In ACM SIGARCH Computer Architec-
ture News (2010), vol. 38, ACM, pp. 163–174.
[47] QIU, S., AND REDDY, A. N. Nvmfs: A hybrid file sys-
tem for improving random write in nand-flash ssd. In
Mass Storage Systems and Technologies (MSST), 2013
IEEE 29th Symposium on (2013), IEEE, pp. 1–5.
[48] ROBINSON, J. T., AND DEVARAKONDA, M. V. Data
cache management using frequency-based replace-
ment, vol. 18. ACM, 1990.
[49] SALKHORDEH, R., ASADI, H., AND EBRAHIMI, S.
Operating system level data tiering using online work-
load characterization. The Journal of Supercomputing
71, 4 (2015), 1534–1562.
[50] SHASHA, D., AND JOHNSON, T. 2Q: A low overhead
high performance buffer management replacement al-
goritm. In Proceedings of the 20th International Con-
ference on Very Large Databases (1994).
[51] SHIMPI, A. L. Understanding apple’s fusion drive, Oc-
tober 2012. https://www.anandtech.com/show/
6406/understanding-apples-fusion-drive.
16
[52] SMARAGDAKIS, Y., KAPLAN, S., AND WILSON, P.
EELRU: simple and effective adaptive page replace-
ment. In ACM SIGMETRICS Performance Evaluation
Review (1999), vol. 27, ACM, pp. 122–133.
[53] SMULLEN, C. W., MOHAN, V., NIGAM, A., GU-
RUMURTHI, S., AND STAN, M. R. Relaxing non-
volatility for fast and energy-efficient stt-ram caches.
In High Performance Computer Architecture (HPCA),
2011 IEEE 17th International Symposium on (2011),
IEEE, pp. 50–61.
[54] SUN, Z., BI, X., LI, H. H., WONG, W.-F., ONG, Z.-
L., ZHU, X., AND WU, W. Multi retention level STT-
RAM cache designs with a dynamic refresh scheme.
In Proceedings of the 44th Annual IEEE/ACM Interna-
tional Symposium on Microarchitecture (2011), ACM,
pp. 329–338.
[55] TAI, J., SHENG, B., YAO, Y., AND MI, N. Sla-aware
data migration in a shared hybrid storage cluster. Clus-
ter Computing 18, 4 (2015), 1581–1593.
[56] TALLIS, B., AND CUTRESS, I. Intel Launches
Optane DIMMs Up To 512GB: Apache Pass Is Here!
https://www.anandtech.com/show/12828/
intel-launches-optane-dimms-up-to-512gb-apache-pass-is-here.
[57] TOY, W. N., AND ZEE, B. Computer Hardware-
Software Architecture. Prentice Hall Professional Tech-
nical Reference, 1986.
[58] TURNER, V., GANTZ, J. F., REINSEL, D., AND
MINTON, S. The digital universe of opportunities:
Rich data and the increasing value of the internet of
things. IDC Analyze the Future 16 (2014).
[59] VOLOS, H., TACK, A. J., AND SWIFT, M. M.
Mnemosyne: Lightweight persistent memory. In Pro-
ceedings of the Sixteenth International Conference
on Architectural Support for Programming Languages
and Operating Systems, ASPLOS XVI (2011), vol. 39,
ACM, pp. 91–104.
[60] WANG, C., WANG, D., CHAI, Y., WANG, C., AND
SUN, D. Larger, cheaper, but faster: SSD-SMR hybrid
storage boosted by a new SMR-oriented cache frame-
work. In Proceedings of the 33rd International Con-
ference on Massive Storage Systems and Technology
(MSST’17) (2017).
[61] WEI, Q., WANG, C., CHEN, C., YANG, Y., YANG,
J., AND XUE, M. Transactional nvm cache with high
performance and crash consistency. In Proceedings
of the International Conference for High Performance
Computing, Networking, Storage and Analysis (2017),
ACM, p. 56.
[62] XIAO, W., DONG, H., MA, L., LIU, Z., AND ZHANG,
Q. HS-BAS: A hybrid storage system based on band
awareness of Shingled Write Disk. In 2016 IEEE 34th
International Conference on Computer Design (ICCD)
(2016), IEEE, pp. 64–71.
[63] XU, J., AND SWANSON, S. Nova: A log-structured file
system for hybrid volatile/non-volatile main memories.
In FAST (2016), pp. 323–338.
[64] YAMADA, T., MATSUI, C., AND TAKEUCHI, K.
Optimal combinations of scm characteristics and
non-volatile cache algorithms for high-performance
scm/nand flash hybrid ssd. In Silicon Nanoelectronics
Workshop (SNW), 2016 IEEE (2016), IEEE, pp. 88–89.
[65] YANG, J., PLASSON, N., GILLIS, G., TALAGALA,
N., SUNDARARAMAN, S., AND WOOD, R. HEC:
improving endurance of high performance flash-based
cache devices. In Proceedings of the 6th International
Systems and Storage Conference (2013), ACM, p. 10.
[66] YANG, Z., HOSEINZADEH, M., ANDREWS, A.,
MAYERS, C., EVANS, D. T., BOLT, R. T., BHIMANI,
J., MI, N., AND SWANSON, S. Autotiering: automatic
data placement manager in multi-tier all-flash datacen-
ter. In Performance Computing and Communications
Conference (IPCCC), 2017 IEEE 36th International
(2017), IEEE, pp. 1–8.
[67] YOON, H., MEZA, J., MURALIMANOHAR, N.,
JOUPPI, N. P., AND MUTLU, O. Efficient data map-
ping and buffering techniques for multilevel cell phase-
change memories. ACM Transactions on Architecture
and Code Optimization (TACO) 11, 4 (2014), 40.
[68] ZHAO, D., QIAO, K., AND RAICU, I. Towards
cost-effective and high-performance caching middle-
ware for distributed systems. International Journal of
Big Data Intelligence 3, 2 (2016), 92–110.
[69] ZHENG, S., HOSEINZADEH, M., AND SWANSON, S.
Ziggurat: A Tiered File System for Non-Volatile Main
Memories and Disks. In FAST (2019).
[70] ZHOU, Y., PHILBIN, J., AND LI, K. The multi-
queue replacement algorithm for second level buffer
caches. In USENIX Annual Technical Conference, Gen-
eral Track (2001).
17
