Flash Caching on the Storage Client by Holland, David A. et al.
 
Flash Caching on the Storage Client
 
 
(Article begins on next page)
The Harvard community has made this article openly available.
Please share how this access benefits you. Your story matters.
Citation David A. Holland, Elaine Angelino, Gideon Wald, Margo I.
Seltzer. 2013.  Flash Caching on the Storage Client. In
Proceedings of the 2013 Usenix Annual Technical Conference,
San Jose, CA, June 26-28, 2013.
Published Version http://0b4af6cdc2f0c5998459-
c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/11792-
atc13-full_proceedings.pdf
Accessed February 19, 2015 12:02:59 PM EST
Citable Link http://nrs.harvard.edu/urn-3:HUL.InstRepos:11324016
Terms of Use This article was downloaded from Harvard University's DASH
repository, and is made available under the terms and conditions
applicable to Open Access Policy Articles, as set forth at
http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-
use#OAPFlash Caching on the Storage Client
David A. Holland, Elaine Angelino, Gideon Wald, Margo I. Seltzer
Harvard University
Abstract
Flash memory has recently become popular as a caching
medium. Most uses to date are on the storage server side.
We investigate a different structure: ﬂash as a cache on
the client side of a networked storage environment. We
use trace-driven simulation to explore the design space.
We consider a wide range of conﬁgurations and policies
to determine the potential client-side caches might offer
and how best to arrange them.
Our results show that the ﬂash cache writeback policy
does not signiﬁcantly affect performance. Write-through
is sufﬁcient; this greatly simpliﬁes cache consistency
handling. We also ﬁnd that the chief beneﬁt of the ﬂash
cache is its size, not its persistence. Cache persistence of-
fers additional performance beneﬁts at system restart at
essentially no runtime cost. Finally, for some workloads
a large ﬂash cache allows using miniscule amounts of
RAM for ﬁle caching (e.g., 256 KB) leaving more mem-
ory available for application use.
1 Introduction
Recently ﬂash memory has become popular not only as
a storage medium but also as a caching layer in high-end
storage systems. The typical scenario has been to com-
bine ﬂash with disks, either locally or on a ﬁle server.
We look at the opposite case: ﬂash combined with the
operating system’s buffer cache, on the client side of a
networked storage system.
We consider compute servers running storage-
intensive workloads that are themselves clients in a net-
worked storage environment. There are many examples
of such servers: application servers in three-tier web ap-
plications, compute servers in data centers, render farms
used in animation, and compute nodes in scientiﬁc com-
putation clusters all ﬁt this model. Our analysis explores
a range of design issues arising from this conﬁguration:
 Must the ﬂash cache be managed together with the
ﬁle system RAM cache or can it act as an indepen-
dent layer below it?
 Should the RAM cache be a proper subset of the
ﬂash cache or should the two caches be treated as a
single uniﬁed cache to avoid duplication?
 How large must the ﬂash cache be relative to RAM?
 What writeback policies should be used from RAM
to ﬂash and from ﬂash to the ﬁle server?
 Should a ﬂash cache be persistent and recoverable?
 How critical is consistency across multiple caches?
This design space is already enormous, so we put aside
other relevant but secondary considerations, such as
cache replacement policy (we use LRU) and wear lev-
eling algorithms. We assume our ﬂash device comes
equipped with a ﬂash translation layer that handles wear
leveling, erase cycles, and other considerations that arise
if one uses raw ﬂash chips directly.
We explore this design space via trace-driven simula-
tion, which allows us to examine the behavior of an ex-
tensive range of conﬁgurations and cache sizes. We vali-
dated our simulator and traces against actual workloads,
but use stochastically generated workloads for our anal-
ysis, because we could not ﬁnd real-world traces with
workloads large enough to stress the ﬂash.
Our results show that all simple writeback policies,
short of synchronously writing from RAM all the way
through to the ﬁle server, produce comparable results.
This means that ﬂash caches can be write-through, which
simpliﬁes cache consistency handling. We also ﬁnd the
primary beneﬁt of ﬂash caching comes from its density.
A volatile cache medium available for a reasonable price
in similar sizes would also be attractive.In the next section, we brieﬂy discuss the various ways
ﬂash is being used to boost storage performance. We then
outline the ﬂash cache design space in Section 3. We de-
scribe our traces in Section 4 and our simulator in Sec-
tion5.Wediscusshowwevalidatedourtoolsandmodels
in Section 6 and then present the results of our simulation
study in Section 7. Our conclusions are in Section 8.
2 Related Work
Flash is widely used in high end storage servers [2, 3]
and more recently in hybrid drives that package ﬂash and
spinning media inside a single device [18, 20]. The Net-
App FlashCache[17] is a device that transparently sits in
front of a storage server, using the persistent cache to
reduce latency. FlashTier [19] is a disk controller with
an on-board persistent ﬂash cache. It explores the pos-
sibilities of using a custom ﬂash translation layer opti-
mized for caching rather than storage. All of these so-
lutions place ﬂash on the storage side of a network (or
local SATA), combining ﬂash with disk drives. Our work
examines ﬂash on the client side, combining ﬂash with
the operating system buffer cache.
NetApp’s Project Mercury [6] is a client-side ﬂash
cache that avoids explicit integration with the operating
system. It is a block-level cache that can be deployed
in various ways: a hypervisor ﬁlter driver, an OS ﬁlter
driver, an application cache, or a proxy cache for net-
work storage protocols. Mercury is one point in the de-
sign space this study explores. In Mercury, RAM stores
a proper subset of the data stored in the ﬂash cache, the
writeback policy from RAM is the operating system’s,
and the writeback policy from ﬂash is write-through.
Microsoft’s ReadyBoost [15] is a software solution in
recent Windows releases that uses a standard ﬂash de-
vice as an extension to memory for random read caching.
Windows gradually ﬁlls the ﬂash cache with data and
then services random reads from that cache, when doing
so improves performance.
Recently, Koller et al. [11] experimented with a range
of more sophisticated writeback policies for a ﬂash
cache. They found (as we did) that synchronous write-
through all the way to disk is slow. Their work is oth-
erwise complementary to ours as it explores write-back
policies more sophisticated than those we considered.
(They found, for example, that their policies can increase
write throughput by improving the batching of back-end
write requests; our simulator does not model this effect.)
One key difference is that they were working in an envi-
ronment where applications wait until writes propagate
all the way to disk. We concentrate on a more conven-
tional environment where writes return to the application
once the data is written into the operating system’s buffer
cache. As we will see, this hides the write latency of the
underlying storage tiers except under heavy write trafﬁc.
We also assume a high-performance ﬁler with sophis-
ticated read-ahead, nonvolatile cache, and large server
memory at the back end, rather than a simple disk array.
3 Flash Cache Design Space
We model an application server environment consist-
ing of one or more compute servers (“hosts”) and a ﬁle
server (“ﬁler”) connected by private network segments.
Each host runs one or more applications, involving one
or more threads of execution. Each host has cache space
that is partially RAM and partially ﬂash. As previously
mentioned this environment reﬂects a number of real-life
situations. We consider storage-intensive workloads.
We now address the design issues from Section 1.
3.1 Flash-RAM Integration
We begin by asking whether ﬂash cache support should
be integrated into the operating system’s buffer manager
or if it performs acceptably as an independent entity, as
in Mercury. The former case requires substantial kernel
modiﬁcations. The latter case allows deploying the ﬂash
cache in (or as) a self-contained device driver.
The need for integration depends on the level of co-
ordination required between the RAM and ﬂash caches.
If accessing the ﬂash via ordinary block reads and writes
performs adequately, the ﬂash cache can be independent.
On the other hand, if special policies are required, or ex-
tra metadata must be provided to the ﬂash cache, then
kernel support is required.
3.2 Placement
Our second design question is whether the RAM cache
can be a subset of the ﬂash cache. This is effectively
a choice of block placement policy. The straightfor-
ward approach is to structure the ﬂash cache as an addi-
tional independent tier of cache below the RAM cache.
The ﬂash cache services the RAM cache and the ﬁle
server services the ﬂash. Newly referenced blocks are
ﬁrst placed in ﬂash, then into RAM; the RAM cache is
always a subset of the ﬂash cache. This policy wastes
some of the capacity of the ﬂash, but is relatively simple.
Alternatively, one could use two separate layers of
cache, but choose some more elaborate policy; for ex-
ample, one might place blocks initially into RAM andthenmigratelessrecently(orlessfrequently)usedblocks
down to ﬂash. Another option is to treat the two stores as
a single uniﬁed cache and come up with some policy for
initial placement and perhaps also internal migration.
The basic question is whether the simple approach is
good enough. We would also like to estimate how much
better (if at all) an alternate placement scheme performs.
3.3 Cache Architecture
We handle integration and placement as a single choice
of cache architecture. Because the number of possible
ﬁll and migration policies is near inﬁnite, we chose three
simple alternatives to implement and test. Other options
are certainly possible and may be a worthwhile subject
of future research. These are the three architectures:
 Naive. The ﬂash cache is treated as an indepen-
dent cache layer beneath the RAM cache; the RAM
cache is always a subset of the ﬂash cache, requiring
no integrated management.
 Lookaside. Based on Mercury [6], writes go di-
rectly from RAM to the ﬁle server instead of being
routed through the ﬂash. The ﬂash is updated after
the ﬁle server and never contains dirty data. Appli-
cations see persistence guarantees identical to a sys-
tem without ﬂash. The RAM cache is a subset of the
ﬂash cache, requiring no integrated management.
 Uniﬁed.RAMandﬂasharemanagedtogetherusing
a single LRU chain. Data blocks are placed into the
least recently used buffer, whether RAM or ﬂash,
andarenevermigrated.Noattemptismadetoprefer
RAM to ﬂash. Here the RAM cache is not a subset
of the ﬂash, so integrated management is needed.
3.4 Relative Size
What size does the ﬂash cache need to be relative to the
RAM cache to be effective? We use 8 GB as the baseline
RAM size and examine ﬂash sizes ranging from 8 GB to
128 GB (1x to 16x RAM). We use 64 GB as the baseline
ﬂashsizebasedontheoldruleofthumbthateachsucces-
sive layer of cache should be roughly an order of mag-
nitude larger. (Note that the RAM size actually reﬂects
the amount of RAM available for ﬁle system caching.
For many real-life workloads this is substantially smaller
than the total amount of RAM in the machine.)
3.5 Flash Writeback Policy
We next consider the question of when dirty blocks move
from ﬂash to the ﬁle server. We chose four policies:
 write-through - data is immediately written to the
server, blocking the requester until completion.
 asynchronous write-through - data is immediately
written to the server without blocking the requester.
 periodic - dirty data remains in the cache until a
syncer thread ﬂushes the data back to the server.
 none - dirty data remains in the cache until evicted
for capacity reasons.
We run the periodic case with syncer periods of 1, 5,
15, and 30 seconds, resulting in seven different policies.
3.6 RAM Writeback Policy
We now consider RAM writeback policies. Since (at
least for the naive architecture) these writebacks go to
the ﬂash cache, it does not necessarily follow that the
standard behavior of ﬁle system RAM caches is correct.
We tested the same seven writeback policies that we
usedforﬂashwriteback,yielding49differentpolicycon-
ﬁgurations for each of the three architectures.
We did not try other more elaborate policies (such as
trickle-ﬂushing, writing back asynchronously after a de-
lay, etc.) for either ﬂash or RAM, because we found that
nearly all the policy combinations perform identically.
3.7 Cache Persistence
Volatile RAM caches are emptied by system restart and
are typically left to reﬁll naturally. However, a cache kept
in persistent memory can potentially be recovered after
a crash, to avoid the performance degradation that oc-
curs when reﬁlling the cache [12]. The Rio File Cache
researchprototypedemonstratedthepotentialofsuchap-
proaches as early as 1996 [7]. Today, the NetApp Mer-
cury cache exploits persistence to avoid performance
degradation after reboot [6], and high end ﬁle servers
typically use battery-backed memory similarly to accom-
plish such warm restarts [1, 2]. With ﬂash caches, cached
data can survive a restart, but the system must take pre-
cautions to ensure that the data is valid.
Our results show that the price/performance of ﬂash
makes it attractive simply as a larger cache. However,
taking advantage of its persistence can provide additional
beneﬁt. There are three chief obstacles: First, cache con-
sistency needs to be maintained; this is discussed in the
next section. Second, the cache indexing structures must
themselves be kept in the ﬂash and kept up to date and
consistent with the data blocks in the ﬂash. This creates
additional ﬂash trafﬁc and additional overhead. A naive
implementation adds an additional ﬂash write latency ev-
ery time the ﬂash cache is updated; a clever implementa-tioncanbatchthosewrites.Third,ifthecrashwascaused
by corruption in the ﬂash itself, a simple reboot may not
be sufﬁcient to restore the system to a running state.
In the lookaside architecture blocks in the ﬂash are
never dirty, so the system cannot crash with dirty blocks
that must be recovered and written back to the ﬁle server.
3.8 Cache Consistency
Normally one writes updated blocks in the RAM cache
back to the ﬁle server quickly, because RAM is volatile.
This motivation disappears with a persistent cache. If the
ﬂash cache is recoverable, as discussed in the previous
section, cache writebacks can be delayed. Some writes
will then die in the cache, reducing network contention.
However, for shared data, it also complicates cache
consistency handling. Data not written back to the ﬁle
server right away must still be reported back to the
server so other hosts do not read stale versions. And, of
course, unmodiﬁed data retained in the cache must also
be tracked in case some other host updates it.
Cacheconsistencyisnotanewproblem[9,16,21]and
does not need a new solution; however, two new issues
arise. The size of ﬂash caches may affect the scalability
of consistency protocols; detailed modeling of this effect
is beyond the scope of our work. Furthermore, a recover-
able cache is unavailable during a reboot; it cannot ﬂush
dirty data or participate in cache consistency protocols
until afterwards. As reboots typically take at least min-
utes, this may induce unacceptable delays.
We concentrate primarily on non-shared data, e.g.,
disk images provided to clients over a SAN. We touch
brieﬂy on cache consistency only to quantify the mag-
nitude of the problem. The simulator invalidates stale
copies of blocks instantly (using global knowledge)
when a new version is ﬁrst written into a cache. This
exposes the overhead caused when these blocks must
be fetched again later. However, we only count invali-
dations; we do not model the overhead of cache consis-
tency trafﬁc, nor do we adopt any particular real-world
cache consistency model. This information gives design-
ers a basic overview of the circumstances that arise with
the much larger caches that ﬂash allows.
4 Traces
For our trace-driven simulation, we use block-level
traces containing read and write operations. Each oper-
ation identiﬁes a ﬁle and a range of blocks within that
ﬁle. Each operation also carries a thread ID and host ID.
During development and validation, we used traces
from the SNIA repository and the Mercury traces, but
forouranalysisweusesynthetictraces.Adequatelylarge
real traces are, by and large, not available; when working
with a 128GB ﬂash device, we need a trace that churns
through enough data to ﬁll it and then work with it for
long enough to access plenty of data that both is and is
not in the original ﬁll. The largest trace for which we
present results moves roughly 2.5 TB of data, all told;
we were unable to locate any real traces this large.
Wewroteatracegeneratortoproducelargetraceswith
characteristics similar to real traces. The trace generator
starts from a list of ﬁles and ﬁle sizes from the Impres-
sions ﬁle system generator [4]. It samples this ﬁle server
model to produce working sets, then samples these to
produce I/O requests. A portion of the I/O requests are
sampled instead from the whole ﬁle server. The distri-
bution of I/Os among hosts and threads is uniform; the
distribution of I/Os among ﬁles (and selection of ﬁles
for working sets) is weighted by popularity, where small
integer popularities are generated from a Zipﬁan distri-
bution. The distribution of I/O sizes (and selection of
ﬁle subregions for working sets) is Poisson, modiﬁed by
clamping to the ﬁlesize. The distribution of I/O starting
points (and ﬁle subregion starting points) is uniform.
All traces used in the results presented are based on
the same 1.4 TB ﬁle server model we generated with Im-
pressions. (This is larger than any of the cache sizes we
use.) They use 4K blocks and have 80% of the I/Os com-
ing from the working set. They also use eight threads per
host. They grind through a total volume of data that is, in
all cases, four times the working set size, half of it being
devoted to a warmup period for which statistics are not
collected. This ensures the cache ﬁlls thoroughly. (We
checked the results of changing the working set percent-
age and the number of threads; these did not affect the
conclusions about our key questions.)
The two traces we use as a baseline use one host, one
working set, working set sizes of 60 and 80 GB (for use
with a 64 GB ﬂash), and 30% writes. For many of the
experiments we vary one or more of these parameters.
5 Simulator
As discussed earlier, we model an environment where
some number of computation servers (“hosts”) share a
single networked ﬁle server. We wrote a trace-driven
simulator for this environment.
The simulator issues I/O requests from the trace as
quickly as possible given that each application thread can
have only one I/O in progress. I/O requests may stall atvarious points in the system; all executions are fully in-
terleaved. We do not try to produce realistic application-
level I/O schedules; not only is scheduling I/O traces
a known hard problem [10, 14, 22], but ﬂash substan-
tially changes the timing. Timestamps taken from envi-
ronments without ﬂash would have dubious value.
We model the caches in detail; each is a single LRU
chainofblocks.Wetreattheﬂashitselfasablockdevice;
that is, we write blocks to it and read them back. We as-
sume a ﬂash translation layer but do not model it directly.
We use average per-block access times derived from test-
ing real ﬂash devices. (See Sections 6.1 and 6.2.)
The network is modeled less exactly: each segment
can carry one packet at a time, and each I/O request uses
one packet in each direction. Each packet is assumed to
incur a ﬁxed latency (for headers, block information, and
so forth) plus a small amount of additional time per bit
of block data transferred.
We do not attempt to model the caches or prefetch-
ing behavior of the ﬁler directly. Many man-years of ef-
fort have gone into providing high-end ﬁle servers with
clever and aggressive caching logic, and modeling this is
irrelevant to the main goals of this work. Instead we use
a simple model: a “fast” latency for cache hits, a “slow”
latency for misses, and a prefetch success rate that deter-
mines what fraction of reads are fast. (Which reads are
fast is random. Writes are buffered and always fast.)
We do not model application overhead, user-kernel
transitions, hypercall delays, processing latency in the
nework stack, etc. Most of these are invariant under
caching or can be incorporated elsewhere.
6 Validation
We validate two parts of our system that could produce
fallacious results if not done properly. First, we validate
our simulator against data using NetApp’s Mercury ﬂash
cache. Second, we validate that average read/write laten-
cies for our device reasonably approximate actual ﬂash
latencies.
6.1 Simulator Validation
We validated our simulator against NetApp’s Mer-
cury [6], a hardware implementation of a client-side
ﬂash cache. Working with the Mercury group, we took
four days of traces from a NetApp Windows laptop and
played them back both on their hardware and on our sim-
ulator. These traces were collected below the ﬁle system,
i.e., under the buffer cache, so we played them back di-
rectly through a 32GB ﬂash cache. (In our simulator, that
 10
 100
 1000
 0  10  20  30  40  50  60  70  80
L
a
t
e
n
c
y
 
(
i
n
 
u
s
)
Cumulative I/Os performed (millions)
SSD access latency as a function of time
Read latency
Write latency
Figure 1: Flash device read (top) and write (bottom) latency;
60GB working set workload on a 58GB device. Each point is
the average of 10,000 block I/Os.
means we set the RAM cache size to zero.)
We debugged the simulator and adjusted our timing
models as necessary until the I/O throughput and laten-
cies seen above and below the ﬂash cache, as well as
accessing the ﬂash device, plus the cache hit rates, all
or nearly all matched within 10%. Many of the statistics
matched more closely. A perfect alignment is not possi-
ble, because (besides the inherent limitations of simula-
tors) Mercury is not structured identically to the simula-
tor. The simulator also does not account for an additional
application-level or other systemic overhead of roughly
10% seen in the end-to-end run times.
These measurements gave us conﬁdence that the sim-
ulator accurately models the system behavior and that its
results are meaningful.
6.2 Flash Modeling Validation
We worried that average write latencies might not ade-
quately model the behavior of a real device in the pres-
ence of ﬂash erase cycles. We bought two low-end con-
sumer grade SSDs and evaluated their latency behavior.
We modiﬁed the simulator to log I/Os to the ﬂash as
it ran and captured the results for a variety of workloads.
Then we replayed these I/Os to the SSDs and recorded
the actual read and write latencies. We also tried fully
random reads and writes with a read/write mix similar to
that found in the simulator logs.
We found three things of possible interest. First, while
both devices exhibited high variance in their access la-
tency, this variance is short-term; across a group of
10,000 to 100,000 block accesses (much less than the
length of our traces) the variance is high, but from groupto group the average behavior is quite reasonable. Sec-
ond, and perhaps of more interest, both devices main-
tained a single average write latency from beginning to
end across essentially all the workloads. This included
workloads with up to 90% (application) writes. Only
the read latency ﬂuctuated signiﬁcantly over time as
the device ﬁlled. We observed a weak relationship be-
tweenhigherwritevolumesandworsereadperformance;
whether this is due to erase cycles or caching or some
other internal phenomenon is anyone’s guess.
Third, the read performance replaying the simulator
logs is much better than the read performance doing
purely random I/Os. Caching workloads are not random.
Figure 1 shows a scatter plot of the read and write
latencies against time for a typical workload run. Each
point is the average of 10,000 block I/Os.
Our conclusion was that a single average access la-
tency is ﬁne for modeling writes, and viable, though not
ideal, for reads. However, our experience with ﬂash de-
vices is that each model is different, exhibiting its own
average latencies and behavioral quirks. Fortunately the
system performance does not appear to be highly sensi-
tive to ﬂash performance; see Section 7.7.
7 Results
We chose a per-block RAM access time of 400 ns, corre-
sponding to roughly 10 GB/sec memory bandwidth. An
internal limitation of the simulator restricts it to integer
multiples of 100 ns, so this speed roughly reﬂects the 10-
12 GB/s expected (and observed on an Intel Core i7 [13])
bandwidth of DDR3 RAM.
We used the performance data from validating against
Mercury to choose timing models for the ﬂash and the
combined network and ﬁle server accesses. We then
picked latencies loosely corresponding to a gigabit net-
work for the network and attributed the rest of the com-
bined network and ﬁle server times to the ﬁle server. Ta-
ble 1 summarizes the timing parameters.
In evaluating possible conﬁgurations, we use the la-
tency experienced by the application as the governing
metric.Althoughthesimulatorcapturesavarietyofother
metrics(includingthroughputandlatenciesateverylevel
ofthestack),weusethoseonlytoexplainbehaviorrather
than to evaluate policies.
7.1 Architecture and Writeback Policy
We begin our analysis by evaluating our naive, looka-
side, and uniﬁed architectures and how they are affected
by the 49 combinations (seven each for RAM and ﬂash)
Parameter Value
RAM read 400 ns / 4K block
RAM write 400 ns / 4K block
Flash read 88 ms / 4K block
Flash write 21 ms / 4K block
Network base latency 8.2 ms / packet
Network data latency 1 ns / bit
File server fast read 92 ms / 4K block
File server slow read 7952 ms / 4K block
File server write 92 ms / 4K block
File server fast read rate 90%
Table 1: Timing Model Parameters
of writeback policies. Identifying the promising conﬁg-
urations from among the 147 possibilities allows for a
more focused comparison in the rest of the evaluation.
WeusedthetwobaselinetracesdescribedinSection4.
We ran these traces on the corresponding baseline simu-
lator conﬁguration: 8 GB of RAM and 64 GB of ﬂash.
Figure 2 shows the average read and write latency seen
by the application across all 49 policies for the three dif-
ferent architectures. We show the 80 GB workload; the
60 GB graphs are nearly identical.
Cursory inspection of the ﬁgures reveals the ﬁrst
important result: excepting policies that result in syn-
chronous writes to the ﬁler (synchronous or none) the
writeback policy does not matter. The “none” policy
leadstosynchronousevictionsoncethecacheﬁlls.When
the RAM policy allows this effect in the ﬂash cache to
show through to the application, as seen in the front left
and right corners of the write latency graph, multiple
threads doing evictions contend for the network, convoy,
and slow down to (less than) the speed of the ﬁle server.
While this result initially surprised us, it is entirely
reasonable: ﬂash caches are so large that any reason-
able writeback policy maintains an ample supply of clean
blockstoevictandreplace;thelatencyexposedabovethe
ﬂash cache is never greater than the ﬂash write latency.
For the application to observe greater latency, it would
have to sustain a write bandwidth greater than the write-
back bandwidth to the ﬁle server for sufﬁciently long
to ﬁll many gigabytes of ﬂash with dirty blocks. While
workloadsexhibitingthisbehaviorprobablyexist,weex-
pect them to be rare. Furthermore, upon ﬁlling the ﬂash,
write latency will largely revert to that of the ﬁle server.
This produces the same effect as having no ﬂash cache.
Based on this exploration, we use one policy combi-
nation for most of the remaining analysis: a one-second
periodic RAM writeback policy (as this most closely
matches real system behavior) and asynchronous write-s a p1 p5 p15 p30 n s
a
p1
p5
p15
p30
n
 0
 100
 200
 300
 400
L
a
t
e
n
c
y
 
(
i
n
 
u
s
)
Read Latency (80 GB)
naive
lookaside
unified
RAM Policy
Flash Policy
L
a
t
e
n
c
y
 
(
i
n
 
u
s
)
s a p1 p5 p15 p30 n s
a
p1
p5
p15
p30
n
 0
 100
 200
L
a
t
e
n
c
y
 
(
i
n
 
u
s
)
Write Latency (80 GB) naive
lookaside
unified
RAM Policy
Flash Policy
L
a
t
e
n
c
y
 
(
i
n
 
u
s
)
Figure 2: Application read and write latency on the 80 GB working set as a function of RAM and ﬂash writeback policies.
through for the ﬂash cache. Asynchronous write-through
seems like the best overall choice for the ﬂash, as it
is equivalent to synchronous write-through for consis-
tency and integrity purposes. Meanwhile it avoids expos-
ing synchronous ﬁle server writes if the RAM cache be-
comes synchronous through dysfunction, e.g., thrashing.
Figure 2 also shows the uniﬁed architecture produces
the lowest read latencies while the naive and lookaside
architecturesproducethelowestwritelatencies.Theread
latency results are unsurprising, because the effective ca-
pacity of the uniﬁed architecture is greater: it is the sum
of the RAM and ﬂash sizes (72 GB) instead of just the
ﬂash size (64 GB). When the working set ﬁts in the
ﬂash (60 GB), the difference is tiny, only 3.5%. How-
ever, when the working set falls out of the ﬂash (80 GB),
we see that the larger effective cache size produces a sig-
niﬁcant beneﬁt, improving read latency by as much as
 0
 100
 200
 300
 400
 500
 600
 700
 800
 0  100  200  300  400  500  600  700
L
a
t
e
n
c
y
 
(
i
n
 
u
s
)
Working Set Size (in GB)
Read Latency as a function of Working Set Size
8G RAM, 64G flash, Naive
8G RAM, 64G RAM, Naive
8G RAM, 56G RAM, Unified
Figure 3: Application read latencies comparing effective cache
sizes. See discussion in text.
20%. Figure 3 illustrates in more detail how the effective
total cache size affects performance. For two of the cases
in this graph we pretended that the ﬂash has the same
access latency as RAM. This allows distinguishing the
structural effects from the latency properties of the cache
materials. Although it is difﬁcult to see in the graph, the
performance of the RAM-only uniﬁed architecture with
8 and 56 GB caches is identical to that of the RAM-only
naive architecture with 8 and 64 GB caches. The differ-
ence between that line and the one above it reﬂects the
effect the slower ﬂash has on read latency.
Returning to the policy comparison in Figure 2, on the
write side, the naive and lookaside architectures perform
at RAM speed, because all writes go directly to RAM
(except for very high write rates). The uniﬁed architec-
ture also exposes ﬂash latency by nature; since only 1=9
of the data is placed in RAM and the rest in ﬂash, on
average we see 8=9 of the 21 ms ﬂash latency.
Stepping back, these results suggest that for read per-
formance,biggerisbetterandthatforwriteperformance,
the key is to avoid exposing applications to the ﬂash tim-
ing. If we assume a given cost budget, an attractive strat-
egy is to use only enough RAM to act as an effective
write buffer and then buy as much ﬂash as the budget al-
lows. We explore this option in Section 7.5. Unless oth-
erwise speciﬁed, we use the naive architecture in the re-
maining analyses, as it hides the ﬂash write latency and
offers the simplest implementation alternative.
7.2 Flash vs. No Flash
Having settled on policies, we now investigate the ad-
vantage the ﬂash cache offers. To this end we ran a range
of working set sizes, ranging from 5 GB to 640 GB, on
three sizes of ﬂash cache (32 GB, 64 GB, and 128 GB) 0
 100
 200
 300
 400
 500
 600
 700
 800
 900
 1000
 0  100  200  300  400  500  600
L
a
t
e
n
c
y
 
(
i
n
 
u
s
)
Working Set Size (in GB)
Read Latency as a function of Working Set Size
No flash
32 GB flash
64 GB flash
128 GB flash
Figure 4: Read latencies as a function of working set size
across a variety of ﬂash sizes. As expected, when the work-
ing set ﬁts in the ﬂash, read latency improves dramatically over
a RAM-only system.
as well as with no ﬂash cache. The RAM cache size is 8
GB. The working set sizes range from smaller than RAM
to substantially larger than the largest ﬂash cache.
Figure 4 shows that even when the working set far ex-
ceeds the ﬂash size, the ﬂash improves performance sig-
niﬁcantly, because the difference between ﬂash perfor-
mance and ﬁler performance is substantial. In all con-
ﬁgurations, the RAM hit rate is only 3.4%, but the ﬂash
hit rate varies from 0 (with no ﬂash) to 47% in the 128
GB conﬁguration. Although the ﬁler fast read time (92
ms) is quite close to that of ﬂash (88 ms), the two orders
of magnitude difference between fast and slow ﬁler read
times is signiﬁcant, even with the 90% fast ﬁler read rate.
As we shall see in the next section, the ﬁler’s ability to
read ahead is critical in any conﬁguration. The write la-
tency ﬁgures from this experiment are not interesting: all
writes see the RAM write latency of 0.4 ms.
7.3 Filer Read-Ahead
An effect observed in Mercury [6] suggests that a large
cache reduces the ﬁle server’s ability to prefetch data. We
cannot yet quantify this effect, but we can bound it. In
Figure 5 we show the spread between an 80% prefetch
rate, which we believe to be a reasonable lower bound,
and a 95% prefetch rate, which serves as a plausible up-
per bound. The graph shows the spread for the 64 GB
ﬂash, as well as for no ﬂash, using the same range of
working set sizes used in the previous section.
The application read latency is dominated by the cost
of ﬁle server misses, which cost milliseconds. In an ideal
world, installing the ﬂash cache would not affect the ﬁle
 0
 400
 800
 1200
 1600
 2000
 2400
 0  100  200  300  400  500  600
L
a
t
e
n
c
y
 
(
i
n
 
u
s
)
Working Set Size (in GB)
Read Latency as a function of Working Set Size
No flash; 80% prefetch rate
No flash; 95% prefetch rate
64 GB flash; 80% prefetch rate
64 GB flash; 95% prefetch rate
Figure 5: Application-level read latency for different workload
sizesandtwoﬁlerprefetchrates.Comparingthelinesofsimilar
shapedemonstratesthedramaticeffectthatﬁlerprefetchinghas
on the resulting latency.
server’s prefetch ability. Then the ﬂash cache is bene-
ﬁcial for almost all workload sizes, as can be seen in
the ﬁgure. In a pessimal world, the prefetch rate might
drop substantially; in this case the cache is beneﬁcial for
a much narrower range of workloads: those that ﬁt in
ﬂash but not in RAM. This can be seen in Figure 5 as
the pocket between the lower (better) no-ﬂash curve and
the upper (worse) with-ﬂash curve.
Avoiding the pessimal world is an engineering chal-
lenge and a critical issue for the adoption of ﬂash
caching. In the presence of a ﬂash cache, the ﬁler cache
transitions from a second level cache to a third level
cache; its prefetching and replacement policies must
therefore adapt accordingly [5, 8, 23].
However, in environments where the back end is not a
ﬁler but a plain disk array [11], the prefetch rate will be
negligible and a ﬂash cache is a huge win.
7.4 Flash Cache Size
We next examined the converse case: given a ﬁxed work-
load, what happens as we increase the ﬂash cache size.
As expected, the read latency decreases as a greater por-
tion of the working set falls in the cache until the ﬂash
cache is large enough to capture the entire working set,
at which point the read latency is that of ﬂash. As there
is nothing unexpected in these results, we have omitted
the corresponding graphs. 0
 5
 10
 15
 20
 25
 30
0 64K 1M 16M 256M 4G
 0
 50
 100
 150
 200
 250
 300
W
r
i
t
e
 
L
a
t
e
n
c
y
 
(
i
n
 
u
s
)
R
e
a
d
 
L
a
t
e
n
c
y
 
(
i
n
 
u
s
)
RAM Size (log scale, except for 0 which really means 0)
Read and Write Latency as a function of RAM Size (60 GB working set)
Read (p1)
Read (a)
Write (p1)
Write (a)
 0
 5
 10
 15
 20
 25
 30
0 64K 1M 16M 256M 4G
 0
 100
 200
 300
 400
 500
W
r
i
t
e
 
L
a
t
e
n
c
y
 
(
i
n
 
u
s
)
R
e
a
d
 
L
a
t
e
n
c
y
 
(
i
n
 
u
s
)
RAM Size (log scale, except for 0 which really means 0)
Read and Write Latency as a function of RAM Size (80 GB working set)
Read (p1)
Read (a)
Write (p1)
Write (a)
Figure6:ApplicationreadandwritelatencieswithsmallRAM
cache sizes. The (a) and (p1) notations in both graphs refer to
the RAM write-back policy: asynchronous write-through and
1-second periodic respectively. Surprisingly, a small (256 KB)
cache achieves performance comparable to much larger ones.
7.5 No RAM Cache
One intriguing possibility suggested by the previous re-
sults is to dispense with the RAM cache entirely. We run
the baseline workloads with a ﬁxed 64 GB ﬂash cache
and RAM cache sizes ranging from zero to the baseline
8 GB. We run these with both the asynchronous write-
through RAM policy (a) as well as the default 1-second
periodic writeback (p1) we chose above.
Figure 6 shows the application read and write latencies
for the 60 GB and 80 GB working sets, respectively. The
X axis is the base 2 log of the RAM size or zero for none.
The no-RAM conﬁguration does not work well, but it
is surprising how well a relatively small (e.g., 64 MB)
RAM cache performs. If we use the asynchronous write-
through policy, a tiny 256 KB is sufﬁcient as a write
buffer. For the smallest caches the periodic syncer does
not run often enough, so the RAM cache ﬁlls with dirty
 0
 5
 10
 15
 20
 25
 30
0 64K 1M 16M 256M 4G
 0
 50
 100
 150
 200
 250
 300
W
r
i
t
e
 
L
a
t
e
n
c
y
 
(
i
n
 
u
s
)
R
e
a
d
 
L
a
t
e
n
c
y
 
(
i
n
 
u
s
)
RAM Size (log scale, except for 0 which really means 0)
Read and Write Latency as a function of RAM Size (5 GB working set)
Read (p1)
Read (a)
Write (p1)
Write (a)
Figure7:ApplicationreadandwritelatencieswithsmallRAM
cache sizes and a small workload.
blocks and performance drops.
The somewhat startling conclusion is that with a large,
cheap ﬂash cache, and a workload much larger than
RAM, we can allocate minimal RAM (large enough to
act as a speed-matching buffer) to ﬁle system caching,
leaving the rest of memory available for application or
operating system use!
This was tantalizing, so we tried the small RAM con-
ﬁguration on RAM-sized workloads. Figure 7 shows the
latencies for a workload with a 5GB working set. As seen
at the right, this conﬁguration carries a 25-30% penalty,
which is noticeable but far less than the factor of ﬁve or
so seen without the ﬂash cache. It may be an acceptable
tradeoff in some circumstances.
7.6 Read-mostly vs. Write-mostly
The previous results all assumed a 30% write percent-
age. We next investigate the sensitivity of our results to
the write percentage. We use our baseline working set
sizes (60 GB and 80 GB) and cache sizes (8 GB RAM
cache and 64 GB ﬂash cache), while varying the per-
centage of writes in the trace from 0% to 100%. Figure 8
shows the application-level read and write latencies. As
expected, read latency remains stable. The write latency
is also unaffected except at very high write rates, where
we start seeing synchronous writebacks from the RAM
cache that expose the ﬂash’s write latency. As the pro-
portion of writes increases, the trace runs faster, because
writes are faster than reads. At very high write rates the
1-second RAM-to-ﬂash syncer starts to fall behind. Sev-
eral other effects come into play as well, such as network
saturation, resulting in complex behavior that may be im-
perfectly modeled. The portion of the graphs above 90% 0
 50
 100
 150
 200
 250
 300
 350
 0  10  20  30  40  50  60  70  80  90
 0
 2
 4
 6
 8
 10
 12
 14
R
e
a
d
 
L
a
t
e
n
c
y
 
(
i
n
 
u
s
)
W
r
i
t
e
 
L
a
t
e
n
c
y
 
(
i
n
 
u
s
)
Percent Write Operations
Read/Write Latency as a function of the % Write Operations
Read (80 GB)
Read (60 GB)
Write (80 GB)
Write (60 GB)
Figure 8: Application read and write latencies (in seconds) as
a function of write percentage. As long as the write percent-
age remains below 90%, avoiding synchronous RAM evictions,
performance is independent of the write rate.
 0
 100
 200
 300
 400
 500
 600
 0  20  40  60  80  100
L
a
t
e
n
c
y
 
(
i
n
 
u
s
)
Flash read time (in us)
Read Latency as a function of the flash read time
Read lookaside (80 GB)
Read naive (80 GB)
Read unified (80 GB)
Read lookaside (60 GB)
Read naive (60 GB)
Read unified (60 GB)
Figure 9: Application read latencies (in ms) for a range of ﬂash
read latencies (shown) and write latencies (proportional), in ms.
writes should be taken with a grain of salt.
The beneﬁt of ﬂash caching increases with write ratio
because writes never incur a ﬁle server latency by miss-
ing in the cache: they always go straight to cache and are
written back in the background.
7.7 Flash Timings
As ﬂash devices vary a good deal in performance, we
wanted to test a variety of ﬂash timing conﬁgurations.
Once again, the results were as expected: where the ﬂash
latencies appear directly, they scale with the ﬂash speed;
where they are hidden, changing the ﬂash speed has no
effect; and where they participate in the total latency, the
overall latency scales linearly.
 0
 100
 200
 300
 400
 500
 600
 700
 800
 900
 1000
 0  100  200  300  400  500  600
L
a
t
e
n
c
y
 
(
i
n
 
u
s
)
Working Set Size (in GB)
Read Latency as a function of Working Set Size
No flash warmed
64 GB flash, not warmed
64 GB flash warmed
Figure 10: Effect of persistence. The not-warmed case is
equivalent to having a non-persistent cache and crashing at the
beginning of the simulator run. The no ﬂash case is provided
for comparison.
Figure 9 shows the application-level read latency for
a range of ﬂash timings for both standard traces and all
three cache architectures. The leftmost point represents
the potential performance of phase-change memory.
When the working set ﬁts in ﬂash, the architecture
makes little difference, but when it falls out, we see the
beneﬁt of the larger effective sizes of the uniﬁed archi-
tecture. In all cases, however, application latency scales
linearly with the ﬂash latency, so improvements in ﬂash
timings are readily visible to the application.
7.8 Persistence
We approximated the cost making the ﬂash persistent by
doubling the ﬂash write latency to model performing two
ﬂash writes per block, one of the data and one for the
meta-data describing the block. (We did not attempt to
simulate the recovery phase.) We investigated the ben-
eﬁt by skipping the warming phase of our traces; this
is equivalent to having a non-persistent ﬂash cache and
crashing at the start of the simulator run.
The result is that the increased ﬂash write latency as-
sociated with persistence is invisible to the application.
This is consistent with our other results where the ﬂash
write latency is also invisible. However, the beneﬁt of
persistence, or rather the potential cost of not providing
persistence, is substantial, as shown in Figure 10.
7.9 Cache Consistency
As discussed in Section 3.8, ﬂash caches introduce two
problems related to consistency: their larger size, and, 0
 20
 40
 60
 80
 100
 120
 0  20  40  60  80  100
I
n
v
a
l
i
d
a
t
i
o
n
s
 
(
%
 
o
f
 
b
l
o
c
k
s
 
w
r
i
t
t
e
n
)
Percent Write Operations
Invalidations as a function of % Write Operations
No flash (80 GB)
No flash (60 GB)
64 GB flash (80 GB)
64 GB flash (60 GB)
 0
 200
 400
 600
 800
 1000
 0  20  40  60  80  100
R
e
a
d
 
L
a
t
e
n
c
y
 
(
i
n
 
u
s
)
Percent Write Operations
Read Latency as a function of % Write Operations
No flash (80 GB)
No flash (60 GB)
64 GB flash (80 GB)
64 GB flash (60 GB)
Figure 11: Invalidations required, and read latency, as a func-
tion of write percentage.
for recoverable caches, that the cache is ofﬂine during
reboots. These may affect cache consistency protocols.
We generated two additional families of traces, using
two hosts, to investigate the effect of size on consistency
control. As a worst-case scenario we make the two hosts
share one working set. In the ﬁrst family, we examine
varying write percentages; in the second, we examine
a range of working set sizes. Writing a new version of
a block into a cache must invalidate all copies in other
caches. We measure the fraction of (application-level)
block writes that require invalidations.
Figure 11 shows the percentage of blocks written re-
quiring invalidation and application read latency, as a
function of the write percentage. The write latencies (for
the 64 GB ﬂash) are comparable to those in Figure 8.
Figure 12 shows, for the baseline setting of 30%
writes, the percentage of invalidations and the applica-
tion read latency as a function of the working set size.
The write latency results are uniform and are not shown.
The primary ﬁnding is that for workloads that ﬁt in
ﬂash, the percentage of writes requiring invalidation is
 0
 10
 20
 30
 40
 50
 60
 70
 80
 90
 100
 0  100  200  300  400  500  600  700
I
n
v
a
l
i
d
a
t
i
o
n
s
 
(
%
 
o
f
 
t
o
t
a
l
 
b
l
o
c
k
s
)
Working Set Size (in GB)
Invalidations as a function of Working Set Size
No flash
64 GB flash
 0
 100
 200
 300
 400
 500
 600
 700
 800
 900
 1000
 0  100  200  300  400  500  600  700
R
e
a
d
 
L
a
t
e
n
c
y
 
(
i
n
 
u
s
)
Working Set Size (in GB)
Read Latency as a function of Working Set Size
No flash
64 GB flash
Figure 12: Invalidations required, and read latency, as a func-
tion of working set size.
high, even relative to workloads that ﬁt in RAM with
no ﬂash. The invalidation rate drops off for out-of-cache
workloads, but neither as quickly nor as signiﬁcantly
as with the smaller RAM cache. This has implications
for read performance as well. Comparing the application
read latency graphs (Figure 11 to Figure 8 and Figure 12
to Figure 4), we see that while the ﬂash provides an ad-
vantage, read latency increases with the fraction of in-
validations, because invalidated blocks must be reread
from the ﬁler. Although this is a worst case analysis (both
servers share the entire working set), these results high-
light critical areas in cache management design.
8 Conclusions
The results of our simulations show that even in its sim-
plest implementation, a client-side ﬂash cache provides
signiﬁcant beneﬁts to applications. We now review our
ﬁndings regarding the design questions from Section 1.
The ﬂash cache does not need to be integrated with theﬁle system. While doing so increases the effective size of
the cache, given the relative sizes (and prices) of RAM
and ﬂash this effect is fairly small and may not justify the
implementation complexity.
The ﬂash cache can be as large relative to RAM as
desired. In fact, except for workloads that ﬁt entirely into
RAM,itmakessensetolimittheRAMcachetothespace
needed to buffer writes, keeping the cache only in ﬂash.
Any writeback policy that avoids synchronous writes
and does not allow the cache to become full of dirty
dataproducesgoodperformance.Promptwritebackfrom
ﬂash exposes cache consistency events at no cost, and
these cache consistency events are potentially important.
It is not necessary to make the cache persistent (that is,
recoverable) to beneﬁt from it. However, doing so offers
signiﬁcant additional beneﬁt.
Cache consistency is a serious issue when multiple
hosts actively modify a shared working set. Even with
a write-through ﬂash cache, such workloads cause sub-
stantially higher invalidation trafﬁc than we see with tra-
ditional RAM-based caches. Also, traditional cache con-
sistency protocols may not be able to cope with a recov-
erable cache being ofﬂine while recovering.
There is much follow-on work to be done. The most
important area of further research is adapting ﬁle servers
to these larger caches, ensuring that we can retain excel-
lent read-ahead behavior when we do miss in the ﬂash. In
the presence of data shared among multiple hosts, each
with its own ﬂash cache, it is necessary to explore the
details of maintaining cache consistency among the mul-
tiple caches. Finally, ﬂash caching is a good candidate
for a custom ﬂash translation layer [19] – exploring ap-
proaches and algorithms as well as establishing satisfac-
tory lifetime for this application remains as future work.
9 Acknowledgements
This work was supported by NetApp. In addition, James
Lentini, Keith Smith, and Chris Small, all of NetApp,
were tremendously helpful in providing us with the
means and expertise to validate our simulator.
References
[1] Smart Array technology: Advantages of battery-backed
cache. http://h10032.www1.hp.com/ctg/Manual/
c00257513.pdf, 2002.
[2] Oracle, Sun launch high-end OLTP server. PCWorld, Sep 2009.
[3] EMC outlines strategy to accelerate ﬂash adoption. In EMCWorld
2011 (May 2011), http://www.emc.com/about/news/
press/2011/20110509-05.htm.
[4] AGRAWAL, N., ARPACI-DUSSEAU, A. C., AND ARPACI-
DUSSEAU, R. H. Generating realistic impressions for ﬁle-system
benchmarking. Trans. Storage 5 (December 2009), 16:1–16:30.
[5] BUTT, A. R., GNIADY, C., AND HU, Y. C. The performance
impact of kernel prefetching on buffer cache replacement algo-
rithms. In Proc. SIGMETRICS 2005 (Banff, Alberta, Canada,
2005), ACM, pp. 157–168.
[6] BYAN, S., ET AL. Mercury: Host-side ﬂash caching for the data
center. In 28th IEEE Symposium on Mass Storage Systems and
Technologies (MSST 2012) (April 2012), pp. 1 –12.
[7] CHEN, P. M., NG, W. T., CHANDRA, S., AYCOCK, C., RA-
JAMANI, G., AND LOWELL, D. The Rio ﬁle cache: Surviving
operating system crashes. In Proc. ASPLOS (October 1996).
[8] FORNEY, B. C., ARPACI-DUSSEAU, A. C., AND ARPACI-
DUSSEAU, R. H. Storage-aware caching: revisiting caching for
heterogeneous storage systems. In Proc. FAST (Monterey, CA,
2002), USENIX Association, pp. 5–5.
[9] HOWARD, J. H., ET AL. Scale and performance in a distributed
ﬁle system. ACM Trans. Comput. Syst. 6 (February 1988), 51–81.
[10] JOUKOV, N., WONG, T., AND ZADOK, E. Accurate and efﬁcient
replaying of ﬁle system traces. In Proc. FAST (San Francisco,
CA, 2005), USENIX Association, pp. 25–25.
[11] KOLLER, R., ET AL. Write policies for host-side ﬂash caches. In
Proc. FAST (San Jose, CA, 2013), USENIX Assoc., pp. 45–58.
[12] KOURAI, K. CacheMind: Fast performance recovery using a
virtual machine monitor. In Dependable Systems and Networks
Workshops (DSN-W) (July 2010), pp. 86 –92.
[13] MCCALPIN, J. D. Stream: Sustainable memory bandwidth in
high performance computers. Tech. rep., University of Virginia,
Charlottesville, Virginia, 1991-2011. A continually updated tech-
nical report. http://www.cs.virginia.edu/stream/.
[14] MESNIER, M. P., ET AL. Trace: parallel trace replay with ap-
proximate causal events. In Proc. FAST (San Jose, CA, 2007),
USENIX Association, p. 24.
[15] MICROSOFT. ReadyBoost. http://windows.
microsoft.com/en-US/windows7/products/
features/readyboost, 2009.
[16] NELSON, M. N., WELCH, B. B., AND OUSTERHOUT, J. K.
Caching in the Sprite network ﬁle system. ACM Trans. Comput.
Syst. 6 (February 1988), 134–154.
[17] NETAPP. Flash Cache. http://www.netapp.com/us/
products/storage-systems/flash-cache/.
[18] RAIDON. HyBrid RunneR iH2420-2S-S2 data sheet.
http://www.raidon.com.tw/content.php?sno=
0000462&p_id=113, 2010.
[19] SAXENA, M., SWIFT, M. M., AND ZHANG, Y. FlashTier: a
lightweight, consistent and durable storage cache. In Proc. Eu-
roSys (Bern, Switzerland, 2012), ACM, pp. 267–280.
[20] SEAGATE. Momentus XT product data sheet. http:
//www.seagate.com/docs/pdf/datasheet/disc/
ds_momentus_xt_retail.pdf, 2009.
[21] SHEPLER, S., ET AL. NFS version 4 protocol. http://www.
ietf.org/rfc/rfc3530.txt, April 2003.
[22] VIJAYAKUMAR, K., MUELLER, F., MA, X., AND ROTH, P. C.
ScalableI/Otracingandanalysis. InProc.WorkshoponPetascale
Data Storage (Portland, Oregon, 2009), ACM, pp. 26–31.
[23] YADGAR, G., FACTOR, M., AND SCHUSTER, A. Karma: know-
it-allreplacementforamultilevelcache. InProc.FAST (SanJose,
CA, 2007), USENIX Association, pp. 25–25.