Stochastic Modeling of Hybrid Cache Systems by Ju, Gaoying et al.
ar
X
iv
:1
60
7.
00
71
4v
2 
 [c
s.P
F]
  3
0 S
ep
 20
16
Stochastic Modeling of Hybrid Cache Systems
Gaoying Ju1, Yongkun Li1,2, Yinlong Xu1,3, Jiqiang Chen1, John C. S. Lui4
1School of Computer Science and Technology, University of Science and Technology of China
2Collaborative Innovation Center of High Performance Computing, National University of Defense Technology
3AnHui Province Key Laboratory of High Performance Computing
4Department of Computer Science and Engineering, The Chinese University of Hong Kong
{jgy93317, cjqld}@mail.ustc.edu.cn, {ykli, ylxu}@ustc.edu.cn, cslui@cse.cuhk.edu.hk
Abstract—In recent years, there is an increasing demand of big
memory systems so to perform large scale data analytics. Since
DRAM memories are expensive, some researchers are suggesting
to use other memory systems such as non-volatile memory
(NVM) technology to build large-memory computing systems.
However, whether the NVM technology can be a viable alternative
(either economically and technically) to DRAM remains an open
question. To answer this question, it is important to consider
how to design a memory system from a “system perspective”,
that is, incorporating different performance characteristics and
price ratios from hybrid memory devices.
This paper presents an analytical model of a “hybrid page
cache system” so to understand the diverse design space and
performance impact of a hybrid cache system. We consider
(1) various architectural choices, (2) design strategies, and (3)
configuration of different memory devices. Using this model,
we provide guidelines on how to design hybrid page cache to
reach a good trade-off between high system throughput (in I/O
per sec or IOPS) and fast cache reactivity which is defined by
the time to fill the cache. We also show how one can configure
the DRAM capacity and NVM capacity under a fixed budget.
We pick PCM as an example for NVM and conduct numerical
analysis. Our analysis indicates that incorporating PCM in a page
cache system significantly improves the system performance, and
it also shows larger benefit to allocate more PCM in page cache in
some cases. Besides, for the common setting of performance-price
ratio of PCM, “flat architecture” offers as a better choice, but
“layered architecture” outperforms if PCM write performance
can be significantly improved in the future.
Keywords-Stochastic Model; Mean-field Analysis; Hybrid
Cache Systems
I. INTRODUCTION
In modern computer systems, there is a common consen-
sus that secondary storage devices such as hard disk drives
(HDDs) are orders of magnitude slower than memory de-
vices like DRAM. Even though flash-based storage devices
like solid-state drives (SSDs), which are much faster than
HDDs, have been quickly developed and widely used in
recent years, they cannot replace DRAM since SSDs have
lower I/O throughput than DRAM (i.e., at least an order of
magnitude lower). Due to the large performance gap between
memory and secondary storage, I/O access poses as a major
bottleneck for computer system performance. To address this
issue, one commonly used technique is to allow some memory
as page cache, which exploits workload locality by buffering
the recently accessed data in fast-speed memory for a short
time before flushing to the slow-speed storage devices. Using
page caches, one can mitigate the performance mismatch
between memory and storage.
Traditional page cache usually uses DRAM due to its high
throughput (in terms of IOPS), e.g., [2], [11], [15]. However,
solely relying on DRAM has at least three limitations. First,
the development of DRAM technology has already reached its
limit, e.g., DRAM scaling is more difficult as charge storage
and sensing mechanisms will become less reliable when scaled
to thinner manufacturing processes [17]. Second, the price of
DRAM is still much higher than that of HDDs or SSDs, and it
also consumes much more energy due to its refresh operations.
So DRAM-based main memory consumes a significant portion
of the total system cost and energy with its increasing size
[12]. Finally, DRAM is a volatile device and data in DRAM
will disappear if there is any power failure. Hence, keeping a
lot of data in DRAM implies lowering the system reliability.
Non-volatile memory (NVM) technologies (e.g. PCM, STT-
MRAM, ReRAM) offer an alternative to DRAM due to their
byte-addressable feature (which is similar to DRAM) and
higher throughput than flash memory. In particular, NVM is
commonly accepted as a new tier in the storage hierarchy
“between” DRAM and SSDs, and it also poses a design trade-
off when we use it as page cache. On the one hand, it is much
faster than flash-based SSDs but still slower than DRAM, so
replacing DRAM with NVM in page cache may degrade the
system performance. On the other hand, the price and single-
device capacity of NVM are also considered to lie between
DRAM and SSDs, so one can have more NVM storage
capacity than DRAM given a fixed budget. Furthermore, due
to the non-volatile property of NVM, even keeping a large
amount of data in NVM does not reduce the system reliability.
Thus, it is possible to have a large page cache with NVM,
which increases the cache hit ratio and as a result improves
the overall system performance. Therefore, it remains an open
question whether it is more efficient to consider a hybrid cache
system with both DRAM and NVM, and how to fully utilize
the benefits of NVM in page cache design. This motivates us
to develop a mathematical model to comprehensively study
the impact of architecture design and system configurations
on page cache performance, and explore the full design space
when both DRAM and NVM are available.
However, analyzing a hybrid cache system is challenging.
First, including NVM in page cache clearly introduces system
heterogeneity, and so it offers more choices for system design
and severely increases the analysis complexity. For example,
when both DRAM and NVM are used, should we consider a
“flat architecture” which places DRAM and NVM in the same
level and accesses them in parallel, or consider a “layered ar-
chitecture” which uses DRAM as a cache for NVM? Another
question is how to allocate the capacity of each device under
a fixed budget so as to maximize the system performance.
Second, since access to DRAM and NVM have different
latencies, it is not accurate to analyze the system performance
by deriving only the hit ratio as in traditional cache analysis.
In fact, one needs to explicitly take the difference of latency
into account in the analysis. We emphasize that measurement
studies with simulator/prototype are also feasible methods, but
they may suffer from the efficiency problem due to the wide
choices in system design. While analytical modeling is easy
to be parameterized and generally needs less running time.
Motivated by the list-based model developed by Gast et al.
in [6], in this paper, we extend the model to analyze hybrid
cache systems under both the flat and layered architectures.
We also take into consideration the device heterogeneity by
defining a latency-based model to characterize the cache
performance so as to explore the full design space and the
optimal architecture design. To the best of our knowledge, this
is the first work which uses mathematical models to analyze
hybrid cache systems with DRAM and NVM.
The main contributions of this paper are as follows.
• We extend the list-based model in [6] to characterize the
dynamics of cache content distribution in hybrid cache
systems under both flat and layered architectures, and
derive the steady-state solution by using a mean-filed
approximation. We make each device operate in a fine
granularity by dividing it into multiple lists with a layered
structure so as to explore the optimal system performance
and full design space.
• We propose a latency-based metric to quantify the hy-
brid cache performance. To support the latency model,
we conduct measurements in the Linux kernel level to
obtain the average request delay at the granularity of
nanoseconds. With this latency model, we are able to
take the heterogeneity of different devices into account
so as to study the impact of different design choices on
hybrid cache performance with higher accuracy.
• We validate our analysis with simulations by modifying
the DRAMSim2 simulator [18]. We further study the
impact of different architectures (flat or layered) and
different system settings, such as the number of lists in
each cache device, the performance-price ratio of NVM,
as well as the capacity allocation of each cache device,
on the hybrid cache performance via numerical analysis.
• Our analysis results show that incorporating PCM in
hybrid cache design significantly improves the system
performance over traditional DRAM-only cache under the
common setting of performance-price ratio. Furthermore,
the hybrid cache design needs to be adjusted accordingly
when the ratio varies. In particular, the number of lists
in each cache device should be configured carefully to
achieve a good trade-off between the cache performance
and cache reactivity. Besides, under the common setting
of performance-price ratio of PCM, flat architecture offers
a better choice, but layered architecture outperforms if
PCM write performance gets significantly improved.
The rest of this paper proceeds as follows. In §II, we
introduce the architecture design and system configurations
of hybrid page cache, and formulate multiple design issues
to motivate our study. We present the Markov model for
characterizing the cache content distribution in §III, and derive
the mean-field approximation in §IV. We validate our analysis
by using DRAMSim2 simulator in §V, and show the analysis
results and insights via numerical analysis in §VI. Finally, we
review related work in §VII, and conclude the paper in §VIII.
II. DESIGN CHOICES AND ISSUES OF HYBRID CACHE
In this section, we first introduce the system architecture
and design choices of hybrid cache systems that we analyze
in this paper. In particular, we consider two types of system
architectures: flat architecture and layered architecture (see
§II-A), and study a fine-grained list-based cache replacement
algorithm (see §II-B). After that, we formulate several design
issues to motivate our study (see §II-C).
A. System Architecture
We focus on hybrid cache design which composes of both
DRAM and NVM. For ease of presentation, we call DRAM
and NVM used in a cache D-Cache and N-Cache, respectively,
and assume that we have mD DRAM pages and mN NVM
pages with the same page size, say 4KB, in the system. That
is, the capacity of D-Cache is mD, and that of N-Cache is
mN . We also denote m as the total capacity of the hybrid
cache, i.e., m = mD + mN . We denote the overall system
cost as C = mD ∗ cD +mN ∗ cN , where cD and cN denote
the price/cost of each page of DRAM and NVM, respectively.
To organize D-Cache and N-Cache, we further divide each
of them into multiple lists, each of which contains a certain
number of pages, and denote the number of lists in D-Cache
and N-Cache as hD and hN , respectively. We label the lists
of N-Cache as l1, · · · , lhN , and label the lists of D-Cache as
lhN+1, · · · , lh, where h = hN + hD denotes the total number
of lists in the whole system. For list li, we define its capacity
as mi, so we have and we have m = (m1, ...,mh), with∑h
i=1mi = m, which describes the whole cache system.
We denote the secondary storage layer as list l0. Without
loss of generality, we call list li the i-th list, i.e., li = i.
Figure 1 shows an example of the list-based organization of
D-Cache and N-Cache under different architectures.
To design a hybrid cache with both D-Cache and N-Cache,
we consider two architectures: flat architecture and layered
architecture, which are described as follows.
• Flat architecture: In this design, both D-Cache and N-
Cache are placed in the same level and accessed in
parallel as shown in Figure 1(a). In particular, for a new
data page which has not been cached before, it is either
cached in D-Cache with probability α or in N-Cache with
+... ...... ...
Victim Page Cache Miss 
-
(a) Flat Architecture
... ...
+
......
(b) Layered Architecture
Fig. 1. Architecture of hybrid cache.
probability 1−α. Note that α is a tunable parameter, and
increasing it implies that D-Cache is more preferred to
be used. In the flat architecture, pages are never migrated
between the two types of caches.
Note that both D-Cache and N-Cache contain multiple
lists. To exploit workload locality, we let pages be
first buffered in the list with the smallest label in the
corresponding cache, and then upgrade to the larger-
numbered lists when they become hot (e.g., when cache
hit happens). That is, lists in the same cache device are
organized in a layered structure.
• Layered architecture: In this design, we use D-Cache as a
caching layer for N-Cache as shown in Figure 1(b). Par-
ticularly, new data page is directly buffered in N-Cache
first, and when page in the list of the largest label in N-
Cache is accessed, it is upgraded to D-Cache. Similarly,
we also organize lists in both D-Cache and N-Cache in
a layered structure. Note that data migration between D-
Cache and N-Cache happens here, and usually, data in
D-Cache is considered to be hotter than data in N-Cache.
B. Cache Replacement Algorithm
For cache replacement, we follow the list-based algorithm
introduced in [6], and extend it to hybrid cache with different
architectures. Roughly speaking, a new data page enters into
a cache through the first list and moves to the upper list by
exchanging with a randomly selected data page whenever a
cache hit occurs. Specifically, when a data page k is requested
at time t, one of the three events below happens:
• Cache miss: Page k is not in D-Cache nor N-Cache. In
this case, page k enters into the first list in D-Cache
(i.e., list lhN+1) with probability α or into the first list
in N-Cache (i.e., list l1) with probability (1 − α) under
the flat architecture. For the layered architecture, page k
enters into the first list of N-Cache (i.e., list l1). For both
architectures, the position in the list for writing page k is
chosen uniformly at random. Meanwhile, the victim page
in the position moves back to list 0.
• Cache hit in list li where li 6= lhN and li 6= lh: In this
case, page k moves to a randomly selected position v of
list li+1, meanwhile, the victim page in position v of list
li+1 takes the former position of page k.
• Cache hit in list li where li = lhN or li = lh: In this
case, page k remains at the same position under the flat
architecture. However, for the layered architecture with
li = lhN , page k moves to a random position in list li+1
as in the second case.
Figure 1 shows the data flow under flat and layered archi-
tectures. Note that data migration happens between lists of the
same type of cache, while the migration between D-Cache and
N-Cache happens only in the case of layered architecture.
C. Design Issues
Note that the overall performance of a hybrid cache system
may depend on various factors, such as system architecture,
capacity allocation between DRAM and NVM, as well as
the configuration parameters like the number of lists in each
cache device. Thus, it poses a wide range of design choices
for hybrid cache, which makes it very difficult to explore
the full design space and optimize the cache performance.
To understand the impact of hybrid cache design on system
performance, in this work, we aim to address the following
issues by developing mathematical models.
• For each architecture (flat or layered), what is the impact
of the list-based hierarchical design, and how to set the
best parameters so as to optimize the overall performance,
including the numbers of lists hD and hN , as well as the
preference parameter α for the flat architecture?
• Which architecture should be used when considering both
DRAM and NVM into a hybrid design?
• Under a fixed budget C, what is the best capacity
allocation of each cache type for better performance?
III. SYSTEM MODEL
In this section, we first describe the workload model, then
characterize the dynamics of data pages in hybrid cache, and
finally derive the cache content distribution in steady state.
After that, we define a latency-based performance metric based
on the cache content distribution so as to quantify the overall
cache performance.
A. Workload Model
In this work, we focus on cache-effective applications like
web search and database query [22], [11], in which memory
and I/O latency are critical to system performance. Thus,
caching files in main memory becomes necessary to provide
sufficient throughput for these applications. To provide high
data reliability, we assume to use the write-through policy, in
which data is also written to the storage tier once it is buffered
in the page cache. With this policy, all data pages in cache
should have a copy in the secondary storage.
In this paper, we focus on the independent reference model
[6] in which requests in a workload are independent of each
other. Since cache mainly benefits the read performance, we
focus on read requests only, while we can also extend our
model to write requests. Suppose that we have n total data
pages in the system. In each time slot, one read request arrives,
and it accesses data pages according to a particular distribution
where page k (k = 1, 2, ..., n) is accessed with probability pk.
Clearly, we have
∑n
k=1 pk = 1. Without loss of generality,
we assume that pages are sorted in the decreasing order of
their popularity. That is, if i < j, then pi ≥ pj . It is well
known that workload possesses high skewness in the sense
that a small portion of data pages receive a large fraction of
requests, and the access probability usually follows a Zipf-
like distribution [3], [23]. Thus, we model pk’s as a Zipf-like
distribution. Mathematically, we let
pk = ck
−γ , γ > 0,
where c is the normalized constant. We would like to empha-
size that our model also allows other forms of distributions.
B. Markov Model
In this subsection, we extend the mathematical model in
[6] to capture the dynamics of data pages in a hybrid cache
system with different architectures, and then derive the steady-
state distribution to quantify the hit ratio of each request.
Note that we have n data pages in total in the system, and
the total capacity of the hybrid cache is m. Without loss of
generality, we assume that m < n, so only parts of data pages
can be kept in the hybrid cache. To characterize the system
state of the hybrid cache, we use a random variable Xk,i(t)
(k = 1, 2, · · · , n, and i = 1, 2, · · · , h) to denote whether page
k is in list li at time t. If yes, we let Xk,i(t) = 1 and 0
otherwise. If page k does not exist in the hybrid cache, i.e.,
Xk,i(t) = 0 for i = 1, 2, · · · , h, then page k must be stored
in the secondary storage, and we let Xk,0(t) = 1 in this case.
Now we capture the system state from a perspective of lists,
and define Yi(t) = {k|Xk,i = 1} (i ∈ {1, .., h}) as the set of
pages in list li at time t. We have |Yi(t)| ≤ mi. The process
Yh(t) = (Y1(t),Y2(t), ...,Yh(t)) denotes the distribution of
pages in the hybrid cache at time t. Now the state space of
Yh(t), which we denote as Cn(m), can be viewed as the
set of all sequences of h sets c = {c1, ..., ch} with each set ci
consisting of mi distinct integers taken from the set {1, ..., n}.
In each time slot, only one request arrives and triggers a
state transition accordingly. Under the independent reference
model in §III-A, the process Yh(t) is clearly a Markov chain
on the state space Cn(m) for the cache replacement algorithms
described in §II-B. Now we denote πA(c) with c = {c1, ..., ch}
as the steady-state probability of state c, where A ∈ {F,L}
standing for the flat architecture or the layered architecture. We
use a variable htA(li) to denote the height of list li, which is
defined as the number of steps to move a data page from list
l0 to list li. Precisely, we have
htF (li)=
{
i, i = 1, ..., hN ,
i−hN , i = hN+1, ..., h,
and htL(li) = i. (1)
Now the steady-state probability πA(c) can be derived as
shown in the following theorem.
Theorem 1. The steady state probabilities πA(c), with c ∈
Cn(m), can be written as
piA(c) =
1
Z(m)
∏h
i=1
(∏
j∈ci
pj
)htA(li)
, (2)
where Z(m) =∑
c∈Cn(m)
∏h
i=1(
∏
j∈ci
pj)htA(li).
Proof: Please refer to the Appendix.
Remarks: We point out that the steady-state results share the
same structure as the results in [6] for both the flat and layered
architectures. The difference is that our model introduces a
parameter htA(li), which represents the height of lists and
provides the capability of unifying the model for different
architectures. In particular, the notation htA(li) (i.e., the height
of lists) is an “architecture-dependent parameter” (i.e., its value
depends on the architecture of the hybrid system), and we
include it in the analysis so as to enhance the model’s ability
in analyzing different architectures.
According to the probabilities πA(c), we can calculate the
hit probability of list li in steady state, which is denoted
as Hi= lim
t→∞
∑
k pkE[Xk,i(t)]. We also call this probability
distribution cache content distribution. Mathematically,
Hi =
∑
c∈Cn(m)
∑
k
pk1{k∈ci}πA(c), (3)
where 1{k∈ci} is a 0-1 variable denoting whether page k is
in list li or not.
However, it is not efficient to compute πA(c) by using the
above formula unless the cache capacity m is small. In the
next section, we will introduce a mean-field approach, which
can approximate the cache content distribution very efficiently.
C. Performance Metric
Recall that we focus on hybrid cache systems consisting
of both DRAM and NVM, which show very different char-
acteristics in access latency. To take device heterogeneity into
account, we define a latency-based performance metric to eval-
uate hybrid cache performance. Since requests are processed
differently under different architectures, we distinguish the
definitions for flat architecture and layered architecture.
1) Latency Model under Flat Architecture: Suppose that at
time t, a request arrives. To process this request, we first access
the metadata in file system to identify the current position the
request served, and there are two cases: (1) cache hit, which
means that the requested page is available in the hybrid cache,
and (2) cache miss, which means that the requested page does
not exist in the hybrid cache. In the following, we derive the
access latency in the above two cases.
At time t, if cache hit happens, the service time of accessing
a page depends on which cache page is accessed. If the hit
occurs in N-Cache, that is,
∑hN
i=1
∑n
k=1 pkXk,i(t), then the
service time includes only the time to read a page from NVM,
and we denote it as TN,r, where N denotes N-Cache and r
represents read. Otherwise, i.e., the hit occurs in D-Cache and∑h
i=hN+1
∑n
k=1 pkXk,i(t), then the service time is the time
to read a page from DRAM, which we denote as TD,r.
If cache miss happens, that is,
∑n
k=1 pkXk,0(t), then we
need to first copy the data from the secondary storage to the
destined cache (either D-Cache or N-Cache), then serve the
request from the corresponding cache. So the service time
includes the time to read a page from the secondary storage,
which we denote as TS,r, the time to write a page to cache,
which we denote as TD,w for writing to D-Cache and TN,w for
writing to N-Cache, and the time to read a page from cache.
Note that under the flat architecture, a new data page is written
to D-Cache (or N-Cache) with probability α (or 1 − α), so
the service time in the case of cache miss can be derived as
α(TS,r + TD,w + TD,r) + (1− α)(TS,r + TN,w + TN,r).
By summarizing the above two cases and noting that
Hi(t) =
∑n
k=1 pkXk,i(t), the average service time of pro-
cessing the request at time t under the flat architecture, which
we also call the average latency, can be derived as follows.
LF (t)=E[H0(t)]
(
TS,r+α(TD,w + TD,r)
+(1− α)(TN,w + TN,r)
)
+
∑
i6=0
E[Hi(t)]Td(i),r, (4)
where d(i) is the device type of list li, i.e., d(i) ∈ {D,N, S}.
2) Latency Model under Layered Architecture: Similar to
the above derivation, we can also derive the average latency
under layered architecture, while there are two differences.
First, if cache hit occurs in the highest list of N-Cache, i.e.,
in list lhN , then we need to exchange this data in N-Cache
with a data page in D-Cache. As a result, we need one read
from N-Cache, one write to D-Cache, as well as one read
from D-Cache and one write to N-Cache, so the total time is
TN,r + TD,w + TD,r + TN,w. Second, if cache miss happens,
data can only be written to N-Cache, and the service time is
TS,r + TN,w + TN,r. In summary, the average latency under
the layered architecture can be derived as:
LL(t) = E[H0(t)](TS,r+TN,w+TN,r)
+E[HhN (t)](TN,r+TN,w+TD,r+TD,w)
+
∑
i6=0,hN
E[Hi(t)]Td(i),r. (5)
IV. MEAN FIELD ANALYSIS
In this section, we conduct mean-field analysis to ap-
proximate the cache content distribution so as to make the
computation more efficient. The rough idea of the mean-
field analysis can be stated as follows. Instead of accurately
deriving the steady-state probability distribution directly from
the Markov process, we first formulate a deterministic process
defined by a set of ordinary differential equations (ODEs), then
we show that the Markov process can be approximated by the
deterministic process, which converges to the fixed point (i.e.,
mean-field limit), and finally, we use the mean field limit to
approximate the steady-state solution of the Markov process.
A. ODEs
As mentioned in [6], the rationale of the mean-field ap-
proximation is that when pk is small and the capacity of
each list mi (i ∈ {0, 1, ..., h}) is large, the dynamics of
one particular data page becomes independent of the hit ratio
of each list, hence, its behavior can be approximated by
a time-inhomogeneous continuous-time Markov chain. As a
result, the stochastic process Yh(t) can be approximated by a
+
-
-
...  ...
...  ...
-
-
(a) Flat Architecture
+
...  ... ...  ...
(b) Layered Architecture
Fig. 2. State transitions of a single data page.
particular deterministic process x(t) = {xk,i(t)} (k = 1, ..., n
and i = 1, ..., h).
To formulate the set of ODEs to define x(t), we first focus
on the flat architecture. According to the state transitions of a
single data page illustrated in Figure 2(a), we can define x(t)
by using the ODEs in (6)-(10).
Case 1: If i 6= 0, 1, hN + 1, h, hN (i.e., in middle lists):
x˙k,i(t) = pkxk,i−1(t) −
∑
j
pjxj,i−1(t)
xk,i(t)
mi
+
∑
j
pjxj,i(t)
xk,i+1(t)
mi+1
− pkxk,i(t). (6)
Case 2: If i = h or i = hN (i.e., in the highest list):
x˙k,i(t) = pkxk,i−1(t) −
∑
j
pjxj,i−1(t)
xk,i(t)
mi
. (7)
Case 3: If i = 1 (i.e., in the lowest list of N-Cache):
x˙k,i(t) = (1− α)pkxk,0(t) − (1− α)
∑
j
pjxj,0(t)
xk,i(t)
mi
+
∑
j
pjxj,i(t)
xk,i+1(t)
mi+1
− pkxk,i(t). (8)
Case 4: If i = hN + 1 (i.e., in the lowest list of D-Cache):
x˙k,i(t) = αpkxk,0(t)−α
∑
j
pjxj,0(t)
xk,i(t)
mi
+
∑
j
pjxj,i(t)
xk,i+1(t)
mi+1
− pkxk,i(t). (9)
Case 5: If i = 0 (i.e., in the storage layer):
x˙k,0(t) = (1 − α)
∑
j
pjxj,0(t)
xk,1(t)
m1
+α
∑
j
pjxj,0(t)
xk,hN+1(t)
mhN+1
− pkxk,0(t). (10)
To illustrate the ODEs, we take (6) as an example. First, if
page k is in list i−1 at time t and it is accessed, then it moves
from list i−1 to i, and the probability is pkxk,i−1(t). Second,
if a page in list i− 1 is accessed, then it will exchange with a
randomly selected page in list i. The probability of accessing
a page in list i − 1 is ∑j pjxj,i−1(t), which we denote as
Hi−1(t), and the probability of page k being in list i and
also being selected for exchanging is xk,i(t)/mi. Thus, with
probability Hi−1(t)xk,i(t)/mi, page k moves from list i to
list i − 1. Third, if a page in list i is accessed, then it will
exchange with a randomly selected page in list i + 1. In this
case, the probability of page k being in list i+1 and moving
back to list i is
∑n
j=1 pjxj,i(t)
xk,i+1(t)
mi+1
. At last, if page k is
in list i and accessed, then it moves from list i to list i + 1,
and the corresponding probability is pkxk,i(t). By summing
the above four cases, we have the ODE as in (6).
Now we consider the layered architecture, similar to the
case of flat architecture, we can also formulate the set of ODEs
according to the state transitions illustrated in Figure 2(b), and
the ODEs are defined by (11)-(12).
Case 1: If i 6= 0 (i.e., in the hybrid cache):
x˙k,i(t) = pkxk,i−1(t) −
∑
j
pjxj,i−1(t)
xk,i(t)
mi
+1{(i<h)}(
∑
j
pjxj,i(t)
xk,i+1(t)
mi+1
− pkxk,i(t)). (11)
Case 2: If i = 0 (i.e., in the storage layer):
x˙k,0(t) =
∑
j
pjxj,0(t)xk,1(t)/m1 − pkxk,0(t). (12)
Remarks: We point out that the set of ODEs share similarities
with the ODEs formulated in [6]. This is mainly because
we also divide each kind of device (DRAM or NVM) into
multiple lists so as to explore the full design space. However,
we emphasize that with the consideration of multiple devices
and different architectures, the lists in the boundary behave in
a very different way, and so the state transitions for boundary
lists are also different.
B. Fixed Point
We derive the fixed point of the ODEs defined by (6)-
(10),(11)-(12). The results are stated in the following theorem.
Theorem 2. The ODEs have a unique fixed point, which we
denote as πk,i (k = 1, ..., n and i = 0, ..., h).
pik,i =
p
htA(i)
k
si
1 +
∑h
j=1 p
htA(j)
k
sj
, (13)
where htA(i) (A ∈ {F,L}) is defined in (1), and (s1, ..., sh)
is the unique solution of the following equation.
n∑
k=1
p
htA(i)
k
si
1 +
∑h
j=1 p
htA(j)
k
sj
= mi.
Proof: Please refer to the Appendix.
Remarks: Note that for the layered architecture, we have
htA(i) = i. By substituting it in the above Equations in
Theorem 2, we have the same results as in [6]. This is
because for the layered architecture, the hybrid cache can be
considered as a single unified cache containing h lists when
deriving the cache content distribution. However, we would
like to emphasize that due to the device heterogeneity, the
average latency of the hybrid cache must be different from
that of a single unified cache. On the other hand, for the flat
architecture, we see that in steady state, the fixed point πk,i is
independent of the parameter α. This implies that the hit ratio
is independent of the policy of choosing which cache device to
buffer new data. Thus, we can freely increase α to cache more
missed data pages in the fast-speed D-Cache so as to achieve
better overall cache performance. In terms of the convergence,
note that the stochastic process under each architecture has the
reversible property, which is the same as Corollary 1 in [6], so
we may also apply the method in [1] to show that the process
will concentrate on the fixed point. We also point out that the
fixed-point provides an efficient numerical method to compute
the steady-state performance for both architectures.
C. Convergence Results
Here, we show that we can use δi(t) =
∑
k xk,i(t)pk to
approximate Hi(t) =
∑
kXk,i(t)pk where xk,i(t) is defined
by the set of ODEs. The convergence result is stated in the
following theorem.
Theorem 3. When pk→0 as n→∞ (a=maxkpk → 0) and
mi →∞, then for any T , E[supi,t≤T |Hi(t)− δi(t)|] → 0,
with initial condition Hi(0) = δi(0).
Proof: Please refer to the Appendix.
Remarks: Based on Theorem 2 and Theorem 3, we can use
the fixed point
∑
k πk,i (derived in (13)) to approximate the
cache content distribution Hi (defined in (3)), which denotes
the hit probability of list i in the steady state. More impor-
tantly, it is efficient to compute Hi with this approximation,
which makes it feasible to further derive the average latency
of the hybrid cache.
V. MODEL VALIDATION
In this section, we first validate the mean-field approxima-
tion by comparing the hit probabilities derived from model and
simulations, then we validate our model analysis of average
latency by modifying the DRAMSim2 simulator [18].
A. Validation on Mean-field Approximation
In this subsection, we validate the mean-field approximation
using the trace-based simulations by setting mN = 200, mD
= 100, n = 1000, and pk by following a Zipf-like distribution
with parameter γ = 0.8.
To validate the mean-field approximation, we use the prob-
ability of hitting each page in each device as a metric. Note
that the hit probability can be derived from πk,i. In particular,
for a particular page k, the probability of hitting page k in
N-Cache can be derived as
∑hN
i=1 πk,i, and the probability of
hitting k in D-Cache is
∑h
i=hN+1
πk,i. For the simulation, we
run 50 times and take an average result.
200 400 600 800 1000
Page ID
0
0.2
0.4
0.6
0.8
1
D
ev
ic
e 
Hi
t P
ro
ba
bi
lit
y
N-Cache,ODE
N-Cache,simul
D-Cache,ODE
D-cache,simul
Storage,ODE
Storage,simul
(a) Flat Architecture
200 400 600 800 1000
Page ID
0
0.2
0.4
0.6
0.8
1
D
ev
ic
e 
Hi
t P
ro
ba
bi
lit
y
N-Cache,ODE
N-Cache,simul
D-Cache,ODE
D-cache,simul
Storage,ODE
Storage,simul
(b) Layered Architecture
Fig. 3. Validation on mean-field approximation: The hit probability of each
page in each device.
103 104 105
Number of Requests
0.3
0.35
0.4
0.45
0.5
0.55
0.6
M
is
s 
Ra
tio
0.8,ODE
0.8,simul
0.9,ODE
0.9,simul
0.95,ODE
0.95,simul
(a) Flat (varying α)
103 104 105
Number of Requests
0.3
0.35
0.4
0.45
0.5
0.55
0.6
M
is
s 
Ra
tio
(2,2),ODE
(2,2),simul
(4,2),ODE
(4,2),simul
(2,4),ODE
(2,4),simul
(4,4),ODE
(4,4),simul
(b) Flat (varying hN , hD)
103 104 105
Number of Requests
0.3
0.35
0.4
0.45
0.5
0.55
0.6
M
is
s 
Ra
tio
(2,2),ODE
(2,2),simul
(4,2),ODE
(4,2),simul
(2,4),ODE
(2,4),simul
(4,4),ODE
(4,4),simul
(c) Layered (varying hN , hD)
Fig. 4. Validation on mean-field approximation: Transient behaviors of the hybrid cache starting from an empty state.
Figure 3 shows the model and simulation results under the
flat and layered architectures.We see that the analysis results
match well with the simulation results. In particular, even for
a very small system (e.g., n = 1000), we can still achieve a
good approximation by using the mean-field analysis.
We further validate the mean-field approximation by con-
sidering the transient hit probability instead of the steady-state
result derived from the mean-field limit. We use the average
miss ratio of the hybrid cache over all pages as a metric, and
divide time into small intervals to compare the simulation and
model results in each time interval. For the model results,
since we now focus on the transient behavior, we derive the
average miss ratio directly from the ODEs in (6)-(10) and (11)-
(12). Precisely, the average miss ratio at time slot t + 1 is
computed by
∑
k pkxk,0(t + 1) =
∑
k pk(xk,0(t) + x˙k,0(t)).
For simulations, we record the position of each page after
processing each request, and then measure the average miss
ratio in each time interval.
Figure 4 shows the results under different settings by
varying the parameter α under flat architecture (Figure 4(a))
and varying hN and hD under flat and layered architectures
(Figure 4(b) and 4(c)). We see that even for the transient
behavior, the mean-field model still approximates well for
small systems. Another interesting observation is that the
number of lists in each cache device may have a big influence
on the cache reactivity, which is measured by the time to fill
the cache. Precisely, if the number of lists is set to be large,
which results in a small list size, then it may need a very
long time to fill the cache. That is, the convergence rate to the
steady state becomes small.
B. Validation on Average Latency
To validate the model analysis of average latency, we de-
velop a hybrid cache simulator by modifying the DRAMSim2
simulator [18], and it includes the following modules.
• Trace Generation Module: It generates requests with logical
address and request starting time.
• Memory Controller Module: It manages the cache metadata
and controls the page replacement.
• D-Cache Module: It simulates a DRAM device,serves the
requests coming to DRAM and sends the finishing time to
the Time Collection Module. The timing parameters of DRAM
refers to [14].
• N-Cache Module: It simulates a NVM device and serves the
requests coming to NVM. The default timing parameters of
NVM are set according to [14], we can vary device-level latency
by adjusting the timing parameters.
• Storage Module: It simulates the access to storage devices by
adding a delay to the request, then sends the new time clock to
the Time Collection Module.
• Time Collection Module: It collects the starting time and
finishing time of each request.
• Device Performance Monitor Module: It collects the average
read/write latency at device level for each cache device (DRAM
and NVM) in each time interval.
In our simulation, we use the Trace Generation Module to
generate requests according to the Zipf-like distribution, and
set the workload size n = 3000. We also use the Device
Performance Monitor Module to measure the device-level
latency parameters of DRAM and NVM (TD,r, TN,r, TD,w,
TN,w), and then use them as inputs to our latency model to
compute the average latency of the hybrid cache.
We validate our model by considering different design
settings, including the system architecture, the capacity of D-
Cache and N-Cache (mD and mN ), and the number of lists
in each cache device (hD and hN ). We only show the results
under some settings in Table I due to page limit. We see that
the analysis results match well with the simulation results even
under the settings of small systems, and the relative error is at
most 2.87%. We also run more simulations for validation by
varying the timing parameters of cache devices, results also
show that our model captures the average latency of hybrid
cache accurately. We skip the results in the interest of space.
Arc. mN mD hN hD Sim.(µs) Model(µs) Rel. Err.
F 200 400 3 4 54.38 55.34 1.77%
F 200 400 3 3 55.67 54.68 1.78%
F 200 400 2 4 54.79 56.21 2.59%
F 100 200 3 4 69.44 67.45 2.87%
L 400 200 4 3 49.28 49.30 0.04%
L 400 200 3 3 70.04 67.54 0.36%
L 400 200 4 2 69.22 68.43 1.14%
L 300 100 3 2 83.56 81.41 2.57%
TABLE I
LATENCY VALIDATION UNDER DIFFERENT SETTINGS.
VI. NUMERICAL RESULTS AND GUIDELINES
In this section, we use PCM as an example of NVM and
conduct numerical analysis to study the impact of system
architecture and design settings on hybrid cache performance
so as to understand the benefit of NVM and explore the design
space of hybrid cache. In the following, we first introduce the
parameter settings and justify their choices, then we perform
numerical analysis to study the impact of various design
choices and provide insightful guidelines.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
α
50
60
70
80
90
100
A
ve
ra
ge
 L
at
en
cy
 (u
s)
TN,w = 128 us
TN,w = 64 us
TN,w = 32 us
TN,w = 16 us
(a) Impact of α
1 2 3 4 5 6 7 8 9
Num of lists N-Cache
50
60
70
80
90
100
A
ve
ra
ge
 L
at
en
cy
 (u
s)
hD = 1
hD = 3
hD = 5
hD = 9
(b) Impact of hN
1 2 3 4 5 6 7 8 9
Num of lists D-Cache
50
60
70
80
90
100
A
ve
ra
ge
 L
at
en
cy
 (u
s)
hN = 1
hN = 3
hN = 5
hN = 9
(c) Impact of hD
Fig. 5. Flat architecture: Impact of α, hN and hD on average latency of the hybrid cache (α denotes the probability of writing data to D-Cache when cache
miss happens, hN and hD denote the number of lists in N-Cache and D-Cache, respectively).
A. Parameter Settings
Recall that our model takes device heterogeneity into
consideration by using a latency-based performance metric.
Thus, to perform numerical analysis, we first configure the
performance parameters for different devices.
DRAM parameters are measured at the granularity of
nanoseconds in practical file system page cache environment
by patching Linux kernel 4.0.2. Both read and write latencies
are around 0.2µs, averaged over millions of records. Note that
this latency is nearly 10× longer than that reported in [5], [16],
which is 10 ∼ 25ns, this is mainly because of the software
overhead caused by file system. As PCM is not available in
the market yet, we refer to the parameter settings in [13], and
let TN,r = 6.7µs and TN,w = 128.3µs.
To set the latency of accessing the secondary storage, we
consider an example of networked storage application, in
which the file server is equipped with an all-flash storage
system [10]. The network parameters are based on the timing
parameters in previous work [7], and precisely, the network
overhead for 4 KB transmission is calculated as 41.0 µs (8.2
µs basic latency + (4,096 × 8) bits × 1 ns/bit). Thus, the
overall read time is set as 151 µs (110µs file server read time
+ 41 µs network transmission overhead).
Table II summarizes the delay parameters of different de-
vices used in this paper. Note that given the latency parameters
in Table II and the cache content distribution approximated by
πk,i in (13), the average latency under flat and layered archi-
tectures can be computed by using (4) and (5), respectively.
DRAM PCM Storage
4KB R. Lat.: 0.2µs 6.7µs 151µs
4KB W. Lat.: 0.2µs 128.3µs -
TABLE II
THE LATENCY OF DIFFERENT DEVICES IN COMMON SETTING.
B. Impact of Design Choices under Flat Architecture
In this subsection, we focus on the flat architecture, and
study the impact of various design choices by setting mN =
15000, mD = 5000, n = 100000, and pk by following a Zipf-
like distribution with parameter γ = 0.8.
Impact of α: We first study the impact of parameter α,
which denotes the probability of writing data to D-Cache when
cache miss happens. Note that the missed data is written to
N-Cache with probability 1− α.
Figure 5(a) shows the analysis results. We see that the
average latency decreases when α increases. That is, if we
write missed data pages to D-Cache with higher probability,
then the overall cache performance increases, because more
data will be served by the high-speed D-Cache. However,
the performance gain is limited when keep increasing PCM
performance since the speed gap between DRAM and PCM is
narrowed, especially for the write performance. For example,
if we set the PCM write latency as 16µs, which is 8× smaller
than the common setting, then less than 10% improvement
can be achieved when increasing α from 0.1 to 0.9. In the
following study, we fix α as 0.8 under the flat architecture.
Impact of hN and hD: Now we study the impact of hN
and hD, which denote the number of lists in N-Cache and D-
Cache, respectively. To decouple the dependency between hN
and hD, we vary hN by fixing hD in Figure 5(b), and vary
hD by fixing hN in Figure 5(c). Based on the results, we have
the following observations.
• Increasing the number of lists in N-Cache (i.e., hN ) does
not always increase the cache performance. For example,
as shown in Figure 5(b), when hD = 9, increasing hN
incurs even longer latency when hN is larger than 7. The
main reason is that increasing the number of lists in N-
Cache may result in a reduction of the overall cache miss
probability of the hybrid cache, but it also leads to a
reduction of the cache hit probability of D-Cache as hot
data is more likely to be trapped in N-Cache.
• The average latency decreases when using more lists in
D-Cache by setting a larger hD, because increasing hD
not only decreases the overall cache miss probability, but
also increases the D-Cache hit probability. Besides, the
performance gain diminishes when hD is already large.
We also conduct analysis by varying the latency of PCM, the
capacity of N-Cache and D-Cache, and we observe the same
conclusions. We do not show the results here in the interest of
space. Further considering the impact of hN and hD on the
cache reactivity (see Figure 4), we recommend to use a large
hD and a small hN under flat architecture, e.g., set hD as 4
∼ 6 and hN as 2 ∼ 3.
C. Impact of Design Choices under Layered Architecture
Now we focus on the layered architecture and study the
impact of various design choices. Since the major factors
are hN and hD under layered architecture, which denote the
1 2 3 4 5 6 7 8 9
Num of lists N-Cache
90
100
110
120
130
140
150
A
ve
ra
ge
 L
at
en
cy
 (u
s)
hD = 1
hD = 3
hD = 5
hD = 9
(a) Impact of hN
1 2 3 4 5 6 7 8 9
Num of lists D-Cache
90
100
110
120
130
140
150
A
ve
ra
ge
 L
at
en
cy
 (u
s)
hN = 1
hN = 3
hN = 5
hN = 9
(b) Impact of hD
Fig. 6. Layered architecture: Impact of hN and hD on average latency (hN
and hD denote the number of lists in N-Cache and D-Cache, respectively).
number of lists in N-Cache and D-Cache, we also study their
impact on the average latency of hybrid cache as before.
Impact of hN and hD: Figure 6 shows the analysis results,
and we have the following observations.
• The average latency decreases when either hN or hD in-
creases. That is, better cache performance can be achieved
by adding more lists in both N-Cache and D-Cache.
• The performance improvement is more significant when
adding more lists in N-Cache (i.e., increasing hN ) than
increasing hD. In particular, the improvement is negligi-
ble when increasing hD, especially when hN is large.
We also vary the latency of PCM and the capacity of N-
Cache and D-Cache. The results are in line with the observa-
tions. Further considering the impact of hN and hD on cache
reactivity (see Figure 4), we recommend to set a large hN and
a small hD under layered architecture, e.g., set hN as 4 ∼ 6
and hD as 2 ∼ 3.
D. Impact of PCM Performance and Capacity
In this subsection, we explore the performance impact and
design space of hybrid cache by varying the read and write
performance of PCM, as well as its capacity allocation. To
vary the PCM capacity in hybrid cache, we fix the total budge
C, and adjust mN and mD by assuming that the price of PCM
is 14× of that of DRAM [13].
Figure 7 shows the impact of PCM performance under
flat and layered architectures. In this analysis, we fix PCM
capacity by setting mN/(mN +mD) = 50%, and we also fix
the read and write performance of DRAM (TD,r and TD,w) as
the common parameters in Table II. We change the read and
write performance of PCM by varying TN,r from 1× to 32×
of TD,r (see Figure 7(a)), and varying TN,w from 1× to 640×
of TD,w (see Figure 7(b)). Note that in common settings, TN,r
is 32× of TD,r and TN,w is 640× of TD,w (see Table II).
32 16 8 4 2 1
TN,r/TD,r
60
80
100
120
140
A
ve
ra
ge
 L
at
en
cy
 (u
s)
Layered
Flat
(a) Impact of TN,r
640 320 160 80 40 20 10 5 1
TN,w/TD,w
60
80
100
120
140
A
ve
ra
ge
 L
at
en
cy
 (u
s)
Layered
Flat
(b) Impact of TN,w
Fig. 7. Impact of PCM read/write performance (TN,r and TN,w). We fix
PCM capacity by setting mN/(mN +mD) = 50%.
Results show that the read performance of PCM has a
very small impact on the hybrid cache performance. However,
the impact of PCM write performance TN,w is significant.
In particular, when the write performance of PCM is slow,
the flat architecture achieves better performance than the
layered architecture, but when we increase the PCM write
performance (by decreasing TN,w), the average latency of
hybrid cache under layered architecture drops even faster, and
finally, layered architecture outperforms flat architecture when
PCM write becomes fast. Thus, choosing which architecture
in hybrid cache for better performance really depends on the
PCM performance characteristics.
To further investigate the architectural choices of hybrid
cache, we also take into consideration the capacity allocation
of different cache devices. Results are shown in Figure 8, in
which the horizontal axis represents the percentage of PCM
cache by fixing the total budget, and the vertical axis shows
the average latency of the hybrid cache.
In Figure 8(a), we use the common settings in Table II to set
PCM write performance (i.e., TN,w ≈ 640TD,w). We see that
if we allocate more budget for PCM, the average latency keeps
decreasing under the flat architecture as we can have a larger
cache size. However, the result is very different for layered
architecture, in particular, the average latency decreases first,
but begins to increase when PCM capacity becomes very large.
The main reason is that under the layered architecture, even
though we can have a large cache by using more PCM, it
may incur a lot of data migrations between DRAM and PCM,
which may incur a big overhead as PCM is two orders of
magnitude slower than DRAM.
In terms of choosing which design between the flat and
layered architectures, we see that flat architecture can achieve
better performance than layered architecture under the com-
mon setting of performance-price ratio of PCM (as shown in
Figure 8(a)). However, if the write performance of PCM can
have a big break through, e.g., in the extreme case where
PCM reaches the same performance as DRAM as shown in
Figure 8(b) (i.e., TN,w = TD,w), then layered architecture
becomes the better choice, and clearly, we do not need to
struggle with the capacity allocation problem in this situation.
To explore the whole design space, we also seek for the
boundary condition as shown in Figure 8(c). In general, for
the common setting of performance-price ratio of PCM, flat
architecture outperforms layered architecture, but when the
write performance of PCM improves, we may need to switch
to the layered architecture.
VII. RELATED WORK
In recent years, researchers are suggesting to use non-
volatile devices to build a large memory page cache to improve
system performance. For example, researchers in [11] and
[12] proposed to use NAND flash memories as a page cache
between DRAM and disk storage so as to reduce the demand
of DRAM for system memory. Lee et al. in [15] showed the
potential of using a small portion of STT-MRAM as the non-
volatile buffer cache to eliminate the periodic flush overhead
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
PCM Percentage
50
70
90
110
130
150
A
ve
ra
ge
 L
at
en
cy
 (u
s)
Flat
Layered
(a) Common setting (TN,w≈640TD,w)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
PCM Percentage
50
55
60
65
70
75
80
85
A
ve
ra
ge
 L
at
en
cy
 (u
s)
Flat
Layered
(b) High-speed PCM (TN,w = TD,w)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
PCM Percentage
50
70
90
110
130
150
A
ve
ra
ge
 L
at
en
cy
 (u
s)
Flat
Layered
(c) Boundary case (TN,w ≈ 30TD,w)
Fig. 8. Impact of PCM capacity in hybrid cache under different performance conditions.
caused by the volatile DRAM memory. Especially PCM, a
large body of works to study how to architect PCM in memory,
e.g., [9], [19], [16], [4], [21], [5].
However, recent study suggested that we should pay at-
tention to the difference between the material-level and the
system-level performance due to the under-developing indus-
trial technology [13]. Thus, it still necessitates a comprehen-
sive study when incorporating NVM with DRAM from a
system perspective.
This paper presents an analytical model to study the per-
formance impact of incorporating NVM in hybrid page cache
by extending the list-based model in [6], and we make the
following differences. First, we focus on hybrid cache systems
and consider two system architectural designs. Second, we
take into account the device heterogeneity, and quantify the
hybrid cache performance by developing a latency model.
Last, we conduct trace-driven simulations with the DRAM-
Sim2 simulator to validate our analysis.
VIII. CONCLUSIONS
We develop mathematical models to analyze a hybrid cache
system so as to understand its performance impact and de-
sign space. We study two different architectural designs, flat
architecture and layered architecture, and develop a latency
model by taking into consideration the device heterogeneity.
We conduct trace-driven simulations with DRAMSim2 simu-
lator to validate our model, and perform extensive numerical
analysis by incorporating different performance characteristics
and capacity ratios. Based on our model analysis, we provide
multiple guidelines on how to design hybrid page cache so as
to reach high system throughput.
ACKNOWLEDGEMENTS
This work was supported by National Nature Science Foun-
dation of China (61303048 and 61379038), Anhui Provincial
Natural Science Foundation (1508085SQF214), CCF-Tencent
Open Research Fund.
REFERENCES
[1] J.-Y. L. Boudec. The stationary behaviour of fluid limits of re-
versible processes is concentrated on stationary points. arXiv preprint
arXiv:1009.5021, 2010.
[2] D. P. Bovet and M. Cesati. Understanding the Linux Kernel. O’Reilly
Media, Inc., 2005.
[3] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web Caching
and Zipf-like Distributions: Evidence and Implications. In INFOCOM,
1999.
[4] J.-H. Choi, S.-M. Kim, C. Kim, K.-W. Park, and K. H. Park. OPAMP:
Evaluation Framework for Optimal Page Allocation of Hybrid Main
Memory Architecture. In IEEE, ICPADS’2012.
[5] G. Dhiman, R. Ayoub, and T. Rosing. PDRAM: a Hybrid PRAM and
DRAM Main Memory System. In IEEE DAC, 2009.
[6] N. Gast and B. Van Houdt. Transient and Steady-state Regime of
A Family of List-based Cache Replacement Algorithms. In ACM
SIGMETRICS, 2015.
[7] D. A. Holland, E. L. Angelino, G. Wald, and M. I. Seltzer. Flash Caching
on The Storage Client. In USENIX, ATC’2013.
[8] J. M. Holte. Discrete gronwall lemma and applications. In MAA-NCS
meeting at the University of North Dakota, volume 24, pages 1–7, 2009.
[9] J. Hu, Q. Zhuge, C. J. Xue, W. C. Tseng, and H. M. Sha. Software
Enabled Wear-leveling for Hybrid PCM Main Memory on Embedded
Systems. In DATE, 2013.
[10] IBM. IBM FlashSystem 820 and IBM FlashSystem 720.
[11] T. Kgil and T. Mudge. FlashCache: A NAND Flash Memory File Cache
for Low Power Web Servers. In CASES, 2006.
[12] T. Kgil, D. Roberts, and T. Mudge. Improving NAND Flash Based Disk
Caches. In ISCA. IEEE, 2008.
[13] H. Kim, S. Seshadri, C. L. Dickey, and L. Chiu. Evaluating Phase
Change Memory for Enterprise Storage Systems: A Study of Caching
and Tiering Approaches. In USENIX, FAST’2014.
[14] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger. Architecting Phase Change
Memory As A Scalable Dram Alternative. ACM SIGARCH Computer
Architecture News, 37(3):2–13, 2009.
[15] E. Lee, H. Kang, H. Bahn, and K. Shin. Eliminating Periodic Flush
Overhead of File I/O With Non-volatile Buffer Cache. In Transactions
on Computers. IEEE, 2014.
[16] M. K. Qureshi, V. Srinivasan, and J. A. Rivers. Scalable High Perfor-
mance Main Memory System Using Phase-change Memory Technology.
ACM SIGARCH Computer Architecture News, 37(3):24–33, 2009.
[17] S. Raoux, G. W. Burr, M. J. Breitwisch, C. T. Rettner, Y.-C. Chen, R. M.
Shelby, M. Salinga, D. Krebs, S.-H. Chen, H.-L. Lung, et al. Phase-
change Random Access Memory: A Scalable Technology. IBM Journal
of Research and Development, 52(4.5):465–479, 2008.
[18] P. Rosenfeld, E. Cooper-Balis, and B. Jacob. DRAMSim2: A Cycle
Accurate Memory System Simulator. Computer Architecture Letters,
10(1):16–19, 2011.
[19] C. J. Xue, Y. Zhang, Y. Chen, G. Sun, J. J. Yang, and H. Li.
Emerging Non-volatile Memories: Opportunities and Challenges. In
CODES+ISSS. ACM, 2011.
[20] R. D. Yates. A framework for uplink power control in cellular
radio systems. IEEE Journal on selected areas in communications,
13(7):1341–1347, 1995.
[21] H. Yoon, J. Meza, R. Ausavarungnirun, R. A. Harding, and O. Mutlu.
Row Buffer Locality Aware Caching Policies for Hybrid Memories. In
IEEE, ICCD’2012.
[22] W. Zheng and G. Zhang. FastScale: Accelerate RAID Scaling by
Minimizing Data Migration. In FAST, 2011.
[23] G. K. Zipf. Relative Frequency as a Determinant of Phonetic Change.
Harvard studies in classical philology, 40:1–95, 1929.
APPENDIX
A. Proof of Theorem 1
Note that, for the Layered Architecture, that is when
htA(i) = i, [[6], Theorem 1] has proofed the steady state
probability. Here, we follows the method to proof that Theo-
rem 1 is the probability in Flat Architecture.
We use (i, u) to denote the item u in list i, that is u ∈ ci,
where u is the item’s id, and use c(i,u)↔(j,v) to denote the
new set that is same to set c except that item u in ci and item
v in cj exchanged. And denote ck→(i,u) as the new set that is
same to set c except that a new item k from list 0 is changed
with item u from list i. For example, c = {c1, c2, c3, c4} =
{{1, 2}, {3, 4}, {5, 6}, {7, 8}}, then c(1,2)↔(2,4) and c9↔(2,2)
can be draw as:
c(1,1)↔(2,2) = {{1, 4}, {3, 2}, {5, 6}, {7, 8}},
c9↔(1,2) = {{1, 9}, {3, 4}, {5, 6}, {7, 8}}.
To prove Theorem 1, we first show lemma 1 as follows.
Lemma 1. For both the flat and layered architectures (A = F
or L), equation (2) is equivalent to the following equations:
(∗)


1)πA(c(i,u)↔(i+1,v))/πA(c)=pu/pv,i6=0, hN , h,
2)πA(ck→(1,u))/πA(c) = pk/pu
3)πA(ck→(hN+1,u))/πA(c) = pk/pu
Proof of Lemma 1: First, we can see that Equation (2) ⇒(*)
holds clearly. In order to prove (*) ⇒ (2), we label all the
states c in Cn(m) as c1, c2, ..., c|Cn(m)|. For each page k in
state cj , we define Ik(cj) as the height of the list that contains
page k in state cj , e.g. in state cj , assuming list lz contains
page k, then Ik(cj) can be drawn as htA(lz). We normalize
each probability with respect to c1. Using the (*), we draw
the ratio of other state cj ’s steady state probabilities πA(cj)
(j ∈ Cn(m), j 6= i) to that of c1:
πA(cj)/πA(c1) =
(∏n
k=1
pk
Ik(cj)−Ik(c1)
)
(14)
By using that
∑|Cn(m)|
j=1 πA(cj) = 1, as all the steady state
probabilities sums as 1, this yeilds
(
1 +
|Cn(m)|∑
j=2
∏n
k=1
pk
Ik(cj)−Ik(c1)
)
πA(c1) = 1.
So we get
πA(c1) = 1/
(
1 +
|Cn(m)|∑
j=2
∏n
k=1
pk
Ik(cj)−Ik(c1)
)
.
By multiple both the numerator and denominator in the left
hand side by
∏n
k=1 pk
Ik(c1)
, we get
πA(c1) =
∏n
k=1
pk
(Ik(c1)/
|Cn(m)|∑
j=1
∏n
k=1
pk
Ik(cj),
this implies that
πA(c1) =
1
Z(m)
∏h
i=1
(∏
j∈c1i
pj
)htA(li)
, (15)
By using (14), we can draw other states’s steady state
probabilities πA(cj). So far, we proof that (*) ⇒ (2) holds.
To prove Theorem 1, we start with the flat architecture.
In flat architecture, we first derive the transition probabilities
between state c and another state, which are as follows.
• The probability that state c is transited to another state
can be expressed as:
πF (c) (1−
∑
j∈chN
pj −
∑
j∈ch
pj), (16)
noting that this transition will happen unless there is a
hit on the highest list in D-Cache or N-Cache.
• Then we express the probability that the other states are
transited back to state c. Using the notations above, the
probability that the other states are transited back to state
c can be expressed as
(1− α)
∑
k/∈c1,...,ch
∑
u∈c1
πF (ck→(1,u))pu/m1 (17)
+α
∑
k/∈c1,...,ch
∑
u∈chN+1
πF (ck→(hN+1,u))pu/mhN+1
+
∑
i6=0,hN ,h
∑
u∈ci
∑
v∈ci+1
πF (c(i,u)↔(i+1,v))pv/mi+1.
Reaching the steady state means that the probability that
state c is transited to another state (i.e., (16)) equals the
probability that other states are transited back to state c (i.e.,
(18)). We can express the global balance equation of state c
as follows,
πF (c)(1 −
∑
j∈chN
pj −
∑
j∈ch
pj) (18)
=(1− α)
∑
k/∈c1,...,ch
∑
u∈c1
πF (ck→(1,u))pu/m1
+α
∑
k/∈c1,...,ch
∑
u∈chN+1
πF (ck→(hN+1,u))pu/mhN+1
+
∑
i6=0,hN ,h
∑
u∈ci
∑
v∈ci+1
πF (c(i,u)↔(i+1,v))pv/mi+1.
By plugging (*) into (19), we draw that
1−
∑
j∈chN
pj −
∑
j∈ch
pj =
∑
k/∈c
pk +
h∑
i=1
∑
u∈ci
pu,
which clearly holds as
∑
pk = 1, denoting that (*) is the
steady state probabilities. By using lemma 1, we proof
that Theorem 1 holds for the flat architecture. We see
that in steady state, the steady state of each state in flat
architecture (i.e., πF (c)) is independent of the parameter α.
B. Proof of Theorem 2
For the Layered Architecture, that is when htA(i) = i, [6]
Theorem 7 has proofed that the ODEs have a fixed point and
the fixed point is unique. Here, we follows the method to show
the uniqueness in Flat Architecture.
Note that the equation has the same structure with a birth-
and-death process. For list i ∈ {1, ..., hN}, we have
xk,i =
pikm1m2...mi
H0H1...Hi−1
xk,0, i ∈ {0, 1, ..., hN}.
Noting that the list i ∈ {hN + 1, ..., h} is the (i − hN )-th
list in D-Cache, so we have
xk,i =
pi−hNk mhN+1...mi
H0HhN+1...Hi−1
xk,0, i ∈ {hN + 1, ..., h}
To simplify the above equation, we define s by letting
si =
{
m1m2...mi
H0H1...Hi−1
, if i ∈ {1, ..., hN},
mhN+1...mi
H0HhN+1...Hi−1
, if i ∈ {hN + 1, ..., h}.
Clearly, we have
∑
j xk,j = 1, which implies that xk,i =
p
htF (i)
k
si
1+
∑hN
j=1 p
j
k
sj+
∑
h
j=hN+1
p
j−hN
k
sj
. By using that
∑n
k=1 xk,j =
mj , we can have
mi =
n∑
k=1
xk,i =
n∑
k=1
p
htF (i)
k si
1 +
∑hN
j=1 p
j
ksj +
∑h
j=hN+1
pj−hNk sj
.
For a vector ~s = (s1, s2, ..., sh),i ∈ {1, ..., h}, where each
si > 0. Define Di(~s),i ∈ {1, ..., h} by
Di(~s) =
n∑
k=1
p
htF (i)
k si
1 +
∑hN
j=1 p
j
ksj +
∑h
j=hN+1
pj−hNk sj
.
Define a vector ~si(y) to denote the vector that all the
elements equal to s except that the i − th element is y.
Di(~si(y)) = mi has a unique solution that we denotes as
Gi(~s). Hence G(~s) is a vector that calculated by ~s. Since
Di(~s) =
n∑
k=1
p
htF (i)
k
1/si +
∑hN
j=1 p
j
ksj/si +
∑h
j=hN+1
pj−hNk sj/si
.
it implies that Di(~s) is decreasing in sj when j 6= i, which
implies that Gi(~s) is increasing in ~s. We define the sequence
~s(t) by ~s(0) = (0, 0, ..., 0) and ~s(t+ 1) = G(~s(t)).
Moreover, for any t, Di(~s(t)) ≤ mi, which implies that
n−
h∑
i=1
mi ≤ n−
h∑
i=i
Di(~s(t))
= n−
n∑
k=1
∑hN
i=1 p
i
ks
t
i +
∑h
i=hN+1
pi−hNk s
t
i
1 +
∑hN
j=1 p
j
ks
t
j +
∑h
j=hN+1
pj−hNk s
t
j
=
n∑
k=1
1
1 +
∑hN
j=1 p
j
ks
t
j +
∑h
j=hN+1
pj−hNk s
t
j
.
Therefore, we can conclude that the sequence ~s(t) is
bounded, that is the sequence is increasing and it finally
converges to a fixed point G.
Now we prove the uniqueness. Let λ > 1 and multiply all
the elements in vector s by λ, we have:
Di(λ~s) =
n∑
k=1
p
htF (i)
k λsi
1 +
∑hN
j=1 p
j
kλsj +
∑h
j=hN+1
pj−hNk λsj
=
n∑
k=1
p
htF (i)
k si
1/λ+
∑hN
j=1 p
j
ksj +
∑h
j=hN+1
pj−hNk sj
> Di(~s)
This implies that Gi(λ~s) < λGi(~s), which by [[20],Theorem
2] proof the uniqueness.
C. Proof of Theorem 3
For the Layered Architecture, that is when htA(i) = i, [[6]
Theorem 7] has proofed the validity of the approximation.
Here, we follows the method to show validity of using ~δ(t) to
approximate ~H(t) in Flat Architecture.
Let ~H(t) be the vector that describe the hit probability in
each list at time t, where each element Hi(t) =
∑
k pkXk,i(t)
is the hit probability of list i. Recall that Xk,i(t) denotes
whether item k is in list i at time t. If yes, Xk,i(t) = 1
and 0 otherwise. Hence 0 6 Hi(t) ≤ 1.
Let ~δ(t) be the vector that consists of δi(t), each element
δi(t) is defined by δi(t) =
∑
k pkxk,i(t), where xk,i(t) is the
unique solution of ODEs in (6)-(10), with initial conditions
xk,i(0) = Xk,i(0). So the sequence ~δ(t), t = 0, 1, ... describes
the deterministic process. By varying the initial conditions, we
get series of sequence. At any time of the sequence, it is a
vector, denoted by ~δ. Let H be the name space of vector ~δ.
And we define two norms on any ~δ in H:‖~δ‖∞ = maxi |δi|,
‖~δ‖2∞ = maxi |δi|2.
Under the Flat Architecture, we define the function f on δ
as follows.
Case 1: If i 6= 0, 1, hN + 1, h, hN (i.e., in middle lists):
fi(~δ) = pkδi−1 −
δi−1δi
mi
+
δiδi+1
mi+1
− pkδi. (19)
Case 2: If i = h or i = hN (i.e., in the highest list):
fi(~δ) = pkδi−1 −
δiδi−1
mi
. (20)
Case 3: If i = 1 (i.e., in the lowest list of N-Cache):
fi(~δ) = (1− α)pkδi−1 − (1− α)
δiδi−1
mi
+
δiδi+1
mi+1
− pkδi. (21)
Case 4: If i = hN + 1 (i.e., in the lowest list of D-Cache):
fi(~δ) = αpkδi−1 − α
δi−1δi
mi
+
δiδi+1
mi+1
− pkδi. (22)
Case 5: If i = 0 (i.e., in the storage layer):
fi(~δ) = (1− α)
δ0δ1
m1
+ α
δ0δhN+1
mhN+1
− pkδ0. (23)
Now we first prove that the following four lemmas hold.
Lemma 2. f( ~H(t)) is the average variation of ~H(t), i.e.,
E[ ~H(t+ 1)− ~H(t)|Ft] = f( ~H(t)).
Lemma 3. The second moment of the variation of ~H(t) is
bounded is bounded:
E[‖ ~H(t+1)− ~H(t)‖2∞|Ft] ≤ 2a2.
Lemma 4. There exists a constant L which is independent
with pk’s and mi’s, such that the function f is Lipschitz-
continuous of constant L(a+ b) on X, that is: for all δ′ and
δ
′′
in H,
‖f(~δ′)− f( ~δ′′)‖∞ ≤ L(a+ b)‖~δ′ − ~δ′′‖∞
where a = maxkpk and b = maxi(1/mi).
Lemma 5. With initial conditions δi(0) = E[Hi(0)], then we
can draw that:
~δ(t) = ~H(0) +
t−1∑
s=0
f(~δ(s)).
Proof of Lemma 2: Take case 1 as an example. If i 6= 0, 1,
hN + 1, h, hN (i.e., in middle lists), at time t, two types of
events can modify the value of Hi:
• If at time t, an item in list i− 1 is requested, denoted as
k, k ∈ {1, ..., n}, and it exchanges with an item from list
i, denoted as j, j ∈ {1, ..., n} . The average variation of
Hi due to these events is:
∑
k,j
Xk,i−1(t)Xj,i(t)pk
mi
(pk − pj)
=
∑
k,j
Xk,i−1(t)Xj,i(t)pk
mi
pk
−
∑
k,j
Xk,i−1(t)Xj,i(t)pk
mi
pj
= pk
∑
j
Xj,i(t)
mi
∑
k
Xk,i−1(t)pk
−
∑
k
Xk,i−1(t)pk
∑
j
Xj,i(t)pj
mi
= pkHi−1(t)−
Hi−1(t)Hi(t)
mi
(24)
• If at time t, item k is requested in list i, and it exchanges
with an item j from list i + 1. The average variation of
Hi due to these events is:
∑
k,j
Xk,i(t)Xj,i+1(t)pk
mi+1
(pj − pk)
=
∑
k,j
Xk,i(t)Xj,i+1(t)pk
mi+1
pj
−
∑
k,j
Xk,i(t)Xj,i+1(t)pk
mi+1
pk
=
∑
j
Xj,i+1(t)pj
∑
k
Xk,i(t)pk/mi+1
−pk
∑
j
Xj,i+1(t)/mi+1
∑
k
Xk,i(t)pk
=
Hi+1(t)Hi(t)
mi+1
− pkHi(t) (25)
By summing the two terms, we have for i 6= 0, 1, hN , h:
E[Hi(t+ 1)−Hi(t)|Ft] = fi( ~H(t))
Note that i is chosen randomly, we have
E[ ~H(t+ 1)− ~H(t)|Ft] = f( ~H(t))
By summing up the cases of all the values of i, we can prove
Lemma 2.
Proof of Lemma 3: The second moment of the variation of
~H(t) can be derived as follows.
E[(Hi(t+ 1) −Hi(t))
2|Ft]
=
∑
k,j
Xk,i−1(t)Xj,i(t)(pk − pj)
2pk/mi
+
∑
k,j
Xk,i(t)Xj,i+1(t)(pj − pk)
2pk/mi+1 (26)
Since 0 < pi, pk ≤ maxk pk = a, we have E[(Hi(t+ 1)−
Hi(t))
2|Ft] is less than:
∑
k,j
Xk,i−1(t)Xj,i(t)pka
2
mi
+
∑
k,j
Xk,i(t)Xj,i+1(t)pka
2
mi+1
= (Hi−1(t) +Hi(t))a
2 (27)
This shows that:
E[‖ ~H(t+ 1)− ~H(t)||
2
∞|Ft]
= E[sup
i
(Hi(t+ 1) −Hi(t))
2|Ft]
≤ E[
∑
i
(Hi(t+ 1)−Hi(t))
2|Ft]
≤
∑
i
(Hi−1(t) +Hi(t))a
2
≤ 2a2 (28)
Thus, the second moment of the variation of ~H(t) is
bounded.
Proof of Lemma 4: First we take case 1 as an example, that
is i 6= 0, 1, hN + 1, h, hN (i.e., in middle lists):
fi(δ) = pkδi−1 −
δi−1δi
mi
+
δiδi+1
mi+1
− pkδi
We split fi(~δ) into four parts and denote each part as
gi(~δ) for ease of presentation. We then show that each part
is Lipschitz-continuous individually. We denote ~δ′ and ~δ′′ as
two vectors which are chosen randomly from H.
• Part 1: gi(~δ) = pkδi−1. For any i, i 6= 0, 1, hN + 1,
h, hN :
|gi(δ
′
i)− gi(δ
′′
i )| = |pk(δ
′
i − δ
′′
i )| ≤ pk|δ
′
i − δ
′′
i |
≤ a‖~δ′ − ~δ′′‖∞
Hence ‖g(~δ′)− g( ~δ′′)‖∞ ≤ a‖~δ′ − ~δ′′‖∞.
• Part 2: gi(~δ) = − δi−1δimi . For any i, i 6= 0, 1, hN + 1,
h, hN :
|gi(δ
′
i)− gi(δ
′′
i )| = |
δ
′
iδ
′
i−1 − δ
′′
i δ
′′
i−1
mi
|
= |δ
′
i−1(δ
′
i − δ
′′
i ) + δ
′′
i (δ
′
i−1 − δ
′′
i−1)
mi
|
≤ |δ
′
i−1(δ
′
i − δ
′′
i )
mi
|+ |δ
′′
i (δ
′
i−1 − δ
′′
i−1)
mi
|
≤ bδ′i−1|(δ
′
i − δ
′′
i )|+ bδ
′′
i |(δ
′
i−1 − δ
′′
i−1)|
≤ 2b‖~δ′ − ~δ′′‖∞
Hence ‖g(~δ′)− g( ~δ′′)‖∞ ≤ 2b‖~δ′ − ~δ′′‖∞.
• Part 3: gi(~δ) = δiδi+1mi+1 . The proof is similar to part 2.
Hence ‖g(~δ′)− g( ~δ′′)‖∞ ≤ 2b‖~δ′ − ~δ′′‖∞.
• Part 4: gi(~δ) = pkδi. The proof is similar to part 1. Hence
‖g(~δ′)− g( ~δ′′)‖∞ ≤ a‖~δ′ − ~δ′′‖∞.
Summing the above four parts, we can prove that for i 6= 0,
1, hN +1, h, hN (i.e., in middle lists), there exists a constant
L which is independent with pk’s and mi’s, such that the
function f is Lipschitz-continuous of constant L(a+ b) on X,
that is
‖f(~δ′)− f( ~δ′′)‖∞ ≤ L(a+ b)‖~δ′ − ~δ′′‖∞
where a = maxkpk and b = maxi(1/mi). By summing all
the cases of i, we show Lemma 4 holds.
Proof of lemma 5: The proof of Lemma 5 is simple as ~δ(0)
equals ~H(0) and f(~δ(t)) is the variation of ~δ(t) for all i.
Proof of Theorem 3: Let ~M(t) =
∑t−1
s=0(
~H(s+1)− ~H(s)−
f( ~H(s))), we have:
~H(t) = ~H(0) +
t−1∑
s=0
f( ~H(s)) + ~M(t) (29)
Combining Lemma 5 and Equation (29), we get:
~H(t)− ~δ(t) =
t−1∑
s=0
(f( ~H(s))− f(~δ(s))) + ~M(t)
By using norm, we have, for t ≤ τ , ‖ ~H(t)− ~δ(t)‖∞ is less
than
τ−1∑
s=0
‖(f( ~H(s))− f(~δ(s)))‖∞ + sup
t≤τ
‖ ~M(t)‖∞
By using Lemma 4, we get ‖(f( ~H(s)) − f(~δ(s)))‖∞ ≤
L(a + b)‖ ~H(s)− ~δ(s)‖∞ ≤ L(a + b). Hence, for t ≤ τ ,
E[‖ ~H(t)− ~δ(t)‖∞] is less than
τ−1∑
s=0
L(a+ b)E[‖( ~H(s)− ~δ(s))‖∞] + E[sup
t≤τ
‖ ~M(t)‖∞]
By using Lemma 3, E[‖ ~M(τ)‖2∞] ≤ 2a2τ . Besides, we have
E[ ~M(t+ 1)| ~M(t), ~M(t− 1), ..., ~M(0)]
= E[ ~M(t)+ ~H(t+ 1)− ~H(t)−f( ~H(t))| ~M(t), ~M(t− 1), ..., ~M(0)]
= E[ ~M(t)| ~M(t), ~M(t− 1), ..., ~M(0)]
+ E[ ~H(t+ 1)− ~H(t)− f( ~H(t))| ~M(t), ~M(t− 1), ..., ~M(0)]
= E[ ~M(t)| ~M(t), ~M(t− 1), ..., ~M(0)] = ~M(t)
So we have E[ ~M(t+1)| ~M(t), ~M(t− 1), ..., ~M(0)] = ~M(t),
which means that ~M(t) is martingale. Thus, we have
E[sup
t≤τ
‖ ~M(τ)‖∞] ≤ E[‖ ~M(τ)‖
2
∞] ≤
√
2a2τ ≤
√
2τ(a+ b)
So far, we can prove that for t ≤ τ , E[‖ ~H(t)− ~δ(t)‖∞] is
less than L(a+ b)
∑τ−1
s=0 E[‖( ~H(s)− ~δ(s))‖∞]+
√
2τ(a+ b).
By using Discrete Gronwall inequality in [8], The above
inequality implies that E[supt≤τ ‖ ~H(t)− ~δ(t)‖∞] is less
than
(√
2τ(a + b)
)
exp(L(a + b)τ). Now by replacing τ
with 1L(a+b) , we can have E[supt≤ 1L(a+b) ‖ ~H(t)− ~δ(t)‖∞]
is less than
(√
2(a+ b)/L
)
e. Now we can finally show
that when pk → 0 as n → ∞ (a = maxkpk → 0) and
mi → ∞ (i = 0, 1, ..., h), then for τ = 1L(a+b) → ∞,
E[supt≤τ ‖ ~H(t)− ~δ(t)‖∞]→ 0, where ~δ(t) =
∑
k xk,i(t)pk,
~H(t) =
∑
kXk,i(t)pk and ~H(0) = ~δ(0).
